Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments #18162

Merged
merged 3 commits into from Oct 28, 2019

Conversation

rolandjohann
Copy link
Contributor

Which issue this PR fixes

  • missing imagePullSecrets at deployments so we can use sparkoperator images from private docker repos

Special notes for your reviewer:

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

  • DCO signed
  • Chart Version bumped
  • Variables are documented in the README.md
  • Title of the PR starts with chart name (e.g. [stable/chart])

Signed-off-by: Roland Johann <roland.johann@phenetic.io>
Signed-off-by: Roland Johann <roland.johann@phenetic.io>
Signed-off-by: Roland Johann <roland.johann@phenetic.io>
@helm-bot helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 21, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @rolandjohann. Thanks for your PR.

I'm waiting for a helm member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 21, 2019
@liyinan926
Copy link
Collaborator

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liyinan926, rolandjohann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2019
@liyinan926
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 21, 2019
@rolandjohann
Copy link
Contributor Author

/retest

@zanhsieh
Copy link
Collaborator

@rolandjohann
I think CRD does not get clean up properly. You might consider to implement hook. See other charts for detail.

I1021 18:28:57.015] Error: release sparkoperator-voyujv17r9 failed: customresourcedefinitions.apiextensions.k8s.io "sparkapplications.sparkoperator.k8s.io" already exists

@rolandjohann
Copy link
Contributor Author

rolandjohann commented Oct 22, 2019

@zanhsieh saw that, too. There is already a hook triggered CRD cleaner, and I didn't introduce any additional CRD - just define the imagePullSecrets property in k8s deployment manifests.

The log indicates to me that the chart install errors because of failure at the previous run where the chart delete has been forced and the CRDs are still there.

I faced similar issues when we enabled PSPs and the CRD cleanup job pod can't be scheduled because of policy violations.

The log is not verbose about the failure of apply of the previous run and I don't have access to k8s to actually see what's going wrong there.

Please correct me if I'm wrong: The changes introduced by me recently can't lead to failing tests here.

@rolandjohann
Copy link
Contributor Author

I'm not sure how to proceed from here.

Maybe someone with sufficient permissions can inspect the cluster to determine what's the root cause of the failing tests.

Can someone point me to the repository of the test container code, so we can execute them manually?

@liyinan926
Copy link
Collaborator

@zanhsieh
Copy link
Collaborator

/retest

@zanhsieh
Copy link
Collaborator

@vsliouniaev
Do you have clue on this? They implemented CRD deletion already.

@vsliouniaev
Copy link
Collaborator

@zanhsieh from my experience with the tests in this repo, the cluster used for all PRs is the same. If the clean-up job failed to execute, then CRDs may already exist in the cluster. To solve this you will need to do the following two separate things:

1. Set up the chart

In order to solve this problem in the stable/prometheus-operator chart I had to do the following:

  1. Install the CRDs using the helm pre-install CRD hooks (which is necessary because we're creating resources that we're provisioning the CRDs for in the same chart)
  2. Use a CI-only values.yaml to specify that the CRDs should be deleted if they already exist. It is critical that this behaviour is not enabled in the actual release values. If it does, you may end up deleting user-provisioned resources at some point later.

Here is how this is done for one of the CRDs in the prometheus-operator chart:

  annotations:
{{- if .Values.prometheusOperator.cleanupCustomResourceBeforeInstall }}
    "helm.sh/hook-delete-policy": "before-hook-creation"
{{- end }}
    "helm.sh/hook": crd-install

And then we have a separate values yaml we sed a few custom settings for CI into when we make updates to the main one : https://github.com/helm/charts/blob/master/stable/prometheus-operator/ci/test-values.yaml#L1025

2. Set up a hack PR in cases where Helm breaks

Helm does not properly support CRDs, despite arguments that it does and all the hooks that are available for them. It may work 95% of the time, but will still fail in those 5% of cases. There should at least be an escape-hatch provided in the chart to install it by provisioning the CRDs first, then the rest of the components. This will become more obvious when the chart begins to include resources for which it creates CRDs. Since this is likely going to still bite you in this repo you should also create a hack PR that will simply execute a post-install / pre-install job to delete the CRDs in the cluster. If you ever get stuck in a loop like you're in now you can re-run tests on that PR and get it to clean up the test cluster for you.
Here's the one for prometheus-operator: #18184


It has been suggested to me that it may be easier to move the charts out of this repo and maintain them externally, adding the repo to helm hub.

@rolandjohann
Copy link
Contributor Author

@vsliouniaev thanks for the insights. Is there a chance that we can inspect logs of k8s resources to get the root cause of the failing test? The helm chart should be uninstalled regarding the log, so the CRDs should be deleted as well. Is it possible to manually fix the test clusters broken state so we can run the tests for this PR?

Maybe we create a separate issue for that to align how to fix the tests.

@vsliouniaev
Copy link
Collaborator

vsliouniaev commented Oct 23, 2019

Is there a chance that we can inspect logs of k8s resources to get the root cause of the failing test

I don't really understand what you mean. The error in the test run is:

I1022 21:05:24.872] Error: release sparkoperator-34w1g8vn2j failed: customresourcedefinitions.apiextensions.k8s.io "sparkapplications.sparkoperator.k8s.io" already exists

The helm chart should be uninstalled regarding the log, so the CRDs should be deleted as well.

There is absolutely nothing to guarantee that helm will run everything end-to-end. In an ideal scenario a delete will trigger the hooks, which will most likely perform the operation you intend. But that cannot be guaranteed to occur 100% of the time.

Is it possible to manually fix the test clusters broken state so we can run the tests for this PR?

Yes, I believe I outlined how to get this done in part 2 here: #18162 (comment)

@zanhsieh
Copy link
Collaborator

@vsliouniaev
Thanks you for detail explanation.

@rolandjohann
Please try to implement as @vsliouniaev suggested. Once done, let's try e2e test again.

@vsliouniaev
Copy link
Collaborator

vsliouniaev commented Oct 24, 2019

The stable/prometheus-operator chart has started suffering from this bug during CI #9241

I don't have a workaround for this and I imagine it will be pretty involved if I figure one out
I have a possible work-around for this for now. The longer-term fix should be to update the version of helm used by the helm testing image. More details are here: #18271

@rolandjohann rolandjohann reopened this Oct 28, 2019
@k8s-ci-robot k8s-ci-robot merged commit 9806bd7 into helm:master Oct 28, 2019
demisx added a commit to beocommedia/charts that referenced this pull request Oct 28, 2019
* upstream/master: (134 commits)
  [stable/external-dns] Enable RBAC resources by default (helm#18398)
  [incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments (helm#18162)
  [stable/prometheus-operator] Remove CRD hooks from CI entirely (helm#18271)
  [stable/mcrouter] upgrade requirements (helm#18313)
  [stable/fluent-bit] Upgrade fluentbit to 1.3.2 (helm#18328)
  Update deployment.yaml (helm#18376)
  [stable/sysdig] Fix 1.16 compatibility errors introduced in v1.4.19 (helm#18230)
  [stable/bookstack] Fix mariadb values in README (helm#18385)
  [stable/memcached] - Add servicemonitor for Prometheus-operator (helm#18386)
  [stable/prometheus-cloudwatch-exporter] Update cloudwatch-exporter version. (helm#18003)
  [stable/prometheus-cloudwatch-exporter] Add myself to owners/maintainers. (helm#18093)
  [incubator/vault] Api version backwards compatibility (helm#18378)
  [stable/memcached] upgrade metrics exporter, fix PDB (helm#18387)
  [stable/home-assistant] Fix wildcard home-assistant ingress (helm#18384)
  [stable/karma] Upgrade to 0.48 (helm#17715)
  [stable/joomla] Release 6.1.11 updating components versions (helm#18364)
  [stable/prometheus-node-exporter] update NOTES.txt to show correct port (helm#17931)
  [stable/apm-server] Use a non-root user (helm#18366)
  [stable/airflow] Fix template directive value of nodePort for web service (helm#18059)
  [stable/spinnaker] Support serviceConfigs for local offline installation. (helm#18365)
  ...
demisx added a commit to beocommedia/charts that referenced this pull request Oct 28, 2019
* master: (135 commits)
  [stable/external-dns] Enable RBAC resources by default (helm#18398)
  [incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments (helm#18162)
  [stable/prometheus-operator] Remove CRD hooks from CI entirely (helm#18271)
  [stable/mcrouter] upgrade requirements (helm#18313)
  [stable/fluent-bit] Upgrade fluentbit to 1.3.2 (helm#18328)
  Update deployment.yaml (helm#18376)
  [stable/sysdig] Fix 1.16 compatibility errors introduced in v1.4.19 (helm#18230)
  [stable/bookstack] Fix mariadb values in README (helm#18385)
  [stable/memcached] - Add servicemonitor for Prometheus-operator (helm#18386)
  [stable/prometheus-cloudwatch-exporter] Update cloudwatch-exporter version. (helm#18003)
  [stable/prometheus-cloudwatch-exporter] Add myself to owners/maintainers. (helm#18093)
  [incubator/vault] Api version backwards compatibility (helm#18378)
  [stable/memcached] upgrade metrics exporter, fix PDB (helm#18387)
  [stable/home-assistant] Fix wildcard home-assistant ingress (helm#18384)
  [stable/karma] Upgrade to 0.48 (helm#17715)
  [stable/joomla] Release 6.1.11 updating components versions (helm#18364)
  [stable/prometheus-node-exporter] update NOTES.txt to show correct port (helm#17931)
  [stable/apm-server] Use a non-root user (helm#18366)
  [stable/airflow] Fix template directive value of nodePort for web service (helm#18059)
  [stable/spinnaker] Support serviceConfigs for local offline installation. (helm#18365)
  ...
anandsinghkunwar pushed a commit to anandsinghkunwar/charts that referenced this pull request Oct 30, 2019
…deployments (helm#18162)

* introduce imagePullSecret at sparkoperator deployments

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

* bump chart version

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

* document additional var at README.md

Signed-off-by: Roland Johann <roland.johann@phenetic.io>
hakman pushed a commit to hakman/charts that referenced this pull request Dec 5, 2019
…deployments (helm#18162)

* introduce imagePullSecret at sparkoperator deployments

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

* bump chart version

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

* document additional var at README.md

Signed-off-by: Roland Johann <roland.johann@phenetic.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). lgtm Indicates that a PR is ready to be merged. ok-to-test size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants