[incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments #18162

rolandjohann · 2019-10-21T16:10:46Z

Which issue this PR fixes

missing imagePullSecrets at deployments so we can use sparkoperator images from private docker repos

Special notes for your reviewer:

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

DCO signed
Chart Version bumped
Variables are documented in the README.md
Title of the PR starts with chart name (e.g. [stable/chart])

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

k8s-ci-robot · 2019-10-21T16:11:00Z

Hi @rolandjohann. Thanks for your PR.

I'm waiting for a helm member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liyinan926 · 2019-10-21T16:31:05Z

/approve
/lgtm

k8s-ci-robot · 2019-10-21T16:31:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liyinan926, rolandjohann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~incubator/sparkoperator/OWNERS~~ [liyinan926]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liyinan926 · 2019-10-21T16:31:45Z

/ok-to-test

rolandjohann · 2019-10-21T18:27:15Z

/retest

zanhsieh · 2019-10-22T06:21:56Z

@rolandjohann
I think CRD does not get clean up properly. You might consider to implement hook. See other charts for detail.

I1021 18:28:57.015] Error: release sparkoperator-voyujv17r9 failed: customresourcedefinitions.apiextensions.k8s.io "sparkapplications.sparkoperator.k8s.io" already exists

rolandjohann · 2019-10-22T11:16:42Z

@zanhsieh saw that, too. There is already a hook triggered CRD cleaner, and I didn't introduce any additional CRD - just define the imagePullSecrets property in k8s deployment manifests.

The log indicates to me that the chart install errors because of failure at the previous run where the chart delete has been forced and the CRDs are still there.

I faced similar issues when we enabled PSPs and the CRD cleanup job pod can't be scheduled because of policy violations.

The log is not verbose about the failure of apply of the previous run and I don't have access to k8s to actually see what's going wrong there.

Please correct me if I'm wrong: The changes introduced by me recently can't lead to failing tests here.

rolandjohann · 2019-10-22T16:09:00Z

I'm not sure how to proceed from here.

Maybe someone with sufficient permissions can inspect the cluster to determine what's the root cause of the failing tests.

Can someone point me to the repository of the test container code, so we can execute them manually?

zanhsieh · 2019-10-22T20:39:47Z

@rolandjohann
I took a look the hook implementation:

https://github.com/helm/charts/blob/master/incubator/sparkoperator/templates/webhook-cleanup-job.yaml
https://github.com/helm/charts/blob/master/incubator/sparkoperator/templates/webhook-init-job.yaml
https://github.com/helm/charts/blob/master/incubator/sparkoperator/templates/webhook-service.yaml

All three above do not implement CRD clean at all.

liyinan926 · 2019-10-22T20:42:19Z

@zanhsieh https://github.com/helm/charts/blob/master/incubator/sparkoperator/templates/crd-cleanup-job.yaml does CRD clean up.

zanhsieh · 2019-10-22T21:03:54Z

/retest

zanhsieh · 2019-10-22T21:14:11Z

@vsliouniaev
Do you have clue on this? They implemented CRD deletion already.

vsliouniaev · 2019-10-23T06:27:02Z

@zanhsieh from my experience with the tests in this repo, the cluster used for all PRs is the same. If the clean-up job failed to execute, then CRDs may already exist in the cluster. To solve this you will need to do the following two separate things:

1. Set up the chart

In order to solve this problem in the stable/prometheus-operator chart I had to do the following:

Install the CRDs using the helm pre-install CRD hooks (which is necessary because we're creating resources that we're provisioning the CRDs for in the same chart)
Use a CI-only values.yaml to specify that the CRDs should be deleted if they already exist. It is critical that this behaviour is not enabled in the actual release values. If it does, you may end up deleting user-provisioned resources at some point later.

Here is how this is done for one of the CRDs in the prometheus-operator chart:

  annotations:
{{- if .Values.prometheusOperator.cleanupCustomResourceBeforeInstall }}
    "helm.sh/hook-delete-policy": "before-hook-creation"
{{- end }}
    "helm.sh/hook": crd-install

And then we have a separate values yaml we sed a few custom settings for CI into when we make updates to the main one : https://github.com/helm/charts/blob/master/stable/prometheus-operator/ci/test-values.yaml#L1025

2. Set up a hack PR in cases where Helm breaks

Helm does not properly support CRDs, despite arguments that it does and all the hooks that are available for them. It may work 95% of the time, but will still fail in those 5% of cases. There should at least be an escape-hatch provided in the chart to install it by provisioning the CRDs first, then the rest of the components. This will become more obvious when the chart begins to include resources for which it creates CRDs. Since this is likely going to still bite you in this repo you should also create a hack PR that will simply execute a post-install / pre-install job to delete the CRDs in the cluster. If you ever get stuck in a loop like you're in now you can re-run tests on that PR and get it to clean up the test cluster for you.
Here's the one for prometheus-operator: #18184

It has been suggested to me that it may be easier to move the charts out of this repo and maintain them externally, adding the repo to helm hub.

rolandjohann · 2019-10-23T12:21:27Z

@vsliouniaev thanks for the insights. Is there a chance that we can inspect logs of k8s resources to get the root cause of the failing test? The helm chart should be uninstalled regarding the log, so the CRDs should be deleted as well. Is it possible to manually fix the test clusters broken state so we can run the tests for this PR?

Maybe we create a separate issue for that to align how to fix the tests.

vsliouniaev · 2019-10-23T14:20:25Z

Is there a chance that we can inspect logs of k8s resources to get the root cause of the failing test

I don't really understand what you mean. The error in the test run is:

I1022 21:05:24.872] Error: release sparkoperator-34w1g8vn2j failed: customresourcedefinitions.apiextensions.k8s.io "sparkapplications.sparkoperator.k8s.io" already exists

The helm chart should be uninstalled regarding the log, so the CRDs should be deleted as well.

There is absolutely nothing to guarantee that helm will run everything end-to-end. In an ideal scenario a delete will trigger the hooks, which will most likely perform the operation you intend. But that cannot be guaranteed to occur 100% of the time.

Is it possible to manually fix the test clusters broken state so we can run the tests for this PR?

Yes, I believe I outlined how to get this done in part 2 here: #18162 (comment)

zanhsieh · 2019-10-24T03:36:23Z

@vsliouniaev
Thanks you for detail explanation.

@rolandjohann
Please try to implement as @vsliouniaev suggested. Once done, let's try e2e test again.

vsliouniaev · 2019-10-24T06:51:06Z

The stable/prometheus-operator chart has started suffering from this bug during CI #9241

~~I don't have a workaround for this and I imagine it will be pretty involved if I figure one out~~
I have a possible work-around for this for now. The longer-term fix should be to update the version of helm used by the helm testing image. More details are here: #18271

* upstream/master: (134 commits) [stable/external-dns] Enable RBAC resources by default (helm#18398) [incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments (helm#18162) [stable/prometheus-operator] Remove CRD hooks from CI entirely (helm#18271) [stable/mcrouter] upgrade requirements (helm#18313) [stable/fluent-bit] Upgrade fluentbit to 1.3.2 (helm#18328) Update deployment.yaml (helm#18376) [stable/sysdig] Fix 1.16 compatibility errors introduced in v1.4.19 (helm#18230) [stable/bookstack] Fix mariadb values in README (helm#18385) [stable/memcached] - Add servicemonitor for Prometheus-operator (helm#18386) [stable/prometheus-cloudwatch-exporter] Update cloudwatch-exporter version. (helm#18003) [stable/prometheus-cloudwatch-exporter] Add myself to owners/maintainers. (helm#18093) [incubator/vault] Api version backwards compatibility (helm#18378) [stable/memcached] upgrade metrics exporter, fix PDB (helm#18387) [stable/home-assistant] Fix wildcard home-assistant ingress (helm#18384) [stable/karma] Upgrade to 0.48 (helm#17715) [stable/joomla] Release 6.1.11 updating components versions (helm#18364) [stable/prometheus-node-exporter] update NOTES.txt to show correct port (helm#17931) [stable/apm-server] Use a non-root user (helm#18366) [stable/airflow] Fix template directive value of nodePort for web service (helm#18059) [stable/spinnaker] Support serviceConfigs for local offline installation. (helm#18365) ...

* master: (135 commits) [stable/external-dns] Enable RBAC resources by default (helm#18398) [incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments (helm#18162) [stable/prometheus-operator] Remove CRD hooks from CI entirely (helm#18271) [stable/mcrouter] upgrade requirements (helm#18313) [stable/fluent-bit] Upgrade fluentbit to 1.3.2 (helm#18328) Update deployment.yaml (helm#18376) [stable/sysdig] Fix 1.16 compatibility errors introduced in v1.4.19 (helm#18230) [stable/bookstack] Fix mariadb values in README (helm#18385) [stable/memcached] - Add servicemonitor for Prometheus-operator (helm#18386) [stable/prometheus-cloudwatch-exporter] Update cloudwatch-exporter version. (helm#18003) [stable/prometheus-cloudwatch-exporter] Add myself to owners/maintainers. (helm#18093) [incubator/vault] Api version backwards compatibility (helm#18378) [stable/memcached] upgrade metrics exporter, fix PDB (helm#18387) [stable/home-assistant] Fix wildcard home-assistant ingress (helm#18384) [stable/karma] Upgrade to 0.48 (helm#17715) [stable/joomla] Release 6.1.11 updating components versions (helm#18364) [stable/prometheus-node-exporter] update NOTES.txt to show correct port (helm#17931) [stable/apm-server] Use a non-root user (helm#18366) [stable/airflow] Fix template directive value of nodePort for web service (helm#18059) [stable/spinnaker] Support serviceConfigs for local offline installation. (helm#18365) ...

…deployments (helm#18162) * introduce imagePullSecret at sparkoperator deployments Signed-off-by: Roland Johann <roland.johann@phenetic.io> * bump chart version Signed-off-by: Roland Johann <roland.johann@phenetic.io> * document additional var at README.md Signed-off-by: Roland Johann <roland.johann@phenetic.io>

rolandjohann added 3 commits October 21, 2019 18:05

introduce imagePullSecret at sparkoperator deployments

f2eecd5

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

bump chart version

59fe538

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

document additional var at README.md

367a7e1

Signed-off-by: Roland Johann <roland.johann@phenetic.io>

helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 21, 2019

k8s-ci-robot requested review from liyinan926 and yuchaoran2011 October 21, 2019 16:10

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 21, 2019

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2019

k8s-ci-robot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 21, 2019

rolandjohann mentioned this pull request Oct 28, 2019

[incubating/sparkoperator] Introduce optional remove before install CRDs flag #18397

Merged

4 tasks

rolandjohann closed this Oct 28, 2019

rolandjohann reopened this Oct 28, 2019

k8s-ci-robot merged commit 9806bd7 into helm:master Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments #18162

[incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments #18162

rolandjohann commented Oct 21, 2019

k8s-ci-robot commented Oct 21, 2019

liyinan926 commented Oct 21, 2019

k8s-ci-robot commented Oct 21, 2019

liyinan926 commented Oct 21, 2019

rolandjohann commented Oct 21, 2019

zanhsieh commented Oct 22, 2019

rolandjohann commented Oct 22, 2019 •

edited

rolandjohann commented Oct 22, 2019

zanhsieh commented Oct 22, 2019

liyinan926 commented Oct 22, 2019

zanhsieh commented Oct 22, 2019

zanhsieh commented Oct 22, 2019

vsliouniaev commented Oct 23, 2019

rolandjohann commented Oct 23, 2019

vsliouniaev commented Oct 23, 2019 •

edited

zanhsieh commented Oct 24, 2019

vsliouniaev commented Oct 24, 2019 •

edited

[incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments #18162

[incubator/sparkoperator] introduce imagePullSecret at sparkoperator deployments #18162

Conversation

rolandjohann commented Oct 21, 2019

Which issue this PR fixes

Special notes for your reviewer:

Checklist

k8s-ci-robot commented Oct 21, 2019

liyinan926 commented Oct 21, 2019

k8s-ci-robot commented Oct 21, 2019

liyinan926 commented Oct 21, 2019

rolandjohann commented Oct 21, 2019

zanhsieh commented Oct 22, 2019

rolandjohann commented Oct 22, 2019 • edited

rolandjohann commented Oct 22, 2019

zanhsieh commented Oct 22, 2019

liyinan926 commented Oct 22, 2019

zanhsieh commented Oct 22, 2019

zanhsieh commented Oct 22, 2019

vsliouniaev commented Oct 23, 2019

1. Set up the chart

2. Set up a hack PR in cases where Helm breaks

rolandjohann commented Oct 23, 2019

vsliouniaev commented Oct 23, 2019 • edited

zanhsieh commented Oct 24, 2019

vsliouniaev commented Oct 24, 2019 • edited

rolandjohann commented Oct 22, 2019 •

edited

vsliouniaev commented Oct 23, 2019 •

edited

vsliouniaev commented Oct 24, 2019 •

edited