Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

tls-init-cleanup can run if pre-install fails #419

Merged
merged 2 commits into from
Apr 9, 2020
Merged

tls-init-cleanup can run if pre-install fails #419

merged 2 commits into from
Apr 9, 2020

Conversation

lkysow
Copy link
Member

@lkysow lkysow commented Apr 8, 2020

If the pre-install hooks failed, then no non-hook resources get created.
This means that the tls-init-cleanup service account doesn't get
created. Then if the user runs helm delete, the tls-init-cleanup job
tries to run but never starts because it doesn't have its service
account. helm delete then hangs forever.

The fix is to ensure that the resources needed by the cleanup job get
created even if the pre-install hook failed. To do this, we mark them as
part of the pre-delete hook and Helm will ensure they get created.

Another snag is that Helm creates the Job before the service account. To
fix this, we set the weight of the job to 1 so that it is created after
the service account. This is a known helm issue: helm/helm#7447

Also adds before-hook-creation to the deletion policy for the tls-init-job so that if a user does run a helm delete, they can run a helm install and the job will get deleted instead of helm complaining that the job already exists.

Fixes #418

Reproduction

To reproduce, you need to cause a failure in the pre-install hooks. An easy way to do this is to set the tls-init-job's service account to one that doesn't exist:

# tls-init-job.yaml
      serviceAccountName: not-exist

Run helm install with a values file that has tls.enabled=true. The helm install will hang forever and eventually time out. Ctrl-C it so you don't have to wait.

Then run helm delete. You'll see the tls-init-cleanup job is created but it never completes. If you describe the job it'll say its service account doesn't exist. The helm delete will hang forever. If you Ctrl-C it and then try to run helm install you'll get an error that the release already exists. You have to manually delete the helm secret to fix the issue (or manually create the cleanup job's service account).

Now check out this branch and perform the same steps (you'll need to make sure you delete the init job's service account and job first). The helm delete should succeed.

If the pre-install hooks failed, then no non-hook resources get created.
This means that the tls-init-cleanup service account doesn't get
created. Then if the user runs helm delete, the tls-init-cleanup job
tries to run but never starts because it doesn't have its service
account. helm delete then hangs forever.

The fix is to ensure that the resources needed by the cleanup job get
created even if the pre-install hook failed. To do this, we mark them as
part of the pre-delete hook and Helm will ensure they get created.

Another snag is that Helm creates the Job before the service account. To
fix this, we set the weight of the job to 1 so that it is created after
the service account. This is a known helm issue: helm/helm#7447
@lkysow lkysow requested a review from a team April 8, 2020 18:41
@lkysow lkysow added area/chart-only Related to changes that simply require yaml chart changes, e.g. exposing a new field bug Something isn't working labels Apr 8, 2020
@lkysow
Copy link
Member Author

lkysow commented Apr 8, 2020

I think I need to add before-hook-creation to the tls-init-job so it can be re-run if the first install fails.

Without this setting, if the tls init job failed, users wouldn't be
able to re-run helm install without manually deleting the job.
@lkysow
Copy link
Member Author

lkysow commented Apr 8, 2020

I think I need to add before-hook-creation to the tls-init-job so it can be re-run if the first install fails.

Updated to add this.

@ishustava ishustava self-assigned this Apr 8, 2020
@ishustava
Copy link
Contributor

I think I need to add before-hook-creation to the tls-init-job so it can be re-run if the first install fails.

@lkysow Could you give a bit more details? Are you referring to the helm install? If helm install fails due to tls-init, then the job shouldn't be deleted because Helm will delete it only if the job suceeds.

@lkysow
Copy link
Member Author

lkysow commented Apr 9, 2020

It's easiest to show via reproduction:

  1. Check out my branch but remove the before-hook-creation
  2. run helm install with the tls-init-job set to an invalid service account to replicate a failed pre-install.
  3. Ctrl-C the install since it will hang forever
  4. Run helm delete which will succeed
  5. Run kubectl get jobs you'll see that tls-init is still there.
  6. Run helm install again. It will fail because it says that there's already a tls-init job.

The solution is to add before-hook-creation so that in Step 6, the install succeeds by deleting the old tls-init job that was hanging around.

Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkysow this is awesome! Thanks so much for this fix. I originally didn't make serviceaccount, clusterrolebinding, and clusterrole for the tls-init-cleanup job a hook because then it would get deleted immediately after if I'm using hook-succeeded. But I didn't think of using weights to solve it!

@lkysow lkysow merged commit c80ed63 into master Apr 9, 2020
@lkysow lkysow deleted the pre-install branch April 9, 2020 17:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/chart-only Related to changes that simply require yaml chart changes, e.g. exposing a new field bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Helm delete hangs if tls-init-cleanup-serviceaccount was never created
2 participants