Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm should delete job after it is successfully finished #1769

Closed
dzavalkinolx opened this issue Dec 29, 2016 · 43 comments
Closed

Helm should delete job after it is successfully finished #1769

dzavalkinolx opened this issue Dec 29, 2016 · 43 comments
Assignees
Labels

Comments

@dzavalkinolx
Copy link

I have a job bind to post-install,post-upgrade,post-rollback

    "helm.sh/hook": post-install,post-upgrade,post-rollback

when i’m running update charts i get Error: UPGRADE FAILED: jobs.batch "opining-quetzal-dns-upsert" already exists
kubectl get jobs returns

opining-quetzal-dns-upsert   1         1            12m

So, how we are supposed to use jobs as hooks if it is not deleted after it is successfully finished?
There is no way to update chart if it has such job.

@thomastaylor312
Copy link
Contributor

I would love to see this and I will try to get around to submitting a PR for it if I have the time. In the meantime, I tag this on to the end of the job name as a workaround:

metadata:
  name: {{ template "fullname" . }}-{{ randAlphaNum 5 | lower }}

@dzavalkinolx
Copy link
Author

I'm currently using workaround from @thomastaylor312 and it is very-very-very bad to do it this way.
There are 2 issues:

  1. If job wasn't finished successfully (by whatever reason) - it will be rescheduled again and again and again. I think we need something like a maxRetry parameter here.
  2. If chart is deleted - job is not deleted by helm. So it will continue to be scheduled and fail indefinitely.

As a result I've just run into a performance issue with a storm of kill container / start container events.

@thomastaylor312
Copy link
Contributor

@dzavalkinolx: We should probably bring this up in the kubernetes issues. That isn't so much a problem with Helm as it is with how Kubernetes Jobs work. Have you also tried setting restartPolicy: OnFailure so that it just restarts the container instead of creating a new pod?

@technosophos
Copy link
Member

This is a tricky one. Helm doesn't implicitly manage the lifecycles of things like this. In particular, it never deletes a user-created resource without receiving an explicit request from a user.

Now, we recently introduced an annotation called resource-policy that we could use for something like this. Currently, the only defined resource policy is keep, which is used to tell Tiller not to delete that resource on a normal deletion request. I suppose we could implement another policy with something like delete-on-completion that deleted a Pod or Job on completion.

Since Tiller does not actively monitor resources once they are deployed, I'm not sure this would be a terribly powerful annotation, but it could work on hooks because we do watch hooks for lifecycle events.

@technosophos technosophos added this to the 2.3.0-Triage milestone Jan 3, 2017
@longseespace
Copy link
Contributor

@dzavalkinolx Have you tried activeDeadlineSeconds?

@thomastaylor312
Copy link
Contributor

@longseespace I tried that too, but it just kills the job after the amount of time and spins up a new one

@javiercr
Copy link
Contributor

javiercr commented Feb 3, 2017

I ran into this problem today too. I found out about Helm Hooks recently and I thought it would be perfect to implement pre-upgrade database migrations (rake db:migrate, for those of you familar with Rails). But then I found that Helm does not clean up completed jobs, so this doesn't work on a CI environment.

I'm probably missing something, but what's the point of having a job executed in a pre-upgrade / post-upgrade hook if once the job gets executed the first time you do helm upgrade, the next one it will fail?

@technosophos
Copy link
Member

That's why we recommend appending a random string to a job name if you know for sure you are going to re-run a job again.

@javiercr
Copy link
Contributor

I see, I'm not sure I like the approach of leaving behind a new successful job for every deployment. Right now what I've done is adding a new step to our deployment script that does kubectl delete job db-migrate-job --ignore-not-found before our helm upgrade.

@johnw188
Copy link

johnw188 commented Apr 5, 2017

It feels like helm should be managing the full lifecycle of its hooks, as they're documented as the approach to take when it comes to executing lifecycle events such as migrations. Someone without a strong understanding of kubernetes could end up with hundreds of useless jobs.

I almost feel like the ideal solution here would just be to have tiller delete an old job if it hits a name conflict with a hook job it attempts to create. You could verify that the job was required by helm by ensuring that the correct hook annotation is present. You could also put this behavior behind a command line flag, such as --overwrite-hooks with a better error message for users:

Error: UPGRADE FAILED: jobs.batch "opining-quetzal-dns-upsert" already exists
Rerun your upgrade with --overwrite-hooks to automatically replace existing hooks

@docmatrix
Copy link

I am having the exact same challenges with a python / django app. Does anyone have a mechanism such that helm will abort an upgrade if the pre-upgrade job fails?

@thomastaylor312 thomastaylor312 added feature help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels May 18, 2017
@thomastaylor312
Copy link
Contributor

thomastaylor312 commented May 18, 2017

I may take this once I get some other work done for 2.5. Assigning to myself for now, if someone else wants to take it before I work on it, let me know

@DoctorZK
Copy link
Contributor

@thomastaylor312 Have you finished this feature yet? If not, I would like to take this work.

@thomastaylor312
Copy link
Contributor

@DoctorZK Feel free to take it. Thank you for offering to do it!

@thomastaylor312 thomastaylor312 removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 15, 2017
@gianrubio
Copy link

What about a simple annotation helm.sh/resource-policy: delete-job-after-run? When the job successfully run, helm delete this job.

@DoctorZK
Copy link
Contributor

DoctorZK commented Jun 21, 2017

Good suggestion. I have thought out two approaches to solve this problem.

Add a simple annotation in the hook templates, which is the easiest to implement. However, it can not solve this kind of problem: helm fails during the pre-install/pre-upgrade process, but users try to install/upgrade the release with the same chart again, which will incur resource objects name conflict in K8S. Therefore, with the approach, we should add another annotation, such as helm.sh/resource-policy: delete-job-if-job-fails .

Add flags after install/upgrade/rollback/delete commands (e.g., upgrade $release_name --include-hooks) which can solve name conflict problem, however, it will also remove some kinds of hooks that users are not intended to delete, such as configmaps and secrets that are designed to use repeatedly by different versions of the same release.

I prefer the first one, which can control hooks with a finer granularity.

@gianrubio
Copy link

@DoctorZK as you suggests I would like to have another annotation helm.sh/resource-policy: delete-job-if-succeed. This is important when you're deleting a helm and have a job for cleanup, for now it's not possible to delete this job without running another job.

Are you willing to work on this? if not I can take care of it

@DoctorZK
Copy link
Contributor

Thanks for your help. I have finished the coding process, and now is under test. I will submit the pull request as soon as possible.

@libesz
Copy link
Contributor

libesz commented Jun 27, 2017

For those who want to have a workaround for this, until the final solution is implemented.
Instead of appending random characters to the Job object's name, the Job may delete itself from the APIserver as the last task before exiting. It is equally "elegant" but not leaking Job objects.
Sad facts:

  • You need to pass the APIserver credentials into the Job to do this.
  • You will lose the Job logs as those are deleted also.
  • Works only for objects which contains running code (works for Job, Pod..., but not for PVC, 3rd party resources, etc.)

@DonMartin76
Copy link

@thomastaylor312 Did this land in 2.7?

@bacongobbler
Copy link
Member

yes, everything currently in master landed in 2.7.

@macropin
Copy link

macropin commented Nov 8, 2017

Just to be clear... if I add "helm.sh/hook-delete-policy": hook-succeeded to my job. Then every deployment should recreate and re-run that job? Because that's not what I'm seeing here.

@thomastaylor312
Copy link
Contributor

@macropin Is your hook defined as a post-install,post-upgrade (or pre as your case may be)? If it only has the post-install it will only run the first time

@macropin
Copy link

@thomastaylor312 It's defined as post-install,post-upgrade, so you're saying it should be working?

@thomastaylor312
Copy link
Contributor

It will create a new object (generally a Pod or Job) each time you release. If you use the feature mentioned here, it will delete the job when it is done running. If for some reason a hook isn't deploying, it would be a separate issue

@macropin
Copy link

The job has the following annotations:

      annotations:
        "helm.sh/hook": post-install,post-upgrade
        "helm.sh/hook-weight": "5"
        "helm.sh/hook-delete-policy": hook-succeeded,hook-failed

The job only runs once on the first install, and never again on subsequent upgrades. Running Helm v2.7.0. Should I create a separate issue for this?

@thomastaylor312
Copy link
Contributor

@macropin Yes. Could you please create another issue with details about your cluster and, if possible, an example chart that duplicates the issue

omar-nahhas added a commit to omar-nahhas/charts that referenced this issue Jan 18, 2018
Without the annotation, helm upgrade fails.
helm/helm#1769
omar-nahhas added a commit to omar-nahhas/charts that referenced this issue Jan 18, 2018
Without those annotation, helm upgrade fails because of :
helm/helm#1769
k8s-ci-robot pushed a commit to helm/charts that referenced this issue Jan 27, 2018
* Update job.yaml

Without those annotation, helm upgrade fails because of :
helm/helm#1769

* Increasing version number
@sohel2020
Copy link

sohel2020 commented Nov 13, 2019

The job has the following annotations:

      annotations:
        "helm.sh/hook": post-install,post-upgrade
        "helm.sh/hook-weight": "5"
        "helm.sh/hook-delete-policy": hook-succeeded,hook-failed

The job only runs once on the first install, and never again on subsequent upgrades. Running Helm v2.7.0. Should I create a separate issue for this?

@macropin How did you solve it? I'm facing a similar issue. It never creates job every subsequent upgrade. I'm using the same annotation as yours.

helm version: v2.14.3

@guice
Copy link

guice commented Dec 13, 2019

@sohel2020 and @macropin - Same board, helm v3. The job is never re-ran on subsequent upgrades.

@thecrazzymouse
Copy link

What is the solution for rerunning jobs on helm v3?

@renepardon
Copy link

Same problem with Helm version: version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}

Jobs are neither deleted nor run on subsequent upgrade commands.

@jabdoa2
Copy link

jabdoa2 commented May 12, 2020

We regularly hit this one too in Helm 3.1.

almerico added a commit to almerico/alfresco-identity-service that referenced this issue Aug 3, 2020
vol-clean-{{ template "alfresco-identity.fullname" . }} updated to   name: vol-clean-{{ template "alfresco-identity.fullname" . }}-{{ randAlphaNum 5 | lower }}
based on helm/helm#1769 our issue is https://issues.alfresco.com/jira/browse/AAE-3212
endrec pushed a commit to Rungway/charts-we-use that referenced this issue Aug 14, 2020
* Update job.yaml

Without those annotation, helm upgrade fails because of :
helm/helm#1769

* Increasing version number
torstenwalter pushed a commit to grafana/helm-charts that referenced this issue Sep 4, 2020
* Update job.yaml

Without those annotation, helm upgrade fails because of :
helm/helm#1769

* Increasing version number
torstenwalter pushed a commit to grafana/helm-charts that referenced this issue Sep 4, 2020
* Update job.yaml

Without those annotation, helm upgrade fails because of :
helm/helm#1769

* Increasing version number
@paologallinaharbur
Copy link

In the future I believe we will be able to rely as well on job TTL

@schollii
Copy link

@paologallinaharbur only partially: for example if job ttl is 5 min, and you make a commit before and the previous commit caused job to fail so it is still there, you will have same issue

The two options that have worked for me:

  1. before a helm upgrade, run kubectl delete job;
  2. use the "helm.sh/hook-delete-policy": hook-succeeded,hook-failed policy, which is a better approach than item 1 BUT this approach will drop the logs of failed job which could be detrimental in some cases for troubleshooting

The best option would be to have a hook policy that is applied before a hook is run, eg something like

annotations:
  "helm.sh/hook-delete-policy": previous-hook-failed

MichaelMorrisEst pushed a commit to Nordix/helm that referenced this issue Nov 17, 2023
@joelmathew003
Copy link

joelmathew003 commented Jan 18, 2024

What is the solution for rerunning jobs on helm v3?

@thecrazzymouse Were you able to find a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests