-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job's backoffLimit needs to take number of failed pods into account #70251
Comments
@soltysh I'm not trying to add hack to backoff but want to remove the use of Based on original design doc of backoffLimit https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apps/job.md#backoff-policy-and-failed-pod-limit:
I also find it vague and confusing what "number of job restarts" actually mean, especially for jobs with |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
I also experiencing this issue. Is #64787 in plans for any coming release? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen any plan for this issue ? #64787,same problem at k8s v1.19 version |
@yasongxu: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Job controller uses
NumRequeues
of the Job workqueue to calculate backoffLimit. This makes backoffLimit unpredictable especially when a job with parallelism larger than 1 and when its restartPolicy isNever
.Take this job for example:
The job will eventually fail after it creates 64 failed pods, instead of 10 failed pods.
If restartPolicy is set to
OnFailure
, total number of container restarts count will be taken into account. This seems more predictable. However, if we change the restartPolicy of the same Job toOnFailure
, all the failed Pods will be deleted when they are restarted:kubernetes/pkg/controller/job/job_controller.go
Line 520 in b6fd5d9
The job will fail when its creates 10 failed pods, but will end up with 0 pods.
@kubernetes/sig-apps-bugs @soltysh @kow3ns
/king bug
The text was updated successfully, but these errors were encountered: