-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix job's backoff limit for restart policy OnFailure #58972
Conversation
pkg/controller/job/job_controller.go
Outdated
jobFailed = true | ||
failureReason = "BackoffLimitExceeded" | ||
failureMessage = "Job has reach the specified backoff limit" | ||
} else if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine the first two if-cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally put them separately to avoid having monstrous if condition, but since you ask ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I don't like to repeat myself, so I guess yeah.
pkg/controller/job/job_controller.go
Outdated
if job.Spec.Template.Spec.RestartPolicy == v1.RestartPolicyOnFailure && pastBackoffLimit(&job, pods) { | ||
jobFailed = true | ||
failureReason = "BackoffLimitExceeded" | ||
failureMessage = "Job has reach the specified backoff limit" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: has reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, good catch.
Updated, ptal. |
pkg/controller/job/job_controller.go
Outdated
// check if the number of failed jobs increased since the last syncJob | ||
if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) { | ||
if (job.Spec.Template.Spec.RestartPolicy == v1.RestartPolicyOnFailure && pastBackoffLimit(&job, pods)) || | ||
(jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally put them separately to avoid having monstrous if condition, but since you ask ;-)
This can be improved by creating one bool
value for each of these two conditions, or turn them into a function that returns a bool
.
pkg/controller/job/job_controller.go
Outdated
} | ||
for j := range po.Status.ContainerStatuses { | ||
stat := po.Status.ContainerStatuses[j] | ||
result += stat.RestartCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From ContainerStatus
spec, this number is calculated from the number of dead containers and will be capped at 5 by GC. If the BackoffLimit is set to > 5 * (# of containers), the job may never past backoff limit.
// The number of times the container has been restarted, currently based on
// the number of dead containers that have not yet been removed.
// Note that this is calculated from dead containers. But those containers are subject to
// garbage collection. This value will get capped at 5 by GC.
RestartCount int32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but I'm not sure I'd want to address this with some hacky approach in the controller. This is a thing a user should be aware of, more or less.
pkg/controller/job/job_controller.go
Outdated
if po.Status.Phase != v1.PodRunning { | ||
continue | ||
} | ||
for j := range po.Status.ContainerStatuses { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InitContainerStatuses
should be taken into consideration too.
@janetkuo updated, ptal |
[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process Pull Request Labels
|
// pastBackoffLimitOnFailure checks if container restartCounts sum exceeds BackoffLimit | ||
// this method applies only to pods with restartPolicy == OnFailure | ||
func pastBackoffLimitOnFailure(job *batch.Job, pods []*v1.Pod) bool { | ||
if job.Spec.Template.Spec.RestartPolicy != v1.RestartPolicyOnFailure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the pods' restart policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's needed. I can't think of a reasonable way they will differ, (other than hacky) 😉
|
||
// check if the number of failed jobs increased since the last syncJob | ||
if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) { | ||
if exceedsBackoffLimit || pastBackoffLimitOnFailure(&job, pods) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be
if exceedsBackoffLimit || (jobHaveNewFailure && pastBackoffLimitOnFailure(&job, pods))
?
Because you only check the pod restarts/job retries when the job fails?
If so, please add a test for this case too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the current check is needed. jobHaveNewFailure
verifies actual pod failures, which for OnFailure
restart policy won't happen (unless the kubelet will kill the pod). So we always need to check these numbers as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test case I've added in this PR nicely covers it.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: janetkuo, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions here. |
#63650-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller Cherry pick of #58972 #63650 on release-1.10. #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller Fixes #62382. **Release Note:** ```release-note Fix regression in `v1.JobSpec.backoffLimit` that caused failed Jobs to be restarted indefinitely. ```
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #54870
Release note:
/assign janetkuo