Fix job's backoff limit for restart policy OnFailure #58972

soltysh · 2018-01-29T16:33:57Z

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #54870

Release note:

NONE

/assign janetkuo

janetkuo · 2018-01-29T19:12:54Z

pkg/controller/job/job_controller.go

+		jobFailed = true
+		failureReason = "BackoffLimitExceeded"
+		failureMessage = "Job has reach the specified backoff limit"
+	} else if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) {


Combine the first two if-cases

I intentionally put them separately to avoid having monstrous if condition, but since you ask ;-)

And I don't like to repeat myself, so I guess yeah.

janetkuo · 2018-01-29T19:13:35Z

pkg/controller/job/job_controller.go

+	if job.Spec.Template.Spec.RestartPolicy == v1.RestartPolicyOnFailure && pastBackoffLimit(&job, pods) {
+		jobFailed = true
+		failureReason = "BackoffLimitExceeded"
+		failureMessage = "Job has reach the specified backoff limit"


typo: has reached

Thx, good catch.

soltysh · 2018-01-30T11:48:01Z

Updated, ptal.

janetkuo · 2018-01-30T21:30:11Z

pkg/controller/job/job_controller.go

-	// check if the number of failed jobs increased since the last syncJob
-	if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) {
+	if (job.Spec.Template.Spec.RestartPolicy == v1.RestartPolicyOnFailure && pastBackoffLimit(&job, pods)) ||
+		(jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit)) {


I intentionally put them separately to avoid having monstrous if condition, but since you ask ;-)

This can be improved by creating one bool value for each of these two conditions, or turn them into a function that returns a bool.

janetkuo · 2018-01-30T22:31:00Z

pkg/controller/job/job_controller.go

+		}
+		for j := range po.Status.ContainerStatuses {
+			stat := po.Status.ContainerStatuses[j]
+			result += stat.RestartCount


From ContainerStatus spec, this number is calculated from the number of dead containers and will be capped at 5 by GC. If the BackoffLimit is set to > 5 * (# of containers), the job may never past backoff limit.

// The number of times the container has been restarted, currently based on // the number of dead containers that have not yet been removed. // Note that this is calculated from dead containers. But those containers are subject to // garbage collection. This value will get capped at 5 by GC. RestartCount int32

Agree, but I'm not sure I'd want to address this with some hacky approach in the controller. This is a thing a user should be aware of, more or less.

janetkuo · 2018-01-30T22:35:24Z

pkg/controller/job/job_controller.go

+		if po.Status.Phase != v1.PodRunning {
+			continue
+		}
+		for j := range po.Status.ContainerStatuses {


InitContainerStatuses should be taken into consideration too.

soltysh · 2018-03-20T09:29:53Z

@janetkuo updated, ptal

k8s-github-robot · 2018-04-11T19:57:56Z

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@janetkuo @soltysh

Pull Request Labels

sig/apps: Pull Request will be escalated to these SIGs if needed.
priority/important-longterm: Escalate to the pull request owners; move out of the milestone after 1 attempt.
kind/bug: Fixes a bug discovered during the current release.

Help

janetkuo · 2018-04-18T22:43:09Z

pkg/controller/job/job_controller.go

+// pastBackoffLimitOnFailure checks if container restartCounts sum exceeds BackoffLimit
+// this method applies only to pods with restartPolicy == OnFailure
+func pastBackoffLimitOnFailure(job *batch.Job, pods []*v1.Pod) bool {
+	if job.Spec.Template.Spec.RestartPolicy != v1.RestartPolicyOnFailure {


Should we check the pods' restart policy?

I don't think that's needed. I can't think of a reasonable way they will differ, (other than hacky) 😉

janetkuo · 2018-04-18T22:55:28Z

pkg/controller/job/job_controller.go


-	// check if the number of failed jobs increased since the last syncJob
-	if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) {
+	if exceedsBackoffLimit || pastBackoffLimitOnFailure(&job, pods) {


Shouldn't it be

if exceedsBackoffLimit || (jobHaveNewFailure && pastBackoffLimitOnFailure(&job, pods))

?

Because you only check the pod restarts/job retries when the job fails?

If so, please add a test for this case too.

No, the current check is needed. jobHaveNewFailure verifies actual pod failures, which for OnFailure restart policy won't happen (unless the kubelet will kill the pod). So we always need to check these numbers as is.

I think the test case I've added in this PR nicely covers it.

janetkuo · 2018-04-19T18:31:39Z

/lgtm

k8s-ci-robot · 2018-04-19T18:32:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [janetkuo,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2018-04-19T21:49:54Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

janetkuo · 2018-04-19T22:23:14Z

/retest

k8s-github-robot · 2018-04-19T23:47:15Z

Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions here.

#63650-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller Cherry pick of #58972 #63650 on release-1.10. #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller Fixes #62382. **Release Note:** ```release-note Fix regression in `v1.JobSpec.backoffLimit` that caused failed Jobs to be restarted indefinitely. ```

k8s-ci-robot assigned janetkuo Jan 29, 2018

k8s-ci-robot requested review from erictune and janetkuo January 29, 2018 16:34

soltysh mentioned this pull request Jan 29, 2018

Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure #54870

Closed

janetkuo reviewed Jan 29, 2018

View reviewed changes

soltysh force-pushed the issue54870 branch from 3426290 to 3b1d14e Compare January 30, 2018 11:47

janetkuo reviewed Jan 30, 2018

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2018

Fix job's backoff limit for restart policy OnFailure

5ff7e97

soltysh force-pushed the issue54870 branch from 3b1d14e to 5ff7e97 Compare March 19, 2018 16:40

soltysh added this to the v1.11 milestone Mar 19, 2018

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2018

soltysh added area/batch needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 19, 2018

k8s-github-robot added the milestone/incomplete-labels label Mar 20, 2018

soltysh added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 20, 2018

k8s-github-robot added milestone/needs-approval and removed milestone/incomplete-labels labels Mar 20, 2018

soltysh added the status/approved-for-milestone label Mar 20, 2018

k8s-github-robot removed the milestone/needs-approval label Mar 20, 2018

kow3ns added this to In Progress in Workloads Mar 29, 2018

janetkuo reviewed Apr 18, 2018

View reviewed changes

janetkuo approved these changes Apr 19, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2018

k8s-github-robot merged commit 139309f into kubernetes:master Apr 19, 2018

Workloads automation moved this from In Progress to Done Apr 19, 2018

soltysh deleted the issue54870 branch April 20, 2018 14:26

wgliang mentioned this pull request May 7, 2018

-Fix BackoffLimit for Job does not work #63484

Closed

cblecker mentioned this pull request Jun 6, 2018

Automated cherry pick of #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller #64813

Merged

soltysh mentioned this pull request Jun 3, 2020

Job Controller: job.Status.Failed is incorrect in some cases #89630

Closed

Kartik494 mentioned this pull request Jun 23, 2022

Update PV to beta yaml file kubernetes-sigs/sig-storage-local-static-provisioner#326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix job's backoff limit for restart policy OnFailure #58972

Fix job's backoff limit for restart policy OnFailure #58972

soltysh commented Jan 29, 2018 •

edited by janetkuo

Loading

janetkuo Jan 29, 2018

soltysh Jan 30, 2018

soltysh Jan 30, 2018

janetkuo Jan 29, 2018

soltysh Jan 30, 2018

soltysh commented Jan 30, 2018

janetkuo Jan 30, 2018

janetkuo Jan 30, 2018

soltysh Mar 19, 2018

janetkuo Jan 30, 2018

soltysh commented Mar 20, 2018

k8s-github-robot commented Apr 11, 2018

janetkuo Apr 18, 2018

soltysh Apr 19, 2018

janetkuo Apr 18, 2018

soltysh Apr 19, 2018

soltysh Apr 19, 2018

janetkuo commented Apr 19, 2018

k8s-ci-robot commented Apr 19, 2018

fejta-bot commented Apr 19, 2018

janetkuo commented Apr 19, 2018

k8s-github-robot commented Apr 19, 2018

Fix job's backoff limit for restart policy OnFailure #58972

Fix job's backoff limit for restart policy OnFailure #58972

Conversation

soltysh commented Jan 29, 2018 • edited by janetkuo Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Jan 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Mar 20, 2018

k8s-github-robot commented Apr 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janetkuo commented Apr 19, 2018

k8s-ci-robot commented Apr 19, 2018

fejta-bot commented Apr 19, 2018

janetkuo commented Apr 19, 2018

k8s-github-robot commented Apr 19, 2018

soltysh commented Jan 29, 2018 •

edited by janetkuo

Loading