Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job's backoff limit for restart policy OnFailure #58972

Merged
merged 1 commit into from
Apr 19, 2018

Conversation

soltysh
Copy link
Contributor

@soltysh soltysh commented Jan 29, 2018

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #54870

Release note:

NONE

/assign janetkuo

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 29, 2018
jobFailed = true
failureReason = "BackoffLimitExceeded"
failureMessage = "Job has reach the specified backoff limit"
} else if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine the first two if-cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally put them separately to avoid having monstrous if condition, but since you ask ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't like to repeat myself, so I guess yeah.

if job.Spec.Template.Spec.RestartPolicy == v1.RestartPolicyOnFailure && pastBackoffLimit(&job, pods) {
jobFailed = true
failureReason = "BackoffLimitExceeded"
failureMessage = "Job has reach the specified backoff limit"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: has reached

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, good catch.

@soltysh
Copy link
Contributor Author

soltysh commented Jan 30, 2018

Updated, ptal.

// check if the number of failed jobs increased since the last syncJob
if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) {
if (job.Spec.Template.Spec.RestartPolicy == v1.RestartPolicyOnFailure && pastBackoffLimit(&job, pods)) ||
(jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally put them separately to avoid having monstrous if condition, but since you ask ;-)

This can be improved by creating one bool value for each of these two conditions, or turn them into a function that returns a bool.

}
for j := range po.Status.ContainerStatuses {
stat := po.Status.ContainerStatuses[j]
result += stat.RestartCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From ContainerStatus spec, this number is calculated from the number of dead containers and will be capped at 5 by GC. If the BackoffLimit is set to > 5 * (# of containers), the job may never past backoff limit.

	// The number of times the container has been restarted, currently based on
	// the number of dead containers that have not yet been removed.
	// Note that this is calculated from dead containers. But those containers are subject to
	// garbage collection. This value will get capped at 5 by GC.
	RestartCount int32 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but I'm not sure I'd want to address this with some hacky approach in the controller. This is a thing a user should be aware of, more or less.

if po.Status.Phase != v1.PodRunning {
continue
}
for j := range po.Status.ContainerStatuses {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitContainerStatuses should be taken into consideration too.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2018
@soltysh soltysh added this to the v1.11 milestone Mar 19, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2018
@soltysh soltysh added area/batch needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 19, 2018
@soltysh
Copy link
Contributor Author

soltysh commented Mar 20, 2018

@janetkuo updated, ptal

@soltysh soltysh added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 20, 2018
@kow3ns kow3ns added this to In Progress in Workloads Mar 29, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@janetkuo @soltysh

Pull Request Labels
  • sig/apps: Pull Request will be escalated to these SIGs if needed.
  • priority/important-longterm: Escalate to the pull request owners; move out of the milestone after 1 attempt.
  • kind/bug: Fixes a bug discovered during the current release.
Help

// pastBackoffLimitOnFailure checks if container restartCounts sum exceeds BackoffLimit
// this method applies only to pods with restartPolicy == OnFailure
func pastBackoffLimitOnFailure(job *batch.Job, pods []*v1.Pod) bool {
if job.Spec.Template.Spec.RestartPolicy != v1.RestartPolicyOnFailure {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the pods' restart policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's needed. I can't think of a reasonable way they will differ, (other than hacky) 😉


// check if the number of failed jobs increased since the last syncJob
if jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) {
if exceedsBackoffLimit || pastBackoffLimitOnFailure(&job, pods) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be

if exceedsBackoffLimit || (jobHaveNewFailure && pastBackoffLimitOnFailure(&job, pods)) 

?

Because you only check the pod restarts/job retries when the job fails?

If so, please add a test for this case too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the current check is needed. jobHaveNewFailure verifies actual pod failures, which for OnFailure restart policy won't happen (unless the kubelet will kill the pod). So we always need to check these numbers as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test case I've added in this PR nicely covers it.

@janetkuo
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@janetkuo
Copy link
Member

/retest

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 139309f into kubernetes:master Apr 19, 2018
Workloads automation moved this from In Progress to Done Apr 19, 2018
@soltysh soltysh deleted the issue54870 branch April 20, 2018 14:26
k8s-github-robot pushed a commit that referenced this pull request Jun 11, 2018
#63650-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller

Cherry pick of #58972 #63650 on release-1.10.

#58972: Fix job's backoff limit for restart policy OnFailure
#63650: Never clean backoff in job controller

Fixes #62382.

**Release Note:**
```release-note
Fix regression in `v1.JobSpec.backoffLimit` that caused failed Jobs to be restarted indefinitely.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/batch cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Workloads
  
Done
Development

Successfully merging this pull request may close these issues.

Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure
5 participants