Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job backoffLimit #63990

Closed
wants to merge 3 commits into from
Closed

Conversation

janetkuo
Copy link
Member

@janetkuo janetkuo commented May 17, 2018

What this PR does / why we need it:
Job backoffLimit is broken in 1.10, which is a regression. This PR does:

  1. Test job backoffLimit correctly. The test should fail and catch the backoffLimit bug.
  2. Never clean backoff in job controller. This fixes the backoffLimit bug and makes the test pass.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #62382

Special notes for your reviewer:

Release note:

Fix backoffLimit of Job

@janetkuo janetkuo self-assigned this May 17, 2018
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 17, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2018
@janetkuo janetkuo changed the title WIP: Fix job backoffLimit Fix job backoffLimit May 17, 2018
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 17, 2018
@janetkuo janetkuo added this to the v1.10 milestone May 17, 2018
@janetkuo janetkuo assigned kow3ns and unassigned janetkuo May 17, 2018
@janetkuo janetkuo added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. status/approved-for-milestone labels May 17, 2018
Copy link
Member

@kow3ns kow3ns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@kow3ns
Copy link
Member

kow3ns commented May 17, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 17, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, kow3ns

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@janetkuo
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 17, 2018
@janetkuo
Copy link
Member Author

Test seems to be flaky. Sometimes there's only 1 failed pod when the job allows 1 retries (should have 2 failed pods.) I wasn't able to reproduce it locally so added a temp commit for logging.

@janetkuo
Copy link
Member Author

/retest

Unrelated failures

@janetkuo
Copy link
Member Author

janetkuo commented May 18, 2018

From the test of a failing Job with backoffLimit = 1 (retry once), I saw two different kinds of failures:

  1. Job never retries (log): the code uses NumRequeues to see the number of retries. However, the job can failed to be synced just because of an update conflict (the object has been modified; please apply your changes to the latest version and try again). That shouldn't be seen as a retry.
    // retrieve the previous number of retry
    previousRetry := jm.queue.NumRequeues(key)
  2. Job retries more than needed (log): Job controller creates more pods than needed because it doesn't realize itself has failed and exceeded backoffLimit earlier. This may be due to stale cache or that the job status failed to be updated earlier.

I think Job backoffLimit is still broken with the fix.

@bryanlarsen
Copy link

Any status on this PR? We've got jobs with activeDeadlineSeconds set to 86400 and backoffLimit set to 3. If they fail immediately in 1.10.X, they can spawn enough failed jobs to take down the cluster. A partial fix may be better than no fix.

@dims
Copy link
Member

dims commented Jun 3, 2018

/milestone v1.11

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.10, v1.11 Jun 3, 2018
@jberkus
Copy link

jberkus commented Jun 4, 2018

/retest

@jberkus
Copy link

jberkus commented Jun 4, 2018

Raising priority so that this stays in 1.11.

/priority critical-urgent
/remove-priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 4, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@janetkuo @kow3ns

Pull Request Labels
  • sig/apps: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@dims
Copy link
Member

dims commented Jun 4, 2018

/test pull-kubernetes-e2e-gce

@dims
Copy link
Member

dims commented Jun 4, 2018

/test pull-kubernetes-e2e-kops-aws

1 similar comment
@dims
Copy link
Member

dims commented Jun 4, 2018

/test pull-kubernetes-e2e-kops-aws

@dims
Copy link
Member

dims commented Jun 4, 2018

/hold

holding in favor of #63650

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2018
@soltysh
Copy link
Contributor

soltysh commented Jun 5, 2018

/close
since #63650 is updated

Workloads automation moved this from In Progress to Done Jun 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Workloads
  
Done
Development

Successfully merging this pull request may close these issues.

Backoff Limit for Job does not work on Kubernetes 1.10.0
8 participants