Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff only when failed pod shows up #60985

Merged
merged 1 commit into from Mar 16, 2018

Conversation

soltysh
Copy link
Contributor

@soltysh soltysh commented Mar 9, 2018

What this PR does / why we need it:
Upon introducing the backoff policy we started to delay sync runs for the job when it failed several times before. This leads to failed jobs not reporting status right away in cases that are not related to failed pods, eg. a successful run. This PR ensures the backoff is applied only when updatePod receives a failed pod.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #59918 #59527

/assign @janetkuo @kow3ns

Release note:

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 9, 2018
@soltysh
Copy link
Contributor Author

soltysh commented Mar 9, 2018

/cc @clamoriniere1A @csrwng

@k8s-ci-robot
Copy link
Contributor

@soltysh: GitHub didn't allow me to request PR reviews from the following users: clamoriniere1A.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @clamoriniere1A @csrwng

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from csrwng March 9, 2018 15:44
@soltysh
Copy link
Contributor Author

soltysh commented Mar 9, 2018

@janetkuo @kow3ns not sure if we want to have this for 1.10, I'll mark it as such, but feel free to drop it from there.

@soltysh soltysh added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. status/approved-for-milestone labels Mar 9, 2018
@soltysh soltysh added this to the v1.10 milestone Mar 9, 2018
@soltysh
Copy link
Contributor Author

soltysh commented Mar 9, 2018

Fixed the bazel update.

backoff := getBackoff(jm.queue, key)
var backoff time.Duration
if immediate {
backoff = time.Duration(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that forcing the backoff value to 0 will only do half of the job because it didn't clear the counter of the previous failures in the rateLimitedQueue.

what do you think about the following code?

if immediate {
  jm.queue.Forget(key)
}
backoff := getBackoff(jm.queue, key)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@jdumars
Copy link
Member

jdumars commented Mar 12, 2018

If this gets an LGTM ASAP we can get this into 1.10. Also this needs a SIG and priority assigned.

@k8s-github-robot k8s-github-robot removed this from the v1.10 milestone Mar 12, 2018
@soltysh soltysh added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Mar 12, 2018
@soltysh
Copy link
Contributor Author

soltysh commented Mar 12, 2018

Comments addressed.
@clamoriniere1A @janetkuo @kow3ns ptal

@clamoriniere1A
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link
Contributor

@clamoriniere1A: changing LGTM is restricted to assignees, and only kubernetes org members may be assigned issues.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

immediate := true
if curPod.Status.Phase == v1.PodFailed {
immediate = false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

immediate := curPod.Status.Phase != v1.PodFailed

return
}

// Retrieves the backoff duration for this Job
if immediate {
jm.queue.Forget(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this will cause backoff == time.Duration(0), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will remove the key from the queue which will result in a fresh add with a zero backoff.

newPod.Status.Phase = tc.phase
manager.updatePod(oldPod, newPod)

if queue.duration.Nanoseconds() != tc.backoff {
Copy link
Member

@janetkuo janetkuo Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's probably simpler to move * DefaultJobBackOff.Nanoseconds() here

"1st success": {0, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
"2nd success": {1, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
"1st running": {0, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
"2nd running": {1, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running cases and success cases are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are, but I've decided to put them for clarity :-)

@janetkuo janetkuo added this to the v1.10 milestone Mar 13, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@janetkuo @kow3ns @soltysh

Pull Request Labels
  • sig/apps: Pull Request will be escalated to these SIGs if needed.
  • priority/important-longterm: Escalate to the pull request owners; move out of the milestone after 1 attempt.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@jdumars
Copy link
Member

jdumars commented Mar 13, 2018

@janetkuo when satisfied, could you add the LGTM so it ships in 1.10? Thanks!

@soltysh
Copy link
Contributor Author

soltysh commented Mar 14, 2018

@janetkuo nits addressed, ptal

@jdumars
Copy link
Member

jdumars commented Mar 15, 2018

Hi @janetkuo could you PTAL so we can wrap this up for 1.10? Thanks!

@kow3ns kow3ns added this to Backlog in Workloads Mar 15, 2018
@kow3ns kow3ns moved this from Backlog to In Progress in Workloads Mar 15, 2018
@janetkuo
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clamoriniere1A, janetkuo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 5d67222 into kubernetes:master Mar 16, 2018
Workloads automation moved this from In Progress to Done Mar 16, 2018
@soltysh soltysh deleted the issue59918 branch March 16, 2018 16:30
k8s-github-robot pushed a commit that referenced this pull request Jun 6, 2018
Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Never clean backoff in job controller

**What this PR does / why we need it**:
In #60985 I've added a mechanism which allows immediate job status update, unfortunately that broke the backoff logic seriously. I'm sorry for that. I've changed the `immediate` mechanism so that it NEVER cleans the backoff, but for the cases when we want fast status update it uses a zero backoff. 

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #62382

**Special notes for your reviewer**:
/assign @janetkuo 

**Release note**:
```release-note
None
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Workloads
  
Done
Development

Successfully merging this pull request may close these issues.

Jobs - takes a while to update job status if job succeeds after retrying many times
7 participants