New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff only when failed pod shows up #60985
Conversation
@soltysh: GitHub didn't allow me to request PR reviews from the following users: clamoriniere1A. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixed the bazel update. |
pkg/controller/job/job_controller.go
Outdated
backoff := getBackoff(jm.queue, key) | ||
var backoff time.Duration | ||
if immediate { | ||
backoff = time.Duration(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that forcing the backoff value to 0 will only do half of the job because it didn't clear the counter of the previous failures in the rateLimitedQueue.
what do you think about the following code?
if immediate {
jm.queue.Forget(key)
}
backoff := getBackoff(jm.queue, key)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
If this gets an LGTM ASAP we can get this into 1.10. Also this needs a SIG and priority assigned. |
Comments addressed. |
/lgtm |
@clamoriniere1A: changing LGTM is restricted to assignees, and only kubernetes org members may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/controller/job/job_controller.go
Outdated
immediate := true | ||
if curPod.Status.Phase == v1.PodFailed { | ||
immediate = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
immediate := curPod.Status.Phase != v1.PodFailed
return | ||
} | ||
|
||
// Retrieves the backoff duration for this Job | ||
if immediate { | ||
jm.queue.Forget(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, this will cause backoff == time.Duration(0)
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will remove the key from the queue which will result in a fresh add with a zero backoff.
newPod.Status.Phase = tc.phase | ||
manager.updatePod(oldPod, newPod) | ||
|
||
if queue.duration.Nanoseconds() != tc.backoff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's probably simpler to move * DefaultJobBackOff.Nanoseconds()
here
"1st success": {0, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()}, | ||
"2nd success": {1, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()}, | ||
"1st running": {0, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()}, | ||
"2nd running": {1, v1.PodSucceeded, int64(0) * DefaultJobBackOff.Nanoseconds()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running cases and success cases are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they are, but I've decided to put them for clarity :-)
[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process Pull Request Labels
|
@janetkuo when satisfied, could you add the LGTM so it ships in 1.10? Thanks! |
@janetkuo nits addressed, ptal |
Hi @janetkuo could you PTAL so we can wrap this up for 1.10? Thanks! |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clamoriniere1A, janetkuo, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Automatic merge from submit-queue (batch tested with PRs 60978, 60985). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Never clean backoff in job controller **What this PR does / why we need it**: In #60985 I've added a mechanism which allows immediate job status update, unfortunately that broke the backoff logic seriously. I'm sorry for that. I've changed the `immediate` mechanism so that it NEVER cleans the backoff, but for the cases when we want fast status update it uses a zero backoff. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #62382 **Special notes for your reviewer**: /assign @janetkuo **Release note**: ```release-note None ```
What this PR does / why we need it:
Upon introducing the backoff policy we started to delay sync runs for the job when it failed several times before. This leads to failed jobs not reporting status right away in cases that are not related to failed pods, eg. a successful run. This PR ensures the backoff is applied only when
updatePod
receives a failed pod.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #59918 #59527
/assign @janetkuo @kow3ns
Release note: