-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Never clean backoff in job controller #63650
Conversation
LGTM 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. Would you add a test to catch this regression?
[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process Pull Request Labels
|
Haven't heard back from you so I opened #63990 with your commit and a test for it. |
/milestone v1.11 |
/retest |
As a proof, I'm just reaching 300 iteration of the test without a flake. So I guess the flakiness is a low probability, as well as hard to fix :/ |
pkg/controller/job/job_controller.go
Outdated
@@ -393,7 +393,9 @@ func (jm *JobController) processNextWorkItem() bool { | |||
} | |||
|
|||
utilruntime.HandleError(fmt.Errorf("Error syncing job: %v", err)) | |||
jm.queue.AddRateLimited(key) | |||
if !errors.IsConflict(err) { | |||
jm.queue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If status update failed, we still need to requeue this job; otherwise, the job status may never be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a workaround, we can probably retry on update conflict. But in the long run we should not use # of requeues to compare against backoff limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I totally agree that the re-queues are not good for that. They're causing us more harm than good :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #64787 to track this.
pkg/controller/job/job_controller.go
Outdated
// new failures happen when status does not reflect the failures and active | ||
// is different than parallelism, otherwise the previous controller loop | ||
// failed updating status so even if we pick up failure it is not a new one | ||
exceedsBackoffLimit := jobHaveNewFailure && (active != *job.Spec.Parallelism) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you explain why active != *job.Spec.Parallelism
is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the case when a controller failed an update, but went through the whole process, which means that the number of active is exactly what it should be. There's no other way right now to figure that the previous loop failed.
pkg/controller/job/job_controller.go
Outdated
@@ -494,8 +496,12 @@ func (jm *JobController) syncJob(key string) (bool, error) { | |||
var failureReason string | |||
var failureMessage string | |||
|
|||
jobHaveNewFailure := failed > job.Status.Failed | |||
exceedsBackoffLimit := jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) | |||
jobHaveNewFailure := (failed > job.Status.Failed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could happen when Job Status failed to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did not change, it is as it was before I was trying different scenarios, so that change is nothing new. During tests I've just added parenthesis here.
@janetkuo updated, ptal once again |
a430523
to
abd9e0d
Compare
pkg/controller/job/job_controller.go
Outdated
st := job.Status | ||
|
||
var err error | ||
for i, job := 0, job; i < statusUpdateRetries; i, job = i+1, refresh(jobClient, job) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i <= statusUpdateRetries
pkg/controller/job/job_controller.go
Outdated
if err == nil { | ||
return newJob | ||
} else { | ||
return job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should probably give up updating status when GET returns an error
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: janetkuo, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@soltysh: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test pull-kubernetes-verify |
Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch, please follow the instructions here. |
#63650-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller Cherry pick of #58972 #63650 on release-1.10. #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller Fixes #62382. **Release Note:** ```release-note Fix regression in `v1.JobSpec.backoffLimit` that caused failed Jobs to be restarted indefinitely. ```
@soltysh I run the test locally, failed 4 times
related log:
fail scenario 2:
related log:
|
For posterity, this was backported in 1.10.5 -> #64813 |
What this PR does / why we need it:
In #60985 I've added a mechanism which allows immediate job status update, unfortunately that broke the backoff logic seriously. I'm sorry for that. I've changed the
immediate
mechanism so that it NEVER cleans the backoff, but for the cases when we want fast status update it uses a zero backoff.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #62382
Special notes for your reviewer:
/assign @janetkuo
Release note: