-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/controller/job: re-honor exponential backoff delay #114516
Conversation
@nikhita: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @alculquicondor |
Thanks for confirming. It was simpler than I was thinking.
Please leave the legacy path untouched if it's not needed.
And add an integration test that verifies that the second pod created is
delayed at least 10 seconds.
We need to cherry-pick down to 1.24.
…On Fri, Dec 16, 2022, 2:06 a.m. Sathyanarayanan Saravanamuthu < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pkg/controller/job/job_controller.go
<#114516 (comment)>
:
> @@ -879,7 +881,9 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rEr
// returning an error will re-enqueue Job after the backoff period
I tried validating the changes in a kind cluster. I took the job
definition given in the linked issue for testing. I did two runs and found
similar results in both cases. The sorted pod start time are as follows.
Run 1
2022-12-16T05:54:36Z
2022-12-16T05:54:40Z
2022-12-16T05:54:53Z
2022-12-16T05:55:16Z
2022-12-16T05:56:39Z
2022-12-16T05:59:22Z
2022-12-16T06:04:45Z
2022-12-16T06:10:49Z
2022-12-16T06:16:52Z
2022-12-16T06:22:54Z
Run 2
2022-12-16T06:30:45Z
2022-12-16T06:30:48Z
2022-12-16T06:31:01Z
2022-12-16T06:31:24Z
2022-12-16T06:32:47Z
2022-12-16T06:35:30Z
2022-12-16T06:40:53Z
2022-12-16T06:46:57Z
2022-12-16T06:53:00Z
The time diffs in seconds for the above two run are:
Run 1: 4, 13, 23, 83, 162, 320, 363, 363, 363
Run 2: 3, 13, 23, 83, 163, 323, 364, 363
—
Reply to this email directly, view it on GitHub
<#114516 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ5E6FZAZSNJRMUZVM7OZTWNQII5ANCNFSM6AAAAAATADDRZE>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
@alculquicondor , from what I observed, the second pod is created immediately when the first pod is created. This may happen because new failure detection and pod creation are done in the same reconcile loop. There is a 10 second gap between 2nd pod failure and 3rd pod creation. |
Performed further testing. Details are as follows. Job manifestCreated a manifest that fails the pods after 1 minute
Start and EndThe space separated container start time and end time are as follows:
After each pod failure, the next pod starts at an interval of approximately 3 seconds. I don't see exponential backoff in this case. Let me know if this makes sense! |
Slightly changing the logic fixes this:
|
498654b
to
970e133
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few changes and added an integration test, but I'm still not sure how to set forget = true
correctly without hitting #114516 (comment).
970e133
to
7e97407
Compare
609f60e
to
215c59e
Compare
Should we add a sentence in the release note that a higher level of parallelism might lead to pods being recreated faster? /lgtm |
LGTM label has been added. Git tree hash: 3f049f3c0d00fa03a44e601ec9144263a3d0fc10
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, nikhita The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit makes the job controller re-honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...). The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed. Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on. This commit fixes a few bugs: 1. Right now, each time `uncounted != nil` and the job does not see a _new_ failure, `forget` is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed and `NumRequeues` is reset, which results in a backoff of 0s. 2. Updates `updatePod` to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod. 3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The unit test for this check also had a few bugs: - `DefaultJobBackOff` is overwritten to 0 in certain unit tests, which meant that `DefaultJobBackOff` was considered to be 0, effectively not running any meaningful checks. - `JobsReadyPods` was not enabled for test cases that ran tests which required the feature gate to be enabled. - The check for expected and actual backoff had incorrect calculations.
215c59e
to
fd8d92a
Compare
Updated the release note and also created cherry-pick PRs: |
"might be"? We can't guarantee either behavior for now. |
You are missing a cherry pick for 1.26 |
Updated the release note in this PR + all cherry-pick PRs.
oops, created #115027 |
@alculquicondor the lgtm was removed due to the rebase. Can you re-apply? Thanks! |
/lgtm |
LGTM label has been added. Git tree hash: b067de2a89d49893d78658fcc1f1246de597ec28
|
What type of PR is this?
/kind regression
What this PR does / why we need it:
This commit makes the job controller honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...).
The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed.
Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod is created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on.
This commit fixes a few bugs:
Right now, each time
uncounted != nil
and the job does not see a new failure,forget
is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed andNumRequeues
is reset, which results in a backoff of 0s.Updates
updatePod
to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod.While updating the job status and removing finalizers, we returned an error to handle transient failures. Returning an error would mean that the job is re-enqueued with backoff applied. However, an error is also returned in cases when we get stale info from the informer cache, which can lead to the backoff being calculated twice. This case is now handled by checking
apierrors.IsConflict(err)
and not returning an error so that we start a fresh reconcile loop to pick up the latest version of the job from the informer cache.If
JobsReadyPods
feature is enabled, we see the 1st failed pod and backoff is 0s, the job is now enqueued afterpodUpdateBatchPeriod
seconds, instead of 0s. The unit test for this check also had a few bugs:DefaultJobBackOff
is overwritten to 0 in certain unit tests, which meant thatDefaultJobBackOff
was considered to be 0, effectively not running any meaningful checks.JobsReadyPods
was not enabled for test cases that ran tests which required the feature gate to be enabled.Some unit tests modified
DefaultJobBackOff
but did not set it back to it's (original) default value. This commit addsdefer
statements for such tests to set it back to 10s.Which issue(s) this PR fixes:
For #114391
Special notes for your reviewer:
The fix will need to be cherry-picked down to 1.23.
Update: Cherry-pick PRs are mentioned below.
Does this PR introduce a user-facing change?