pkg/controller/job: re-honor exponential backoff delay #114516

nikhita · 2022-12-15T19:33:52Z

What type of PR is this?

/kind regression

What this PR does / why we need it:

This commit makes the job controller honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed.

Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod is created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on.

This commit fixes a few bugs:

Right now, each time uncounted != nil and the job does not see a new failure, forget is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed and NumRequeues is reset, which results in a backoff of 0s.
Updates updatePod to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod.
While updating the job status and removing finalizers, we returned an error to handle transient failures. Returning an error would mean that the job is re-enqueued with backoff applied. However, an error is also returned in cases when we get stale info from the informer cache, which can lead to the backoff being calculated twice. This case is now handled by checking apierrors.IsConflict(err) and not returning an error so that we start a fresh reconcile loop to pick up the latest version of the job from the informer cache.
If JobsReadyPods feature is enabled, we see the 1st failed pod and backoff is 0s, the job is now enqueued after podUpdateBatchPeriod seconds, instead of 0s. The unit test for this check also had a few bugs:
- DefaultJobBackOff is overwritten to 0 in certain unit tests, which meant that DefaultJobBackOff was considered to be 0, effectively not running any meaningful checks.
- JobsReadyPods was not enabled for test cases that ran tests which required the feature gate to be enabled.
- The check for expected and actual backoff had incorrect calculations.
Some unit tests modified DefaultJobBackOff but did not set it back to it's (original) default value. This commit adds defer statements for such tests to set it back to 10s.

Which issue(s) this PR fixes:

For #114391

Special notes for your reviewer:

The fix will need to be cherry-picked down to 1.23.
Update: Cherry-pick PRs are mentioned below.

Does this PR introduce a user-facing change?

Fix regression in 1.25 default configurations so failed pods associated with a job with `parallelism = 1` are recreated by the job controller honoring exponential backoff delay again. However, for jobs with `parallelism > 1`, pods might be created without exponential backoff delay.

k8s-ci-robot · 2022-12-15T19:34:00Z

@nikhita: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nikhita · 2022-12-15T19:35:30Z

/assign @alculquicondor

pkg/controller/job/job_controller.go

alculquicondor · 2022-12-16T12:42:26Z

Thanks for confirming. It was simpler than I was thinking. Please leave the legacy path untouched if it's not needed. And add an integration test that verifies that the second pod created is delayed at least 10 seconds. We need to cherry-pick down to 1.24.

…

On Fri, Dec 16, 2022, 2:06 a.m. Sathyanarayanan Saravanamuthu < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pkg/controller/job/job_controller.go <#114516 (comment)> : > @@ -879,7 +881,9 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rEr // returning an error will re-enqueue Job after the backoff period I tried validating the changes in a kind cluster. I took the job definition given in the linked issue for testing. I did two runs and found similar results in both cases. The sorted pod start time are as follows. Run 1 2022-12-16T05:54:36Z 2022-12-16T05:54:40Z 2022-12-16T05:54:53Z 2022-12-16T05:55:16Z 2022-12-16T05:56:39Z 2022-12-16T05:59:22Z 2022-12-16T06:04:45Z 2022-12-16T06:10:49Z 2022-12-16T06:16:52Z 2022-12-16T06:22:54Z Run 2 2022-12-16T06:30:45Z 2022-12-16T06:30:48Z 2022-12-16T06:31:01Z 2022-12-16T06:31:24Z 2022-12-16T06:32:47Z 2022-12-16T06:35:30Z 2022-12-16T06:40:53Z 2022-12-16T06:46:57Z 2022-12-16T06:53:00Z The time diffs in seconds for the above two run are: Run 1: 4, 13, 23, 83, 162, 320, 363, 363, 363 Run 2: 3, 13, 23, 83, 163, 323, 364, 363 — Reply to this email directly, view it on GitHub <#114516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ5E6FZAZSNJRMUZVM7OZTWNQII5ANCNFSM6AAAAAATADDRZE> . You are receiving this because you were assigned.Message ID: ***@***.***>

pkg/controller/job/job_controller_test.go

sathyanarays · 2022-12-17T05:54:23Z

Thanks for confirming. It was simpler than I was thinking. Please leave the legacy path untouched if it's not needed. And add an integration test that verifies that the second pod created is delayed at least 10 seconds. We need to cherry-pick down to 1.24.
…
On Fri, Dec 16, 2022, 2:06 a.m. Sathyanarayanan Saravanamuthu < @.> wrote: @.* commented on this pull request. ------------------------------ In pkg/controller/job/job_controller.go <#114516 (comment)> : > @@ -879,7 +881,9 @@ func (jm Controller) syncJob(ctx context.Context, key string) (forget bool, rEr // returning an error will re-enqueue Job after the backoff period I tried validating the changes in a kind cluster. I took the job definition given in the linked issue for testing. I did two runs and found similar results in both cases. The sorted pod start time are as follows. Run 1 2022-12-16T05:54:36Z 2022-12-16T05:54:40Z 2022-12-16T05:54:53Z 2022-12-16T05:55:16Z 2022-12-16T05:56:39Z 2022-12-16T05:59:22Z 2022-12-16T06:04:45Z 2022-12-16T06:10:49Z 2022-12-16T06:16:52Z 2022-12-16T06:22:54Z Run 2 2022-12-16T06:30:45Z 2022-12-16T06:30:48Z 2022-12-16T06:31:01Z 2022-12-16T06:31:24Z 2022-12-16T06:32:47Z 2022-12-16T06:35:30Z 2022-12-16T06:40:53Z 2022-12-16T06:46:57Z 2022-12-16T06:53:00Z The time diffs in seconds for the above two run are: Run 1: 4, 13, 23, 83, 162, 320, 363, 363, 363 Run 2: 3, 13, 23, 83, 163, 323, 364, 363 — Reply to this email directly, view it on GitHub <#114516 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ5E6FZAZSNJRMUZVM7OZTWNQII5ANCNFSM6AAAAAATADDRZE . You are receiving this because you were assigned.Message ID: @.**>

@alculquicondor , from what I observed, the second pod is created immediately when the first pod is created. This may happen because new failure detection and pod creation are done in the same reconcile loop. There is a 10 second gap between 2nd pod failure and 3rd pod creation.

sathyanarays · 2022-12-17T09:10:12Z

Performed further testing. Details are as follows.

Job manifest

Created a manifest that fails the pods after 1 minute

apiVersion: batch/v1
kind: Job
metadata:
  name: backoff-exp
spec:
  backoffLimit: 5
  completions: 25
  template:
    spec:
      containers:
        - name: die
          image: myubuntu:1
          command: ["/bin/sh","-c"]
          args: ["/bin/sleep 60 && /bin/false"]
          imagePullPolicy: IfNotPresent
      restartPolicy: Never

Start and End

The space separated container start time and end time are as follows:

2022-12-17T06:23:08Z 2022-12-17T06:24:08Z
2022-12-17T06:24:12Z 2022-12-17T06:25:12Z
2022-12-17T06:25:15Z 2022-12-17T06:26:15Z
2022-12-17T06:26:18Z 2022-12-17T06:27:18Z
2022-12-17T06:27:21Z 2022-12-17T06:28:21Z
2022-12-17T06:28:24Z 2022-12-17T06:29:24Z

After each pod failure, the next pod starts at an interval of approximately 3 seconds. I don't see exponential backoff in this case. Let me know if this makes sense!

sathyanarays · 2022-12-17T13:18:20Z

Performed further testing. Details are as follows.

Job manifest

Created a manifest that fails the pods after 1 minute
apiVersion: batch/v1
kind: Job
metadata:
  name: backoff-exp
spec:
  backoffLimit: 5
  completions: 25
  template:
    spec:
      containers:
        - name: die
          image: myubuntu:1
          command: ["/bin/sh","-c"]
          args: ["/bin/sleep 60 && /bin/false"]
          imagePullPolicy: IfNotPresent
      restartPolicy: Never
Start and End

The space separated container start time and end time are as follows:
2022-12-17T06:23:08Z 2022-12-17T06:24:08Z
2022-12-17T06:24:12Z 2022-12-17T06:25:12Z
2022-12-17T06:25:15Z 2022-12-17T06:26:15Z
2022-12-17T06:26:18Z 2022-12-17T06:27:18Z
2022-12-17T06:27:21Z 2022-12-17T06:28:21Z
2022-12-17T06:28:24Z 2022-12-17T06:29:24Z
After each pod failure, the next pod starts at an interval of approximately 3 seconds. I don't see exponential backoff in this case. Let me know if this makes sense!

Slightly changing the logic fixes this:

if finishedCondition != nil {
    forget = true
}

pkg/controller/job/job_controller.go

nikhita

Made a few changes and added an integration test, but I'm still not sure how to set forget = true correctly without hitting #114516 (comment).

pkg/controller/job/job_controller.go

pkg/controller/job/job_controller_test.go

alculquicondor · 2023-01-12T14:52:28Z

Should we add a sentence in the release note that a higher level of parallelism might lead to pods being recreated faster?

/lgtm
/approve
/hold
for thoughts on the release note.

k8s-ci-robot · 2023-01-12T14:52:33Z

LGTM label has been added.

Git tree hash: 3f049f3c0d00fa03a44e601ec9144263a3d0fc10

k8s-ci-robot · 2023-01-12T14:52:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, nikhita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit makes the job controller re-honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...). The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed. Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on. This commit fixes a few bugs: 1. Right now, each time `uncounted != nil` and the job does not see a _new_ failure, `forget` is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed and `NumRequeues` is reset, which results in a backoff of 0s. 2. Updates `updatePod` to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod. 3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The unit test for this check also had a few bugs: - `DefaultJobBackOff` is overwritten to 0 in certain unit tests, which meant that `DefaultJobBackOff` was considered to be 0, effectively not running any meaningful checks. - `JobsReadyPods` was not enabled for test cases that ran tests which required the feature gate to be enabled. - The check for expected and actual backoff had incorrect calculations.

nikhita · 2023-01-12T15:42:58Z

Should we add a sentence in the release note that a higher level of parallelism might lead to pods being recreated faster?

Updated the release note and also created cherry-pick PRs:

alculquicondor · 2023-01-12T15:55:17Z

pods are created without exponential backoff delay.

"might be"? We can't guarantee either behavior for now.

alculquicondor · 2023-01-12T15:56:08Z

You are missing a cherry pick for 1.26

nikhita · 2023-01-12T16:05:36Z

"might be"? We can't guarantee either behavior for now.

Updated the release note in this PR + all cherry-pick PRs.

You are missing a cherry pick for 1.26

oops, created #115027

nikhita · 2023-01-13T03:00:00Z

@alculquicondor the lgtm was removed due to the rebase. Can you re-apply? Thanks!

alculquicondor · 2023-01-13T14:09:22Z

/lgtm
/hold cancel
Thanks!

k8s-ci-robot · 2023-01-13T14:09:27Z

LGTM label has been added.

Git tree hash: b067de2a89d49893d78658fcc1f1246de597ec28

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 15, 2022

k8s-ci-robot requested review from alculquicondor and smarterclayton December 15, 2022 19:35

k8s-ci-robot assigned alculquicondor Dec 15, 2022

nikhita mentioned this pull request Dec 15, 2022

Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

Closed

alculquicondor reviewed Dec 15, 2022

View reviewed changes

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

alculquicondor reviewed Dec 16, 2022

View reviewed changes

pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved

mimowo reviewed Dec 19, 2022

View reviewed changes

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

mimowo mentioned this pull request Dec 20, 2022

job_controller: refactor job controller to be able to inject FakeClock for UTs #110710

Merged

nikhita force-pushed the job-backoff-fix branch from 498654b to 970e133 Compare January 2, 2023 15:50

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 2, 2023

nikhita commented Jan 2, 2023

View reviewed changes

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved

nikhita force-pushed the job-backoff-fix branch from 970e133 to 7e97407 Compare January 2, 2023 16:15

alculquicondor reviewed Jan 10, 2023

View reviewed changes

pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller_test.go Show resolved Hide resolved

nikhita force-pushed the job-backoff-fix branch from 609f60e to 215c59e Compare January 12, 2023 09:45

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2023

nikhita force-pushed the job-backoff-fix branch from 215c59e to fd8d92a Compare January 12, 2023 15:41

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2023

k8s-ci-robot requested a review from alculquicondor January 12, 2023 15:41

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2023

This was referenced Jan 12, 2023

[1.23] pkg/controller/job: re-honor exponential backoff #115020

Merged

[1.24] pkg/controller/job: re-honor exponential backoff #115021

Merged

[1.25] pkg/controller/job: re-honor exponential backoff #115022

Merged

nikhita mentioned this pull request Jan 12, 2023

[1.26] pkg/controller/job: re-honor exponential backoff #115027

Merged

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 13, 2023

k8s-ci-robot merged commit c0c386b into kubernetes:master Jan 13, 2023

k8s-ci-robot added this to the v1.27 milestone Jan 13, 2023

sathyanarays mentioned this pull request Jan 17, 2023

REQUEST: New membership for sathyanarays kubernetes/org#3960

Closed

9 tasks

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/controller/job: re-honor exponential backoff delay #114516

pkg/controller/job: re-honor exponential backoff delay #114516

nikhita commented Dec 15, 2022 •

edited by liggitt

k8s-ci-robot commented Dec 15, 2022

nikhita commented Dec 15, 2022

alculquicondor commented Dec 16, 2022 via email

sathyanarays commented Dec 17, 2022

sathyanarays commented Dec 17, 2022

sathyanarays commented Dec 17, 2022

Job manifest

Start and End

nikhita left a comment

alculquicondor commented Jan 12, 2023

k8s-ci-robot commented Jan 12, 2023

k8s-ci-robot commented Jan 12, 2023

nikhita commented Jan 12, 2023 •

edited

alculquicondor commented Jan 12, 2023

alculquicondor commented Jan 12, 2023

nikhita commented Jan 12, 2023

nikhita commented Jan 13, 2023

alculquicondor commented Jan 13, 2023

k8s-ci-robot commented Jan 13, 2023

pkg/controller/job: re-honor exponential backoff delay #114516

pkg/controller/job: re-honor exponential backoff delay #114516

Conversation

nikhita commented Dec 15, 2022 • edited by liggitt

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Dec 15, 2022

nikhita commented Dec 15, 2022

alculquicondor commented Dec 16, 2022 via email

sathyanarays commented Dec 17, 2022

sathyanarays commented Dec 17, 2022

Job manifest

Start and End

sathyanarays commented Dec 17, 2022

Job manifest

Start and End

nikhita left a comment

Choose a reason for hiding this comment

alculquicondor commented Jan 12, 2023

k8s-ci-robot commented Jan 12, 2023

k8s-ci-robot commented Jan 12, 2023

nikhita commented Jan 12, 2023 • edited

alculquicondor commented Jan 12, 2023

alculquicondor commented Jan 12, 2023

nikhita commented Jan 12, 2023

nikhita commented Jan 13, 2023

alculquicondor commented Jan 13, 2023

k8s-ci-robot commented Jan 13, 2023

nikhita commented Dec 15, 2022 •

edited by liggitt

nikhita commented Jan 12, 2023 •

edited