Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/controller/job: re-honor exponential backoff delay #114516

Merged
merged 1 commit into from
Jan 13, 2023

Conversation

nikhita
Copy link
Member

@nikhita nikhita commented Dec 15, 2022

What type of PR is this?

/kind regression

What this PR does / why we need it:

This commit makes the job controller honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed.

Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod is created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on.

This commit fixes a few bugs:

  1. Right now, each time uncounted != nil and the job does not see a new failure, forget is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed and NumRequeues is reset, which results in a backoff of 0s.

  2. Updates updatePod to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod.

  3. While updating the job status and removing finalizers, we returned an error to handle transient failures. Returning an error would mean that the job is re-enqueued with backoff applied. However, an error is also returned in cases when we get stale info from the informer cache, which can lead to the backoff being calculated twice. This case is now handled by checking apierrors.IsConflict(err) and not returning an error so that we start a fresh reconcile loop to pick up the latest version of the job from the informer cache.

  4. If JobsReadyPods feature is enabled, we see the 1st failed pod and backoff is 0s, the job is now enqueued after podUpdateBatchPeriod seconds, instead of 0s. The unit test for this check also had a few bugs:

    • DefaultJobBackOff is overwritten to 0 in certain unit tests, which meant that DefaultJobBackOff was considered to be 0, effectively not running any meaningful checks.
    • JobsReadyPods was not enabled for test cases that ran tests which required the feature gate to be enabled.
    • The check for expected and actual backoff had incorrect calculations.
  5. Some unit tests modified DefaultJobBackOff but did not set it back to it's (original) default value. This commit adds defer statements for such tests to set it back to 10s.

Which issue(s) this PR fixes:

For #114391

Special notes for your reviewer:

The fix will need to be cherry-picked down to 1.23.
Update: Cherry-pick PRs are mentioned below.

Does this PR introduce a user-facing change?

Fix regression in 1.25 default configurations so failed pods associated with a job with `parallelism = 1` are recreated by the job controller honoring exponential backoff delay again. However, for jobs with `parallelism > 1`, pods might be created without exponential backoff delay.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2022
@k8s-ci-robot
Copy link
Contributor

@nikhita: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 15, 2022
@nikhita
Copy link
Member Author

nikhita commented Dec 15, 2022

/assign @alculquicondor

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Member

alculquicondor commented Dec 16, 2022 via email

@sathyanarays
Copy link
Contributor

Thanks for confirming. It was simpler than I was thinking. Please leave the legacy path untouched if it's not needed. And add an integration test that verifies that the second pod created is delayed at least 10 seconds. We need to cherry-pick down to 1.24.

On Fri, Dec 16, 2022, 2:06 a.m. Sathyanarayanan Saravanamuthu < @.> wrote: @.* commented on this pull request. ------------------------------ In pkg/controller/job/job_controller.go <#114516 (comment)> : > @@ -879,7 +881,9 @@ func (jm Controller) syncJob(ctx context.Context, key string) (forget bool, rEr // returning an error will re-enqueue Job after the backoff period I tried validating the changes in a kind cluster. I took the job definition given in the linked issue for testing. I did two runs and found similar results in both cases. The sorted pod start time are as follows. Run 1 2022-12-16T05:54:36Z 2022-12-16T05:54:40Z 2022-12-16T05:54:53Z 2022-12-16T05:55:16Z 2022-12-16T05:56:39Z 2022-12-16T05:59:22Z 2022-12-16T06:04:45Z 2022-12-16T06:10:49Z 2022-12-16T06:16:52Z 2022-12-16T06:22:54Z Run 2 2022-12-16T06:30:45Z 2022-12-16T06:30:48Z 2022-12-16T06:31:01Z 2022-12-16T06:31:24Z 2022-12-16T06:32:47Z 2022-12-16T06:35:30Z 2022-12-16T06:40:53Z 2022-12-16T06:46:57Z 2022-12-16T06:53:00Z The time diffs in seconds for the above two run are: Run 1: 4, 13, 23, 83, 162, 320, 363, 363, 363 Run 2: 3, 13, 23, 83, 163, 323, 364, 363 — Reply to this email directly, view it on GitHub <#114516 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ5E6FZAZSNJRMUZVM7OZTWNQII5ANCNFSM6AAAAAATADDRZE . You are receiving this because you were assigned.Message ID: @.**>

@alculquicondor , from what I observed, the second pod is created immediately when the first pod is created. This may happen because new failure detection and pod creation are done in the same reconcile loop. There is a 10 second gap between 2nd pod failure and 3rd pod creation.

@sathyanarays
Copy link
Contributor

Performed further testing. Details are as follows.

Job manifest

Created a manifest that fails the pods after 1 minute

apiVersion: batch/v1
kind: Job
metadata:
  name: backoff-exp
spec:
  backoffLimit: 5
  completions: 25
  template:
    spec:
      containers:
        - name: die
          image: myubuntu:1
          command: ["/bin/sh","-c"]
          args: ["/bin/sleep 60 && /bin/false"]
          imagePullPolicy: IfNotPresent
      restartPolicy: Never

Start and End

The space separated container start time and end time are as follows:

2022-12-17T06:23:08Z 2022-12-17T06:24:08Z
2022-12-17T06:24:12Z 2022-12-17T06:25:12Z
2022-12-17T06:25:15Z 2022-12-17T06:26:15Z
2022-12-17T06:26:18Z 2022-12-17T06:27:18Z
2022-12-17T06:27:21Z 2022-12-17T06:28:21Z
2022-12-17T06:28:24Z 2022-12-17T06:29:24Z

After each pod failure, the next pod starts at an interval of approximately 3 seconds. I don't see exponential backoff in this case. Let me know if this makes sense!

@sathyanarays
Copy link
Contributor

Performed further testing. Details are as follows.

Job manifest

Created a manifest that fails the pods after 1 minute

apiVersion: batch/v1
kind: Job
metadata:
  name: backoff-exp
spec:
  backoffLimit: 5
  completions: 25
  template:
    spec:
      containers:
        - name: die
          image: myubuntu:1
          command: ["/bin/sh","-c"]
          args: ["/bin/sleep 60 && /bin/false"]
          imagePullPolicy: IfNotPresent
      restartPolicy: Never

Start and End

The space separated container start time and end time are as follows:

2022-12-17T06:23:08Z 2022-12-17T06:24:08Z
2022-12-17T06:24:12Z 2022-12-17T06:25:12Z
2022-12-17T06:25:15Z 2022-12-17T06:26:15Z
2022-12-17T06:26:18Z 2022-12-17T06:27:18Z
2022-12-17T06:27:21Z 2022-12-17T06:28:21Z
2022-12-17T06:28:24Z 2022-12-17T06:29:24Z

After each pod failure, the next pod starts at an interval of approximately 3 seconds. I don't see exponential backoff in this case. Let me know if this makes sense!

Slightly changing the logic fixes this:

if finishedCondition != nil {
    forget = true
}

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 2, 2023
Copy link
Member Author

@nikhita nikhita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few changes and added an integration test, but I'm still not sure how to set forget = true correctly without hitting #114516 (comment).

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Member

Should we add a sentence in the release note that a higher level of parallelism might lead to pods being recreated faster?

/lgtm
/approve
/hold
for thoughts on the release note.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3f049f3c0d00fa03a44e601ec9144263a3d0fc10

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, nikhita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2023
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.

Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.

This commit fixes a few bugs:

1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.

2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.

3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
    - `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
    which meant that `DefaultJobBackOff` was considered to be 0,
    effectively not running any meaningful checks.
    - `JobsReadyPods` was not enabled for test cases that ran tests
    which required the feature gate to be enabled.
    - The check for expected and actual backoff had incorrect
    calculations.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2023
@nikhita
Copy link
Member Author

nikhita commented Jan 12, 2023

Should we add a sentence in the release note that a higher level of parallelism might lead to pods being recreated faster?

Updated the release note and also created cherry-pick PRs:

@alculquicondor
Copy link
Member

pods are created without exponential backoff delay.

"might be"? We can't guarantee either behavior for now.

@alculquicondor
Copy link
Member

You are missing a cherry pick for 1.26

@nikhita
Copy link
Member Author

nikhita commented Jan 12, 2023

"might be"? We can't guarantee either behavior for now.

Updated the release note in this PR + all cherry-pick PRs.

You are missing a cherry pick for 1.26

oops, created #115027

@nikhita
Copy link
Member Author

nikhita commented Jan 13, 2023

@alculquicondor the lgtm was removed due to the rebase. Can you re-apply? Thanks!

@alculquicondor
Copy link
Member

/lgtm
/hold cancel
Thanks!

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b067de2a89d49893d78658fcc1f1246de597ec28

@k8s-ci-robot k8s-ci-robot merged commit c0c386b into kubernetes:master Jan 13, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Jan 13, 2023
@liggitt liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants