Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

Closed
jayunit100 opened this issue Dec 9, 2022 · 18 comments · Fixed by #114768
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@jayunit100
Copy link
Member

What happened?

Problem

The job controller appears to "rapidly reschedule" several pods at the same time (without backing off) which ultimately appears almost like a batch or gang scheduling workflow.

... This is the opposite of what the job docs say, bc they clearly state :

 The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.

So, rather then,

  • 10, 20, 40, 80 wait times between trying things we are seeing
  • 0, 3, 3, 3, 3 second wait times before pods are being retried

This is in the 1.25 release, but not sure if it also happens earlier. You can see this in the latest kind version.

What did you expect to happen?

Jobs wouldnt schedule the same pod at rapid fire intervals

How can we reproduce it (as minimally and precisely as possible)?

Details

Thanks to Ryan for this reproducer:

apiVersion: batch/v1

kind: Job

metadata:

  name: backoff-exp

spec:

  backoffLimit: 100

  template:

    spec:

      containers:

        - name: die

          image: ubuntu

          command: [ "/bin/false" ]

          imagePullPolicy: IfNotPresent

      restartPolicy: Never

Now once this is running, you can graph the values like so:

kubectl get pods -o wide | awk '

BEGIN {

    a["d"] = 24*(\

    a["h"] = 60*(\

    a["m"] = 60*(\

    a["s"] = 1)));

  }

  {

    print fx($5); # kubernetes time elapsed in seconds

  }

  function fx(ktym,    f,u,n,t,i) {

    n = split(ktym, f, /[dhms]/, u)

    t = 0

    for (i=1; i<n; i++) {

      t += f[i] * a[u[i]]

    }

    return t

  }

'|sort

And youl see a table like this, where the integers below are the AGE OF PODS in SECONDS. We can see that the pod ages are 3 seconds apart as opposed to (10, 20, 40, 80 ...) or some other exponentiallly increasing number of seconds, apart.....

69
72
75
78
81
84
87
90
94
96
99
102
105
108
111
119

Anything else we need to know?

No response

Kubernetes version

.25

Cloud provider

all

OS version

all

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@jayunit100 jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 9, 2022
@k8s-ci-robot
Copy link
Contributor

@jayunit100: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jayunit100 jayunit100 changed the title Kubernetes Jobs API scheduling erratic Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics Dec 9, 2022
@MadhavJivrajani
Copy link
Contributor

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 13, 2022
@nikhita
Copy link
Member

nikhita commented Dec 13, 2022

/assign

@nikhita
Copy link
Member

nikhita commented Dec 13, 2022

This is broken when the JobTrackingWithFinalizers feature gate is enabled. Note that it was disabled and re-enabled before moving to GA:

If the feature gate is disabled, the pods are retried by honoring exponential backoff characteristics.

From a quick glance, there are a few reasons why this happens:

  1. Given that the code below is triggered everytime uncounted != nil, the queue is made to forget the key, so we always end up with a backoff of zero.

if uncounted != nil {
needsStatusUpdate := suspendCondChanged || active != job.Status.Active || !equalReady(ready, job.Status.Ready)
job.Status.Active = active
job.Status.Ready = ready
err = jm.trackJobStatusAndRemoveFinalizers(ctx, &job, pods, prevSucceededIndexes, *uncounted, expectedRmFinalizers, finishedCondition, needsStatusUpdate)
if err != nil {
return false, fmt.Errorf("tracking status: %w", err)
}
jobFinished := IsJobFinished(&job)
if jobHasNewFailure && !jobFinished {
// returning an error will re-enqueue Job after the backoff period
return forget, fmt.Errorf("failed pod(s) detected for job key %q", key)
}
forget = true
return forget, manageJobErr
}

  1. Even if we updated the above code to only trigger if a status update is required and get the right backoff period, we will have the backoff being calculated twice for each pod failure i.e. t'll end up being something like 10, 10, 20, 20, 40, 40, ...

Looking at:

immediate := curPod.Status.Phase != v1.PodFailed
// Don't check if oldPod has the finalizer, as during ownership transfer
// finalizers might be re-added and removed again in behalf of the new owner.
// If all those Pod updates collapse into a single event, the finalizer
// might be removed in oldPod and curPod. We want to record the latest
// state.
finalizerRemoved := !hasJobTrackingFinalizer(curPod)
curControllerRef := metav1.GetControllerOf(curPod)
oldControllerRef := metav1.GetControllerOf(oldPod)
controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)
if controllerRefChanged && oldControllerRef != nil {
// The ControllerRef was changed. Sync the old controller, if any.
if job := jm.resolveControllerRef(oldPod.Namespace, oldControllerRef); job != nil {
if finalizerRemoved {
key, err := controller.KeyFunc(job)
if err == nil {
jm.finalizerExpectations.finalizerRemovalObserved(key, string(curPod.UID))
}
}
jm.enqueueControllerPodUpdate(job, immediate)
}
}

We can end up with a scenario where:

  • Pod A fails, triggering a resync. Since the pod has failed, backoff is calculated and applied.
  • Finalizer is removed from Pod A and job status is updated.
  • We receive the job update event, but the pod informer cache is stale. This triggers a sync.
  • We noticed that Pod A has the status Failed so the backoff is applied again.

I'll send a PR to fix both the issues.

@nikhita
Copy link
Member

nikhita commented Dec 15, 2022

/kind regression

Created a PR - #114516

@k8s-ci-robot k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Dec 15, 2022
@alculquicondor
Copy link
Member

This is broken when the JobTrackingWithFinalizers feature gate is enabled.

Did you confirm in older versions of k8s?

@alculquicondor
Copy link
Member

IIUC, this issue is not exclusive to job tracking with finalizers.

Let's backport to all supported versions of k8s (1.23)

@nikhita
Copy link
Member

nikhita commented Jan 17, 2023

All cherry-picks have merged.

Keeping this issue open to fix the backoff for jobs with parallelism > 1 . This would be fixed by #114768.

/assign @sathyanarays

@k8s-ci-robot
Copy link
Contributor

@nikhita: GitHub didn't allow me to assign the following users: sathyanarays.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

All cherry-picks have merged.

Keeping this issue open to fix the backoff for jobs with parallelism > 1 . This would be fixed by #114768.

/assign @sathyanarays

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sathyanarays
Copy link
Contributor

/assign

@liggitt
Copy link
Member

liggitt commented Feb 24, 2023

Keeping this issue open to fix the backoff for jobs with parallelism > 1 . This would be fixed by #114768.

Are master and release branches still in a regressed state, or is this a follow-up to fix/improve a pre-existing issue? If any regressions are now resolved, can we track #114768 in a new issue?

@nikhita
Copy link
Member

nikhita commented Feb 24, 2023

Are master and release branches still in a regressed state

master and release branches have been fixed for parallelism = 1 but are still in a regressed state for parallelism > 1.

#114768 will fix the issue for parrallelism > 1.

@liggitt
Copy link
Member

liggitt commented Feb 24, 2023

#114768 is a really big change... is there a more scoped change we can backport with lower risk?

@nikhita
Copy link
Member

nikhita commented Feb 24, 2023

#114768 is a really big change... is there a more scoped change we can backport with lower risk?

#114768 is only expected to land on master and not intended to be backported. The issue with parallelism > 1 has been around for a while so we aren't treating it as a regression, and aren't intending to backport a fix.

The fix for parallelism > 1 is pretty involved and we decided in #114516 that a tightly scoped change might not be possible.

ref: #114516 (comment)

Happy to track #114768 in a new issue if it's easier though.

@liggitt
Copy link
Member

liggitt commented Feb 24, 2023

The issue with parallelism > 1 has been around for a while so we aren't treating it as a regression, and aren't intending to backport a fix.

That's helpful to know... having one issue for the regression fixed in #114516 and one for the other issue still to be fixed might be helpful

@alculquicondor
Copy link
Member

let's track it separately

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

let's track it separately

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants