Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

jayunit100 · 2022-12-09T18:43:24Z

What happened?

Problem

The job controller appears to "rapidly reschedule" several pods at the same time (without backing off) which ultimately appears almost like a batch or gang scheduling workflow.

... This is the opposite of what the job docs say, bc they clearly state :

 The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.

So, rather then,

10, 20, 40, 80 wait times between trying things we are seeing
0, 3, 3, 3, 3 second wait times before pods are being retried

This is in the 1.25 release, but not sure if it also happens earlier. You can see this in the latest kind version.

What did you expect to happen?

Jobs wouldnt schedule the same pod at rapid fire intervals

How can we reproduce it (as minimally and precisely as possible)?

Details

Thanks to Ryan for this reproducer:

apiVersion: batch/v1

kind: Job

metadata:

  name: backoff-exp

spec:

  backoffLimit: 100

  template:

    spec:

      containers:

        - name: die

          image: ubuntu

          command: [ "/bin/false" ]

          imagePullPolicy: IfNotPresent

      restartPolicy: Never

Now once this is running, you can graph the values like so:

kubectl get pods -o wide | awk '

BEGIN {

    a["d"] = 24*(\

    a["h"] = 60*(\

    a["m"] = 60*(\

    a["s"] = 1)));

  }

  {

    print fx($5); # kubernetes time elapsed in seconds

  }

  function fx(ktym,    f,u,n,t,i) {

    n = split(ktym, f, /[dhms]/, u)

    t = 0

    for (i=1; i<n; i++) {

      t += f[i] * a[u[i]]

    }

    return t

  }

'|sort

And youl see a table like this, where the integers below are the AGE OF PODS in SECONDS. We can see that the pod ages are 3 seconds apart as opposed to (10, 20, 40, 80 ...) or some other exponentiallly increasing number of seconds, apart.....

Anything else we need to know?

No response

Kubernetes version

.25

Cloud provider

all

OS version

all

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-12-09T18:43:33Z

@jayunit100: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MadhavJivrajani · 2022-12-13T13:06:01Z

/sig apps

nikhita · 2022-12-13T13:30:38Z

/assign

nikhita · 2022-12-13T20:05:15Z

This is broken when the JobTrackingWithFinalizers feature gate is enabled. Note that it was disabled and re-enabled before moving to GA:

enabled by default in v1.23 - 2c1b3fd#diff-71e3b98f9a6bbf5b8421e26a7ba0c079f397cd8d49abacdad943c66a4f44f03d
disabled by default from v1.23.9 - Disable JobTrackingWithFinalizers due to unresolved bug #109491
disabled by default in v1.24 - 3b18613
enabled by default in v1.25 - a26920b
and moved to GA in v1.26 - 4948918

If the feature gate is disabled, the pods are retried by honoring exponential backoff characteristics.

From a quick glance, there are a few reasons why this happens:

Given that the code below is triggered everytime uncounted != nil, the queue is made to forget the key, so we always end up with a backoff of zero.

kubernetes/pkg/controller/job/job_controller.go

Lines 869 to 884 in bb6edfb

    
           if uncounted != nil { 
        
           	needsStatusUpdate := suspendCondChanged || active != job.Status.Active || !equalReady(ready, job.Status.Ready) 
        
           	job.Status.Active = active 
        
           	job.Status.Ready = ready 
        
           	err = jm.trackJobStatusAndRemoveFinalizers(ctx, &job, pods, prevSucceededIndexes, *uncounted, expectedRmFinalizers, finishedCondition, needsStatusUpdate) 
        
           	if err != nil { 
        
           		return false, fmt.Errorf("tracking status: %w", err) 
        
           	} 
        
           	jobFinished := IsJobFinished(&job) 
        
           	if jobHasNewFailure && !jobFinished { 
        
           		// returning an error will re-enqueue Job after the backoff period 
        
           		return forget, fmt.Errorf("failed pod(s) detected for job key %q", key) 
        
           	} 
        
           	forget = true 
        
           	return forget, manageJobErr 
        
           }

Even if we updated the above code to only trigger if a status update is required and get the right backoff period, we will have the backoff being calculated twice for each pod failure i.e. t'll end up being something like 10, 10, 20, 20, 40, 40, ...

Looking at:

kubernetes/pkg/controller/job/job_controller.go

Lines 308 to 330 in bb6edfb

    
           immediate := curPod.Status.Phase != v1.PodFailed 
        
           // Don't check if oldPod has the finalizer, as during ownership transfer 
        
           // finalizers might be re-added and removed again in behalf of the new owner. 
        
           // If all those Pod updates collapse into a single event, the finalizer 
        
           // might be removed in oldPod and curPod. We want to record the latest 
        
           // state. 
        
           finalizerRemoved := !hasJobTrackingFinalizer(curPod) 
        
           curControllerRef := metav1.GetControllerOf(curPod) 
        
           oldControllerRef := metav1.GetControllerOf(oldPod) 
        
           controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) 
        
           if controllerRefChanged && oldControllerRef != nil { 
        
           	// The ControllerRef was changed. Sync the old controller, if any. 
        
           	if job := jm.resolveControllerRef(oldPod.Namespace, oldControllerRef); job != nil { 
        
           		if finalizerRemoved { 
        
           			key, err := controller.KeyFunc(job) 
        
           			if err == nil { 
        
           				jm.finalizerExpectations.finalizerRemovalObserved(key, string(curPod.UID)) 
        
           			} 
        
           		} 
        
           		jm.enqueueControllerPodUpdate(job, immediate) 
        
           	} 
        
           }

We can end up with a scenario where:

Pod A fails, triggering a resync. Since the pod has failed, backoff is calculated and applied.
Finalizer is removed from Pod A and job status is updated.
We receive the job update event, but the pod informer cache is stale. This triggers a sync.
We noticed that Pod A has the status Failed so the backoff is applied again.

I'll send a PR to fix both the issues.

nikhita · 2022-12-15T19:58:19Z

/kind regression

Created a PR - #114516

alculquicondor · 2022-12-15T21:39:00Z

This is broken when the JobTrackingWithFinalizers feature gate is enabled.

Did you confirm in older versions of k8s?

alculquicondor · 2023-01-03T21:14:44Z

IIUC, this issue is not exclusive to job tracking with finalizers.

Let's backport to all supported versions of k8s (1.23)

nikhita · 2023-01-12T15:49:07Z

Cherry-pick PRs:

nikhita · 2023-01-17T11:43:56Z

All cherry-picks have merged.

Keeping this issue open to fix the backoff for jobs with parallelism > 1 . This would be fixed by #114768.

/assign @sathyanarays

k8s-ci-robot · 2023-01-17T11:43:58Z

@nikhita: GitHub didn't allow me to assign the following users: sathyanarays.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

All cherry-picks have merged.

Keeping this issue open to fix the backoff for jobs with parallelism > 1 . This would be fixed by #114768.

/assign @sathyanarays

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sathyanarays · 2023-01-23T09:55:42Z

/assign

liggitt · 2023-02-24T03:57:46Z

Keeping this issue open to fix the backoff for jobs with parallelism > 1 . This would be fixed by #114768.

Are master and release branches still in a regressed state, or is this a follow-up to fix/improve a pre-existing issue? If any regressions are now resolved, can we track #114768 in a new issue?

nikhita · 2023-02-24T05:16:00Z

Are master and release branches still in a regressed state

master and release branches have been fixed for parallelism = 1 but are still in a regressed state for parallelism > 1.

#114768 will fix the issue for parrallelism > 1.

liggitt · 2023-02-24T05:19:34Z

#114768 is a really big change... is there a more scoped change we can backport with lower risk?

nikhita · 2023-02-24T05:24:49Z

#114768 is a really big change... is there a more scoped change we can backport with lower risk?

#114768 is only expected to land on master and not intended to be backported. The issue with parallelism > 1 has been around for a while so we aren't treating it as a regression, and aren't intending to backport a fix.

The fix for parallelism > 1 is pretty involved and we decided in #114516 that a tightly scoped change might not be possible.

ref: #114516 (comment)

Happy to track #114768 in a new issue if it's easier though.

liggitt · 2023-02-24T13:05:25Z

The issue with parallelism > 1 has been around for a while so we aren't treating it as a regression, and aren't intending to backport a fix.

That's helpful to know... having one issue for the regression fixed in #114516 and one for the other issue still to be fixed might be helpful

alculquicondor · 2023-02-24T14:57:31Z

let's track it separately

/close

k8s-ci-robot · 2023-02-24T14:57:46Z

@alculquicondor: Closing this issue.

In response to this:

let's track it separately

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 9, 2022

jayunit100 changed the title ~~Kubernetes Jobs API scheduling erratic~~ Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics Dec 9, 2022

cfryanr mentioned this issue Dec 9, 2022

Plumb legacy PinnipedConfigs to ClusterBootstrap vmware-tanzu/tanzu-framework#3916

Merged

16 tasks

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 13, 2022

k8s-ci-robot assigned nikhita Dec 13, 2022

nikhita mentioned this issue Dec 15, 2022

pkg/controller/job: re-honor exponential backoff delay #114516

Merged

k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Dec 15, 2022

This was referenced Dec 22, 2022

Job failure back-off delay and limit are not configurable #114651

Closed

Job back-offs get reset on controller-manager restart #114650

Closed

This was referenced Jan 17, 2023

REQUEST: New membership for sathyanarays kubernetes/org#3960

Closed

Decouple batch/job back-off logic from workqueues #114768

Merged

k8s-ci-robot assigned sathyanarays Jan 23, 2023

k8s-ci-robot closed this as completed Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

jayunit100 commented Dec 9, 2022

k8s-ci-robot commented Dec 9, 2022

MadhavJivrajani commented Dec 13, 2022

nikhita commented Dec 13, 2022

nikhita commented Dec 13, 2022 •

edited

nikhita commented Dec 15, 2022

alculquicondor commented Dec 15, 2022

alculquicondor commented Jan 3, 2023

nikhita commented Jan 12, 2023 •

edited

nikhita commented Jan 17, 2023

k8s-ci-robot commented Jan 17, 2023

sathyanarays commented Jan 23, 2023

liggitt commented Feb 24, 2023

nikhita commented Feb 24, 2023

liggitt commented Feb 24, 2023

nikhita commented Feb 24, 2023

liggitt commented Feb 24, 2023

alculquicondor commented Feb 24, 2023

k8s-ci-robot commented Feb 24, 2023

Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391

Comments

jayunit100 commented Dec 9, 2022

What happened?

Problem

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Details

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Dec 9, 2022

MadhavJivrajani commented Dec 13, 2022

nikhita commented Dec 13, 2022

nikhita commented Dec 13, 2022 • edited

nikhita commented Dec 15, 2022

alculquicondor commented Dec 15, 2022

alculquicondor commented Jan 3, 2023

nikhita commented Jan 12, 2023 • edited

nikhita commented Jan 17, 2023

k8s-ci-robot commented Jan 17, 2023

sathyanarays commented Jan 23, 2023

liggitt commented Feb 24, 2023

nikhita commented Feb 24, 2023

liggitt commented Feb 24, 2023

nikhita commented Feb 24, 2023

liggitt commented Feb 24, 2023

alculquicondor commented Feb 24, 2023

k8s-ci-robot commented Feb 24, 2023

nikhita commented Dec 13, 2022 •

edited

nikhita commented Jan 12, 2023 •

edited