Job back-offs get reset on controller-manager restart #114650

sathyanarays · 2022-12-22T03:16:09Z

What happened?

Exponential back-off kicks in when a pod fails. The back-off logic is offloaded to the in-memory workqueue data structure. On controller-manager restarts, the information in the workqueue is lost. So, the back-off information is lost and the back-off gets calculated as if the job is a new one.

What did you expect to happen?

The pods to be created with correct back-offs even when the controller-manager restarts

How can we reproduce it (as minimally and precisely as possible)?

Start a job that always fail
Let the job fail till we see a visible back-off
Stop the controller-manager container
New pod gets scheduled immediately after controller-manager restart irrespective of the previously computed back-off

Anything else we need to know?

No response

Kubernetes version

In master as of commit d2504c9
Also present in latest release.

Cloud provider

NA

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-12-22T03:16:17Z

@sathyanarays: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sathyanarays · 2022-12-22T03:16:37Z

/wg batch

sathyanarays · 2022-12-22T03:59:52Z

related to #114391 and #114651

alculquicondor · 2022-12-22T13:26:54Z

There is a more immediate problem for me:
All job syncs are tied to the workqueue delays. We should just be delaying pod creation, but not job status updates. This is specially important when the job hits the backoffLimit. We would wait X seconds before declaring the Job as failed, even though it could be done immediately.

I think the solution is somewhat simple: keep track of the elapsed time within the sync, instead of the workqueue and restrict pod creation specifically. The time should be measured based on the Pod status, to be resilient to restarts. But even if we can't do that immediately, just tracking in memory shouldn't be too bad to start.

alculquicondor · 2022-12-22T13:27:16Z

cc @mimowo

sathyanarays · 2022-12-24T10:16:28Z

@alculquicondor , @mimowo , please provide early feedback on these changes if they are in the right direction!

#114684

k8s-triage-robot · 2023-03-24T10:49:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-04-23T11:01:39Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

alculquicondor · 2023-04-24T17:52:08Z

/close

I can't remember were we said this, but I think we agreed that handling this is overkill.

k8s-ci-robot · 2023-04-24T17:52:14Z

@alculquicondor: Closing this issue.

In response to this:

/close

I can't remember were we said this, but I think we agreed that handling this is overkill.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sathyanarays added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 22, 2022

k8s-ci-robot added wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2022

sathyanarays mentioned this issue Dec 22, 2022

Job failure back-off delay and limit are not configurable #114651

Open

sathyanarays mentioned this issue Dec 24, 2022

Make Job backoffs reslient to controller manager failures #114684

Closed

sathyanarays mentioned this issue Jan 3, 2023

Decouple batch/job back-off logic from workqueues #114768

Merged

sathyanarays mentioned this issue Jan 17, 2023

REQUEST: New membership for sathyanarays kubernetes/org#3960

Closed

9 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 24, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 23, 2023

k8s-ci-robot closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job back-offs get reset on controller-manager restart #114650

Job back-offs get reset on controller-manager restart #114650

sathyanarays commented Dec 22, 2022

k8s-ci-robot commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

sathyanarays commented Dec 24, 2022

k8s-triage-robot commented Mar 24, 2023

k8s-triage-robot commented Apr 23, 2023

alculquicondor commented Apr 24, 2023

k8s-ci-robot commented Apr 24, 2023

Job back-offs get reset on controller-manager restart #114650

Job back-offs get reset on controller-manager restart #114650

Comments

sathyanarays commented Dec 22, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

sathyanarays commented Dec 24, 2022

k8s-triage-robot commented Mar 24, 2023

k8s-triage-robot commented Apr 23, 2023

alculquicondor commented Apr 24, 2023

k8s-ci-robot commented Apr 24, 2023