Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job back-offs get reset on controller-manager restart #114650

Closed
sathyanarays opened this issue Dec 22, 2022 · 10 comments
Closed

Job back-offs get reset on controller-manager restart #114650

sathyanarays opened this issue Dec 22, 2022 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@sathyanarays
Copy link
Contributor

What happened?

Exponential back-off kicks in when a pod fails. The back-off logic is offloaded to the in-memory workqueue data structure. On controller-manager restarts, the information in the workqueue is lost. So, the back-off information is lost and the back-off gets calculated as if the job is a new one.

What did you expect to happen?

The pods to be created with correct back-offs even when the controller-manager restarts

How can we reproduce it (as minimally and precisely as possible)?

  1. Start a job that always fail
  2. Let the job fail till we see a visible back-off
  3. Stop the controller-manager container
  4. New pod gets scheduled immediately after controller-manager restart irrespective of the previously computed back-off

Anything else we need to know?

No response

Kubernetes version

In master as of commit d2504c9
Also present in latest release.

Cloud provider

NA

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@sathyanarays sathyanarays added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 22, 2022
@k8s-ci-robot
Copy link
Contributor

@sathyanarays: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sathyanarays
Copy link
Contributor Author

/wg batch

@k8s-ci-robot k8s-ci-robot added wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2022
@sathyanarays
Copy link
Contributor Author

related to #114391 and #114651

@alculquicondor
Copy link
Member

There is a more immediate problem for me:
All job syncs are tied to the workqueue delays. We should just be delaying pod creation, but not job status updates. This is specially important when the job hits the backoffLimit. We would wait X seconds before declaring the Job as failed, even though it could be done immediately.

I think the solution is somewhat simple: keep track of the elapsed time within the sync, instead of the workqueue and restrict pod creation specifically. The time should be measured based on the Pod status, to be resilient to restarts. But even if we can't do that immediately, just tracking in memory shouldn't be too bad to start.

@alculquicondor
Copy link
Member

cc @mimowo

@sathyanarays
Copy link
Contributor Author

@alculquicondor , @mimowo , please provide early feedback on these changes if they are in the right direction!

#114684

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 23, 2023
@alculquicondor
Copy link
Member

/close

I can't remember were we said this, but I think we agreed that handling this is overkill.

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

/close

I can't remember were we said this, but I think we agreed that handling this is overkill.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants