Job failure back-off delay and limit are not configurable #114651

sathyanarays · 2022-12-22T03:33:20Z

What happened?

The job failure back-off parameters are computed based on the hardcoded constants in job controller. There is no way to override these back-off parameters at Job level

What did you expect to happen?

The user should have the ability to provide back-off parameters as part of the Job Spec

How can we reproduce it (as minimally and precisely as possible)?

NA

Anything else we need to know?

No response

Kubernetes version

NA

Cloud provider

NA

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot · 2022-12-22T03:33:28Z

@sathyanarays: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sathyanarays · 2022-12-22T03:33:47Z

/wg batch

sathyanarays · 2022-12-22T03:58:58Z

related to #114391 and #114650

alculquicondor · 2022-12-22T13:20:14Z

This could be good additions to the PodFailurePolicy struct

kubernetes/staging/src/k8s.io/api/batch/v1/types.go

Line 199 in 2bb77a1

type PodFailurePolicy struct {

liggitt · 2022-12-24T19:26:13Z

/remove-kind bug
/kind feature

zeusbee · 2022-12-27T15:38:28Z

/assign

alculquicondor · 2023-01-03T15:49:08Z

@zeusbee note that this change requires a KEP

xadhix · 2023-03-26T08:17:03Z

Hi @alculquicondor @zeusbee I created a KEP for this. Could you please take a look at it?

alculquicondor · 2023-06-09T14:09:53Z

@sathyanarays @xadhix, what configuration of backoff do you expect to use?

I wonder if it's feasible and good enough for you to just reduce the backoff that we currently have, or make it purely exponential: 1s, 2s, 4s, 8s, etc., as suggested @mimowo in another thread.

Since the backoff wasn't properly working until recently, I don't expect users to currently rely on the existing delays.

k8s-triage-robot · 2024-01-22T05:28:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-21T05:35:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-22T06:23:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-22T06:23:55Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimowo · 2024-09-12T08:44:26Z

/reopen
/remove-lifecycle rotten
Reopening to see if there is still some interest in the community for doing it.

One use case it could have is to speed up the Job e2e tests - as some test cases wait 10s or more for the replacement Pods, significantly impacting execution time of them:

"should execute all indexes despite some failing when using backoffLimitPerIndex" - 33s
"should run a job to completion when tasks sometimes fail and are locally restarted" - 48s
"should run a job to completion when tasks sometimes fail and are not locally restarted" - 1min 13s
"should fail to exceed backoffLimit" - 30s

From this use case's perspective it would be enough to allow for configuring DefaultJobPodFailureBackOff per Job, which is used here per Job. I think we could start with API like spec.podFailureBackoff.baseSeconds, and
for safety we could assume it is >=1s.

/cc @atiratree @alculquicondor @tenzen-y @soltysh @sathyanarays

k8s-ci-robot · 2024-09-12T08:44:31Z

@mimowo: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
Reopening to see if there is still some interest in the community for doing it.

One use case it could have is to speed up the Job e2e tests - as some test cases wait 10s or more for the replacement Pods, significantly impacting execution time of them:

"should execute all indexes despite some failing when using backoffLimitPerIndex" - 33s

"should run a job to completion when tasks sometimes fail and are locally restarted" - 48s

"should run a job to completion when tasks sometimes fail and are not locally restarted" - 1min 13s

"should fail to exceed backoffLimit" - 30s

From this use case's perspective it would be enough to allow for configuring DefaultJobPodFailureBackOff per Job, which is used here per Job. I think we could start with API like spec.podFailureBackoff.baseSeconds, and
for safety we could assume it is >=1s.

/cc @atiratree @alculquicondor @tenzen-y @soltysh @sathyanarays

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sathyanarays added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2022

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 22, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 22, 2022

k8s-ci-robot added wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2022

sathyanarays mentioned this issue Dec 22, 2022

Job back-offs get reset on controller-manager restart #114650

Closed

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 24, 2022

k8s-ci-robot assigned zeusbee Dec 27, 2022

sathyanarays mentioned this issue Jan 17, 2023

REQUEST: New membership for sathyanarays kubernetes/org#3960

Closed

9 tasks

This was referenced Mar 26, 2023

KEP-3922: Make backoff times configurable kubernetes/enhancements#3923

Closed

Configurable backoff times kubernetes/enhancements#3922

Closed

mimowo mentioned this issue Jun 9, 2023

KEP-3939: Consider terminating pods in job controller kubernetes/enhancements#3940

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2024

k8s-ci-robot reopened this Sep 12, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job failure back-off delay and limit are not configurable #114651

Job failure back-off delay and limit are not configurable #114651

sathyanarays commented Dec 22, 2022

k8s-ci-robot commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

liggitt commented Dec 24, 2022

zeusbee commented Dec 27, 2022

alculquicondor commented Jan 3, 2023

xadhix commented Mar 26, 2023

alculquicondor commented Jun 9, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

k8s-triage-robot commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

mimowo commented Sep 12, 2024

k8s-ci-robot commented Sep 12, 2024

Job failure back-off delay and limit are not configurable #114651

Job failure back-off delay and limit are not configurable #114651

Comments

sathyanarays commented Dec 22, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

sathyanarays commented Dec 22, 2022

alculquicondor commented Dec 22, 2022

liggitt commented Dec 24, 2022

zeusbee commented Dec 27, 2022

alculquicondor commented Jan 3, 2023

xadhix commented Mar 26, 2023

alculquicondor commented Jun 9, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

k8s-triage-robot commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

mimowo commented Sep 12, 2024

k8s-ci-robot commented Sep 12, 2024