Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job failure back-off delay and limit are not configurable #114651

Open
sathyanarays opened this issue Dec 22, 2022 · 15 comments
Open

Job failure back-off delay and limit are not configurable #114651

sathyanarays opened this issue Dec 22, 2022 · 15 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@sathyanarays
Copy link
Contributor

What happened?

The job failure back-off parameters are computed based on the hardcoded constants in job controller. There is no way to override these back-off parameters at Job level

What did you expect to happen?

The user should have the ability to provide back-off parameters as part of the Job Spec

How can we reproduce it (as minimally and precisely as possible)?

NA

Anything else we need to know?

No response

Kubernetes version

NA

Cloud provider

NA

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@sathyanarays sathyanarays added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 22, 2022
@k8s-ci-robot
Copy link
Contributor

@sathyanarays: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 22, 2022
@sathyanarays
Copy link
Contributor Author

/wg batch

@k8s-ci-robot k8s-ci-robot added wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2022
@sathyanarays
Copy link
Contributor Author

related to #114391 and #114650

@alculquicondor
Copy link
Member

This could be good additions to the PodFailurePolicy struct

type PodFailurePolicy struct {

@liggitt
Copy link
Member

liggitt commented Dec 24, 2022

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 24, 2022
@zeusbee
Copy link

zeusbee commented Dec 27, 2022

/assign

@alculquicondor
Copy link
Member

@zeusbee note that this change requires a KEP

@xadhix
Copy link

xadhix commented Mar 26, 2023

Hi @alculquicondor @zeusbee I created a KEP for this. Could you please take a look at it?

@alculquicondor
Copy link
Member

@sathyanarays @xadhix, what configuration of backoff do you expect to use?

I wonder if it's feasible and good enough for you to just reduce the backoff that we currently have, or make it purely exponential: 1s, 2s, 4s, 8s, etc., as suggested @mimowo in another thread.

Since the backoff wasn't properly working until recently, I don't expect users to currently rely on the existing delays.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2024
@mimowo
Copy link
Contributor

mimowo commented Sep 12, 2024

/reopen
/remove-lifecycle rotten
Reopening to see if there is still some interest in the community for doing it.

One use case it could have is to speed up the Job e2e tests - as some test cases wait 10s or more for the replacement Pods, significantly impacting execution time of them:

  • "should execute all indexes despite some failing when using backoffLimitPerIndex" - 33s
  • "should run a job to completion when tasks sometimes fail and are locally restarted" - 48s
  • "should run a job to completion when tasks sometimes fail and are not locally restarted" - 1min 13s
  • "should fail to exceed backoffLimit" - 30s

From this use case's perspective it would be enough to allow for configuring DefaultJobPodFailureBackOff per Job, which is used here per Job. I think we could start with API like spec.podFailureBackoff.baseSeconds, and
for safety we could assume it is >=1s.

/cc @atiratree @alculquicondor @tenzen-y @soltysh @sathyanarays

@k8s-ci-robot
Copy link
Contributor

@mimowo: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
Reopening to see if there is still some interest in the community for doing it.

One use case it could have is to speed up the Job e2e tests - as some test cases wait 10s or more for the replacement Pods, significantly impacting execution time of them:

  • "should execute all indexes despite some failing when using backoffLimitPerIndex" - 33s
  • "should run a job to completion when tasks sometimes fail and are locally restarted" - 48s
  • "should run a job to completion when tasks sometimes fail and are not locally restarted" - 1min 13s
  • "should fail to exceed backoffLimit" - 30s

From this use case's perspective it would be enough to allow for configuring DefaultJobPodFailureBackOff per Job, which is used here per Job. I think we could start with API like spec.podFailureBackoff.baseSeconds, and
for safety we could assume it is >=1s.

/cc @atiratree @alculquicondor @tenzen-y @soltysh @sathyanarays

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot reopened this Sep 12, 2024
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
None yet
Development

No branches or pull requests

8 participants