Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw an error from jobReady() if the job exceeds its BackoffLimit #9950

Merged
merged 2 commits into from
Jul 13, 2023

Conversation

jeffrosenberg
Copy link
Contributor

Closes #9285

Signed-off-by: Rosenberg, Jeff jeff.rosenberg@icfnext.com

What this PR does / why we need it:

As noted in #9285, if --wait-for-jobs is specified, and a job exceeds its backoff limit, Helm will always wait for the full timeout before failing, even though there is no chance of its succeeding. This PR modifies the jobReady() function to return an error. Based on the comments on k8s.io/apimachinery/pkg/util/wait#PollImmediateUntil, it appears that returning an error should stop that function from polling.

Special notes for your reviewer:

I've updated the tests in ready_test.go, but there is no test coverage in client_test.go to validate that this works as expected. Ideally this would have a test in client_test.go, but these comments indicate that there haven't been tests for waits since the feature was introduced.

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

Closes helm#9285

Signed-off-by: Rosenberg, Jeff <jeff.rosenberg@icfnext.com>
@helm-bot helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 16, 2021
@jeffrosenberg
Copy link
Contributor Author

In local testing, this speeds up how quickly we fail, but not yet to the extent I'd like. This still waits for all deployments to be ready, and then once the failed job is the last thing we're waiting for, it fails immediately rather than spinning on the failed job until reaching the timeout. That's still a significant improvement, but I'd love to figure out if there's a way to fail the install immediately on job failure.

@zonggen
Copy link
Member

zonggen commented Sep 10, 2021

This still waits for all deployments to be ready, and then once the failed job is the last thing we're waiting for

Yes because the ResourceList ordering: https://github.com/helm/helm/blob/main/pkg/kube/wait.go#L54

Test cases passed locally as expected. I think this PR is sane and able to fix the original issue.

@jeffrosenberg
Copy link
Contributor Author

Thanks for the review @zonggen; FYI, I've resolved the merge conflict.

Copy link
Contributor

@joejulian joejulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I'm not an approver.

@stuartdrennan
Copy link
Contributor

End user test

chart.zip

Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "sleep(30) && print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 0

Baseline Test

Released Helm version 3.9.1

helm install d . --wait-for-jobs --wait --debug --timeout 60s

client.go:128: [debug] creating 1 resource(s)
wait.go:48: [debug] beginning wait for 1 resources with timeout of 1m0s
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:231: [debug] Job is not completed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
ready.go:227: [debug] Job is failed: default/pi
Error: INSTALLATION FAILED: timed out waiting for the condition
helm.go:84: [debug] timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
        helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.4.0/command.go:902
main.main
        helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
        runtime/proc.go:255
runtime.goexit
        runtime/asm_arm64.s:1133

Summary

Job failed but Helm waited until timeout before exiting. Not the desired response but expected.
Same result with backoffLimit: 0 and backoffLimit: 1

PR Test

install.go:192: [debug] Original chart version: ""
install.go:209: [debug] CHART PATH: /Users/sdrennan/code/tmp/gg/chart

client.go:128: [debug] creating 1 resource(s)
wait.go:66: [debug] beginning wait for 1 resources with timeout of 1m0s
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:235: [debug] Job is not completed: default/pi
ready.go:230: [debug] Job is failed: default/pi
Error: INSTALLATION FAILED: job is failed: default/pi
helm.go:84: [debug] job is failed: default/pi
INSTALLATION FAILED
main.newInstallCmd.func2
        helm.sh/helm/v3/cmd/helm/install.go:141
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.5.0/command.go:872
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.5.0/command.go:990
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.5.0/command.go:918
main.main
        helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_arm64.s:1165

Summary

Helm exited correctly within timeout period
Same result with backoffLimit: 0 and backoffLimit: 1

@jouve
Copy link
Contributor

jouve commented Oct 31, 2022

another blocker for this PR ?

I could help if needed :)

@joejulian joejulian added this to the 3.10.2 milestone Oct 31, 2022
@mattfarina mattfarina modified the milestones: 3.10.2, 3.10.3 Nov 10, 2022
@hickeyma hickeyma modified the milestones: 3.10.3, 3.11.0 Dec 14, 2022
@mattfarina mattfarina modified the milestones: 3.11.0, 3.12.0 Jan 18, 2023
@joejulian joejulian modified the milestones: 3.12.0, 3.13.0 May 5, 2023
@armstrongli
Copy link

this PR works for me. 👍

@armstrongli
Copy link

@mattfarina , need your help on this PR?

@armstrongli
Copy link

@jouve , need your help on this PR

@armstrongli
Copy link

/assign mattfarina

@joejulian joejulian merged commit 99e1dce into helm:main Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--wait-for-jobs does not fails if job failed
9 participants