New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job failure policy controller support #51153
Job failure policy controller support #51153
Conversation
Hi @clamoriniere1A. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/assign @soltysh |
/ok-to-test |
086943d
to
d4c0433
Compare
ee66e84
to
b102ad2
Compare
Job failure policy integration in JobController. From the JobSpec.BackoffLimit the JobController will define the backoff duration between Job retry. It use the ```workqueue.RateLimitingInterface``` to store the number of "retry" as "requeue" and the default Job backoff initial duration is set during the initialization of the ```workqueue.RateLimiter. Since the number of retry for each job is store in a local structure "JobController.queue" if the JobController restarts the number of retries will be lost and the backoff duration will be reset to 0. Add e2e test for Job backoff failure policy
fc03232
to
1dbef2f
Compare
Rebased so /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clamoriniere1A, soltysh Associated issue: 48075 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest Review the full test history for this PR. |
/retest |
@clamoriniere1A: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue |
…ckoff Automatic merge from submit-queue (batch tested with PRs 52091, 52071) Bugfix: Improve how JobController use queue for backoff **What this PR does / why we need it**: In some cases, the backoff delay for a given Job is reset unnecessarily. the PR improves how JobController uses queue for backoff: - Centralize the key "forget" and "re-queue" process in only on method. - Change the signature of the syncJob method in order to return the information if it is necessary to forget the backoff delay for a given key. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # Links to #51153 **Special notes for your reviewer**: **Release note**: ```release-note ```
The e2e test that was added in this PR ([sig-apps] Job should exceed backoffLimit) seems to be flaking recently: https://k8s-testgrid.appspot.com/release-1.8-blocking#gci-gce-1.8. |
@nikhiljindal yes sure Looking at some flaky iterations, it seems that the timeout used in the test for waiting that the job status becomes "Failed" is too short. Kubelet takes more than the 30sec to start properly the Pod... (search "backofflimit-q57jk" in logs https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-stable1/111/artifacts/bootstrap-e2e-minion-group-2292/kubelet.log) I will do a PR that updates this e2e test with the generic timeout duration used in others tests. |
This fix is linked to the PR kubernetes#51153 that introduce the JobSpec.BackoffLimit. Previously the Timeout used in the test was too agressive and generates flaky test execution. Now it used the default framework.JobTimeout used in others tests.
…flaky Automatic merge from submit-queue Bugfix: Fix e2e Flaky Apps/Job BackoffLimit test This fix is linked to the PR #51153 that introduce the `JobSpec.BackoffLimit`. Previously the Timeout used in the test was too aggressive and generates flaky test execution. Now it used the default `framework.JobTimeout` used in others tests. **What this PR does / why we need it**: This PR should fix flaky "[sig-apps] Job should exceed backoffLimit" test, due to a too short timeout duration. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # fixes #51153 **Special notes for your reviewer**: **Release note**: ```release-note ```
What this PR does / why we need it:
Start implementing the support of the "Backoff policy and failed pod limit" in the
JobController
defined in kubernetes/community#583.This PR depends on a previous PR #48075 that updates the K8s API types.
TODO:
JobSpec.BackoffLimit
supportimplements kubernetes/community#583
Special notes for your reviewer:
Release note: