Job failure policy controller support #51153

clamoriniere1A · 2017-08-22T22:36:34Z

What this PR does / why we need it:
Start implementing the support of the "Backoff policy and failed pod limit" in the JobController defined in kubernetes/community#583.
This PR depends on a previous PR #48075 that updates the K8s API types.

TODO:

Implement JobSpec.BackoffLimit support
Rebase when [API] Feature/job failure policy #48075 has been merged.
Implement end2end tests

implements kubernetes/community#583

Special notes for your reviewer:

Release note:

Add backoff policy and failed pod limit for a job

k8s-ci-robot · 2017-08-22T22:36:47Z

Hi @clamoriniere1A. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

clamoriniere1A · 2017-08-22T22:37:24Z

/assign @soltysh

dims · 2017-08-25T12:36:55Z

/ok-to-test
/release-note-none

Job failure policy integration in JobController. From the JobSpec.BackoffLimit the JobController will define the backoff duration between Job retry. It use the ```workqueue.RateLimitingInterface``` to store the number of "retry" as "requeue" and the default Job backoff initial duration is set during the initialization of the ```workqueue.RateLimiter. Since the number of retry for each job is store in a local structure "JobController.queue" if the JobController restarts the number of retries will be lost and the backoff duration will be reset to 0. Add e2e test for Job backoff failure policy

soltysh · 2017-09-03T10:11:25Z

Rebased so

/lgtm
/approve

k8s-github-robot · 2017-09-03T10:11:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clamoriniere1A, soltysh

Associated issue: 48075

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/controller/job/OWNERS~~ [soltysh]
~~test/OWNERS~~ [soltysh]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

fejta-bot · 2017-09-03T16:09:50Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

soltysh · 2017-09-03T18:22:38Z

/retest

k8s-ci-robot · 2017-09-03T20:08:20Z

@clamoriniere1A: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-kops-aws	`1dbef2f`	link	`/test pull-kubernetes-e2e-kops-aws`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-09-03T20:13:17Z

Automatic merge from submit-queue

…ckoff Automatic merge from submit-queue (batch tested with PRs 52091, 52071) Bugfix: Improve how JobController use queue for backoff **What this PR does / why we need it**: In some cases, the backoff delay for a given Job is reset unnecessarily. the PR improves how JobController uses queue for backoff: - Centralize the key "forget" and "re-queue" process in only on method. - Change the signature of the syncJob method in order to return the information if it is necessary to forget the backoff delay for a given key. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # Links to #51153 **Special notes for your reviewer**: **Release note**: ```release-note ```

nikhiljindal · 2017-09-13T21:34:16Z

The e2e test that was added in this PR ([sig-apps] Job should exceed backoffLimit) seems to be flaking recently: https://k8s-testgrid.appspot.com/release-1.8-blocking#gci-gce-1.8.
@clamoriniere1A Can you please take a look?

clamoriniere1A · 2017-09-14T00:21:32Z

@nikhiljindal yes sure

Looking at some flaky iterations, it seems that the timeout used in the test for waiting that the job status becomes "Failed" is too short. Kubelet takes more than the 30sec to start properly the Pod... (search "backofflimit-q57jk" in logs https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-stable1/111/artifacts/bootstrap-e2e-minion-group-2292/kubelet.log)

I will do a PR that updates this e2e test with the generic timeout duration used in others tests.

This fix is linked to the PR kubernetes#51153 that introduce the JobSpec.BackoffLimit. Previously the Timeout used in the test was too agressive and generates flaky test execution. Now it used the default framework.JobTimeout used in others tests.

…flaky Automatic merge from submit-queue Bugfix: Fix e2e Flaky Apps/Job BackoffLimit test This fix is linked to the PR #51153 that introduce the `JobSpec.BackoffLimit`. Previously the Timeout used in the test was too aggressive and generates flaky test execution. Now it used the default `framework.JobTimeout` used in others tests. **What this PR does / why we need it**: This PR should fix flaky "[sig-apps] Job should exceed backoffLimit" test, due to a too short timeout duration. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # fixes #51153 **Special notes for your reviewer**: **Release note**: ```release-note ```

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017

k8s-github-robot assigned janetkuo and piosz Aug 22, 2017

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Aug 22, 2017

k8s-ci-robot assigned soltysh Aug 22, 2017

piosz requested review from sttts and removed request for sttts August 23, 2017 09:09

piosz removed their assignment Aug 23, 2017

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-label-needed labels Aug 25, 2017

clamoriniere1A force-pushed the feature/job_failure_policy_controller branch from 086943d to d4c0433 Compare August 28, 2017 11:55

clamoriniere1A force-pushed the feature/job_failure_policy_controller branch 6 times, most recently from ee66e84 to b102ad2 Compare August 28, 2017 14:12

soltysh added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 29, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2017

soltysh force-pushed the feature/job_failure_policy_controller branch from fc03232 to 1dbef2f Compare September 3, 2017 10:09

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 3, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2017

clamoriniere mentioned this pull request Sep 3, 2017

[API] Feature/job failure policy #48075

Merged

k8s-github-robot merged commit b63abc9 into kubernetes:master Sep 3, 2017

This was referenced Sep 5, 2017

Failure policy for Jobs kubernetes/enhancements#298

Closed

Update Job workload documentation with backoff failure policy kubernetes/website#5319

Merged

Bugfix: Improve how JobController use queue for backoff #52091

Merged

soltysh mentioned this pull request Sep 8, 2017

Limit retries for a (Cron)Job #43512

Closed

soltysh mentioned this pull request Sep 8, 2017

Job being constanly recreated despite RestartPolicy: Never #20255

Closed

clamoriniere1A mentioned this pull request Sep 14, 2017

Bugfix: Fix e2e Flaky Apps/Job BackoffLimit test #52476

Merged

soltysh mentioned this pull request Jan 17, 2018

ProgressDeadlineSeconds for a job #58384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job failure policy controller support #51153

Job failure policy controller support #51153

clamoriniere1A commented Aug 22, 2017 •

edited by 0xmichalis

k8s-ci-robot commented Aug 22, 2017

clamoriniere1A commented Aug 22, 2017

dims commented Aug 25, 2017

soltysh commented Sep 3, 2017

k8s-github-robot commented Sep 3, 2017

fejta-bot commented Sep 3, 2017

soltysh commented Sep 3, 2017

k8s-ci-robot commented Sep 3, 2017 •

edited

k8s-github-robot commented Sep 3, 2017

nikhiljindal commented Sep 13, 2017

clamoriniere1A commented Sep 14, 2017 •

edited

Job failure policy controller support #51153

Job failure policy controller support #51153

Conversation

clamoriniere1A commented Aug 22, 2017 • edited by 0xmichalis

k8s-ci-robot commented Aug 22, 2017

clamoriniere1A commented Aug 22, 2017

dims commented Aug 25, 2017

soltysh commented Sep 3, 2017

k8s-github-robot commented Sep 3, 2017

fejta-bot commented Sep 3, 2017

soltysh commented Sep 3, 2017

k8s-ci-robot commented Sep 3, 2017 • edited

k8s-github-robot commented Sep 3, 2017

nikhiljindal commented Sep 13, 2017

clamoriniere1A commented Sep 14, 2017 • edited

clamoriniere1A commented Aug 22, 2017 •

edited by 0xmichalis

k8s-ci-robot commented Sep 3, 2017 •

edited

clamoriniere1A commented Sep 14, 2017 •

edited