Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job failure policy controller support #51153

Conversation

clamoriniere1A
Copy link
Contributor

@clamoriniere1A clamoriniere1A commented Aug 22, 2017

What this PR does / why we need it:
Start implementing the support of the "Backoff policy and failed pod limit" in the JobController defined in kubernetes/community#583.
This PR depends on a previous PR #48075 that updates the K8s API types.

TODO:

implements kubernetes/community#583

Special notes for your reviewer:

Release note:

Add backoff policy and failed pod limit for a job

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @clamoriniere1A. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017
@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Aug 22, 2017
@clamoriniere1A
Copy link
Contributor Author

/assign @soltysh

@piosz piosz requested review from sttts and removed request for sttts August 23, 2017 09:09
@piosz piosz removed their assignment Aug 23, 2017
@dims
Copy link
Member

dims commented Aug 25, 2017

/ok-to-test
/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-label-needed labels Aug 25, 2017
@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy_controller branch from 086943d to d4c0433 Compare August 28, 2017 11:55
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 28, 2017
@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy_controller branch 6 times, most recently from ee66e84 to b102ad2 Compare August 28, 2017 14:12
@soltysh soltysh added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 29, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2017
Job failure policy integration in JobController. From the
JobSpec.BackoffLimit the JobController will define the backoff
duration between Job retry.

It use the ```workqueue.RateLimitingInterface``` to store the number of
"retry" as "requeue" and the default Job backoff initial duration is set
during the initialization of the ```workqueue.RateLimiter.

Since the number of retry for each job is store in a local structure
"JobController.queue" if the JobController restarts the number of retries
will be lost and the backoff duration will be reset to 0.

Add e2e test for Job backoff failure policy
@soltysh soltysh force-pushed the feature/job_failure_policy_controller branch from fc03232 to 1dbef2f Compare September 3, 2017 10:09
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 3, 2017
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Sep 3, 2017
@soltysh
Copy link
Contributor

soltysh commented Sep 3, 2017

Rebased so

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clamoriniere1A, soltysh

Associated issue: 48075

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@soltysh
Copy link
Contributor

soltysh commented Sep 3, 2017

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 3, 2017

@clamoriniere1A: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 1dbef2f link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit b63abc9 into kubernetes:master Sep 3, 2017
k8s-github-robot pushed a commit that referenced this pull request Sep 8, 2017
…ckoff

Automatic merge from submit-queue (batch tested with PRs 52091, 52071)

Bugfix: Improve how JobController use queue for backoff

**What this PR does / why we need it**:

In some cases,  the backoff delay for a given Job is reset unnecessarily. 

the PR improves how JobController uses queue for backoff:
- Centralize the key "forget" and "re-queue" process in only on method.
- Change the signature of the syncJob method in order to return the
information if it is necessary to forget the backoff delay for a given
key.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
Links to #51153

**Special notes for your reviewer**:

**Release note**:

```release-note
```
@nikhiljindal
Copy link
Contributor

The e2e test that was added in this PR ([sig-apps] Job should exceed backoffLimit) seems to be flaking recently: https://k8s-testgrid.appspot.com/release-1.8-blocking#gci-gce-1.8.
@clamoriniere1A Can you please take a look?

@clamoriniere1A
Copy link
Contributor Author

clamoriniere1A commented Sep 14, 2017

@nikhiljindal yes sure

Looking at some flaky iterations, it seems that the timeout used in the test for waiting that the job status becomes "Failed" is too short. Kubelet takes more than the 30sec to start properly the Pod... (search "backofflimit-q57jk" in logs https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-stable1/111/artifacts/bootstrap-e2e-minion-group-2292/kubelet.log)

I will do a PR that updates this e2e test with the generic timeout duration used in others tests.

clamoriniere1A added a commit to clamoriniere1A/kubernetes that referenced this pull request Sep 14, 2017
This fix is linked to the PR kubernetes#51153 that introduce the
JobSpec.BackoffLimit.
Previously the Timeout used in the test was too agressive and generates
flaky test execution. Now it used the default framework.JobTimeout used
in others tests.
k8s-github-robot pushed a commit that referenced this pull request Sep 15, 2017
…flaky

Automatic merge from submit-queue

Bugfix: Fix e2e Flaky Apps/Job BackoffLimit test

This fix is linked to the PR #51153 that introduce the `JobSpec.BackoffLimit`.

Previously the Timeout used in the test was too aggressive and generates flaky test execution. Now it used the default `framework.JobTimeout` used in others tests.



**What this PR does / why we need it**:
This PR should fix flaky "[sig-apps] Job should exceed backoffLimit" test, due to a too short timeout duration.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
fixes #51153 

**Special notes for your reviewer**:

**Release note**:

```release-note
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants