Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support handling of pod failures with respect to the configured rules #111113

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jul 13, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces support for the podFailurePolicy job configuration. In particular it allows us to configure rules for
handling pod failures based on the container exit codes and the pod's end state.

Example job configuration using this feature

apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
      - name: job-container
        image: job-image
        command: ["./program"]
  backoffLimit: 6
  podFailurePolicy:
    rules:
    - action: Ignore
      onPodConditions:
      - type: DisruptionTarget
    - action: FailJob
      onExitCodes:
        operator: In
        values: [40,41,42]

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Yes, it supports podFailurePolicy API for handling pod failures based on container exit codes

Introduces support for handling pod failures with respect to the configured pod failure policy rules

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 13, 2022
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 13, 2022

Hi @mimowo. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 13, 2022
@k8s-ci-robot k8s-ci-robot added area/code-generation area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 13, 2022
@mimowo mimowo force-pushed the retriable-pod-failures-job-controller branch from 34a5db1 to 142b1b6 Compare Jul 13, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 13, 2022
@mimowo mimowo force-pushed the retriable-pod-failures-job-controller branch from 142b1b6 to 87cb426 Compare Jul 13, 2022
pkg/apis/batch/types.go Outdated Show resolved Hide resolved
pkg/apis/core/types.go Outdated Show resolved Hide resolved
pkg/apis/core/types.go Outdated Show resolved Hide resolved
pkg/apis/core/types.go Outdated Show resolved Hide resolved
pkg/apis/core/types.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
pkg/features/kube_features.go Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the retriable-pod-failures-job-controller branch 2 times, most recently from 086287d to 20273f9 Compare Jul 14, 2022
@leilajal
Copy link
Contributor

leilajal commented Jul 14, 2022

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 14, 2022
@alculquicondor
Copy link
Member

alculquicondor commented Aug 4, 2022

Ready to squash

@mimowo mimowo force-pushed the retriable-pod-failures-job-controller branch from 6a43f28 to bdcb6c5 Compare Aug 4, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Aug 4, 2022

Ready to squash

done

@alculquicondor
Copy link
Member

alculquicondor commented Aug 4, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2022
@alculquicondor
Copy link
Member

alculquicondor commented Aug 4, 2022

/assign @janetkuo @liggitt

Copy link
Member

@liggitt liggitt left a comment

one question on the validation, API changes lgtm otherwise

I haven't reviewed the implementation or functional tests, so this still needs apps approval

pkg/apis/batch/validation/validation.go Show resolved Hide resolved
@liggitt liggitt moved this from Changes requested to API review completed, 1.25 in API Reviews Aug 4, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2022
@mimowo mimowo force-pushed the retriable-pod-failures-job-controller branch from 2a9242f to defcb24 Compare Aug 4, 2022
@mimowo mimowo force-pushed the retriable-pod-failures-job-controller branch from defcb24 to bf9ce70 Compare Aug 4, 2022
@alculquicondor
Copy link
Member

alculquicondor commented Aug 4, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2022
Copy link
Member

@janetkuo janetkuo left a comment

Approved from sig-apps

@janetkuo
Copy link
Member

janetkuo commented Aug 4, 2022

/approve

@liggitt
Copy link
Member

liggitt commented Aug 4, 2022

/approve
for API changes

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 4, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, liggitt, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2022
@alculquicondor
Copy link
Member

alculquicondor commented Aug 4, 2022

/hold cancel
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 4, 2022
@k8s-ci-robot k8s-ci-robot merged commit eefcf6a into kubernetes:master Aug 4, 2022
16 checks passed
@mimowo mimowo deleted the retriable-pod-failures-job-controller branch Sep 30, 2022
mimowo added a commit to mimowo/kubernetes that referenced this pull request Oct 17, 2022
I think I'm ready to start reviewing code in this package, but not
necessarily for the entire sig-apps.

My PRs to the package:
kubernetes#110292
kubernetes#111113
kubernetes#112948
mimowo added a commit to mimowo/kubernetes that referenced this pull request Oct 20, 2022
…gration/job

I think I'm ready to start review and LGTM code changes within this
package, but not necessarily for the entire sig-apps.

My PRs to the packages:
kubernetes#110292
kubernetes#111113
kubernetes#112948

PRs to the packages I contributed reviews to:
kubernetes#113166
kubernetes#110294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
API Reviews
API review completed, 1.25
Development

Successfully merging this pull request may close these issues.

None yet