Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement configurable failure policy. #537

Merged

Conversation

jedwins1998
Copy link
Contributor

This pull request is to implement configurable failure policy.

There is one difference to note from the KEP. I added a new field to the JobSetStatus that tracks the number of restarts which count towards the restart limit. I then use this variable to allow some restarts to not count towards the maximum number of restarts.

This resolves #262.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 19, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @jedwins1998. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 19, 2024
Copy link

netlify bot commented Apr 19, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 6a1ac21
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/664bd459b5d8bd00088bff49

@jedwins1998
Copy link
Contributor Author

jedwins1998 commented Apr 19, 2024

There is one TODO left for an additional test I would like to add and I still need to implement Webhook validation for OnJobFailureReasons. Besides that, I consider the code ready for review.

@danielvegamyhre
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 20, 2024
Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all the helpers in jobset_controller.go specific to failure policies to a failure_policy.go file, and add unit tests for any important ones in failure_policy_test.go? Same as success_policy.go and success_policy_test.go.

I'll take a deeper look next week. Thanks for working on this!

@jedwins1998
Copy link
Contributor Author

Can we move all the helpers in jobset_controller.go specific to failure policies to a failure_policy.go file, and add unit tests for any important ones in failure_policy_test.go? Same as success_policy.go and success_policy_test.go.

I'll take a deeper look next week. Thanks for working on this!

Can do.

@jedwins1998 jedwins1998 force-pushed the dev-configurable-failure-policy branch 4 times, most recently from b94061d to afc73b0 Compare April 25, 2024 22:09
Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick pass while you are working on the refactor, looks good so far!

pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@jedwins1998 jedwins1998 force-pushed the dev-configurable-failure-policy branch from afc73b0 to 591d2ee Compare April 26, 2024 21:30
@jedwins1998
Copy link
Contributor Author

jedwins1998 commented Apr 26, 2024

Can we move all the helpers in jobset_controller.go specific to failure policies to a failure_policy.go file, and add unit tests for any important ones in failure_policy_test.go? Same as success_policy.go and success_policy_test.go.

I'll take a deeper look next week. Thanks for working on this!

This is now done.

@jedwins1998 jedwins1998 force-pushed the dev-configurable-failure-policy branch 4 times, most recently from 72b6e44 to c1db56b Compare April 26, 2024 23:20
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the integration tests yet

api/jobset/v1alpha2/jobset_types.go Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/failure_policy.go Outdated Show resolved Hide resolved
pkg/controllers/failure_policy.go Outdated Show resolved Hide resolved
pkg/controllers/failure_policy.go Outdated Show resolved Hide resolved
pkg/webhooks/jobset_webhook.go Outdated Show resolved Hide resolved
@jedwins1998 jedwins1998 force-pushed the dev-configurable-failure-policy branch 2 times, most recently from 1bbfe17 to 0283adf Compare April 29, 2024 20:44
@jedwins1998 jedwins1998 force-pushed the dev-configurable-failure-policy branch from 8381433 to 5b8f55d Compare May 17, 2024 17:54
I added `[failure policy]` to the begin of the name of each
test related to failure policies so that it is easier to select only
those tests to run.

I also updated tests to check that `RestartsCountTowardsMax` is
incrementing only when expected.
Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after a few final comments are addressed. In a follow-up PR, we should add an example JobSet spec to the examples/ folder showcasing how the feature works, which we can use to do manual testing as well.

pkg/constants/constants.go Outdated Show resolved Hide resolved
pkg/controllers/failure_policy.go Outdated Show resolved Hide resolved
pkg/webhooks/jobset_webhook.go Outdated Show resolved Hide resolved
pkg/controllers/failure_policy.go Outdated Show resolved Hide resolved
RestartJobSetAndIgnoreMaxRestarts FailurePolicyAction = "RestartJobSetAndIgnoreMaxRestarts"
)

// FailurePolicyRule defines a FailurePolicyAction to be executed if a child job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is a general comment unrelated to this file) Now that feature gate support has been merged in #557 I think we should add a feature gate (default on) for this feature. If the feature is not enabled, fall back to the current behavior.

We can do this in a follow up PR.

I want to avoid a scenario where we publish the v0.6.0 release and an important customer is using this feature, then they encounter a bug that slipped through the cracks, and we can't simply downgrade to v0.5.0 to mitigate because their JobSet spec (often defined in Python/Go code checked into their codebase) is using fields which only exist in v0.6.0 - thus requiring some emergency rollout on their end to revert their Python/Go code to a spec usable by JobSet v0.5.0, and then downgrade JobSet deployment to v0.5.0.

pkg/controllers/jobset_controller_test.go Show resolved Hide resolved
pkg/controllers/failure_policy.go Outdated Show resolved Hide resolved
@danielvegamyhre
Copy link
Contributor

/lgtm

Thanks for working on this! Will leave approval for @ahg-g

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2024
@ahg-g
Copy link
Contributor

ahg-g commented May 20, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, jedwins1998

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2024
@ahg-g
Copy link
Contributor

ahg-g commented May 21, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2024
@k8s-ci-robot k8s-ci-robot merged commit ac8cd93 into kubernetes-sigs:main May 21, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Configurable Failure Policy API
5 participants