Add handling for podFailurePolicy #269

danielvegamyhre · 2023-08-24T23:20:13Z

Fixes #262

Note I didn't modify the FailurePolicy API as I originally proposed in #262 but rather made respecting the podFailurePolicy the default behavior. Let me know if you have any feedback on this design decision, I am happy to discuss/revisit it.

cc @kannon92 @alculquicondor

k8s-ci-robot · 2023-08-24T23:20:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kannon92 · 2023-08-25T01:31:18Z

pkg/controllers/jobset_controller.go

+func (r *JobSetReconciler) triggeredPodFailurePolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) bool {
+	log := ctrl.LoggerFrom(ctx)
+	for _, failedJob := range ownedJobs.failed {
+		for _, c := range failedJob.Status.Conditions {


You could use IsStatusConditionTrue
true from kubernetes here.

Looking at this more, I don't think you would be able to use this. That allows you to see if you have a FailedJob conditon but it doesn't match on the reason. I guess you could match on FailedJob condition and then check reason.

kannon92 · 2023-08-25T01:48:50Z

pkg/controllers/jobset_controller.go

-	apiGVStr    = jobset.GroupVersion.String()
+	jobOwnerKey                        = ".metadata.controller"
+	apiGVStr                           = jobset.GroupVersion.String()
+	JobConditionReasonPodFailurePolicy = "PodFailurePolicy"


https://github.com/kubernetes/kubernetes/blob/714e77595c8b19b693925bda2a96ab80c307d38f/pkg/controller/job/job_controller.go#L59

I know that this change comes from this location and if we go this route I think we’d want to maybe version the constant as a API for job.

I think moving the PodFailurePolicy reason into staging for job would make it so that it would be more difficult to change the constant. Maybe I’m being paranoid but if someone changes this constant name, it could break you pretty easily. It’d be nice to reference this constant value as part of the API rather than a hard coded constant.

@alculquicondor wdyt?

Yeah it would be ideal if we could simply reference this constant as part of the Job API but since we currently can't, in the interest of velocity I just hard coded a constant here.

You did the correct thing. I have kubernetes/kubernetes#120175 where we would add these to API but it would be a few releases until you can use it probably.

n-2 support an all.

pkg/controllers/jobset_controller.go

kannon92 · 2023-08-25T16:46:06Z

pkg/controllers/jobset_controller.go

@@ -459,12 +460,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset.
 }

 func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error {
-	if js.Spec.FailurePolicy.MaxRestarts == 0 {
+	if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) {


So just checking, we only want to obey PodFailurePolicy if RestartPolicy is specified. Default behavior would be to fail on any failure?

If no jobset failure policy is specified, the jobset will fail immediately without restarts anyway. So the only place we need to do this podFailurePolicy check is if restarting is an option.

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.

Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

kannon92

Implementation looks good to me! Its very clean.

Just a few comments.

Not sure if you want @alculquicondor or @ahg-g thoughts on it.
/hold
/lgtm

kannon92 · 2023-08-25T17:57:40Z

/lgtm

k8s-ci-robot · 2023-08-25T22:45:57Z

New changes are detected. LGTM label has been removed.

pkg/controllers/jobset_controller.go

alculquicondor · 2023-08-28T19:39:13Z

pkg/controllers/jobset_controller.go

@@ -459,12 +460,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset.
 }

 func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error {
-	if js.Spec.FailurePolicy.MaxRestarts == 0 {
+	if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) {


Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.

Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

danielvegamyhre · 2023-08-28T20:53:31Z

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

Yeah in my original proposal I wanted to have it configurable as part of the JobSet's FailurePolicy, but I realized modifying the API would probably require a new minor version, and v0.3.0 isn't scheduled until early October, and in the meantime we have users who want this functionality sooner rather than later.

Personally I would strongly prefer it be a configurable option in the JobSet Failure Policy though so we may have no choice but to wait. I would be curious to get others thoughts on this though.

kannon92 · 2023-08-29T13:31:12Z

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

Yeah in my original proposal I wanted to have it configurable as part of the JobSet's FailurePolicy, but I realized modifying the API would probably require a new minor version, and v0.3.0 isn't scheduled until early October, and in the meantime we have users who want this functionality sooner rather than later.

Personally I would strongly prefer it be a configurable option in the JobSet Failure Policy though so we may have no choice but to wait. I would be curious to get others thoughts on this though.

So I was thinking about this a bit more and I realize that we have a few cases we should consider with RestartPolicy.

PodFailurePolicy is one case but what about other FailedJob cases. ActiveDeadlineExceeded, BackoffLimitExceeded.

And other area that I was thinking is what about failure policies on different replicas. Could we see a case where parent RJ is allowed to failure but workers are not?

alculquicondor · 2023-08-29T14:36:34Z

PodFailurePolicy is one case but what about other FailedJob cases. ActiveDeadlineExceeded, BackoffLimitExceeded.

I wonder the same. These failures are also indicative of a problem in the user workload.

Let's flip the question: in which scenarios (or use cases), it makes sense to retry the entire job?

kannon92 · 2023-09-14T12:45:46Z

PodFailurePolicy is one case but what about other FailedJob cases. ActiveDeadlineExceeded, BackoffLimitExceeded.

I wonder the same. These failures are also indicative of a problem in the user workload.

Let's flip the question: in which scenarios (or use cases), it makes sense to retry the entire job?

@danielvegamyhre wdyt?

I can see a more general implementation to obey all failures of a job.

kannon92 · 2023-09-19T15:58:33Z

/retest

danielvegamyhre · 2023-10-16T22:58:34Z

@kannon92 as long as there is any easy way to check if a Job failed due to ActiveDeadlineExceeded or BackoffLimitExceeded then we can include those, and perhaps change the name of the configuration to RespectNonRetriableErrors or something? We can probably think of a better name, but the general idea seems fine to me.

kannon92 · 2023-10-17T02:07:28Z

@kannon92 as long as there is any easy way to check if a Job failed due to ActiveDeadlineExceeded or BackoffLimitExceeded then we can include those, and perhaps change the name of the configuration to RespectNonRetriableErrors or something? We can probably think of a better name, but the general idea seems fine to me.

We pushed a change in 1.29 to start exposing these as published apis in the reason field. Both of these work like PodFailurePolicy.

kubernetes/kubernetes@a62eb45

k8s-ci-robot · 2023-11-10T14:07:51Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-12-18T21:26:52Z

@danielvegamyhre: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-jobset-test-e2e-main-1-29	`81ae7d5`	link	true	`/test pull-jobset-test-e2e-main-1-29`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

danielvegamyhre · 2024-01-16T20:06:56Z

Closing in favor of revised API discussed in #262

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 24, 2023

k8s-ci-robot requested a review from ahg-g August 24, 2023 23:20

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 24, 2023

add handling for podFailurePolicy

94c3e70

danielvegamyhre force-pushed the podfailurepolicy branch from 609397c to 94c3e70 Compare August 25, 2023 00:09

kannon92 reviewed Aug 25, 2023

View reviewed changes

pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved

kannon92 reviewed Aug 25, 2023

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2023

k8s-ci-robot assigned kannon92 Aug 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023

add comment to JobConditionReasonPodFailurePolicy

1dfa069

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023

add podFailurePolicy to integration test

99b7cb5

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023

alculquicondor reviewed Aug 28, 2023

View reviewed changes

check that condition is failed

81ae7d5

danielvegamyhre mentioned this pull request Aug 31, 2023

Bump k8s version to next 1.27.x patch release to include upstream podFailurePolicy bug fix #285

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2023

danielvegamyhre closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handling for podFailurePolicy #269

Add handling for podFailurePolicy #269

danielvegamyhre commented Aug 24, 2023 •

edited

k8s-ci-robot commented Aug 24, 2023

kannon92 Aug 25, 2023 •

edited

kannon92 Aug 25, 2023

kannon92 Aug 25, 2023

danielvegamyhre Aug 25, 2023

kannon92 Aug 25, 2023

kannon92 Aug 25, 2023

danielvegamyhre Aug 25, 2023

alculquicondor Aug 28, 2023

kannon92 left a comment

kannon92 commented Aug 25, 2023

k8s-ci-robot commented Aug 25, 2023

alculquicondor Aug 28, 2023

danielvegamyhre commented Aug 28, 2023

kannon92 commented Aug 29, 2023

alculquicondor commented Aug 29, 2023

kannon92 commented Sep 14, 2023

kannon92 commented Sep 19, 2023

danielvegamyhre commented Oct 16, 2023

kannon92 commented Oct 17, 2023 •

edited

k8s-ci-robot commented Nov 10, 2023

k8s-ci-robot commented Dec 18, 2023

danielvegamyhre commented Jan 16, 2024

Add handling for podFailurePolicy #269

Add handling for podFailurePolicy #269

Conversation

danielvegamyhre commented Aug 24, 2023 • edited

k8s-ci-robot commented Aug 24, 2023

kannon92 Aug 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 left a comment

Choose a reason for hiding this comment

kannon92 commented Aug 25, 2023

k8s-ci-robot commented Aug 25, 2023

Choose a reason for hiding this comment

danielvegamyhre commented Aug 28, 2023

kannon92 commented Aug 29, 2023

alculquicondor commented Aug 29, 2023

kannon92 commented Sep 14, 2023

kannon92 commented Sep 19, 2023

danielvegamyhre commented Oct 16, 2023

kannon92 commented Oct 17, 2023 • edited

k8s-ci-robot commented Nov 10, 2023

k8s-ci-robot commented Dec 18, 2023

danielvegamyhre commented Jan 16, 2024

danielvegamyhre commented Aug 24, 2023 •

edited

kannon92 Aug 25, 2023 •

edited

kannon92 commented Oct 17, 2023 •

edited