Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish finer-grained failure reason for podFailurePolicy #122972

Open
ahg-g opened this issue Jan 25, 2024 · 8 comments
Open

Publish finer-grained failure reason for podFailurePolicy #122972

ahg-g opened this issue Jan 25, 2024 · 8 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@ahg-g
Copy link
Member

ahg-g commented Jan 25, 2024

What would you like to be added?

When a Job failure is triggered via PodFailurePolicy, we currently set a generic reason on the failure condition, which is "PodFailurePolicy"

It would be great if the reason is more specific about the failure reason, which exact rule failed the job. There are few ways of doing that:

  1. Add a Reason field to PodFailurePolicyRule to allow users to optionally decide what reason should show up on the condition when the rule is triggered. We verify that it should be a single word with a specific length limit (because reasons are typically machine readable codes).

  2. In the case of onExitCodes to have the code appended to the reason, like "PodFailurePolicy-ExitCode143". But this will be tricky to do for the OnPodConditions case.

My preference is the first option.

Why is this needed?

This will give higher order APIs that use Job as a building block (such as JobSet) control in how they react to the child Job failure based on the container exit code. See kubernetes-sigs/jobset#381 for more details.

/wg batch
/sig apps

@ahg-g ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 25, 2024
@k8s-ci-robot k8s-ci-robot added wg/batch Categorizes an issue or PR as relevant to WG Batch. sig/apps Categorizes an issue or PR as relevant to SIG Apps. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 25, 2024
@ahg-g
Copy link
Member Author

ahg-g commented Jan 25, 2024

/cc @mwielgus @soltysh @alculquicondor @liggitt

@alculquicondor
Copy link
Member

+1 from me for option 1

@danielvegamyhre
Copy link
Member

danielvegamyhre commented Jan 26, 2024

+1 for option 1, this would be super useful for building a configurable failure policy API for JobSet, as described in kubernetes-sigs/jobset#381.

I can work on a KEP for this.

@mimowo
Copy link
Contributor

mimowo commented Jan 26, 2024

+1, I also think option 1 is preferable, because the rules let you group exit codes, say 1-10, so it is likely the user would want to group them under one reason, rather than needed to deal with 10 different reasons, one per exit code.

As for the specific API we can discuss under the KEP, but two comments:

  • I find Reason ambiguous. In case of pod conditions it might suggest this is the reason expected on the matching pod condition. Maybe SetFailureReason or such?
  • instead of letting the users to configure the reason entirely I would rather let them configure just the suffix. For example AddFailureReasonSuffix: ExitCode143 would result in the reason as PodFailurePolicy-ExitCode143. This provides the same expressiveness for JobSet, yet prevents users from shooting in their own foot with non-descriptive reasons. This may matter for user / admin interactions.

@kannon92
Copy link
Contributor

/triage accepted

Since there is a few +1, I think the feature is valid. @danielvegamyhre I'm not sure your timeline on the feature but you may want to bring this up to a sig-apps meeting. Feature planning for 1.30 may have finished for most sigs so you might find it difficult to get reviews this cycle.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2024
@danielvegamyhre
Copy link
Member

danielvegamyhre commented Jan 27, 2024

/triage accepted

Since there is a few +1, I think the feature is valid. @danielvegamyhre I'm not sure your timeline on the feature but you may want to bring this up to a sig-apps meeting. Feature planning for 1.30 may have finished for most sigs so you might find it difficult to get reviews this cycle.

Ok I added an agenda item to discuss this at the next Batch WG meeting on Feb 1st, hopefully we can make it work. I'll try to have the KEP ready before then as well.

@tenzen-y
Copy link
Member

tenzen-y commented Feb 2, 2024

  • instead of letting the users to configure the reason entirely I would rather let them configure just the suffix. For example AddFailureReasonSuffix: ExitCode143 would result in the reason as PodFailurePolicy-ExitCode143. This provides the same expressiveness for JobSet, yet prevents users from shooting in their own foot with non-descriptive reasons. This may matter for user / admin interactions.

@mimowo @danielvegamyhre As I can see in the API recommendation, it seems that we should use the camelCase in the Reason. So, I think that we need to consider alternative approaches.

In condition types, and everywhere else they appear in the API, Reason is intended to be a one-word, CamelCase representation of the category of cause of the current status

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

@danielvegamyhre
Copy link
Member

danielvegamyhre commented Feb 5, 2024

  • instead of letting the users to configure the reason entirely I would rather let them configure just the suffix. For example AddFailureReasonSuffix: ExitCode143 would result in the reason as PodFailurePolicy-ExitCode143. This provides the same expressiveness for JobSet, yet prevents users from shooting in their own foot with non-descriptive reasons. This may matter for user / admin interactions.

@mimowo @danielvegamyhre As I can see in the API recommendation, it seems that we should use the camelCase in the Reason. So, I think that we need to consider alternative approaches.

In condition types, and everywhere else they appear in the API, Reason is intended to be a one-word, CamelCase representation of the category of cause of the current status

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

That was just an example, adhering to CamelCase reason is fine and something we can validate as well.

In the example above, we could remove the "-" character and have PodFailurePolicyExitCode3, or similar.

Also the KEP has been published, please review when you have time: kubernetes/enhancements#4479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Needs Triage
Development

No branches or pull requests

7 participants