-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish finer-grained failure reason for podFailurePolicy #122972
Comments
+1 from me for option 1 |
+1 for option 1, this would be super useful for building a configurable failure policy API for JobSet, as described in kubernetes-sigs/jobset#381. I can work on a KEP for this. |
+1, I also think option 1 is preferable, because the rules let you group exit codes, say 1-10, so it is likely the user would want to group them under one reason, rather than needed to deal with 10 different reasons, one per exit code. As for the specific API we can discuss under the KEP, but two comments:
|
/triage accepted Since there is a few +1, I think the feature is valid. @danielvegamyhre I'm not sure your timeline on the feature but you may want to bring this up to a sig-apps meeting. Feature planning for 1.30 may have finished for most sigs so you might find it difficult to get reviews this cycle. |
Ok I added an agenda item to discuss this at the next Batch WG meeting on Feb 1st, hopefully we can make it work. I'll try to have the KEP ready before then as well. |
@mimowo @danielvegamyhre As I can see in the API recommendation, it seems that we should use the camelCase in the Reason. So, I think that we need to consider alternative approaches.
|
That was just an example, adhering to CamelCase reason is fine and something we can validate as well. In the example above, we could remove the "-" character and have Also the KEP has been published, please review when you have time: kubernetes/enhancements#4479 |
What would you like to be added?
When a Job failure is triggered via PodFailurePolicy, we currently set a generic reason on the failure condition, which is "PodFailurePolicy"
It would be great if the reason is more specific about the failure reason, which exact rule failed the job. There are few ways of doing that:
Add a
Reason
field toPodFailurePolicyRule
to allow users to optionally decide what reason should show up on the condition when the rule is triggered. We verify that it should be a single word with a specific length limit (because reasons are typically machine readable codes).In the case of onExitCodes to have the code appended to the reason, like "PodFailurePolicy-ExitCode143". But this will be tricky to do for the OnPodConditions case.
My preference is the first option.
Why is this needed?
This will give higher order APIs that use Job as a building block (such as JobSet) control in how they react to the child Job failure based on the container exit code. See kubernetes-sigs/jobset#381 for more details.
/wg batch
/sig apps
The text was updated successfully, but these errors were encountered: