New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retriable and non-retriable Pod failures for Jobs #3329
Comments
/sig apps |
/assign |
/assign |
/sig scheduling |
Hello @alculquicondor Just checking in as we approach enhancements freeze on 18:00 PT on Thursday June 23, 2022, which is just over 2 days from now. For note, This enhancement is targeting for stage Here's where this enhancement currently stands:
The open PR #3374 is addressing all the listed criteria above. We would just require getting it merged by the Enhancements Freeze. For note, the status of this enhancement is marked as |
With KEP PR #3374 merged, the enhancement is ready for the 1.25 Enhancements Freeze. For note, the status is now marked as |
Hello @alculquicondor Please follow the steps detailed in the documentation to open a PR against |
@Atharva-Shinde @alculquicondor there is one more PR that should be included before the code freeze: kubernetes/kubernetes#111475 |
thank you @mimowo, |
Hi @mimowo and @alculquicondor
For this enhancement, it looks like the following PRs are open and need to be merged before code freeze: Please let me know what other PRs in k/k I should be tracking for this KEP. |
hi @mimowo and @alculquicondor, I have found an issue with API-initiated eviction when used together with PodDisruptionConditions kubernetes/kubernetes#116552, I am trying to fix it in kubernetes/kubernetes#116554 |
/milestone v1.28 since there will be work for 1.28 |
/label lead-opted-in |
I took another skim thru this and I noticed that the failure-policy covers the whole pod, but speaks in terms like "exit code":
Pods don't have exit codes - containers do. What happens when there's more than one container? I still wonder if this would better to delegate some aspects of this to the Pod definition - like:
or even:
We could even consider |
Hi @mimowo and @alculquicondor Just checking in as we approach enhancements freeze on 01:00 UTC Friday, 16th June 2023. This enhancement is targeting for stage Here's where this enhancement currently stands:
For this KEP, we would just need to update the following:
The status of this enhancement is marked as |
Pod failure policy allows to specify rules on exit codes per container with the podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: container1
operator: In
values: [1,2,3]
- action: FailJob
onExitCodes:
containerName: container2
operator: NotIn
values: [40,41,42] When Regarding the support for exit code ranges (as implied in the API samples below) - it was discussed and originally proposed but we deferred it as a potential future improvement, once needed. For example, Kubeflow's TFJob, introduces retry convention based on exit codes: https://www.kubeflow.org/docs/components/training/tftraining/, but it is just a handful of them which can be easily enumerated.
Not sure I understand what We still would need to communicate the Kubelet decision with Job Controller to allow for handling by users. For example, based on the exit codes users might want to Ignore, FailJob or Count the failure (this is a basic set of actions needed by TFJob). One way I imagine this could work is that, pod spec has a set of rules, and for each matching rule Kubelet adds a pod condition (then the pod condition is matched by Job's pod failure policy). It isn't clear how the set of conditions would be defined, alternatives I see:
Note that, it is important that we can introduce new actions specific to the Job controller without a need for changes in Kubelet. For example, in this KEP we are planning to have a new action |
@npolshakova I added the KEP update PR to the description |
I was hand-waving that a Pod can fail or it can perma-fail. A perma-fail means "give up". This would give Pods the ability to use You still need the higher-level control-plane (Job) to react to those specific cases. It just doesn't need to be in the loop for ALL POSSIBLE exit codes. We can do it all in the higher level control plane, but WHY? I am reminded of another system I worked on which did exactly this, and ultimately had to add the ability to make more localized retry decisions in order to scale. |
@thockin thanks for looking into this! After a second thought I start to see how this could play together nicely.
For (2.) we need Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure, (cc @kerthcet ) to make sure the pod is eventually marked failed. Then, container restart rules are needed to support use cases that we have (such as allow transition from TFJob). However, we could enable This could also play nicely with Backoff Limit Per Index . Also currently limited to An example API I think of: apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimitPerIndex: 2 # it only takes into account pods with phase=Failed
podFailurePolicy: # job-level API for handling failed pods (with phase=Failed)
rules:
- action: Ignore # do not count towards backoffLimitPerIndex
onPodConditions:
- type: DisruptionTarget
Status: true
Template:
Spec:
restartPolicy: OnFailure
maxRestartTimes: 4
Containers:
- name: myapp1
restartRules: # pod-level API for specifying container restart rules before pod's phase=Failed
- action: FailPod # short-circuit and fail the pod (set phase=Failed)
onExitCodes:
values: [1]
- Action: Count # count towards maxRestartTimes
onExitCodes:
values: [2]
- Action: Ignore # completely ignore and restart
onExitCodes:
values: [3] So to me the question is how to get there, and if supporting I feel that smaller KEPs are better cause they allow for separate prioritization and independent graduation. |
That is true. The job controller doesn't need to take all decisions and it could delegate to kubelet. But it still needs to know what to do after a pod has failed. For example, let's imagine that we have a restartPolicy in the PodSpec with exit codes support. Once the kubelet decided that the pod cannot be retried because of That's why we started with an API in the Job spec, and limited it to Pods with But we can still introduce an API at the Pod spec, both for number or retries and policies for specific exit codes! That's why I welcome @kerthcet's KEP. The way I see it working is that a user writing a Job manifest would only set the API at the job spec, then if the podTemplate has a |
I don't think this is very useful. If a user wants to restrict number of restarts for an application, it doesn't matter if they happen in the same Pod, or across recreations. |
@alculquicondor Are you proposing to move forward with a Job-level API and maybe-later-maybe-not loosen the I don't want to stand in the way of solving real problems, but I worry that this becomes conceptual debt that we will never pay off (or even remember!) |
Yes, that is my proposal
@kerthcet already has an open WIP KEP :) |
The max-restarts KEP isn't the same as restart-rules, though, right? They
all seem complementary but not the same.
…On Thu, Jun 1, 2023, 5:19 PM Aldo Culquicondor ***@***.***> wrote:
Yes, that is my proposal
I don't want to stand in the way of solving real problems, but I worry
that this becomes conceptual debt that we will never pay off (or even
remember!)
@kerthcet <https://github.com/kerthcet> already has an open WIP KEP :)
And we have already received good user feedback about the failure policy
at the job level.
—
Reply to this email directly, view it on GitHub
<#3329 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVBJ3D5XAFG2TOBVLB3XJEWPRANCNFSM5XSDYNYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's the point I think Job API can leverage the
Yes, currently max-restarts KEP only accounts for the |
Yes, these are the pieces of work that need to be done to support fully restart policy
Note that:
IMO doing (1.) - (4.) under this KEP would prolong its graduation substantially, and different points have different priorities, so it is useful to de-couple. Recently we got this slack ask to support |
Enhancement Description
One-line enhancement description (can be used as a release note): An API to influence retries based on exit codes and/or pod deletion reasons.
Kubernetes Enhancement Proposal: https://git.k8s.io/enhancements/keps/sig-apps/3329-retriable-and-non-retriable-failures
Discussion Link: RFE: ability to define special exit code to terminate existing job kubernetes#17244
Primary contact (assignee): @alculquicondor
Responsible SIGs: apps, api-machinery, scheduling
Enhancement target (which target equals to which milestone):
Alpha
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s): Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobs website#35219Beta
k/enhancements
) update PR(s):k/k
) update PR(s):v1.28
:k/website
) update(s):The text was updated successfully, but these errors were encountered: