Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retriable and non-retriable Pod failures for Jobs #3329

Open
8 tasks done
alculquicondor opened this issue Jun 1, 2022 · 64 comments
Open
8 tasks done

Retriable and non-retriable Pod failures for Jobs #3329

alculquicondor opened this issue Jun 1, 2022 · 64 comments
Assignees
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/beta Denotes an issue tracking an enhancement targeted for Beta status wg/batch Categorizes an issue or PR as relevant to WG Batch.
Milestone

Comments

@alculquicondor
Copy link
Member

alculquicondor commented Jun 1, 2022

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 1, 2022
@alculquicondor
Copy link
Member Author

/sig apps
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 1, 2022
@alculquicondor
Copy link
Member Author

/assign

@mimowo
Copy link
Contributor

mimowo commented Jun 2, 2022

/assign

@alculquicondor alculquicondor changed the title Retriable and non-retriable Job pod failures Retriable and non-retriable Pod failures for Jobs Jun 6, 2022
@alculquicondor
Copy link
Member Author

/sig scheduling
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jun 9, 2022
@Priyankasaggu11929
Copy link
Member

Hello @alculquicondor 👋, 1.25 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PT on Thursday June 23, 2022, which is just over 2 days from now.

For note, This enhancement is targeting for stage alpha for 1.25 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP file using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable
  • KEP has a updated detailed test plan section filled out
  • KEP has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

The open PR #3374 is addressing all the listed criteria above. We would just require getting it merged by the Enhancements Freeze.

For note, the status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@Priyankasaggu11929 Priyankasaggu11929 added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Jun 22, 2022
@Priyankasaggu11929 Priyankasaggu11929 added this to the v1.25 milestone Jun 22, 2022
@Priyankasaggu11929
Copy link
Member

With KEP PR #3374 merged, the enhancement is ready for the 1.25 Enhancements Freeze.

For note, the status is now marked as tracked. Thank you so much!

@kcmartin
Copy link

Hello @alculquicondor 👋, 1.25 Release Docs Lead here.
This enhancement is marked as ‘Needs Docs’ for 1.25 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.25 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by August 4.
 Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you!

@Atharva-Shinde
Copy link
Contributor

Atharva-Shinde commented Jul 25, 2022

Hi @alculquicondor @mimowo, Enhancements team here again 👋

Checking in as we approach Code Freeze at 01:00 UTC on Wednesday, 3rd August 2022.

Please ensure that the following items are completed before the code-freeze:

Let me know if there are any additional k/k PRs besides the ones listed above

Currently, the status of the enhancement is marked as at-risk

Thanks :)

@mimowo
Copy link
Contributor

mimowo commented Jul 29, 2022

@Atharva-Shinde @alculquicondor there is one more PR that should be included before the code freeze: kubernetes/kubernetes#111475

@Atharva-Shinde
Copy link
Contributor

thank you @mimowo,
I have updated my comment with the PR and have also tagged you for future reference :)

@shatoboar
Copy link

Hi @mimowo and @alculquicondor 👋,
Checking in as we approach 1.27 code freeze at 17:00 PDT on Tuesday 14th March 2023.
Please ensure the following items are completed:

  • All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes).
  • All PRs are fully merged by the code freeze deadline.

For this enhancement, it looks like the following PRs are open and need to be merged before code freeze:

Please let me know what other PRs in k/k I should be tracking for this KEP.
As always, we are here to help should questions come up. Thanks!

@atiratree
Copy link
Member

hi @mimowo and @alculquicondor, I have found an issue with API-initiated eviction when used together with PodDisruptionConditions kubernetes/kubernetes#116552, I am trying to fix it in kubernetes/kubernetes#116554

@SergeyKanzhelev
Copy link
Member

/milestone v1.28

since there will be work for 1.28

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.27, v1.28 May 5, 2023
@Atharva-Shinde Atharva-Shinde removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team lead-opted-in Denotes that an issue has been opted in to a release labels May 14, 2023
@soltysh
Copy link
Contributor

soltysh commented May 18, 2023

/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label May 18, 2023
@thockin
Copy link
Member

thockin commented May 30, 2023

I took another skim thru this and I noticed that the failure-policy covers the whole pod, but speaks in terms like "exit code":

  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        operator: NotIn
        values: [40,41,42]

Pods don't have exit codes - containers do. What happens when there's more than one container?

I still wonder if this would better to delegate some aspects of this to the Pod definition - like:

Pod.spec: {
    restartPolicy: Never
    containers:
    - name: myapp
      onExit:
      - statusCode:
        valueRange:
          from: 40
          to: 42
        action:
          failWithPrejudice{}

or even:

Pod.spec: {
    restartPolicy: Always
    containers:
    - name: myapp
      onExit:
      - statusCode:
        valueRange:
          from: 42
        action:
          fail{}
      - signal:
        name: SIGBUS
        action:
          failWithPrejudice{}

We could even consider restartPolicy: Bespoke to trigger this processing.

@npolshakova
Copy link

Hi @mimowo and @alculquicondor 👋, Enhancements team here!

Just checking in as we approach enhancements freeze on 01:00 UTC Friday, 16th June 2023.

This enhancement is targeting for stage beta for 1.28 (correct me, if otherwise.)

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: 1.28
  • KEP readme has a updated detailed test plan section filled out
  • KEP readme has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would just need to update the following:

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@mimowo
Copy link
Contributor

mimowo commented May 31, 2023

I took another skim thru this and I noticed that the failure-policy covers the whole pod, but speaks in terms like "exit code":
Pods don't have exit codes - containers do. What happens when there's more than one container?

Pod failure policy allows to specify rules on exit codes per container with the containerName field. For example,

  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: container1
        operator: In
        values: [1,2,3]
    - action: FailJob
      onExitCodes:
        containerName: container2
        operator: NotIn
        values: [40,41,42]

When containerName isn't specified the rule matches all containers. The syntax without containerName is to make it more concise, as for batch workloads 1 container is often the case.

Regarding the support for exit code ranges (as implied in the API samples below) - it was discussed and originally proposed but we deferred it as a potential future improvement, once needed. For example, Kubeflow's TFJob, introduces retry convention based on exit codes: https://www.kubeflow.org/docs/components/training/tftraining/, but it is just a handful of them which can be easily enumerated.

I still wonder if this would better to delegate some aspects of this to the Pod definition - like:

Pod.spec: {
    restartPolicy: Never
    containers:
    - name: myapp
      onExit:
      - statusCode:
        valueRange:
          from: 40
          to: 42
        action:
          failWithPrejudice{}

or even:

Pod.spec: {
    restartPolicy: Always
    containers:
    - name: myapp
      onExit:
      - statusCode:
        valueRange:
          from: 42
        action:
          fail{}
      - signal:
        name: SIGBUS
        action:
          failWithPrejudice{}

We could even consider restartPolicy: Bespoke to trigger this processing.

Not sure I understand what failWithPrejudice is.

We still would need to communicate the Kubelet decision with Job Controller to allow for handling by users. For example, based on the exit codes users might want to Ignore, FailJob or Count the failure (this is a basic set of actions needed by TFJob).

One way I imagine this could work is that, pod spec has a set of rules, and for each matching rule Kubelet adds a pod condition (then the pod condition is matched by Job's pod failure policy). It isn't clear how the set of conditions would be defined, alternatives I see:

  1. single FailedWithPrejudice, reason would be FailJob, Count Ignore (the downside is that we need matching based on Reason which we want to avoid breaking changes, it also leaks Job knowledge)
  2. fixed set: FailedMatchingIgnore, FailedMatchingFailJob FailedMatchingCount (requires changes to both kubelet and job controller to introduce a new action)
  3. user-defined set of conditions in the pod-spec (I guess this could work, it means Kubelet starts to add conditions which are not strongly typed, it also means that the user needs to be careful not to make typos when using pod condition names in pod-spec and pod failure policy).

Note that, it is important that we can introduce new actions specific to the Job controller without a need for changes in Kubelet. For example, in this KEP we are planning to have a new action FailIndex (#3967).

@alculquicondor
Copy link
Member Author

@npolshakova I added the KEP update PR to the description

@thockin
Copy link
Member

thockin commented Jun 1, 2023

Not sure I understand what failWithPrejudice is.

I was hand-waving that a Pod can fail or it can perma-fail. A perma-fail means "give up". This would give Pods the ability to use restartPolicy=OnFailure (e.g. OOM) and still recognize that some failures are intentional and should not be restarted. It seems unfortunate to give up on restartPolicy when what you really want is a slightly different policy. I'm trying to emphasize that, in general, we want to pass information down as far as possible and enable the most-local control loop to make correct decisions. In this case, we can tell kubelet "restart this pod until it exits cleanly, or until it hits a specific case".

You still need the higher-level control-plane (Job) to react to those specific cases. It just doesn't need to be in the loop for ALL POSSIBLE exit codes.

We can do it all in the higher level control plane, but WHY? I am reminded of another system I worked on which did exactly this, and ultimately had to add the ability to make more localized retry decisions in order to scale.

@mimowo
Copy link
Contributor

mimowo commented Jun 1, 2023

@thockin thanks for looking into this! After a second thought I start to see how this could play together nicely.

podFailurePolicy is currently limited to restartPolicy: Never, because:

  1. of preexisting issue with tracking failures in Job Controller which for backoffLimit (IMO originating from the Job controller running into the competence of Kubelet: When restartPolicy=OnFailure the calculation for number of retries is not accurate kubernetes#109870).
  2. The API in Kubelet wasn’t ready to support our use cases

For (2.) we need Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure, (cc @kerthcet ) to make sure the pod is eventually marked failed.

Then, container restart rules are needed to support use cases that we have (such as allow transition from TFJob). However, we could enable podFailurePolicy for restartPolicy: OnFailure as soon as we have maxRestartTimes and we modify backoffLimit to ignore counting individual container restarts in this mode.

This could also play nicely with Backoff Limit Per Index . Also currently limited to restartPolicy=Never, but we could extend to restartPolicy=OnFailure, and define it in this case to skip counting individual container restarts.

An example API I think of:

apiVersion: v1
kind: Job
spec:
  parallelism: 10
  completions: 10
  completionMode: Indexed
  backoffLimitPerIndex: 2  # it only takes into account pods with phase=Failed
  podFailurePolicy:  # job-level API for handling failed pods (with phase=Failed)
    rules:
    - action: Ignore # do not count towards backoffLimitPerIndex
      onPodConditions:
      - type: DisruptionTarget
Status: true
Template:
  Spec:
    restartPolicy: OnFailure
    maxRestartTimes: 4
    Containers:
    - name: myapp1
      restartRules:   # pod-level API for specifying container restart rules before pod's phase=Failed
      - action: FailPod # short-circuit and fail the pod (set phase=Failed)
        onExitCodes:
          values: [1]
      - Action: Count # count towards maxRestartTimes
        onExitCodes:
          values: [2]
      - Action: Ignore # completely ignore and restart
        onExitCodes:
          values: [3]

So to me the question is how to get there, and if supporting podFailurePolicy when restartPolicy=OnFailure should be part of this KEP, or a follow up once we have the required APIs from Kubelet.

I feel that smaller KEPs are better cause they allow for separate prioritization and independent graduation.

/cc @alculquicondor @kerthcet

@alculquicondor
Copy link
Member Author

alculquicondor commented Jun 1, 2023

You still need the higher-level control-plane (Job) to react to those specific cases. It just doesn't need to be in the loop for ALL POSSIBLE exit codes.

We can do it all in the higher level control plane, but WHY? I am reminded of another system I worked on which did exactly this, and ultimately had to add the ability to make more localized retry decisions in order to scale.

That is true. The job controller doesn't need to take all decisions and it could delegate to kubelet. But it still needs to know what to do after a pod has failed.

For example, let's imagine that we have a restartPolicy in the PodSpec with exit codes support. Once the kubelet decided that the pod cannot be retried because of exitCode=10, the job controller still needs to know whether it needs to recreate a replacement Pod. It's not clear from the API in the PodSpec whether the job should recreate the Pod.

That's why we started with an API in the Job spec, and limited it to Pods with restartPolicy=Never, as @mimowo is pointing out. This gives clear expectations to the job controller about what to do with pods.

But we can still introduce an API at the Pod spec, both for number or retries and policies for specific exit codes! That's why I welcome @kerthcet's KEP.

The way I see it working is that a user writing a Job manifest would only set the API at the job spec, then if the podTemplate has a restartPolicy!=Never, the job controller would create pods setting the Podspec-level policy. This gives the job controller the chance to have full control over number of retries in the case of Pod recreations. The first pod creation has the full number of retries. Subsequent recreated Pods have the remaining number of retries. In Job validation we can restrict that users cannot set the policy at the podspec.

@alculquicondor
Copy link
Member Author

backoffLimitPerIndex: 2  # it only takes into account pods with phase=Failed

I don't think this is very useful. If a user wants to restrict number of restarts for an application, it doesn't matter if they happen in the same Pod, or across recreations.

@thockin
Copy link
Member

thockin commented Jun 2, 2023

@alculquicondor Are you proposing to move forward with a Job-level API and maybe-later-maybe-not loosen the restartPolicy = Never requirement via a pod-level API (as sketched by @mimowo) ?

I don't want to stand in the way of solving real problems, but I worry that this becomes conceptual debt that we will never pay off (or even remember!)

@alculquicondor
Copy link
Member Author

Yes, that is my proposal

I don't want to stand in the way of solving real problems, but I worry that this becomes conceptual debt that we will never pay off (or even remember!)

@kerthcet already has an open WIP KEP :)
And we have already received good user feedback about the failure policy at the job level.

@thockin
Copy link
Member

thockin commented Jun 2, 2023 via email

@kerthcet
Copy link
Member

kerthcet commented Jun 2, 2023

This could also play nicely with #3967 . Also currently limited to restartPolicy=Never, but we could extend to restartPolicy=OnFailure, and define it in this case to skip counting individual container restarts

That's the point I think Job API can leverage the maxRestartTimesOnFailure to achieve Backoff Limit Per Index.

The max-restarts KEP isn't the same as restart-rules, though, right? They
all seem complementary but not the same.

Yes, currently max-restarts KEP only accounts for the restartPolicy=OnFailure, because restartPolicy=Never & maxRestartTimes=x looks conflict.

@mimowo
Copy link
Contributor

mimowo commented Jun 2, 2023

The max-restarts KEP isn't the same as restart-rules, though, right? They
all seem complementary but not the same.

Yes, these are the pieces of work that need to be done to support fully restart policy Never and OnFailure:

  1. restartPolicy=Never (this KEP, currently Beta without blocking issues open)
  2. maxRestartTimes (started as a new KEP)
  3. enable restartPolicy=OnFailure (not started, but seems a small follow-up KEP); plus change to Job's backoffLimit when maxRestartTimes is specified
  4. containerRestartPolicy in pod spec (not started, AFAIK purely node changes) (enabling API similar to the one in Retriable and non-retriable Pod failures for Jobs #3329 (comment))

Note that:

  • (2.) is a prerequisite for (3.)
  • (3.) and (4.) and independent technically

IMO doing (1.) - (4.) under this KEP would prolong its graduation substantially, and different points have different priorities, so it is useful to de-couple. Recently we got this slack ask to support OnFailure. We will need to keep track of how much demand there is for (3.) and (4.) and prioritize accordingly, but de-coupling allows that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/beta Denotes an issue tracking an enhancement targeted for Beta status wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Graduating
Status: Tracked
Status: At Risk
Development

No branches or pull requests