A mechanism to fail a Pod that is stuck due to an invalid Image #122300

alculquicondor · 2023-12-13T19:43:29Z

What would you like to be added?

A mechanism to set a Pod into phase=Failed when the image pull has failed for a number of times (perhaps configurable).
Currently, the Pod just stays in phase=Pending

Why is this needed?

This is especially problematic for Jobs submitted through a queueing system.

In a queued environment, the time when the job starts running (pods created) might be hours or even days after the Job is created. Then, the user that submitted the job might not realize their mistake until it's too late. Since these Pods block resources in the cluster, it might cause other pending Jobs not to start.

If the Pods stay in phase=Pending, the job controller cannot do anything about them, as it only does "failure handling" once the Pods actually terminate with a phase=Failed.

alculquicondor · 2023-12-13T19:43:37Z

/wg batch
/sig node

k8s-ci-robot · 2023-12-13T19:43:39Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2023-12-13T19:43:47Z

cc @kannon92

alculquicondor · 2023-12-13T19:54:47Z

Prior work was in kubernetes/enhancements#3816

kannon92 · 2023-12-13T19:55:34Z

More general issue was here

BenTheElder · 2023-12-13T21:48:07Z

xref: https://github.com/kubernetes/test-infra/blob/954bd25d267d71e210848cd2783410c4297e7ce2/prow/apis/prowjobs/v1/types.go#L532-L540

Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.

kannon92 · 2023-12-13T23:07:14Z

The phrase "invalid image" seems misleading here.

You essentially want a limit on the number of tries for image pull and then you want to declare it as failed.

"Invalid image" bothers me because you can have validation on the images (InvalidImageName) or other types of validations where your image pull fails. I don't know if all these cases will cause it to go to image pull backoff.

You also have cases where the secrets are invalid so the pod is in ErrImagePull state..

kannon92 · 2023-12-13T23:08:56Z

xref: https://github.com/kubernetes/test-infra/blob/954bd25d267d71e210848cd2783410c4297e7ce2/prow/apis/prowjobs/v1/types.go#L532-L540

Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.

That is an interesting idea but I worry that ML workloads may have experience much slower time to pull images. Do you have a lot of tuning for these pull times in Prow?

BenTheElder · 2023-12-14T01:30:22Z

We haven't needed to change that timeout much, It's usually set to something fairly generous and isn't exclusively covering image pulls. I could imagine certain environments with extremely slow image pulls of large images over a slow network but I bet those hit some other timeouts that need tuning already.

And yes, good point on the phrasing.

These are potentially valid image references, and it would be a layering violation to check if the image exists (plus auth concerns etc) and even if we could check the image could exist later so we have to assume they're totally valid if the format is right. So "invalid" isn't the right word.

I think the point remains that there's an issue where a job refers to an image that doesn't exist and will never become pullable can become stuck forever and waste resources. Maybe call it "non-existent images"

AxeZhan · 2023-12-14T02:47:08Z

/cc

sftim · 2023-12-14T08:41:13Z

BTW, there may already be community solutions for this; for example, I think Kyverno can spot this and fail the Pod.

kannon92 · 2023-12-14T14:19:23Z

Reposting my thoughts here:

cc @kannon92 you thought about this problem before. Maybe it's time to pursue it.

Yes, this was the crux of kubernetes/enhancements#3816. If anyone wants to pick that up, please feel free. I don't think I will have the bandwidth to lead that anytime soon.

I decided to focus on PodReadyToStartContainers as this was already an existing feature. This feature was for issues in pod sandbox creation. This was promoted to beta in 1.29 so this will at least display a condition for sandbox issues.

My 2c on this is that there are two lifecycle issues that are not well covered by pod conditions/errors: image issues and pod lifecycle (ie pod hooks or config setup).

If you have any issues in image pull (validation, secrets, etc), then we hit an issue and it could go into ImagePullBackOff. That PR I linked was mostly about adding conditions for these cases.

So what I have thought is that maybe there is a need for two conditions (image lifecycle and pod lifecycle). I don't know if I should consider separating that KEP into two different ones per conditions.

charles-chenzz · 2023-12-15T00:56:11Z

@kannon92 I might have bandwidth to take on KEP-3815, can I take it on to see if we can push it forward?

AxeZhan · 2023-12-15T03:25:34Z

I looked into kubernetes/enhancements#3816 . And it seems this KEP aims to add a condition for corresponding pods, but this issue aims to change the pod's phase also?

alculquicondor · 2023-12-15T14:47:00Z

Both would be useful. It can be the same KEP, though.

kannon92 · 2023-12-15T14:50:27Z

So since @alculquicondor has a use case in mind for this, I would prioritize this feature over mine. I brought it up because the main issue you will run into if you take this issue is that there is no stable condition or field for these cases. Now most of them are probably not as important for ops than image pulls so..

Maybe as part of this KEP, we could add a condition for image errors and also add this functionality.

kannon92 · 2023-12-15T14:51:14Z

Also remember to compare different container runtimes as I'm not sure if my findings in that KEP were transferable to crio.

alculquicondor · 2023-12-15T15:32:28Z

@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.

kannon92 · 2023-12-15T15:39:06Z

@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.

Which feature? I never got much attention from sig-node on my original feature. For who takes this, I agree that they should take this feature to sig-node and see if it is viable.

tenzen-y · 2024-02-20T14:28:21Z

The potential concrete use-case is here: kubeflow/training-operator#2001

kannon92 · 2024-03-05T14:05:25Z

I am going to take this to sig-node on Mar 5th and see if its a viable option for someone to pursue.

edit: pushed to Mar 12th.

dchen1107 · 2024-03-12T17:07:58Z

+1 on adding the upper limit.

kannon92 · 2024-03-12T17:24:00Z

In sig-node, @mrunalp brought up that maybe we should consider both ImagePullBackOff and CrashLoopBackOff for pod lifecycle. One is related to images and the other is related to failures in the pod lifecycle.

It may be worth considering both in this feature.

k8s-triage-robot · 2024-06-12T14:14:51Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

BenTheElder · 2024-06-12T20:58:41Z

cc @lauralorenz xref kubernetes/enhancements#4603

SergeyKanzhelev · 2024-06-12T21:44:51Z

I think it falls nicely under unscheduling of pods category: https://docs.google.com/document/d/1fOx4PTN61sDv6u9C-t-WpOH8hi2IpenJWPc-Za8UAQE/edit#heading=h.1tvyczqnfmzb as well. The biggest problem as mentioned above is to distinguish recoverable and non-recoverable image pull errors or what value to set the limit on retries to.

lauralorenz · 2024-08-20T20:02:38Z

Hello! This is interesting. I do think this can be worked on independently of CrashLoopBackoff. Addressing the ImagePull backoff is explicitly out of scope for what I'm working on related to that in kubernetes/enhancements#4603 (technically they share some configuration /today/ but as part of my work I will be separating that so that I don't affect image pull backoff with my changes).

I think it /should/ be worked on separately because from convos I've had around this space, compared to application crashes controlled by crashloopbackoff, image pull failures are expected to be easier for kubernetes to introspect and take action on (like if it's a 404 or a 413, probably not going to get fixed asap, so you could fail early or at least loop differently). So we should research into those use cases separately and they should have their own backoff logic.

So while I think they both should be worked on, if I'm working on crashloopbackoff, and you are working on imagepullbackoff, I think that works great! And we can work together for ideas at least since we're hanging out the same code area

kannon92 · 2024-08-20T20:07:02Z

Sounds good. I'm not sure if I will get this into 1.32 at the moment but wanted to see if this is a usecase we could address. If not, then we can target ImagePullBackOff separately.

tallclair · 2024-08-22T16:42:21Z

This is somewhat tangential to this issue, but we need a better feedback mechanism to controllers (particularly replicaset) for pods failing with non-recoverable configuration errors. Otherwise we can end up in a hot-looping situation where the controller repeatedly recreates a pod that's immediately marked as failed.

I don't think that would block this issue since the initial image pull backoff would prevent a hot-loop situation.

kannon92 · 2024-08-27T17:43:07Z

Agree. I sorta started that discussion when I joined the k8s community. kubernetes/enhancements#3816

It was my first KEP I tried to tackle so I got a bit overwhelmed.

alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 13, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 13, 2023

alculquicondor mentioned this issue Dec 13, 2023

Pods in ImagePullBackOff get stuck there, if image fails to pull for long enough #119891

Closed

alculquicondor mentioned this issue Dec 14, 2023

KEP-1282: Requeue Strategy kubernetes-sigs/kueue#1311

Merged

kannon92 mentioned this issue Jan 23, 2024

KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors kubernetes/enhancements#3816

Closed

kannon92 mentioned this issue Feb 19, 2024

Pod stuck when pod start error kubeflow/training-operator#2001

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 12, 2024

kannon92 mentioned this issue Jun 17, 2024

DRA: kubelet: support permanent and transient errors #125542

Open

kannon92 mentioned this issue Jun 26, 2024

KEP-4603: Tune CrashLoopBackoff kubernetes/enhancements#4603

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

alculquicondor commented Dec 13, 2023

alculquicondor commented Dec 13, 2023

k8s-ci-robot commented Dec 13, 2023

alculquicondor commented Dec 13, 2023

alculquicondor commented Dec 13, 2023

kannon92 commented Dec 13, 2023

BenTheElder commented Dec 13, 2023

kannon92 commented Dec 13, 2023

kannon92 commented Dec 13, 2023

BenTheElder commented Dec 14, 2023 •

edited

Loading

AxeZhan commented Dec 14, 2023

sftim commented Dec 14, 2023

kannon92 commented Dec 14, 2023

charles-chenzz commented Dec 15, 2023

AxeZhan commented Dec 15, 2023

alculquicondor commented Dec 15, 2023

kannon92 commented Dec 15, 2023

kannon92 commented Dec 15, 2023

alculquicondor commented Dec 15, 2023

kannon92 commented Dec 15, 2023

tenzen-y commented Feb 20, 2024

kannon92 commented Mar 5, 2024 •

edited

Loading

dchen1107 commented Mar 12, 2024

kannon92 commented Mar 12, 2024 •

edited

Loading

k8s-triage-robot commented Jun 12, 2024

BenTheElder commented Jun 12, 2024

SergeyKanzhelev commented Jun 12, 2024

lauralorenz commented Aug 20, 2024

kannon92 commented Aug 20, 2024

tallclair commented Aug 22, 2024

kannon92 commented Aug 27, 2024

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

Comments

alculquicondor commented Dec 13, 2023

What would you like to be added?

Why is this needed?

alculquicondor commented Dec 13, 2023

k8s-ci-robot commented Dec 13, 2023

alculquicondor commented Dec 13, 2023

alculquicondor commented Dec 13, 2023

kannon92 commented Dec 13, 2023

BenTheElder commented Dec 13, 2023

kannon92 commented Dec 13, 2023

kannon92 commented Dec 13, 2023

BenTheElder commented Dec 14, 2023 • edited Loading

AxeZhan commented Dec 14, 2023

sftim commented Dec 14, 2023

kannon92 commented Dec 14, 2023

charles-chenzz commented Dec 15, 2023

AxeZhan commented Dec 15, 2023

alculquicondor commented Dec 15, 2023

kannon92 commented Dec 15, 2023

kannon92 commented Dec 15, 2023

alculquicondor commented Dec 15, 2023

kannon92 commented Dec 15, 2023

tenzen-y commented Feb 20, 2024

kannon92 commented Mar 5, 2024 • edited Loading

dchen1107 commented Mar 12, 2024

kannon92 commented Mar 12, 2024 • edited Loading

k8s-triage-robot commented Jun 12, 2024

BenTheElder commented Jun 12, 2024

SergeyKanzhelev commented Jun 12, 2024

lauralorenz commented Aug 20, 2024

kannon92 commented Aug 20, 2024

tallclair commented Aug 22, 2024

kannon92 commented Aug 27, 2024

BenTheElder commented Dec 14, 2023 •

edited

Loading

kannon92 commented Mar 5, 2024 •

edited

Loading

kannon92 commented Mar 12, 2024 •

edited

Loading