Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

Open
alculquicondor opened this issue Dec 13, 2023 · 30 comments
Open

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

alculquicondor opened this issue Dec 13, 2023 · 30 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@alculquicondor
Copy link
Member

What would you like to be added?

A mechanism to set a Pod into phase=Failed when the image pull has failed for a number of times (perhaps configurable).
Currently, the Pod just stays in phase=Pending

Why is this needed?

This is especially problematic for Jobs submitted through a queueing system.

In a queued environment, the time when the job starts running (pods created) might be hours or even days after the Job is created. Then, the user that submitted the job might not realize their mistake until it's too late. Since these Pods block resources in the cluster, it might cause other pending Jobs not to start.

If the Pods stay in phase=Pending, the job controller cannot do anything about them, as it only does "failure handling" once the Pods actually terminate with a phase=Failed.

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 13, 2023
@alculquicondor
Copy link
Member Author

/wg batch
/sig node

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 13, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/batch Categorizes an issue or PR as relevant to WG Batch. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 13, 2023
@alculquicondor
Copy link
Member Author

cc @kannon92

@alculquicondor
Copy link
Member Author

Prior work was in kubernetes/enhancements#3816

@kannon92
Copy link
Contributor

More general issue was here

@BenTheElder
Copy link
Member

xref: https://github.com/kubernetes/test-infra/blob/954bd25d267d71e210848cd2783410c4297e7ce2/prow/apis/prowjobs/v1/types.go#L532-L540

Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.

@kannon92
Copy link
Contributor

The phrase "invalid image" seems misleading here.

You essentially want a limit on the number of tries for image pull and then you want to declare it as failed.

"Invalid image" bothers me because you can have validation on the images (InvalidImageName) or other types of validations where your image pull fails. I don't know if all these cases will cause it to go to image pull backoff.

You also have cases where the secrets are invalid so the pod is in ErrImagePull state..

@kannon92
Copy link
Contributor

xref: https://github.com/kubernetes/test-infra/blob/954bd25d267d71e210848cd2783410c4297e7ce2/prow/apis/prowjobs/v1/types.go#L532-L540

Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.

That is an interesting idea but I worry that ML workloads may have experience much slower time to pull images. Do you have a lot of tuning for these pull times in Prow?

@BenTheElder
Copy link
Member

BenTheElder commented Dec 14, 2023

We haven't needed to change that timeout much, It's usually set to something fairly generous and isn't exclusively covering image pulls. I could imagine certain environments with extremely slow image pulls of large images over a slow network but I bet those hit some other timeouts that need tuning already.

And yes, good point on the phrasing.

These are potentially valid image references, and it would be a layering violation to check if the image exists (plus auth concerns etc) and even if we could check the image could exist later so we have to assume they're totally valid if the format is right. So "invalid" isn't the right word.

I think the point remains that there's an issue where a job refers to an image that doesn't exist and will never become pullable can become stuck forever and waste resources. Maybe call it "non-existent images"

@AxeZhan
Copy link
Member

AxeZhan commented Dec 14, 2023

/cc

@sftim
Copy link
Contributor

sftim commented Dec 14, 2023

BTW, there may already be community solutions for this; for example, I think Kyverno can spot this and fail the Pod.

@kannon92
Copy link
Contributor

Reposting my thoughts here:

cc @kannon92 you thought about this problem before. Maybe it's time to pursue it.

Yes, this was the crux of kubernetes/enhancements#3816. If anyone wants to pick that up, please feel free. I don't think I will have the bandwidth to lead that anytime soon.

I decided to focus on PodReadyToStartContainers as this was already an existing feature. This feature was for issues in pod sandbox creation. This was promoted to beta in 1.29 so this will at least display a condition for sandbox issues.

My 2c on this is that there are two lifecycle issues that are not well covered by pod conditions/errors: image issues and pod lifecycle (ie pod hooks or config setup).

If you have any issues in image pull (validation, secrets, etc), then we hit an issue and it could go into ImagePullBackOff. That PR I linked was mostly about adding conditions for these cases.

So what I have thought is that maybe there is a need for two conditions (image lifecycle and pod lifecycle). I don't know if I should consider separating that KEP into two different ones per conditions.

@charles-chenzz
Copy link
Member

@kannon92 I might have bandwidth to take on KEP-3815, can I take it on to see if we can push it forward?

@AxeZhan
Copy link
Member

AxeZhan commented Dec 15, 2023

I looked into kubernetes/enhancements#3816 . And it seems this KEP aims to add a condition for corresponding pods, but this issue aims to change the pod's phase also?

@alculquicondor
Copy link
Member Author

Both would be useful. It can be the same KEP, though.

@kannon92
Copy link
Contributor

So since @alculquicondor has a use case in mind for this, I would prioritize this feature over mine. I brought it up because the main issue you will run into if you take this issue is that there is no stable condition or field for these cases. Now most of them are probably not as important for ops than image pulls so..

Maybe as part of this KEP, we could add a condition for image errors and also add this functionality.

@kannon92
Copy link
Contributor

Also remember to compare different container runtimes as I'm not sure if my findings in that KEP were transferable to crio.

@alculquicondor
Copy link
Member Author

@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.

@kannon92
Copy link
Contributor

@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.

Which feature? I never got much attention from sig-node on my original feature. For who takes this, I agree that they should take this feature to sig-node and see if it is viable.

@tenzen-y
Copy link
Member

The potential concrete use-case is here: kubeflow/training-operator#2001

@kannon92
Copy link
Contributor

kannon92 commented Mar 5, 2024

I am going to take this to sig-node on Mar 5th and see if its a viable option for someone to pursue.

edit: pushed to Mar 12th.

@dchen1107
Copy link
Member

+1 on adding the upper limit.

@dchen1107 dchen1107 added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2024
@kannon92
Copy link
Contributor

kannon92 commented Mar 12, 2024

In sig-node, @mrunalp brought up that maybe we should consider both ImagePullBackOff and CrashLoopBackOff for pod lifecycle. One is related to images and the other is related to failures in the pod lifecycle.

It may be worth considering both in this feature.

@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 12, 2024
@BenTheElder
Copy link
Member

@SergeyKanzhelev
Copy link
Member

I think it falls nicely under unscheduling of pods category: https://docs.google.com/document/d/1fOx4PTN61sDv6u9C-t-WpOH8hi2IpenJWPc-Za8UAQE/edit#heading=h.1tvyczqnfmzb as well. The biggest problem as mentioned above is to distinguish recoverable and non-recoverable image pull errors or what value to set the limit on retries to.

@lauralorenz
Copy link

Hello! This is interesting. I do think this can be worked on independently of CrashLoopBackoff. Addressing the ImagePull backoff is explicitly out of scope for what I'm working on related to that in kubernetes/enhancements#4603 (technically they share some configuration /today/ but as part of my work I will be separating that so that I don't affect image pull backoff with my changes).

I think it /should/ be worked on separately because from convos I've had around this space, compared to application crashes controlled by crashloopbackoff, image pull failures are expected to be easier for kubernetes to introspect and take action on (like if it's a 404 or a 413, probably not going to get fixed asap, so you could fail early or at least loop differently). So we should research into those use cases separately and they should have their own backoff logic.

So while I think they both should be worked on, if I'm working on crashloopbackoff, and you are working on imagepullbackoff, I think that works great! And we can work together for ideas at least since we're hanging out the same code area

@kannon92
Copy link
Contributor

Sounds good. I'm not sure if I will get this into 1.32 at the moment but wanted to see if this is a usecase we could address. If not, then we can target ImagePullBackOff separately.

@tallclair
Copy link
Member

This is somewhat tangential to this issue, but we need a better feedback mechanism to controllers (particularly replicaset) for pods failing with non-recoverable configuration errors. Otherwise we can end up in a hot-looping situation where the controller repeatedly recreates a pod that's immediately marked as failed.

I don't think that would block this issue since the initial image pull backoff would prevent a hot-loop situation.

@kannon92
Copy link
Contributor

Agree. I sorta started that discussion when I joined the k8s community. kubernetes/enhancements#3816

It was my first KEP I tried to tackle so I got a bit overwhelmed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
None yet
Development

No branches or pull requests