-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A mechanism to fail a Pod that is stuck due to an invalid Image #122300
Comments
/wg batch |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @kannon92 |
Prior work was in kubernetes/enhancements#3816 |
More general issue was here |
Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override. |
The phrase "invalid image" seems misleading here. You essentially want a limit on the number of tries for image pull and then you want to declare it as failed. "Invalid image" bothers me because you can have validation on the images ( You also have cases where the secrets are invalid so the pod is in |
That is an interesting idea but I worry that ML workloads may have experience much slower time to pull images. Do you have a lot of tuning for these pull times in Prow? |
We haven't needed to change that timeout much, It's usually set to something fairly generous and isn't exclusively covering image pulls. I could imagine certain environments with extremely slow image pulls of large images over a slow network but I bet those hit some other timeouts that need tuning already. And yes, good point on the phrasing. These are potentially valid image references, and it would be a layering violation to check if the image exists (plus auth concerns etc) and even if we could check the image could exist later so we have to assume they're totally valid if the format is right. So "invalid" isn't the right word. I think the point remains that there's an issue where a job refers to an image that doesn't exist and will never become pullable can become stuck forever and waste resources. Maybe call it "non-existent images" |
/cc |
BTW, there may already be community solutions for this; for example, I think Kyverno can spot this and fail the Pod. |
Reposting my thoughts here:
Yes, this was the crux of kubernetes/enhancements#3816. If anyone wants to pick that up, please feel free. I don't think I will have the bandwidth to lead that anytime soon. I decided to focus on My 2c on this is that there are two lifecycle issues that are not well covered by pod conditions/errors: image issues and pod lifecycle (ie pod hooks or config setup). If you have any issues in image pull (validation, secrets, etc), then we hit an issue and it could go into So what I have thought is that maybe there is a need for two conditions (image lifecycle and pod lifecycle). I don't know if I should consider separating that KEP into two different ones per conditions. |
I looked into kubernetes/enhancements#3816 . And it seems this KEP aims to add a condition for corresponding pods, but this issue aims to change the pod's phase also? |
Both would be useful. It can be the same KEP, though. |
So since @alculquicondor has a use case in mind for this, I would prioritize this feature over mine. I brought it up because the main issue you will run into if you take this issue is that there is no stable condition or field for these cases. Now most of them are probably not as important for ops than image pulls so.. Maybe as part of this KEP, we could add a condition for image errors and also add this functionality. |
Also remember to compare different container runtimes as I'm not sure if my findings in that KEP were transferable to crio. |
@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP. |
Which feature? I never got much attention from sig-node on my original feature. For who takes this, I agree that they should take this feature to sig-node and see if it is viable. |
The potential concrete use-case is here: kubeflow/training-operator#2001 |
I am going to take this to sig-node on Mar 5th and see if its a viable option for someone to pursue. edit: pushed to Mar 12th. |
+1 on adding the upper limit. |
In sig-node, @mrunalp brought up that maybe we should consider both ImagePullBackOff and CrashLoopBackOff for pod lifecycle. One is related to images and the other is related to failures in the pod lifecycle. It may be worth considering both in this feature. |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
I think it falls nicely under unscheduling of pods category: https://docs.google.com/document/d/1fOx4PTN61sDv6u9C-t-WpOH8hi2IpenJWPc-Za8UAQE/edit#heading=h.1tvyczqnfmzb as well. The biggest problem as mentioned above is to distinguish recoverable and non-recoverable image pull errors or what value to set the limit on retries to. |
Hello! This is interesting. I do think this can be worked on independently of CrashLoopBackoff. Addressing the ImagePull backoff is explicitly out of scope for what I'm working on related to that in kubernetes/enhancements#4603 (technically they share some configuration /today/ but as part of my work I will be separating that so that I don't affect image pull backoff with my changes). I think it /should/ be worked on separately because from convos I've had around this space, compared to application crashes controlled by crashloopbackoff, image pull failures are expected to be easier for kubernetes to introspect and take action on (like if it's a 404 or a 413, probably not going to get fixed asap, so you could fail early or at least loop differently). So we should research into those use cases separately and they should have their own backoff logic. So while I think they both should be worked on, if I'm working on crashloopbackoff, and you are working on imagepullbackoff, I think that works great! And we can work together for ideas at least since we're hanging out the same code area |
Sounds good. I'm not sure if I will get this into 1.32 at the moment but wanted to see if this is a usecase we could address. If not, then we can target ImagePullBackOff separately. |
This is somewhat tangential to this issue, but we need a better feedback mechanism to controllers (particularly replicaset) for pods failing with non-recoverable configuration errors. Otherwise we can end up in a hot-looping situation where the controller repeatedly recreates a pod that's immediately marked as failed. I don't think that would block this issue since the initial image pull backoff would prevent a hot-loop situation. |
Agree. I sorta started that discussion when I joined the k8s community. kubernetes/enhancements#3816 It was my first KEP I tried to tackle so I got a bit overwhelmed. |
What would you like to be added?
A mechanism to set a Pod into phase=Failed when the image pull has failed for a number of times (perhaps configurable).
Currently, the Pod just stays in phase=Pending
Why is this needed?
This is especially problematic for Jobs submitted through a queueing system.
In a queued environment, the time when the job starts running (pods created) might be hours or even days after the Job is created. Then, the user that submitted the job might not realize their mistake until it's too late. Since these Pods block resources in the cluster, it might cause other pending Jobs not to start.
If the Pods stay in phase=Pending, the job controller cannot do anything about them, as it only does "failure handling" once the Pods actually terminate with a phase=Failed.
The text was updated successfully, but these errors were encountered: