-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node reboot leaving existing pod using resources stuck with error UnexpectedAdmissionError #125579
Comments
|
While this issue can be easily reproduced with node reboot, I believe it can also happen if device plugin suddenly becomes unhealthy after pod is scheduled and before kubelet allocates the resources. The expected behavior is that the failure should be retried just like CNI failures. |
|
/sig node |
|
I can confirm this issue. Luckily for us the pods are managed by Deployments and they are restarted. But we're still stuck with 'ghost' pods in an |
Hi! This behavior wants to make evident and recoverable a previously hidden breakage when pod actually started, but noone allocated the device they requested: #109595 I don't know the mechanics of that specific GPU device plugin, but it seems likely to me that this problem can totally happen also for gpu devices. IOW, from what I gathered it was not device-specific but rather a flaw in how the devices are handled in kubelet, hence the flag.
This behavior can be surprising indeed, but when pod reach terminal state the system doesn't try to recover them, this is why the platform introduce and recommends higher level controllers like deployment
Yes, this is something we've being discussed since a while but there's unfortunately nothing concrete yet :\ |
|
/triage accepted kubelet retries are being discussed since a while and there's general agreement it's a desirable behavior |
/assign @swatisehgal |
What happened?
When a node is rebooted, pod using resources allocated by device plugin will encounter UnexpectedAdmissionError error as below:
What makes it really bad is if it's a raw pod, it stucks in such state and never recover.
What did you expect to happen?
The pod should be retried until device plugin is ready
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
The behavior is introduced with #116376
And there are various issues opened around kubelet restart #118559 #124345
But this issue is about node restart. When node is restarted, kubelet started to rerun existing pods in random order. So a pod can run into this issue before device plugin pod is healthy on the node
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: