New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755
Comments
/cc @bobbypage |
Container runtime (CRI) and version ? as if containerd has the same bug. see: containerd/containerd#7076 |
/sig node |
It looks like Kubelet didn't have enough time to report the Pod status to APIServer. |
When the graceful shutdown time of the Pod is the same as the graceful shutdown time of the node, the Pod is killed and the inhibit lock is released at the same time. At this time, the Kubelet does not have enough time to report the status of the Pod exit. |
I am not a big fan of solving race conditions with "wait a little bit more". |
/triage accepted |
If the Pod is cleaned early, the Kubelet will also exit early, which is the maximum time the Kubelet will take to inhibit shutdown. Once the Inhibit time is up, the Kubelet can't stop Systemd from killing it, it can only request more time from Systemd when it initializes itself. There is no other way. |
I'm also a bit uncertain of this approach of artificially adding extra time during shutdown (#110804 (comment)). In offline testing, I've seen cases where the pod status manager can have quite high latency even after pod resources are reclaimed. For example in cases when there is a large amount of pods being terminated the latency from the pod actually being terminating to time status update is sent can be quite large (I've seen >= 10 seconds). One of the reasons is that we only report Terminated state after the pod resources are reclaimed which can take some time ( kubernetes/pkg/kubelet/status/status_manager.go Lines 735 to 740 in 1ad457b
In general, we never have the full guarantee that the status update will be sent -- the api server can be down for example or kubelet API QPS can be throttled. I think we need to figure out a better solution here: Perhaps some other avenues to explore:
|
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 #1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
In our v1.25.0 clusters, when a node is gracefully shutdown, many non-critical pods end up in an Error or Completed state, with
after enabling graceful node shutdown. Its not rare tail latency from my observations. As nodes auto-update, you end up with an increasing number of Error/Completed Pod clutter. They're technically harmless, but manually cleaning them up is a non-starter. Inspecting the container runtime, those pods were indeed killed. Ultimately, disabling Kubelet graceful node shutdown resolved the issue. |
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 #1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon/typhoon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon/typhoon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon/typhoon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon/typhoon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon/typhoon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
Filed separately as #113278 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
What happened?
In some cases, we saw that in a graceful shutdown enabled cluster, the kubelet started the shutdown for some pods, but end up not finishing it, and these pods have the same behavior without the graceful shutdown for preemption VM
What did you expect to happen?
Pod set to terminated or failed status after node is shutdown
How can we reproduce it (as minimally and precisely as possible)?
it depends on the kubelet killing pod speed. probably could create lots of pods and have some postStop hook of these pods
Anything else we need to know?
No response
Kubernetes version
v1.23+
Cloud provider
GCP
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: