[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

Dragoncell · 2022-06-23T23:50:13Z

What happened?

In some cases, we saw that in a graceful shutdown enabled cluster, the kubelet started the shutdown for some pods, but end up not finishing it, and these pods have the same behavior without the graceful shutdown for preemption VM

2022-06-23 08:42:16.744 PDT
gke-dev-default-pool-XXX
I0623 15:42:16.744070    1748 nodeshutdown_manager_linux.go:316] "Shutdown manager killing pod with gracePeriod" pod="XX/client-64ccc5cc77-6jrfk" gracePeriod=15

>> Expected to have `Shutdown manager finishied killing pod` logs 

>> After the new node comes up, the pod skip the scheduling process and try to starts before the node is ready 
2022-06-23 08:44:33.989 PDT
gke-dev-default-pool-XXX
E0623 15:44:33.989839    1746 pod_workers.go:951] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="XX/client-64ccc5cc77-6jrfk" podUID=4ee81e47-0e79-42fb-a7f6-40cd8e1a47a9

What did you expect to happen?

Pod set to terminated or failed status after node is shutdown

How can we reproduce it (as minimally and precisely as possible)?

it depends on the kubelet killing pod speed. probably could create lots of pods and have some postStop hook of these pods

Anything else we need to know?

No response

Kubernetes version

v1.23+

Cloud provider

GCP

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

Dragoncell · 2022-06-23T23:57:06Z

/cc @bobbypage

wangyysde · 2022-06-24T02:29:29Z

Container runtime (CRI) and version ? as if containerd has the same bug. see: containerd/containerd#7076

pacoxu · 2022-06-24T03:41:34Z

/sig node
/cc @wzshiming

wzshiming · 2022-06-27T07:28:33Z

It looks like Kubelet didn't have enough time to report the Pod status to APIServer.

wzshiming · 2022-06-27T07:41:32Z

When the graceful shutdown time of the Pod is the same as the graceful shutdown time of the node, the Pod is killed and the inhibit lock is released at the same time. At this time, the Kubelet does not have enough time to report the status of the Pod exit.

matthyx · 2022-08-29T12:37:17Z

I am not a big fan of solving race conditions with "wait a little bit more".
Do you think there are other ways of protecting this status report?

matthyx · 2022-08-29T12:37:49Z

/triage accepted
/priority important-longterm

wzshiming · 2022-08-29T14:33:44Z

I am not a big fan of solving race conditions with "wait a little bit more".
Do you think there are other ways of protecting this status report?

@matthyx

If the Pod is cleaned early, the Kubelet will also exit early, which is the maximum time the Kubelet will take to inhibit shutdown. Once the Inhibit time is up, the Kubelet can't stop Systemd from killing it, it can only request more time from Systemd when it initializes itself. There is no other way.

bobbypage · 2022-09-08T20:04:15Z

I'm also a bit uncertain of this approach of artificially adding extra time during shutdown (#110804 (comment)). In offline testing, I've seen cases where the pod status manager can have quite high latency even after pod resources are reclaimed. For example in cases when there is a large amount of pods being terminated the latency from the pod actually being terminating to time status update is sent can be quite large (I've seen >= 10 seconds).

One of the reasons is that we only report Terminated state after the pod resources are reclaimed which can take some time (

kubernetes/pkg/kubelet/status/status_manager.go

Lines 735 to 740 in 1ad457b

    
           func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool { 
        
           	if pod.DeletionTimestamp == nil || kubetypes.IsMirrorPod(pod) { 
        
           		return false 
        
           	} 
        
           	return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status) 
        
           }

).

In general, we never have the full guarantee that the status update will be sent -- the api server can be down for example or kubelet API QPS can be throttled. I think we need to figure out a better solution here:

Perhaps some other avenues to explore:

Find out why what is general pod status update latency - @smarterclayton has a PR for that in progress to add metric: kubelet: Record a metric for latency of pod status update #107896
See if there are ways we can optimize and reduce the status update latency in the general case
Consider coming up with a new pod phase "terminating" (see Status of pods can become "OutOfCpu" when many pods are created and completed in a short time on the same node. #106884 (comment) for prior discussion). If node is deleted/shutdown before pod is "terminated", status should be updated on the server side.

* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 #1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755

dghubble · 2022-09-10T21:57:11Z

In our v1.25.0 clusters, when a node is gracefully shutdown, many non-critical pods end up in an Error or Completed state, with

Message:      Pod was terminated in response to imminent node shutdown.
Reason:       Terminated

after enabling graceful node shutdown. Its not rare tail latency from my observations. As nodes auto-update, you end up with an increasing number of Error/Completed Pod clutter. They're technically harmless, but manually cleaning them up is a non-starter. Inspecting the container runtime, those pods were indeed killed. Ultimately, disabling Kubelet graceful node shutdown resolved the issue.

* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 #1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755

* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon/typhoon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755

dghubble · 2022-10-23T04:28:49Z

Filed separately as #113278

k8s-triage-robot · 2023-01-21T05:24:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dghubble · 2023-01-21T18:35:24Z

/remove-lifecycle stale

* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in Kubernetes v1.25.0 poseidon#1222) * Graceful node shutdown shutdown allows 30s for critical pods to shutdown and 15s for regular pods to shutdown before releasing the inhibitor lock to allow the host to shutdown * Unfortunately, both pods and the node are shutdown at the same time at the end of the 45s period without further configuration options. As a result, regular pods and the node are shutdown at the same time. In practice, enabling this feature leaves Error or Completed pods in kube-apiserver state until manually cleaned up. This feature is not ready for general use * Fix issue where Error/Completed pods are accumulating whenever any node restarts (or auto-updates), visible in kubectl get pods * This issue wasn't apparent in initial testing and seems to only affect non-critical pods (due to critical pods being killed earlier) But its very apparent on our real clusters Rel: kubernetes/kubernetes#110755

k8s-triage-robot · 2024-01-21T19:26:46Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2024-04-20T19:57:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-20T20:16:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dragoncell added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2022

pacoxu added this to Triage in SIG Node Bugs Jun 24, 2022

wzshiming mentioned this issue Jun 27, 2022

Allow extra time for Kubelet to report Pod status when node graceful shutdown #110804

Closed

matthyx moved this from Triage to Triaged in SIG Node Bugs Aug 29, 2022

dghubble mentioned this issue Sep 10, 2022

Revert Graceful Node Shutdown feature poseidon/typhoon#1227

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

Dragoncell commented Jun 23, 2022 •

edited

Dragoncell commented Jun 23, 2022

wangyysde commented Jun 24, 2022

pacoxu commented Jun 24, 2022

wzshiming commented Jun 27, 2022

wzshiming commented Jun 27, 2022 •

edited

matthyx commented Aug 29, 2022

matthyx commented Aug 29, 2022

wzshiming commented Aug 29, 2022 •

edited

bobbypage commented Sep 8, 2022 •

edited

dghubble commented Sep 10, 2022 •

edited

dghubble commented Oct 23, 2022

k8s-triage-robot commented Jan 21, 2023

dghubble commented Jan 21, 2023

k8s-triage-robot commented Jan 21, 2024

k8s-triage-robot commented Apr 20, 2024

k8s-triage-robot commented May 20, 2024

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

Comments

Dragoncell commented Jun 23, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Dragoncell commented Jun 23, 2022

wangyysde commented Jun 24, 2022

pacoxu commented Jun 24, 2022

wzshiming commented Jun 27, 2022

wzshiming commented Jun 27, 2022 • edited

matthyx commented Aug 29, 2022

matthyx commented Aug 29, 2022

wzshiming commented Aug 29, 2022 • edited

bobbypage commented Sep 8, 2022 • edited

dghubble commented Sep 10, 2022 • edited

dghubble commented Oct 23, 2022

k8s-triage-robot commented Jan 21, 2023

dghubble commented Jan 21, 2023

k8s-triage-robot commented Jan 21, 2024

k8s-triage-robot commented Apr 20, 2024

k8s-triage-robot commented May 20, 2024

Dragoncell commented Jun 23, 2022 •

edited

wzshiming commented Jun 27, 2022 •

edited

wzshiming commented Aug 29, 2022 •

edited

bobbypage commented Sep 8, 2022 •

edited

dghubble commented Sep 10, 2022 •

edited