Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still #110755

Open
Dragoncell opened this issue Jun 23, 2022 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@Dragoncell
Copy link
Member

Dragoncell commented Jun 23, 2022

What happened?

In some cases, we saw that in a graceful shutdown enabled cluster, the kubelet started the shutdown for some pods, but end up not finishing it, and these pods have the same behavior without the graceful shutdown for preemption VM

2022-06-23 08:42:16.744 PDT
gke-dev-default-pool-XXX
I0623 15:42:16.744070    1748 nodeshutdown_manager_linux.go:316] "Shutdown manager killing pod with gracePeriod" pod="XX/client-64ccc5cc77-6jrfk" gracePeriod=15

>> Expected to have `Shutdown manager finishied killing pod` logs 

>> After the new node comes up, the pod skip the scheduling process and try to starts before the node is ready 
2022-06-23 08:44:33.989 PDT
gke-dev-default-pool-XXX
E0623 15:44:33.989839    1746 pod_workers.go:951] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="XX/client-64ccc5cc77-6jrfk" podUID=4ee81e47-0e79-42fb-a7f6-40cd8e1a47a9

What did you expect to happen?

Pod set to terminated or failed status after node is shutdown

How can we reproduce it (as minimally and precisely as possible)?

it depends on the kubelet killing pod speed. probably could create lots of pods and have some postStop hook of these pods

Anything else we need to know?

No response

Kubernetes version

v1.23+

Cloud provider

GCP

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Dragoncell Dragoncell added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2022
@Dragoncell Dragoncell changed the title [Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown before the 30s are up and left the pods in running state still [Node Graceful Shutdown] kubelet sometimes doesn't finish kill pods before node shutdown within 30s and left the pods in running state still Jun 23, 2022
@Dragoncell
Copy link
Member Author

/cc @bobbypage

@wangyysde
Copy link
Member

Container runtime (CRI) and version ? as if containerd has the same bug. see: containerd/containerd#7076

@pacoxu
Copy link
Member

pacoxu commented Jun 24, 2022

/sig node
/cc @wzshiming

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2022
@pacoxu pacoxu added this to Triage in SIG Node Bugs Jun 24, 2022
@wzshiming
Copy link
Member

It looks like Kubelet didn't have enough time to report the Pod status to APIServer.

@wzshiming
Copy link
Member

wzshiming commented Jun 27, 2022

When the graceful shutdown time of the Pod is the same as the graceful shutdown time of the node, the Pod is killed and the inhibit lock is released at the same time. At this time, the Kubelet does not have enough time to report the status of the Pod exit.

@matthyx
Copy link
Contributor

matthyx commented Aug 29, 2022

I am not a big fan of solving race conditions with "wait a little bit more".
Do you think there are other ways of protecting this status report?

@matthyx matthyx moved this from Triage to Triaged in SIG Node Bugs Aug 29, 2022
@matthyx
Copy link
Contributor

matthyx commented Aug 29, 2022

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 29, 2022
@wzshiming
Copy link
Member

wzshiming commented Aug 29, 2022

I am not a big fan of solving race conditions with "wait a little bit more".
Do you think there are other ways of protecting this status report?

@matthyx

If the Pod is cleaned early, the Kubelet will also exit early, which is the maximum time the Kubelet will take to inhibit shutdown. Once the Inhibit time is up, the Kubelet can't stop Systemd from killing it, it can only request more time from Systemd when it initializes itself. There is no other way.

@bobbypage
Copy link
Member

bobbypage commented Sep 8, 2022

I'm also a bit uncertain of this approach of artificially adding extra time during shutdown (#110804 (comment)). In offline testing, I've seen cases where the pod status manager can have quite high latency even after pod resources are reclaimed. For example in cases when there is a large amount of pods being terminated the latency from the pod actually being terminating to time status update is sent can be quite large (I've seen >= 10 seconds).

One of the reasons is that we only report Terminated state after the pod resources are reclaimed which can take some time (

func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
if pod.DeletionTimestamp == nil || kubetypes.IsMirrorPod(pod) {
return false
}
return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}
).

In general, we never have the full guarantee that the status update will be sent -- the api server can be down for example or kubelet API QPS can be throttled. I think we need to figure out a better solution here:

Perhaps some other avenues to explore:

  1. Find out why what is general pod status update latency - @smarterclayton has a PR for that in progress to add metric: kubelet: Record a metric for latency of pod status update #107896
  2. See if there are ways we can optimize and reduce the status update latency in the general case
  3. Consider coming up with a new pod phase "terminating" (see Status of pods can become "OutOfCpu" when many pods are created and completed in a short time on the same node. #106884 (comment) for prior discussion). If node is deleted/shutdown before pod is "terminated", status should be updated on the server side.

dghubble added a commit to poseidon/typhoon that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 #1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
@dghubble
Copy link
Contributor

dghubble commented Sep 10, 2022

In our v1.25.0 clusters, when a node is gracefully shutdown, many non-critical pods end up in an Error or Completed state, with

Message:      Pod was terminated in response to imminent node shutdown.
Reason:       Terminated

after enabling graceful node shutdown. Its not rare tail latency from my observations. As nodes auto-update, you end up with an increasing number of Error/Completed Pod clutter. They're technically harmless, but manually cleaning them up is a non-starter. Inspecting the container runtime, those pods were indeed killed. Ultimately, disabling Kubelet graceful node shutdown resolved the issue.

dghubble added a commit to poseidon/typhoon that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 #1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
dghubble-robot pushed a commit to poseidon/terraform-azure-kubernetes that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 poseidon/typhoon#1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
dghubble-robot pushed a commit to poseidon/terraform-onprem-kubernetes that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 poseidon/typhoon#1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
dghubble-robot pushed a commit to poseidon/terraform-digitalocean-kubernetes that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 poseidon/typhoon#1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
dghubble-robot pushed a commit to poseidon/terraform-aws-kubernetes that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 poseidon/typhoon#1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
dghubble-robot pushed a commit to poseidon/terraform-google-kubernetes that referenced this issue Sep 10, 2022
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 poseidon/typhoon#1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
@dghubble
Copy link
Contributor

Filed separately as #113278

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2023
@dghubble
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2023
Snaipe pushed a commit to aristanetworks/monsoon that referenced this issue Apr 13, 2023
* Disable Kubelet Graceful Node Shutdown on worker nodes (enabled in
Kubernetes v1.25.0 poseidon#1222)
* Graceful node shutdown shutdown allows 30s for critical pods to
shutdown and 15s for regular pods to shutdown before releasing the
inhibitor lock to allow the host to shutdown
* Unfortunately, both pods and the node are shutdown at the same
time at the end of the 45s period without further configuration
options. As a result, regular pods and the node are shutdown at the
same time. In practice, enabling this feature leaves Error or Completed
pods in kube-apiserver state until manually cleaned up. This feature
is not ready for general use
* Fix issue where Error/Completed pods are accumulating whenever any
node restarts (or auto-updates), visible in kubectl get pods
* This issue wasn't apparent in initial testing and seems to only
affect non-critical pods (due to critical pods being killed earlier)
But its very apparent on our real clusters

Rel: kubernetes/kubernetes#110755
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Development

Successfully merging a pull request may close this issue.

9 participants