Graceful node shutdown doesn't wait for volume teardown #115148

msau42 · 2023-01-18T00:18:08Z

What happened?

We have observed that after a node is preempted, pods using persistent volumes take 6+ minutes to restart. This happens when a volume was not able to be cleanly unmounted (indicated by updating Node.Status.VolumesInUse), which causes the Attach/Detach controller (in kube-controller-manager) to wait for 6 minutes before issuing a force detach. In addition to a long pod restart period, an unclean mount could also cause data corruption.

From the logs we've observed the following non-ideal behaviors, which I think reduces to the same issue that we only check for containers' being terminated and not their volumes.

The CSI driver running on the node gets terminated before the workload's volumes get unmounted. This causes unmount to fail. In our case, the CSI driver is part of the `system-node-critical" class, but since we don't wait for volumes to be unmounted, we move on to shutting down the critical pods.
The OS gets shutdown before volume teardown is completed, even if there is still time left in the shutdownGracePeriod. In our case, the shutdownGracePeriod is set to 30 seconds, however we see that the node got shutdown in 10 seconds, even while unmount has not succeeded.

kubelet.log

Jan 17 21:48:18 ip-10-0-1-208 kubelet[950]: I0117 21:48:18.460409     950 setters.go:548] "Node became not ready" node="ip-10-0-1-208" condition={Type:Ready Status:False LastHeartbeatTime:2023-01-17 21:48:18.46035065 +0000 UTC m=+1235.025577092 LastTransitionTime:2023-01-17 21:48:18.46035065 +0000 UTC m=+1235.025577092 Reason:KubeletNotReady Message:node is shutting down}
...
Jan 17 21:48:27 ip-10-0-1-208 kubelet[950]: E0117 21:48:27.126414     950 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-00477959568e1db98 podName: nodeName:}" failed. No retries permitted until 2023-01-17 21:48:35.126400526 +0000 UTC m=+1251.691626967 (durationBeforeRetry 8s). Error: UnmountDevice failed for volume "pvc-513e457c-86c1-4564-911c-6c0670cb839c" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-00477959568e1db98") on node "ip-10-0-1-208" : kubernetes.io/csi: attacher.UnmountDevice failed to create newCsiDriverClient: driver name ebs.csi.aws.com not found in the list of registered CSI drivers
...
Jan 17 21:48:29 ip-10-0-1-208 kubelet[950]: E0117 21:48:29.006761     950 kubelet.go:2475] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Connection to 10.0.1.208 closed by remote host.
Connection to 10.0.1.208 closed.

Some code tracing:

The logic for releasing the inhibitor lock waits for pods to be terminated within the grace period.

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Line 326 in 37e73b4

m.dbusCon.ReleaseInhibitLock(m.inhibitLock)

It depends on IsTerminated to indicate that the pod has been killed:

kubernetes/pkg/kubelet/pod_workers.go

Line 676 in 37e73b4

close(ch)
IsTerminated depends on terminatedAt, which is set when all containers have stopped:

kubernetes/pkg/kubelet/pod_workers.go

Line 273 in 37e73b4

terminatedAt time.Time
However, volume termination for a Pod only begins when all containers have stopped:

kubernetes/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go

Line 249 in 37e73b4

if !dswp.podStateProvider.ShouldPodRuntimeBeRemoved(volumeToMount.Pod.UID) {

What did you expect to happen?

Node graceful shutdown to also wait for volumes to be torn down.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a StatefulSet pod using an attachable PVC, and then trigger node shutdown.

Anything else we need to know?

No response

Kubernetes version

Tested on 1.25, 1.26

Cloud provider

Observed on GCP, AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

msau42 · 2023-01-18T00:18:20Z

/sig node

msau42 · 2023-01-18T00:31:16Z

/sig storage

bobbypage · 2023-01-18T00:59:10Z

/cc

MadhavJivrajani · 2023-01-18T08:07:37Z

/cc @sonasingh46

sftim · 2023-01-18T09:55:39Z

For this CSI driver, what do the preStop hooks do?

I'm wondering if there's an opportunity to detect the shutdown and perform cleanup, or at least a more graceful detach.

msau42 · 2023-01-18T17:07:49Z

CSI drivers depend on kubelet (volume manager) to tell it when it's safe to do an unmount. The key signal for volume manager is that the Pod's containers have terminated. We do not want to unmount a volume while a container could still be writing to it.

Here's a diagram I made that illustrates the race. The main challenge is that node shutdown manager and volume manager are operating off of the same signals (pod terminated) concurrently.

There's two possible solutions I can see:

preStop or sig-term handler in the CSI driver to detect and wait for all mounts to be unmounted. We would have to recommend that every CSI driver implement this.
node shutdown manager also considers volume cleanup as a signal

sftim · 2023-01-18T17:22:55Z

I'd imagined that the CSI driver could be aware of all the mounted volumes for that particular driver, and complete its preStop once those are all unpublished / unstaged. Is that viable?

msau42 · 2023-01-18T17:31:00Z

Many CSI drivers don't keep state of what's mounted. It potentially could do that by scanning mounts on the host.

SergeyKanzhelev · 2023-01-18T18:47:50Z

/triage accepted

mauriciopoppe · 2023-01-20T01:11:37Z

Adding on top of #115148 (comment), I think the "wait for volumes unmounted" workaround (in either preStop hook or CSI Driver SIGTERM handler impls) might not work in the upgrade scenario:

Conditions
- node has CSI Driver v1 with a "wait for volumes unmounted" impl
- there's a Pod p1 running with a volume attached
Daemonset is updated to CSI Driver v2
kubelet attempts to delete the CSI Driver v1 pod, because of the "wait for volumes unmounted" logic it'll block until it all volumes are unmounted, in reality Pod p1 is still running fine and doesn't need its volume to be unmounted, therefore the CSI Driver v1 pod termination will be delayed until it's forcefully terminated

This could be a reason why we should pursue the solution: "node shutdown manager also considers volume cleanup as a signal"

sftim · 2023-01-20T08:58:50Z

If normal-priority Pod is still running, then the preStop hook I'm imagining has no cleanup to do there. I'm picturing a situation where a CSI driver is shutting down, there are now no running containers that require storage from the driver, but the local cleanup for the mounted PersistentVolumes / supporting infrastructure has not completed.

In other words, this cleanup should happen after kill normal pods as illustrated in #115148 (comment), and before the kubelet considers that all Pods have terminated.

msau42 · 2023-01-20T17:04:34Z

As @mauriciopoppe pointed out, the challenge with adding a prestop/sigterm hook is that there are situations when the driver pod is shutting down but user's pods + mounts are expected to stay up and running. The most common scenario would be when you're doing a rolling update of the csi driver daemonset.

sftim · 2023-01-21T22:16:54Z

@msau42 are you also saying that it's difficult for a CSI driver to tell if it's in use? I'm suggesting that a preStop hook could use that information (whether or not any running containers are still using the driver) and adjust its behavior accordingly.

For example, a Pod shutdown due to a DaemonSet rolling update could do an early exit during preStop after detecting that a running Pod (or container - either is fine) depends on the storage that this driver Pod is providing.

msau42 · 2023-01-23T17:31:00Z

are you also saying that it's difficult for a CSI driver to tell if it's in use? I'm suggesting that a preStop hook could use that information (whether or not any running containers are still using the driver) and adjust its behavior accordingly.

For example, a Pod shutdown due to a DaemonSet rolling update could do an early exit during preStop after detecting that a running Pod (or container - either is fine) depends on the storage that this driver Pod is providing.

Yes, this is difficult because CSI drivers are designed to be Kubernetes-agnostic. So things like listing and inspecting Pods are not things we require CSI drivers to do. In addition, the check would be racy as pods comes and go, and also would be difficult to scale (we try to avoid per-node components making expensive List calls)

sftim · 2023-01-23T19:23:57Z

💭 is there a way that the kubelet or host OS can provide extra context to the DaemonSet Pods?

Analogy: /etc/nologin which marks that a POSIX-like machine shouldn't allow logins. If either a systemd unit or the kubelet were to create a file as soon as the node begins a graceful shutdown (possibly also other actions, like a full drain), then we could mount that path into CSI driver Pods and allow them to customize their behavior.

Other parts of the Kubernetes ecosystem, not just storage, might find it useful to know that the kubelet is shutting down all Pods in sequence. For example, a network driver might unarp or release IP addresses if the driver Pod termination is triggered by a node shutdown. Lots of other examples.

Separately, but also relevant: if you want to set up a node so that stopping the kubelet does trigger a local drain, this change I'm imagining could be part of that design - and only needs minimal or no changes to the kubelet itself.

msau42 · 2023-01-25T05:27:21Z

Having the kubelet advertise some sort of file-based signal could be an option.

From the CSI perspective, I think a vast majority of drivers would all want the same behavior and would end up implementing the same logic. Given that there are over 100 CSI drivers, I would prefer to pursue an option that makes it easier to maintain consistent behavior.

mauriciopoppe · 2023-01-27T18:44:19Z

I'm sharing an idea that I posted internally, tldr: it's about adding a flag to the KubeletConfiguration in shutdownGracePeriodByPodPriority to respect the shutdownGracePeriod time, we can configure workloads to be in a group that node shutdown manager will terminate (and wait for the respective shutdownGracePeriod) and critical workloads like CSI Drivers to be terminated later in a different group, with that we give the CSI Driver the chance to complete the volume teardown CSI calls

processShutdownEvent does the following:

Get pods and group them by priority
- Grouping configuration is set through the KubeletConfiguration shutdown* keys, we can set (shutdownGracePeriod and shutdownGracePeriodCriticalPods) or (shutdownGracePeriodByPodPriority), if (shutdownGracePeriod and shutdownGracePeriodCriticalPods) are set then in the code these are migrated to the (shutdownGracePeriodByPodPriority) config anyways.

Let's assume that we have 2 groups (workload group where the StatefulSet is at and critical group where the CSI Driver is at),

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 355 to 365 in cc60df9

    
           for _, group := range groups { 
        
           	// If there are no pods in a particular range, 
        
           	// then do not wait for pods in that priority range. 
        
           	if len(group.Pods) == 0 { 
        
           		continue 
        
           	} 
        
           	var wg sync.WaitGroup 
        
           	wg.Add(len(group.Pods)) 
        
           	for _, pod := range group.Pods { 
        
           		go func(pod *v1.Pod, group podShutdownGroup) {

is iterating over all the Pods in a group and in a goroutine for each one is terminating them, the overall group termination completion is controlled with a sync.WaitGroup, at the end we'll either timeout with the config timeout (i.e. the shutdownGracePeriod for this group) or when all the pods are terminated

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 404 to 414 in cc60df9

    
           go func() { 
        
           	defer close(doneCh) 
        
           	wg.Wait() 
        
           }() 
        
           select { 
        
           case <-doneCh: 
        
           	timer.Stop() 
        
           case <-timer.C(): 
        
           	m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority) 
        
           }

.

The idea is to add an additional parameter to ShutdownGracePeriodByPodPriority https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go#L642 that enables respecting the shutdownGracePeriod time even if Pods in this group completed, if that's set then in

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 404 to 414 in cc60df9

    
           go func() { 
        
           	defer close(doneCh) 
        
           	wg.Wait() 
        
           }() 
        
           select { 
        
           case <-doneCh: 
        
           	timer.Stop() 
        
           case <-timer.C(): 
        
           	m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority) 
        
           }

we'd no longer react to the sync.WaitGroup completion and only the timeout

KubeletConfiguration example:

kind: KubeletConfiguration
shutdownGracePeriodByPodPriority:
  - priority: 0
    shutdownGracePeriodSeconds: 10s
    respectShutdownGracePeriod: true  #new
  - priority: inf
    shutdownGracePeriodSeconds: 20s

 loop:
  for {
                select {
                case <-doneCh:
                        if group.RespectShutdownGracePeriod {
                           // we don't react to the sync.WaitGroup finish
                           // instead wait for the timer to signal
                           break
                        }
                        timer.Stop()
                        break loop                       
                case <-timer.C():
                        m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority)
                         break loop
                }
  }

smarterclayton · 2023-03-14T22:39:53Z

Once we add context cancellation, we should be much closer to terminating pods in deterministic time (right now a long syncPod can take longer than graceful shutdown), which means we could and should also allocate some time for other subsystems of the node to reach terminal state. I'll keep that in mind when we look at components downstream of pod worker pointing at pod worker state instead of pod manager.

k8s-triage-robot · 2024-03-13T22:54:14Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

msau42 · 2024-03-18T21:08:45Z

/triage accepted

msau42 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 18, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 18, 2023

msau42 changed the title ~~Graceful node shutdown releases inhibitor lock without waiting for volume teardown~~ Graceful node shutdown doesn't wait for volume teardown Jan 18, 2023

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jan 18, 2023

SergeyKanzhelev added this to Triage in SIG Node Bugs Jan 18, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2023

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Jan 18, 2023

msau42 mentioned this issue Mar 14, 2023

Termination signals and pod/data safety impact #116618

Open

mauriciopoppe mentioned this issue Jul 14, 2023

VolumeAttachment takes too long to remove kubernetes-csi/external-attacher#463

Closed

akankshapanse mentioned this issue Aug 3, 2023

In case of node reboot, pod running on that node goes to "Unknown" state, as kubelet fails to attach the PVC associated #119401

Closed

mauriciopoppe mentioned this issue Aug 21, 2023

Wait for volumes to be unmounted between node shutdown Pod group terminators #120091

Closed

torredil mentioned this issue Jan 12, 2024

Volume hang on Karpenter Node Consolidation/Termination kubernetes-sigs/aws-ebs-csi-driver#1665

Closed

jmdeal mentioned this issue Jan 17, 2024

PersistentVolumes stuck after node consolidation / termination kubernetes-sigs/karpenter#944

Open

atiratree mentioned this issue Feb 2, 2024

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

Open

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 13, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 18, 2024

youwalther65 mentioned this issue Apr 9, 2024

[EKS] [request]: Systemd upgrade to > v239 for enabling node graceful shutdown aws/containers-roadmap#2057

Closed

torredil linked a pull request May 22, 2024 that will close this issue

Ensure volumes are unmounted during graceful node shutdown #125070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful node shutdown doesn't wait for volume teardown #115148

Graceful node shutdown doesn't wait for volume teardown #115148

msau42 commented Jan 18, 2023 •

edited

Loading

msau42 commented Jan 18, 2023

msau42 commented Jan 18, 2023

bobbypage commented Jan 18, 2023

MadhavJivrajani commented Jan 18, 2023

sftim commented Jan 18, 2023

msau42 commented Jan 18, 2023 •

edited

Loading

sftim commented Jan 18, 2023

msau42 commented Jan 18, 2023

SergeyKanzhelev commented Jan 18, 2023

mauriciopoppe commented Jan 20, 2023

sftim commented Jan 20, 2023

msau42 commented Jan 20, 2023

sftim commented Jan 21, 2023

msau42 commented Jan 23, 2023

sftim commented Jan 23, 2023

msau42 commented Jan 25, 2023

mauriciopoppe commented Jan 27, 2023 •

edited

Loading

smarterclayton commented Mar 14, 2023

k8s-triage-robot commented Mar 13, 2024

msau42 commented Mar 18, 2024

Graceful node shutdown doesn't wait for volume teardown #115148

Graceful node shutdown doesn't wait for volume teardown #115148

Comments

msau42 commented Jan 18, 2023 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

msau42 commented Jan 18, 2023

msau42 commented Jan 18, 2023

bobbypage commented Jan 18, 2023

MadhavJivrajani commented Jan 18, 2023

sftim commented Jan 18, 2023

msau42 commented Jan 18, 2023 • edited Loading

sftim commented Jan 18, 2023

msau42 commented Jan 18, 2023

SergeyKanzhelev commented Jan 18, 2023

mauriciopoppe commented Jan 20, 2023

sftim commented Jan 20, 2023

msau42 commented Jan 20, 2023

sftim commented Jan 21, 2023

msau42 commented Jan 23, 2023

sftim commented Jan 23, 2023

msau42 commented Jan 25, 2023

mauriciopoppe commented Jan 27, 2023 • edited Loading

smarterclayton commented Mar 14, 2023

k8s-triage-robot commented Mar 13, 2024

msau42 commented Mar 18, 2024

msau42 commented Jan 18, 2023 •

edited

Loading

msau42 commented Jan 18, 2023 •

edited

Loading

mauriciopoppe commented Jan 27, 2023 •

edited

Loading