Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful node shutdown doesn't wait for volume teardown #115148

Open
msau42 opened this issue Jan 18, 2023 · 20 comments · May be fixed by #125070
Open

Graceful node shutdown doesn't wait for volume teardown #115148

msau42 opened this issue Jan 18, 2023 · 20 comments · May be fixed by #125070
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@msau42
Copy link
Member

msau42 commented Jan 18, 2023

What happened?

We have observed that after a node is preempted, pods using persistent volumes take 6+ minutes to restart. This happens when a volume was not able to be cleanly unmounted (indicated by updating Node.Status.VolumesInUse), which causes the Attach/Detach controller (in kube-controller-manager) to wait for 6 minutes before issuing a force detach. In addition to a long pod restart period, an unclean mount could also cause data corruption.

From the logs we've observed the following non-ideal behaviors, which I think reduces to the same issue that we only check for containers' being terminated and not their volumes.

  1. The CSI driver running on the node gets terminated before the workload's volumes get unmounted. This causes unmount to fail. In our case, the CSI driver is part of the `system-node-critical" class, but since we don't wait for volumes to be unmounted, we move on to shutting down the critical pods.

  2. The OS gets shutdown before volume teardown is completed, even if there is still time left in the shutdownGracePeriod. In our case, the shutdownGracePeriod is set to 30 seconds, however we see that the node got shutdown in 10 seconds, even while unmount has not succeeded.

kubelet.log

Jan 17 21:48:18 ip-10-0-1-208 kubelet[950]: I0117 21:48:18.460409     950 setters.go:548] "Node became not ready" node="ip-10-0-1-208" condition={Type:Ready Status:False LastHeartbeatTime:2023-01-17 21:48:18.46035065 +0000 UTC m=+1235.025577092 LastTransitionTime:2023-01-17 21:48:18.46035065 +0000 UTC m=+1235.025577092 Reason:KubeletNotReady Message:node is shutting down}
...
Jan 17 21:48:27 ip-10-0-1-208 kubelet[950]: E0117 21:48:27.126414     950 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-00477959568e1db98 podName: nodeName:}" failed. No retries permitted until 2023-01-17 21:48:35.126400526 +0000 UTC m=+1251.691626967 (durationBeforeRetry 8s). Error: UnmountDevice failed for volume "pvc-513e457c-86c1-4564-911c-6c0670cb839c" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-00477959568e1db98") on node "ip-10-0-1-208" : kubernetes.io/csi: attacher.UnmountDevice failed to create newCsiDriverClient: driver name ebs.csi.aws.com not found in the list of registered CSI drivers
...
Jan 17 21:48:29 ip-10-0-1-208 kubelet[950]: E0117 21:48:29.006761     950 kubelet.go:2475] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Connection to 10.0.1.208 closed by remote host.
Connection to 10.0.1.208 closed.

Some code tracing:

  1. The logic for releasing the inhibitor lock waits for pods to be terminated within the grace period.

m.dbusCon.ReleaseInhibitLock(m.inhibitLock)

  1. It depends on IsTerminated to indicate that the pod has been killed:

  2. IsTerminated depends on terminatedAt, which is set when all containers have stopped:

    terminatedAt time.Time

  3. However, volume termination for a Pod only begins when all containers have stopped:

    if !dswp.podStateProvider.ShouldPodRuntimeBeRemoved(volumeToMount.Pod.UID) {

What did you expect to happen?

Node graceful shutdown to also wait for volumes to be torn down.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a StatefulSet pod using an attachable PVC, and then trigger node shutdown.

Anything else we need to know?

No response

Kubernetes version

Tested on 1.25, 1.26

Cloud provider

Observed on GCP, AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@msau42 msau42 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 18, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2023
@msau42
Copy link
Member Author

msau42 commented Jan 18, 2023

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 18, 2023
@msau42 msau42 changed the title Graceful node shutdown releases inhibitor lock without waiting for volume teardown Graceful node shutdown doesn't wait for volume teardown Jan 18, 2023
@msau42
Copy link
Member Author

msau42 commented Jan 18, 2023

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jan 18, 2023
@bobbypage
Copy link
Member

/cc

@MadhavJivrajani
Copy link
Contributor

/cc @sonasingh46

@sftim
Copy link
Contributor

sftim commented Jan 18, 2023

For this CSI driver, what do the preStop hooks do?

I'm wondering if there's an opportunity to detect the shutdown and perform cleanup, or at least a more graceful detach.

@msau42
Copy link
Member Author

msau42 commented Jan 18, 2023

CSI drivers depend on kubelet (volume manager) to tell it when it's safe to do an unmount. The key signal for volume manager is that the Pod's containers have terminated. We do not want to unmount a volume while a container could still be writing to it.

Here's a diagram I made that illustrates the race. The main challenge is that node shutdown manager and volume manager are operating off of the same signals (pod terminated) concurrently.

Blank diagram (1)

There's two possible solutions I can see:

  1. preStop or sig-term handler in the CSI driver to detect and wait for all mounts to be unmounted. We would have to recommend that every CSI driver implement this.
  2. node shutdown manager also considers volume cleanup as a signal

@sftim
Copy link
Contributor

sftim commented Jan 18, 2023

I'd imagined that the CSI driver could be aware of all the mounted volumes for that particular driver, and complete its preStop once those are all unpublished / unstaged. Is that viable?

@msau42
Copy link
Member Author

msau42 commented Jan 18, 2023

Many CSI drivers don't keep state of what's mounted. It potentially could do that by scanning mounts on the host.

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Jan 18, 2023
@SergeyKanzhelev
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2023
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Jan 18, 2023
@mauriciopoppe
Copy link
Member

Adding on top of #115148 (comment), I think the "wait for volumes unmounted" workaround (in either preStop hook or CSI Driver SIGTERM handler impls) might not work in the upgrade scenario:

  • Conditions
    • node has CSI Driver v1 with a "wait for volumes unmounted" impl
    • there's a Pod p1 running with a volume attached
  • Daemonset is updated to CSI Driver v2
  • kubelet attempts to delete the CSI Driver v1 pod, because of the "wait for volumes unmounted" logic it'll block until it all volumes are unmounted, in reality Pod p1 is still running fine and doesn't need its volume to be unmounted, therefore the CSI Driver v1 pod termination will be delayed until it's forcefully terminated

This could be a reason why we should pursue the solution: "node shutdown manager also considers volume cleanup as a signal"

@sftim
Copy link
Contributor

sftim commented Jan 20, 2023

If normal-priority Pod is still running, then the preStop hook I'm imagining has no cleanup to do there. I'm picturing a situation where a CSI driver is shutting down, there are now no running containers that require storage from the driver, but the local cleanup for the mounted PersistentVolumes / supporting infrastructure has not completed.

In other words, this cleanup should happen after kill normal pods as illustrated in #115148 (comment), and before the kubelet considers that all Pods have terminated.

@msau42
Copy link
Member Author

msau42 commented Jan 20, 2023

As @mauriciopoppe pointed out, the challenge with adding a prestop/sigterm hook is that there are situations when the driver pod is shutting down but user's pods + mounts are expected to stay up and running. The most common scenario would be when you're doing a rolling update of the csi driver daemonset.

@sftim
Copy link
Contributor

sftim commented Jan 21, 2023

@msau42 are you also saying that it's difficult for a CSI driver to tell if it's in use? I'm suggesting that a preStop hook could use that information (whether or not any running containers are still using the driver) and adjust its behavior accordingly.

For example, a Pod shutdown due to a DaemonSet rolling update could do an early exit during preStop after detecting that a running Pod (or container - either is fine) depends on the storage that this driver Pod is providing.

@msau42
Copy link
Member Author

msau42 commented Jan 23, 2023

are you also saying that it's difficult for a CSI driver to tell if it's in use? I'm suggesting that a preStop hook could use that information (whether or not any running containers are still using the driver) and adjust its behavior accordingly.

For example, a Pod shutdown due to a DaemonSet rolling update could do an early exit during preStop after detecting that a running Pod (or container - either is fine) depends on the storage that this driver Pod is providing.

Yes, this is difficult because CSI drivers are designed to be Kubernetes-agnostic. So things like listing and inspecting Pods are not things we require CSI drivers to do. In addition, the check would be racy as pods comes and go, and also would be difficult to scale (we try to avoid per-node components making expensive List calls)

@sftim
Copy link
Contributor

sftim commented Jan 23, 2023

💭 is there a way that the kubelet or host OS can provide extra context to the DaemonSet Pods?

Analogy: /etc/nologin which marks that a POSIX-like machine shouldn't allow logins. If either a systemd unit or the kubelet were to create a file as soon as the node begins a graceful shutdown (possibly also other actions, like a full drain), then we could mount that path into CSI driver Pods and allow them to customize their behavior.

Other parts of the Kubernetes ecosystem, not just storage, might find it useful to know that the kubelet is shutting down all Pods in sequence. For example, a network driver might unarp or release IP addresses if the driver Pod termination is triggered by a node shutdown. Lots of other examples.

Separately, but also relevant: if you want to set up a node so that stopping the kubelet does trigger a local drain, this change I'm imagining could be part of that design - and only needs minimal or no changes to the kubelet itself.

@msau42
Copy link
Member Author

msau42 commented Jan 25, 2023

Having the kubelet advertise some sort of file-based signal could be an option.

From the CSI perspective, I think a vast majority of drivers would all want the same behavior and would end up implementing the same logic. Given that there are over 100 CSI drivers, I would prefer to pursue an option that makes it easier to maintain consistent behavior.

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Jan 27, 2023

I'm sharing an idea that I posted internally, tldr: it's about adding a flag to the KubeletConfiguration in shutdownGracePeriodByPodPriority to respect the shutdownGracePeriod time, we can configure workloads to be in a group that node shutdown manager will terminate (and wait for the respective shutdownGracePeriod) and critical workloads like CSI Drivers to be terminated later in a different group, with that we give the CSI Driver the chance to complete the volume teardown CSI calls


processShutdownEvent does the following:

  • Get pods and group them by priority
    • Grouping configuration is set through the KubeletConfiguration shutdown* keys, we can set (shutdownGracePeriod and shutdownGracePeriodCriticalPods) or (shutdownGracePeriodByPodPriority), if (shutdownGracePeriod and shutdownGracePeriodCriticalPods) are set then in the code these are migrated to the (shutdownGracePeriodByPodPriority) config anyways.
  • Let's assume that we have 2 groups (workload group where the StatefulSet is at and critical group where the CSI Driver is at),
    for _, group := range groups {
    // If there are no pods in a particular range,
    // then do not wait for pods in that priority range.
    if len(group.Pods) == 0 {
    continue
    }
    var wg sync.WaitGroup
    wg.Add(len(group.Pods))
    for _, pod := range group.Pods {
    go func(pod *v1.Pod, group podShutdownGroup) {
    is iterating over all the Pods in a group and in a goroutine for each one is terminating them, the overall group termination completion is controlled with a sync.WaitGroup, at the end we'll either timeout with the config timeout (i.e. the shutdownGracePeriod for this group) or when all the pods are terminated
    go func() {
    defer close(doneCh)
    wg.Wait()
    }()
    select {
    case <-doneCh:
    timer.Stop()
    case <-timer.C():
    m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority)
    }
    .
  • The idea is to add an additional parameter to ShutdownGracePeriodByPodPriority https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go#L642 that enables respecting the shutdownGracePeriod time even if Pods in this group completed, if that's set then in
    go func() {
    defer close(doneCh)
    wg.Wait()
    }()
    select {
    case <-doneCh:
    timer.Stop()
    case <-timer.C():
    m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority)
    }
    we'd no longer react to the sync.WaitGroup completion and only the timeout

KubeletConfiguration example:

kind: KubeletConfiguration
shutdownGracePeriodByPodPriority:
  - priority: 0
    shutdownGracePeriodSeconds: 10s
    respectShutdownGracePeriod: true  #new
  - priority: inf
    shutdownGracePeriodSeconds: 20s
 loop:
  for {
                select {
                case <-doneCh:
                        if group.RespectShutdownGracePeriod {
                           // we don't react to the sync.WaitGroup finish
                           // instead wait for the timer to signal
                           break
                        }
                        timer.Stop()
                        break loop                       
                case <-timer.C():
                        m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority)
                         break loop
                }
  }

@smarterclayton
Copy link
Contributor

Once we add context cancellation, we should be much closer to terminating pods in deterministic time (right now a long syncPod can take longer than graceful shutdown), which means we could and should also allocate some time for other subsystems of the node to reach terminal state. I'll keep that in mind when we look at components downstream of pod worker pointing at pod worker state instead of pod manager.

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 13, 2024
@msau42
Copy link
Member Author

msau42 commented Mar 18, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Triaged
Development

Successfully merging a pull request may close this issue.

9 participants