-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful node shutdown doesn't wait for volume teardown #115148
Comments
/sig node |
/sig storage |
/cc |
/cc @sonasingh46 |
For this CSI driver, what do the I'm wondering if there's an opportunity to detect the shutdown and perform cleanup, or at least a more graceful detach. |
CSI drivers depend on kubelet (volume manager) to tell it when it's safe to do an unmount. The key signal for volume manager is that the Pod's containers have terminated. We do not want to unmount a volume while a container could still be writing to it. Here's a diagram I made that illustrates the race. The main challenge is that node shutdown manager and volume manager are operating off of the same signals (pod terminated) concurrently. There's two possible solutions I can see:
|
I'd imagined that the CSI driver could be aware of all the mounted volumes for that particular driver, and complete its |
Many CSI drivers don't keep state of what's mounted. It potentially could do that by scanning mounts on the host. |
/triage accepted |
Adding on top of #115148 (comment), I think the "wait for volumes unmounted" workaround (in either preStop hook or CSI Driver SIGTERM handler impls) might not work in the upgrade scenario:
This could be a reason why we should pursue the solution: "node shutdown manager also considers volume cleanup as a signal" |
If normal-priority Pod is still running, then the In other words, this cleanup should happen after kill normal pods as illustrated in #115148 (comment), and before the kubelet considers that all Pods have terminated. |
As @mauriciopoppe pointed out, the challenge with adding a prestop/sigterm hook is that there are situations when the driver pod is shutting down but user's pods + mounts are expected to stay up and running. The most common scenario would be when you're doing a rolling update of the csi driver daemonset. |
@msau42 are you also saying that it's difficult for a CSI driver to tell if it's in use? I'm suggesting that a For example, a Pod shutdown due to a DaemonSet rolling update could do an early exit during |
Yes, this is difficult because CSI drivers are designed to be Kubernetes-agnostic. So things like listing and inspecting Pods are not things we require CSI drivers to do. In addition, the check would be racy as pods comes and go, and also would be difficult to scale (we try to avoid per-node components making expensive List calls) |
💭 is there a way that the kubelet or host OS can provide extra context to the DaemonSet Pods? Analogy: Other parts of the Kubernetes ecosystem, not just storage, might find it useful to know that the kubelet is shutting down all Pods in sequence. For example, a network driver might unarp or release IP addresses if the driver Pod termination is triggered by a node shutdown. Lots of other examples. Separately, but also relevant: if you want to set up a node so that stopping the kubelet does trigger a local drain, this change I'm imagining could be part of that design - and only needs minimal or no changes to the kubelet itself. |
Having the kubelet advertise some sort of file-based signal could be an option. From the CSI perspective, I think a vast majority of drivers would all want the same behavior and would end up implementing the same logic. Given that there are over 100 CSI drivers, I would prefer to pursue an option that makes it easier to maintain consistent behavior. |
I'm sharing an idea that I posted internally, tldr: it's about adding a flag to the KubeletConfiguration in processShutdownEvent does the following:
KubeletConfiguration example: kind: KubeletConfiguration
shutdownGracePeriodByPodPriority:
- priority: 0
shutdownGracePeriodSeconds: 10s
respectShutdownGracePeriod: true #new
- priority: inf
shutdownGracePeriodSeconds: 20s loop:
for {
select {
case <-doneCh:
if group.RespectShutdownGracePeriod {
// we don't react to the sync.WaitGroup finish
// instead wait for the timer to signal
break
}
timer.Stop()
break loop
case <-timer.C():
m.logger.V(1).Info("Shutdown manager pod killing time out", "gracePeriod", group.ShutdownGracePeriodSeconds, "priority", group.Priority)
break loop
}
} |
Once we add context cancellation, we should be much closer to terminating pods in deterministic time (right now a long syncPod can take longer than graceful shutdown), which means we could and should also allocate some time for other subsystems of the node to reach terminal state. I'll keep that in mind when we look at components downstream of pod worker pointing at pod worker state instead of pod manager. |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
What happened?
We have observed that after a node is preempted, pods using persistent volumes take 6+ minutes to restart. This happens when a volume was not able to be cleanly unmounted (indicated by updating
Node.Status.VolumesInUse
), which causes the Attach/Detach controller (in kube-controller-manager) to wait for 6 minutes before issuing a force detach. In addition to a long pod restart period, an unclean mount could also cause data corruption.From the logs we've observed the following non-ideal behaviors, which I think reduces to the same issue that we only check for containers' being terminated and not their volumes.
The CSI driver running on the node gets terminated before the workload's volumes get unmounted. This causes unmount to fail. In our case, the CSI driver is part of the `system-node-critical" class, but since we don't wait for volumes to be unmounted, we move on to shutting down the critical pods.
The OS gets shutdown before volume teardown is completed, even if there is still time left in the
shutdownGracePeriod
. In our case, theshutdownGracePeriod
is set to 30 seconds, however we see that the node got shutdown in 10 seconds, even while unmount has not succeeded.kubelet.log
Some code tracing:
kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go
Line 326 in 37e73b4
It depends on
IsTerminated
to indicate that the pod has been killed:kubernetes/pkg/kubelet/pod_workers.go
Line 676 in 37e73b4
IsTerminated
depends onterminatedAt
, which is set when all containers have stopped:kubernetes/pkg/kubelet/pod_workers.go
Line 273 in 37e73b4
However, volume termination for a Pod only begins when all containers have stopped:
kubernetes/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go
Line 249 in 37e73b4
What did you expect to happen?
Node graceful shutdown to also wait for volumes to be torn down.
How can we reproduce it (as minimally and precisely as possible)?
Deploy a StatefulSet pod using an attachable PVC, and then trigger node shutdown.
Anything else we need to know?
No response
Kubernetes version
Tested on 1.25, 1.26
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: