New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running pods with devices are terminated if kubelet is restarted #118559
Comments
I was also able to reproduce it without KubeVirt. I used https://github.com/k8stopologyawareschedwg/sample-device-plugin The steps are:
|
My observation is that the original issue #109595 was focused on the node reboot scenario. Then the PR #116376 fixed that. But in the case when only kubelet is restarted and some workloads are still running, there is the problem that device plugins may take time to initialize and report back healthy devices. Meanwhile, kubelet terminates the running pods as it assumes an unexpected admission. |
/sig node |
/triage accepted we have reproducers |
/cc |
I'm looking into this issue and I'll be updating shortly. At this point in time I can say that yes, #116376 made the devicemanager stricter and leads to this behavior. We may need another way to fix the inconsistency reported in #109595 , and likely we will need to tight up/fix the e2e tests about device plugins. But there could be a (partially?) mismatched expectation as well, because AFAIK the kubelet will run admission on restart (and in general on initialization) and thus may kill running pods. In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart. |
From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm... I thought that it was actually the supposed behavior to always try to keep the pods. |
I agree this is a desirable and expected behavior, if nothing else out of habit. This is what kubelet implementation did. However, the deeper I look, the less I'm sure it is a guaranteed behavior. There are well known circumnstances on which kubelet may reserve the option to kill running pods when kubelet restarts, e.g. if the machine config changes. Granted, this is NOT the case which is reported here (nothing changed across restart, hence we want a followup fix of some kind), but I'm convinced that clarifying the guarantees on kubelet restart should be part of the ongoing conversation. |
Important to note in this discussion, and regraldess of how the kubelet should behave: Please recognize the fact that kubelte behaved like this for many years, and there might be a few projects out there who assume htis behavior. Again: I#m not judging if it's right or wrong, but only about that we "break userspace" - if this would be the kernel. And just like the kernel, we should avoid these things. |
/cc |
/cc |
1 similar comment
/cc |
/priority critical-urgent it's true we changed a long-established behavior, so let's find a consensus quickly about the resolution. /cc @klueska (device manager) @SergeyKanzhelev @mrunalp @dchen1107 @derekwaynecarr (sig-node) |
It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run? We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work? |
Thanks Clayton for chiming in.
The device manager is invoked as part of admission and it performs allocation. Up until the device plugins register themselves, the device manager cannot indeed provide authoritative answers about device availability. Any answer is essentially wrong at this stage. For context, the current behaviour loudly breaks all the pods on restart, the previous silently broke some pods.
This is one of the option I'm evaluating. I'm willing to explore this path, my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime. If this is acceptable designwise I'll start to sketch an implementation we can iterate upon. I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly. |
At admission time on a restart you won't actually know whether the pod is running, because the kubelet itself won't know, because admission on restart is the transition from "unknown state" to "should still be running". And the kubelet implemention is very naive - we assume admission is fast, reentrant, and cheap, none of which are completely true with the more complex admission requirements of cpu manager, device manager, etc. I think asking pod cache or kuberuntime directly would not be appropriate. However, at admission time you do know what other pods have been previously admitted (passed as an argument), and you also know what pods you've seen. So if you see a Running pod submitted to admission, and the device plugin internal state does not say that the pod's devices, you can treat that as "i previously admitted this" and accept it. If that requires significant changes to device manager internal state, it might be worth discussing what other changes to device manager might make sense.
This could be problematic in some cases, like starting static pods. Do device plugins typically consult control plane resources during initialization? For some distros, the control plane runs in static pods so that would potentially block restart in a failure scenario. If the device plugins are only checking local state, and registration is fast, that might be ok. However, in general I would say this is the wrong approach in general - we only want pods that need devices to be impacted by a slow device plugin. That argues that we should have a "no decision" option for admission, and then we simply keep retrying via the kubelet resync loop (HandlePodCleanups, which attempts to start pods that are in config but not running). I'm slightly leaning towards adding "no decision, retry later" option because it should work pretty cleanly and be safe to backport to at least 1.27 (we made 1.27 retry later possible, so that could be an input here). But it would be ok to have two separate fixes - one for 1.25/1.26 and one for 1.27+. |
OK, time for me to ask some silly questions because we're reaching very fast the edge of my knowledge of the kubelet outside the resource managers :)
Thanks for pointing out, I'll play with this code a bit
Just to be sure, does this mean that admission plugin can trust the pod status they receive (e.g. is it up to date)? otherwise how can detect a pod is Running?
Typically AFAIK no, but we don't know for sure.
We can't however guarantee that and so it can break randomly, so I seems to me this is a dealbreaker for this approach - too fragile with known cases
I concur
OK, but resource managers (device, cpu, ...) run as part of kubelet
Assuming the retry approach turns out clean, would invasive backports to 1.26/1.25 be needed to include the prerequisite work? |
Appears to me that we have two options:
I understand Option 1 would require additional information to be passed to changes to device manager admission check and would be confined to kubelet and hence more focused at addressing this specific problem. On the other hand, Option 2 would not only help out in resolving this issue but also other broader set of issues e.g. scheduler being topology-unaware can cause runaway pod creation which could be another reason to pursue this option in addition to backportabliity as mentioned above . @smarterclayton Could you please clarify a couple of questions based on your vision of the "no decision" option:
|
Perhaps decoupling admission and allocation could ease some of the pain here? I wonder what was the reason for coupling the two in the first place. It seems reasonable to re-admit pods on kubelet restart but I am not sure why resource allocation needs to happen again for already running pods. |
/cc |
When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>
I plan to backport #118635 soon, starting with 1.27 and then down to 1.25 (included). |
When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>
When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>
When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>
When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>
What happened?
In KubeVirt project, we now see a regression when running on Kubernetes
1.25.10 | 1.26.5 | 1.27.2
. If kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated withUnexpectedAdmissionError
:KubeVirt runs virtual machines inside pods and uses a device plugin to advertise e.g.
/dev/kvm
on the nodes.Presumably, this PR changed the behavior: #116376
Original issue: #109595
What did you expect to happen?
A potential restart of kubelet should not interrupt the running workloads.
How can we reproduce it (as minimally and precisely as possible)?
with KubeVirt:
pkill kubelet
or with https://github.com/k8stopologyawareschedwg/sample-device-plugin
Anything else we need to know?
No response
Kubernetes version
This affects the 1.25.x, 1.26.x and 1.27.x branches.
1.25.10 | 1.26.5 | 1.27.2
Cloud provider
N/A
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: