Running pods with devices are terminated if kubelet is restarted #118559

vasiliy-ul · 2023-06-08T06:44:14Z

What happened?

In KubeVirt project, we now see a regression when running on Kubernetes 1.25.10 | 1.26.5 | 1.27.2. If kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated with UnexpectedAdmissionError:

Warning  UnexpectedAdmissionError  45s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected
Normal   Killing                   42s   kubelet            Stopping container compute

KubeVirt runs virtual machines inside pods and uses a device plugin to advertise e.g. /dev/kvm on the nodes.

Presumably, this PR changed the behavior: #116376
Original issue: #109595

What did you expect to happen?

A potential restart of kubelet should not interrupt the running workloads.

How can we reproduce it (as minimally and precisely as possible)?

with KubeVirt:

run a KubeVirt VM
pkill kubelet
observe that the workload pod gets terminated

or with https://github.com/k8stopologyawareschedwg/sample-device-plugin

make deploy
make test-both
pkill kubelet
the pod gets restarted

Anything else we need to know?

No response

Kubernetes version

This affects the 1.25.x, 1.26.x and 1.27.x branches.

1.25.10 | 1.26.5 | 1.27.2

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

vasiliy-ul · 2023-06-08T07:26:44Z

I was also able to reproduce it without KubeVirt. I used https://github.com/k8stopologyawareschedwg/sample-device-plugin

The steps are:

make deploy
make test-both
pkill kubelet

Warning  UnexpectedAdmissionError  30s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices example.com/deviceA, which is unexpected

vasiliy-ul · 2023-06-08T07:49:22Z

My observation is that the original issue #109595 was focused on the node reboot scenario. Then the PR #116376 fixed that. But in the case when only kubelet is restarted and some workloads are still running, there is the problem that device plugins may take time to initialize and report back healthy devices. Meanwhile, kubelet terminates the running pods as it assumes an unexpected admission.

vaibhav2107 · 2023-06-08T08:47:49Z

/sig node

ffromani · 2023-06-08T11:17:40Z

/triage accepted

we have reproducers

ffromani · 2023-06-08T11:17:54Z

/cc

ffromani · 2023-06-08T11:29:56Z

I'm looking into this issue and I'll be updating shortly. At this point in time I can say that yes, #116376 made the devicemanager stricter and leads to this behavior. We may need another way to fix the inconsistency reported in #109595 , and likely we will need to tight up/fix the e2e tests about device plugins.

But there could be a (partially?) mismatched expectation as well, because AFAIK the kubelet will run admission on restart (and in general on initialization) and thus may kill running pods.

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

vasiliy-ul · 2023-06-08T11:44:40Z

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm... I thought that it was actually the supposed behavior to always try to keep the pods.

ffromani · 2023-06-08T11:51:16Z

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm... I thought that it was actually the supposed behavior to always try to keep the pods.

I agree this is a desirable and expected behavior, if nothing else out of habit. This is what kubelet implementation did.

However, the deeper I look, the less I'm sure it is a guaranteed behavior.

There are well known circumnstances on which kubelet may reserve the option to kill running pods when kubelet restarts, e.g. if the machine config changes. Granted, this is NOT the case which is reported here (nothing changed across restart, hence we want a followup fix of some kind), but I'm convinced that clarifying the guarantees on kubelet restart should be part of the ongoing conversation.

fabiand · 2023-06-08T11:56:00Z

Important to note in this discussion, and regraldess of how the kubelet should behave: Please recognize the fact that kubelte behaved like this for many years, and there might be a few projects out there who assume htis behavior.

Again: I#m not judging if it's right or wrong, but only about that we "break userspace" - if this would be the kernel. And just like the kernel, we should avoid these things.

swatisehgal · 2023-06-08T12:11:33Z

/cc

rmohr · 2023-06-08T15:18:49Z

/cc

rthallisey · 2023-06-08T17:26:51Z

/cc

ffromani · 2023-06-09T07:10:26Z

/priority critical-urgent

it's true we changed a long-established behavior, so let's find a consensus quickly about the resolution.
I want to help with a followup change, and I'd rather have the project not to chose between this regression and regressing over #109595 (still very much real) so I'll be posting a POC to accomodate both needs shortly.

/cc @klueska (device manager) @SergeyKanzhelev @mrunalp @dchen1107 @derekwaynecarr (sig-node)

smarterclayton · 2023-06-09T14:07:17Z

Running pods should always survive kubelet restart.
Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)
It is the responsibility of every admission plugin to handle the scenario of kubelet restart correctly (by identifying when it can start making admission decisions)
We probably lack all the tools to correctly handle admission + allocation, and we need to identify which ones to add.
Admission is processed in a serial (and mostly random) order and therefore admission plugins cannot safely "block" until initialization is complete

It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run?

We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work?

ffromani · 2023-06-09T14:19:42Z

Thanks Clayton for chiming in.

Running pods should always survive kubelet restart.

Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)

It is the responsibility of every admission plugin to handle the scenario of kubelet restart correctly (by identifying when it can start making admission decisions)

We probably lack all the tools to correctly handle admission + allocation, and we need to identify which ones to add.

Admission is processed in a serial (and mostly random) order and therefore admission plugins cannot safely "block" until initialization is complete

It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run?

The device manager is invoked as part of admission and it performs allocation. Up until the device plugins register themselves, the device manager cannot indeed provide authoritative answers about device availability. Any answer is essentially wrong at this stage. For context, the current behaviour loudly breaks all the pods on restart, the previous silently broke some pods.

We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work?

This is one of the option I'm evaluating. I'm willing to explore this path, my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime. If this is acceptable designwise I'll start to sketch an implementation we can iterate upon.

I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly.

smarterclayton · 2023-06-09T14:44:52Z

my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime

At admission time on a restart you won't actually know whether the pod is running, because the kubelet itself won't know, because admission on restart is the transition from "unknown state" to "should still be running". And the kubelet implemention is very naive - we assume admission is fast, reentrant, and cheap, none of which are completely true with the more complex admission requirements of cpu manager, device manager, etc.

I think asking pod cache or kuberuntime directly would not be appropriate. However, at admission time you do know what other pods have been previously admitted (passed as an argument), and you also know what pods you've seen. So if you see a Running pod submitted to admission, and the device plugin internal state does not say that the pod's devices, you can treat that as "i previously admitted this" and accept it. If that requires significant changes to device manager internal state, it might be worth discussing what other changes to device manager might make sense.

I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly.

This could be problematic in some cases, like starting static pods. Do device plugins typically consult control plane resources during initialization? For some distros, the control plane runs in static pods so that would potentially block restart in a failure scenario. If the device plugins are only checking local state, and registration is fast, that might be ok. However, in general I would say this is the wrong approach in general - we only want pods that need devices to be impacted by a slow device plugin. That argues that we should have a "no decision" option for admission, and then we simply keep retrying via the kubelet resync loop (HandlePodCleanups, which attempts to start pods that are in config but not running).

I'm slightly leaning towards adding "no decision, retry later" option because it should work pretty cleanly and be safe to backport to at least 1.27 (we made 1.27 retry later possible, so that could be an input here). But it would be ok to have two separate fixes - one for 1.25/1.26 and one for 1.27+.

ffromani · 2023-06-09T15:02:45Z

OK, time for me to ask some silly questions because we're reaching very fast the edge of my knowledge of the kubelet outside the resource managers :)

my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime

At admission time on a restart you won't actually know whether the pod is running, because the kubelet itself won't know, because admission on restart is the transition from "unknown state" to "should still be running". And the kubelet implemention is very naive - we assume admission is fast, reentrant, and cheap, none of which are completely true with the more complex admission requirements of cpu manager, device manager, etc.

I think asking pod cache or kuberuntime directly would not be appropriate. However, at admission time you do know what other pods have been previously admitted (passed as an argument), and you also know what pods you've seen.

Thanks for pointing out, I'll play with this code a bit

So if you see a Running pod submitted to admission, and the device plugin internal state does not say that the pod's devices, you can treat that as "i previously admitted this" and accept it. If that requires significant changes to device manager internal state, it might be worth discussing what other changes to device manager might make sense.

Just to be sure, does this mean that admission plugin can trust the pod status they receive (e.g. is it up to date)? otherwise how can detect a pod is Running?

I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly.

This could be problematic in some cases, like starting static pods. Do device plugins typically consult control plane resources during initialization?

Typically AFAIK no, but we don't know for sure.

For some distros, the control plane runs in static pods so that would potentially block restart in a failure scenario. If the device plugins are only checking local state, and registration is fast, that might be ok.

We can't however guarantee that and so it can break randomly, so I seems to me this is a dealbreaker for this approach - too fragile with known cases

However, in general I would say this is the wrong approach in general - we only want pods that need devices to be impacted by a slow device plugin.

I concur

That argues that we should have a "no decision" option for admission, and then we simply keep retrying via the kubelet resync loop (HandlePodCleanups, which attempts to start pods that are in config but not running).

OK, but resource managers (device, cpu, ...) run as part of kubelet admitHandlers (vs softAdmitHandlers) and so their rejection is always final. Or there's a way to signal from these handlers "dunno, retry later" I'm missing? (xref: https://github.com/kubernetes/kubernetes/blob/v1.28.0-alpha.2/pkg/kubelet/kubelet.go#LL1250-L1257 and https://github.com/kubernetes/kubernetes/blob/v1.28.0-alpha.2/pkg/kubelet/kubelet.go#L2572)

I'm slightly leaning towards adding "no decision, retry later" option because it should work pretty cleanly and be safe to backport to at least 1.27 (we made 1.27 retry later possible, so that could be an input here). But it would be ok to have two separate fixes - one for 1.25/1.26 and one for 1.27+.

Assuming the retry approach turns out clean, would invasive backports to 1.26/1.25 be needed to include the prerequisite work?

swatisehgal · 2023-06-09T15:52:29Z

Appears to me that we have two options:

Impart device manager the intelligence so that it maintains an internal state of the pods that were previously admitted by it and take that into consideration when running pods are re-admitted on kubelet restarts.
"no decision" option for admission and retry later.

I understand Option 1 would require additional information to be passed to changes to device manager admission check and would be confined to kubelet and hence more focused at addressing this specific problem. On the other hand, Option 2 would not only help out in resolving this issue but also other broader set of issues e.g. scheduler being topology-unaware can cause runaway pod creation which could be another reason to pursue this option in addition to backportabliity as mentioned above .

@smarterclayton Could you please clarify a couple of questions based on your vision of the "no decision" option:

How would we distinguish between the system being in steady state when pods are supposed to go through the admission flow as normal as opposed to cases like kubelet restart where we might want delayed/no decision admission?
Is this going to be an opt-in capability at a pod level or a node level?

swatisehgal · 2023-06-09T16:12:33Z

Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)

Perhaps decoupling admission and allocation could ease some of the pain here? I wonder what was the reason for coupling the two in the first place. It seems reasonable to re-admit pods on kubelet restart but I am not sure why resource allocation needs to happen again for already running pods.

yanirq · 2023-06-09T17:31:14Z

/cc

When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>

One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2023-07-15T15:49:43Z

I plan to backport #118635 soon, starting with 1.27 and then down to 1.25 (included).