Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running pods with devices are terminated if kubelet is restarted #118559

Closed
vasiliy-ul opened this issue Jun 8, 2023 · 28 comments · Fixed by #118635
Closed

Running pods with devices are terminated if kubelet is restarted #118559

vasiliy-ul opened this issue Jun 8, 2023 · 28 comments · Fixed by #118635
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@vasiliy-ul
Copy link

vasiliy-ul commented Jun 8, 2023

What happened?

In KubeVirt project, we now see a regression when running on Kubernetes 1.25.10 | 1.26.5 | 1.27.2. If kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated with UnexpectedAdmissionError:

Warning  UnexpectedAdmissionError  45s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected
Normal   Killing                   42s   kubelet            Stopping container compute

KubeVirt runs virtual machines inside pods and uses a device plugin to advertise e.g. /dev/kvm on the nodes.

Presumably, this PR changed the behavior: #116376
Original issue: #109595

What did you expect to happen?

A potential restart of kubelet should not interrupt the running workloads.

How can we reproduce it (as minimally and precisely as possible)?

with KubeVirt:

  • run a KubeVirt VM
  • pkill kubelet
  • observe that the workload pod gets terminated

or with https://github.com/k8stopologyawareschedwg/sample-device-plugin

  • make deploy
  • make test-both
  • pkill kubelet
  • the pod gets restarted

Anything else we need to know?

No response

Kubernetes version

This affects the 1.25.x, 1.26.x and 1.27.x branches.

1.25.10 | 1.26.5 | 1.27.2

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@vasiliy-ul vasiliy-ul added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 8, 2023
@vasiliy-ul
Copy link
Author

I was also able to reproduce it without KubeVirt. I used https://github.com/k8stopologyawareschedwg/sample-device-plugin

The steps are:

  • make deploy
  • make test-both
  • pkill kubelet
Warning  UnexpectedAdmissionError  30s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices example.com/deviceA, which is unexpected

@vasiliy-ul
Copy link
Author

vasiliy-ul commented Jun 8, 2023

My observation is that the original issue #109595 was focused on the node reboot scenario. Then the PR #116376 fixed that. But in the case when only kubelet is restarted and some workloads are still running, there is the problem that device plugins may take time to initialize and report back healthy devices. Meanwhile, kubelet terminates the running pods as it assumes an unexpected admission.

@vaibhav2107
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 8, 2023
@ffromani
Copy link
Contributor

ffromani commented Jun 8, 2023

/triage accepted

we have reproducers

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 8, 2023
@ffromani
Copy link
Contributor

ffromani commented Jun 8, 2023

/cc

@ffromani
Copy link
Contributor

ffromani commented Jun 8, 2023

I'm looking into this issue and I'll be updating shortly. At this point in time I can say that yes, #116376 made the devicemanager stricter and leads to this behavior. We may need another way to fix the inconsistency reported in #109595 , and likely we will need to tight up/fix the e2e tests about device plugins.

But there could be a (partially?) mismatched expectation as well, because AFAIK the kubelet will run admission on restart (and in general on initialization) and thus may kill running pods.

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

@vasiliy-ul
Copy link
Author

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm... I thought that it was actually the supposed behavior to always try to keep the pods.

@ffromani
Copy link
Contributor

ffromani commented Jun 8, 2023

In other words, in addition to the followup fix, it would be beneficial to clarify that in general running containers are or are not guaranteed to survive a kubelet restart.

From the user point of view, I guess, kubelet should keep the running pods. Kubelet can be restarted for various reasons, but IMHO it should not affect critical workloads. Hm... I thought that it was actually the supposed behavior to always try to keep the pods.

I agree this is a desirable and expected behavior, if nothing else out of habit. This is what kubelet implementation did.

However, the deeper I look, the less I'm sure it is a guaranteed behavior.

There are well known circumnstances on which kubelet may reserve the option to kill running pods when kubelet restarts, e.g. if the machine config changes. Granted, this is NOT the case which is reported here (nothing changed across restart, hence we want a followup fix of some kind), but I'm convinced that clarifying the guarantees on kubelet restart should be part of the ongoing conversation.

@fabiand
Copy link
Contributor

fabiand commented Jun 8, 2023

Important to note in this discussion, and regraldess of how the kubelet should behave: Please recognize the fact that kubelte behaved like this for many years, and there might be a few projects out there who assume htis behavior.

Again: I#m not judging if it's right or wrong, but only about that we "break userspace" - if this would be the kernel. And just like the kernel, we should avoid these things.

@swatisehgal
Copy link
Contributor

/cc

@rmohr
Copy link
Contributor

rmohr commented Jun 8, 2023

/cc

1 similar comment
@rthallisey
Copy link

/cc

@ffromani
Copy link
Contributor

ffromani commented Jun 9, 2023

/priority critical-urgent

it's true we changed a long-established behavior, so let's find a consensus quickly about the resolution.
I want to help with a followup change, and I'd rather have the project not to chose between this regression and regressing over #109595 (still very much real) so I'll be posting a POC to accomodate both needs shortly.

/cc @klueska (device manager) @SergeyKanzhelev @mrunalp @dchen1107 @derekwaynecarr (sig-node)

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 9, 2023
@smarterclayton
Copy link
Contributor

  1. Running pods should always survive kubelet restart.
  2. Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)
  3. It is the responsibility of every admission plugin to handle the scenario of kubelet restart correctly (by identifying when it can start making admission decisions)
  4. We probably lack all the tools to correctly handle admission + allocation, and we need to identify which ones to add.
  5. Admission is processed in a serial (and mostly random) order and therefore admission plugins cannot safely "block" until initialization is complete

It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run?

We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work?

@ffromani
Copy link
Contributor

ffromani commented Jun 9, 2023

Thanks Clayton for chiming in.

  1. Running pods should always survive kubelet restart.

    1. Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)

    2. It is the responsibility of every admission plugin to handle the scenario of kubelet restart correctly (by identifying when it can start making admission decisions)

    3. We probably lack all the tools to correctly handle admission + allocation, and we need to identify which ones to add.

    4. Admission is processed in a serial (and mostly random) order and therefore admission plugins cannot safely "block" until initialization is complete

It sounds like this is because the device admission plugin is still not able to authoritatively state whether a device is available at the time the restarted pod is run?

The device manager is invoked as part of admission and it performs allocation. Up until the device plugins register themselves, the device manager cannot indeed provide authoritative answers about device availability. Any answer is essentially wrong at this stage. For context, the current behaviour loudly breaks all the pods on restart, the previous silently broke some pods.

We need to make some changes to admission generally to solve this case completely, but until we do, is it possible to have the admission plugin safely accept a pod that is a) never before seen by the device plugin and b) in the running phase? Or are there other reasons why that has been tried and found not to work?

This is one of the option I'm evaluating. I'm willing to explore this path, my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime. If this is acceptable designwise I'll start to sketch an implementation we can iterate upon.

I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly.

@smarterclayton
Copy link
Contributor

my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime

At admission time on a restart you won't actually know whether the pod is running, because the kubelet itself won't know, because admission on restart is the transition from "unknown state" to "should still be running". And the kubelet implemention is very naive - we assume admission is fast, reentrant, and cheap, none of which are completely true with the more complex admission requirements of cpu manager, device manager, etc.

I think asking pod cache or kuberuntime directly would not be appropriate. However, at admission time you do know what other pods have been previously admitted (passed as an argument), and you also know what pods you've seen. So if you see a Running pod submitted to admission, and the device plugin internal state does not say that the pod's devices, you can treat that as "i previously admitted this" and accept it. If that requires significant changes to device manager internal state, it might be worth discussing what other changes to device manager might make sense.

I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly.

This could be problematic in some cases, like starting static pods. Do device plugins typically consult control plane resources during initialization? For some distros, the control plane runs in static pods so that would potentially block restart in a failure scenario. If the device plugins are only checking local state, and registration is fast, that might be ok. However, in general I would say this is the wrong approach in general - we only want pods that need devices to be impacted by a slow device plugin. That argues that we should have a "no decision" option for admission, and then we simply keep retrying via the kubelet resync loop (HandlePodCleanups, which attempts to start pods that are in config but not running).

I'm slightly leaning towards adding "no decision, retry later" option because it should work pretty cleanly and be safe to backport to at least 1.27 (we made 1.27 retry later possible, so that could be an input here). But it would be ok to have two separate fixes - one for 1.25/1.26 and one for 1.27+.

@ffromani
Copy link
Contributor

ffromani commented Jun 9, 2023

OK, time for me to ask some silly questions because we're reaching very fast the edge of my knowledge of the kubelet outside the resource managers :)

my concern is that will require the device manager to gain knowledge about running pods, asking the pod cache or the kuberuntime

At admission time on a restart you won't actually know whether the pod is running, because the kubelet itself won't know, because admission on restart is the transition from "unknown state" to "should still be running". And the kubelet implemention is very naive - we assume admission is fast, reentrant, and cheap, none of which are completely true with the more complex admission requirements of cpu manager, device manager, etc.

I think asking pod cache or kuberuntime directly would not be appropriate. However, at admission time you do know what other pods have been previously admitted (passed as an argument), and you also know what pods you've seen.

Thanks for pointing out, I'll play with this code a bit

So if you see a Running pod submitted to admission, and the device plugin internal state does not say that the pod's devices, you can treat that as "i previously admitted this" and accept it. If that requires significant changes to device manager internal state, it might be worth discussing what other changes to device manager might make sense.

Just to be sure, does this mean that admission plugin can trust the pod status they receive (e.g. is it up to date)? otherwise how can detect a pod is Running?

I have another POC which is about letting the kubelet wait (up until a timeout) for device plugins to register themselves before start syncing pods, I'll upload it for reference very shortly.

This could be problematic in some cases, like starting static pods. Do device plugins typically consult control plane resources during initialization?

Typically AFAIK no, but we don't know for sure.

For some distros, the control plane runs in static pods so that would potentially block restart in a failure scenario. If the device plugins are only checking local state, and registration is fast, that might be ok.

We can't however guarantee that and so it can break randomly, so I seems to me this is a dealbreaker for this approach - too fragile with known cases

However, in general I would say this is the wrong approach in general - we only want pods that need devices to be impacted by a slow device plugin.

I concur

That argues that we should have a "no decision" option for admission, and then we simply keep retrying via the kubelet resync loop (HandlePodCleanups, which attempts to start pods that are in config but not running).

OK, but resource managers (device, cpu, ...) run as part of kubelet admitHandlers (vs softAdmitHandlers) and so their rejection is always final. Or there's a way to signal from these handlers "dunno, retry later" I'm missing? (xref: https://github.com/kubernetes/kubernetes/blob/v1.28.0-alpha.2/pkg/kubelet/kubelet.go#LL1250-L1257 and https://github.com/kubernetes/kubernetes/blob/v1.28.0-alpha.2/pkg/kubelet/kubelet.go#L2572)

I'm slightly leaning towards adding "no decision, retry later" option because it should work pretty cleanly and be safe to backport to at least 1.27 (we made 1.27 retry later possible, so that could be an input here). But it would be ok to have two separate fixes - one for 1.25/1.26 and one for 1.27+.

Assuming the retry approach turns out clean, would invasive backports to 1.26/1.25 be needed to include the prerequisite work?

@swatisehgal
Copy link
Contributor

swatisehgal commented Jun 9, 2023

Appears to me that we have two options:

  1. Impart device manager the intelligence so that it maintains an internal state of the pods that were previously admitted by it and take that into consideration when running pods are re-admitted on kubelet restarts.
  2. "no decision" option for admission and retry later.

I understand Option 1 would require additional information to be passed to changes to device manager admission check and would be confined to kubelet and hence more focused at addressing this specific problem. On the other hand, Option 2 would not only help out in resolving this issue but also other broader set of issues e.g. scheduler being topology-unaware can cause runaway pod creation which could be another reason to pursue this option in addition to backportabliity as mentioned above .

@smarterclayton Could you please clarify a couple of questions based on your vision of the "no decision" option:

  • How would we distinguish between the system being in steady state when pods are supposed to go through the admission flow as normal as opposed to cases like kubelet restart where we might want delayed/no decision admission?
  • Is this going to be an opt-in capability at a pod level or a node level?

@swatisehgal
Copy link
Contributor

Admission is re-run on every kubelet restart (it must, because the kubelet is stateless and we have coupled admission with allocation)

Perhaps decoupling admission and allocation could ease some of the pain here? I wonder what was the reason for coupling the two in the first place. It seems reasonable to re-admit pods on kubelet restart but I am not sure why resource allocation needs to happen again for already running pods.

@yanirq
Copy link

yanirq commented Jun 9, 2023

/cc

ffromani added a commit to ffromani/kubernetes that referenced this issue Jul 13, 2023
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
kubernetes#118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Jul 13, 2023
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
SIG Node Bugs automation moved this from High Priority to Done Jul 15, 2023
@ffromani
Copy link
Contributor

I plan to backport #118635 soon, starting with 1.27 and then down to 1.25 (included).

ffromani added a commit to ffromani/kubernetes that referenced this issue Jul 19, 2023
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
kubernetes#118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Jul 19, 2023
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Aug 8, 2023
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
kubernetes#118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Aug 8, 2023
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Aug 8, 2023
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
kubernetes#118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Aug 8, 2023
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Aug 8, 2023
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
kubernetes#118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Aug 8, 2023
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects