Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591

sharnoff · 2023-10-30T00:21:53Z

This issue is largely adapted from a message in slack from a couple months ago.

As of 2023-10-29, this is not urgent. It is a theoretical issue that is highly unlikely to occur, but should be considered in light of certain planned changes.

Background

Kubernetes has "Quality of Service" (QoS) classifications for pods. The QoS given to a pod is determined exclusively by the resources.requests and resources.limits used (or not) for its containers.

For the most part, QoS is descriptive, rather than prescriptive. The QoS classes are roughly:

"Guaranteed" — all containers have cpu/memory requests equal to limits
"Burstable" — at least one container has cpu or memory requests or limits
"BestEffort" — no containers have CPU/memory requests or limits

As a refresher, Kubernetes guarantees that any "requested" resources are always available to the container, should it need them.

What's actually going on is more clear if we look at how it's implemented in EKS (at least, as of 2023-10-29, k8s 1.25).

Cgroups are fundamental to the implementation of containers, so we can gather a lot about how certain kubernetes features are implemented by looking at how they affect the relevant cgroups. If we ssh into a kubernetes node and run ls /sys/fs/cgroup/cpu/kubepods.slice, we might see something like:

$ ls /sys/fs/cgroup/cpu/kubepods.slice
cgroup.clone_children  cpuacct.usage_percpu_sys   cpu.rt_period_us           kubepods-pod1035a83e_09cb_46e0_b56b_0ba59af9465c.slice  notify_on_release
cgroup.procs           cpuacct.usage_percpu_user  cpu.rt_runtime_us          kubepods-pod17f65edb_b99c_4530_8d93_3ff62c64f50f.slice  tasks
cpuacct.stat           cpuacct.usage_sys          cpu.shares                 kubepods-pod22bf9367_60ee_4481_a762_addb5f83a056.slice
cpuacct.usage          cpuacct.usage_user         cpu.stat                   kubepods-pod7bbd817f_b4d6_4e6f_8ac1_50bc6e270a2d.slice
cpuacct.usage_all      cpu.cfs_period_us          kubepods-besteffort.slice  kubepods-pod7daf1144_d3b5_4531_9a59_221f96a36a2b.slice
cpuacct.usage_percpu   cpu.cfs_quota_us           kubepods-burstable.slice   kubepods-podcb6dc9c2_eda7_4e8a_8e4a_948169e57cdc.slice

There's a few things of note here:

The kubepods-besteffort.slice cgroup
The kubepods-burstable.slice cgroup
The individual kubepods-pod<UUID> cgroups

The pods are relatively easy to track down: the UUID is just the UID of the pod; we can find the last one, for example, with:

$ kubectl get pod -A -o json | jq -r '.items[] | select(.metadata.uid == "cb6dc9c2-eda7-4e8a-8e4a-948169e57cdc") | "\(.metadata.namespace)/\(.metadata.name)"'
overprovisioning/overprovisioning-6f7d78c958-2228t

These are pods with a "Guaranteed" QoS.

The other cgroups mentioned are relatively self-explanatory. Inside them, we can see a similar situation: the "burstable" cgroup is full of kubepods-burstable-pod<UUID> cgroups, and the "besteffort" cgroup is full of kubepods-besteffort-pod<UUID>.

VMs get the "BestEffort" QoS because we can't set resources.requests (because changing a pod's resources.requests is only supported for k8s 1.27+), so: what cgroup settings are used for the BestEffort pods?

The relevant setting for resources.requests is the cpu.shares setting, which determines how the kernel's scheduler assigns CPU time if there isn't enough to go around. Basically, if the host is at 100% CPU usage, then all cgroups with runnable processes will get CPU time proportional to their number of cpu.shares, relative to whatever the total count is.

If we look at an arbitrary "Guaranteed" pod (say, e.g. the same one as before), we can see that if it has 2 CPUs:

$ kubectl get pod -n overprovisioning overprovisioning-6f7d78c958-2228t -o jsonpath='{.spec.containers[*].resources.requests.cpu}'
2

... then it gets 2048 CPU shares:

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-podcb6dc9c2_eda7_4e8a_8e4a_948169e57cdc.slice/cpu.shares
2048

So we'd expect about 1 CPU share for 1m of CPU in in resources.requests.

If we look at some pods the "Burstable" QoS, we see that this is roughly the same:

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod050b4a85_f401_4ddb_a585_3d2120ee0132.slice/cpu.shares
276
$ kubectl get pod -A -o json | jq -r '.items[] | select(.metadata.uid == "050b4a85-f401-4ddb-a585-3d2120ee0132") | "\(.metadata.namespace)/\(.metadata.name)"'
default/compute-orange-recipe-53139292-5d85955d54-5hk7x
$ kubectl get pod compute-orange-recipe-53139292-5d85955d54-5hk7x -o jsonpath='{.spec.containers[*].resources.requests.cpu}'
250m 10m 10m

We can also see that the total number of shares given to the cgroup containing "Burstable" pods is approximately equal to sum of the shares inside the cgroup (i.e. there's no dilution)

$ cd /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/
$ cat cpu.shares
32076
$ # Sum of all cpu.shares for child cgroups:
$ sh -c 'for c in kubepods-burstable-*; do cat $c/cpu.shares; done' | awk '{s+=$1} END {print s}'
32052

Unfortunately, there's no such guarantees for BestEffort pods. This, however, makes sense - after all, they haven't asked for any resources to be guaranteed!

If we look at the cgroup containing the BestEffort pods, we can see that it gets just 2 CPU shares! Essentially the equivalent of 0.002 CPUs guaranteed, split between all BestEffort pods.

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-besteffort.slice/cpu.shares
2

And within that cgroup, all of the BestEffort pods are given an equal number of CPU shares:

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-besteffort.slice/kubepods-*/cpu.shares | sort | uniq
2

Current state of affairs

What all of that background means is this:

If a Kubernetes node is under high CPU load, there's basically no guarantees that any VM will get CPU time (aside from that miniscule amount), and any CPU time allotted to VMs as a whole will be split equally between them, regardless of their total size.

So:

High CPU usage by any non-VM pod can starve VMs
CPU starvation will result in bigger users being disproportionately throttled

Note: This is not currently an issue, because our node CPU usage is so low. Once we implement overcommit (see #517), this may change and we may want to resolve this before we have an incident because of it.

Possible solutions

There's a variety of ideas here, ranging in scope and impact. Here's an assortment:

Upgrade to k8s 1.27 and accurately set resources.requests on VM runner pods according to their spec.guest.{cpus,memorySlots}.use values
Set resources.requests according to the minimum values in the VM's spec
Revert neonvm/runner: Create QEMU cgroup inside current one #441 (move QEMU cgroup out of kubepods.slice) and set cpu.shares appropriately
Ditch kubernetes entirely 😛

The text was updated successfully, but these errors were encountered:

We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591

NB: This PR is conditionally enabled via the --enable-container-mgr flag on neonvm-controller. There are no effects without that. --- We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591

sharnoff added t/bug Issue Type: Bug c/autoscaling/neonvm Component: autoscaling: NeonVM labels Oct 30, 2023

sharnoff mentioned this issue Jan 15, 2024

neonvm: Use crictl to change container CPU, ditch cgroup #738

Merged

sharnoff closed this as completed in #738 Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591

Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591

sharnoff commented Oct 30, 2023

Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591

Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591

Comments

sharnoff commented Oct 30, 2023

Background

Current state of affairs

Possible solutions