Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591

Closed
sharnoff opened this issue Oct 30, 2023 · 0 comments · Fixed by #738
Closed
Labels
c/autoscaling/neonvm Component: autoscaling: NeonVM t/bug Issue Type: Bug

Comments

@sharnoff
Copy link
Member

This issue is largely adapted from a message in slack from a couple months ago.

As of 2023-10-29, this is not urgent. It is a theoretical issue that is highly unlikely to occur, but should be considered in light of certain planned changes.

Background

Kubernetes has "Quality of Service" (QoS) classifications for pods. The QoS given to a pod is determined exclusively by the resources.requests and resources.limits used (or not) for its containers.

For the most part, QoS is descriptive, rather than prescriptive. The QoS classes are roughly:

  1. "Guaranteed" — all containers have cpu/memory requests equal to limits
  2. "Burstable" — at least one container has cpu or memory requests or limits
  3. "BestEffort" — no containers have CPU/memory requests or limits

As a refresher, Kubernetes guarantees that any "requested" resources are always available to the container, should it need them.

What's actually going on is more clear if we look at how it's implemented in EKS (at least, as of 2023-10-29, k8s 1.25).

Cgroups are fundamental to the implementation of containers, so we can gather a lot about how certain kubernetes features are implemented by looking at how they affect the relevant cgroups. If we ssh into a kubernetes node and run ls /sys/fs/cgroup/cpu/kubepods.slice, we might see something like:

$ ls /sys/fs/cgroup/cpu/kubepods.slice
cgroup.clone_children  cpuacct.usage_percpu_sys   cpu.rt_period_us           kubepods-pod1035a83e_09cb_46e0_b56b_0ba59af9465c.slice  notify_on_release
cgroup.procs           cpuacct.usage_percpu_user  cpu.rt_runtime_us          kubepods-pod17f65edb_b99c_4530_8d93_3ff62c64f50f.slice  tasks
cpuacct.stat           cpuacct.usage_sys          cpu.shares                 kubepods-pod22bf9367_60ee_4481_a762_addb5f83a056.slice
cpuacct.usage          cpuacct.usage_user         cpu.stat                   kubepods-pod7bbd817f_b4d6_4e6f_8ac1_50bc6e270a2d.slice
cpuacct.usage_all      cpu.cfs_period_us          kubepods-besteffort.slice  kubepods-pod7daf1144_d3b5_4531_9a59_221f96a36a2b.slice
cpuacct.usage_percpu   cpu.cfs_quota_us           kubepods-burstable.slice   kubepods-podcb6dc9c2_eda7_4e8a_8e4a_948169e57cdc.slice

There's a few things of note here:

  1. The kubepods-besteffort.slice cgroup
  2. The kubepods-burstable.slice cgroup
  3. The individual kubepods-pod<UUID> cgroups

The pods are relatively easy to track down: the UUID is just the UID of the pod; we can find the last one, for example, with:

$ kubectl get pod -A -o json | jq -r '.items[] | select(.metadata.uid == "cb6dc9c2-eda7-4e8a-8e4a-948169e57cdc") | "\(.metadata.namespace)/\(.metadata.name)"'
overprovisioning/overprovisioning-6f7d78c958-2228t

These are pods with a "Guaranteed" QoS.

The other cgroups mentioned are relatively self-explanatory. Inside them, we can see a similar situation: the "burstable" cgroup is full of kubepods-burstable-pod<UUID> cgroups, and the "besteffort" cgroup is full of kubepods-besteffort-pod<UUID>.

VMs get the "BestEffort" QoS because we can't set resources.requests (because changing a pod's resources.requests is only supported for k8s 1.27+), so: what cgroup settings are used for the BestEffort pods?

The relevant setting for resources.requests is the cpu.shares setting, which determines how the kernel's scheduler assigns CPU time if there isn't enough to go around. Basically, if the host is at 100% CPU usage, then all cgroups with runnable processes will get CPU time proportional to their number of cpu.shares, relative to whatever the total count is.

If we look at an arbitrary "Guaranteed" pod (say, e.g. the same one as before), we can see that if it has 2 CPUs:

$ kubectl get pod -n overprovisioning overprovisioning-6f7d78c958-2228t -o jsonpath='{.spec.containers[*].resources.requests.cpu}'
2

... then it gets 2048 CPU shares:

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-podcb6dc9c2_eda7_4e8a_8e4a_948169e57cdc.slice/cpu.shares
2048

So we'd expect about 1 CPU share for 1m of CPU in in resources.requests.

If we look at some pods the "Burstable" QoS, we see that this is roughly the same:

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod050b4a85_f401_4ddb_a585_3d2120ee0132.slice/cpu.shares
276
$ kubectl get pod -A -o json | jq -r '.items[] | select(.metadata.uid == "050b4a85-f401-4ddb-a585-3d2120ee0132") | "\(.metadata.namespace)/\(.metadata.name)"'
default/compute-orange-recipe-53139292-5d85955d54-5hk7x
$ kubectl get pod compute-orange-recipe-53139292-5d85955d54-5hk7x -o jsonpath='{.spec.containers[*].resources.requests.cpu}'
250m 10m 10m

We can also see that the total number of shares given to the cgroup containing "Burstable" pods is approximately equal to sum of the shares inside the cgroup (i.e. there's no dilution)

$ cd /sys/fs/cgroup/cpu/kubepods.slice/kubepods-burstable.slice/
$ cat cpu.shares
32076
$ # Sum of all cpu.shares for child cgroups:
$ sh -c 'for c in kubepods-burstable-*; do cat $c/cpu.shares; done' | awk '{s+=$1} END {print s}'
32052

Unfortunately, there's no such guarantees for BestEffort pods. This, however, makes sense - after all, they haven't asked for any resources to be guaranteed!

If we look at the cgroup containing the BestEffort pods, we can see that it gets just 2 CPU shares! Essentially the equivalent of 0.002 CPUs guaranteed, split between all BestEffort pods.

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-besteffort.slice/cpu.shares
2

And within that cgroup, all of the BestEffort pods are given an equal number of CPU shares:

$ cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-besteffort.slice/kubepods-*/cpu.shares | sort | uniq
2

Current state of affairs

What all of that background means is this:

If a Kubernetes node is under high CPU load, there's basically no guarantees that any VM will get CPU time (aside from that miniscule amount), and any CPU time allotted to VMs as a whole will be split equally between them, regardless of their total size.

So:

  1. High CPU usage by any non-VM pod can starve VMs
  2. CPU starvation will result in bigger users being disproportionately throttled

Note: This is not currently an issue, because our node CPU usage is so low. Once we implement overcommit (see #517), this may change and we may want to resolve this before we have an incident because of it.

Possible solutions

There's a variety of ideas here, ranging in scope and impact. Here's an assortment:

  1. Upgrade to k8s 1.27 and accurately set resources.requests on VM runner pods according to their spec.guest.{cpus,memorySlots}.use values
  2. Set resources.requests according to the minimum values in the VM's spec
  3. Revert neonvm/runner: Create QEMU cgroup inside current one #441 (move QEMU cgroup out of kubepods.slice) and set cpu.shares appropriately
  4. Ditch kubernetes entirely 😛
@sharnoff sharnoff added t/bug Issue Type: Bug c/autoscaling/neonvm Component: autoscaling: NeonVM labels Oct 30, 2023
sharnoff added a commit that referenced this issue Jan 15, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Jan 16, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Jan 28, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Jan 29, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Jan 29, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Feb 2, 2024
NB: This PR is conditionally enabled via the --enable-container-mgr flag
on neonvm-controller. There are no effects without that.

---

We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Feb 2, 2024
NB: This PR is conditionally enabled via the --enable-container-mgr flag
on neonvm-controller. There are no effects without that.

---

We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
sharnoff added a commit that referenced this issue Feb 2, 2024
NB: This PR is conditionally enabled via the --enable-container-mgr flag
on neonvm-controller. There are no effects without that.

---

We recently realized[^1] that under cgroups v2, kubernetes uses cgroup
namespaces which has a few effects:

1. The output of /proc/self/cgroup shows as if the container were at the
   root of the hierarchy
2. It's very difficult for us to determine the actual cgroup that the
   container corresponds to on the host
3. We still can't directly create a cgroup in the container's namespace
   because /sys/fs/cgroup is mounted read-only

So, neonvm-runner currently *does not* work as expected with cgroups v2;
it creates a new cgroup for the VM, at the top of the hierarchy, and
doesn't clean it up on exit.

How do we fix this? The aim of this PR is to remove the special cgroup
handling entirely, and "just" go through the Container Runtime Interface
(CRI) exposed by containerd to modify the existing container we're
running in.

This requires access to /run/containerd/containerd.sock, which a
malicious user could use to perform priviledged operations on the host
(or in any other container on the host).
Obviously we'd like to prevent that as much as possible, so the CPU
handling is now runs alongside neonvm-runner as a separate container.
neonvm-runner does not have access to the containerd socket.

On the upside, one key benefit we get from this is being able to set cpu
shares, the abstraction underlying container resources.requests.
The other options weren't looking so great[^2], so if this works, this
would be a nice compromise.

[^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719
[^2]: #591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/neonvm Component: autoscaling: NeonVM t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant