-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: VMs have low-priority CPU use, distributed equally under load (not based on VM size!) #591
Labels
Comments
sharnoff
added a commit
that referenced
this issue
Jan 15, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Jan 16, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Jan 28, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Jan 29, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Jan 29, 2024
We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Feb 2, 2024
NB: This PR is conditionally enabled via the --enable-container-mgr flag on neonvm-controller. There are no effects without that. --- We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Feb 2, 2024
NB: This PR is conditionally enabled via the --enable-container-mgr flag on neonvm-controller. There are no effects without that. --- We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
sharnoff
added a commit
that referenced
this issue
Feb 2, 2024
NB: This PR is conditionally enabled via the --enable-container-mgr flag on neonvm-controller. There are no effects without that. --- We recently realized[^1] that under cgroups v2, kubernetes uses cgroup namespaces which has a few effects: 1. The output of /proc/self/cgroup shows as if the container were at the root of the hierarchy 2. It's very difficult for us to determine the actual cgroup that the container corresponds to on the host 3. We still can't directly create a cgroup in the container's namespace because /sys/fs/cgroup is mounted read-only So, neonvm-runner currently *does not* work as expected with cgroups v2; it creates a new cgroup for the VM, at the top of the hierarchy, and doesn't clean it up on exit. How do we fix this? The aim of this PR is to remove the special cgroup handling entirely, and "just" go through the Container Runtime Interface (CRI) exposed by containerd to modify the existing container we're running in. This requires access to /run/containerd/containerd.sock, which a malicious user could use to perform priviledged operations on the host (or in any other container on the host). Obviously we'd like to prevent that as much as possible, so the CPU handling is now runs alongside neonvm-runner as a separate container. neonvm-runner does not have access to the containerd socket. On the upside, one key benefit we get from this is being able to set cpu shares, the abstraction underlying container resources.requests. The other options weren't looking so great[^2], so if this works, this would be a nice compromise. [^1]: https://neondb.slack.com/archives/C03TN5G758R/p1705092611188719 [^2]: #591
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue is largely adapted from a message in slack from a couple months ago.
As of 2023-10-29, this is not urgent. It is a theoretical issue that is highly unlikely to occur, but should be considered in light of certain planned changes.
Background
Kubernetes has "Quality of Service" (QoS) classifications for pods. The QoS given to a pod is determined exclusively by the resources.requests and resources.limits used (or not) for its containers.
For the most part, QoS is descriptive, rather than prescriptive. The QoS classes are roughly:
As a refresher, Kubernetes guarantees that any "requested" resources are always available to the container, should it need them.
What's actually going on is more clear if we look at how it's implemented in EKS (at least, as of 2023-10-29, k8s 1.25).
Cgroups are fundamental to the implementation of containers, so we can gather a lot about how certain kubernetes features are implemented by looking at how they affect the relevant cgroups. If we ssh into a kubernetes node and run
ls /sys/fs/cgroup/cpu/kubepods.slice
, we might see something like:There's a few things of note here:
kubepods-besteffort.slice
cgroupkubepods-burstable.slice
cgroupkubepods-pod<UUID>
cgroupsThe pods are relatively easy to track down: the UUID is just the UID of the pod; we can find the last one, for example, with:
These are pods with a "Guaranteed" QoS.
The other cgroups mentioned are relatively self-explanatory. Inside them, we can see a similar situation: the "burstable" cgroup is full of
kubepods-burstable-pod<UUID>
cgroups, and the "besteffort" cgroup is full ofkubepods-besteffort-pod<UUID>
.VMs get the "BestEffort" QoS because we can't set resources.requests (because changing a pod's resources.requests is only supported for k8s 1.27+), so: what cgroup settings are used for the BestEffort pods?
The relevant setting for resources.requests is the
cpu.shares
setting, which determines how the kernel's scheduler assigns CPU time if there isn't enough to go around. Basically, if the host is at 100% CPU usage, then all cgroups with runnable processes will get CPU time proportional to their number ofcpu.shares
, relative to whatever the total count is.If we look at an arbitrary "Guaranteed" pod (say, e.g. the same one as before), we can see that if it has 2 CPUs:
... then it gets 2048 CPU shares:
So we'd expect about 1 CPU share for 1m of CPU in in resources.requests.
If we look at some pods the "Burstable" QoS, we see that this is roughly the same:
We can also see that the total number of shares given to the cgroup containing "Burstable" pods is approximately equal to sum of the shares inside the cgroup (i.e. there's no dilution)
Unfortunately, there's no such guarantees for BestEffort pods. This, however, makes sense - after all, they haven't asked for any resources to be guaranteed!
If we look at the cgroup containing the BestEffort pods, we can see that it gets just 2 CPU shares! Essentially the equivalent of 0.002 CPUs guaranteed, split between all BestEffort pods.
And within that cgroup, all of the BestEffort pods are given an equal number of CPU shares:
Current state of affairs
What all of that background means is this:
If a Kubernetes node is under high CPU load, there's basically no guarantees that any VM will get CPU time (aside from that miniscule amount), and any CPU time allotted to VMs as a whole will be split equally between them, regardless of their total size.
So:
Note: This is not currently an issue, because our node CPU usage is so low. Once we implement overcommit (see #517), this may change and we may want to resolve this before we have an incident because of it.
Possible solutions
There's a variety of ideas here, ranging in scope and impact. Here's an assortment:
spec.guest.{cpus,memorySlots}.use
valuesThe text was updated successfully, but these errors were encountered: