New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: Run container workload in its own cgroup namespace (cgroup v2 guest only) #9125
Conversation
Run cargo-clippy to reduce noise in actual functional changes. Signed-off-by: Greg Kurz <groug@kaod.org>
When cgroup v2 is in use, a container should only see its part of the unified hierarchy in `/sys/fs/cgroup`, not the full hierarchy created at the OS level. Similarly, `/proc/self/cgroup` inside the container should display `0::/`, rather than a full path such as : 0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podde291f58_8f20_4d44_aa89_c9e538613d85.slice/crio-9e1823d09627f3c2d42f30d76f0d2933abdbc033a630aab732339c90334fbc5f.scope What is needed here is isolation from the OS. Do that by running the container in its own cgroup namespace. This matches what runc and other non VM based runtimes do. Fixes kata-containers#9124 Signed-off-by: Greg Kurz <groug@kaod.org>
a31aa09
to
600b951
Compare
/test |
if cgroups::hierarchies::is_cgroup2_unified_mode() { | ||
sched::unshare(CloneFlags::CLONE_NEWCGROUP)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this isolation required to cgroup v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this isolation required to cgroup v1?
Hi Xavier, I was kinda expecting this question 😉
Cgroup v1 doesn't have the problem with /sys/fs/cgroup
as the agent bind mounts the appropriate directories in the container.
There is some leaking in /proc/self/cgroup
though, as it partially displays details that belong to the guest OS. For example, this what we get inside a kata container on Openshift 4.11 (soon reaching EOL) :
bash-5.2$ cat /proc/self/cgroup
12:memory:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
11:blkio:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
10:hugetlb:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
9:cpuset:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
8:rdma:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
7:cpu,cpuacct:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
6:devices:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
5:net_cls,net_prio:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
4:pids:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
3:freezer:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
2:perf_event:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
1:name=systemd:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
Container should not see that CRI-O is involved, but this is really minor and didn't cause any concern since the beginning.
I did try to unshare the cgroup namespace for cgroup v1 as well for experiment and it resulted in the container not starting. Since cgroup v1 in the guest isn't really my use case, I'll leave it for someone who cares and stick to fix the cgroup v2 experience only in this PR (updated the PR title to make it explicit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @gkurz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @gkurz!
This adds some missing namespace isolation in cgroup v2 guests. Some linting is apparently needed in main. Do this as a preparatory patch avoid the noise in the actual fix.
Fixes #9124