Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Run container workload in its own cgroup namespace (cgroup v2 guest only) #9125

Merged
merged 2 commits into from Feb 23, 2024

Conversation

gkurz
Copy link
Member

@gkurz gkurz commented Feb 21, 2024

This adds some missing namespace isolation in cgroup v2 guests. Some linting is apparently needed in main. Do this as a preparatory patch avoid the noise in the actual fix.

Fixes #9124

@katacontainersbot katacontainersbot added the size/small Small and simple task label Feb 21, 2024
@gkurz gkurz self-assigned this Feb 21, 2024
@gkurz gkurz added ok-to-test and removed size/small Small and simple task labels Feb 21, 2024
Run cargo-clippy to reduce noise in actual functional changes.

Signed-off-by: Greg Kurz <groug@kaod.org>
When cgroup v2 is in use, a container should only see its part of the
unified hierarchy in `/sys/fs/cgroup`, not the full hierarchy created
at the OS level. Similarly, `/proc/self/cgroup` inside the container
should display `0::/`, rather than a full path such as :

0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podde291f58_8f20_4d44_aa89_c9e538613d85.slice/crio-9e1823d09627f3c2d42f30d76f0d2933abdbc033a630aab732339c90334fbc5f.scope

What is needed here is isolation from the OS. Do that by running the
container in its own cgroup namespace. This matches what runc and
other non VM based runtimes do.

Fixes kata-containers#9124

Signed-off-by: Greg Kurz <groug@kaod.org>
@katacontainersbot katacontainersbot added the size/small Small and simple task label Feb 21, 2024
@gkurz
Copy link
Member Author

gkurz commented Feb 21, 2024

/test

Comment on lines +559 to +561
if cgroups::hierarchies::is_cgroup2_unified_mode() {
sched::unshare(CloneFlags::CLONE_NEWCGROUP)?;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this isolation required to cgroup v1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this isolation required to cgroup v1?

Hi Xavier, I was kinda expecting this question 😉

Cgroup v1 doesn't have the problem with /sys/fs/cgroup as the agent bind mounts the appropriate directories in the container.

There is some leaking in /proc/self/cgroup though, as it partially displays details that belong to the guest OS. For example, this what we get inside a kata container on Openshift 4.11 (soon reaching EOL) :

bash-5.2$ cat /proc/self/cgroup 
12:memory:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
11:blkio:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
10:hugetlb:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
9:cpuset:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
8:rdma:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
7:cpu,cpuacct:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
6:devices:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
5:net_cls,net_prio:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
4:pids:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
3:freezer:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
2:perf_event:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4
1:name=systemd:/crio/bfca835403a5c2629942b254fe8d850c069576be14a292ac3cd3a77f9b1958b4

Container should not see that CRI-O is involved, but this is really minor and didn't cause any concern since the beginning.

I did try to unshare the cgroup namespace for cgroup v1 as well for experiment and it resulted in the container not starting. Since cgroup v1 in the guest isn't really my use case, I'll leave it for someone who cares and stick to fix the cgroup v2 experience only in this PR (updated the PR title to make it explicit).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, thanks!

Copy link
Contributor

@littlejawa littlejawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Thanks @gkurz

Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @gkurz!

@fidencio
Copy link
Member

@gkurz, I won't block this PR to get merged, but I'd love to see some test for this case at some point.
Please, let's sync with @wainersm, @ldoktor, and @GabyCT on how to add those.

@gkurz gkurz changed the title agent: Run container workload in its own cgroup namespace agent: Run container workload in its own cgroup namespace (cgroup v2 guest only) Feb 22, 2024
@justxuewei justxuewei merged commit 89c76d7 into kata-containers:main Feb 23, 2024
330 of 520 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/small Small and simple task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: container sees full cgroup v2 hierarchy
5 participants