runtime: Reduce the mount points with namespace isolation #8760

fadecoder · 2024-01-02T08:25:32Z

We notice a 100% CPU load systemd process in the host when deploying more than 400 Kata Containers Pods (using go runtime) through k8s in one node, as described in issue #8758 .

Inspired by rust runtime, we add the mount namespace isolation to go runtime, and it can prevent the systemd process from seeing more mounts, thereby reducing the burden on the systemd, making it faster to deploy more Pods.

Considering the runtime-rs VMMs are under heavy developments, adding such isolation is still useful for many user-cases.

Fixes: #8758

WenyuanLau · 2024-01-02T08:54:09Z

Thanks! @fadecoder

WenyuanLau · 2024-01-02T12:10:42Z

/test

WenyuanLau

LGTM!

zvonkok · 2024-01-10T14:01:51Z

Interesting, I would like to see how OpenRC behaves here ... and if we can save some CPU cycles using OpenRC vs. systemd. @fidencio FYI #8760

WenyuanLau · 2024-01-10T14:09:06Z

Interesting, I would like to see how OpenRC behaves here ... and if we can save some CPU cycles using OpenRC vs. systemd. @fidencio FYI #8760

The systemd process mentioned in the PR is in the host machine, not in the VM, and I updated the description just now. I'm not quite sure that we are talking about the same process. :)

gkurz

Allowing untrusted users to hog the host CPU and add measurable latency in a critical component like the host systemd is a security issue. Maybe not critical but severe enough to waste precious cycles and $$ on a big system.

Adding the missing isolation not only helps with the k8s deployment density, but it is true to kata's provide the workload isolation and security advantages of VMs. motto, that all users legitimately expect.

Thanks @fadecoder !

danmihai1 · 2024-01-12T23:25:08Z

/test

fidencio · 2024-01-15T14:21:24Z

@fadecoder, bear with me here in case I'm wrong, but I think I understand why CRI-O tests are failing.

Creating a new mount namespace requires that the process creating it has CAP_SYS_ADMIN capabilities. On the CRI-O side, that's not part of the default capabilites, and you'd have to expand that by adding CAP_SYS_ADMIN to

kata-containers/tests/common.bash

Lines 590 to 601 in 0dc00ae

    
           default_capabilities = [ 
        
                  "CHOWN", 
        
                  "DAC_OVERRIDE", 
        
                  "FSETID", 
        
                  "FOWNER", 
        
                  "SETGID", 
        
                  "SETUID", 
        
                  "SETPCAP", 
        
                  "NET_BIND_SERVICE", 
        
                  "KILL", 
        
                  "SYS_CHROOT", 
        
           ]

That will make the CI pass, most likely.

Whether or not giving this set of permissions is a good thing, is not exactly up for discussion here (mainly as we're already allowing on DAC_OVERRIDE). But, again, this discussion is out of the scope of this PR.

lifupan · 2024-01-17T05:05:29Z

/test

gkurz · 2024-02-01T08:54:43Z

This PR needs to be rebased to main in order to get rid of the bogus pending build checks in CI.

This patch can reduce load on systemd process, and increase the k8s deployment density when using go runtime. Fixes: kata-containers#8758 Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com> Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>

WenyuanLau · 2024-02-02T07:26:12Z

/test

WenyuanLau · 2024-02-02T08:54:33Z

CI have passed, I will get it merged. Thanks! @fadecoder

PR kata-containers#8760 tentatively tried to have the shim to run in its own mount namespace for the sake of improving isolation between the sandbox and the host. This was inspired by the runtime-rs implementation. The monitor checks in CI have been failing since then with CRI-O. The precise path of code that got broken by kata-containers#8760 isn't known at this time but in any case, only the shim daemon, that will process the actual mount operations, needs to live in a separate namespace. The runtime-rs shim does exactly that by checking its command line arguments. If it doesn't find an action, e.g "start" or "delete", as typically passed the container runtime, it assumes it is the running as the shim daemon and unshares its mount namespace. Since the shim command line parsing bits are hidden deep in containerd vendor code, it isn't practical to do the exactly the same in the go runtime. Go for a variant : introduce a function that heuristicaly validates thet current process was spawned by the NewCommand() function. A C-style assert is added in NewCommand() to ensure any code change that would break the assumptions of the validating function are detected by CI. The panic won't fire in production. This is a bit hacky and could certainly be handled cleanlier with a change in the containerd shim package. The go shim is expected to sunset soon though. It thus doesn't seem worth the pain. Signed-off-by: Greg Kurz <groug@kaod.org>

PR kata-containers#8760 tentatively tried to have the shim to run in its own mount namespace for the sake of improving isolation between the sandbox and the host. This was inspired by the runtime-rs implementation. Unfortunately, the kata-monitor checks in CI have been failing since then with CRI-O. The precise path of code that got broken by kata-containers#8760 isn't known at this time. Looking closer at runtime-rs, it appears that only the shim daemon unshares its mount namespace. Do the same with the go runtime. Similarly to what runtime-rs does, look at the command line arguments to detect that the current process is the daemon. Since the go shim command line parsing belongs to containerd vendored code, it isn't really practical to do exactly the same though. Go for a variant : introduce a function that heuristicaly validates that the current process was spawned by the NewCommand() function. A C-style assert is added in NewCommand() to ensure any code change that would break the assumptions of the validating function are detected by CI. The panic *cannot* happen in production. This is a bit hacky and could certainly be handled cleanlier with a change in the containerd shim package. The go shim is expected to sunset soon though. It thus doesn't seem worth the pain. Fixes kata-containers#9028 Signed-off-by: Greg Kurz <groug@kaod.org>

PR kata-containers#8760 tentatively tried to have the shim to run in its own mount namespace for the sake of improving isolation between the sandbox and the host. Thus crio storage drivers shouldn't create a PRIVATE bind mount on their home directory. Otherwise, the container's rootfs mount wouldn't be propagated to kata runtime's mount namespace, and kata runtime couldn't access the container's rootfs files. So, when kata cooperated with crio, crio should set skip_mount_home=true for its storage overlay. Fixes: kata-containers#9028 Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>

katacontainersbot added the size/small Small and simple task label Jan 2, 2024

WenyuanLau added the no-backport-needed label Jan 2, 2024

WenyuanLau self-requested a review January 2, 2024 08:41

WenyuanLau added the ok-to-test label Jan 3, 2024

WenyuanLau changed the title ~~mount: Reduce the mount points with namespace isolation~~ runtime: Reduce the mount points with namespace isolation Jan 5, 2024

fadecoder force-pushed the reduce_go_runtime_mounts branch from c31cc76 to b148e2b Compare January 10, 2024 04:40

WenyuanLau approved these changes Jan 10, 2024

View reviewed changes

zvonkok mentioned this pull request Jan 10, 2024

WIP: RFC: Add OpenRC Support for Kata builds of guest initrd, image #7613

Closed

gkurz approved these changes Jan 10, 2024

View reviewed changes

fadecoder force-pushed the reduce_go_runtime_mounts branch from b148e2b to 091cf1b Compare January 17, 2024 01:12

lifupan approved these changes Jan 17, 2024

View reviewed changes

fadecoder force-pushed the reduce_go_runtime_mounts branch from 091cf1b to 9317e23 Compare February 1, 2024 10:35

WenyuanLau merged commit cb88851 into kata-containers:main Feb 2, 2024
287 of 292 checks passed

gkurz mentioned this pull request Feb 8, 2024

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

Closed

gkurz mentioned this pull request Feb 22, 2024

runtime: Only unshare mount namespace in the shim daemon #9063

Closed

lifupan mentioned this pull request Mar 3, 2024

CI: fix the issue of ci failure on crio #9206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: Reduce the mount points with namespace isolation #8760

runtime: Reduce the mount points with namespace isolation #8760

fadecoder commented Jan 2, 2024 •

edited by WenyuanLau

WenyuanLau commented Jan 2, 2024

WenyuanLau commented Jan 2, 2024

WenyuanLau left a comment

zvonkok commented Jan 10, 2024

WenyuanLau commented Jan 10, 2024 •

edited

gkurz left a comment •

edited

danmihai1 commented Jan 12, 2024

fidencio commented Jan 15, 2024

lifupan commented Jan 17, 2024

gkurz commented Feb 1, 2024

WenyuanLau commented Feb 2, 2024

WenyuanLau commented Feb 2, 2024

runtime: Reduce the mount points with namespace isolation #8760

runtime: Reduce the mount points with namespace isolation #8760

Conversation

fadecoder commented Jan 2, 2024 • edited by WenyuanLau

WenyuanLau commented Jan 2, 2024

WenyuanLau commented Jan 2, 2024

WenyuanLau left a comment

Choose a reason for hiding this comment

zvonkok commented Jan 10, 2024

WenyuanLau commented Jan 10, 2024 • edited

gkurz left a comment • edited

Choose a reason for hiding this comment

danmihai1 commented Jan 12, 2024

fidencio commented Jan 15, 2024

lifupan commented Jan 17, 2024

gkurz commented Feb 1, 2024

WenyuanLau commented Feb 2, 2024

WenyuanLau commented Feb 2, 2024

fadecoder commented Jan 2, 2024 •

edited by WenyuanLau

WenyuanLau commented Jan 10, 2024 •

edited

gkurz left a comment •

edited