Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Reduce the mount points with namespace isolation #8760

Merged

Conversation

fadecoder
Copy link
Contributor

@fadecoder fadecoder commented Jan 2, 2024

We notice a 100% CPU load systemd process in the host when deploying more than 400 Kata Containers Pods (using go runtime) through k8s in one node, as described in issue #8758 .

Inspired by rust runtime, we add the mount namespace isolation to go runtime, and it can prevent the systemd process from seeing more mounts, thereby reducing the burden on the systemd, making it faster to deploy more Pods.

Considering the runtime-rs VMMs are under heavy developments, adding such isolation is still useful for many user-cases.

Fixes: #8758

@katacontainersbot katacontainersbot added the size/small Small and simple task label Jan 2, 2024
@WenyuanLau WenyuanLau self-requested a review January 2, 2024 08:41
@WenyuanLau
Copy link
Contributor

Thanks! @fadecoder

@WenyuanLau
Copy link
Contributor

/test

@WenyuanLau WenyuanLau changed the title mount: Reduce the mount points with namespace isolation runtime: Reduce the mount points with namespace isolation Jan 5, 2024
Copy link
Contributor

@WenyuanLau WenyuanLau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zvonkok
Copy link
Contributor

zvonkok commented Jan 10, 2024

Interesting, I would like to see how OpenRC behaves here ... and if we can save some CPU cycles using OpenRC vs. systemd. @fidencio FYI #8760

@WenyuanLau
Copy link
Contributor

WenyuanLau commented Jan 10, 2024

Interesting, I would like to see how OpenRC behaves here ... and if we can save some CPU cycles using OpenRC vs. systemd. @fidencio FYI #8760

The systemd process mentioned in the PR is in the host machine, not in the VM, and I updated the description just now. I'm not quite sure that we are talking about the same process. :)

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing untrusted users to hog the host CPU and add measurable latency in a critical component like the host systemd is a security issue. Maybe not critical but severe enough to waste precious cycles and $$ on a big system.

Adding the missing isolation not only helps with the k8s deployment density, but it is true to kata's provide the workload isolation and security advantages of VMs. motto, that all users legitimately expect.

Thanks @fadecoder !

@danmihai1
Copy link
Contributor

/test

@fidencio
Copy link
Member

@fadecoder, bear with me here in case I'm wrong, but I think I understand why CRI-O tests are failing.

Creating a new mount namespace requires that the process creating it has CAP_SYS_ADMIN capabilities. On the CRI-O side, that's not part of the default capabilites, and you'd have to expand that by adding CAP_SYS_ADMIN to

default_capabilities = [
"CHOWN",
"DAC_OVERRIDE",
"FSETID",
"FOWNER",
"SETGID",
"SETUID",
"SETPCAP",
"NET_BIND_SERVICE",
"KILL",
"SYS_CHROOT",
]

That will make the CI pass, most likely.

Whether or not giving this set of permissions is a good thing, is not exactly up for discussion here (mainly as we're already allowing on DAC_OVERRIDE). But, again, this discussion is out of the scope of this PR.

@lifupan
Copy link
Member

lifupan commented Jan 17, 2024

/test

@gkurz
Copy link
Member

gkurz commented Feb 1, 2024

This PR needs to be rebased to main in order to get rid of the bogus pending build checks in CI.

This patch can reduce load on systemd process, and
increase the k8s deployment density when using go runtime.

Fixes: kata-containers#8758

Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
@WenyuanLau
Copy link
Contributor

/test

@WenyuanLau
Copy link
Contributor

CI have passed, I will get it merged. Thanks! @fadecoder

@WenyuanLau WenyuanLau merged commit cb88851 into kata-containers:main Feb 2, 2024
287 of 292 checks passed
gkurz added a commit to gkurz/kata-containers that referenced this pull request Feb 13, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

The monitor checks in CI have been failing since then with
CRI-O. The precise path of code that got broken by kata-containers#8760
isn't known at this time but in any case, only the shim
daemon, that will process the actual mount operations, needs
to live in a separate namespace.

The runtime-rs shim does exactly that by checking its
command line arguments. If it doesn't find an action,
e.g "start" or "delete", as typically passed the
container runtime, it assumes it is the running as
the shim daemon and unshares its mount namespace.

Since the shim command line parsing bits are hidden deep
in containerd vendor code, it isn't practical to do the
exactly the same in the go runtime.

Go for a variant : introduce a function that heuristicaly
validates thet current process was spawned by the NewCommand()
function. A C-style assert is added in NewCommand() to ensure
any code change that would break the assumptions of the
validating function are detected by CI. The panic won't fire
in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Signed-off-by: Greg Kurz <groug@kaod.org>
gkurz added a commit to gkurz/kata-containers that referenced this pull request Feb 22, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
gkurz added a commit to gkurz/kata-containers that referenced this pull request Feb 22, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
gkurz added a commit to gkurz/kata-containers that referenced this pull request Feb 22, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this pull request Feb 28, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this pull request Feb 28, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this pull request Feb 28, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this pull request Feb 29, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this pull request Feb 29, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
lifupan pushed a commit to lifupan/kata-containers that referenced this pull request Mar 3, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its own mount
namespace for the sake of improving isolation between the sandbox and
the host. Thus crio storage drivers shouldn't create a PRIVATE
bind mount on their home directory. Otherwise, the container's rootfs
mount wouldn't be propagated to kata runtime's mount namespace, and
kata runtime couldn't access the container's rootfs files.

So, when kata cooperated with crio, crio should set
skip_mount_home=true for its storage overlay.

Fixes: kata-containers#9028

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/small Small and simple task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mounts: Too many mount points in k8s high-density deployment with Kata go runtime
8 participants