libcontainer: remove all mount logic from nsexec #3985

With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: fda12ab ("Support idmap mounts on volumes") Fixes: 9c44407 ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we can now handle arbitrary mappings without issue, so remove the primary artificial limit of mappings (must use the same mapping as the container's userns) and add some tests. We still only support idmap mounts for bind-mounts because configuring mappings for other filesystems would require switching our entire mount machinery to the new mount API. The current design would easily allow for this but we would need to convert new mount options entirely to the fsopen/fsconfig/fsmount API. This can be done in the future. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The primary change is a switch to using /proc/thread-self, which is needed for when we add a CLONE_FS thread to runc. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Our previous test for whether we can mount on top of /proc incorrectly assumed that it would only be called with bind-mount sources. This meant that having a non bind-mount entry for a pseudo-filesystem (like overlayfs) with a dummy source set to /proc on the host would let you bypass the check, which could easily lead to security issues. In addition, the check should be applied more uniformly to all mount types, so fix that as well. And add some tests for some of the tricky cases to make sure we protect against them properly. Fixes: 331692b ("Only allow proc mount if it is procfs") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

If a user specifies a configuration like "rro, rrw", we should have similar behaviour to "ro, rw" where we clear the previous flags so that the last specified flag takes precendence. Fixes: 382eba4 ("Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

ridmap indicates that the id mapping should be applied recursively (only really relevant for rbind mount entries), and idmap indicates that it should not be applied recursively (the default). If no mappings are specified for the mount, we use the userns configuration of the container. This matches the behaviour in the currently-unreleased runtime-spec. This includes a minor change to the state.json serialisation format, but because there has been no released version of runc with commit fbf183c ("Add uid and gid mappings to mounts"), we can safely make this change without affecting running containers. Doing it this way makes it much easier to handle m.IsIDMapped() and indicating that a mapping has been specified. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Our previous implementation of idmapped mounts and bind-mount sources would open all of the source paths before we did any mounts, meaning that mounts using sources from inside the container rootfs would not be correct. This has been fixed with the new on-demand system, and so add some regression tests. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libcontainer: remove all mount logic from nsexec #3985

libcontainer: remove all mount logic from nsexec #3985

Commits on Dec 14, 2023