Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcontainer: remove all mount logic from nsexec #3985

Merged
merged 8 commits into from
Dec 18, 2023
Merged

libcontainer: remove all mount logic from nsexec #3985

merged 8 commits into from
Dec 18, 2023

Commits on Dec 14, 2023

  1. libcontainer: remove all mount logic from nsexec

    With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
    id-mapped mounts and bind-mount source file descriptor logic entirely in
    Go without requiring any complicated handling from nsexec.
    
    However, implementing it the naive way (do the OPEN_TREE_CLONE in the
    host namespace before the rootfs is set up -- which is what the existing
    implementation did) exposes issues in how mount ordering (in particular
    when handling mount sources from inside the container rootfs, but also
    in relation to mount propagation) was handled for idmapped mounts and
    bind-mount sources. In order to solve this problem completely, it is
    necessary to spawn a thread which joins the container mount namespace
    and provides mountfds when requested by the rootfs setup code (ensuring
    that the mount order and mount propagation of the source of the
    bind-mount are handled correctly). While the need to join the mount
    namespace leads to other complicated (such as with the usage of
    /proc/self -- fixed in a later patch) the resulting code is still
    reasonable and is the only real way to solve the issue.
    
    This allows us to reduce the amount of C code we have in nsexec, as well
    as simplifying a whole host of places that were made more complicated
    with the addition of id-mapped mounts and the bind sourcefd logic.
    Because we join the container namespace, we can continue to use regular
    O_PATH file descriptors for non-id-mapped bind-mount sources (which
    means we don't have to raise the kernel requirement for that case).
    
    In addition, we can easily add support for id-mappings that don't match
    the container's user namespace. The approach taken here is to use Go's
    officially supported mechanism for spawning a process in a user
    namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
    different process. The most efficient way to implement this would be to
    do clone() in cgo directly to run a function that just does
    kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
    this approach is too slow. It should be noted that the included
    micro-benchmark seems to indicate this is Fast Enough(TM):
    
      goos: linux
      goarch: amd64
      pkg: github.com/opencontainers/runc/libcontainer/userns
      cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
      BenchmarkSpawnProc
      BenchmarkSpawnProc-8        1670            770065 ns/op
    
    Fixes: fda12ab ("Support idmap mounts on volumes")
    Fixes: 9c44407 ("Open bind mount sources from the host userns")
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    ba0b5e2 View commit details
    Browse the repository at this point in the history
  2. idmap: allow arbitrary idmap mounts regardless of userns configuration

    With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
    can now handle arbitrary mappings without issue, so remove the primary
    artificial limit of mappings (must use the same mapping as the
    container's userns) and add some tests.
    
    We still only support idmap mounts for bind-mounts because configuring
    mappings for other filesystems would require switching our entire mount
    machinery to the new mount API. The current design would easily allow
    for this but we would need to convert new mount options entirely to the
    fsopen/fsconfig/fsmount API. This can be done in the future.
    
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    5ae88da View commit details
    Browse the repository at this point in the history
  3. vendor: update to github.com/moby/sys/mountinfo@v0.7.1

    The primary change is a switch to using /proc/thread-self, which is
    needed for when we add a CLONE_FS thread to runc.
    
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    a04d88e View commit details
    Browse the repository at this point in the history
  4. tree-wide: use /proc/thread-self for thread-local state

    With the idmap work, we will have a tainted Go thread in our
    thread-group that has a different mount namespace to the other threads.
    It seems that (due to some bad luck) the Go scheduler tends to make this
    thread the thread-group leader in our tests, which results in very
    baffling failures where /proc/self/mountinfo produces gibberish results.
    
    In order to avoid this, switch to using /proc/thread-self for everything
    that is thread-local. This primarily includes switching all file
    descriptor paths (CLONE_FS), all of the places that check the current
    cgroup (technically we never will run a single runc thread in a separate
    cgroup, but better to be safe than sorry), and the aforementioned
    mountinfo code. We don't need to do anything for the following because
    the results we need aren't thread-local:
    
     * Checks that certain namespaces are supported by stat(2)ing
       /proc/self/ns/...
    
     * /proc/self/exe and /proc/self/cmdline are not thread-local.
    
     * While threads can be in different cgroups, we do not do this for the
       runc binary (or libcontainer) and thus we do not need to switch to
       the thread-local version of /proc/self/cgroups.
    
     * All of the CLONE_NEWUSER files are not thread-local because you
       cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
       is blocked for multi-threaded programs).
    
    Note that we have to use runtime.LockOSThread when we have an open
    handle to a tid-specific procfs file that we are operating on multiple
    times. Go can reschedule us such that we are running on a different
    thread and then kill the original thread (causing -ENOENT or similarly
    confusing errors). This is not strictly necessary for most usages of
    /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
    only operating on the actual inodes associated with the tid requires
    this locking, but because of the pre-3.17 fallback for CentOS, we have
    to do this in most cases.
    
    In addition, CentOS's kernel is too old for /proc/thread-self, which
    requires us to emulate it -- however in rootfs_linux.go, we are in the
    container pid namespace but /proc is the host's procfs. This leads to
    the incredibly frustrating situation where there is no way (on pre-4.1
    Linux) to figure out which /proc/self/task/... entry refers to the
    current tid. We can just use /proc/self in this case.
    
    Yes this is all pretty ugly. I also wish it wasn't necessary.
    
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    8e8b136 View commit details
    Browse the repository at this point in the history
  5. rootfs: fix 'can we mount on top of /proc' check

    Our previous test for whether we can mount on top of /proc incorrectly
    assumed that it would only be called with bind-mount sources. This meant
    that having a non bind-mount entry for a pseudo-filesystem (like
    overlayfs) with a dummy source set to /proc on the host would let you
    bypass the check, which could easily lead to security issues.
    
    In addition, the check should be applied more uniformly to all mount
    types, so fix that as well. And add some tests for some of the tricky
    cases to make sure we protect against them properly.
    
    Fixes: 331692b ("Only allow proc mount if it is procfs")
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    cdff09a View commit details
    Browse the repository at this point in the history
  6. specconv: handle recursive attribute clearing more consistently

    If a user specifies a configuration like "rro, rrw", we should have
    similar behaviour to "ro, rw" where we clear the previous flags so that
    the last specified flag takes precendence.
    
    Fixes: 382eba4 ("Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)")
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    7795ca4 View commit details
    Browse the repository at this point in the history
  7. mount: add support for ridmap and idmap

    ridmap indicates that the id mapping should be applied recursively (only
    really relevant for rbind mount entries), and idmap indicates that it
    should not be applied recursively (the default). If no mappings are
    specified for the mount, we use the userns configuration of the
    container. This matches the behaviour in the currently-unreleased
    runtime-spec.
    
    This includes a minor change to the state.json serialisation format, but
    because there has been no released version of runc with commit
    fbf183c ("Add uid and gid mappings to mounts"), we can safely make
    this change without affecting running containers. Doing it this way
    makes it much easier to handle m.IsIDMapped() and indicating that a
    mapping has been specified.
    
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    3b57e45 View commit details
    Browse the repository at this point in the history
  8. tests: mounts: add some tests to check mount ordering

    Our previous implementation of idmapped mounts and bind-mount sources
    would open all of the source paths before we did any mounts, meaning
    that mounts using sources from inside the container rootfs would not be
    correct.
    
    This has been fixed with the new on-demand system, and so add some
    regression tests.
    
    Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
    cyphar committed Dec 14, 2023
    Configuration menu
    Copy the full SHA
    fa93c8b View commit details
    Browse the repository at this point in the history