[CVE-2019-19921]: Volume mount race condition with shared mounts #2197

leoluk · 2020-01-01T13:07:23Z

Disclosed in #2190.

Here's the original report to security@opencontainers.org:

Hi all,

an attacker who controls the container image for two containers that share a volume can race volume mounts during container initialization, by adding a symlink to the rootfs that points to a directory on the volume. The second container won't be able to see the actual mount, but it can race it by modifying the mount point on the volume.

This can be exploited for a full container breakout by racing readonly/mask mounts, allowing writes to dangerous paths like /proc/sys/kernel/core_pattern.

Example:

The rootfs of container A has a symlink /proc -> /evil/level1
Container A specifies a named volume mounted to /evil
Container B, started before container A, shares this named volume and repeatedly swaps /evil/level1 and /evil/level1~
Container A mounts procfs to /evil/level1~/level2, but when it remounts /proc/sys, it does so at /evil/level1/level2/sys.

This can reliably be reproduced using runc and podman on Fedora 30 (takes about 0-5s to win the race for me): https://gist.github.com/leoluk/82965ad9df58247202aa0e1878439092

SELinux would ordinarily prevent the exploit by disallowing container_t from writing usermodehelper_t, but it can be disabled by symlinking /proc/self/task/1/attr/exec to something benign like /proc/self/sched (bypassing the procfs check). AppArmor can be disabled similarly.

Docker specifies the mounts in a different order and mounts procfs after it mounts the volumes, mounting over the /proc symlink, which appears to prevent at least the /proc approach. I haven't tested other runc usage scenarios, for instance, k8s+cri-o might be vulnerable as well.

Fabian of Cure53 (in CC) created a minimal PoC that uses runc directly: https://gist.github.com/LiveOverflow/c937820b688922eb127fb760ce06dab9

There are other container init steps after the volume mount that can be raced, obvious ones being utils.CloseExecFrom and the AppArmor/SELinux attrs but there might be others, especially in mountToRootfs (like tricking remount into mounting the rootfs as rshared if there's another volume that specifies the flag, but I haven't tried that).

This is similar to the vulnerability I reported that Adam Iwaniuk disclosed during their Dragon Sector CTF (#2128) and a similar crun one (containers/crun#111).

The fix for the mounts is probably what Aleksa outlined here, using /proc/self/fd to resolve the path: containers/crun#111 (comment)

The text was updated successfully, but these errors were encountered:

cyphar · 2020-01-01T14:06:09Z

My proposed ("stop the bleeding") patch was something like the following:

commit 81a9af6677b1f87e70b87e9a655cb4f4d06a0503 (HEAD -> fix-double-volume-attack)
Author: Aleksa Sarai <asarai@suse.de>
Date:   Sat Dec 21 23:40:17 2019 +1100

    rootfs: do not permit /proc mounts to non-directories
    
    mount(2) will blindly follow symlinks, which is a problem because it
    allows a malicious container to trick runc into mounting /proc to an
    entirely different location (and thus within the attacker's control for
    a rename-exchange attack).
    
    This is just a hotfix, and the more complete fix would be finish
    libpathrs and port runc to it (to avoid these types of attacks entirely,
    and defend against a variety of other /proc-related attacks).
    
    Fixes: CVE-YYYY-XXXX
    Signed-off-by: Aleksa Sarai <asarai@suse.de>

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 291021440a1a..6e896bc4fdaa 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -297,17 +297,49 @@ func mountToRootfs(m *configs.Mount, rootfs, mountLabel string, enableCgroupns b
                dest = filepath.Join(rootfs, dest)
        }
 
+       // For "special" filesystems, we have to be quite careful about mounting --
+       // we must make sure that the destination is what we expect. This is done
+       // by opening the destination as an O_PATH descriptor, and using the
+       // /proc/self/fd/... as the mount target. Unfortunately this is actually
+       // possible to bypass with a little bit of thought, but the complete
+       // solution for this will be to port runc to libpathrs.
        switch m.Device {
-       case "proc", "sysfs":
+       case "proc", "sysfs", "mqueue":
+               // NOTE: If the container controls any part of dest, this is unsafe.
                if err := os.MkdirAll(dest, 0755); err != nil {
                        return err
                }
+               destFd, err := unix.Open(dest, unix.O_PATH|unix.O_CLOEXEC, 0)
+               if err != nil {
+                       return err
+               }
+               defer unix.Close(destFd)
+
+               // Check that the path is exactly what we expect.
+               // NOTE: If the path contains an attacker-controlled bind-mount, this
+               //       check won't do anything. In addition, if procfs is fraudulent,
+               //       it will also be useless. As above, the solution is to switch
+               //       to libpathrs.
+               destFdPath := fmt.Sprintf("/proc/self/fd/%d", destFd)
+               destUnsafePath, err := os.Readlink(destFdPath)
+               if err != nil {
+                       return err
+               }
+               if destUnsafePath != dest {
+                       return fmt.Errorf("detected possible breakout: trying to mount '%s' on '%s' was actually targeted to '%s'", m.Device, dest, destUnsafePath)
+               }
+
+               // Okay, now we can use destFdPath.
+               dest = destFdPath
+               m.Destination = destFdPath
+       }
+
+       // Now actually do the mount.
+       switch m.Device {
+       case "proc", "sysfs":
                // Selinux kernels do not support labeling of /proc or /sys
                return mountPropagate(m, rootfs, "")
        case "mqueue":
-               if err := os.MkdirAll(dest, 0755); err != nil {
-                       return err
-               }
                if err := mountPropagate(m, rootfs, mountLabel); err != nil {
                        // older kernels do not support labeling of /dev/mqueue
                        if err := mountPropagate(m, rootfs, ""); err != nil {

Unfortunately this is not sufficient if / is shared with another container, because then you can do the same trick (but this time on / directly). It also needs some more work to work around the fact that there are m.Destination-based checks elsewhere in rootfs_linux.go.

leoluk · 2020-01-04T15:20:55Z

Your patch does stop the bleeding, though - most runc use cases do not share the rootfs. Mounting a volume on / breaks all kinds of things. Haven't managed to do anything useful using either cri-o or podman.

cyphar · 2020-01-04T22:12:59Z

Alright, I'll prepare a PR. Thanks @leoluk -- and sorry for the response time issues (as well as how the disclosure happened).

liggitt · 2020-01-13T19:29:41Z

any ETA on the workaround to unblock rc10?

cyphar · 2020-01-14T02:31:26Z

I've been off the face of the earth for the past 2ish weeks. I will prepare a PR tomorrow.

cyphar · 2020-01-14T04:42:16Z

#2207 contains a very simplified version of the above patch (the patch I posted above doesn't work because rootfs_linux.go has a very fun relationship with pathnames that I don't have time to debug right now).

Fix CVE-2019-19921 See opencontainers/runc#2197

Fix CVE-2019-19921 See opencontainers/runc#2197 void-linux/void-packages@1702166

Beuc · 2023-02-20T16:24:09Z

Hi,

I'm part of the Debian Long Term Support (LTS) team, and I'm attempting to fix CVE-2019-19921 in our past releases that package "runc".
(apologizes for digging up this old issue :))

I'm still able to reproduce the vulnerability (using the runc reproducer linked in the original topic), in the following situations:

backporting the fix 2fc03cc to 1.0.0~rc6 (Debian 10 "buster"/"old-stable")
more annoyingly, with 1.0.0~rc93, as shipped in Debian 11 "bullseye"/current; for reference the fix was pushed to rc10

AFAICS the fix does make the exploit less likely, but does not stop it entirely: within a few minutes I'm still able to overwrite my root system's /proc/sys/kernel/core_pattern from container-2.

Is this expected (as in, it's a mitigation but not a bullet-proof fix)?
Or is there a follow-up fix that I missed?

Thanks for your attention and best regards.

leoluk · 2023-02-20T17:20:38Z

2fc03cc should completely prevent the exploit. It adds a check to avoid mounting procfs to /proc in the rootfs if the target is something other than a directory or absent, which makes it impossible to point it to an attacker-controlled bind mount. It's not possible to race /proc itself in this setup (the rootfs is not attacker-accessible during early setup).

Either there's a regression or something's wrong with the Debian backport.

leoluk · 2023-02-20T17:43:49Z

The code is definitely included in Debian 1.0.0~rc93: https://salsa.debian.org/go-team/packages/runc/-/blob/debian/1.0.0_rc93+ds1-5+deb11u2/libcontainer/rootfs_linux.go#L314

Beuc · 2023-02-20T22:21:38Z

Thanks for your fast feedback!

Debian might have different dependency versions, because it mostly removes vendor/* and uses the packaged versions.
Thus I tried with a Ubuntu Focal (20.04) VM where 'runc' is built with the built-in vendor/*, to make sure if that was the reason.

Interestingly:

1.0.0~rc10-0ubuntu1 correctly blocs the mount attempt early and 'runc run container-[12]' fails ("must be mounted on ordinary directory")
1.0.0~rc95-0ubuntu1~20.04.2 is vulnerable to the PoC
1.1.0-0ubuntu1~20.04.2 is vulnerable to the PoC

So AFAICS, despite the presence of the fix in all versions, some other commit re-introduced the issue.
(and similarly the fix alone didn't appear to fix ~rc6 in my previous message)

If you've got further insights I'd be grateful :)
Otherwise I can try and bisect to pinpoint when the fix lost its effectiveness (probably tomorrow).

Beuc · 2023-02-21T10:12:40Z

After a bit of digging, ironically it looks like the fix for this vulnerability (CVE-2019-19921) was broken by the one for CVE-2021-30465: 0ca91f4
This sounds like a regression as you suspected.

Do you want me to open a new ticket for this?
And register a new CVE (if you confirm)?

butterflyhack · 2023-04-24T11:30:02Z

hi，I can not reproduce the vulnerability, I use debian 10, kernel version: Linux runc 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20) x86_64 x86_64 x86_64 GNU/Linux. runc version: 1.0.0~rc93+ds1-5+deb11u2. When I run pwn in container-1, the error is "SYS_renameat2: Permission denied", can not change "/poc/layer".

leoluk mentioned this issue Jan 1, 2020

Adding Security audit #2190

Merged

liggitt mentioned this issue Jan 13, 2020

Fix race checking for process exit and waiting for exec fifo #2185

Merged

liggitt mentioned this issue Jan 14, 2020

"OCI runtime start failed" runc race causes many CI failures/flakes kubernetes/kubernetes#86312

Closed

16 tasks

cyphar mentioned this issue Jan 14, 2020

rootfs: do not permit /proc mounts to non-directories #2207

Merged

mrunalp closed this as completed in #2207 Jan 22, 2020

CameronNemo added a commit to CameronNemo/void-packages that referenced this issue Jan 24, 2020

runc: update to 1.0.0rc9.

6762c1d

Fix CVE-2019-19921 See opencontainers/runc#2197

CameronNemo mentioned this issue Jan 24, 2020

runc: update to 1.0.0rc10. void-linux/void-packages#18526

Merged

CameronNemo added a commit to CameronNemo/void-packages that referenced this issue Jan 24, 2020

runc: update to 1.0.0rc10.

90c94ff

Fix CVE-2019-19921 See opencontainers/runc#2197

Hoshpak pushed a commit to void-linux/void-packages that referenced this issue Jan 24, 2020

runc: update to 1.0.0rc10.

1702166

Fix CVE-2019-19921 See opencontainers/runc#2197

This was referenced Jan 24, 2020

Bump to opencontainers/runc new version - v1.0.0-rc10 containerd/containerd#3973

Merged

update runc to v1.0.0-rc10 (CVE-2019-19921) moby/moby#40404

Merged

atweiden pushed a commit to atweiden/voidpkgs that referenced this issue Jan 24, 2020

runc: update to 1.0.0rc10.

4bbe655

Fix CVE-2019-19921 See opencontainers/runc#2197 void-linux/void-packages@1702166

kolyshkin mentioned this issue Jan 5, 2021

tmpfs and symlink resolution #2683

Closed

Beuc mentioned this issue Feb 24, 2023

CVE-2019-19921 re-introduction/regression #3751

Closed

eshafaq1 mentioned this issue Aug 8, 2023

gosu binary Vuln with thirdparty github.com/opencontainers/runc (CVE-2023-27561) tianon/gosu#130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CVE-2019-19921]: Volume mount race condition with shared mounts #2197

[CVE-2019-19921]: Volume mount race condition with shared mounts #2197

leoluk commented Jan 1, 2020

cyphar commented Jan 1, 2020 •

edited

Loading

leoluk commented Jan 4, 2020

cyphar commented Jan 4, 2020

liggitt commented Jan 13, 2020

cyphar commented Jan 14, 2020 •

edited

Loading

cyphar commented Jan 14, 2020

Beuc commented Feb 20, 2023

leoluk commented Feb 20, 2023 •

edited

Loading

leoluk commented Feb 20, 2023

Beuc commented Feb 20, 2023

Beuc commented Feb 21, 2023

butterflyhack commented Apr 24, 2023

[CVE-2019-19921]: Volume mount race condition with shared mounts #2197

[CVE-2019-19921]: Volume mount race condition with shared mounts #2197

Comments

leoluk commented Jan 1, 2020

cyphar commented Jan 1, 2020 • edited Loading

leoluk commented Jan 4, 2020

cyphar commented Jan 4, 2020

liggitt commented Jan 13, 2020

cyphar commented Jan 14, 2020 • edited Loading

cyphar commented Jan 14, 2020

Beuc commented Feb 20, 2023

leoluk commented Feb 20, 2023 • edited Loading

leoluk commented Feb 20, 2023

Beuc commented Feb 20, 2023

Beuc commented Feb 21, 2023

butterflyhack commented Apr 24, 2023

cyphar commented Jan 1, 2020 •

edited

Loading

cyphar commented Jan 14, 2020 •

edited

Loading

leoluk commented Feb 20, 2023 •

edited

Loading