-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CVE-2019-19921]: Volume mount race condition with shared mounts #2197
Comments
My proposed ("stop the bleeding") patch was something like the following: commit 81a9af6677b1f87e70b87e9a655cb4f4d06a0503 (HEAD -> fix-double-volume-attack)
Author: Aleksa Sarai <asarai@suse.de>
Date: Sat Dec 21 23:40:17 2019 +1100
rootfs: do not permit /proc mounts to non-directories
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).
This is just a hotfix, and the more complete fix would be finish
libpathrs and port runc to it (to avoid these types of attacks entirely,
and defend against a variety of other /proc-related attacks).
Fixes: CVE-YYYY-XXXX
Signed-off-by: Aleksa Sarai <asarai@suse.de>
diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 291021440a1a..6e896bc4fdaa 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -297,17 +297,49 @@ func mountToRootfs(m *configs.Mount, rootfs, mountLabel string, enableCgroupns b
dest = filepath.Join(rootfs, dest)
}
+ // For "special" filesystems, we have to be quite careful about mounting --
+ // we must make sure that the destination is what we expect. This is done
+ // by opening the destination as an O_PATH descriptor, and using the
+ // /proc/self/fd/... as the mount target. Unfortunately this is actually
+ // possible to bypass with a little bit of thought, but the complete
+ // solution for this will be to port runc to libpathrs.
switch m.Device {
- case "proc", "sysfs":
+ case "proc", "sysfs", "mqueue":
+ // NOTE: If the container controls any part of dest, this is unsafe.
if err := os.MkdirAll(dest, 0755); err != nil {
return err
}
+ destFd, err := unix.Open(dest, unix.O_PATH|unix.O_CLOEXEC, 0)
+ if err != nil {
+ return err
+ }
+ defer unix.Close(destFd)
+
+ // Check that the path is exactly what we expect.
+ // NOTE: If the path contains an attacker-controlled bind-mount, this
+ // check won't do anything. In addition, if procfs is fraudulent,
+ // it will also be useless. As above, the solution is to switch
+ // to libpathrs.
+ destFdPath := fmt.Sprintf("/proc/self/fd/%d", destFd)
+ destUnsafePath, err := os.Readlink(destFdPath)
+ if err != nil {
+ return err
+ }
+ if destUnsafePath != dest {
+ return fmt.Errorf("detected possible breakout: trying to mount '%s' on '%s' was actually targeted to '%s'", m.Device, dest, destUnsafePath)
+ }
+
+ // Okay, now we can use destFdPath.
+ dest = destFdPath
+ m.Destination = destFdPath
+ }
+
+ // Now actually do the mount.
+ switch m.Device {
+ case "proc", "sysfs":
// Selinux kernels do not support labeling of /proc or /sys
return mountPropagate(m, rootfs, "")
case "mqueue":
- if err := os.MkdirAll(dest, 0755); err != nil {
- return err
- }
if err := mountPropagate(m, rootfs, mountLabel); err != nil {
// older kernels do not support labeling of /dev/mqueue
if err := mountPropagate(m, rootfs, ""); err != nil { Unfortunately this is not sufficient if |
Your patch does stop the bleeding, though - most runc use cases do not share the rootfs. Mounting a volume on |
Alright, I'll prepare a PR. Thanks @leoluk -- and sorry for the response time issues (as well as how the disclosure happened). |
any ETA on the workaround to unblock rc10? |
I've been off the face of the earth for the past 2ish weeks. I will prepare a PR tomorrow. |
#2207 contains a very simplified version of the above patch (the patch I posted above doesn't work because |
Hi, I'm part of the Debian Long Term Support (LTS) team, and I'm attempting to fix CVE-2019-19921 in our past releases that package "runc". I'm still able to reproduce the vulnerability (using the runc reproducer linked in the original topic), in the following situations:
AFAICS the fix does make the exploit less likely, but does not stop it entirely: within a few minutes I'm still able to overwrite my root system's /proc/sys/kernel/core_pattern from container-2. Is this expected (as in, it's a mitigation but not a bullet-proof fix)? Thanks for your attention and best regards. |
2fc03cc should completely prevent the exploit. It adds a check to avoid mounting procfs to Either there's a regression or something's wrong with the Debian backport. |
The code is definitely included in Debian 1.0.0~rc93: https://salsa.debian.org/go-team/packages/runc/-/blob/debian/1.0.0_rc93+ds1-5+deb11u2/libcontainer/rootfs_linux.go#L314 |
Thanks for your fast feedback! Debian might have different dependency versions, because it mostly removes vendor/* and uses the packaged versions. Interestingly:
So AFAICS, despite the presence of the fix in all versions, some other commit re-introduced the issue. If you've got further insights I'd be grateful :) |
After a bit of digging, ironically it looks like the fix for this vulnerability (CVE-2019-19921) was broken by the one for CVE-2021-30465: 0ca91f4 Do you want me to open a new ticket for this? |
hi,I can not reproduce the vulnerability, I use debian 10, kernel version: Linux runc 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20) x86_64 x86_64 x86_64 GNU/Linux. runc version: 1.0.0~rc93+ds1-5+deb11u2. When I run pwn in container-1, the error is "SYS_renameat2: Permission denied", can not change "/poc/layer". |
Disclosed in #2190.
Here's the original report to security@opencontainers.org:
Hi all,
an attacker who controls the container image for two containers that share a volume can race volume mounts during container initialization, by adding a symlink to the rootfs that points to a directory on the volume. The second container won't be able to see the actual mount, but it can race it by modifying the mount point on the volume.
This can be exploited for a full container breakout by racing readonly/mask mounts, allowing writes to dangerous paths like
/proc/sys/kernel/core_pattern
.Example:
/proc
->/evil/level1
/evil
/evil/level1
and/evil/level1~
/evil/level1~/level2
, but when it remounts/proc/sys
, it does so at/evil/level1/level2/sys
.This can reliably be reproduced using runc and podman on Fedora 30 (takes about 0-5s to win the race for me): https://gist.github.com/leoluk/82965ad9df58247202aa0e1878439092
SELinux would ordinarily prevent the exploit by disallowing
container_t
from writingusermodehelper_t
, but it can be disabled by symlinking/proc/self/task/1/attr/exec
to something benign like/proc/self/sched
(bypassing the procfs check). AppArmor can be disabled similarly.Docker specifies the mounts in a different order and mounts procfs after it mounts the volumes, mounting over the /proc symlink, which appears to prevent at least the /proc approach. I haven't tested other runc usage scenarios, for instance, k8s+cri-o might be vulnerable as well.
Fabian of Cure53 (in CC) created a minimal PoC that uses runc directly: https://gist.github.com/LiveOverflow/c937820b688922eb127fb760ce06dab9
There are other container init steps after the volume mount that can be raced, obvious ones being utils.CloseExecFrom and the AppArmor/SELinux attrs but there might be others, especially in mountToRootfs (like tricking remount into mounting the rootfs as rshared if there's another volume that specifies the flag, but I haven't tried that).
This is similar to the vulnerability I reported that Adam Iwaniuk disclosed during their Dragon Sector CTF (#2128) and a similar crun one (containers/crun#111).
The fix for the mounts is probably what Aleksa outlined here, using /proc/self/fd to resolve the path: containers/crun#111 (comment)
The text was updated successfully, but these errors were encountered: