Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: cloned_binary: "memfd" cleanups #1984

Merged
merged 5 commits into from Mar 7, 2019

Conversation

Projects
None yet
@cyphar
Copy link
Member

cyphar commented Feb 15, 2019

For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

In order to get around the memfd_create(2) requirement, 0a8e411
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

  • It required O_TMPFILE which is relatively new (having been added to
    Linux 3.11).

  • The fallback choice was made at compile-time, not runtime. This
    results in several complications when it comes to running binaries
    on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a very bad idea).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked alongside the protection being
present.

Fixes #1979
Fixes #1980
Fixes #1988
Signed-off-by: Aleksa Sarai asarai@suse.de

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Feb 15, 2019

This new setup would be an opportunity to fix #1980, by making the O_TMPFILE be the statedir of runc. But we'd have to allow you to disable memfd_create(2) usage at runtime which is just going to be ugly.

@cyphar cyphar force-pushed the cyphar:memfd-cleanups branch 3 times, most recently from 56babdc to ff3b993 Feb 15, 2019

@cyphar cyphar force-pushed the cyphar:memfd-cleanups branch from ff3b993 to 2dbb8e1 Feb 17, 2019

Show resolved Hide resolved libcontainer/nsenter/cloned_binary.c Outdated
Show resolved Hide resolved libcontainer/nsenter/cloned_binary.c Outdated
Show resolved Hide resolved libcontainer/nsenter/cloned_binary.c
@thaJeztah

This comment has been minimized.

Copy link

thaJeztah commented Feb 18, 2019

Show resolved Hide resolved libcontainer/nsenter/cloned_binary.c Outdated
Show resolved Hide resolved libcontainer/nsenter/cloned_binary.c Outdated
Show resolved Hide resolved libcontainer/nsenter/cloned_binary.c Outdated
@kolyshkin

This comment has been minimized.

Copy link
Contributor

kolyshkin commented Feb 18, 2019

Thanks for writing this, you beat me to it :) Left some minor comments in the code. The only issue remaining is memfd_create() is not used in a scenario when compiled with older includes (i.e. no __NR_memfd_create) and later run under a kernel that supports it. The fix would be to add something like

/* Use our own wrapper for memfd_create. */
#if !defined(__NR_memfd_create)
#  ifdef __i386__
#    define __NR_memfd_create 356
#  elif __x86_64__
#    define __NR_memfd_create 319
#  elif __powerpc64__
#    define __NR_memfd_create 360
#  elif __aarch64__
#    define __NR_memfd_create 279
#  elif __arm__
#    define __NR_memfd_create 385
/* FIXME more arches here */
#  endif
#endif /* !__NR_memfd_create */

but this is borderline ugly, and in the light of #1988 and #1980 maybe it makes sense to just not use memfd_create() at all, only leaving other methods? I'm not proposinig it, just an idea to discuss...

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Feb 19, 2019

@kolyshkin We don't do that for any other syscall and I really don't like the idea of hard-coding syscall numbers. Yeah, it's part of the Linux ABI and thus won't break, but I'm really not a fan of it.

@brauner

This comment has been minimized.

Copy link
Contributor

brauner commented Feb 19, 2019

@cyphar, did you think about dropping O_TMPFILE? We now have merged the old kernel pieces and now have

https://github.com/lxc/lxc/blob/2d8bd1db234462356c73b12f238a615b2d1cb550/src/lxc/rexec.c#L106

@cyphar cyphar force-pushed the cyphar:memfd-cleanups branch 2 times, most recently from e344a46 to 89c0755 Feb 19, 2019

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Feb 19, 2019

@brauner I don't have a strong opinion about whether I should keep O_TMPFILE. On the one hand having an extra fallback just means that the mkostemp path won't be tested as well, but on the other hand O_TMPFILE is in principle the "better" way of doing it.

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Feb 20, 2019

Any strong opinions on this one (in particular, should we keep O_TMPFILE or drop it). Or even drop memfd_create entirely?

/cc @opencontainers/runc-maintainers

@thaJeztah

This comment has been minimized.

Copy link

thaJeztah commented Feb 20, 2019

ping @justincormack as well; PTAL

@mrunalp

This comment has been minimized.

Copy link
Contributor

mrunalp commented Feb 20, 2019

@cyphar I am in favor of retaining O_TMPFILE.

@mrunalp

This comment has been minimized.

Copy link
Contributor

mrunalp commented Feb 20, 2019

I like the approach you have taken in the PR. I am on the fence on whether we should remove memfd_create right away as you provide a way to disable it.

@mrunalp

This comment has been minimized.

Copy link
Contributor

mrunalp commented Feb 20, 2019

I am approving this but will be happy to re-approve even if majority thinks that we should drop memfd_create. Thanks!

@mrunalp

This comment has been minimized.

Copy link
Contributor

mrunalp commented Feb 20, 2019

LGTM

Approved with PullApprove

@kolyshkin

This comment has been minimized.

Copy link
Contributor

kolyshkin commented Feb 21, 2019

TL;DR: it looks like memfd_create() is not properly ported to RHEL7 kernels (including the latest and greatest RHEL 7.6). Tested on CentOS, not real RHEL.

All the gory details

I tested `memfd_create` and friends, using memfd_test.c from the kernel sources.

On CentOS 7.2 kernel (3.10.0-327.el7.x86_64) and it fails right there, giving EINVAL on MFD_ALLOW_SEALING flag.

I tested it on latest CentOS 7.6 kernel (3.10.0-957.5.1.el7.x86_64) and it fails mid-way on SHARE_OPEN test in

https://github.com/torvalds/linux/blob/f6163d67cc31b8f2a946c4df82be3c6dd918412d/tools/testing/selftests/memfd/memfd_test.c#L881

with:

memfd: SHARE-OPEN 
open(/proc/self/fd/3) failed: Text file busy
Aborted

Here's a relevant portion of strace:

write(1, "memfd: SHARE-OPEN \n", 19memfd: SHARE-OPEN 
)    = 19
memfd_create("kern_memfd_share_open", MFD_CLOEXEC|MFD_ALLOW_SEALING) = 3
ftruncate(3, 8192)                      = 0
fcntl(3, F_GET_SEALS)                   = 0
openat(AT_FDCWD, "/proc/self/fd/3", O_RDWR) = 4
fcntl(3, F_GET_SEALS)                   = 0
fcntl(3, F_ADD_SEALS, F_SEAL_WRITE)     = 0
fcntl(3, F_GET_SEALS)                   = 0x8 (seals F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0x8 (seals F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0x8 (seals F_SEAL_WRITE)
fcntl(4, F_ADD_SEALS, F_SEAL_SHRINK)    = 0
fcntl(3, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
close(3)                                = 0
openat(AT_FDCWD, "/proc/self/fd/4", O_RDONLY) = 3
fcntl(3, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
fcntl(3, F_ADD_SEALS, F_SEAL_SEAL)      = -1 EPERM (Operation not permitted)
fcntl(3, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
fcntl(4, F_GET_SEALS)                   = 0xa (seals F_SEAL_SHRINK|F_SEAL_WRITE)
close(4)                                = 0
openat(AT_FDCWD, "/proc/self/fd/3", O_RDWR) = -1 ETXTBSY (Text file busy)
write(1, "open(/proc/self/fd/3) failed: Te"..., 45open(/proc/self/fd/3) failed: Text file busy
) = 45

Note that the above test case was part of initial implementation of memfd_create(). I have retried test with selinux disabled -- same failure.

I didn't look whether this is the way that this runc patch uses memfd_create (probably not).

All this gives me an impression that memfd_create() and friends are in bad shape even in latest RHEL7 kernels.

@kolyshkin

This comment has been minimized.

Copy link
Contributor

kolyshkin commented Feb 21, 2019

Now, for my great surprise, I ran the same test binary on Debian Jessie (which is 3.16 but have memfd stuff backported), and it passes with flying colors:

root@kir-deb-jessie:~# cat /etc/debian_version 
8.10
root@kir-deb-jessie:~# uname -a
Linux kir-deb-jessie 3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64 GNU/Linux
root@kir-deb-jessie:~# ./memfd_test 
memfd: CREATE
memfd: BASIC
memfd: SEAL-WRITE
memfd: SEAL-SHRINK
memfd: SEAL-GROW
memfd: SEAL-RESIZE
memfd: SHARE-DUP 
memfd: SHARE-MMAP 
memfd: SHARE-OPEN 
memfd: SHARE-FORK 
memfd: SHARE-DUP (shared file-table)
memfd: SHARE-MMAP (shared file-table)
memfd: SHARE-OPEN (shared file-table)
memfd: SHARE-FORK (shared file-table)
memfd: DONE
@kolyshkin

This comment has been minimized.

Copy link
Contributor

kolyshkin commented Feb 21, 2019

@cyphar cyphar force-pushed the cyphar:memfd-cleanups branch from f1e42e5 to 2d4a37b Mar 1, 2019

@Ace-Tang

This comment has been minimized.

Copy link
Contributor

Ace-Tang commented Mar 1, 2019

Sloved !

I don't understand why it would fail on `4.9.93` -- `O_TMPFILE` was added in 3.11 but that's probably due to an old `glibc`

so which gcc version it should be larger than ? 😕

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Mar 1, 2019

It's not gcc, it's glibc. I'm really not sure, you would need to go into the git history. Not to mention that CentOS will have backports and so even then the version number would be useless.

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Mar 4, 2019

@@ -88,28 +91,61 @@ static void *must_realloc(void *ptr, size_t size)
static int is_self_cloned(void)

This comment has been minimized.

@lifubang

lifubang Mar 4, 2019

Contributor

How about just use CLONED_BINARY_ENV to avoid potential loop that we have not found yet?
I don't know whether this is more safety to avoid loop or not!

For example:

static int is_self_cloned(void) {
    return getenv(CLONED_BINARY_ENV) && strcmp(getenv(CLONED_BINARY_ENV), "1")==0;
}

This comment has been minimized.

@lifubang

lifubang Mar 4, 2019

Contributor

I think there is just one disadvantage, even if we have protected runc binary file by administrator, we will call ensure_clone_binary first.

This comment has been minimized.

@cyphar

cyphar Mar 5, 2019

Author Member

Because I don't want to be in a situation where someone passes CLONED_BINARY_ENV to runc and that results in an unsafe setup.

In fact, the only reason why CLONED_BINARY_ENV exists in this PR is to stop us being tricked by potentially-unsafe setups -- the case where /usr/bin/runc is on a read-only mount (or is an unlinked file) and we think that makes the binary safe -- but in those cases it is not safe (and we need to apply the protection).

I don't want to provide anyone the ability to opt out of this protection -- if you want to, you can bind-mount a sealed memfd over /usr/bin/runc if you want to "opt out" (as in, set up the protection as an administrator). There isn't really another protection an administrator can add -- there is chattr +i but as discussed whether that's safe is determined by what capabilities you give your containers.

I don't know whether this is more safety to avoid loop or not!

It would avoid more loops in theory (but the only loop issue we've had is a broken RHEL kernel -- and the current code takes care of that too). I think worrying about loops here doesn't make too much sense.

This comment has been minimized.

@lifubang

lifubang Mar 6, 2019

Contributor

For this topic, I have no other opinions.

@Ace-Tang

This comment has been minimized.

Copy link
Contributor

Ace-Tang commented Mar 7, 2019

@crosbymichael @cyphar , any update on this patch, can this be merged ?

@crosbymichael

This comment has been minimized.

Copy link
Member

crosbymichael commented Mar 7, 2019

LGTM

Approved with PullApprove

1 similar comment
@mrunalp

This comment has been minimized.

Copy link
Contributor

mrunalp commented Mar 7, 2019

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 2b18fe1 into opencontainers:master Mar 7, 2019

2 checks passed

code-review/pullapprove Approved by crosbymichael, mrunalp
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

thaJeztah added a commit to thaJeztah/containerd that referenced this pull request Mar 7, 2019

update runc to 2b18fe1d885ee5083ef9f0838fee39b62d653e30
This includes an improved fix for CVE-2019-5736 to reduce the
increased memory-consumption introduced by the original patch,
RHEL 7.6 getting into a loop due to a kernel bug in those kernels,
and improve compatibility with older kernels.

changes included:

- opencontainers/runc#1973 Vendor opencontainers/runtime-spec 29686dbc
- opencontainers/runc#1978 Remove detection for scope properties, which have always been broken
- opencontainers/runc#1963 Vendor in go-criu and use it for CRIU's RPC definition
- opencontainers/runc#1995 exec: expose --preserve-fds
- opencontainers/runc#2000 fix preserve-fds flag may cause runc hang
- opencontainers/runc#1968 Create bind mount mountpoints during restore
- opencontainers/runc#1984 nsenter: cloned_binary: "memfd" cleanups

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah added a commit to thaJeztah/containerd that referenced this pull request Mar 7, 2019

update runc to 2b18fe1d885ee5083ef9f0838fee39b62d653e30
This includes an improved fix for CVE-2019-5736 to reduce the
increased memory-consumption introduced by the original patch,
RHEL 7.6 getting into a loop due to a kernel bug in those kernels,
and improve compatibility with older kernels.

changes included:

- opencontainers/runc#1973 Vendor opencontainers/runtime-spec 29686dbc
- opencontainers/runc#1978 Remove detection for scope properties, which have always been broken
- opencontainers/runc#1963 Vendor in go-criu and use it for CRIU's RPC definition
- opencontainers/runc#1995 exec: expose --preserve-fds
- opencontainers/runc#2000 fix preserve-fds flag may cause runc hang
- opencontainers/runc#1968 Create bind mount mountpoints during restore
- opencontainers/runc#1984 nsenter: cloned_binary: "memfd" cleanups

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit b8d40b3)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah added a commit to thaJeztah/containerd that referenced this pull request Mar 7, 2019

update runc to 2b18fe1d885ee5083ef9f0838fee39b62d653e30
This includes an improved fix for CVE-2019-5736 to reduce the
increased memory-consumption introduced by the original patch,
RHEL 7.6 getting into a loop due to a kernel bug in those kernels,
and improve compatibility with older kernels.

changes included:

- opencontainers/runc#1973 Vendor opencontainers/runtime-spec 29686dbc
- opencontainers/runc#1978 Remove detection for scope properties, which have always been broken
- opencontainers/runc#1963 Vendor in go-criu and use it for CRIU's RPC definition
- opencontainers/runc#1995 exec: expose --preserve-fds
- opencontainers/runc#2000 fix preserve-fds flag may cause runc hang
- opencontainers/runc#1968 Create bind mount mountpoints during restore
- opencontainers/runc#1984 nsenter: cloned_binary: "memfd" cleanups

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit b8d40b3)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

@cyphar cyphar deleted the cyphar:memfd-cleanups branch Mar 8, 2019

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Mar 8, 2019

Ah dammit, this wasn't meant to be merged -- the bind-mount approach actually isn't fool-proof (if you can do a mount, you can get around it). However, the default profiles actually disallow mounting so this isn't the end of the world. I'll submit a follow-up PR that uses overlayfs rather than bind-mounts.

@Ace-Tang

This comment has been minimized.

Copy link
Contributor

Ace-Tang commented Mar 8, 2019

Bad news... we plan to appy this patch to our production environment, looks like still need to wait

@AkihiroSuda

This comment has been minimized.

Copy link
Contributor

AkihiroSuda commented Mar 8, 2019

I'll submit a follow-up PR that uses overlayfs rather than bind-mounts.

It won't work in userns?

@cyphar

This comment has been minimized.

Copy link
Member Author

cyphar commented Mar 8, 2019

It won't work in userns?

No, but that's why we have fallbacks. I've submitted #2006.

However I've come to the conclusion that running a privileged container with CAP_SYS_ADMIN is simply an insecure setup, so maybe it's better if we just leave it in this state -- where arguably it's not secure with CAP_SYS_ADMIN (but there's almost certainly other issues with running CAP_SYS_ADMIN). CAP_SYS_ADMIN allows you to disable seccomp protections, do privileged device ioctls, among other things.

{
switch (fdtype) {
case EFD_MEMFD:
return fcntl(*fd, F_ADD_SEALS, RUNC_MEMFD_SEALS);

This comment has been minimized.

@lifubang

lifubang Mar 16, 2019

Contributor

How about modify fallback like try_bindfd? For example:
try_memfd()
try_otmpfile()
try_oldschooltemp()

Is there a fail situation like this: memfd_create success, but F_ADD_SEALS fail?

This comment has been minimized.

@cyphar

cyphar Mar 16, 2019

Author Member

If you want to submit a patch to do that, sure.

However, I spent several weeks working on it or answering questions about it, and given painfully trivial the fix for the CVE is ("make a copy and execute that"), working on it for any more time feels like a waste of time to me.

If F_ADD_SEALS fails (with memfd_create and MFD_ALLOW_SEALING working) then the kernel is broken -- and while that can happen with some distributions I'd prefer to actually see it be an existing problem before trying to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.