Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

Closed
Random-Liu opened this issue Feb 13, 2019 · 35 comments · Fixed by #1984
Closed

[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

Random-Liu opened this issue Feb 13, 2019 · 35 comments · Fixed by #1984

Comments

@Random-Liu
Copy link

@Random-Liu Random-Liu commented Feb 13, 2019

We observed higher memory usage (likely during container startup) after the fix for CVE 0a8e411.

We had a test that specifies 10m container cgroup limit, which never failed before, but now the container get oom-killed a lot. For example https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-containerd-node-e2e-1-2/2500.

kernel: runc:[2:INIT] invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=998
kernel: runc:[2:INIT] cpuset=80e651c417ebd71d83e5023ee59b281e585497468bd71ee7c7b3ae6730d9ec8f mems_allowed=0
kernel: CPU: 0 PID: 333 Comm: runc:[2:INIT] Not tainted 4.4.64+ #1
kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
kernel:  0000000000000000 ffff880003e87ca8 ffffffff9f317334 ffff880003e87d88
kernel:  ffff8800bb3e8000 ffff880003e87d18 ffffffff9f1a8fb4 ffff880003e87ce0
kernel:  ffffffff9f13e780 ffff8800bb3eb500 0000000000000206 ffff880003e87cf0
kernel: Call Trace:
kernel:  [<ffffffff9f317334>] dump_stack+0x63/0x8f
kernel:  [<ffffffff9f1a8fb4>] dump_header+0x65/0x1d4
kernel:  [<ffffffff9f13e780>] ? find_lock_task_mm+0x20/0xb0
kernel:  [<ffffffff9f13ef1d>] oom_kill_process+0x28d/0x430
kernel:  [<ffffffff9f1a3e6b>] ? mem_cgroup_iter+0x1db/0x390
kernel:  [<ffffffff9f1a6374>] mem_cgroup_out_of_memory+0x284/0x2d0
kernel:  [<ffffffff9f1a6de9>] mem_cgroup_oom_synchronize+0x2f9/0x310
kernel:  [<ffffffff9f1a1ab0>] ? memory_high_write+0xc0/0xc0
kernel:  [<ffffffff9f13f5f8>] pagefault_out_of_memory+0x38/0xa0
kernel:  [<ffffffff9f045a27>] mm_fault_error+0x77/0x150
kernel:  [<ffffffff9f046264>] __do_page_fault+0x414/0x420
kernel:  [<ffffffff9f046292>] do_page_fault+0x22/0x30
kernel:  [<ffffffff9f5b1f98>] page_fault+0x28/0x30

It seems to be caused by the memory spike introduced by binary copy. Should we always enforce a minimum memory limit for runc containers in the future?

My runc binary is statically linked:

$ ls -alh usr/local/sbin/runc 
-rwxr-xr-x 1 lantaol primarygroup 7.8M Feb 12 13:46 usr/local/sbin/runc
@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 13, 2019

Should we always enforce a minimum memory limit for runc containers in the future?

Probably. This is something that has always been quite difficult to do sanely -- because all container limits are applied while runc is still executing. While we do get out of your way eventually, the only safe way of constraining the user process is to limit ourselves first -- so knowing what the minimum limit should be is quite difficult (very low pids limits are hard to set). You could try to do a runc update afterwards...

But yes this would definitely be caused by the copying procedure. One idea I had was to create a temporary overlayfs such that the binary would not be overwritable -- but that has a lot of other issues that made it implausible.

@Ace-Tang

This comment has been minimized.

Copy link
Contributor

@Ace-Tang Ace-Tang commented Feb 13, 2019

Also got the memory problem, my runc binary is 11M, one of our test set memory 10m to run, runc create will fail with error

\"process_linux.go:424: container init caused \\\"process_linux.go:390: setting cgroup config for procHooks process caused \\\\\\\"failed to write 10485760 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/default/96cd18b0b87ba67c64b6fe535607c29af30119229328add6658404239a479905/memory.limit_in_bytes: device or resource busy\\\\\\\"\\\"\"
@Random-Liu

This comment has been minimized.

Copy link
Author

@Random-Liu Random-Liu commented Feb 13, 2019

This is a regression. I get some data with docker.

Test Environment

  • Machine type: GCE n1-standard-1 (1 vCPU, 3.75 GB memory)
  • Docker version:
$ docker version
Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.11.2
 Git commit:        4d60db4
 Built:             Wed Jan 23 19:35:04 2019
 OS/Arch:           linux/amd64
 Experimental:      false
Server:
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.2
  Git commit:       4d60db4
  Built:            Wed Jan 23 19:34:06 2019
  OS/Arch:          linux/amd64
  Experimental:     false
  • runc binaries:
$ ls -alh
total 37M
drwxr-xr-x 2 lantaol lantaol 4.0K Feb 13 10:12 .
drwxr-xr-x 4 lantaol lantaol 4.0K Feb 13 10:11 ..
-rwxr-xr-x 1 lantaol lantaol  11M Feb 13 10:11 runc-dynamic-12f6a991
-rwxr-xr-x 1 lantaol lantaol  11M Feb 13 10:10 runc-dynamic-6635b4f0
-rwxr-xr-x 1 lantaol lantaol 7.7M Feb 13 10:05 runc-static-12f6a991
-rwxr-xr-x 1 lantaol lantaol 7.8M Feb 13 09:57 runc-static-6635b4f0

Dynamically linked binaries are built with make BUILDTAGS="seccomp apparmor".
Statically linked binaries are built with make static BUILDTAGS="seccomp apparmor".
12f6a99 is the last runc version we use in containerd.
6635b4f is the new runc version with the CVE fix.

  • Golang version:
$ go version
go version go1.11.2 linux/amd64

Test Result

  • runc-dynamic-12f6a991:
# Memory limit too low, can't even create the container.
$ docker run -m=4m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 4194304 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/5031f5b2cf99b84da41e24836524fb4ae736d6bc4886ce2e0e75c1f43820803f/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

# Memory limit is high enough to create the container, but the init process gets OOM killed right away.
$ docker run -m=5m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

# Memory limit is enough to run the container.
$ docker run -m=6m busybox ls
# ok
  • runc-dynamic-6635b4f0:
$ docker run -m=15m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 15938355 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/6650dbb7f2a8fc8ae58b3ce3d365be8c716321d253d8d29f140e1d45e0dfa818/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

$ docker run -m=15.5m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

$ docker run -m=16m busybox ls
# ok
  • runc-static-12f6a991:
$ docker run -m=4m busybox ls
# ok
  • runc-static-6635b4f0:
$ docker run -m=9.2m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 9646899 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/bf1ec70dc29ee0a823ba6aebf2a88633cd875e08715f12651451073e86437fc3/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

$ docker run -m=9.3m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

$ docker run -m=10m busybox ls
# ok

Conclusion

We need to set higher memory limit for the container to run, and the minimum limit is larger than the runc binary size
Before 6635b4f, the minimum limit is not that high, and much lower than the runc binary size.
This is a regression to users, their existing workloads may not run without tweaking memory limit.

@crosbymichael

This comment has been minimized.

Copy link
Member

@crosbymichael crosbymichael commented Feb 13, 2019

A good long term fix is to move all the runc init code to C. This should be fairly simple as most of it is system level syscalls. cgroups can remain implemented in Go as it is set by the calling process.

@giuseppe

This comment has been minimized.

Copy link
Contributor

@giuseppe giuseppe commented Feb 13, 2019

A good long term fix is to move all the runc init code to C. This should be fairly simple as most of it is system level syscalls. cgroups can remain implemented in Go as it is set by the calling process.

would that be a separate binary?

@crosbymichael

This comment has been minimized.

Copy link
Member

@crosbymichael crosbymichael commented Feb 13, 2019

@giuseppe I was thinking same binary, we just never allow it to exec into the go runtime. Maybe that is not possible and we will have the same issue, it's something we would have to look into.

I know you have a C implementation, it would be interesting to see where we can combine the two as there are still some areas that are easier to write and maintain in Go and others where C makes more sense, like the init.

@giuseppe

This comment has been minimized.

Copy link
Contributor

@giuseppe giuseppe commented Feb 13, 2019

I know you have a C implementation, it would be interesting to see where we can combine the two as there are still some areas that are easier to write and maintain in Go and others where C makes more sense, like the init.

that would be great. If there is anything I can do to help out, just let me know :-)

@keloyang

This comment has been minimized.

Copy link
Contributor

@keloyang keloyang commented Feb 14, 2019

If we limit the init in container,don't allow /proc/self/exe,CVE-2019-5736 can be blocked ?

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 14, 2019

@keloyang The problem is that you cannot safely verify (in userspace) whether or not you are going to execute /proc/self/exe -- and because of #!/proc/self/exe even O_PATH and execveat(AT_EMPTY_PATH) won't save you.

deiwin added a commit to salemove/pipeline-lib that referenced this issue Feb 14, 2019
A new version of runc [doesn't work well][1] with very small memory limits.

[1]: opencontainers/runc#1980 (comment)
deiwin added a commit to salemove/pipeline-lib that referenced this issue Feb 14, 2019
A new version of runc [doesn't work well][1] with very small memory limits.

[1]: opencontainers/runc#1980 (comment)
@Random-Liu

This comment has been minimized.

Copy link
Author

@Random-Liu Random-Liu commented Feb 14, 2019

This seems to be affecting many people.

I heard about that if we put runc on a readonly filesystem (mount -o remount,ro), we won't be affected by the CVE.

In this case, is it possible to opt-out the fix?

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 14, 2019

Yes, it would be possible to add the ability opt-out of the fix -- but I'm worried about what happens if someone decides to remount the filesystem as read-write. There's no way for us to deal with that (luckily runc's design means there's no long-running processes but what if the remount happens while runc is doing an operation -- then we have a race where you can attack the binary).

I am currently working on a patch which will expand the O_TMPFILE fallback (which doesn't use memory -- though it does create a file on tmpfs) -- and we can discuss making memfd_create opt-outable such that the O_TMPFILE fallback will be used. But I'm absolutely against making it easy to entirely disable the fix -- that seems like a really bad idea.

@Random-Liu

This comment has been minimized.

Copy link
Author

@Random-Liu Random-Liu commented Feb 14, 2019

@cyphar tmpfs memory usage is also charged to the memory cgroup, right? Why is it better than memfd_create. Just for my education.

Actually, I tried hard coding to completely skip the memfd_create code path, but get the same result:

$ ls -alh binary/runc-static-patched-6635b4f0 
-rwxr-xr-x 1 lantaol lantaol 7.8M Feb 14 22:58 binary/runc-static-patched-6635b4f0

$ docker run -m=9.2m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 9646899 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/f44f29ca39e754892e73d15c03d3ff12b46388fb13ea4c01100835b78f7afa75/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

$ docker run -m=9.3m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

$ docker run -m=10m busybox ls
ok

Diff:

index c8a42c23..1817bc72 100644
--- a/libcontainer/nsenter/cloned_binary.c
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -32,23 +32,6 @@
 #include <sys/sendfile.h>
 #include <sys/syscall.h>
 
-/* Use our own wrapper for memfd_create. */
-#if !defined(SYS_memfd_create) && defined(__NR_memfd_create)
-#  define SYS_memfd_create __NR_memfd_create
-#endif
-#ifdef SYS_memfd_create
-#  define HAVE_MEMFD_CREATE
-/* memfd_create(2) flags -- copied from <linux/memfd.h>. */
-#  ifndef MFD_CLOEXEC
-#    define MFD_CLOEXEC       0x0001U
-#    define MFD_ALLOW_SEALING 0x0002U
-#  endif
-int memfd_create(const char *name, unsigned int flags)
-{
-	return syscall(SYS_memfd_create, name, flags);
-}
-#endif
-
 /* This comes directly from <linux/fcntl.h>. */
 #ifndef F_LINUX_SPECIFIC_BASE
 #  define F_LINUX_SPECIFIC_BASE 1024
@@ -65,11 +48,6 @@ int memfd_create(const char *name, unsigned int flags)
 #endif
 
 #define RUNC_SENDFILE_MAX 0x7FFFF000 /* sendfile(2) is limited to 2GB. */
-#ifdef HAVE_MEMFD_CREATE
-#  define RUNC_MEMFD_COMMENT "runc_cloned:/proc/self/exe"
-#  define RUNC_MEMFD_SEALS \
-	(F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_WRITE)
-#endif
 
 static void *must_realloc(void *ptr, size_t size)
 {
@@ -93,15 +71,10 @@ static int is_self_cloned(void)
 	if (fd < 0)
 		return -ENOTRECOVERABLE;
 
-#ifdef HAVE_MEMFD_CREATE
-	ret = fcntl(fd, F_GET_SEALS);
-	is_cloned = (ret == RUNC_MEMFD_SEALS);
-#else
 	struct stat statbuf = {0};
 	ret = fstat(fd, &statbuf);
 	if (ret >= 0)
 		is_cloned = (statbuf.st_nlink == 0);
-#endif
 	close(fd);
 	return is_cloned;
 }
@@ -203,11 +176,7 @@ static int clone_binary(void)
 	int binfd, memfd;
 	ssize_t sent = 0;
 
-#ifdef HAVE_MEMFD_CREATE
-	memfd = memfd_create(RUNC_MEMFD_COMMENT, MFD_CLOEXEC | MFD_ALLOW_SEALING);
-#else
 	memfd = open("/tmp", O_TMPFILE | O_EXCL | O_RDWR | O_CLOEXEC, 0711);
-#endif
 	if (memfd < 0)
 		return -ENOTRECOVERABLE;
 
@@ -220,11 +189,6 @@ static int clone_binary(void)
 	if (sent < 0)
 		goto error;
 
-#ifdef HAVE_MEMFD_CREATE
-	int err = fcntl(memfd, F_ADD_SEALS, RUNC_MEMFD_SEALS);
-	if (err < 0)
-		goto error;
-#else
 	/* Need to re-open "memfd" as read-only to avoid execve(2) giving -EXTBUSY. */
 	int newfd;
 	char *fdpath = NULL;
@@ -238,7 +202,6 @@ static int clone_binary(void)
 
 	close(memfd);
 	memfd = newfd;
-#endif
 	return memfd;
 
 error:
@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 15, 2019

tmpfs memory usage is also charged to the memory cgroup, right? Why is it better than memfd_create. Just for my education.

Yes, you're right -- I incorrectly assumed it's charged to kmem not mem. In that case it's not any better -- though it does work on older kernels (which is the primary reason why I'm working on improving the fallbacks).

There is a way to do it without using extra memory (and I proposed it internally when we were discussing solutions for this vulnerability), but it has the downside that it can't be done with rootless containers and is a bit ugly. We create a temporary overlayfs mount for the runc binary (with a tmpfs as the upper layer) and then use the merged version as /proc/self/exe. This allows us to avoid any copying -- and any attempt to overwrite the binary will just modify the upperdir (which would be unique per-container). It also gives us page-cache sharing for the runc binary.

The main downside is that we now would require overlayfs support, and in the case of rootless containers we'd need to make a copy anyway. Not to mention we'd have to have some pretty ugly code to get it all to work -- since we need to set up the mount namespace before we've started doing any operations with the containers' namespaces (otherwise we're poisoning the host mount namespace).

But as @crosbymichael said, if we separate out runc init and make it all C code then the binary size would be a much smaller problem (no pun intended).

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 15, 2019

Another idea would be to use O_TMPFILE on the runc state directory (which runc needs to have write access to) and thus we'd be able to avoid charging memcg when using that. The main problem is that we aren't able to get ENOMEM with memfd_create -- because the only way we grow the size of the memfd is by writing to it (and thus we get hit with an OOM).

deiwin added a commit to salemove/pipeline-lib that referenced this issue Feb 15, 2019
A new version of runc [doesn't work well][1] with very small memory limits.

[1]: opencontainers/runc#1980 (comment)
@Random-Liu

This comment has been minimized.

Copy link
Author

@Random-Liu Random-Liu commented Feb 15, 2019

It would still be better to find a way to eliminate this. :/

If we really can't find a better solution, we should at least broadly advertise this, so that users know that they should increase their memory limit. :) In the new GKE release, we are going to mention that because our ubuntu image is going to carry the runc fix.

We may need a better channel to advertise that, e.g. runc release note, tweet?

@cgwalters

This comment has been minimized.

Copy link
Contributor

@cgwalters cgwalters commented Feb 15, 2019

@cyphar I was arguing for a while to have a way to make regular files content-immutable.

@cgwalters

This comment has been minimized.

Copy link
Contributor

@cgwalters cgwalters commented Feb 16, 2019

Or another approach maybe; have runc fork off a once-per-uid (lockfile in /run?) instance of itself that holds open the binary. Or require distributors to ship a systemd unit that does that, add runc sleep and ExecStart=/usr/bin/runc sleep ?

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 17, 2019

While that will somewhat solve the problem, I really don't like that solution (and I'm worried what happens if the service is restarted -- how sure are you that the attacking process won't get a chance to overwrite the host binary before it's killed).

Personally I think #1984 could be changed so that instead of /tmp we use the runc state directory and add an environment variable you can use to disable the use of memfd_create. I think that would be a much simpler solution that doesn't require removing the actual protection -- because while /proc/self/exe re-opening can be blocked by just having the process stick around I really don't trust passing a struct file to a host binary to a container.

@lifubang

This comment has been minimized.

Copy link
Contributor

@lifubang lifubang commented Feb 18, 2019

How about copy runc binary file from /usr/bin to /tmp with a new random file name before we run runc init? We use this new binary file to exec runc init.
Because the copy action is not in runc init, so it's memory usage will not be charged to the container's memory usage.
After runc init process started, we delete this temp runc binary file in /tmp.
the only problem is that:
Though the temp runc binary file in /tmp is deleted, the link /proc/self/exe also can be exec in syscall.Exec when starting container's init command.
So the CVE-2019-5736 can also change the temp runc binary file content in /tmp.
So, if the administrator run these temp runc binary files in /tmp just for fun, it may also run a virus. But I think it may not effect docker and k8s.

I have tested it in my server based on v1.0.0-rc5 version. The memory usage is the same as before.

root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/ubuntu# docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
bin
dev
etc
home
proc
root
sys
tmp
usr
var

But I'm not sure it works for fixing CVE-2019-5736.
If it does make sense, I'll send a pr.

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 18, 2019

@lifubang

How about copy runc binary file from /usr/bin to /tmp with a new random file name before we run runc init?

That is precisely what #1984 does -- and the current "no memfd_create(2)" fallback does as well. It gets around the problem you mention by unlinking it before using it, so that you can't accidentally use the binary (since the only reference to it is the file descriptor).

@lifubang

This comment has been minimized.

Copy link
Contributor

@lifubang lifubang commented Feb 18, 2019

I think there is a small different:
If I understand correctly, in #1984, the copy action is in runc init, so it will take container's memory.
I think we can move the copy action to the position before we call runc init.

With the patch #1984, and set _LIBCONTAINER_DISABLE_MEMFD_CLONE=1, some times we will get these errors:

root@iZ2ze1o61blvco5p5ducnnZ:/opt# docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:403: container init caused \"process_linux.go:368: setting cgroup config for procHooks process caused \\\"failed to write 4194304 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/24e534c572e33aca2d7812c217776b740c3970de7214997c972b8f63821cb61e/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 18, 2019

I'm not a fan of moving it before runc init -- we tried this during the embargo (and the first version of my patch did it that way) and it simply didn't work properly. You can do it as part of the constructor for all runc calls but then things like runc list become really expensive.

It should be noted that quite a lot of distributions don't actually use tmpfs for /tmp (openSUSE and Ubuntu both use the host filesystem) and on those distributions this isn't a huge problem because it won't account for cgroups (aside from some minor blkio usage).

I am working on an update to #1984 which will change it so that it uses c.root (the runc state directory) as the O_TMPFILE target -- which means that even on distributions that use /tmp as tmpfs you won't have this issue.

Another option would be to make cgroup attachment happen later (which we can do now because of the cgroup namespace code) -- in fact I could've sworn that cgroup attachment happens later and so this shouldn't be a problem (weird -- I will look into that).

@lifubang

This comment has been minimized.

Copy link
Contributor

@lifubang lifubang commented Feb 18, 2019

You can do it as part of the constructor for all runc calls but then things like runc list become really expensive.

I just do it before run runc init command, not in every command. Please see #1987 . I don't know how to explain my solution, please see the code.

@Random-Liu

This comment has been minimized.

Copy link
Author

@Random-Liu Random-Liu commented Feb 22, 2019

I am working on an update to #1984 which will change it so that it uses c.root (the runc state directory) as the O_TMPFILE target -- which means that even on distributions that use /tmp as tmpfs you won't have this issue.

@cyphar SG!

And actually we don't see this problem in Ubuntu yet, and maybe just because as you said /tmp is not a tmpfs on our Ubuntu.

@lifubang

This comment has been minimized.

Copy link
Contributor

@lifubang lifubang commented Feb 22, 2019

With the latest patch #1984 , dynamic build, most of time, it works fine, but there is about a 1 in 10-30 chances to get an OOM killed.

Build:

root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers# git clone https://github.com/opencontainers/runc
正克隆到 'runc'...
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 19197 (delta 1), reused 1 (delta 0), pack-reused 19193
接收对象中: 100% (19197/19197), 7.15 MiB | 1.39 MiB/s, 完成.
处理 delta 中: 100% (12414/12414), 完成.
检查连接... 完成。
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers# cd runc
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# git fetch origin pull/1984/head:pr-1984
remote: Enumerating objects: 19, done.
remote: Counting objects: 100% (19/19), done.
remote: Compressing objects: 100% (9/9), done.
展开对象中: 100% (22/22), 完成.
remote: Total 22 (delta 14), reused 13 (delta 10), pack-reused 3
来自 https://github.com/opencontainers/runc
 * [新引用]          refs/pull/1984/head -> pr-1984
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# git checkout pr-1984
切换到分支 'pr-1984'
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# make  BUILDTAGS="apparmor selinux seccomp"
go build -buildmode=pie  -ldflags "-X main.gitCommit="89c07553177c66cfe07de11fcdc17b59621ead07" -X main.version=1.0.0-rc6+dev " -tags "apparmor selinux seccomp" -o runc .
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# cp ./runc /usr/bin/docker-runc

with -m 4m(about 1 in 10 chances failure):

root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
bin
...

real	0m1.026s
user	0m0.080s
sys	0m0.008s
root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 4194304 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/a3f6c2cbe4b04f03f51526c8654c9b7ff7b671dc1e0946920ddbada0f2ff19e5/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 

real	0m0.741s
user	0m0.064s
sys	0m0.024s

with -m 6m(about 1 in 30 chances failure):

root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 6m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
bin
...

real	0m1.083s
user	0m0.076s
sys	0m0.012s
root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 6m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: cannot start a stopped process: unknown.
ERRO[0001] error waiting for container: context canceled 

real	0m1.050s
user	0m0.080s
sys	0m0.008s
@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 22, 2019

@lifubang By default it's still using memfd_create(2) -- you would need to pass _LIBCONTAINER_DISABLE_MEMFD_CLONE=1 as an environment variable to stop it. However, I am thinking of just fixing when cgroup.Apply() occurs so that this doesn't happen anymore. I don't like having to pass an environment variable to disable something like that.

@lifubang

This comment has been minimized.

Copy link
Contributor

@lifubang lifubang commented Feb 22, 2019

Thank you for your work.
_LIBCONTAINER_DISABLE_MEMFD_CLONE=1 can't work at that time. Please see:
#1984 (comment)

@justinsb

This comment has been minimized.

Copy link

@justinsb justinsb commented Feb 22, 2019

Is the chattr +i /sbin/runc approach equivalent to the copying? (e.g. https://seclists.org/oss-sec/2019/q1/134 ) I would imagine this approach would not have any significant memory overhead.

One approach could be to verify that the immutable attribute was set on the runc binary (and skip the copy if so, or warn/error if not?)

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 22, 2019

chattr +i is fine so long as you haven't given CAP_LINUX_IMMUTABLE to the container -- I'm not really a fan of having to figure out whether that's true as part of the mitigation -- the code is already complicated enough.

The easier solution is hopefully going to be to evaluate whether we can delay cgroup application until after the copy -- which should be possible and would be much more fool-proof than that.

@Random-Liu

This comment has been minimized.

Copy link
Author

@Random-Liu Random-Liu commented Feb 26, 2019

The easier solution is hopefully going to be to evaluate whether we can delay cgroup application until after the copy -- which should be possible and would be much more fool-proof than that.

@cyphar Are you working on this?

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 26, 2019

I just pushed an update to #1984 -- it uses a read-only bind-mount. It works, and adds nothing to memory usage if it works. For rootless containers it currently won't work, but that's a fairly niche usecase for now.

So this problem has been solved -- with the caveat that runc must be run with sufficient privileges to create a bind-mount (which is the case for Kubernetes and Docker, even if you have user namespaces enabled).

@rhatdan

This comment has been minimized.

Copy link
Contributor

@rhatdan rhatdan commented Feb 26, 2019

rootless containers don't need the fix, anyways, since a process running as non root can not overwrite a file owned by root.

I guess if you were running a runc in your homedir, this could be a problem, but I know of no one who does this.

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 26, 2019

I guess if you were running a runc in your homedir, this could be a problem, but I know of no one who does this.

Since the whole purpose of (and initial justification for me to work on) rootless containers is precisely to be able to do that, and not require an admin to install binaries for you (if they can install binaries for you, they can install a setuid binary for you too -- defeating the purpose). So the fix is needed regardless. The usernetes distribution works like this, and I hope eventually it will gain wider usage, but I definitely want to protect rootless containers.

(These days, "rootless containers" has become a misnomer -- the original idea was that there was no administrator intervention required -- and that's still possible today but it really doesn't help when the concept isn't agreed upon and the default actually uses setuid binaries contrary to the whole point.)

@giuseppe

This comment has been minimized.

Copy link
Contributor

@giuseppe giuseppe commented Feb 26, 2019

for the rootless implementations in buildah and podman. We are creating the user namespace before we call into the OCI runtime, so the remount to readonly should just work. Similarly Usernetes does the same thing

@cyphar

This comment has been minimized.

Copy link
Member

@cyphar cyphar commented Feb 26, 2019

Ah right, because it uses rootlesskit -- sorry for misstating that. 😉

However, the fix is still definitely needed because the "canonical" usernetes installation has runc installed as the same user as the person who ran it. Obviously with newuidmap you are protected, but we still need protections for "true rootless" (no funny business) mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.