Split lxc_setup to unshare cgroup before capabilities drop #1597

superboum · 2017-05-29T21:15:03Z

Hi everyone,

Abstract

I'm proposing to move the capabilities drop part of lxc_setup in a new function named lxc_late_setup which is run after the cgroup unshare. I've also added a hook named privileged-start in lxc_late_setup which is run with all the capabilities and in all the namespaces (including the cgroup one). It allows me to start a systemd container without CAP_SYS_ADMIN on a kernel supporting the cgroup namespaces.

systemd without `CAP_SYS_ADMIN`

If you want to run systemd without CAP_SYS_ADMIN, you have to mount everything "by hand" before its start. It works for most of the filesystems but cause problems with the cgroup mount.

If your kernel supports the cgroup namespace, you can't use:

lxc.mount.auto = cgroup as the function in the lxc code directly returns without doing anything.
lxc.mount.entry or lxc.hook.mount because both commands are not executed in the cgroup namespace (due to reasons you can find in the LXC code) - so you'll mount your host hierarchy.
lxc.hook.start as it is executed after the capabilities drop - so you'll have a Permission Error if you try to mount your cgroup hierarchy.

How to use it

Add something like this in your container's configuration:

# path of the file on the host: /var/lib/lxc/test/config
# ... (the default configuration)
lxc.mount.auto = proc:rw sys:rw
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.hook.priv-start = /bin/mount-systemd-cgroups
lxc.cap.drop = sys_admin

And create this script:

#!/bin/bash
# path of the file on the host: /var/lib/lxc/test/rootfs/bin/mount-systemd-cgroups
mount -t tmpfs tmpfs /sys/fs/cgroup
mkdir -p /sys/fs/cgroup/systemd
mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd cgroup /sys/fs/cgroup/systemd

I'm staying open to suggestions, my first goal is to start systemd without CAP_SYS_ADMIN 😄

Notes: This pull request is a rework of GH-1593.

lxc-jenkins · 2017-05-29T21:15:05Z

This pull request didn't trigger Jenkins as its author isn't in the whitelist.

An organization member must perform one of the following:

To have this branch tested by Jenkins, use the "ok to test" command.
To have a one time test done, use the "test this please" command.

Those commands are simple Github comments of the format: "jenkins: COMMAND"

brauner · 2017-05-29T21:21:09Z

It shouldn't be necessary to add a new hook just for that. I'd rather have the mount hook run. We shouldn't just add random hooks for very very specific use cases. :)

brauner · 2017-05-30T04:05:35Z

I might actually look into hooking up the lxc.hook.mount such that it is run after all namespaces have been setup but before capabilities have been dropped. If I get annoyed, I'll ping you. :)

superboum · 2017-05-30T05:25:42Z

Thanks for your help. I understand the need to limit hooks proliferation. I didn't modify the lxc.mount.hook as I thought it could break things as we don't control for what purpose people use it.

Moreover, it's a bit complicated because we have to change the time where pivot_root is called, as lxc.mount.hook is called outside of the new root now but would be called inside it if we move it in lxc_late_setup.

brauner · 2017-05-30T05:26:40Z

@superboum, yeah I saw that as well. I might have to sleep on that one for a bit. :)

Signed-off-by: Quentin Dufour <quentin@dufour.tk>

superboum · 2017-05-31T07:29:08Z

I tried to write a patch without adding a new hook. So i moved lxc_setup after the cgroup logic. I had to keep two calls before the cgroup logic that I put in lxc_early_setup(). I didn't test extensively the patch but it seems to work for my use case.

So, here is the configuration I used (with Debian 9):

# path of the file on the host: /var/lib/lxc/test/config
# ... (the default configuration)
lxc.mount.auto = proc:rw sys:rw
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.hook.mount = /opt/mount-systemd-cgroups
lxc.cap.drop = sys_admin

#!/bin/bash
# path of the file on the host: /opt/mount-systemd-cgroups
mount -t tmpfs tmpfs $LXC_ROOTFS_MOUNT/sys/fs/cgroup
mkdir -p $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd
mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd cgroup $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd

brauner · 2017-06-01T01:56:15Z

src/lxc/start.c

@@ -850,6 +849,12 @@ static int do_start(void *data)
 		INFO("Unshared CLONE_NEWCGROUP.");
 	}

+	/* Setup the container, ip, names, utsname, ... */
+	if (lxc_setup(handler)) {


So moving the unshare(CLONE_NEWCGROUP) after the pivot_root() syscall is something which I need to test in a POC before merging as I'm not sure if there isn't a possible interaction between the two. I doubt it but security firsts.

I mean it really shouldn't make a difference but I've seen stranger things.

Ok, so I confirmed that pivot_root() and CLONE_NEWCGROUP do not hate each other. :) That's a good sign.

brauner · 2017-06-01T02:52:04Z

src/lxc/conf.c

+	if (send_ttys_to_parent(handler) < 0) {
+		ERROR("failure sending console info to parent");
+		return -1;
+	}


So what is the reasoning for putting the tty creation and sending before unsharing the cgroup namespace?

Especially since the tty setup itself is done after the cgroup namespace has been unshared.
So I think, barring any crucial reasons for doing so, we want the ttys to be created after the cgroup namespace has been unshared. Otherwise the tty file descriptors refer to the parent's cgroup namespace which is not ideal security-wise.

brauner

Actually, looking closer at the patch its wrong. By moving lxc_create_tty() before lxc_setup_devpts() you're allocating tty file descriptors from the host and not from the container. That's a security issue. So we unfortunately can't merge this in its current form.

brauner · 2017-06-01T03:23:03Z

Right, I have a working patch for what you want to achieve. I'm closing this one and give you a suggested-by line since you had the original idea. :)

When the running kernel supports cgroup namespace and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. This idea also made me aware of a security improvement which should harden our tty handling code a bit. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the hosts cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). This leaves room for the vague possibility of setns()ing to the parent's cgroup namespaces via one of those fds. This commit makes sure that the ttys are allocated after *all* namespaces have been created. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes lxc#1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

When the running kernel supports cgroup namespaces and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the host's cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). Although not a security risk at this point since setns() restricts its calls to /proc/<self>/ns files it's still better to do it *after* the cgroup namespace has been unshared. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes lxc#1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

When the running kernel supports cgroup namespaces and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the host's cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). Although not a security risk at this point since setns() restricts its calls to /proc/<self>/ns files it's still better to do it *after* the cgroup namespace has been unshared. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes #1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

brauner added the Incomplete Waiting on more information from reporter label May 30, 2017

brauner mentioned this pull request May 30, 2017

lxc list -- IPV4 table element not being populated on Gentoo canonical/lxd#3357

Closed

5 tasks

superboum force-pushed the systemd-without-sys_admin branch from 19a0264 to 5bf017a Compare May 31, 2017 07:17

Run most of lxc_setup after cgroup unshare

a421eca

Signed-off-by: Quentin Dufour <quentin@dufour.tk>

superboum force-pushed the systemd-without-sys_admin branch from 5bf017a to a421eca Compare May 31, 2017 07:23

brauner reviewed Jun 1, 2017

View reviewed changes

brauner removed the Incomplete Waiting on more information from reporter label Jun 1, 2017

brauner reviewed Jun 1, 2017

View reviewed changes

brauner requested changes Jun 1, 2017

View reviewed changes

brauner added the Blocked Waiting on an external task label Jun 1, 2017

brauner mentioned this pull request Jun 1, 2017

call lxc_setup() after unshare(CLONE_NEWCGROUP) #1606

Merged

stgraber closed this Jul 2, 2017

brauner mentioned this pull request Sep 18, 2017

Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split lxc_setup to unshare cgroup before capabilities drop #1597

Split lxc_setup to unshare cgroup before capabilities drop #1597

superboum commented May 29, 2017

lxc-jenkins commented May 29, 2017

brauner commented May 29, 2017

brauner commented May 30, 2017

superboum commented May 30, 2017 •

edited

brauner commented May 30, 2017

superboum commented May 31, 2017 •

edited

brauner Jun 1, 2017

brauner Jun 1, 2017

brauner Jun 1, 2017

brauner Jun 1, 2017

brauner Jun 1, 2017 •

edited

brauner left a comment

brauner commented Jun 1, 2017

Split lxc_setup to unshare cgroup before capabilities drop #1597

Split lxc_setup to unshare cgroup before capabilities drop #1597

Conversation

superboum commented May 29, 2017

Abstract

systemd without CAP_SYS_ADMIN

How to use it

lxc-jenkins commented May 29, 2017

brauner commented May 29, 2017

brauner commented May 30, 2017

superboum commented May 30, 2017 • edited

brauner commented May 30, 2017

superboum commented May 31, 2017 • edited

brauner Jun 1, 2017

Choose a reason for hiding this comment

brauner Jun 1, 2017

Choose a reason for hiding this comment

brauner Jun 1, 2017

Choose a reason for hiding this comment

brauner Jun 1, 2017

Choose a reason for hiding this comment

brauner Jun 1, 2017 • edited

Choose a reason for hiding this comment

brauner left a comment

Choose a reason for hiding this comment

brauner commented Jun 1, 2017

systemd without `CAP_SYS_ADMIN`

superboum commented May 30, 2017 •

edited

superboum commented May 31, 2017 •

edited

brauner Jun 1, 2017 •

edited