Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split lxc_setup to unshare cgroup before capabilities drop #1597

Closed
wants to merge 1 commit into from

Conversation

superboum
Copy link

Hi everyone,

Abstract

I'm proposing to move the capabilities drop part of lxc_setup in a new function named lxc_late_setup which is run after the cgroup unshare. I've also added a hook named privileged-start in lxc_late_setup which is run with all the capabilities and in all the namespaces (including the cgroup one). It allows me to start a systemd container without CAP_SYS_ADMIN on a kernel supporting the cgroup namespaces.

systemd without CAP_SYS_ADMIN

If you want to run systemd without CAP_SYS_ADMIN, you have to mount everything "by hand" before its start. It works for most of the filesystems but cause problems with the cgroup mount.

If your kernel supports the cgroup namespace, you can't use:

  • lxc.mount.auto = cgroup as the function in the lxc code directly returns without doing anything.
  • lxc.mount.entry or lxc.hook.mount because both commands are not executed in the cgroup namespace (due to reasons you can find in the LXC code) - so you'll mount your host hierarchy.
  • lxc.hook.start as it is executed after the capabilities drop - so you'll have a Permission Error if you try to mount your cgroup hierarchy.

How to use it

Add something like this in your container's configuration:

# path of the file on the host: /var/lib/lxc/test/config
# ... (the default configuration)
lxc.mount.auto = proc:rw sys:rw
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.hook.priv-start = /bin/mount-systemd-cgroups
lxc.cap.drop = sys_admin

And create this script:

#!/bin/bash
# path of the file on the host: /var/lib/lxc/test/rootfs/bin/mount-systemd-cgroups
mount -t tmpfs tmpfs /sys/fs/cgroup
mkdir -p /sys/fs/cgroup/systemd
mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd cgroup /sys/fs/cgroup/systemd

I'm staying open to suggestions, my first goal is to start systemd without CAP_SYS_ADMIN 馃槃

Notes: This pull request is a rework of GH-1593.

@lxc-jenkins
Copy link

This pull request didn't trigger Jenkins as its author isn't in the whitelist.

An organization member must perform one of the following:

  • To have this branch tested by Jenkins, use the "ok to test" command.
  • To have a one time test done, use the "test this please" command.

Those commands are simple Github comments of the format: "jenkins: COMMAND"

@brauner
Copy link
Member

brauner commented May 29, 2017

It shouldn't be necessary to add a new hook just for that. I'd rather have the mount hook run. We shouldn't just add random hooks for very very specific use cases. :)

@brauner brauner added the Incomplete Waiting on more information from reporter label May 30, 2017
@brauner
Copy link
Member

brauner commented May 30, 2017

I might actually look into hooking up the lxc.hook.mount such that it is run after all namespaces have been setup but before capabilities have been dropped. If I get annoyed, I'll ping you. :)

@superboum
Copy link
Author

superboum commented May 30, 2017

Thanks for your help. I understand the need to limit hooks proliferation. I didn't modify the lxc.mount.hook as I thought it could break things as we don't control for what purpose people use it.

Moreover, it's a bit complicated because we have to change the time where pivot_root is called, as lxc.mount.hook is called outside of the new root now but would be called inside it if we move it in lxc_late_setup.

@brauner
Copy link
Member

brauner commented May 30, 2017

@superboum, yeah I saw that as well. I might have to sleep on that one for a bit. :)

Signed-off-by: Quentin Dufour <quentin@dufour.tk>
@superboum
Copy link
Author

superboum commented May 31, 2017

I tried to write a patch without adding a new hook. So i moved lxc_setup after the cgroup logic. I had to keep two calls before the cgroup logic that I put in lxc_early_setup(). I didn't test extensively the patch but it seems to work for my use case.

So, here is the configuration I used (with Debian 9):

# path of the file on the host: /var/lib/lxc/test/config
# ... (the default configuration)
lxc.mount.auto = proc:rw sys:rw
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.hook.mount = /opt/mount-systemd-cgroups
lxc.cap.drop = sys_admin
#!/bin/bash
# path of the file on the host: /opt/mount-systemd-cgroups
mount -t tmpfs tmpfs $LXC_ROOTFS_MOUNT/sys/fs/cgroup
mkdir -p $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd
mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd cgroup $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd

@@ -850,6 +849,12 @@ static int do_start(void *data)
INFO("Unshared CLONE_NEWCGROUP.");
}

/* Setup the container, ip, names, utsname, ... */
if (lxc_setup(handler)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So moving the unshare(CLONE_NEWCGROUP) after the pivot_root() syscall is something which I need to test in a POC before merging as I'm not sure if there isn't a possible interaction between the two. I doubt it but security firsts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it really shouldn't make a difference but I've seen stranger things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I confirmed that pivot_root() and CLONE_NEWCGROUP do not hate each other. :) That's a good sign.

@brauner brauner removed the Incomplete Waiting on more information from reporter label Jun 1, 2017
if (send_ttys_to_parent(handler) < 0) {
ERROR("failure sending console info to parent");
return -1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the reasoning for putting the tty creation and sending before unsharing the cgroup namespace?

Copy link
Member

@brauner brauner Jun 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially since the tty setup itself is done after the cgroup namespace has been unshared.
So I think, barring any crucial reasons for doing so, we want the ttys to be created after the cgroup namespace has been unshared. Otherwise the tty file descriptors refer to the parent's cgroup namespace which is not ideal security-wise.

Copy link
Member

@brauner brauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking closer at the patch its wrong. By moving lxc_create_tty() before lxc_setup_devpts() you're allocating tty file descriptors from the host and not from the container. That's a security issue. So we unfortunately can't merge this in its current form.

@brauner
Copy link
Member

brauner commented Jun 1, 2017

Right, I have a working patch for what you want to achieve. I'm closing this one and give you a suggested-by line since you had the original idea. :)

@brauner brauner added the Blocked Waiting on an external task label Jun 1, 2017
brauner pushed a commit to brauner/lxc that referenced this pull request Jun 1, 2017
When the running kernel supports cgroup namespace and users want to manually
set up cgroups via lxc.hook.mount before the init binary starts the cgroup
namespace needs to be already unshared. Otherwise the view on the cgroup mounts
is wrong. This commit places the call to lxc_setup() after the
LXC_SYNC_POST_CGROUP barrier.

This idea also made me aware of a security improvement which should harden our
tty handling code a bit. Before this commit, the tty fds we allocate from a
fresh devpts instance in the container's namespaces before the init binary
starts were referring to the hosts cgroup namespace since lxc_setup() was
called before unshare(CLONE_NEWCGROUP). This leaves room for the vague
possibility of setns()ing to the parent's cgroup namespaces via one of those
fds. This commit makes sure that the ttys are allocated after *all* namespaces
have been created.

Adding a Suggested-by line for the lxc.mount.hook fix for Quentin.

Closes lxc#1597.

Suggested-by: Quentin Dufour <quentin@dufour.tk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
brauner pushed a commit to brauner/lxc that referenced this pull request Jun 1, 2017
When the running kernel supports cgroup namespaces and users want to manually
set up cgroups via lxc.hook.mount before the init binary starts the cgroup
namespace needs to be already unshared. Otherwise the view on the cgroup mounts
is wrong. This commit places the call to lxc_setup() after the
LXC_SYNC_POST_CGROUP barrier.

Before this commit, the tty fds we allocate from a fresh devpts instance in the
container's namespaces before the init binary starts were referring to the
host's cgroup namespace since lxc_setup() was called before
unshare(CLONE_NEWCGROUP). Although not a security risk at this point since
setns() restricts its calls to /proc/<self>/ns files it's still better to do it
*after* the cgroup namespace has been unshared.

Adding a Suggested-by line for the lxc.mount.hook fix for Quentin.

Closes lxc#1597.

Suggested-by: Quentin Dufour <quentin@dufour.tk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
@stgraber stgraber closed this Jul 2, 2017
brauner pushed a commit to brauner/lxc that referenced this pull request Jul 24, 2017
When the running kernel supports cgroup namespaces and users want to manually
set up cgroups via lxc.hook.mount before the init binary starts the cgroup
namespace needs to be already unshared. Otherwise the view on the cgroup mounts
is wrong. This commit places the call to lxc_setup() after the
LXC_SYNC_POST_CGROUP barrier.

Before this commit, the tty fds we allocate from a fresh devpts instance in the
container's namespaces before the init binary starts were referring to the
host's cgroup namespace since lxc_setup() was called before
unshare(CLONE_NEWCGROUP). Although not a security risk at this point since
setns() restricts its calls to /proc/<self>/ns files it's still better to do it
*after* the cgroup namespace has been unshared.

Adding a Suggested-by line for the lxc.mount.hook fix for Quentin.

Closes lxc#1597.

Suggested-by: Quentin Dufour <quentin@dufour.tk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
stgraber pushed a commit that referenced this pull request Aug 14, 2017
When the running kernel supports cgroup namespaces and users want to manually
set up cgroups via lxc.hook.mount before the init binary starts the cgroup
namespace needs to be already unshared. Otherwise the view on the cgroup mounts
is wrong. This commit places the call to lxc_setup() after the
LXC_SYNC_POST_CGROUP barrier.

Before this commit, the tty fds we allocate from a fresh devpts instance in the
container's namespaces before the init binary starts were referring to the
host's cgroup namespace since lxc_setup() was called before
unshare(CLONE_NEWCGROUP). Although not a security risk at this point since
setns() restricts its calls to /proc/<self>/ns files it's still better to do it
*after* the cgroup namespace has been unshared.

Adding a Suggested-by line for the lxc.mount.hook fix for Quentin.

Closes #1597.

Suggested-by: Quentin Dufour <quentin@dufour.tk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Waiting on an external task
Development

Successfully merging this pull request may close these issues.

None yet

4 participants