New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split lxc_setup to unshare cgroup before capabilities drop #1597
Conversation
This pull request didn't trigger Jenkins as its author isn't in the whitelist. An organization member must perform one of the following:
Those commands are simple Github comments of the format: "jenkins: COMMAND" |
It shouldn't be necessary to add a new hook just for that. I'd rather have the mount hook run. We shouldn't just add random hooks for very very specific use cases. :) |
I might actually look into hooking up the |
Thanks for your help. I understand the need to limit hooks proliferation. I didn't modify the Moreover, it's a bit complicated because we have to change the time where |
@superboum, yeah I saw that as well. I might have to sleep on that one for a bit. :) |
19a0264
to
5bf017a
Compare
Signed-off-by: Quentin Dufour <quentin@dufour.tk>
5bf017a
to
a421eca
Compare
I tried to write a patch without adding a new hook. So i moved So, here is the configuration I used (with Debian 9): # path of the file on the host: /var/lib/lxc/test/config
# ... (the default configuration)
lxc.mount.auto = proc:rw sys:rw
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = tmpfs run/user tmpfs rw,nosuid,nodev,mode=755,size=50m,create=dir 0 0
lxc.hook.mount = /opt/mount-systemd-cgroups
lxc.cap.drop = sys_admin #!/bin/bash
# path of the file on the host: /opt/mount-systemd-cgroups
mount -t tmpfs tmpfs $LXC_ROOTFS_MOUNT/sys/fs/cgroup
mkdir -p $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd
mount -t cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd cgroup $LXC_ROOTFS_MOUNT/sys/fs/cgroup/systemd |
@@ -850,6 +849,12 @@ static int do_start(void *data) | |||
INFO("Unshared CLONE_NEWCGROUP."); | |||
} | |||
|
|||
/* Setup the container, ip, names, utsname, ... */ | |||
if (lxc_setup(handler)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So moving the unshare(CLONE_NEWCGROUP)
after the pivot_root()
syscall is something which I need to test in a POC before merging as I'm not sure if there isn't a possible interaction between the two. I doubt it but security firsts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it really shouldn't make a difference but I've seen stranger things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I confirmed that pivot_root()
and CLONE_NEWCGROUP
do not hate each other. :) That's a good sign.
if (send_ttys_to_parent(handler) < 0) { | ||
ERROR("failure sending console info to parent"); | ||
return -1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the reasoning for putting the tty creation and sending before unsharing the cgroup namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially since the tty setup itself is done after the cgroup namespace has been unshared.
So I think, barring any crucial reasons for doing so, we want the ttys to be created after the cgroup namespace has been unshared. Otherwise the tty file descriptors refer to the parent's cgroup namespace which is not ideal security-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking closer at the patch its wrong. By moving lxc_create_tty()
before lxc_setup_devpts()
you're allocating tty file descriptors from the host and not from the container. That's a security issue. So we unfortunately can't merge this in its current form.
Right, I have a working patch for what you want to achieve. I'm closing this one and give you a suggested-by line since you had the original idea. :) |
When the running kernel supports cgroup namespace and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. This idea also made me aware of a security improvement which should harden our tty handling code a bit. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the hosts cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). This leaves room for the vague possibility of setns()ing to the parent's cgroup namespaces via one of those fds. This commit makes sure that the ttys are allocated after *all* namespaces have been created. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes lxc#1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When the running kernel supports cgroup namespaces and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the host's cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). Although not a security risk at this point since setns() restricts its calls to /proc/<self>/ns files it's still better to do it *after* the cgroup namespace has been unshared. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes lxc#1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When the running kernel supports cgroup namespaces and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the host's cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). Although not a security risk at this point since setns() restricts its calls to /proc/<self>/ns files it's still better to do it *after* the cgroup namespace has been unshared. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes lxc#1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When the running kernel supports cgroup namespaces and users want to manually set up cgroups via lxc.hook.mount before the init binary starts the cgroup namespace needs to be already unshared. Otherwise the view on the cgroup mounts is wrong. This commit places the call to lxc_setup() after the LXC_SYNC_POST_CGROUP barrier. Before this commit, the tty fds we allocate from a fresh devpts instance in the container's namespaces before the init binary starts were referring to the host's cgroup namespace since lxc_setup() was called before unshare(CLONE_NEWCGROUP). Although not a security risk at this point since setns() restricts its calls to /proc/<self>/ns files it's still better to do it *after* the cgroup namespace has been unshared. Adding a Suggested-by line for the lxc.mount.hook fix for Quentin. Closes #1597. Suggested-by: Quentin Dufour <quentin@dufour.tk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Hi everyone,
Abstract
I'm proposing to move the capabilities drop part of
lxc_setup
in a new function namedlxc_late_setup
which is run after the cgroup unshare. I've also added a hook namedprivileged-start
inlxc_late_setup
which is run with all the capabilities and in all the namespaces (including the cgroup one). It allows me to start a systemd container withoutCAP_SYS_ADMIN
on a kernel supporting the cgroup namespaces.systemd without
CAP_SYS_ADMIN
If you want to run systemd without
CAP_SYS_ADMIN
, you have to mount everything "by hand" before its start. It works for most of the filesystems but cause problems with the cgroup mount.If your kernel supports the cgroup namespace, you can't use:
lxc.mount.auto = cgroup
as the function in the lxc code directly returns without doing anything.lxc.mount.entry
orlxc.hook.mount
because both commands are not executed in the cgroup namespace (due to reasons you can find in the LXC code) - so you'll mount your host hierarchy.lxc.hook.start
as it is executed after the capabilities drop - so you'll have a Permission Error if you try to mount your cgroup hierarchy.How to use it
Add something like this in your container's configuration:
And create this script:
I'm staying open to suggestions, my first goal is to start systemd without
CAP_SYS_ADMIN
馃槃Notes: This pull request is a rework of GH-1593.