Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

Closed
3 tasks done
kartoffelheinz opened this issue Aug 9, 2017 · 18 comments
Closed
3 tasks done
Assignees
Labels
Bug Confirmed to be a bug

Comments

@kartoffelheinz
Copy link

kartoffelheinz commented Aug 9, 2017

This bug is present atleast for 1 year now, but since we use Debian only and stock kernel was way below 4.6 (3.16 to be precise), we thought it might get fixed "by itself". Unfortunately, this was not the case and now Debian 9 arrived with Kernel 4.9 and the bug prevents us from upgrading.

Required information

  • Distribution: Debian amd64
  • Distribution version: (8 and 9)
  • The output of
    • lxc-start --version: 2.0.7
    • lxc-checkconfig:

--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled
Multiple /dev/pts instances: enabled
--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled
--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
Bridges: enabled
Advanced netfilter: enabled
CONFIG_NF_NAT_IPV4: enabled
CONFIG_NF_NAT_IPV6: enabled
CONFIG_IP_NF_TARGET_MASQUERADE: enabled
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled
FUSE (for use with lxcfs): enabled
--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: enabled
Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig`

10:devices:/user.slice/user-0.slice/session-c5.scope
9:cpu,cpuacct:/user.slice/user-0.slice/session-c5.scope
8:freezer:/user/root/0
7:memory:/user.slice/user-0.slice/session-c5.scope
6:perf_event:/
5:pids:/user.slice/user-0.slice/session-c5.scope
4:cpuset:/
3:net_cls,net_prio:/
2:blkio:/user.slice/user-0.slice/session-c5.scope
1:name=systemd:/user.slice/user-0.slice/session-c5.scope

  • cat /proc/1/mounts

sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=65943932k,nr_inodes=16485983,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=13197976k,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=19932 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime,size=52791904k,nr_inodes=2048000 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,relatime,size=13197976k 0 0
/dev/sda3 /boot ext3 rw,noatime,data=ordered 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=13197976k,mode=700,uid=1000,gid=1000 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=13197976k,mode=700 0 0

Issue description

If you try to start a privileged container lxc.cap.drop = sys_admin (anyone not insane will want this) on a kernel newer than 4.5 the container will not boot (hangs on init), emitting the following error:

Failedto mount tmpfs at /sys/fs/cgroup: Operation not permitted Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory [ESC[0;1;31m!!!!!!ESC[0m] Failed to mount API filesystems, freezing. Freezingexecution.

This works just fine with a kernel up to (including) 4.5. I remember something about cgroup architecture had changed around that time, but I'm no kernel developer.

Steps to reproduce

  1. Install latest Debian version (Debian Stretch)
  2. lxc-create -t download -n test -- --dist debian --release stretch --arch amd64
  3. Modify config from test and include the following (mount entries are also needed for younger kernels)

lxc.cap.drop = sys_admin
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,size=128m,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,size=128m,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0

  1. Start the container in foreground or in background and watch logfile

Information to attach

@corsac-s
Copy link

Hi,
I have basically the same issue since I've upgraded Jessie host to Stretch. I've reported a bug to the Debian BTS at https://bugs.debian.org/875733 and @evgeni noticed that when cgroup namespace are available LXC 2.0 won't mount /sys/fs/cgroup at all, even with lxc.mount.auto = cgroup:mixed (https://github.com/lxc/lxc/blob/master/src/lxc/cgroups/cgfsng.c#L1627-L1628).

Note that I was running Jessie with a 4.9 kernel (so cgroup namespaces were already available) but LXC 1.0 didn't behave the same way.

@evgeni
Copy link
Contributor

evgeni commented Sep 18, 2017

@hallyn @brauner can one of you have a look please? this breaks using systemd with SYS_ADMIN droped, as it tries to mount /sys/fs/cgroup itself and fails.

@brauner
Copy link
Member

brauner commented Sep 18, 2017

@evgeni, so right. From the top of my head this would require an implementation of cgroup-mixed with cgfsng which I'm currently not sure we have. But it shouldn't be too difficult to hack into the driver. The other option is to update to LXC 2.1. I moved various sync operations during container startup around such that lxc.hook.mount can be used to pre-mount the container's cgroups before the capability drop. I've added this because someone had the exact same use case: #1597 (comment) .

@evgeni
Copy link
Contributor

evgeni commented Sep 19, 2017

@brauner I think we do, or am I miss-reading cgfsng_mount?

I removed https://github.com/lxc/lxc/blob/master/src/lxc/cgroups/cgfsng.c#L1627-L1628 and was able to start a container with /bin/sh as init and it had cgroups mounted fine.

@brauner
Copy link
Member

brauner commented Sep 19, 2017

Cool. So I assume @hallyn didn't purely put this there because cgroup namespaces makes this features unrequired. I assume it is because before my patch to correctly sync between cgroup_enter() and CLONE_NEWCGROUP the view on the hierarchy would have been wrong if we allowed this codepath. That doesn't seem to be the case anymore. What we likely want now is to make it so that the default when cgroup namespaces are enabled is still to not mount ourselves but when users explicitly request perform the mount for them.

@evgeni
Copy link
Contributor

evgeni commented Sep 19, 2017

Yeah, that code is there for a reason. I just don't understand it properly, but you seem to ;)

@kartoffelheinz
Copy link
Author

Has there been any recent progress on this issue? Anything we can do to help here?

@brauner
Copy link
Member

brauner commented Oct 30, 2017

Oh, we somehow never followed up on this. Sorry, my bad. I'm testing a patch now.

@brauner brauner self-assigned this Oct 30, 2017
@brauner brauner added the Bug Confirmed to be a bug label Oct 30, 2017
@brauner
Copy link
Member

brauner commented Oct 30, 2017

I'm sending a patch that enables pre-mounting the cgroup filesystems when CAP_SYS_ADMIN has been dropped. You should note however, that this requires a co-operative version of systemd. In general, I'm not sure whether the issue you are seeing is actually just systemd and not the kernel. I assume that you didn't just upgrade your kernel but also your systemd version. We have a similar problem in:

#1669

and @evverx is tracking this in

systemd/systemd#6477

@kartoffelheinz
Copy link
Author

Great to see someone working on this. I will check with my coworker if we might be able to test your patch.

I think I can actually rule-out systemd as the culprit here (atleast the debian versions). We tested Debian Jessie and Debian Stretch each with new and old kernels both resulted in the same behaviour, that is old kernel worked on both Debian/systemd versions, new kernel did not.

@evgeni
Copy link
Contributor

evgeni commented Oct 30, 2017

@brauner wrong ev... ;)

brauner pushed a commit to brauner/lxc that referenced this issue Oct 30, 2017
In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we
need to mount cgroups for the container. This patch enables both privileged and
unprivileged containers without CAP_SYS_ADMIN.

Closes lxc#1737.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
@brauner
Copy link
Member

brauner commented Oct 30, 2017

@kartoffelheinz, send a patch that should fix your problem.

@brauner
Copy link
Member

brauner commented Oct 31, 2017

There's one more tweak to this needed. Currently we only mount writable cgroups which for privileged containers == all controllers but for unpriviliged containers means only a subset of them. While that's not a big deal since all of the others are not writable we should still mount them.

brauner pushed a commit that referenced this issue Nov 9, 2017
In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we
need to mount cgroups for the container. This patch enables both privileged and
unprivileged containers without CAP_SYS_ADMIN.

Closes #1737.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
brauner pushed a commit that referenced this issue Nov 9, 2017
In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we
need to mount cgroups for the container. This patch enables both privileged and
unprivileged containers without CAP_SYS_ADMIN.

Closes #1737.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
@matthijskooijman
Copy link
Contributor

matthijskooijman commented Jan 12, 2018

Any plans to backport this fix to the 2.0.x releases? Or does this depend on significant rewrites that are only present in 2.0?

I tried backporting this to 2.0.8 (edit: I actually tried 2.0.7, I noticed later) (for Debian stretch), completely unhindered by any actual knowledge of how this code works of course, which got complex fast (my backport of #1888 also needed 6328fd9 to apply, then failed to compile because must_make_path() was missing, and ccb4cab looked over my head to backport [edit: Later I found it was not missing, but only defined in cgfsng.c, so I just needed to backport 04ad7ff too). Then I tried backporting #1888 to 2.0.9 (since that one is also included in Debian already), which compiled without additional changes needed (other than a fixing some trivial merge conflicts), but that did not solve the problem.

@matthijskooijman
Copy link
Contributor

A bit more digging suggests that the fix backported to 2.0.9 did not work, because:

  • The fix is only applied in the cgfsng backend, not the cgfs backend (is the latter intended to be removed at some point?)
  • The cgfsng backend was not properly loading, so lxc fell back to the cgfs backend.

The problem with loading cgfsng was apparently that I had lxc.cgroup.use = @all, which seems to mean "all cgroups" for cgfs, but is interpreted as a literal cgroup name by cgfsng (which needs the option unset to mean "all cgroups" AFAICS. The manpage seems to only document the latter behaviour (and has never documented the former).

I wanted to try if removing lxc.cgroup.use makes my containers work again, but I just lost access to the machine running this due to a network outage, so that will have to wait until later.

@brauner
Copy link
Member

brauner commented Jan 13, 2018

The problem with loading cgfsng was apparently that I had lxc.cgroup.use = @ALL, which seems to mean "all cgroups" for cgfs, but is interpreted as a literal cgroup name by cgfsng (which needs the option unset to mean "all cgroups" AFAICS. The manpage seems to only document the latter behaviour (and has never documented the former).

Oh really? If so that'd be a bug. If you can show/reproduce this, please open a new issue.

@matthijskooijman
Copy link
Contributor

matthijskooijman commented Jan 14, 2018

I just confirmed that removing the lxc.cgroup.use config on my system makes lxc use cgfsng, and then the fix in #1888 works on top of 2.0.9 as well. I also again tried backporting #1888 on top of 2.0.7 (for Debian stretch), which additionally required 04ad7ff and 6328fd9, which also works on my system.

The relevant Debian bug about this issue is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=875733

geaaru pushed a commit to geaaru/lxc that referenced this issue Apr 25, 2018
In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we
need to mount cgroups for the container. This patch enables both privileged and
unprivileged containers without CAP_SYS_ADMIN.

Closes lxc#1737.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
matthijskooijman added a commit to matthijskooijman/Tika that referenced this issue May 1, 2018
This breaks the "cgfsng" backend within lxc, which prevents the fix for
lxc/lxc#1737 from working. It seems that the
default, when this config value is not defined, is to use all cgroups
anyway, so removing this should not break anything (see
lxc/lxc#2084 about this).
@corsac-s
Copy link

I did some testing with the patch from
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=875733#31 on top of stretch
LXC (1:2.0.7-2+deb9u2). At first sight, it seems to work (I can start LXC with
lxc.cap.drop = sys_admin), but somehow I can't start multiple LXC correctly
anymore.

For example the sequence:

lxc-attach -n test "echo OK"
OK
lxc-start -n test2
lxc-attach -n test2 "echo OK"
OK
lxc-attach -n test "echo OK"
lxc-attach: cgroups/cgfsng.c: cgfsng_attach: 1830 No such file or directory -
Failed to attach 14680 to /sys/fs/cgroup/systemd//lxc/www-1/cgroup.procs
                               lxc-attach: attach.c: lxc_attach: 992 Expected
to receive sequence number 0: No such file or directory.

It might be some kind of race condition because it doesn't always happen with
two containers, sometimes it's three.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Development

No branches or pull requests

5 participants