Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

kartoffelheinz · 2017-08-09T15:08:21Z

This bug is present atleast for 1 year now, but since we use Debian only and stock kernel was way below 4.6 (3.16 to be precise), we thought it might get fixed "by itself". Unfortunately, this was not the case and now Debian 9 arrived with Kernel 4.9 and the bug prevents us from upgrading.

Required information

Distribution: Debian amd64
Distribution version: (8 and 9)
The output of
- lxc-start --version: 2.0.7
- lxc-checkconfig:

--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled
Multiple /dev/pts instances: enabled
--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled
--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
Bridges: enabled
Advanced netfilter: enabled
CONFIG_NF_NAT_IPV4: enabled
CONFIG_NF_NAT_IPV6: enabled
CONFIG_IP_NF_TARGET_MASQUERADE: enabled
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled
FUSE (for use with lxcfs): enabled
--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: enabled
Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig`

uname -a: Linux ws 4.9.0-3-amd64 Prefix tests with lxc-test- #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux
cat /proc/self/cgroup

10:devices:/user.slice/user-0.slice/session-c5.scope
9:cpu,cpuacct:/user.slice/user-0.slice/session-c5.scope
8:freezer:/user/root/0
7:memory:/user.slice/user-0.slice/session-c5.scope
6:perf_event:/
5:pids:/user.slice/user-0.slice/session-c5.scope
4:cpuset:/
3:net_cls,net_prio:/
2:blkio:/user.slice/user-0.slice/session-c5.scope
1:name=systemd:/user.slice/user-0.slice/session-c5.scope

cat /proc/1/mounts

sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=65943932k,nr_inodes=16485983,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=13197976k,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=19932 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
tmpfs /tmp tmpfs rw,relatime,size=52791904k,nr_inodes=2048000 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,relatime,size=13197976k 0 0
/dev/sda3 /boot ext3 rw,noatime,data=ordered 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=13197976k,mode=700,uid=1000,gid=1000 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=13197976k,mode=700 0 0

Issue description

If you try to start a privileged container lxc.cap.drop = sys_admin (anyone not insane will want this) on a kernel newer than 4.5 the container will not boot (hangs on init), emitting the following error:

Failedto mount tmpfs at /sys/fs/cgroup: Operation not permitted Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory [ESC[0;1;31m!!!!!!ESC[0m] Failed to mount API filesystems, freezing. Freezingexecution.

This works just fine with a kernel up to (including) 4.5. I remember something about cgroup architecture had changed around that time, but I'm no kernel developer.

Steps to reproduce

Install latest Debian version (Debian Stretch)
lxc-create -t download -n test -- --dist debian --release stretch --arch amd64
Modify config from test and include the following (mount entries are also needed for younger kernels)

lxc.cap.drop = sys_admin
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,size=128m,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,size=128m,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0

Start the container in foreground or in background and watch logfile

Information to attach

container log (The file from running lxc-start -n <c> -o <log> -l DEBUG) test.log.txt
console.log (from lxc-start -L ) console.txt
the containers configuration file test.config.txt

The text was updated successfully, but these errors were encountered:

corsac-s · 2017-09-17T09:57:59Z

Hi,
I have basically the same issue since I've upgraded Jessie host to Stretch. I've reported a bug to the Debian BTS at https://bugs.debian.org/875733 and @evgeni noticed that when cgroup namespace are available LXC 2.0 won't mount /sys/fs/cgroup at all, even with lxc.mount.auto = cgroup:mixed (https://github.com/lxc/lxc/blob/master/src/lxc/cgroups/cgfsng.c#L1627-L1628).

Note that I was running Jessie with a 4.9 kernel (so cgroup namespaces were already available) but LXC 1.0 didn't behave the same way.

evgeni · 2017-09-18T17:41:26Z

@hallyn @brauner can one of you have a look please? this breaks using systemd with SYS_ADMIN droped, as it tries to mount /sys/fs/cgroup itself and fails.

brauner · 2017-09-18T18:11:06Z

@evgeni, so right. From the top of my head this would require an implementation of cgroup-mixed with cgfsng which I'm currently not sure we have. But it shouldn't be too difficult to hack into the driver. The other option is to update to LXC 2.1. I moved various sync operations during container startup around such that lxc.hook.mount can be used to pre-mount the container's cgroups before the capability drop. I've added this because someone had the exact same use case: #1597 (comment) .

evgeni · 2017-09-19T08:08:55Z

@brauner I think we do, or am I miss-reading cgfsng_mount?

I removed https://github.com/lxc/lxc/blob/master/src/lxc/cgroups/cgfsng.c#L1627-L1628 and was able to start a container with /bin/sh as init and it had cgroups mounted fine.

brauner · 2017-09-19T09:55:22Z

Cool. So I assume @hallyn didn't purely put this there because cgroup namespaces makes this features unrequired. I assume it is because before my patch to correctly sync between cgroup_enter() and CLONE_NEWCGROUP the view on the hierarchy would have been wrong if we allowed this codepath. That doesn't seem to be the case anymore. What we likely want now is to make it so that the default when cgroup namespaces are enabled is still to not mount ourselves but when users explicitly request perform the mount for them.

evgeni · 2017-09-19T11:17:24Z

Yeah, that code is there for a reason. I just don't understand it properly, but you seem to ;)

kartoffelheinz · 2017-10-30T09:51:39Z

Has there been any recent progress on this issue? Anything we can do to help here?

brauner · 2017-10-30T10:05:09Z

Oh, we somehow never followed up on this. Sorry, my bad. I'm testing a patch now.

brauner · 2017-10-30T10:51:17Z

I'm sending a patch that enables pre-mounting the cgroup filesystems when CAP_SYS_ADMIN has been dropped. You should note however, that this requires a co-operative version of systemd. In general, I'm not sure whether the issue you are seeing is actually just systemd and not the kernel. I assume that you didn't just upgrade your kernel but also your systemd version. We have a similar problem in:

#1669

and @evverx is tracking this in

systemd/systemd#6477

kartoffelheinz · 2017-10-30T11:02:41Z

Great to see someone working on this. I will check with my coworker if we might be able to test your patch.

I think I can actually rule-out systemd as the culprit here (atleast the debian versions). We tested Debian Jessie and Debian Stretch each with new and old kernels both resulted in the same behaviour, that is old kernel worked on both Debian/systemd versions, new kernel did not.

evgeni · 2017-10-30T11:05:13Z

@brauner wrong ev... ;)

In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we need to mount cgroups for the container. This patch enables both privileged and unprivileged containers without CAP_SYS_ADMIN. Closes lxc#1737. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

brauner · 2017-10-30T13:37:12Z

@kartoffelheinz, send a patch that should fix your problem.

brauner · 2017-10-31T08:07:09Z

There's one more tweak to this needed. Currently we only mount writable cgroups which for privileged containers == all controllers but for unpriviliged containers means only a subset of them. While that's not a big deal since all of the others are not writable we should still mount them.

In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we need to mount cgroups for the container. This patch enables both privileged and unprivileged containers without CAP_SYS_ADMIN. Closes #1737. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

matthijskooijman · 2018-01-12T21:15:25Z

Any plans to backport this fix to the 2.0.x releases? Or does this depend on significant rewrites that are only present in 2.0?

I tried backporting this to ~~2.0.8~~ (edit: I actually tried 2.0.7, I noticed later) (for Debian stretch), completely unhindered by any actual knowledge of how this code works of course, which got complex fast (my backport of #1888 also needed 6328fd9 to apply, then failed to compile because must_make_path() was missing, and ccb4cab looked over my head to backport [edit: Later I found it was not missing, but only defined in cgfsng.c, so I just needed to backport 04ad7ff too). Then I tried backporting #1888 to 2.0.9 (since that one is also included in Debian already), which compiled without additional changes needed (other than a fixing some trivial merge conflicts), but that did not solve the problem.

matthijskooijman · 2018-01-12T23:14:07Z

A bit more digging suggests that the fix backported to 2.0.9 did not work, because:

The fix is only applied in the cgfsng backend, not the cgfs backend (is the latter intended to be removed at some point?)
The cgfsng backend was not properly loading, so lxc fell back to the cgfs backend.

The problem with loading cgfsng was apparently that I had lxc.cgroup.use = @all, which seems to mean "all cgroups" for cgfs, but is interpreted as a literal cgroup name by cgfsng (which needs the option unset to mean "all cgroups" AFAICS. The manpage seems to only document the latter behaviour (and has never documented the former).

I wanted to try if removing lxc.cgroup.use makes my containers work again, but I just lost access to the machine running this due to a network outage, so that will have to wait until later.

brauner · 2018-01-13T11:17:14Z

The problem with loading cgfsng was apparently that I had lxc.cgroup.use = @ALL, which seems to mean "all cgroups" for cgfs, but is interpreted as a literal cgroup name by cgfsng (which needs the option unset to mean "all cgroups" AFAICS. The manpage seems to only document the latter behaviour (and has never documented the former).

Oh really? If so that'd be a bug. If you can show/reproduce this, please open a new issue.

matthijskooijman · 2018-01-14T14:32:28Z

I just confirmed that removing the lxc.cgroup.use config on my system makes lxc use cgfsng, and then the fix in #1888 works on top of 2.0.9 as well. I also again tried backporting #1888 on top of 2.0.7 (for Debian stretch), which additionally required 04ad7ff and 6328fd9, which also works on my system.

The relevant Debian bug about this issue is here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=875733

In case cgroup namespaces are supported but we do not have CAP_SYS_ADMIN we need to mount cgroups for the container. This patch enables both privileged and unprivileged containers without CAP_SYS_ADMIN. Closes lxc#1737. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

This breaks the "cgfsng" backend within lxc, which prevents the fix for lxc/lxc#1737 from working. It seems that the default, when this config value is not defined, is to use all cgroups anyway, so removing this should not break anything (see lxc/lxc#2084 about this).

corsac-s · 2018-11-20T13:20:37Z

I did some testing with the patch from
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=875733#31 on top of stretch
LXC (1:2.0.7-2+deb9u2). At first sight, it seems to work (I can start LXC with
lxc.cap.drop = sys_admin), but somehow I can't start multiple LXC correctly
anymore.

For example the sequence:

lxc-attach -n test "echo OK"
OK
lxc-start -n test2
lxc-attach -n test2 "echo OK"
OK
lxc-attach -n test "echo OK"
lxc-attach: cgroups/cgfsng.c: cgfsng_attach: 1830 No such file or directory -
Failed to attach 14680 to /sys/fs/cgroup/systemd//lxc/www-1/cgroup.procs
                               lxc-attach: attach.c: lxc_attach: 992 Expected
to receive sequence number 0: No such file or directory.

It might be some kind of race condition because it doesn't always happen with
two containers, sometimes it's three.

brauner self-assigned this Oct 30, 2017

brauner added the Bug Confirmed to be a bug label Oct 30, 2017

brauner mentioned this issue Oct 30, 2017

cgroups: enable container without CAP_SYS_ADMIN #1888

Merged

hallyn closed this as completed in #1888 Oct 30, 2017

matthijskooijman mentioned this issue Jan 14, 2018

Uncertainties about cgfsng support of lxc.cgroup.use config #2084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

kartoffelheinz commented Aug 9, 2017 •

edited

Loading

corsac-s commented Sep 17, 2017

evgeni commented Sep 18, 2017

brauner commented Sep 18, 2017

evgeni commented Sep 19, 2017

brauner commented Sep 19, 2017 •

edited

Loading

evgeni commented Sep 19, 2017

kartoffelheinz commented Oct 30, 2017

brauner commented Oct 30, 2017

brauner commented Oct 30, 2017 •

edited

Loading

kartoffelheinz commented Oct 30, 2017

evgeni commented Oct 30, 2017

brauner commented Oct 30, 2017

brauner commented Oct 31, 2017

matthijskooijman commented Jan 12, 2018 •

edited

Loading

matthijskooijman commented Jan 12, 2018

brauner commented Jan 13, 2018

matthijskooijman commented Jan 14, 2018 •

edited

Loading

corsac-s commented Nov 20, 2018

Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

Cannot start privileged containers without cap sys_admin on Linux Kernel 4.6 and newer #1737

Comments

kartoffelheinz commented Aug 9, 2017 • edited Loading

Required information

Issue description

Steps to reproduce

Information to attach

corsac-s commented Sep 17, 2017

evgeni commented Sep 18, 2017

brauner commented Sep 18, 2017

evgeni commented Sep 19, 2017

brauner commented Sep 19, 2017 • edited Loading

evgeni commented Sep 19, 2017

kartoffelheinz commented Oct 30, 2017

brauner commented Oct 30, 2017

brauner commented Oct 30, 2017 • edited Loading

kartoffelheinz commented Oct 30, 2017

evgeni commented Oct 30, 2017

brauner commented Oct 30, 2017

brauner commented Oct 31, 2017

matthijskooijman commented Jan 12, 2018 • edited Loading

matthijskooijman commented Jan 12, 2018

brauner commented Jan 13, 2018

matthijskooijman commented Jan 14, 2018 • edited Loading

corsac-s commented Nov 20, 2018

kartoffelheinz commented Aug 9, 2017 •

edited

Loading

brauner commented Sep 19, 2017 •

edited

Loading

brauner commented Oct 30, 2017 •

edited

Loading

matthijskooijman commented Jan 12, 2018 •

edited

Loading

matthijskooijman commented Jan 14, 2018 •

edited

Loading