Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archlinux in unprivileged lxc cannot start #1678

Closed
ss1h2a3tw opened this issue Jul 5, 2017 · 35 comments
Closed

archlinux in unprivileged lxc cannot start #1678

ss1h2a3tw opened this issue Jul 5, 2017 · 35 comments

Comments

@ss1h2a3tw
Copy link
Contributor

ss1h2a3tw commented Jul 5, 2017

Seems to be that current mainline systemd have some new operations that unprivileged lxc
will blocks.
It works fine in privileged lxc.
Can confirm that systemd 232-8 still works well

Required information

  • Distribution: archlinux
  • The output of
    • lxc-start --version 2.0.8
    • lxc-checkconfig all green
    • uname -a 4.11.7-1-userns (the linux-userns in AUR)
    • cat /proc/self/cgroup

10:memory:/lxc
9:perf_event:/lxc
8:freezer:/lxc
7:net_cls,net_prio:/lxc
6:devices:/lxc
5:cpuset:/lxc
4:blkio:/lxc
3:pids:/lxc
2:cpu,cpuacct:/lxc
1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
0::/user.slice/user-1000.slice/session-c1.scope

  • cat /proc/1/mounts

proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sys /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
dev /dev devtmpfs rw,nosuid,relatime,size=61761116k,nr_inodes=15440279,mode=755 0 0
run /run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
/dev/sda2 / btrfs rw,relatime,space_cache,subvolid=257,subvol=/root 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /sys/fs/cgroup tmpfs rw,mode=755 0 0
cgroup /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
...

Issue description

Using the lxc-download template downloads the archlinux amd64 and try to run it.

systemd's error

Welcome to Arch Linux!
Set hostname to .
Failed to read AF_UNIX datagram queue length, ignoring: No such file or directory
Failed to install release agent, ignoring: No such file or directory
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object, freezing.
Freezing execution.

@brauner
Copy link
Member

brauner commented Jul 5, 2017

Hm, I'm able to boot Archlinux just fine here even with a new systemd version. I might have to install it in a VM and try for myself.

@ss1h2a3tw
Copy link
Contributor Author

ss1h2a3tw commented Jul 6, 2017

@brauner
https://drive.google.com/file/d/0B4G7EMVYmxcjS1FlOHdHWGdYbkk/view?usp=sharing
I have created a simple VM, password for root and lxc is all "lxc"
there is a cgroup setup script and startup script in the lxc's home
and I have create a archlinux named "test", and cannot start as described.

Here is the way I built it.
Basic archlinux install.
create lxc user
linux-userns from AUR (Here is my prebuild binary https://user.paga.moe/ss1h2a3tw/pub/linux-userns-4.11.7-1-x86_64.pkg.tar.xz)
install lxc from archlinux mirror
configure subuid subgid and lxc default configs
use lxc-download to download archlinux

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

@brauner 172.17.16.238 from canonical-lxd

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

It's got to do with unprivileged user starting the container vs root starting the container.
I confirmed that starting that container as root while still keeping the container unprivileged does get it to start properly.

lxc-start -n test -o debug -l debug -F -P /home/lxc/.local/share/lxc/

So it's got to be something in your cgroup setup that's making systemd unhappy. My guess is that it's related to the unified hierarchy and possibly has to do with your user not owning that part of the tree.

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

I suspect just installing libpam-cgfs would fix your problem.

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

Ah, it probably would if there was a version of it with the needed cherry-picks for the unified hierarchy...

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

#!/bin/sh
echo 1 > /sys/fs/cgroup/cpuset/cgroup.clone_children
for cgroup in /sys/fs/cgroup/*; do
    mkdir -p ${cgroup}/user.slice/user-$(id -u ${1}).slice
    chown -R $(id -u ${1}):$(id -g ${1}) ${cgroup}/user.slice/user-$(id -u ${1}).slice

    if [ "$(basename ${cgroup})" != "unified" ]; then
        echo ${2} > ${cgroup}/user.slice/user-$(id -u ${1}).slice/tasks
    fi
done

Run that as root passing the username as first argument and the shell's PID as the second argument. That will setup all your cgroups cleanly at which point your container will happily start.

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

stgraber@castiana:~/Downloads$ ssh lxc@172.17.16.238 -J ubuntu@canonical-lxd.stgraber.org
lxc@172.17.16.238's password: 
Last login: Thu Jul  6 05:47:07 2017 from 76.10.144.41
[lxc@lxc-test ~]$ lxc-start -n test -F
lxc-start: cgroups/cgfs.c: lxc_cgroupfs_create: 909 Could not set clone_children to 1 for cpuset hierarchy in parent cgroup.
                                                                                                                            lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/blkio/
                                                                                      lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/devices/user.slice
                                                        lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/memory/
                   lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpuset/
                                                                                                                                             lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/perf_event/
                                                                                                            lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/cpu,cpuacct/
                                                                            lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/freezer/
                                        lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/pids/user.slice/user-1000.slice/session-c1.scope
                                        lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Read-only file system - cgroup_rmdir: failed to delete /sys/fs/cgroup/net_cls,net_prio/
             lxc-start: cgroups/cgfs.c: cgroup_rmdir: 209 Permission denied - cgroup_rmdir: failed to delete /sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-c1.scope
                lxc-start: start.c: lxc_spawn: 1123 Failed creating cgroups.
                                                                            lxc-start: start.c: __lxc_start: 1358 Failed to spawn container "test".
                                                                                                                                                   lxc-start: tools/lxc_start.c: main: 366 The container failed to start.
lxc-start: tools/lxc_start.c: main: 370 Additional information can be obtained by setting the --logfile and --logpriority options.


[lxc@lxc-test ~]$ echo $$
247
[lxc@lxc-test ~]$ su
Password: 
[root@lxc-test lxc]# /root/fix-cgroups lxc 247
[root@lxc-test lxc]# exit


[lxc@lxc-test ~]$ lxc-start -n test -F
systemd 232 running in system mode. (+PAM -AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
Detected virtualization lxc.
Detected architecture x86-64.

Welcome to Arch Linux!

Set hostname to <test>.
Failed to read AF_UNIX datagram queue length, ignoring: No such file or directory
Failed to install release agent, ignoring: No such file or directory
[  OK  ] Listening on Process Core Dump Socket.
[  OK  ] Listening on Device-mapper event daemon FIFOs.
[  OK  ] Listening on Network Service Netlink Socket.
user.slice: Failed to reset devices.list: Operation not permitted
user.slice: Failed to set invocation ID on control group /user.slice, ignoring: Operation not permitted
[  OK  ] Created slice User and Session Slice.
[  OK  ] Reached target Swap.
[  OK  ] Started Forward Password Requests to Wall Directory Watch.
[  OK  ] Reached target Remote File Systems.
[  OK  ] Listening on LVM2 metadata daemon socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
system.slice: Failed to reset devices.list: Operation not permitted
system.slice: Failed to set invocation ID on control group /system.slice, ignoring: Operation not permitted
[  OK  ] Created slice System Slice.
systemd-remount-fs.service: Failed to reset devices.list: Operation not permitted
systemd-remount-fs.service: Failed to set invocation ID on control group /system.slice/systemd-remount-fs.service, ignoring: Operation not permitted
         Starting Remount Root and Kernel File Systems...
tmp.mount: Failed to reset devices.list: Operation not permitted
tmp.mount: Failed to set invocation ID on control group /system.slice/tmp.mount, ignoring: Operation not permitted
         Mounting Temporary Directory...
system-getty.slice: Failed to reset devices.list: Operation not permitted
system-getty.slice: Failed to set invocation ID on control group /system.slice/system-getty.slice, ignoring: Operation not permitted
[  OK  ] Created slice system-getty.slice.
dev-mqueue.mount: Failed to reset devices.list: Operation not permitted
dev-mqueue.mount: Failed to set invocation ID on control group /system.slice/dev-mqueue.mount, ignoring: Operation not permitted
         Mounting POSIX Message Queue File System...
[  OK  ] Reached target Slices.
system-container\x2dgetty.slice: Failed to reset devices.list: Operation not permitted
system-container\x2dgetty.slice: Failed to set invocation ID on control group /system.slice/system-container\x2dgetty.slice, ignoring: Operation not permitted
[  OK  ] Created slice system-container\x2dgetty.slice.
systemd-sysctl.service: Failed to reset devices.list: Operation not permitted
systemd-sysctl.service: Failed to set invocation ID on control group /system.slice/systemd-sysctl.service, ignoring: Operation not permitted
         Starting Apply Kernel Variables...
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Listening on Journal Socket (/dev/log).
systemd-journald.service: Failed to reset devices.list: Operation not permitted
systemd-journald.service: Failed to set invocation ID on control group /system.slice/systemd-journald.service, ignoring: Operation not permitted
         Starting Journal Service...
dev-random.mount: Failed to reset devices.list: Operation not permitted
dev-tty4.mount: Failed to reset devices.list: Operation not permitted
dev-ptmx.mount: Failed to reset devices.list: Operation not permitted
dev-null.mount: Failed to reset devices.list: Operation not permitted
dev-tty.mount: Failed to reset devices.list: Operation not permitted
dev-zero.mount: Failed to reset devices.list: Operation not permitted
dev-tty6.mount: Failed to reset devices.list: Operation not permitted
dev-urandom.mount: Failed to reset devices.list: Operation not permitted
dev-full.mount: Failed to reset devices.list: Operation not permitted
sys-devices-virtual-net.mount: Failed to reset devices.list: Operation not permitted
dev-tty5.mount: Failed to reset devices.list: Operation not permitted
proc-sys-net.mount: Failed to reset devices.list: Operation not permitted
-.mount: Failed to reset devices.list: Operation not permitted
dev-tty1.mount: Failed to reset devices.list: Operation not permitted
dev-tty3.mount: Failed to reset devices.list: Operation not permitted
proc-sysrq\x2dtrigger.mount: Failed to reset devices.list: Operation not permitted
dev-tty2.mount: Failed to reset devices.list: Operation not permitted
init.scope: Failed to reset devices.list: Operation not permitted
[  OK  ] Mounted Temporary Directory.
[  OK  ] Started Remount Root and Kernel File Systems.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
[  OK  ] Started Apply Kernel Variables.
[  OK  ] Started Journal Service.
         Starting Flush Journal to Persistent Storage...
[  OK  ] Started Flush Journal to Persistent Storage.
         Starting Create Volatile Files and Directories...
[  OK  ] Started Create Volatile Files and Directories.
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Started Daily rotation of log files.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Started Daily verification of password and group files.
[  OK  ] Started Daily man-db cache update.
[  OK  ] Started Daily Cleanup of Temporary Directories.
[  OK  ] Reached target Timers.
[  OK  ] Reached target Basic System.
[  OK  ] Started D-Bus System Message Bus.
         Starting Network Service...
         Starting Login Service...
[  OK  ] Started Network Service.
[  OK  ] Reached target Network.
         Starting Permit User Sessions...
         Starting Network Name Resolution...
[  OK  ] Started Login Service.
[  OK  ] Started Permit User Sessions.
[  OK  ] Started Getty on lxc/tty2.
[  OK  ] Started Getty on lxc/tty6.
[  OK  ] Started Getty on lxc/tty3.
[  OK  ] Started Container Getty on /dev/pts/4.
[  OK  ] Started Container Getty on /dev/pts/1.
[  OK  ] Started Getty on lxc/tty4.
[  OK  ] Started Container Getty on /dev/pts/2.
[  OK  ] Started Console Getty.
[  OK  ] Started Getty on lxc/tty1.
[  OK  ] Started Container Getty on /dev/pts/3.
[  OK  ] Started Container Getty on /dev/pts/0.
[  OK  ] Started Getty on lxc/tty5.
[  OK  ] Started Container Getty on /dev/pts/5.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Network Name Resolution.
[  OK  ] Reached target Multi-User System.

Arch Linux 4.11.7-1-userns (console)

test login: 

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

So not actually an LXC bug, the issue was a bad cgroup setup on the host which was preventing systemd in the container from mounting the "unified" controller (which is what changed in new systemd).

If you configure your host to have your user own every single one of the controller that are used, then the container starts up properly.

@stgraber stgraber closed this as completed Jul 6, 2017
@brauner
Copy link
Member

brauner commented Jul 6, 2017

Sorry, I was off yesterday. :) Yes, that's exactly what I suspected. But the approach @stgraber outlined is ok but problematic in the long run. This currently only works with the unified hierarchy because it is the empty hierarchy, i.e. there are no controller enabled in it. Also, I'd be interested in the full trace log for the booting container that @stgraber just started after making those cgroup changes. The fact that it looks like the cgfs driver is used is weird. Actually, the cgfsng driver should be used an should just work fine.

@stgraber
Copy link
Member

stgraber commented Jul 6, 2017

@brauner VM is still up, you can go poke at it :)

@ss1h2a3tw
Copy link
Contributor Author

ss1h2a3tw commented Jul 13, 2017

@stgraber sorry I found that I have made a mistake, the lxc-download template will download archlinux rootfs with systemd 232
after doing:
sudo cp /var/cache/pacman/pkg/systemd* ~/.local/share/lxc/test/rootfs/root/
lxc-attach -n test
pacman -U /root/systemd-*
the lxc will not start even after executing the fixcgroup.sh

@ss1h2a3tw
Copy link
Contributor Author

@stgraber
It seems that systemd 233 have changed lots of behavior on unified cgroup controller
Maybe it is related to this problem?
systemd/systemd@a1e2ef7...d60c527

@ss1h2a3tw
Copy link
Contributor Author

ss1h2a3tw commented Jul 14, 2017

@stgraber
I found that in the rootfs of lxc-download , after the lxc started (after executing fixcgroup)
lxc-attach -n test
ls /sys/fs/cgrouop
the output is
blkio cpu cpuacct cpu,cpuacct cpuset devices freezer memory net_cls net_cls,net_prio net_prio perf_event pids systemd
there is no unified in there

and in privileged lxc with systemd 233
the output is
blkio cpu cpuacct cpu,cpuacct cpuset devices freezer memory net_cls net_cls,net_prio net_prio perf_event pids systemd unified

@jjb2016
Copy link

jjb2016 commented Jul 20, 2017

This is still a problem for me. I can't update systemd in any of my unprivileged containers as it breaks the container. I don't really understand cgroups / systemd / unprivileged containers and how they interact with each other ( I have some reading to do!). So I don't know which piece of software is at fault here. The issue is not resolved. Why is this thread closed? Is there a solution in here that I've missed?

Also - I have also just created a new unprivileged "test" container using the lxc-download template. The container starts up no problem but it is using systemd 232-8. When i upgrade it to systemd 233.75-3 it breaks. I can't even stop the container actually ... after executing the lxc-stop -n test command it just sits there doing nothing and the command never completes. Perhaps I haven't been waiting long enough, but I've been having to reboot the host machine to get it to stop.

@evverx
Copy link
Contributor

evverx commented Jul 24, 2017

By the way, the only thing that prevents systemd from "freezing" on Ubuntu is AppArmor, which doesn't allow systemd to mount cgroup2:

AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/cgroup/unified/" pid=1925 comm="systemd" fstype="cgroup2" srcname="cgroup" flags="rw, nosuid, nodev, noexec"

Adding lxc.aa_profile = unconfined to the config of a container can be used to reproduce the issue.

There is systemd/systemd#6408, but I'm wondering which part of lxc doesn't set up /sys/fs/cgroup/unified/.../session-*.scope for using by unprivileged containers. Does anybody know if there is an issue about that?

@brauner
Copy link
Member

brauner commented Jul 24, 2017

@evverx, yeah, I had some other work to finish first but I'm going to take care of this likely during this week. :)

@ss1h2a3tw
Copy link
Contributor Author

@jjb2016
lxc.init_cmd = /sbin/init systemd.legacy_systemd_cgroup_controller=yes
in the config file will be a temporary workaround

@jjb2016
Copy link

jjb2016 commented Jul 24, 2017

This is great guys - thanks a lot for taking notice of this!

@brauner brauner reopened this Jul 26, 2017
brauner pushed a commit to brauner/lxc that referenced this issue Jul 26, 2017
Closes lxc#1669.
Closes lxc#1678.
Relates to systemd/systemd#6408.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
@brauner
Copy link
Member

brauner commented Jul 26, 2017

Hey everyone, I just sent a branch to implement support for the empty cgroup v2 hierarchy. Note, for unprivileged containers run by unprivileged users two conditions must be met:

  • the current cgroup of the unprivileged user on the host in the cgroup v2 hierarchy must be writeable by the unprivileged user on the host, i.e. in the usual case this means that it must be chown()ed to the unprivileged user's uid and gid
  • the unprivileged user's currrent cgroup's cgroup.procs file in the cgroup v2 hierarchy must be writeable

The second condition is caused by the specific delegation model the cgroup v2 hierarchy implements. A visual guide to what I mean is:

> cat /proc/self/cgroup | grep "0::"
0::/user.slice/user-1000.slice/session-1.scope

where the line 0::/user.slice/user-1000.slice/session-1.scope indicates the current cgroup of my user in the cgroup v2 hierarchy. This cgroup must at least look like this:

> ls -al | grep session-1.scope
drwxr-xr-x 4 chb  chb  0 Jul 26 15:26 session-1.scope

and

> ls -al | grep "cgroup.procs"
-rw-r--r-- 1 chb  chb  0 Jul 26 15:26 cgroup.procs

If you have a recent enough version of LXCFS installed then our pam module will allow you to take care of this by placing the following line in the corresponding configuration folder:

session	optional	pam_cgfs.so -c freezer,memory,name=systemd,unified

stgraber pushed a commit that referenced this issue Aug 14, 2017
Closes #1669.
Closes #1678.
Relates to systemd/systemd#6408.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
@qknight
Copy link

qknight commented Nov 10, 2017

i think we still got this issue on NixOS in our nixcloud-container abstraction:

<<< NixOS Stage 2 >>>                                                                                                                       

running activation script...                            
setting up /etc...               
mount: /dev: permission denied.                         
mount: /dev/pts: permission denied.           
mount: /proc: permission denied.                      
NOTE: Under Linux, effective file capabilities must either be empty, or
      exactly match the union of selected permitted and inheritable bits.
Failed to set capabilities on file `/run/wrappers/wrappers.ow2HGqwU8c/ping' (Operation not permitted)
starting systemd...                 
systemd 234 running in system mode. (+PAM +AUDIT -SELINUX +IMA +APPARMOR -SMACK -SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybri
d)                                   
Detected virtualization lxc.                  
Detected architecture x86-64.                        

Welcome to NixOS 18.03pre119292.cfafd6f5a8 (Impala)!   

Set hostname to <v34>.                              
Failed to read AF_UNIX datagram queue length, ignoring: No such file or directory
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[UNSUPP] Starting of proc-sys-fs-binfmt_misc.automount not supported.
systemd-journald-audit.socket: Failed to listen on sockets: Operation not permitted
[FAILED] Failed to listen on Journal Audit Socket.
See 'systemctl status systemd-journald-audit.socket' for details.
systemd-journald-audit.socket: Unit entered failed state.
[  OK  ] Reached target All Network Interfaces (deprecated).
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Reached target Swap.                                                             
[  OK  ] Listening on Journal Socket (/dev/log).
[  OK  ] Reached target Remote File Systems.
user.slice: Failed to set invocation ID on control group /user.slice, ignoring: Operation not permitted
[  OK  ] Created slice User and Session Slice.                                         
[  OK  ] Listening on Journal Socket. 
system.slice: Failed to set invocation ID on control group /system.slice, ignoring: Operation not permitted
[  OK  ] Created slice System Slice.           
system-container\x2dgetty.slice: Failed to set invocation ID on control group /system.slice/system-container\x2dgetty.slice, ignoring: Operation not permitted
[  OK  ] Created slice system-container\x2dgetty.slice.
system-getty.slice: Failed to set invocation ID on control group /system.slice/system-getty.slice, ignoring: Operation not permitted
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.       
systemd-sysctl.service: Failed to set invocation ID on control group /system.slice/systemd-sysctl.service, ignoring: Operation not permitted
         Starting Apply Kernel Variables...
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
systemd-update-utmp.service: Failed to set invocation ID on control group /system.slice/systemd-update-utmp.service, ignoring: Operation not permitted
         Starting Update UTMP about System Boot/Shutdown...          
systemd-journald.service: Failed to set invocation ID on control group /system.slice/systemd-journald.service, ignoring: Operation not permitted
         Starting Journal Service...  
systemd-journal-catalog-update.service: Failed to set invocation ID on control group /system.slice/systemd-journal-catalog-update.service, ignoring: Operation not permitted
         Starting Rebuild Journal Catalog...
[  OK  ] Started Forward Password Requests to Wall Directory Watch.
[  OK  ] Reached target Paths.
[  OK  ] Started Apply Kernel Variables.

if someone wants to have to configurations, see this:

lxc.utsname = v34
# Fixme also support other architectures?
lxc.arch = x86_64
# Not needed, just makes spares a few cpu cycles as LXC doesn't have
# to detect the backend.
lxc.rootfs.backend = dir
lxc.rootfs = /var/lib/lxc/v34/rootfs
# Ensures correct functionality with user namespaces. Since mknod is not possible stuff like
# /dev/console, /dev/tty, /dev/urandom, etc. need to be bind mounted. Note the order
# of the file inclusion here is important.
lxc.include = /nix/store/035kw10gm7786acxd81yxsv235j0n3hf-lxc-2.1.0/share/lxc/config/common.conf
lxc.include = /nix/store/035kw10gm7786acxd81yxsv235j0n3hf-lxc-2.1.0/share/lxc/config/userns.conf
## Network
# see also https://wiki.archlinux.org/index.php/Linux_Containers
lxc.network.type = veth
lxc.network.name = eth0
lxc.network.ipv4 = 10.101.0.2
lxc.network.ipv4.gateway = 10.101.0.1
#FIXME create option for this
lxc.network.flags = up
lxc.network.link = brNC
# Specifiy {u,g}id mapping.
lxc.id_map = u 0 100000 65536
lxc.id_map = g 0 100000 65536
# FIXME apparmor support
# Nixos does not provide AppArmor support.
lxc.aa_profile = unconfined
lxc.aa_allow_incomplete = 1
# Tweaks for systemd.
lxc.autodev = 1
#lxc.kmsg = 0
# Additional mount entries.
lxc.mount.entry = /nix/store nix/store none defaults,bind.ro 0.0
# Mount entries that lead to a cleaner boot experience.
lxc.mount.entry = /sys/kernel/debug sys/kernel/debug none bind,optional 0 0
lxc.mount.entry = /sys/kernel/security sys/kernel/security none bind,optional 0 0
lxc.mount.entry = /sys/fs/pstore sys/fs/pstore none bind,optional 0 0
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir,optional 0 0
# LXC autostart
lxc.start.auto = 0
lxc.rootfs.path = dir:/var/lib/lxc/v34/rootfs

@stgraber the script you proposed on #1678 (comment) does not work for us as we start the lxc-start as root user. now i wonder if that is a bad idea and if we should always start lxc-start with a 'normal' user. any thoughts on that?

@ss1h2a3tw does your workaround degrade security?

@brauner what you wrote in #1678 (comment) we don't use LXCFS at all IIRC but you seem to execute your stuff as normal user 'chb' as well. is this a must? also you added your change into 2.1 which is what we use so i had assumed that this error shouldn't have appeared after all.

help?

@brauner
Copy link
Member

brauner commented Nov 10, 2017

@stgraber the script you proposed on #1678 (comment) does not work for us as we start the lxc-start as root user. now i wonder if that is a bad idea and if we should always start lxc-start with a 'normal' user. any thoughts on that?

It seems you're starting an unprivileged container as root. That is perfectly fine. It means the setup process for the container runs as root but the container itself runs unprivileged.

@ss1h2a3tw does your workaround degrade security?

That's just an argument to systemd itself and it doesn't degrade security. It's a valid boot option that systemd itself exposes.

@brauner what you wrote in #1678 (comment) we don't use LXCFS at all IIRC but you seem to execute your stuff as normal user 'chb' as well. is this a must? also you added your change into 2.1 which is what we use so i had assumed that this error shouldn't have appeared after all.

If you start the container as root the pam module LXCFS ships is not needed and won't be used by LXC so that's ok.

@qknight, the boot you're showing seems fine to me. What exactly is the problem you're observing? Could you please boot the container with:

lxc-start -n <container-name> -l trace -o <container-name>.log

and append/paste the contents of <container-name>.log here? Once you started the container, please retrieve the init pid of the container's init process via lxc-info -n <container-name> and show the output of:

cat /proc/<init-pid>/cgroup
cat /proc/self/cgroup

@qknight
Copy link

qknight commented Nov 13, 2017

@brauner i think my problem is completely different from the OP's posting. i've just seen the same error messages and assumet it was the same. i'm just concerned about;

Failed to set invocation ID on control group /system.slice/systemd-update-utmp.service, ignoring: Operation not permitted

is that a different problem?

@evverx
Copy link
Contributor

evverx commented Nov 13, 2017

This should be fixed when poettering/systemd@f0f0fe3 is being merged. It would be great if you could check that systemd/systemd#7246 works for you.

evverx added a commit to evverx/systemd that referenced this issue Nov 21, 2017
…roup/unified`

It's possible for `systemd` inside an unprivileged user namespace container
to be able to mount `cgroup2` on `/sys/fs/cgroup/unified` without being able
to create directories there.  When this happens, `systemd` fails to boot, making
it impossible to reexecute itself without restarting the container runtime.

In this patch the issue is avoided by trying creating a temporary directory
after mounting `cgroup2` and falling back to `v1` if `mkdir` fails.

Closes systemd#6408 and lxc/lxc#1678.
evverx added a commit to evverx/systemd that referenced this issue Nov 21, 2017
…roup/unified`

It's possible for `systemd` inside an unprivileged user namespace container
to be able to mount `cgroup2` on `/sys/fs/cgroup/unified` without being able
to create directories there.  When this happens, `systemd` fails to boot, making
it impossible to reexecute itself without restarting the container runtime.

In this patch the issue is avoided by trying creating a temporary directory
after mounting `cgroup2` and falling back to `v1` if `mkdir` fails.

Closes systemd#6408 and lxc/lxc#1678.
@qknight
Copy link

qknight commented Nov 21, 2017

@evverx thanks for you help! i tried to apply all patches on 2.34 you linked from the poettering PR but it failed. then i tried to apply systemd/systemd@d3070fb only which also failed.

should i use the 'master' version for testing instead? i've also checked what contains the patch: d3070fbdf6077d7da9dbafa198fff8dea712d2ff but only master has it.

@brauner
Copy link
Member

brauner commented Nov 21, 2017

@brauner i think my problem is completely different from the OP's posting. i've just seen the same error messages and assumet it was the same. i'm just concerned about;

As said before, it looks like your container is starting just fine. If not, please, as requested before, append the required debug output from your container.

Failed to set invocation ID on control group /system.slice/systemd-update-utmp.service, ignoring: Operation not permitted

is that a different problem?

That shouldn't be fatal and shouldn't really be a problem for your container.

@qknight
Copy link

qknight commented Nov 21, 2017

@brauner YES, you are right. the container starts and my problem isn't the OP's propblem. sorry for the thread hijhacking.

@brauner
Copy link
Member

brauner commented Nov 21, 2017

@qknight, np. So the utmp service should show you a clean exit status when you do systemctl status on it. systemd seems to ignore it when it gets EPERM.

@evverx
Copy link
Contributor

evverx commented Nov 21, 2017

@qknight, I think that the log level of those messages should be simply downgraded to debug so as not to intimidate people. I've asked whether it makes any sense in systemd/systemd#4441 (comment).

@qknight
Copy link

qknight commented Nov 21, 2017

@brauner regarding the utmp service i have this:

systemctl status systemd-update-utmp.service
● systemd-update-utmp.service - Update UTMP about System Boot/Shutdown
   Loaded: loaded (/nix/store/hxaj535hm6p7gi24zv2k2ifvqadc3js2-systemd-234/example/systemd/system/systemd-update-utmp.service; enabled; vendor preset: enabled)
  Drop-In: /nix/store/9hhz80d2lf6byxwmhd6yy5pm0wc142h4-system-units/systemd-update-utmp.service.d
           └─overrides.conf
   Active: active (exited) since Tue 2017-11-14 11:52:48 UTC; 1 weeks 0 days ago
     Docs: man:systemd-update-utmp.service(8)
           man:utmp(5)
  Process: 205 ExecStart=/nix/store/hxaj535hm6p7gi24zv2k2ifvqadc3js2-systemd-234/lib/systemd/systemd-update-utmp reboot (code=exited, status=0/SUCCESS)
 Main PID: 205 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/systemd-update-utmp.service

Nov 14 11:52:48 v34 systemd[1]: Started Update UTMP about System Boot/Shutdown.
Nov 14 11:52:48 v34 systemd[1]: systemd-update-utmp.service: Failed to set invocation ID on control group /system.slice/systemd-update-utmp.service, ignoring: Operation not permitted

@brauner
Copy link
Member

brauner commented Nov 21, 2017 via email

@qknight
Copy link

qknight commented Nov 21, 2017

@brauner @evverx thanks for your support! awesome!

brauner pushed a commit to brauner/systemd that referenced this issue Nov 22, 2017
When systemd is running inside a container employing user
namespaces it currently mounts the unified cgroup hierarchy
without being able to write to it. This causes systemd to
freeze during boot.
This patch checks whether the unified cgroup hierarchy
is writable. If it is not it will not mount it.

This solution is based on a patch by Evgeny Vereshchagin.

Closes systemd#6408.
Closes lxc/lxc#1678 .
@marcosps
Copy link
Contributor

When lxc uses the cgfsng cgroups driver, lxc-start works as expected, but this fails when trying an unprivileged container, and so cgfs driver is used.

By starting a new container as root user, cgfsng works and the container is started as expected.

Here are the output of lxc-ls with cgfsng debug enabled. While the unprivileged start fails in cgfsng, root user prints a handful of hierarchies. Does it help to debug this problem further @brauner ?

root_cgfsng_output.txt
user_cgfsng_output.txt

@brauner
Copy link
Member

brauner commented Dec 21, 2017

@marcosps, your issue is not really related to this one. Let's move the discussion #1998.

@kulak
Copy link

kulak commented May 1, 2020

#!/bin/sh
echo 1 > /sys/fs/cgroup/cpuset/cgroup.clone_children
for cgroup in /sys/fs/cgroup/*; do
    mkdir -p ${cgroup}/user.slice/user-$(id -u ${1}).slice
    chown -R $(id -u ${1}):$(id -g ${1}) ${cgroup}/user.slice/user-$(id -u ${1}).slice

    if [ "$(basename ${cgroup})" != "unified" ]; then
        echo ${2} > ${cgroup}/user.slice/user-$(id -u ${1}).slice/tasks
    fi
done

Run that as root passing the username as first argument and the shell's PID as the second argument. That will setup all your cgroups cleanly at which point your container will happily start.

This script really works for my environment which is

current Manjaro

and ubuntu container installed with:

lxc-create -n tuman -t download -- --dist ubuntu --release focal --arch amd64

But, I have to rerun the script after every reboot.

I have never dealt with cgroup, so I am not sure what I should to to preserve the configuration properly. Thank you

By the way, I am managing lxc without installation of lxd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants