Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read-only file system (--init=systemd) [cgroupv2 not supported yet] #349

Closed
archbung opened this issue Apr 24, 2021 · 24 comments
Closed

Read-only file system (--init=systemd) [cgroupv2 not supported yet] #349

archbung opened this issue Apr 24, 2021 · 24 comments

Comments

@archbung
Copy link

Hi,

I was trying to run Steam like so:

x11-docker steam --init=systemd --gpu --pulseaudio --home=/home/archbung/.local/share/Steam -V

where steam is a Docker image built using the following Dockerfile:

FROM ubuntu:20.10                                                                                                    
                                                                                                                     
ARG DEBIAN_FRONTEND=noninteractive                                                                                   
ENV TZ=Europe/Berlin                                                                                                 
                                                                                                                       
# Update and install packages                                                                                        
RUN dpkg --add-architecture i386 \                                                                                   
    && apt-get update -y \                                                                                           
    && apt-get install -y gdebi \                                                                                    
    libc6:i386 \                                                                                                     
    libgl1-mesa-dri:i386 \                                                                                           
    libgl1:i386 \                                                                                                    
    pciutils \                                                                                                       
    wget \                                                                                                           
    xdg-desktop-portal \                                                                                             
    xdg-desktop-portal-gtk \                                                                                         
    xdg-utils \                                                                                                      
    xterm                                                                                                            
                                                                                                                       
WORKDIR /tmp                                                                                                         
                                                                                                                       
RUN wget http://media.steampowered.com/client/installer/steam.deb && gdebi -n steam.deb                              
CMD ["steam"] 

However, x11docker terminated with the following error

Welcome to Ubuntu 20.10!

Set hostname to <ba7666b47c2c>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

Could you give me some tips on troubleshooting this issue? The full x11docker.log can be found here.

Cheers,

@eine
Copy link
Contributor

eine commented Apr 24, 2021

x11docker runs with an unprivileged user by default. If you want to run with user root (the default in docker run), you need to tell it explicitly. See --user and --sudouser in https://github.com/mviereck/x11docker#security.

@mviereck
Copy link
Owner

mviereck commented Apr 24, 2021

Set hostname to .
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

This is a new issue I recently detected, too.
It originates in a change of systemd and a new cgroup setup, cgroupv2. This causes --init=systemd to fail.
Currently there is no useable solution: https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva

For your setup, try without --init=systemd.

@mviereck mviereck added the bug label Apr 24, 2021
@mviereck mviereck changed the title Read-only file system Read-only file system (--init=systemd) Apr 24, 2021
@archbung
Copy link
Author

@eine I forgot to mention that my user is part the docker group so I can run docker commands without sudo already.

@archbung
Copy link
Author

@mviereck Without --init=systemd, the container runs. However, none of the games are working so far (even though they should -- judging from their ProtonDB pages). I'll look into it in whether it is related to turning of --init=systemd or not.

@mviereck
Copy link
Owner

I found a workaround to get back the old behaviour.

You can set a kernel boot option in GRUB to enforce cgroupv1 for systemd:

systemd.unified_cgroup_hierarchy=0

@grigio
Copy link

grigio commented Jun 8, 2021

Thanks

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

in /etc/default/grub fixed the issue

@mviereck
Copy link
Owner

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
in /etc/default/grub fixed the issue

Just want to add that sudo update-grub is required afterwards.

@mviereck mviereck changed the title Read-only file system (--init=systemd) Read-only file system (--init=systemd) [cgroupv2] Jan 2, 2022
@mviereck mviereck removed the bug label Jan 9, 2022
@mviereck mviereck changed the title Read-only file system (--init=systemd) [cgroupv2] Read-only file system (--init=systemd) [cgroupv2 not supported yet] Jan 9, 2022
@mviereck
Copy link
Owner

mviereck commented Feb 4, 2022

Curiously the issue does not appear in Debian bullseye even without setting the kernel options in grub. Maybe some smart fix on side of systemd?

Edit:
Somewhere I've seen a message mentioning something like hybrid system; maybe cgroupv1 and cgroupv2 can be used side by side now.
It's worth to note that I've updated most of my images to Debian bullyseye as well. That might make a difference having a more recent systemd in image.

@lukts30
Copy link

lukts30 commented Feb 8, 2022

The issue is still present on a minimal untweaked Arch Linux install with podman and crun (cgroupv2 only).

But it can be easily fixed without modifying the grub cmdline.
Just passing --systemd=always to podman makes the issue disappear.

Does not work:
x11docker --xephyr --desktop --init=systemd localhost/ubuntu:gnome

Does work:
x11docker --xephyr --desktop --init=systemd -- --systemd=always -- localhost/ubuntu:gnome

Blog post about --systemd=always and podman cgroupv2

@mviereck
Copy link
Owner

mviereck commented Feb 9, 2022

Thank you! Quite interesting that podman allows systemd with option --systemd. I'll include that in x11docker, so at least for podman there is a clean solution.

The blog post is helpful, but unfortunately does not describe exactly what podman does, vaguely speaking of some cgroup setup:

First, we want Podman to run systemd inside a container. Running systemd in a container requires Podman to set up certain mounts required by systemd. For instance, tmpfs mounts on /run, /run/lock, /tmp, and /var/log/journald, plus there is some configuration of /sys/fs/cgroup (depending on whether the system is in cgroup V1 or V2 mode). Podman does this automatically if the entry point of the container is either /usr/sbin/init or /usr/sbin/systemd. You can also use the --systemd=always flag on the command line.

man podman-run does not tell about cgroup setup at all:

   --systemd=true|false|always
       Run container in systemd mode. The default is true.

       The value always enforces the systemd mode is enforced without looking at the executable
       name.  Otherwise, if set to true and the command you are running inside the container is
       systemd, /usr/sbin/init, /sbin/init or /usr/local/sbin/init.

       If the command you are running inside of the container  is  systemd  Podman  will  setup
       tmpfs mount points in the following directories:

              • /run

              • /run/lock

              • /tmp

              • /sys/fs/cgroup/systemd

              • /var/lib/journal

       It will also set the default stop signal to SIGRTMIN+3.

       This allows systemd to run in a confined container without any modifications.

       Note  that on SELinux systems, systemd attempts to write to the cgroup file system. Con‐
       tainers writing to the cgroup file system are denied  by  default.   The  container_man‐
       age_cgroup  boolean  must be enabled for this to be allowed on an SELinux separated sys‐
       tem.

              setsebool -P container_manage_cgroup true

Maybe I find out more about this setup done by podman, so I can reproduce it for docker, too.

I have set boot option cgroup_no_v1=all to have cgroupv2 only to test some setups. Currently only podman with --systemd=always works.


Edit:
One obvious issue if running with Docker: Although the host uses cgroupv2 only, systemd in container assumes a hybrid cgroup setup (at end of long line: default-hierarchy=hybrid):

Failed to set up the root directory for shared mount propagation: Operation not permitted
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <347101025c7e>.
Failed to create /init.scope control group: No such file or directory
Failed to allocate manager object: No such file or directory
[!!!!!!] Failed to allocate manager object.
Freezing execution.

It might help if I could tell systemd in container to use cgroupv2 only. But I found no direct systemd option for this except kernel options in https://www.man7.org/linux/man-pages/man1/init.1.html.

@lukts30
Copy link

lukts30 commented Feb 10, 2022

The recommended way for running systemd in docker seems to be [1][2]:
--privileged --cgroupns=host -v /sys/fs/cgroup:/sys/fs/cgroup:rw

Since I generally do not like to use =host options I tried to replicate what podman does with the docker cli. It is a bit hacky but it seems to be working. Tested on a headless Debian 11 system with docker.io+runc (container is fedora httpd with systemd). Systemd correctly detects and uses cgroupv2 (default-hierarchy=unified).
I did not have time to check how one would integrate this with x11docker.

Based on podman container_internal_linux.go.

options=rw,rprivate,nosuid,nodev
docker run \
--tmpfs /run:$options \
--tmpfs /run/lock:$options \
--tmpfs /tmp:$options \
--tmpfs /var/log/journal:$options \
--cgroupns=private \
--rm \
--name t1 \
-d \
sysd \
/bin/sh -c 'sleep infinity; exec /sbin/init'
nsenter -t $(docker inspect -f '{{.State.Pid}}' t1) -m -p /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'
root@x11docker-test:~# docker top t1
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                20649               20631               0                   22:43               ?                   00:00:00            /bin/sh -c sleep infinity; exec /sbin/init
root                20682               20649               0                   22:43               ?                   00:00:00            sleep infinity
root@x11docker-test:~# nsenter -t $(docker inspect -f '{{.State.Pid}}' t1) -m -p /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'
root@x11docker-test:~# docker top t1
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                20649               20631               0                   22:43               ?                   00:00:00            /sbin/init
81                  20745               20649               0                   22:43               ?                   00:00:00            /usr/bin/dbus-broker-launch --scope system --audit
81                  20746               20745               0                   22:43               ?                   00:00:00            dbus-broker --log 4 --controller 9 --machine-id 7ef69b02670d444395b88e5c297fdbd0 --max-bytes 536870912 --max-fds 4096 --max-matches 16384 --audit
root                20747               20649               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20748               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20749               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20750               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20752               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
root                20727               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-journald
root                20743               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-logind
systemd+            20734               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-oomd
193                 20735               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-resolved
root                20737               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-userdbd
root                20738               20737               0                   22:43               ?                   00:00:00            systemd-userwork
root                20739               20737               0                   22:43               ?                   00:00:00            systemd-userwork
root                20740               20737               0                   22:43               ?                   00:00:00            systemd-userwork
root                20742               20737               0                   22:43               ?                   00:00:00            systemd-userwork


systemd v249.9-1.fc35 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization docker.
Detected architecture x86-64.

Initially, I tried using docker exec --privileged but that is not working as one would think [3].
Creating a container with CAP_SYS_ADMIN remounting rw and dropping the CAP and then exec is also a problem since AppArmor blocks that.
The nsenter method does not grant extra capabilities and also works without having to disable AppArmor.
You can also wrap the nsenter command inside a docker run --privileged --pid=host[4].

@mviereck
Copy link
Owner

Great, much thanks for your investigation!
Your proposals work well, I'll have a look how to include it smoothly in x11docker.

Nice: even slim images like alpine provide nsenter, so I'll likely can rely on its availability. So x11docker neither needs to ask for root privileges (for nsenter from host) nor to provide an additional nsenter image.

You can also wrap the nsenter command inside a docker run --privileged --pid=host

Instead of --privileged one can use:

docker run --cap-add SYS_ADMIN --cap-add=SYS_PTRACE --security-opt apparmor=unconfined --pid=host [...]

@lukts30
Copy link

lukts30 commented Feb 10, 2022

I have just reread the nsenter man page and it might be good to also join the cgroup namespace (-C) in addition to the mount and PID namespace.
Even though a remount seems to work since it is atomic if one would instead umount and then mount being in the same cgroup namespace would be required. At least that is how I understand it.

Additionally, instead of joining the host PID NS it is also possible to join the other containers PID NS and therefore no longer needs to use docker inspect.

docker run --cap-add SYS_ADMIN --security-opt apparmor=unconfined --pid=container:t1 --rm nsfed \
nsenter -t 1 -m -p -C /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'

EDIT:
If I apply the same procedure to a rootless podman container that was created with --systemd=false the remount fails with EPERM but doing

umount /sys/fs/cgroup/ && mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -o rw

after a podman run ... nsenter -t 1 -m -p -C still works. Should not really matter since podman has systemd support built-in but interesting to know.

The same does not work with a docker container where the daemon is running with --userns-remap.
In this case /sys/fs/cgroup/ is already mounted rw but owned by real root and therefore appears to be owned by nobody from inside the container.

@mviereck
Copy link
Owner

Additionally, instead of joining the host PID NS it is also possible to join the other containers PID NS and therefore no longer needs to use docker inspect.

Good catch!

I've almost literally integrated your command in x11docker, works like a charm now. I still have to add --cap-add=SYS_PTRACE, did you remove it intentionally?

--init=systemd works ootb now in hybrid system and in cgroupv2-only system.
It fails yet if I set the (previously recommended) GRUB kernel option systemd.unified_cgroup_hierarchy=0.
One has to set x11docker option --sharecgroup to enable the old setup.

Currently I miss a way to detect if a system is set up with cgroupv1 only although the kernel supports cgroupv2.
The check grep -q cgroup2 /proc/filesystems && Cgroupversion="v2" || Cgroupversion="v1" always results in "v2".

Curious:
Debian buster containers still report default-hierarchy=hybrid (but work nonetheless), while Debian bullseye containers report default-hierarchy=unified.

@lukts30
Copy link

lukts30 commented Feb 10, 2022

I still have to add --cap-add=SYS_PTRACE, did you remove it intentionally?

Yes in my tests I did not need cap_sys_ptrace. I tested with nsenter from ubuntu 20.04 (2.34) & nsenter from fedora:35 (2.37.2).
As well as the BusyBox nsenter that is used by alpine. The BusyBox nsenter does not support entering cgroup namespace (-C).


root@x11docker-test:~# uname -a
Linux x11docker-test 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux

root@x11docker-test:~# options=rw,rprivate,nosuid,nodev
root@x11docker-test:~# docker run --tmpfs /run:$options --tmpfs /run/lock:$options --tmpfs /tmp:$options --tmpfs /var/log/journal:$options --cgroupns=private --rm --name t1 -d sysd /bin/sh -c 'sleep infinity; exec /sbin/init'
3c11269044e87e42294972daddbe69e2566694e859ab94e857bfb1d82ff374dc
root@x11docker-test:~# docker run --cap-add SYS_ADMIN --security-opt apparmor=unconfined --pid=container:t1 --rm -it nsfed /bin/bash
[root@cc8ed9729879 /]# ps au
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           8  0.2  0.0   4832  3892 pts/0    Ss   18:58   0:00 /bin/bash
root          24  0.0  0.0   7620  3376 pts/0    R+   18:58   0:00 ps au
[root@cc8ed9729879 /]# capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap=eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: cap_chown,cap_dac_override,!cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,!cap_linux_immutable,cap_net_bind_service,!cap_net_broadcast,!cap_net_admin,cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,cap_sys_chroot,!cap_sys_ptrace,!cap_sys_pacct,cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,cap_mknod,!cap_lease,cap_audit_write,!cap_audit_control,cap_setfcap,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=
Guessed mode: UNCERTAIN (0)
[root@cc8ed9729879 /]# nsenter -t 1 -m -p -C /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'
[root@cc8ed9729879 /]#  mount | grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
[root@cc8ed9729879 /]# pkill sleep
[root@cc8ed9729879 /]# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.1  20176 12096 ?        Ss   18:57   0:00 /sbin/init
root           8  0.0  0.0   4832  4064 pts/0    Ss   18:58   0:00 /bin/bash
root          39  0.0  0.1  33792 10480 ?        Ss   18:59   0:00 /usr/lib/systemd/systemd-journald
systemd+      46  0.0  0.0  17412  6588 ?        Ss   18:59   0:00 /usr/lib/systemd/systemd-oomd
systemd+      47  0.0  0.1  28804 16484 ?        Ss   18:59   0:00 /usr/lib/systemd/systemd-resolved
root          49  0.0  0.0  17252  6584 ?        Ss   18:59   0:00 /usr/lib/systemd/systemd-userdbd
root          51  0.0  0.0  17596  8212 ?        S    18:59   0:00 systemd-userwork
root          52  0.0  0.0  17596  8216 ?        S    18:59   0:00 systemd-userwork
root          53  0.0  0.0  17736  8188 ?        S    18:59   0:00 systemd-userwork
root          54  0.0  0.0  17508  6520 ?        Ss   18:59   0:00 /usr/lib/systemd/systemd-logind
root          56  0.0  0.0  17596  8224 ?        S    18:59   0:00 systemd-userwork
dbus          57  0.0  0.0   9924  4212 ?        Ss   18:59   0:00 /usr/bin/dbus-broker-launch --scope system --audit
dbus          58  0.0  0.0   4960  2840 ?        S    18:59   0:00 dbus-broker --log 4 --controller 9 --machine-id 7ef69b02670d444395b88e5c297fdbd0 --max-bytes 536870912 --max-fds 4096 --max-matches 16384 --audit
root          59  0.0  0.1  18852 11068 ?        Ss   18:59   0:00 /usr/sbin/httpd -DFOREGROUND
apache        60  0.0  0.0  18976  6432 ?        S    18:59   0:00 /usr/sbin/httpd -DFOREGROUND
apache        61  0.0  0.1 2420108 18316 ?       Sl   18:59   0:00 /usr/sbin/httpd -DFOREGROUND
apache        63  0.0  0.1 2157908 16268 ?       Sl   18:59   0:00 /usr/sbin/httpd -DFOREGROUND
apache        64  0.0  0.1 2223436 16268 ?       Sl   18:59   0:00 /usr/sbin/httpd -DFOREGROUND
root         243  0.0  0.0   7620  3372 pts/0    R+   19:01   0:00 ps aux

mviereck added a commit that referenced this issue Feb 10, 2022
@mviereck
Copy link
Owner

I still have to add --cap-add=SYS_PTRACE, did you remove it intentionally?

Yes in my tests I did not need cap_sys_ptrace. I tested with nsenter from ubuntu 20.04 (2.34) & nsenter from fedora:35 (2.37.2).
As well as the BusyBox nsenter that is used by alpine.

Maybe the host system makes a difference here? I run Debian bullseye.

The BusyBox nsenter does not support entering cgroup namespace (-C).

Good to know. I've removed it from the command.

--init=systemd works ootb now in hybrid system and in cgroupv2-only system.
It fails yet if I set the (previously recommended) GRUB kernel option systemd.unified_cgroup_hierarchy=0.
One has to set x11docker option --sharecgroup to enable the old setup.
Currently I miss a way to detect if a system is set up with cgroupv1 only although the kernel supports cgroupv2.
The check grep -q cgroup2 /proc/filesystems && Cgroupversion="v2" || Cgroupversion="v1" always results in "v2".

Hopefully this is fixed now. I found a hint close to bottom of https://www.man7.org/linux/man-pages/man7/cgroups.7.html :

/proc/[pid]/cgroup (since Linux 2.6.24)
              This file describes control groups to which the process
              with the corresponding PID belongs.  The displayed
              information differs for cgroups version 1 and version 2
              hierarchies.

              For each cgroup hierarchy of which the process is a
              member, there is one entry containing three colon-
              separated fields:

                  hierarchy-ID:controller-list:cgroup-path

              For example:

                  5:cpuacct,cpu,cpuset:/daemons

              The colon-separated fields are, from left to right:

              1. For cgroups version 1 hierarchies, this field contains
                 a unique hierarchy ID number that can be matched to a
                 hierarchy ID in /proc/cgroups.  For the cgroups version
                 2 hierarchy, this field contains the value 0.

x11docker now checks if /proc/self/cgroup contains a line beginning with 0:. If yes, than cgroupv2 is available. Though, i am not sure if this is foolproof.

The same does not work with a docker container where the daemon is running with --userns-remap.
In this case /sys/fs/cgroup/ is already mounted rw but owned by real root and therefore appears to be owned by nobody from inside the container.

This answer indicates that it might help to change the ownership of the cgroup folder to the root user uid of container: https://serverfault.com/a/1054414

However, x11docker disables userns to allow sharing files with the host. For docker it sets --userns=host, for podman it sets --userns=keep-id. (userns is a pita.)

@mviereck
Copy link
Owner

mviereck commented Feb 10, 2022

Side note: capsh --print seems to be not reliable. I found cases where it prints capabilities that were not available. Compare nestybox/sysbox#453 (comment)

@lukts30
Copy link

lukts30 commented Feb 10, 2022

Maybe the host system makes a difference here? I run Debian bullseye.

Strange I did all my docker test inside a Debian bullseye VM managed by lxc/lxd (qemu based it is not a lxc container). But I could imagine that there is a sysctl that forces extra checks on your system. Just a hypothesis not really sure.

x11docker now checks if /proc/self/cgroup contains a line beginning with 0:. If yes, than cgroupv2 is available. Though, i am not sure if this is foolproof.

It seems like to intended way is to use statfs compare the type that against some magic numbers. [1]

If you wonder how to detect which of these three modes is currently used, use statfs() on /sys/fs/cgroup/. If it reports CGROUP2_SUPER_MAGIC in its .f_type field, then you are in unified mode. If it reports TMPFS_MAGIC then you are either in legacy or hybrid mode. To distinguish these two cases, run statfs() again on /sys/fs/cgroup/unified/. If that succeeds and reports CGROUP2_SUPER_MAGIC you are in hybrid mode, otherwise not. From a shell, you can check the Type in stat -f /sys/fs/cgroup and stat -f /sys/fs/cgroup/unified.

cgroup2 filesystem has the magic number 0x63677270 (“cgrp”) [2] [3]
CGROUP_SUPER_MAGIC 0x27e0eb /* Cgroup pseudo FS */
CGROUP2_SUPER_MAGIC 0x63677270 /* Cgroup v2 pseudo FS */
TMPFS_MAGIC 0x01021994

root@x11docker-test:~#  stat -c"%t" -f /sys/fs/cgroup
63677270
root@x11docker-test:~#  stat -c"%T" -f /sys/fs/cgroup
cgroup2fs

@mviereck
Copy link
Owner

mviereck commented Feb 11, 2022

Thanks for the investigation!
x11docker now uses stat for the cgroup version check.

Confusing:
Other than I assumed my Debian bullseye installation seems to run cgroupv2 only by default (i.e. without kernel options). The nsenter setup succeeds.

If I set kernel option systemd.unified_cgroup_hierarchy=0 to have cgroupv1 only, I seem to get a hybrid system according to check of /sys/fs/cgroup/unified. But the nsenter setup fails in this case. The old setup with shared host cgroups is needed.

I don't know an option to get a real hybrid setup.

So x11docker would need two checks:

  • Running --init=systemd on a pure cgroupv1 system. (Not sure if any are out in the wild.)
  • Running --init=systemd on a real hybrid system.

Currently x11docker is configured to use the nsenter setup only on a pure cgroupv2 system. For cgroupv1 and hybrid it falls back to the old behaviour sharing host cgroups.

@mviereck
Copy link
Owner

Today I could run a test on Debian buster with a hybrid cgroup. It runs well with the old setup sharing cgroup from host.
So I believe that --init=systemd now works with all cgroup setups in the wild. The issue is solved.

Much thanks again @lukts30 , this would not have been possible without your help!

@larssb
Copy link

larssb commented Feb 20, 2022

For info and context I'm here providing a comment.

Spank you very much! This helped me solve an issue I've had for pretty much ages. Namely running a Concourse CI worker on a QNAP NAS. Where Docker is in cgroupsv1 mode and the worker would report: mounting cgroup to rootfs at /sys/fs/cgroup caused: invalid argument: unknown.

Using --privileged --cgroupns=host -v /sys/fs/cgroup:/sys/fs/cgroup:rw as @lukts30 to in this comment makes the difference.

Only bummer in my case is that I need to use the docker run... cmd as docker-compose do not support setting the cgroupns in a docker-compose.yaml file. See this link for more.

@mviereck
Copy link
Owner

Using --privileged --cgroupns=host -v /sys/fs/cgroup:/sys/fs/cgroup:rw as @lukts30 to in this #349 (comment) makes the difference.

Great that this helped you! However, just want to note that this setup exposes your host to the container and is quite insecure. Don't use it if there is any reason to distrust the container because basically no isolation is left.

@larssb
Copy link

larssb commented Feb 20, 2022

Thank you @mviereck - that sucks yes. I'm doing what I can to protect the workload in other ways. E.g. it's definitely not going to be publicly exposed.

Thanks

@mviereck
Copy link
Owner

that sucks yes. I'm doing what I can to protect the workload in other ways

I am not sure why your unprivileged setup fails. If you like to, you could try to run your worker with x11docker and its option --init=systemd to check if a more secure setup would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants