Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run systemd in docker with ro /sys/fs/cgroup after systemd 248 host upgrade #42275

Open
fthiery opened this issue Apr 8, 2021 · 37 comments

Comments

@fthiery
Copy link

fthiery commented Apr 8, 2021


BUG REPORT INFORMATION

I used to run docker containers with systemd as CMD without having to expose /sys/fs/cgroup as rw; this worked until systemd 248 on the host. Now it fails with

Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

I opened a related issue on the systemd github repo: systemd/systemd#19245

Workarounds

  • boot host with systemd.unified_cgroup_hierarchy=0
  • remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro but this contaminates the host cgroup, causing e.g. docker top to get confused:
docker top debian-systemd
Error response from daemon: runc did not terminate successfully: container_linux.go:186: getting all container pids from cgroups caused: lstat /sys/fs/cgroup/system.slice/docker-817dfec3facbeb10c64d7b0fae478804b1177ae949e695e111b7c693569dd21a.scope: no such file or directory
: unknown

Steps to reproduce the issue:

Dockerfile:

FROM debian:buster-slim

ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive

USER root
WORKDIR /root

RUN set -x

RUN apt-get update -y \
    && apt-get install --no-install-recommends -y systemd \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
    && rm -f /var/run/nologin

RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
    /etc/systemd/system/*.wants/* \
    /lib/systemd/system/local-fs.target.wants/* \
    /lib/systemd/system/sockets.target.wants/*udev* \
    /lib/systemd/system/sockets.target.wants/*initctl* \
    /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
    /lib/systemd/system/systemd-update-utmp*

VOLUME [ "/sys/fs/cgroup" ]

CMD ["/lib/systemd/systemd"]

Expected behaviour

systemd 247 (247.4-2-arch)
+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <bf431002c7c1>.
Couldn't move remaining userspace processes, ignoring: Input/output error
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[  OK  ] Listening on Journal Socket.
...
[  OK  ] Reached target Graphical Interface.

Actual behaviour

Since systemd v248

$ /lib/systemd/systemd --version
systemd 248 (248-3-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <fbb4fc19cb95>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

Output of docker version:

$ docker version
Client:
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.16
 Git commit:        55c4c88966
 Built:             Wed Mar  3 16:51:54 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16
  Git commit:       363e9a88a1
  Built:            Wed Mar  3 16:51:28 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)

Server:
 Containers: 10
  Running: 1
  Paused: 0
  Stopped: 9
 Images: 61
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.11.11-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.712GiB
 Name: homepc
 ID: 67YO:62DZ:3NIF:TZT3:HTXP:BU6I:YBR3:XETA:7YCB:YGNN:MV6Q:QYN4
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://mirror.gcr.io/
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

x86_64 Intel hw, Arch Linux 5.11.11-arch1-1

@fthiery
Copy link
Author

fthiery commented Apr 8, 2021

Related: https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414

@kaedwen
Copy link

kaedwen commented Apr 28, 2021

Same here
It was working with 247

@skast96
Copy link

skast96 commented Mar 14, 2022

Is there already a fix for this?

@skast96
Copy link

skast96 commented Mar 18, 2022

Is there already a fix for this?

For reference, it is possible with namespace isolation. https://docs.docker.com/engine/security/userns-remap/
Or simply install podman.

@x-yuri
Copy link

x-yuri commented May 2, 2022

remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro

It didn't help. I'm running Ubuntu 21.10 (Impish Indri).

For reference, it is possible with namespace isolation.

@skast96, it didn't help either. I edited /etc/docker/daemon.json:

{"userns-remap": "default"}

Restarted docker. The dockremap user was created, as were the entries in /etc/sub{uid,gid}. The /var/lib/docker/100000.100000 dir was created. docker image ls produced no output. Then:

$ docker run -it --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu
systemd 245.4-4ubuntu3.16 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.4 LTS!

Set hostname to <1bdd4443336d>.
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

So the only workaround is supposedly to switch to the cgroup v1 mode (systemd.unified_cgroup_hierarchy=0):

  • /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0"
  • update-grub
  • reboot

UPD And --cgroupns=host + -v /sys/fs/cgroup:/sys/fs/cgroup (w/o :ro), e.g.:

$ docker run -it --cgroupns=host --tmpfs /tmp --tmpfs /run --tmpfs /run/lock \
    -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu

@LewisGaul
Copy link

Under cgroups v2 the default for --cgroupns switches from 'host' to 'private'. When passing in the entirety of the host's /sys/fs/cgroup explicitly then it's completely expected for this to fail in combination with the container runtime trying to create a private cgroup namespace inside it, as the cgroup path inside the container won't match up with the cgroup namespace...

As you noted, passing --cgroupns=host can make this work. However, passing the host's /sys/fs/cgroup into the container as rw seems very unadvisable (might as well just use --privileged?), and the solution in https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414 involving creating a systemd slice on the host seems more suitable (although I haven't tried getting it to work personally).

Aside from workarounds, it would be good to know what the Docker community's official advice on the matter is (where fundamentally we just want the container's /sys/fs/cgroup to be writable in a non-privileged container).

@LewisGaul
Copy link

Related issue: #42040

@skast96
Copy link

skast96 commented May 4, 2022

@x-yuri the docker approach is not working that great tbh. It is working with namespace isolation when creating a extra slice for docker and adding this slice to the docker run command like so:

docker run -it \
    --cgroup-parent=docker.slice \
    --cgroupns private \
    --tmpfs /tmp \
    --tmpfs /run \
    --tmpfs /run/lock \
    mySystemdImage:latest 

That kinda worked for me. However our other containers stopped working with namespace isolation because they were not configured for that. That meant to much work in order to run one container with systemd.

So I suggest you to just install podman. I experienced no drawbacks on my Arch Linux when having both docker and podman installed. Even the commands are the same. You would start your systemd container like that below with podman.

podman run -it mySystemdImage:latest 

@x-yuri
Copy link

x-yuri commented May 5, 2022

Actually for now I'm planning to employ the hybrid/legacy systemd mode (cgroup v1), which seems tolerable in my case. But podman sounds like an interesting option (haven't tried it).

@skast96
Copy link

skast96 commented May 5, 2022

@x-yuri sounds like a plan. My reason for not using v1 is that I needed cgroups v2 to work.

mleiner added a commit to noris-network/puppet-exim that referenced this issue May 12, 2022
In the past, I used vagrant -> libvirt to run acceptance test, but
after upgrading the worksration to Ubuntu 22.04 LTS (jammy), this stopped
working because of incompatibilities.

Tests are run using pdk bundle exec rake ... and this uses the pdk
environment with ruby 2.7.x, while my operating system (and thus vagrant)
have been ported to ruby 3. This creates confusion within the tools, because
the Gemfile (from PDK) depends on modules that depend on the ruby version in
their name (puppet-module-posix-...-r3.0). After days of debugging and
finding seemingly inactive issues on github and jira, I decided that I do not
want to waste even more time on this.

I like docker as much or less as vagrant (not much) and converted all litmus
tests to use docker.

Docker itself breaks on Ubuntu 22.04 LTS in unprivileged mode, too. Luckily,
from my experience with Gentoo, I already suspected trouble with cgroups.
Since systemd 248 (no likey either) something changed in the cgroup handling
which causes containers using systemd to fail under further conditions. As
much as I understand the discussion, systemd devs expect the container people
to change their containers or container infrastructure to their ideas.
I don't want to investigate this any deeper.

The solution for using unprivileged docker on Ubuntu 22.04 LTS?
Add "systemd.unified_cgroup_hierarchy=0" to your kernel cmdline.

References for the ruby3 issues:
puppetlabs/puppet-module-gems#166
=> https://tickets.puppetlabs.com/browse/MODULES-11161

References for the docker/systemd/ubuntu22 issue:
moby/moby#42275
@aki-k
Copy link

aki-k commented Sep 4, 2023

@marco-a-itl

This is the mount shown in the container:

# findmnt | grep cgroup
│ └─/sys/fs/cgroup      cgroup      cgroup2  rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot

@marco-a-itl
Copy link

@marco-a-itl

This is the mount shown in the container:

# findmnt | grep cgroup
│ └─/sys/fs/cgroup      cgroup      cgroup2  rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot

This is ok (mode is rw). However I assume that you obtained this result with userns-remapping.

I think that it should be possible to have the same result without such daemon option, with the proper modifications on the docker engine, like podman does.

@aki-k
Copy link

aki-k commented Sep 4, 2023

@marco-a-itl

However I assume that you obtained this result with userns-remapping.

That's correct

@Vlad1mir-D
Copy link

For everyone lurking around

As discussion seems to continue and people not able to find the stuff...
Workaround (even without the usage of --priveleged) already mentioned ->HERE<-

@pbecotte
Copy link

Spent some time looking at this today trying to run a systemd container under rootless docker. The docker daemon is running under a --user systemd unit. I am on wsl with cgroups v2 enabled.

docker run -it --rm --tmpfs /tmp --tmpfs /run registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /init.scope control group: Read-only file system .
docker creates the cgroup mount in the container, mounted readonly

docker run -it --rm --tmpfs /tmp --tmpfs -v /sys/fs/cgroup:/sys/fs/cgroup /run registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /init.scope control group: Permission denied .
The "fake root" inside the container doesn't have permission to modify the cgroup that I mounted

docker run -it --rm --tmpfs /tmp --tmpfs /run --cgroupns=host registry.access.redhat.com/ubi8/ubi-init:8.8
Failed to create /user.slice/user-1000.slice/user@1000.service/user.slice/..../init.scope control group: Read-only file system
Docker mounts the host cgroupns and systemd tries to create a cgroup at the appropriate level, but docker mounted the filesystem readonly still

docker run -it --rm --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service:/sys/fs/cgroup/user.slice/user@1000.service --cgroupns=host registry.access.redhat.com/ubi8/ubi-init:8.8
This works. It correctly creates the cgroup under the docker slice under the user slice. (you can mount the whole cgroupns rw but it wasn't necessary

docker run -it --rm --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service:/sys/fs/cgroup/user.slice/user@1000.service registry.access.redhat.com/ubi8/ubi-init:8.8
error mounting "/sys/fs...": read-only file system
I can't mount that folder r/w into the container with private cgroupns mode, presumably because docker setup the fake mount readonly

I can not find any documentation at all of what the expected behavior of cgroupns=private is supposed to be. Should it be a transparent mapping to a parent context? If so- should probably be mounted rw rather than ro. Also- systemd docs https://systemd.io/CONTAINER_INTERFACE/ seem to imply that's not a best practice anyway.

It seems to me that the best approach for my situation is to just set the default cgroupns back to 'host' to get this working properly.

@LewisGaul
Copy link

I can not find any documentation at all of what the expected behavior of cgroupns=private is supposed to be. Should it be a transparent mapping to a parent context? If so- should probably be mounted rw rather than ro. Also- systemd docs https://systemd.io/CONTAINER_INTERFACE/ seem to imply that's not a best practice anyway.

--cgroupns=private means that the container runtime will create a cgroup namespace for the container (podman's docs are more explicit about this).

You may find my blog post informative: https://lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/

I also have some tests that exercise different container setup modes for running systemd: https://github.com/LewisGaul/systemd-containers

@lubo
Copy link

lubo commented Apr 15, 2024

We need something like what's described in containers/podman#14322 (reply in thread) (--security-opt unmask=/sys/fs/cgroup).

@atopion
Copy link

atopion commented Jun 17, 2024

I've just tested it, it seems to work flawlessly with
docker run --rm -it -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /run/lock warewulf-1:latest /sbin/init

docker version: 26.1.4
Host runs: Arch, Kernel 6.9.4, systemd 255.7
Container runs: Debian 12.5 with systemd 252.22
Also works with a container running: Rockylinux 9.3, with systemd 252.32

@celesteking
Copy link

Note that if host is running older cgroupv1, the /sys/fs/cgroup on the host is a tmpfs that's mounted as ro and as such a lot of solutions from here won't work.

@whg517
Copy link

whg517 commented Jul 31, 2024

I'm having the same problem with dockerdesktop in macos m1 and I'm wondering if anyone has a workaround already?

@Antik9421
Copy link

Work ideally

docker run --rm --cgroupns=private --name freeipa-server-almalinux9 -ti \
    -h ipa.hwdomain.lan --read-only --sysctl net.ipv6.conf.all.disable_ipv6=0 \
    -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup/warewulf.scope:ro \
    -v ~/freeipa-data:/data:Z freeipa-almalinux9

@oldium
Copy link

oldium commented Aug 17, 2024

I just faced the same issue and running the container with sysbox-runc runtime helped. With docker run -it --rm --runtime=sysbox-runc my-image the container started.

@darkdragon-001
Copy link

I've just tested it, it seems to work flawlessly with docker run --rm -it -v /sys/fs/cgroup/warewulf.scope:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /run/lock warewulf-1:latest /sbin/init

docker version: 26.1.4 Host runs: Arch, Kernel 6.9.4, systemd 255.7 Container runs: Debian 12.5 with systemd 252.22 Also works with a container running: Rockylinux 9.3, with systemd 252.32

How did you create /sys/fs/cgroup/warewulf.scope?

@oldium
Copy link

oldium commented Aug 29, 2024

How did you create /sys/fs/cgroup/warewulf.scope?

The scope is an ordinary folder, so can be created by Docker itself during volume mount, but this does not work for me - you have a scope, but systemd is not running within the mounted cgroup.

cidrblock added a commit to ansible/ansible-dev-tools that referenced this issue Sep 20, 2024
Change the server in container url to 0.0.0.0, which should be safer long-term and resolve some odd errors found with podman related to pasta.
Log the container run command for easier troubleshooting locally outside the test suite.
Add an execution environment build test
Note the failure in this test run:
opening file /sys/fs/cgroup/cgroup.subtree_control for writing: Read-only file system
https://github.com/ansible/ansible-dev-tools/actions/runs/10930266208/job/30342982168?pr=377

This is why unmask=/sys/fs/cgroup is added after the initial addition of the EE test which works for podman.

For docker based on: moby/moby#42275 (comment)
--privileged was added (not ideal, but few options)

On macOS/intel/podman desktop the following errors were found:
Error: crun: mknod /dev/null: Operation not permitted: OCI permission denied

the following was added to resolve this error:

--cap-add=mknod (docker gets this by default)

this allowed all tests to pass on macOS/intel/podman desktop

277.32s call     tests/integration/test_container.py::test_builder
6.21s call     tests/integration/test_container.py::test_nav_playbook
4.99s call     tests/integration/test_container.py::test_nav_collections
3.56s call     tests/integration/test_container.py::test_navigator_simple_c_in_c
3.18s call     tests/integration/test_container.py::test_nav_collection
2.77s call     tests/integration/test_container.py::test_navigator_simple
2.58s call     tests/integration/test_container.py::test_podman
1.23s call     tests/integration/test_container.py::test_nav_images
1.15s setup    tests/integration/test_container.py::test_nav_collections
0.78s setup    tests/integration/test_container.py::test_nav_playbook
======================================= 34 passed, 1 warning in 310.65s (0:05:10) =======================================
Additional changes necessary for Windows user include the addition of
    "--cap-add=NET_ADMIN",
to avoid bpf query: Operation failed errors when building an EE

---------

Co-authored-by: Brad Thornton <bthornto@bthornto-mac.lan>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests