Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes launched via docker exec are not placed into the correct cgroup #42704

Closed
raxod502 opened this issue Aug 1, 2021 · 7 comments
Closed

Comments

@raxod502
Copy link

raxod502 commented Aug 1, 2021

Description

Processes launched via docker exec are not placed into the correct cgroup when using --cgroup-parent option on docker run with systemd cgroup driver.

Steps to reproduce the issue:

Create /etc/systemd/system/my-example-cgroup.slice with contents:

[Unit]
Description=My example cgroup
Before=slices.target

[Slice]
TasksAccounting=true
TasksMax=10

Edit /etc/docker/daemon.json as follows:

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "cgroup-parent": "my-example-cgroup.slice"
}

Reload systemd and Docker configuration, sudo systemctl daemon-reload and sudo systemctl restart docker.

Now start a container and run something within it, e.g.:

$ sudo docker run -it --rm alpine
/ # tail -f /dev/null this-will-show-in-ps
tail: can't open 'this-will-show-in-ps': No such file or directory
<command hangs>

From another terminal, we can verify the cgroup is set correctly:

$ pgrep -f this-will-show-in-ps
23770
$ systemctl status 23770 | grep CGroup
Warning: The unit file, source configuration file or drop-ins of docker-f903f3cc12f1d6c801d64c8ae04638faeb899c438d6f5d65f3352810d0181007.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
     CGroup: /my-example-cgroup.slice/docker-f903f3cc12f1d6c801d64c8ae04638faeb899c438d6f5d65f3352810d0181007.scope

However, now let's use docker exec:

$ sudo docker ps | grep alpine | awk '{ print $1 }'
f903f3cc12f1
$ sudo docker exec -it f903f3cc12f1 sh
/ # tail -f /dev/null this-will-also-show-in-ps
tail: can't open 'this-will-also-show-in-ps': No such file or directory
<command hangs>

From another terminal, we can see the cgroup is not set correctly (should be the same as the previous process):

$ pgrep -f this-will-also-show-in-ps
24418
$ systemctl status 24418 | grep CGroup
     CGroup: /system.slice/containerd.service

Consequently, cgroup resource limits are not enforced for any processes launched via docker exec.

Describe the results you received:
cgroup of processes started via docker run are placed into the container --cgroup-parent, but processes started via docker exec are placed into the default /system.slice/containerd.service cgroup.

Describe the results you expected:
cgroup of all processes in the container, no matter how they are started, are placed into the container --cgroup-parent.

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 12:00:45 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:58:56 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:                                                                                        
 Context:    default              
 Debug Mode: false
 Plugins:                             
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.8.0)
                                               
Server:                     
 Containers: 5
  Running: 5             
  Paused: 0
  Stopped: 0
 Images: 226
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e25210fe30a0a703442421b0f60afac609f950a3
 runc version: v1.0.1-0-g4144b63
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.11.0-1014-aws
 Operating System: Ubuntu 21.04
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 1.895GiB
 Name: ip-172-31-8-109
 ID: P7WI:4PFY:EWY7:Z3SG:JZGL:KHEE:UC6J:KEB4:SOPD:5MX2:ZARW:HSK6
 Docker Root Dir: /mnt/riju/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): This is on an EC2 instance.

@thaJeztah
Copy link
Member

I'm wondering if this is either a bug or omission in containerd. Doing some initial searching, the container's cgroup parent is set when creating the container;

moby/daemon/oci_linux.go

Lines 808 to 814 in 9674540

if useSystemd {
cgroupsPath = parent + ":" + scopePrefix + ":" + c.ID
logrus.Debugf("createSpec: cgroupsPath: %s", cgroupsPath)
} else {
cgroupsPath = filepath.Join(parent, c.ID)
}
s.Linux.CgroupsPath = cgroupsPath

That property (s.Linux.CgroupsPath) is part of the OCI runtime spec's config;

Linux *Linux `json:"linux,omitempty" platform:"linux"`
// CgroupsPath specifies the path to cgroups that are created and/or joined by the container.
// The path is expected to be relative to the cgroups mountpoint.
// If resources are specified, the cgroups at CgroupsPath will be updated based on resources.
CgroupsPath string `json:"cgroupsPath,omitempty"`

However when doing a docker exec, containerd's Exec accepts an oci.Process type;

p, err = t.Exec(ctx, processID, spec, func(id string) (cio.IO, error) {
rio, err = c.createIO(fifos, containerID, processID, stdinCloseSync, attachStdio)
return rio, err
})
// Exec creates a new process inside the task
Exec(context.Context, string, *specs.Process, cio.Creator) (Process, error)

The oci.Process type allows setting various options for the process (command to run, User, SELinux label, etc), but does not have an option to specify the cgroup path;

// Process contains information to start a specific application inside the container.
type Process struct {

Based on that, I would expect containerd (?) to inherit this from the container itself. If that's indeed what should happen, but not the case, that may be a bug in containerd 🤔

@thaJeztah
Copy link
Member

From a discussion on Slack I had with a containerd maintainer;

Is that a bug in runc maybe? Not 100% sure offhand but I'm pretty sure the shims rely on runc to create and manage the exec processes and don't do too much manual cgroup manipulation.

@AkihiroSuda perhaps you know; could this be a bug/issue in runc?

@kolyshkin
Copy link
Contributor

Here is what happens in your case, @raxod502.

  1. Apparently you're using hybrid cgroup hierarchy (meaning cgroup v1 controllers are mounted to /sys/fs/cgroup/$controller, and cgroup v2 is mounted to /sys/fs/cgroup/unified).
  2. runc puts the process run by docker exec into all proper cgroups for v1, but not for v2.

This can be seen by e.g.

root@ubu2004:/home/kir# docker exec ed0892fae38b cat /proc/self/cgroup
12:memory:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
11:cpuset:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
10:blkio:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
9:perf_event:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
8:net_cls,net_prio:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
7:pids:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
6:cpu,cpuacct:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
5:rdma:/
4:devices:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
3:hugetlb:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
2:freezer:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
1:name=systemd:/my.slice/my-example.slice/my-example-cgroup.slice/docker-ed0892fae38beb03ad3211cac14af97bbb09ceff17f81ab1455741a5a995e218.scope
0::/system.slice/containerd.service

The thing is, runc does not really support hybrid cgroup hierarchy, this is why it does not add the process being executed to the proper cgroupv2 scope (indentified by 0 in /proc/$PID/cgroup file).

Having said that, there's a runc PR to supported hybrid cgroup hierarchy (opencontainers/runc#2087), which can't yet be merged (and will probably be merged later as part of opencontainers/runc#3059).

Now, what you see is systemd looking into cgroup v2 only when reporting the process cgroup, which may or may not be a bug in systemd.

@raxod502
Copy link
Author

raxod502 commented Sep 3, 2021

Well, systemd looking into cgroup v2 only seems like arguably the correct behavior, because the resource limits imposed by cgroup v1 seem to be ignored when using systemd-defined cgroups.

Do you think this is the kind of issue that can be worked around by changing something about the system environment? The end goal is simply to use docker exec and have a set of resource restrictions be enforced. Ideally they could be written down in a systemd slice file, because that is simple to deploy and manage, but that's not a requirement.

@kolyshkin
Copy link
Contributor

the resource limits imposed by cgroup v1 seem to be ignored when using systemd-defined cgroups.

Can you please elaborate on this (and/or provide a quick repro demonstrating this)?

I don't think it's true (because it is, we have a huge issue with runc wrt hybrid cgroup hierarchy).

Here's my repro:

kir@ubu2004:~$ sudo docker ps
CONTAINER ID   IMAGE     COMMAND     CREATED        STATUS        PORTS     NAMES
ed0892fae38b   alpine    "/bin/sh"   46 hours ago   Up 46 hours             intelligent_leavitt

kir@ubu2004:~$ sudo docker update --pids-limit 16 ed0892fae38b
ed0892fae38b

kir@ubu2004:~$ sudo docker exec -it ed0892fae38b sh
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
/ # sleep 1h &
sh: can't fork: Resource temporarily unavailable

As you can see, pids limit is enforced for docker exec.

@raxod502
Copy link
Author

raxod502 commented Oct 3, 2021

Now that I do further testing, I'm finding that although the cgroup parent is not set like I would expect, the configured resource limits for the container cgroup parent are still applied correctly. This is not what I was observing before, so I must have had something else incorrectly configured in the system. Closing this issue as I can't provide a reproducible example of the bad behavior. Thanks for your help!

@raxod502 raxod502 closed this as completed Oct 3, 2021
@jdahm
Copy link

jdahm commented Feb 1, 2023

I had the same issue this week. Stress testing and systemctl status my-example-cgroup.slice showed that the tasks were not placed in the cgroup.

The solution was to add a slash before the slice name. Example /etc/docker/daemon.json:

{
    ...
    "cgroup-parent": "/my-example-cgroup.slice"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants