Skip to content

Propose better way to run docker from a unit file #6791

@ibuildthecloud

Description

@ibuildthecloud

Systemd does a lot of stuff. Docker does a lot of stuff. That stuff may or may not overlap. I don't really care. I just need to solve one very specific problem. I just need a sane way to launch Docker containers in a systemd environment as a system service. As it stands today, the only way I know how is to do docker start -a or docker run ... without -d. Then dockerd launches the container in the background and systemd essentially monitors the docker client. Two problems with this. First, whether or not the docker client is running says very little about whether the actual container is running. Second, I'm left with a rather large docker run process in memory that's not providing much value except to stream stdout/stderr to journald.

So I hacked up the below script to make things better, or really just to see if it was possible to make things better since the script is just a dirty hack. You don't really need to read the script, just skip down and I'll explain what it does.

#!/bin/bash
set -e

ID=$(/usr/bin/docker "$@")
PID=$(docker inspect -f '{{.State.Pid}}' $ID)

declare -A SRC DEST

for line in $(grep slice /proc/$PID/cgroup); do
        IFS=: read _ NAME LOC <<< "$line"
        SRC[${NAME##name=}]=$LOC
done 

for line in $(grep slice /proc/$$/cgroup); do
        IFS=: read _ NAME LOC <<< "$line"
        DEST[${NAME##name=}]=$LOC
done

for type in ${!SRC[@]}; do
        from=/sys/fs/cgroup/${type}${SRC[$type]}
        to=/sys/fs/cgroup/$type/"${DEST[$type]}"/$(basename "${SRC[$type]}")

        echo $from "=>" $to
        mkdir -p $to
        for p in $(<$from/cgroup.procs); do
                echo $p > $to/cgroup.procs
        done
done

echo $PID > /var/run/test.pid

Then I wrote the following unit file

[Unit]
Description=My Service
After=docker.service
Requires=docker.service

[Service]
ExecStart=/opt/bin/docker-wrapper.sh run -d busybox /bin/sh -c "while true; do echo Hello World; sleep 1; done"
Type=forking
PIDFile=/var/run/test.pid

[Install]
WantedBy=multi-user.target

So what this does (and I know it's a hack, but I wanted to see if my proposal has any chance of working) is that after the container is launched, I look up the PID of the container and all of its cgroups. I then create child cgroups of the systemd cgroups and then move the PIDs from the original cgroups to the systemd child cgroups. After that is done I then write the PID of the container to a file. I end up with systemd cgroups being the parent, then a child cgroup under that. Looking something like below

  ├─test.service
  │ └─docker-8a0ff7503e0fca4f44d48f76a24cbcae82079818e3ad4d0d707ccf5765698184.scope
  │   ├─19103 /bin/sh -c while true; do echo Hello World; sleep 1; done
  │   └─19169 sleep 1

Also, since I told systemd to use a PIDFile, systemd is monitoring the PID 1 of the container because I wrote it to a file. So now if I do either docker stop or systemctl stop things just work (at least they seem to do) and I don't have a useless docker client hanging around in memory Now if you look at the script, you'll notice I'm just moving the PIDs, not the settings, so yeah, total hack that defeats the purpose of the original cgroup, but that's not the point right now.

Here's what I propose to make systemd and docker integration a tad bit better. When you want to run docker in a systemd unit you run docker run/start --yo-dawg-use-my-cgroups-as-your-parent ... which will read the current /proc/$$/cgroup of the client and pass it to dockerd. Dockerd now just creates its cgroups as a child of the cgroups passed in, if the subsystem exists. I think this means we can remove the systemd cgroup code and just use the cgroup fs based code (but docker will still have to write to the name=systemd fs). So now systemd can setup the parent cgroups however it wishes and Docker can setup the child cgroups how ever it wishes.

Is this the best solution? Probably not. But it seems a lot better than what we have today and it solves a current pain point.

Is this just plain stupid or already been thought of and shot down?

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/apiarea/cliarea/systemdkind/featureFunctionality or other elements that the project doesn't currently have. Features are new and shiny

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions