Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gocd-agent-docker-dind:v23.1.0 image can sometimes fail docker tasks run after agent start #11378

Closed
chadlwilson opened this issue Mar 12, 2023 · 1 comment · Fixed by #11406

Comments

@chadlwilson
Copy link
Member

chadlwilson commented Mar 12, 2023

Issue Type

  • Bug Report

Summary

Tasks run within a job on the gocd-agent-docker-dind:v23.1.0 image can fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? when they are run almost immediately after agent start-up.

Re-running the job on the same static agent will usually succeed, but on an elastic agent it will usually fail.

Steps to Reproduce

  1. create a job that has a task that does something that needs the daemon e.g docker version
  2. set the job to use a docker, ECS or kubernetes elastic agent based around gocd-agent-docker-dind:v23.1.0. Alternatively, trigger the job in #1, but then start a static agent on gocd-agent-docker-dind:v23.1.0
  3. If the agent starts, registers and obtains work in < 15s you will see Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Expected Results

Tasks that run docker-daemon-requiring tasks immediately should work normally.

Actual Results

Tasks can fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? f they run too quickly after container start.

Possible Fix

At root the issue is caused by:

$(which dind) dockerd --host=unix:///var/run/docker.sock --host=tcp://0.0.0.0:2375 > /var/log/dockerd.log 2>&1 &

This is an insecure way to launch the daemon, and so docker puts a sleep 15 in the start to make sure it is what you intended:
https://github.com/moby/moby/blame/dfd89ede4b3c190817bb1b528d5f47ffd12dd6b3/cmd/dockerd/daemon.go#L676-L683

Previously the agent started so slowly and picked up work so slowly that this sleep 15 in the background was always done by the time the agent started doing work. Subsequent to #11286 the agent now starts a lot faster, and so sometimes it picks up work that needs to interact with the docker daemon before the daemon is actually started, exposing an earlier bug that technically exists in previous GoCD versions.

To address the root issue, we need to either change to only starting tcp on loopback, or turning on TLS, or deliberately opting out. Needs some digging into why the daemon is started this way. We also may want to consider going back to re-use the docker-entrypoint in the docker:dind base image, since it is rather sophisticated in the handling and env vars it supports.

Workarounds for 23.1.0

Recommended: Option 1: Override daemon start-up to TCP listen on localhost, or disable entirely

This option is probably the most secure - and gives the fastest agent start-up.

Override the run-docker-daemons.h to tell it to either

  1. Listen on localhost only
  2. Not listen on TCP at all (remove the --host=tcp://localhost:2375 entirely below)

Create a file run-docker-daemon.sh to override the one built-in:

#!/bin/bash
$(which dind) dockerd --host=unix:///var/run/docker.sock --host=tcp://localhost:2375 > /var/log/dockerd.log 2>&1 &
disown

When running, override with a bind-mount (or if building a custom image override the file from the base layer)

chmod a+x run-docker-daemon.sh
docker run -v $(pwd)/run-docker-daemon.sh:/run-docker-daemon.sh --privileged -e GO_SERVER_URL=https://my-server:8153/go gocd/gocd-agent-docker-dind:v23.1.0

Option 2: Make agent start-up wait for the daemon to be ready

Create a docker-entrypoint script that waits for the daemon to start, and mount it into /docker-entrypoint.d/ (or place it there if building a custom child image).

echo 'until docker stats --no-stream; do sleep 1; done' > wait-for-docker-daemon.sh
chmod a+x wait-for-docker-daemon.sh
docker run -v $(pwd)/wait-for-docker-daemon.sh:/docker-entrypoint.d/wait-for-docker-daemon.sh --privileged -e GO_SERVER_URL=https://my-server:8153/go gocd/gocd-agent-docker-dind:v23.1.0

Option 3: Edit tasks in pipelines

  • Add a sleep 15 command task as the first task in each job that uses an elastic agent.

This will not scale very well, and is wasteful and not sensible to use if using static dind agents since the wait will happen every time the job runs, not just when the agent first starts.

Should really only be used for a simple workaround if you have few pipelines.

Possible other workarounds (to be validated)

  1. Stick with using gocd-agent-docker-dind:v22.3.0 for now
    • This might work by luck, by forcing the agent launcher to "upgrade" each time it launches, which can add some wait time. But not guaranteed.
@chadlwilson
Copy link
Member Author

chadlwilson commented Mar 13, 2023

Hey hey @ketan or @arvindsv - do you happen to know/remember if any thought was given to which situations --host=tcp://0.0.0.0:2375 (bind to all interfaces) is needed for the Docker Daemon in a dind image, and whether either --host=tcp://localhost:2375 would suffice for most folks, or whether listening on TCP can be generally disabled?

I probably need to do some research, but my understanding was that most tools/CLIs can either deal with the unix socket directly (which has appropriate permissions) OR listening on localhost is sufficient for those libraries which can only talk TCP/HTTP to the Docker Daemon API.

I can't really see for a build agent why we'd want to bind to all network interfaces by default, especially if you could be running with --network-mode host or within Kubernetes or something like that which might allow other pods or containers to interact with the domain in an unexpected way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant