gocd-agent-docker-dind:v23.1.0 image can sometimes fail docker tasks run after agent start #11378

chadlwilson · 2023-03-12T16:17:27Z

Issue Type

Bug Report

Summary

Tasks run within a job on the gocd-agent-docker-dind:v23.1.0 image can fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? when they are run almost immediately after agent start-up.

Re-running the job on the same static agent will usually succeed, but on an elastic agent it will usually fail.

Steps to Reproduce

create a job that has a task that does something that needs the daemon e.g docker version
set the job to use a docker, ECS or kubernetes elastic agent based around gocd-agent-docker-dind:v23.1.0. Alternatively, trigger the job in #1, but then start a static agent on gocd-agent-docker-dind:v23.1.0
If the agent starts, registers and obtains work in < 15s you will see Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Expected Results

Tasks that run docker-daemon-requiring tasks immediately should work normally.

Actual Results

Tasks can fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? f they run too quickly after container start.

Possible Fix

At root the issue is caused by:

gocd/buildSrc/src/main/resources/gocd-docker-agent/run-docker-daemon.sh

Line 17 in 0f58107

    
           $(which dind) dockerd --host=unix:///var/run/docker.sock --host=tcp://0.0.0.0:2375 > /var/log/dockerd.log 2>&1 &

This is an insecure way to launch the daemon, and so docker puts a sleep 15 in the start to make sure it is what you intended:
https://github.com/moby/moby/blame/dfd89ede4b3c190817bb1b528d5f47ffd12dd6b3/cmd/dockerd/daemon.go#L676-L683

Previously the agent started so slowly and picked up work so slowly that this sleep 15 in the background was always done by the time the agent started doing work. Subsequent to #11286 the agent now starts a lot faster, and so sometimes it picks up work that needs to interact with the docker daemon before the daemon is actually started, exposing an earlier bug that technically exists in previous GoCD versions.

To address the root issue, we need to either change to only starting tcp on loopback, or turning on TLS, or deliberately opting out. Needs some digging into why the daemon is started this way. We also may want to consider going back to re-use the docker-entrypoint in the docker:dind base image, since it is rather sophisticated in the handling and env vars it supports.

Workarounds for 23.1.0

Recommended: Option 1: Override daemon start-up to TCP listen on localhost, or disable entirely

This option is probably the most secure - and gives the fastest agent start-up.

Override the run-docker-daemons.h to tell it to either

Listen on localhost only
Not listen on TCP at all (remove the --host=tcp://localhost:2375 entirely below)

Create a file run-docker-daemon.sh to override the one built-in:

#!/bin/bash
$(which dind) dockerd --host=unix:///var/run/docker.sock --host=tcp://localhost:2375 > /var/log/dockerd.log 2>&1 &
disown

When running, override with a bind-mount (or if building a custom image override the file from the base layer)

chmod a+x run-docker-daemon.sh
docker run -v $(pwd)/run-docker-daemon.sh:/run-docker-daemon.sh --privileged -e GO_SERVER_URL=https://my-server:8153/go gocd/gocd-agent-docker-dind:v23.1.0

Option 2: Make agent start-up wait for the daemon to be ready

Create a docker-entrypoint script that waits for the daemon to start, and mount it into /docker-entrypoint.d/ (or place it there if building a custom child image).

echo 'until docker stats --no-stream; do sleep 1; done' > wait-for-docker-daemon.sh
chmod a+x wait-for-docker-daemon.sh
docker run -v $(pwd)/wait-for-docker-daemon.sh:/docker-entrypoint.d/wait-for-docker-daemon.sh --privileged -e GO_SERVER_URL=https://my-server:8153/go gocd/gocd-agent-docker-dind:v23.1.0

Option 3: Edit tasks in pipelines

Add a sleep 15 command task as the first task in each job that uses an elastic agent.

This will not scale very well, and is wasteful and not sensible to use if using static dind agents since the wait will happen every time the job runs, not just when the agent first starts.

Should really only be used for a simple workaround if you have few pipelines.

Possible other workarounds (to be validated)

Stick with using gocd-agent-docker-dind:v22.3.0 for now
- This might work by luck, by forcing the agent launcher to "upgrade" each time it launches, which can add some wait time. But not guaranteed.

The text was updated successfully, but these errors were encountered:

chadlwilson · 2023-03-13T04:46:47Z

Hey hey @ketan or @arvindsv - do you happen to know/remember if any thought was given to which situations --host=tcp://0.0.0.0:2375 (bind to all interfaces) is needed for the Docker Daemon in a dind image, and whether either --host=tcp://localhost:2375 would suffice for most folks, or whether listening on TCP can be generally disabled?

I probably need to do some research, but my understanding was that most tools/CLIs can either deal with the unix socket directly (which has appropriate permissions) OR listening on localhost is sufficient for those libraries which can only talk TCP/HTTP to the Docker Daemon API.

I can't really see for a build agent why we'd want to bind to all network interfaces by default, especially if you could be running with --network-mode host or within Kubernetes or something like that which might allow other pods or containers to interact with the domain in an unexpected way?

See gocd/gocd#11378

Use fix for gocd/gocd#11378

chadlwilson added bug container-images labels Mar 12, 2023

chadlwilson added this to the Release 23.2.0 milestone Mar 12, 2023

chadlwilson self-assigned this Mar 12, 2023

chadlwilson added a commit to gocd-contrib/gocd-oss-cookbooks that referenced this issue Mar 14, 2023

Workaround issue with 23.1.0 dind image

3f9d812

See gocd/gocd#11378

chadlwilson mentioned this issue Mar 18, 2023

Tell dockerd to only bind TCP/HTTP API to localhost #11406

Merged

chadlwilson closed this as completed in #11406 Mar 18, 2023

chadlwilson added a commit to gocd-contrib/gocd-oss-cookbooks that referenced this issue Mar 19, 2023

Use experimental dind image for now to workaround bug

dc7239e

Use fix for gocd/gocd#11378

chadlwilson mentioned this issue Apr 4, 2023

Facing issue with image docker:dind docker-library/docker#396

Closed

chadlwilson mentioned this issue May 15, 2024

Add specific test for a real gocd-agent-docker-dind usage #12773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gocd-agent-docker-dind:v23.1.0 image can sometimes fail docker tasks run after agent start #11378

gocd-agent-docker-dind:v23.1.0 image can sometimes fail docker tasks run after agent start #11378

chadlwilson commented Mar 12, 2023 •

edited

chadlwilson commented Mar 13, 2023 •

edited

gocd-agent-docker-dind:v23.1.0 image can sometimes fail docker tasks run after agent start #11378

gocd-agent-docker-dind:v23.1.0 image can sometimes fail docker tasks run after agent start #11378

Comments

chadlwilson commented Mar 12, 2023 • edited

Issue Type

Summary

Steps to Reproduce

Expected Results

Actual Results

Possible Fix

Workarounds for 23.1.0

Recommended: Option 1: Override daemon start-up to TCP listen on localhost, or disable entirely

Option 2: Make agent start-up wait for the daemon to be ready

Option 3: Edit tasks in pipelines

Possible other workarounds (to be validated)

chadlwilson commented Mar 13, 2023 • edited

chadlwilson commented Mar 12, 2023 •

edited

chadlwilson commented Mar 13, 2023 •

edited