You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tasks run within a job on the gocd-agent-docker-dind:v23.1.0 image can fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? when they are run almost immediately after agent start-up.
Re-running the job on the same static agent will usually succeed, but on an elastic agent it will usually fail.
Steps to Reproduce
create a job that has a task that does something that needs the daemon e.g docker version
set the job to use a docker, ECS or kubernetes elastic agent based around gocd-agent-docker-dind:v23.1.0. Alternatively, trigger the job in #1, but then start a static agent on gocd-agent-docker-dind:v23.1.0
If the agent starts, registers and obtains work in < 15s you will see Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Expected Results
Tasks that run docker-daemon-requiring tasks immediately should work normally.
Actual Results
Tasks can fail with Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? f they run too quickly after container start.
Previously the agent started so slowly and picked up work so slowly that this sleep 15 in the background was always done by the time the agent started doing work. Subsequent to #11286 the agent now starts a lot faster, and so sometimes it picks up work that needs to interact with the docker daemon before the daemon is actually started, exposing an earlier bug that technically exists in previous GoCD versions.
To address the root issue, we need to either change to only starting tcp on loopback, or turning on TLS, or deliberately opting out. Needs some digging into why the daemon is started this way. We also may want to consider going back to re-use the docker-entrypoint in the docker:dind base image, since it is rather sophisticated in the handling and env vars it supports.
Workarounds for 23.1.0
Recommended: Option 1: Override daemon start-up to TCP listen on localhost, or disable entirely
This option is probably the most secure - and gives the fastest agent start-up.
Override the run-docker-daemons.h to tell it to either
Listen on localhost only
Not listen on TCP at all (remove the --host=tcp://localhost:2375 entirely below)
Create a file run-docker-daemon.sh to override the one built-in:
Option 2: Make agent start-up wait for the daemon to be ready
Create a docker-entrypoint script that waits for the daemon to start, and mount it into /docker-entrypoint.d/ (or place it there if building a custom child image).
echo'until docker stats --no-stream; do sleep 1; done'> wait-for-docker-daemon.sh
chmod a+x wait-for-docker-daemon.sh
docker run -v $(pwd)/wait-for-docker-daemon.sh:/docker-entrypoint.d/wait-for-docker-daemon.sh --privileged -e GO_SERVER_URL=https://my-server:8153/go gocd/gocd-agent-docker-dind:v23.1.0
Option 3: Edit tasks in pipelines
Add a sleep 15 command task as the first task in each job that uses an elastic agent.
This will not scale very well, and is wasteful and not sensible to use if using static dind agents since the wait will happen every time the job runs, not just when the agent first starts.
Should really only be used for a simple workaround if you have few pipelines.
Possible other workarounds (to be validated)
Stick with using gocd-agent-docker-dind:v22.3.0 for now
This might work by luck, by forcing the agent launcher to "upgrade" each time it launches, which can add some wait time. But not guaranteed.
The text was updated successfully, but these errors were encountered:
Hey hey @ketan or @arvindsv - do you happen to know/remember if any thought was given to which situations --host=tcp://0.0.0.0:2375 (bind to all interfaces) is needed for the Docker Daemon in a dind image, and whether either --host=tcp://localhost:2375 would suffice for most folks, or whether listening on TCP can be generally disabled?
I probably need to do some research, but my understanding was that most tools/CLIs can either deal with the unix socket directly (which has appropriate permissions) OR listening on localhost is sufficient for those libraries which can only talk TCP/HTTP to the Docker Daemon API.
I can't really see for a build agent why we'd want to bind to all network interfaces by default, especially if you could be running with --network-mode host or within Kubernetes or something like that which might allow other pods or containers to interact with the domain in an unexpected way?
chadlwilson
added a commit
to gocd-contrib/gocd-oss-cookbooks
that referenced
this issue
Mar 14, 2023
Issue Type
Summary
Tasks run within a job on the
gocd-agent-docker-dind:v23.1.0
image can fail withCannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
when they are run almost immediately after agent start-up.Re-running the job on the same static agent will usually succeed, but on an elastic agent it will usually fail.
Steps to Reproduce
docker version
gocd-agent-docker-dind:v23.1.0
. Alternatively, trigger the job in#1
, but then start a static agent ongocd-agent-docker-dind:v23.1.0
15s
you will seeCannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Expected Results
Tasks that run docker-daemon-requiring tasks immediately should work normally.
Actual Results
Tasks can fail with
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
f they run too quickly after container start.Possible Fix
At root the issue is caused by:
gocd/buildSrc/src/main/resources/gocd-docker-agent/run-docker-daemon.sh
Line 17 in 0f58107
This is an insecure way to launch the daemon, and so docker puts a
sleep 15
in the start to make sure it is what you intended:https://github.com/moby/moby/blame/dfd89ede4b3c190817bb1b528d5f47ffd12dd6b3/cmd/dockerd/daemon.go#L676-L683
Previously the agent started so slowly and picked up work so slowly that this
sleep 15
in the background was always done by the time the agent started doing work. Subsequent to #11286 the agent now starts a lot faster, and so sometimes it picks up work that needs to interact with the docker daemon before the daemon is actually started, exposing an earlier bug that technically exists in previous GoCD versions.To address the root issue, we need to either change to only starting tcp on loopback, or turning on TLS, or deliberately opting out. Needs some digging into why the daemon is started this way. We also may want to consider going back to re-use the docker-entrypoint in the
docker:dind
base image, since it is rather sophisticated in the handling and env vars it supports.Workarounds for 23.1.0
Recommended: Option 1: Override daemon start-up to TCP listen on localhost, or disable entirely
This option is probably the most secure - and gives the fastest agent start-up.
Override the
run-docker-daemons.h
to tell it to eitherlocalhost
only--host=tcp://localhost:2375
entirely below)Create a file
run-docker-daemon.sh
to override the one built-in:When running, override with a bind-mount (or if building a custom image override the file from the base layer)
Option 2: Make agent start-up wait for the daemon to be ready
Create a docker-entrypoint script that waits for the daemon to start, and mount it into
/docker-entrypoint.d/
(or place it there if building a custom child image).Option 3: Edit tasks in pipelines
sleep 15
command task as the first task in each job that uses an elastic agent.This will not scale very well, and is wasteful and not sensible to use if using static dind agents since the wait will happen every time the job runs, not just when the agent first starts.
Should really only be used for a simple workaround if you have few pipelines.
Possible other workarounds (to be validated)
gocd-agent-docker-dind:v22.3.0
for nowThe text was updated successfully, but these errors were encountered: