Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker in Docker stopped working #31889

Closed
tosi3k opened this issue Feb 8, 2024 · 22 comments
Closed

Docker in Docker stopped working #31889

tosi3k opened this issue Feb 8, 2024 · 22 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@tosi3k
Copy link
Member

tosi3k commented Feb 8, 2024

What happened:

Docker in Docker doesn't work, e.g. in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1755589711310622720 we have the following error on DinD enablement:

Docker in Docker enabled, initializing...
================================================================================
/etc/init.d/docker: 6[2](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1755589711310622720#1:build-log.txt%3A2): ulimit: error setting limit (Invalid argument)
Waiting for docker to be ready, sleeping for 1 seconds.
Waiting for docker to be ready, sleeping for 2 seconds.
Waiting for docker to be ready, sleeping for 3 seconds.
Waiting for docker to be ready, sleeping for 4 seconds.
Waiting for docker to be ready, sleeping for 5 seconds.
Reached maximum attempts, not waiting any longer...
================================================================================
Done setting up docker in docker.

What you expected to happen:

DinD works.

How to reproduce it (as minimally and precisely as possible):

Execute any Prow job with image basing on bootstrap image and DOCKER_IN_DOCKER_ENABLED set to true:

# Check if the job has opted-in to docker-in-docker availability.
export DOCKER_IN_DOCKER_ENABLED=${DOCKER_IN_DOCKER_ENABLED:-false}
if [[ "${DOCKER_IN_DOCKER_ENABLED}" == "true" ]]; then
echo "Docker in Docker enabled, initializing..."
printf '=%.0s' {1..80}; echo
# If we have opted in to docker in docker, start the docker daemon,
service docker start
# the service can be started but the docker socket not ready, wait for ready
WAIT_N=0
MAX_WAIT=5
while true; do
# docker ps -q should only work if the daemon is ready
docker ps -q > /dev/null 2>&1 && break
if [[ ${WAIT_N} -lt ${MAX_WAIT} ]]; then
WAIT_N=$((WAIT_N+1))
echo "Waiting for docker to be ready, sleeping for ${WAIT_N} seconds."
sleep ${WAIT_N}
else
echo "Reached maximum attempts, not waiting any longer..."
break
fi
done
printf '=%.0s' {1..80}; echo
echo "Done setting up docker in docker."

service docker start simply fails with /etc/init.d/docker: 62: ulimit: error setting limit (Invalid argument).

Please provide links to example occurrences, if any:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1755589711310622720

Anything else we need to know?:

I see this is an issue in the newest Docker: docker/cli#4807.

Some workaround is mentioned in https://forums.docker.com/t/etc-init-d-docker-62-ulimit-error-setting-limit-invalid-argument-problem/139424.

@tosi3k tosi3k added the kind/bug Categorizes issue or PR as related to a bug. label Feb 8, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 8, 2024
@tosi3k
Copy link
Member Author

tosi3k commented Feb 8, 2024

/sig testing

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 8, 2024
@tosi3k
Copy link
Member Author

tosi3k commented Feb 8, 2024

IIUC, to fix this we can either:

  • Replace the -Hn argument for the ulimit in /etc/init.d/docker with -n to go back to the previous behavior or,
  • Bump the hard limit value for the number of open descriptors in /etc/security/limits.conf

in the images/bootstrap/runner.sh file.

@tosi3k
Copy link
Member Author

tosi3k commented Feb 8, 2024

CC @BenTheElder

@mboersma
Copy link
Contributor

mboersma commented Feb 8, 2024

Same here: all our Cluster API Provider Azure jobs started failing with the same issue above. See kubernetes-sigs/cluster-api-provider-azure#4553.

@mboersma
Copy link
Contributor

mboersma commented Feb 8, 2024

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2024
@BenTheElder
Copy link
Member

This also changed with picking up a new docker version, right? (haven't had time to look myself)

we probably also want to revert docker to 1.24 overall in the meantime, we've had other 1.25.x breakage like kubernetes-sigs/kind#3487

@tosi3k
Copy link
Member Author

tosi3k commented Feb 8, 2024

This also changed with picking up a new docker version, right?

Exactly, it started failing with Docker 25.0.0.

Could this get prioritized? It completely blocks submission of any pull requests in k/k as well.

@BenTheElder
Copy link
Member

BenTheElder commented Feb 8, 2024

The images and job configs are in this repo, PRs welcome?
AFAIK Nobody's day job is working on these images, I'm trying to get to it but we're also in enhancements freeze crunch and meetings I can't skip.

@BenTheElder
Copy link
Member

The quickest option is to rollback the images in the job config. Whoever rolled them forward would do that ideally.

@BenTheElder
Copy link
Member

#31900

@BenTheElder BenTheElder self-assigned this Feb 8, 2024
@BenTheElder
Copy link
Member

We'll also need to actually deal with the docker-in-docker changes, but this should've been rolled back first, then we can roll forward with a fix.

@BenTheElder
Copy link
Member

If you see something like this in the future, please ping #testing-ops to raise visibility faster than the issue tracker.

@BenTheElder
Copy link
Member

#sig-testing slack discussing roll forward

@dims
Copy link
Member

dims commented Feb 9, 2024

Here's how to test the kubekins image:

IMAGE=gcr.io/k8s-staging-test-infra/kubekins-e2e:v20240207-d8632cc3bc-1.29

(sudo mkdir -p /tmp/docker-graph && sudo chmod -R 777 /tmp/docker-graph && cd /tmp/docker-graph && rm -rf *)

docker run -e DOCKER_IN_DOCKER_ENABLED=true --privileged --rm \
  --entrypoint=/bin/bash \
  -it \
  -v /tmp/docker-graph:/docker-graph \
  -v $HOME/go/src/k8s.io/kubernetes:/go/src/k8s.io/kubernetes \
  $IMAGE -c "cd /go/src/k8s.io/kubernetes; /bin/bash"

when you get dropped into the command line, you can try starting the docker daemon, print version etc.

root@f1ae308f4420:/go/src/k8s.io/kubernetes# docker version
Client: Docker Engine - Community
 Version:           25.0.2
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        29cf629
 Built:             Thu Feb  1 00:23:17 2024
 OS/Arch:           linux/amd64
 Context:           default
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
root@f1ae308f4420:/go/src/k8s.io/kubernetes# service docker start
/etc/init.d/docker: 62: ulimit: error setting limit (Invalid argument)

@dims
Copy link
Member

dims commented Feb 9, 2024

Looks like we need to fix the /etc/init.d/docker as mentioned here:
https://forums.docker.com/t/etc-init-d-docker-62-ulimit-error-setting-limit-invalid-argument-problem/139424

currently it looks like:

		# Only set the hard limit (soft limit should remain as the system default of 1024):
		ulimit -Hn 524288

We need to switch from -Hn to just -n

@kannon92
Copy link
Contributor

kannon92 commented Feb 9, 2024

I'm having a nasty issue in kubernetes-sigs/jobset#400 and I wonder if its related.

Seeing a segmentation fault in controller tools but its not clear to me if this has been fixed for my jobs yet.

@BenTheElder
Copy link
Member

I wouldn't expect a unit test job to have docker in docker enabled, is it even using docker?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/400/pull-jobset-test-unit-main/1756049400766926848

That looks like a nil deref error in the code under test?

@BenTheElder
Copy link
Member

DinD should be fixed now, I believe @dims's roll-forward fix is now out?

@kannon92
Copy link
Contributor

kannon92 commented Feb 9, 2024

I wouldn't expect a unit test job to have docker in docker enabled, is it even using docker?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/400/pull-jobset-test-unit-main/1756049400766926848

That looks like a nil deref error in the code under test?

Yea we could move things around but we use docker to generate some python sdk and run some unit tests in python.

Either way, I also see the same failure in the e2e tests where I am using kind: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/400/pull-jobset-test-e2e-main-1-29/1756049401039556608

So I don't think the unit test code is the issue.

We are seeing this on a few of our PRs so its not related to any code in the PRs.

@kannon92
Copy link
Contributor

kannon92 commented Feb 9, 2024

Maybe its just a coincidence. I'll look into this failure on my end.

@kannon92
Copy link
Contributor

kannon92 commented Feb 9, 2024

circling back to this, @alculquicondor pointed out to me that the images we use for these jobs uses golang 1.22 and it seems to have broke the controller-gen/controller-tools version I was using.

I updated my controller-tools to 0.14.0 and the CI for jobset seems happy again. kubernetes-sigs/jobset#403

@BenTheElder
Copy link
Member

That sounds right, see:
https://kubernetes.slack.com/archives/CHGFYJVAN/p1707478161123049

FWIW with 1.21+ you can now use either go.mod toolchain or GOTOOLCHAIN env to control go.
Previously repos like kind and then k/k controlled go version themselves with gimme + some more bash, but now you could just use https://go.dev/doc/toolchain in a standard way and control go upgrades by a PR to your repo under test.

I'm going to close this as fixed for now, we can follow-up on another issue if we should revert to 1.22

Sawthis pushed a commit to kyma-project/k8s-prow that referenced this issue Feb 12, 2024
…images and k8s-staging-test-infra AR images"

This reverts commit 2bf070c.

See: kubernetes#31889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants