Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Daemon gets stuck when containerd fails to create a container #33828

Open
eugene-dounar opened this issue Jun 26, 2017 · 9 comments
Open

Comments

@eugene-dounar
Copy link

eugene-dounar commented Jun 26, 2017

Description
When containerd fails to create a container within 2 minutes, Docker stops responding to docker ps and other commands.

Steps to reproduce the issue:

  1. Run a container that writes to AUFS extensively enough to get container did not start before the specified timeout errors: docker run --name writer myimage
  2. Run docker run --name test hello-world. (gets stuck)
  3. Run docker ps (also gets stuck)
  4. Wait more than two minutes and run docker ps again (still stuck)

Describe the results you received:
docker ps is stuck

Describe the results you expected:
docker ps shows single container named writer

Additional information you deem important (e.g. issue happens only occasionally):
We had a few containers with heavily writing to container storage. The issue did not occur after the containers were stopped.
I cannot attach all the logs here as they contain confidential stuff (I'll attach it to the Commercial Support ticket). Here's syslog parts related to a "hello-world" container causing docker to hang:

Jun 23 16:23:09 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:23:09.979001725Z" level=debug msg="Calling POST /v1.24/containers/951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e/start"
Jun 23 16:23:09 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:23:09.980299932Z" level=debug msg="container mounted via layerStore: /opt/io1/docker/aufs/mnt/b2f8fcec0e7627457a2df0f032dd2239bc7aad092df04d218096fa6a245b6de2"
Jun 23 16:23:23 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:23:23.664683692Z" level=debug msg="Calling POST /v1.24/containers/951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e/kill?signal=TERM"
Jun 23 16:23:23 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:23:23.664790900Z" level=debug msg="Sending kill signal 15 to container 951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e"
Jun 23 16:25:10 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:25:10.014134874Z" level=error msg="containerd: start container" error="containerd: container did not start before the specified timeout" id=951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e
Jun 23 16:25:10 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:25:10.016980643Z" level=error msg="Create container failed with error: containerd: container did not start before the specified timeout"
Jun 23 16:25:10 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:25:10.491039628Z" level=debug msg="Failed to unmount b2f8fcec0e7627457a2df0f032dd2239bc7aad092df04d218096fa6a245b6de2 aufs: device or resource busy"
Jun 23 16:25:10 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:25:10.491130334Z" level=error msg="Error unmounting container 951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e: device or resource busy"
Jun 23 16:25:14 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:25:14.171515399Z" level=debug msg="Calling GET /v1.27/containers/951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e/json"
Jun 23 16:28:09 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:28:09.983929075Z" level=debug msg="Calling DELETE /v1.24/containers/951541ba1eaf92253df623203b35e9409cda2e78e090cb0d700a3a8935b9096e?force=1&v=1"
Jun 23 16:28:47 docker-linux-4-dh dockerd[79073]: time="2017-06-23T16:28:47.196358619Z" level=debug msg="Calling GET /v1.27/containers/951541ba1eaf92253df623203b3/json"

Gorotines dump goroutine-stacks-2017-06-23T162427Z.txt

Output of docker version:
Docker was upgraded to 17.03.2-ee-4 on this exact host, so I've copied version/info from another one with the same configuration.

Client:
 Version:      17.03.1-ee-3
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   3fcee33
 Built:        Thu Mar 30 20:06:11 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ee-3
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3fcee33
 Built:        Thu Mar 30 20:06:11 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 137
 Running: 110
 Paused: 0
 Stopped: 27
Images: 55
Server Version: 17.03.1-ee-3
Storage Driver: aufs
 Root Dir: /srv/docker_root/aufs
 Backing Filesystem: extfs
 Dirs: 1058
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-78-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 960.7 GiB
Name: docker-linux-5-dh
ID: IMBQ:LPGX:6MBG:HCA4:ZWFL:BGNB:MJ5Y:XG5H:JU2Q:E7B5:4LEL:INMI
Docker Root Dir: /srv/docker_root
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS x1.16xlarge instance

@jtmoon79
Copy link

Similar to #32995

@thaJeztah
Copy link
Member

Looks like #22226, which lead to the timeout being bumped to 2 minutes in #23176

2 minutes is awfully long though (containers should normally start in milliseconds or seconds in extreme cases).

A big change was made in docker 17.07 and up to prevent some occurrences of docker ps hanging (see #31273)

ping @mlaventure anything that can be done here?

@l1va
Copy link

l1va commented Nov 30, 2017

I have the same with Version: 17.09.0-ce
Docker ps is not hanging but "containerd: container did not start before the specified timeout" . Can not run any container, even hello-world :(

@l1va
Copy link

l1va commented Dec 1, 2017

Solved by encreasing memory and rebooting.

@lav-patel
Copy link

@l1va did you increase memory allocation to container or machine itself?

@udalrich
Copy link

@thaJeztah We have an image with a data preloaded with a lot of data, which copies the data into a docker volume. When the volume is first created, dozens of GB is copied to the volume. This will take minutes. Once the volume is initialized, the container does start quickly.

@thaJeztah
Copy link
Member

@udalrich it may be difficult to take that use-case into account; raising the limit too far could mean that containers with an actual problem would never be marked as "faulty". Wondering if an entrypoint script that copies the data to the volume location would help in your case.

@udalrich
Copy link

@thaJeztah

So something like

if (running usual command)
then
   if data not copied
   then
      copy data to volume
      update db config to point to volume
   fi
  start db
else
  run requested cmd
fi    

@thaJeztah
Copy link
Member

Yes, that's what I was thinking (a bit similar to the approach taken by the official WordPress image).

Let me know if that works 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants