Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker stop command hangs after daemon reload and restart - docker 17.03 #31997

Closed
kopnok opened this issue Mar 22, 2017 · 18 comments
Closed

docker stop command hangs after daemon reload and restart - docker 17.03 #31997

kopnok opened this issue Mar 22, 2017 · 18 comments
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. kind/experimental status/more-info-needed version/17.03

Comments

@kopnok
Copy link

kopnok commented Mar 22, 2017

Description
Command docker stop hangs. Container is not stopped.
Check also docker ps and hangs as well.

Steps to reproduce the issue:

  1. sudo systemctl reload docker
  2. sudo systemctl reload-or-restart docker
  3. sudo systemctl restart docker
  4. docker stop container

Describe the results you received:
No results are given from the command - it hangs.

Describe the results you expected:
Container is stopped and the command is successfully finished.

Output of docker version:

Docker version 17.03.0-ce, build 60ccb22

Output of docker info:

Containers: 19
 Running: 18
 Paused: 0
 Stopped: 1
Images: 18
Server Version: 17.03.0-ce
Storage Driver: devicemapper
 Pool Name: docker-253:0-134295373-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 4.546 GB
 Data Space Total: 107.4 GB
 Data Space Available: 26.09 GB
 Metadata Space Used: 11.89 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.136 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 23.39 GiB
Name: Vulture-Manager
ID: XPLT:7GXH:4MZW:57AC:VBG6:N3RA:O5EO:ZXXY:DWDQ:GYTE:K6Y4:P32K
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Labels:
 containerslots=25
Experimental: false
Cluster Store: consul://10.42.44.96:8500
Cluster Advertise: 10.0.0.42:2375
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

daemon-data-2017-03-22T071457Z.txt
goroutine-stacks-2017-03-22T071457Z.txt
strace-docker-stop.txt

@cpuguy83
Copy link
Member

Thanks, looks stuck here: https://github.com/docker/docker/blob/v17.03.0-ce/daemon/monitor.go#L42

goroutine 721 [semacquire, 1 minutes]:
sync.runtime_Semacquire(0xc42080004c)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*WaitGroup).Wait(0xc420800040)
	/usr/local/go/src/sync/waitgroup.go:131 +0x97
github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc4203a8200, 0xc4215cc780, 0x40, 0xc4215e1658, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/monitor.go:42 +0x2c6
github.com/docker/docker/libcontainerd.(*container).handleEvent.func1()
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/libcontainerd/container_unix.go:217 +0x82
github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc420b19d00, 0x0, 0xc42148ba40, 0xc4215b7080)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/libcontainerd/queue_unix.go:28 +0x30
created by github.com/docker/docker/libcontainerd.(*queue).append
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/libcontainerd/queue_unix.go:30 +0x170

@mlaventure It looks like there's some case where the container I/O streams are not closed.

I think blocked on this select:

goroutine 346 [select, 2 minutes]:
github.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read(0xc420a21440, 0xc420a9c000, 0x8000, 0x8000, 0xc42029ed78, 0x421a99, 0x16f9ec0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:138 +0x2a8
bufio.(*Reader).fill(0xc42023d860)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).WriteTo(0xc42023d860, 0x242a620, 0xc4200ed340, 0xc42029ee58, 0x424c01, 0xc42023d860)
	/usr/local/go/src/bufio/bufio.go:472 +0x273
io.copyBuffer(0x242a620, 0xc4200ed340, 0x24294a0, 0xc42023d860, 0x0, 0x0, 0x0, 0x0, 0xffffffffffffffff, 0xffffffffffffffff)
	/usr/local/go/src/io/io.go:380 +0x37b
io.Copy(0x242a620, 0xc4200ed340, 0x24294a0, 0xc42023d860, 0x1, 0xc42029ef80, 0x4184db)
	/usr/local/go/src/io/io.go:360 +0x68
github.com/docker/docker/pkg/pools.Copy(0x242a620, 0xc4200ed340, 0x7fc676a4e4f8, 0xc420a21440, 0xe, 0x5, 0x18eff99)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/pools/pools.go:60 +0x7b
github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1(0x242a620, 0xc4200ed340, 0x7fc676a4e4f8, 0xc420a21440, 0xc420800040)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/container/stream/streams.go:119 +0x55
created by github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/container/stream/streams.go:123 +0x84

@kopnok
Copy link
Author

kopnok commented Mar 23, 2017

Issue can be observed only if --live-restore is set to docker daemon.

@cpuguy83 cpuguy83 added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label Mar 31, 2017
@mlaventure
Copy link
Contributor

@kopnok sorry for the long delay.

Could you provide with the associated daemon log (if you still have them)? If you can reproduce it and put the daemon into debug mode, it'd be a plus too.

Do you have an idea of what output the containers are generating?

Next times it occurs, can you also send a SIGUSR1 to docker-containerd? It will dump it stacktrace into the docker daemon log.

@kopnok
Copy link
Author

kopnok commented Apr 11, 2017

Please find attached file with requested logs.
There is an issue with stopping nginx container in this case.
Nginx output is standard rest reqest/response flow.
docker.log.txt

@mlaventure
Copy link
Contributor

@tonistiigi it looks to me like during the restore some of the fifo got stuck on OpenFile, would you have any idea as to why would this happen? Can't think of a scenario for this atm especially since there's at least 40 seconds between the container being restored and the stop signal sent.

@pangzheng
Copy link

pangzheng commented Apr 17, 2017

The same problem

  • docker run -d --name 2048 image/2048
  • service docker restart
  • docker stop 2048

strace docker stop 2048

...
openat(AT_FDCWD, "/root/.docker/config.json", O_RDONLY|O_CLOEXEC) = 4
read(4, "{\n\t\"auths\": {\n\t\t\"106.75.87.37\": "..., 512) = 472
close(4)                                = 0
futex(0xc42002ad10, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
setsockopt(4, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
epoll_create1(EPOLL_CLOEXEC)            = 5
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=766271696, u64=140665240182992}}) = 0
getsockname(4, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc42002a910, FUTEX_WAKE, 1)      = 1
read(4, 0xc4203e6000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, {}, 128, 0)               = 0
epoll_wait(5, {{EPOLLOUT, {u32=766271696, u64=140665240182992}}}, 128, -1) = 1
epoll_wait(5, {{EPOLLIN|EPOLLOUT, {u32=766271696, u64=140665240182992}}}, 128, -1) = 1
read(4, "HTTP/1.1 200 OK\r\nApi-Version: 1."..., 4096) = 200
futex(0xc4203e8110, FUTEX_WAKE, 1)      = 1
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fef2da31000
futex(0xc4203e8110, FUTEX_WAKE, 1)      = 1
read(4, 0xc4203e6000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
sched_yield()                           = 0
futex(0x11674e0, FUTEX_WAKE, 1)         = 0
futex(0xc4203e8510, FUTEX_WAKE, 1)      = 1
write(4, "POST /v1.27/containers/jolly_eli"..., 157) = 157
futex(0xc4203e8510, FUTEX_WAKE, 1)      = 1
futex(0x11624d0, FUTEX_WAIT, 0, NULL^[[B^CProcess 31636 detached
 <detached ...>

@mlaventure
Copy link
Contributor

@pangzheng if you could provide the same information I requested earlier (i.e. daemon log, daemon and container stacktraces), it would be useful

@elianka
Copy link

elianka commented Sep 20, 2017

@mlaventure @cpuguy83 any progress on this?
i encountered this problem also.
And i found it only happened when running systemd in container.
my docker version is 1.13.1
and my OS is centos 7.3, 3.10.0-514.16.1.el7.x86_64

The process to reproduce this issue:

  1. enable --live-restore for docker daemon
  2. start a container A(which the init process is systemd)
  3. restart docker service: service docker restart
  4. do stop or rm: docker stop A. and it will hang on.
    and docker client will hang on every command after that.
  5. if you do "service docker restart".
    all docker client command will restore, and container A is stopped.

@elianka
Copy link

elianka commented Sep 27, 2017

@mlaventure
image
I found that this commit causes the problem.
commit 25810f3 will replay the problem.
and
commit 37be2d4 is ok, meanwhile, will get error msg after docker service restart.

Sep 27 11:10:45 thinkcenter dockerd[28828]: time="2017-09-27T11:10:45.505958509+08:00" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/root/go/src/github.com/docker/docker/.gopath/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:480\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/root/go/src/github.com/docker/docker/.gopath/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/root/go/src/github.com/docker/docker/.gopath/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2197"
Sep 27 11:10:57 thinkcenter dockerd[28828]: time="2017-09-27T11:10:57.426203322+08:00" level=debug msg="Calling GET /_ping"
Sep 27 11:10:57 thinkcenter dockerd[28828]: time="2017-09-27T11:10:57.426529709+08:00" level=debug msg="Calling GET /v1.25/containers/json"

@mlaventure
Copy link
Contributor

@elianka I vaguely remember this causing an issue, but I think we fixed it in a later patch release. Try updating to 17.03.2 (despite the name change, this is really just 1.13.4)

@elianka
Copy link

elianka commented Sep 28, 2017

@mlaventure
unfortunately, it still exist in v17.03.2-ce and even 17.05.0-ce.
you can see the 'docker stop' is hanging up. The right part shows that logs for dockerd.
image

@CodeJuan
Copy link

Hi @elianka, I think I got the same issue, is there any updates?
Thank you.

@mrousselet
Copy link

Same issue for me, i've updated my daemon from 1.06 to the last one, and all my docker command are stuck. Any ideas?

@maheshmadpathi
Copy link

I have the same issue.

docker --version
Docker version 17.03.2-ce, build f5ec1e2

@cpuguy83
Copy link
Member

Docker >= 17.12 is totally different with the introduction of containerd 1.0. Debugging against an older, non-supported version of Docker is difficult and can't really be fixed here.

That said, we do have timeouts on fifo open to fix a deadlock that was primarily caused by having lots of healthchecks after updating to containerd 1.0.
I do not think this is an issue at this point on a modern version of Docker. 17.03 is over a year old and is beyond the support window for Docker CE releases.
Are you able to test with Docker 18.03 or 18.06?

@leopoldxx
Copy link

leopoldxx commented Oct 28, 2018

i encounter the same issue, and find the root cause is the stderr&stdout redirects.

  1. run a container (i simplified the entrypoint.sh)
    #cat entrypoint.sh
#cat  entrypoint.sh 
#! /bin/sh
exec /test &> /tmp/log
  1. restart dockerd (with live-restore enabled)
  2. stop the container (hangs)
  3. docker ps also hangs.

the dockerd hangs at
https://github.com/moby/moby/blob/v17.03.2-ce/daemon/monitor.go#L42
https://github.com/moby/moby/blob/v17.03.2-ce/container/stream/streams.go#L119

if i remove the stdout&stderr redirects, everything is alright.

#! /bin/sh
exec /test

@godliness
Copy link

godliness commented Jun 17, 2020

Is there any updates about this issue?

@sam-thibault
Copy link
Contributor

I don't see any recent activity on this issue. I will close it as stale.

@sam-thibault sam-thibault closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. kind/experimental status/more-info-needed version/17.03
Projects
None yet
Development

No branches or pull requests