New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker ps hang #28889

Closed
dmyerscough opened this Issue Nov 28, 2016 · 45 comments

Comments

Projects
None yet
@dmyerscough
Copy link

dmyerscough commented Nov 28, 2016

Description

The Docker daemon becomes unresponsive and causes docker ps to hang, containers still continue to run and function fine. A stack trace of the Docker daemon:-

https://gist.github.com/dmyerscough/ced7616a5e8072315e7ea82ef797414c

Steps to reproduce the issue:

  1. Docker daemon runs for 11 hours then the daemon becomes unresponsive

Describe the results you received:

Docker daemon becomes unresponsive.

Describe the results you expected:

Docker daemon shouldn't become unresponsive.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:02:53 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:02:53 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 19
 Running: 19
 Paused: 0
 Stopped: 0
Images: 78
Server Version: 1.12.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.32
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 24
Total Memory: 125.8 GiB
Name: appdocker242-dca1
ID: 36EB:Y3SF:VHJR:3NIB:IXSX:PDVE:IJ7J:W4KQ:HUOH:V3EQ:FRTX:L6EM
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 10.0.0.0/8
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

Physical

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Nov 28, 2016

This should be fixed in 1.12.3 I think.

Could you update and see if it fixes it for you?

@dmyerscough

This comment has been minimized.

Copy link
Author

dmyerscough commented Nov 28, 2016

@mlaventure I can upgrade one or two hosts. However, we have a 2000+ cluster, would you be able to confirm this is fixed in 1.12.3 and if it's a known issue?

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Nov 28, 2016

@dmyerscough

I had a look, unfortunately, it seems the fix for this race is only in master at the moment unfortunately 😞

There's a few issue with execs that were fixed in 1.12.3 though, so if you make use of those and can afford to upgrade, you may still want to do so

@hjacobs

This comment has been minimized.

Copy link

hjacobs commented Nov 29, 2016

@mlaventure we also have the problem (we tried with Docker 1.11, 1.12.1 and 1.12.3) that docker ps hangs. Fun fact: we can "always" resolve this by doing strace -p {dockerd-pid} -f.

Do you have a link to the fix/patch in master?

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Nov 29, 2016

@hjacobs

Those two PR should do the trick:

@hjacobs

This comment has been minimized.

Copy link

hjacobs commented Nov 29, 2016

@mlaventure thanks, we will try the v1.13.0-rc2 release which should contain the mentioned fixes.

@dmyerscough

This comment has been minimized.

Copy link
Author

dmyerscough commented Nov 29, 2016

I just tried strace -p {docker-pid} -f and it unblocked the process, but not sure how/why.

@hjacobs

This comment has been minimized.

Copy link

hjacobs commented Dec 5, 2016

@mlaventure @dmyerscough Docker 1.13 RC fixed the issue for us! 😄 Now we just have to wait for 1.13 release and CoreOS inclusion...

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Dec 5, 2016

@hjacobs I believe we are pulling these stdio fixes into the 1.12 branch as well.

@hjacobs

This comment has been minimized.

Copy link

hjacobs commented Dec 5, 2016

@cpuguy83 cool 👍

@dmyerscough

This comment has been minimized.

Copy link
Author

dmyerscough commented Dec 5, 2016

@cpuguy83 do you know when the 1.12 branch will include these fixes?

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Dec 5, 2016

@dmyerscough We haven't set a release date, but it's merged in the branch at least (mostly #29095 (merged) and #29141).

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Dec 14, 2016

1.12.4 is released and includes these fixes.

Thanks!

@cpuguy83 cpuguy83 closed this Dec 14, 2016

@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 10, 2017

Hi,
I consistently have this issue in my cluster (CoreOS 1235.9.0, Docker 1.12.6) but strace doesn't seems to help. Both with btrfs and devicemapper storage drivers, on 6 nodes. High CPU usage of dockerd when it happens, reply comes in correlation with normal (<1%) CPU usage.

Should I open another issue?

top output:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                  
 2014 root      20   0 4668452 103732  30812 S 126.2  0.1   9:03.70 dockerd                  

strace output:

# strace -p 2014   
Process 2014 attached
futex(0x2b06a08, FUTEX_WAIT, 0, NULL^CProcess 2014 detached
 <detached ...>

docker info:

# docker info
Containers: 44
 Running: 20
 Paused: 0
 Stopped: 24
Images: 136
Server Version: 1.12.6
Storage Driver: btrfs
 Build Version: Btrfs v4.4.1
 Library Version: 101
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay null host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.7.3-coreos-r2
Operating System: Container Linux by CoreOS 1235.9.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 125.9 GiB
Name: kube-6
ID: KUKE:NYSF:KEGL:EMV4:AHNU:PPQJ:3HYF:DYJM:I4GS:N454:7FHX:R4CW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

Thanks!

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Feb 10, 2017

@MikaelCluseau send a SIGUSR1 to the daemon, it will dump its stacktrace

@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 10, 2017

@mlaventure thanks!

We observed this on pulls (staying in waiting state for long periods for instance). This is when the stack was taken (during docker pull docker:1.12.6-dind).

The stacks are big, here's the gist: https://gist.github.com/MikaelCluseau/062c3ce10b6083041dbc9146be5df5a4

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Feb 13, 2017

@MikaelCluseau I couldn't see anything unusual with that stack trace. Pull may take a while if the network is not stable or fast.

Were other commands slow/stuck when executed in parallel?

@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 13, 2017

@mlaventure I think I found what happened, but I wanted to let the system show stability for at least a night before reporting. We had a pretty low memory limit of 100MiB on our docker hub mirror registry (image registry:2.5.1), and it has been OOM killed at least once. I raised the limit to 1GiB, and since then I didn't had any problem with timeout speaking to the docker daemon.

If I'm on the right track, I should get a reliable reproduction path a few minutes.

@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 13, 2017

Well... for now the registry doesn't get OOM killed, I suppose I need to have the same backend (swift) to reproduce. FWIW, here's the current draft:

docker run -d --name mirror --memory 20M -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io registry:2.5.1
docker run -d --name dind --privileged --link mirror:mirror -p 2375:2375 docker:1.12.6-dind --registry-mirror=http://mirror:5000
@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 15, 2017

@mlaventure Got docker ps locked again or, more exactly, took a long time, many minutes, to return.

I think the relevant part of the stack dump is:

goroutine 36695 [semacquire, 1 minutes]:
sync.runtime_Semacquire(0xc82180dc04)
\t/usr/lib/go1.6/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc82180dc00)
\t/usr/lib/go1.6/src/sync/mutex.go:83 +0x1c4
github.com/docker/docker/daemon.(*Daemon).reducePsContainer(0xc820385a00, 0xc82403c5a0, 0xc8248fd080, 0xc8247a73d0, 0x0, 0x0, 0x0)
\t/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.12.6-r1/work/docker-1.12.6/.gopath/src/github.com/docker/docker/daemon/list.go:204 +0x49
github.com/docker/docker/daemon.(*Daemon).reduceContainers(0xc820385a00, 0xc82473f580, 0xc8247a73d0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.12.6-r1/work/docker-1.12.6/.gopath/src/github.com/docker/docker/daemon/list.go:187 +0x1a5
github.com/docker/docker/daemon.(*Daemon).Containers(0xc820385a00, 0xc82473f580, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.12.6-r1/work/docker-1.12.6/.gopath/src/github.com/docker/docker/daemon/list.go:101 +0x76
github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersJSON(0xc82060af40, 0x7f71c61dfd80, 0xc8232017a0, 0x7f71c61efcd0, 0xc821ece9c0, 0xc820f1b500, 0xc823201710, 0x0, 0x0)
[...]

The full dump is here: https://gist.github.com/MikaelCluseau/b2518769f693e667f9fc9575c835c54b

@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 15, 2017

Same on another host:

goroutine 76630 [semacquire, 7 minutes]:
sync.runtime_Semacquire(0xc824d8d504)
\t/usr/lib/go1.6/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc824d8d500)
\t/usr/lib/go1.6/src/sync/mutex.go:83 +0x1c4
github.com/docker/docker/daemon.(*Daemon).reducePsContainer(0xc8204a6000, 0xc821e76d20, 0xc824a64600, 0xc8259093e0, 0x0, 0x0, 0x0)
\t/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.12.6-r1/work/docker-1.12.6/.gopath/src/github.com/docker/docker/daemon/list.go:204 +0x49
github.com/docker/docker/daemon.(*Daemon).reduceContainers(0xc8204a6000, 0xc825fb0540, 0xc8259093e0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.12.6-r1/work/docker-1.12.6/.gopath/src/github.com/docker/docker/daemon/list.go:187 +0x1a5
github.com/docker/docker/daemon.(*Daemon).Containers(0xc8204a6000, 0xc825fb0540, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.12.6-r1/work/docker-1.12.6/.gopath/src/github.com/docker/docker/daemon/list.go:101 +0x76

gist -> https://gist.github.com/MikaelCluseau/8b89528a75c189da9c2575e6172eb080

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Feb 15, 2017

@MikaelCluseau Yours seems to be stuck traversing a directory structure while applying an selinux label:

syscall.Syscall(0xd9, 0x16e, 0xc8233f4000, 0x1000, 0x30, 0x1000, 0x0)
\t/usr/lib/go1.6/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Getdents(0x16e, 0xc8233f4000, 0x1000, 0x1000, 0x64, 0x0, 0x0)
\t/usr/lib/go1.6/src/syscall/zsyscall_linux_amd64.go:508 +0x5f
syscall.ReadDirent(0x16e, 0xc8233f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
\t/usr/lib/go1.6/src/syscall/syscall_linux.go:770 +0x4d
os.(*File).readdirnames(0xc8249ba5f0, 0xffffffffffffffff, 0xc824530000, 0x0, 0x64, 0x0, 0x0)
\t/usr/lib/go1.6/src/os/dir_unix.go:39 +0x215
os.(*File).Readdirnames(0xc8249ba5f0, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/usr/lib/go1.6/src/os/doc.go:138 +0x85
path/filepath.readDirNames(0xc8224f80f0, 0xe8, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:406 +0xc0
path/filepath.walk(0xc8224f80f0, 0xe8, 0x7f0da07362b0, 0xc822cac0d0, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:361 +0x1a2
path/filepath.walk(0xc823120000, 0xdd, 0x7f0da07362b0, 0xc821a040d0, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc823fa97a0, 0xd7, 0x7f0da07362b0, 0xc8243f15f0, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc824cdfc00, 0xd1, 0x7f0da07362b0, 0xc823729c70, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc823729860, 0xc9, 0x7f0da07362b0, 0xc823729930, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc8222f3930, 0xc3, 0x7f0da07362b0, 0xc8222f3a00, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc8240a4d80, 0xbd, 0x7f0da07362b0, 0xc822f44c30, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc822c32900, 0xb9, 0x7f0da07362b0, 0xc8232a61a0, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc820c93800, 0x78, 0x7f0da07362b0, 0xc8223d1ba0, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc820c93580, 0x74, 0x7f0da07362b0, 0xc8223d1ad0, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.walk(0xc8234c1170, 0x70, 0x7f0da07362b0, 0xc825daaa90, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:374 +0x4fc
path/filepath.Walk(0xc8234c1170, 0x70, 0xc823cbf840, 0x0, 0x0)
\t/usr/lib/go1.6/src/path/filepath/path.go:396 +0xe1
github.com/opencontainers/runc/libcontainer/selinux.Chcon(0xc8234c1170, 0x70, 0xc8253a4f30, 0x2f, 0x2b2b201, 0x0, 0x0)

Sounds more like an issue on the system?

@mcluseau

This comment has been minimized.

Copy link

mcluseau commented Feb 16, 2017

@cpuguy83 Thanks for catching this. It's a PXE booted CoreOS on a local SSD.

I should probably file this port of the issue in CoreOS, but shouldn't we considered there's something do in docker too? I mean, "semacquire, 7 minutes" in the stack is probably a sign that it's worth trying to have a finer grained lock.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Feb 16, 2017

Locking is well known, there's a few issues and proposals on it. Unfortunately not a simple change.

@cafuego

This comment has been minimized.

Copy link

cafuego commented Feb 23, 2017

7 doesn't look that bad compared to what mine just spat out. Max is 2950 minutes!

I'm on CentOS 7, docker 1.13.1. The system was apparently operating fine until I tried to stop a set of three containers using docker-compose. The first two containers stopped OK, the final one did not.

$ docker-compose stop
Stopping letsencrypt-nginx-proxy-generator ... done
Stopping letsencrypt-nginx-proxy-companion ... done
Stopping letsencrypt-nginx-proxy ... 

ERROR: for letsencrypt-nginx-proxy  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "compose/cli/main.py", line 61, in main
  File "compose/cli/main.py", line 113, in perform_command
  File "contextlib.py", line 35, in __exit__
  File "compose/cli/errors.py", line 56, in handle_connection_errors
TypeError: log_timeout_error() takes exactly 1 argument (0 given)
docker-compose returned -1

Corresponding wit that I have an entry in the system log:

Feb 22 21:01:50 568elmp03 dockerd: time="2017-02-22T21:01:50.434258974-05:00" level=info msg="Container 4463a91d7f2263269786427e51aa3a0e4fe9325fe2fec04bf551f66ccd598b16 failed to exit within 10 seconds of signal 15 - using the force"

I don't see the container running anymore when I check for a containerd process, but the json blob under /var/lib/docker/containers/4463a91d7f2263269786427e51aa3a0e4fe9325fe2fec04bf551f66ccd598b16 still thinks the container is in state Running.

Any docker commands (apart from info and version) just sit there and don't do anything.

The strace trick didn't have any effect, but I did get the docker daemon to do a track trace dump. https://gist.github.com/cafuego/8e1ebc0a0b2580c6588fe1ee3f5fdb25 (2.5MB)

Restarting the docker daemon allows me to control containers again (but only for a while, the last docker restart was about a week ago)

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Feb 23, 2017

@mlaventure Seems to be blocked in

goroutine 3201617 [semacquire, 17 minutes]:
sync.runtime_Semacquire(0xc4215af98c)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*WaitGroup).Wait(0xc4215af980)
	/usr/local/go/src/sync/waitgroup.go:131 +0x97
github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc4201abc00, 0xc4252981c0, 0x40, 0xc4263595c8, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/monitor.go:42 +0x2c6
github.com/docker/docker/libcontainerd.(*container).handleEvent.func1()
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/libcontainerd/container_unix.go:217 +0x82
github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc428a67d00, 0x0, 0xc4295c5810, 0xc425bc0660)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/libcontainerd/queue_unix.go:28 +0x30
created by github.com/docker/docker/libcontainerd.(*queue).append
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/libcontainerd/queue_unix.go:30 +0x170
@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Feb 23, 2017

@tonistiigi correct, thanks.

It means that one of the CopyToPipe is stuck waiting for the fifo to finish after a container exit event

@cafuego any chance you can send a USR1 to containerd to get its stacktrace? Also, a bit of the daemon log would help to figure out which container is stuck (hoping your daemon is in debug mode) and what command it was running, it may be helpful too.

@cafuego

This comment has been minimized.

Copy link

cafuego commented Feb 23, 2017

@mlaventure Sorry, this is a production system (not in debug mode) and I've had to bounce docker fairly quickly after this happened... so the processes have all restarted and I can't get any more info out until it happens again :-/

If it does happen again, I'll try and get a stack trace out of containerd as well.

@Hexta

This comment has been minimized.

Copy link

Hexta commented Feb 24, 2017

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Feb 24, 2017

@Hexta I'm not sure yours is the same as it seems to be stuck trying to acquire a lock on the logger.
Can you open a new issue and provide full details of the system? Thanks!

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Feb 24, 2017

@Hexta Looking at your trace your issue seems to be same as #30223

@Hexta

This comment has been minimized.

Copy link

Hexta commented Feb 25, 2017

@tonistiigi I see, Thanks.

@cafuego

This comment has been minimized.

Copy link

cafuego commented Mar 3, 2017

I much suspect that #31487 is a duplicate of what I'm experiencing.

So, it looks like my original issue was due to corruption on on underlying XFS device that was mapped into the container that failed to quit and then caused the docker commands to no longer work.

The same thing just happened again, with a different container, but possibly with corruption on the underlying filesystem (hitting the reset button isn't a great idea, who knew?)

$ docker-compose stop mysql
Stopping staging-mysql ... 

ERROR: for staging-mysql  HTTPConnectionPool(host='localhost', port=2375): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

I then went to check for a containerd instance to see if I could get a trace, but it looks that had in fact exited.

The container directories still exist under /var/lib/docker, but since docker commands won't run, I can't delete the container or restart it.

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Mar 3, 2017

@cafuego the fact that containerd is not alive will indeed prevent dockerd to process any commands.

In that sense, it's not a duplicate of #31487.

Is there any trace in the daemon logs regarding a panic in containerd?

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Jul 27, 2017

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jul 27, 2017

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Jul 28, 2017

@cpuguy83 thanks for looking it, the kernel log seems fine, only see many OOM kill messages for containers. I will keep looking it and feedback here. Thanks.

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Aug 3, 2017

@cpuguy83 about the docker hung, I found the dockerd process is there, but docker-containerd has gone, after I manually run docker-containered without restart dockerd, the dockerd seems become active and forked a new docker-containerd immediately (that means there are two docker-containerd), and I killed one that was started manually, after a while the docker hung gone.

@jacknlliu

This comment has been minimized.

Copy link

jacknlliu commented Aug 8, 2017

I met the same issue.

[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12169] futex(0xc420e9f110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12408] <... epoll_wait resumed> [{EPOLLOUT, {u32=2236816896, u64=140447667396096}}], 128, -1) = 1
[pid 12408] epoll_wait(4,  <unfinished ...>
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid  6001] futex(0x26785f8, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0} <unfinished ...>
[pid 12408] <... epoll_wait resumed> [{EPOLLIN|EPOLLOUT, {u32=2236816896, u64=140447667396096}}], 128, -1) = 1
[pid 12408] futex(0x26785f8, FUTEX_WAKE, 1) = 1
[pid  6001] <... futex resumed> )       = 0
[pid 12408] read(6,  <unfinished ...>
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12408] <... read resumed> "\0\0\2\1\4\0\0$\305\210\300\0\0\7\0\0\0\0$\305\0\0\0\0\2\10\1\0\0\2\1\5"..., 32768) = 38
[pid 12408] futex(0xc420e9f110, FUTEX_WAKE, 1) = 1
[pid 12169] <... futex resumed> )       = 0
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid 12169] epoll_wait(4,  <unfinished ...>
[pid 12408] read(6,  <unfinished ...>
[pid 12169] <... epoll_wait resumed> [{EPOLLOUT, {u32=2236816896, u64=140447667396096}}], 128, 0) = 1
[pid 12408] <... read resumed> 0xc4200fc000, 32768) = -1 EAGAIN (Resource temporarily unavailable)
[pid 12169] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12408] epoll_wait(4, [], 128, 0)   = 0
[pid 12169] <... pselect6 resumed> )    = 0 (Timeout)
[pid 12408] epoll_wait(4,  <unfinished ...>
[pid 12169] futex(0xc420e9f110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid  6001] futex(0x26785f8, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0} <unfinished ...>
[pid  6006] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  6006] futex(0x26785f8, FUTEX_WAKE, 1) = 1
[pid  6001] <... futex resumed> )       = 0
[pid  6006] futex(0xc420e9f110, FUTEX_WAKE, 1 <unfinished ...>
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12169] <... futex resumed> )       = 0
[pid  6006] <... futex resumed> )       = 1
[pid 12169] futex(0xc420f73910, FUTEX_WAKE, 1 <unfinished ...>
[pid  6006] write(6, "\0\0\17\1\4\0\0$\307\203\206\310\315\314\313\312~\206\23\357\276\370Z\267\0\0\5\0\1\0\0$"..., 38 <unfinished ...>
[pid 12169] <... futex resumed> )       = 1
[pid  7594] <... futex resumed> )       = 0
[pid 12169] futex(0x2677cf8, FUTEX_WAIT, 0, {tv_sec=0, tv_nsec=248591041} <unfinished ...>
[pid  7594] futex(0xc420f73910, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  6006] <... write resumed> )       = 38
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid 12408] <... epoll_wait resumed> [{EPOLLOUT, {u32=2236816896, u64=140447667396096}}], 128, -1) = 1
[pid  6006] futex(0xc42043b110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 12408] epoll_wait(4,  <unfinished ...>
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL) = 0 (Timeout)
[pid 12408] <... epoll_wait resumed> [{EPOLLIN|EPOLLOUT, {u32=2236816896, u64=140447667396096}}], 128, -1) = 1
[pid  6001] futex(0x26785f8, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0} <unfinished ...>
[pid 12408] futex(0x26785f8, FUTEX_WAKE, 1 <unfinished ...>
[pid  6001] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 12408] <... futex resumed> )       = 0
[pid  6001] sched_yield( <unfinished ...>
[pid 12408] read(6,  <unfinished ...>
[pid  6001] <... sched_yield resumed> ) = 0
[pid 12408] <... read resumed> "\0\0\2\1\4\0\0$\307\210\300\0\0\7\0\0\0\0$\307\0\0\0\0\2\10\1\0\0\2\1\5"..., 32768) = 38
[pid  6001] futex(0x2678530, FUTEX_WAKE, 1) = 0
[pid 12408] futex(0xc42043b110, FUTEX_WAKE, 1 <unfinished ...>
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12408] <... futex resumed> )       = 1
[pid  6006] <... futex resumed> )       = 0
[pid  6006] epoll_wait(4,  <unfinished ...>
[pid 12408] read(6,  <unfinished ...>
[pid  6006] <... epoll_wait resumed> [{EPOLLOUT, {u32=2236816896, u64=140447667396096}}], 128, 0) = 1
[pid 12408] <... read resumed> 0xc4200fc000, 32768) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6006] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12408] epoll_wait(4, [], 128, 0)   = 0
[pid 12408] epoll_wait(4,  <unfinished ...>
[pid  6006] <... pselect6 resumed> )    = 0 (Timeout)
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid  6006] futex(0xc42043b110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  6001] futex(0x26785f8, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0} <unfinished ...>
[pid 12169] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 12169] futex(0x26785f8, FUTEX_WAKE, 1) = 1
[pid  6001] <... futex resumed> )       = 0
[pid 12169] futex(0xc42043b110, FUTEX_WAKE, 1 <unfinished ...>
[pid  6001] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
[pid 12169] <... futex resumed> )       = 1
[pid  6006] <... futex resumed> )       = 0
[pid 12169] futex(0x2677cf8, FUTEX_WAIT, 0, {tv_sec=0, tv_nsec=249146868} <unfinished ...>
[pid  6006] futex(0xc42043b110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  6001] <... pselect6 resumed> )    = 0 (Timeout)
[pid  6001] futex(0x26785f8, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0}^Cstrace: Process 6000 detached
@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Aug 8, 2017

@jacknlliu unfortunately an strace of the client is not helpful.
Also there can be many things that cause such a hang. Can you open a new issue and fill in the relevant details in the issue template?

If you could also get a stack dump from dockerd (we'll need that to figure out where it's stuck). To do this you need to send SIGUSR1 to the dockerd process (kill -SIGUSR1 $(pidof dockerd)). From there, depending on docker version, you'll either have the stack dump in the docker logs or a reference to a file containing the dump in the logs.

Thanks!

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Aug 8, 2017

@keyingliu OOM message is interesting. So the system could just be OOM. Unfortunately in the case of containerd it can also get OOM killed, and there happens to be a bug (fixed in most recent docker versions) where dockerd does not react to a containerd OOM correctly.... if this isn't fixed in 17.06.0, it's definitely fixed in 17.06.1 (currently in RC).... actually two fixes, the containerd OOM score is adjusted such that it is less likely to be killed by the kernel in OOM scenarios AND dockerd's containerd supervisor is fxied to properly handle containerd OOM kill.

@tj13

This comment has been minimized.

Copy link

tj13 commented Nov 24, 2017

got same issue, and the workaroud 'strace -p {dockerd-pid} -f' is failed.
also command 'kill -SIGUSR1 $(pidof dockerd)' failed to generate the dump strackstrace
OS version: CentOS Linux release 7.3.1611 (Core)
Docker version:docker-1.12.6-61.git85d7426.el7.centos.x86_64

@gservat

This comment has been minimized.

Copy link

gservat commented Jan 2, 2018

@cpuguy83 we've also seen "docker ps" hang issues but only in GCP (we run the very same kernel/docker/distro combination in AWS without any problems).

Stack trace: https://gist.github.com/gservat/ba78f2d110b1759beabf65e2e422f16b

Distro: Ubuntu 16.04.3
Kernel: 4.4.0-81-generic
Docker: 1.11.2

Any ideas?

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 2, 2018

@gservat @tj13 These versions of docker out way past their support period. There are numerous fixes for the docker ps hang symptom. The most recent "stable" release is 17.12 (hot of the presses), or alternatively 17.09 is still supported for another month.

@chestack

This comment has been minimized.

Copy link

chestack commented Mar 2, 2018

docker version: 17.03.2-ce

stack trace: https://gist.github.com/chestack/02e44f948fafc74622745d9b67db6273

The timestamp of latest docker log item is 07:18:18, the time I dumped the stack trace is 11:18, 4 hours has gone. So maybe the some goroutine stuck for 240 minutes around is the root cause.

@cpuguy83 , what's the recommended docker version. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment