Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upDocker ps hang #28889
Comments
GordonTheTurtle
added
the
version/1.12
label
Nov 28, 2016
mlaventure
added
the
area/runtime
label
Nov 28, 2016
This comment has been minimized.
This comment has been minimized.
This should be fixed in 1.12.3 I think. Could you update and see if it fixes it for you? |
This comment has been minimized.
This comment has been minimized.
@mlaventure I can upgrade one or two hosts. However, we have a 2000+ cluster, would you be able to confirm this is fixed in 1.12.3 and if it's a known issue? |
This comment has been minimized.
This comment has been minimized.
I had a look, unfortunately, it seems the fix for this race is only in master at the moment unfortunately There's a few issue with |
This comment has been minimized.
This comment has been minimized.
hjacobs
commented
Nov 29, 2016
@mlaventure we also have the problem (we tried with Docker 1.11, 1.12.1 and 1.12.3) that Do you have a link to the fix/patch in master? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hjacobs
commented
Nov 29, 2016
@mlaventure thanks, we will try the v1.13.0-rc2 release which should contain the mentioned fixes. |
This comment has been minimized.
This comment has been minimized.
I just tried |
Dec 2, 2016
This was referenced
This comment has been minimized.
This comment has been minimized.
hjacobs
commented
Dec 5, 2016
@mlaventure @dmyerscough Docker 1.13 RC fixed the issue for us! |
This comment has been minimized.
This comment has been minimized.
@hjacobs I believe we are pulling these stdio fixes into the 1.12 branch as well. |
This comment has been minimized.
This comment has been minimized.
hjacobs
commented
Dec 5, 2016
@cpuguy83 cool |
This comment has been minimized.
This comment has been minimized.
@cpuguy83 do you know when the 1.12 branch will include these fixes? |
This comment has been minimized.
This comment has been minimized.
@dmyerscough We haven't set a release date, but it's merged in the branch at least (mostly #29095 (merged) and #29141). |
This comment has been minimized.
This comment has been minimized.
1.12.4 is released and includes these fixes. Thanks! |
cpuguy83
closed this
Dec 14, 2016
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 10, 2017
Hi, Should I open another issue? top output:
strace output:
docker info:
Thanks! |
This comment has been minimized.
This comment has been minimized.
@MikaelCluseau send a |
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 10, 2017
@mlaventure thanks! We observed this on pulls (staying in waiting state for long periods for instance). This is when the stack was taken (during The stacks are big, here's the gist: https://gist.github.com/MikaelCluseau/062c3ce10b6083041dbc9146be5df5a4 |
This comment has been minimized.
This comment has been minimized.
@MikaelCluseau I couldn't see anything unusual with that stack trace. Pull may take a while if the network is not stable or fast. Were other commands slow/stuck when executed in parallel? |
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 13, 2017
@mlaventure I think I found what happened, but I wanted to let the system show stability for at least a night before reporting. We had a pretty low memory limit of 100MiB on our docker hub mirror registry (image registry:2.5.1), and it has been OOM killed at least once. I raised the limit to 1GiB, and since then I didn't had any problem with timeout speaking to the docker daemon. If I'm on the right track, I should get a reliable reproduction path a few minutes. |
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 13, 2017
Well... for now the registry doesn't get OOM killed, I suppose I need to have the same backend (swift) to reproduce. FWIW, here's the current draft:
|
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 15, 2017
•
@mlaventure Got docker ps locked again or, more exactly, took a long time, many minutes, to return. I think the relevant part of the stack dump is:
The full dump is here: https://gist.github.com/MikaelCluseau/b2518769f693e667f9fc9575c835c54b |
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 15, 2017
Same on another host:
gist -> https://gist.github.com/MikaelCluseau/8b89528a75c189da9c2575e6172eb080 |
This comment has been minimized.
This comment has been minimized.
@MikaelCluseau Yours seems to be stuck traversing a directory structure while applying an selinux label:
Sounds more like an issue on the system? |
This comment has been minimized.
This comment has been minimized.
mcluseau
commented
Feb 16, 2017
@cpuguy83 Thanks for catching this. It's a PXE booted CoreOS on a local SSD. I should probably file this port of the issue in CoreOS, but shouldn't we considered there's something do in docker too? I mean, "semacquire, 7 minutes" in the stack is probably a sign that it's worth trying to have a finer grained lock. |
This comment has been minimized.
This comment has been minimized.
Locking is well known, there's a few issues and proposals on it. Unfortunately not a simple change. |
This comment has been minimized.
This comment has been minimized.
cafuego
commented
Feb 23, 2017
7 doesn't look that bad compared to what mine just spat out. Max is 2950 minutes! I'm on CentOS 7, docker 1.13.1. The system was apparently operating fine until I tried to stop a set of three containers using
Corresponding wit that I have an entry in the system log:
I don't see the container running anymore when I check for a containerd process, but the json blob under /var/lib/docker/containers/4463a91d7f2263269786427e51aa3a0e4fe9325fe2fec04bf551f66ccd598b16 still thinks the container is in state Running. Any docker commands (apart from info and version) just sit there and don't do anything. The strace trick didn't have any effect, but I did get the docker daemon to do a track trace dump. https://gist.github.com/cafuego/8e1ebc0a0b2580c6588fe1ee3f5fdb25 (2.5MB) Restarting the docker daemon allows me to control containers again (but only for a while, the last docker restart was about a week ago) |
This comment has been minimized.
This comment has been minimized.
@mlaventure Seems to be blocked in
|
This comment has been minimized.
This comment has been minimized.
@tonistiigi correct, thanks. It means that one of the CopyToPipe is stuck waiting for the fifo to finish after a container exit event @cafuego any chance you can send a |
This comment has been minimized.
This comment has been minimized.
cafuego
commented
Feb 23, 2017
@mlaventure Sorry, this is a production system (not in debug mode) and I've had to bounce docker fairly quickly after this happened... so the processes have all restarted and I can't get any more info out until it happens again :-/ If it does happen again, I'll try and get a stack trace out of containerd as well. |
This comment has been minimized.
This comment has been minimized.
Hexta
commented
Feb 24, 2017
@mlaventure We have the same issue on our test cloud. |
This comment has been minimized.
This comment has been minimized.
@Hexta I'm not sure yours is the same as it seems to be stuck trying to acquire a lock on the logger. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Hexta
commented
Feb 25, 2017
@tonistiigi I see, Thanks. |
This comment has been minimized.
This comment has been minimized.
cafuego
commented
Mar 3, 2017
I much suspect that #31487 is a duplicate of what I'm experiencing. So, it looks like my original issue was due to corruption on on underlying XFS device that was mapped into the container that failed to quit and then caused the docker commands to no longer work. The same thing just happened again, with a different container, but possibly with corruption on the underlying filesystem (hitting the reset button isn't a great idea, who knew?)
I then went to check for a containerd instance to see if I could get a trace, but it looks that had in fact exited. The container directories still exist under /var/lib/docker, but since docker commands won't run, I can't delete the container or restart it. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
keyingliu
commented
Jul 27, 2017
•
docker hung with version 1.12.6
|
This comment has been minimized.
This comment has been minimized.
Seems you are blocked in an `unmount` syscall:
https://gist.github.com/keyingliu/ea6e4a4e8f9193a3422292912eb452be#file-gistfile1-txt-L7982
This is blocking an inspect:
https://gist.github.com/keyingliu/ea6e4a4e8f9193a3422292912eb452be#file-gistfile1-txt-L8178
Which is blocking `docker ps`
`docker ps` couldn't be blocked in this scenario on 17.07, but the main
culprit here is for some reason unmount is stuck, which may be worth
tracking down on your system.
Do you have kernel logs?
The request is blocked for 78 munutes:
goroutine 10826773 [syscall, 78 minutes]:
…On Thu, Jul 27, 2017 at 5:34 AM, keyingliu ***@***.***> wrote:
docker hung with version 1.12.6:
https://gist.github.com/keyingliu/ea6e4a4e8f9193a3422292912eb452be
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28889 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAwxZgFmTkK4gfRygx55e88PQdz0Cb03ks5sSFmwgaJpZM4K-HFn>
.
--
- Brian Goff
|
This comment has been minimized.
This comment has been minimized.
keyingliu
commented
Jul 28, 2017
@cpuguy83 thanks for looking it, the kernel log seems fine, only see many OOM kill messages for containers. I will keep looking it and feedback here. Thanks. |
This comment has been minimized.
This comment has been minimized.
keyingliu
commented
Aug 3, 2017
•
@cpuguy83 about the docker hung, I found the |
This comment has been minimized.
This comment has been minimized.
jacknlliu
commented
Aug 8, 2017
I met the same issue.
|
This comment has been minimized.
This comment has been minimized.
@jacknlliu unfortunately an strace of the client is not helpful. If you could also get a stack dump from dockerd (we'll need that to figure out where it's stuck). To do this you need to send Thanks! |
This comment has been minimized.
This comment has been minimized.
@keyingliu OOM message is interesting. So the system could just be OOM. Unfortunately in the case of containerd it can also get OOM killed, and there happens to be a bug (fixed in most recent docker versions) where dockerd does not react to a containerd OOM correctly.... if this isn't fixed in 17.06.0, it's definitely fixed in 17.06.1 (currently in RC).... actually two fixes, the containerd OOM score is adjusted such that it is less likely to be killed by the kernel in OOM scenarios AND dockerd's containerd supervisor is fxied to properly handle containerd OOM kill. |
This comment has been minimized.
This comment has been minimized.
tj13
commented
Nov 24, 2017
got same issue, and the workaroud 'strace -p {dockerd-pid} -f' is failed. |
This comment has been minimized.
This comment has been minimized.
gservat
commented
Jan 2, 2018
@cpuguy83 we've also seen "docker ps" hang issues but only in GCP (we run the very same kernel/docker/distro combination in AWS without any problems). Stack trace: https://gist.github.com/gservat/ba78f2d110b1759beabf65e2e422f16b Distro: Ubuntu 16.04.3 Any ideas? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
chestack
commented
Mar 2, 2018
•
docker version: 17.03.2-ce stack trace: https://gist.github.com/chestack/02e44f948fafc74622745d9b67db6273 The timestamp of latest docker log item is 07:18:18, the time I dumped the stack trace is 11:18, 4 hours has gone. So maybe the some goroutine stuck for 240 minutes around is the root cause. @cpuguy83 , what's the recommended docker version. Thanks |
dmyerscough commentedNov 28, 2016
Description
The Docker daemon becomes unresponsive and causes
docker ps
to hang, containers still continue to run and function fine. A stack trace of the Docker daemon:-https://gist.github.com/dmyerscough/ced7616a5e8072315e7ea82ef797414c
Steps to reproduce the issue:
Describe the results you received:
Docker daemon becomes unresponsive.
Describe the results you expected:
Docker daemon shouldn't become unresponsive.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Physical