docker ps command hangs #12606

Closed
daniruddha29 opened this Issue Apr 21, 2015 · 159 comments

Comments

Projects
None yet
@daniruddha29

Hi All,

We have a large setup of postgres databases which are on dockers. The host machine has 3 TB of RAM. When we are trying to execute docker ps command, it hangs and return nothing.
Any help on this is highly appreciable.

Regards,
Aniruddha

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
@crosbymichael

crosbymichael Apr 21, 2015

Contributor

i've seen this also

Contributor

crosbymichael commented Apr 21, 2015

i've seen this also

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

any thing you follow, to check what the problem is???

any thing you follow, to check what the problem is???

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Do you use links or restart policies?

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Do you use links or restart policies?

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

no, we don't . Does that impact?
Hope, this is not serious issue as the environment is production.

no, we don't . Does that impact?
Hope, this is not serious issue as the environment is production.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Just I recall same issues because of restart and links. I'll try to reproduce, any clues about your containers?

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Just I recall same issues because of restart and links. I'll try to reproduce, any clues about your containers?

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

we are using RHEL 6.5

we are using RHEL 6.5

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
@crosbymichael

crosbymichael Apr 21, 2015

Contributor

@daniruddha29 thanks, we can reproduce and will look for a fix and keep you updated

Contributor

crosbymichael commented Apr 21, 2015

@daniruddha29 thanks, we can reproduce and will look for a fix and keep you updated

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Did you use some volumes for your containers?

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Did you use some volumes for your containers?

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

yes we are using volumes to host our database mount.

yes we are using volumes to host our database mount.

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

let me know if you need any specific info.

let me know if you need any specific info.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Any info will be helpful, docker version, docker info, Dockerfile, command line for running containers.

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Any info will be helpful, docker version, docker info, Dockerfile, command line for running containers.

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

docker version
Client version: 1.2.0
Client API version: 1.14
Go version (client): go1.3.1
Git commit (client): fa7b24f
OS/Arch (client): linux/amd64
Server version: 1.2.0
Server API version: 1.14
Go version (server): go1.3.1
Git commit (server): fa7b24f
++++++++++++++++++++++++++++++++
docker info
Containers: 18
Images: 96
Storage Driver: devicemapper
Pool Name: docker-252:0-656075-pool
Pool Blocksize: 64 Kb
Data file: /var/lib/docker/devicemapper/devicemapper/data
Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
Data Space Used: 18131.5 Mb
Data Space Total: 102400.0 Mb
Metadata Space Used: 17.6 Mb
Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.8.13-44.1.1.el6uek.x86_64
Operating System:
+++++++++++++++++++++++++++++++++++++
i will update the docker file shortly

docker version
Client version: 1.2.0
Client API version: 1.14
Go version (client): go1.3.1
Git commit (client): fa7b24f
OS/Arch (client): linux/amd64
Server version: 1.2.0
Server API version: 1.14
Go version (server): go1.3.1
Git commit (server): fa7b24f
++++++++++++++++++++++++++++++++
docker info
Containers: 18
Images: 96
Storage Driver: devicemapper
Pool Name: docker-252:0-656075-pool
Pool Blocksize: 64 Kb
Data file: /var/lib/docker/devicemapper/devicemapper/data
Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
Data Space Used: 18131.5 Mb
Data Space Total: 102400.0 Mb
Metadata Space Used: 17.6 Mb
Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.8.13-44.1.1.el6uek.x86_64
Operating System:
+++++++++++++++++++++++++++++++++++++
i will update the docker file shortly

@LK4D4 LK4D4 removed the kind/regression label Apr 21, 2015

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

its too early to ask.. but did you get anything around that???

its too early to ask.. but did you get anything around that???

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Nope, sorry :/ Can you try update your docker?

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Nope, sorry :/ Can you try update your docker?

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

Need to check as this is production environment.
One question, may be very silly one but what will be the impact if continue using it.?

Need to check as this is production environment.
One question, may be very silly one but what will be the impact if continue using it.?

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Even if it is unfixed issue it won't be backported to 1.2 :(

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Even if it is unfixed issue it won't be backported to 1.2 :(

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

Haha... definitely not.
One more, is it a bug ?? already known??

Haha... definitely not.
One more, is it a bug ?? already known??

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 21, 2015

Contributor

@daniruddha29 Looks like bug, but we can't reproduce it. Maybe it was fixed since 1.2 version.

Contributor

LK4D4 commented Apr 21, 2015

@daniruddha29 Looks like bug, but we can't reproduce it. Maybe it was fixed since 1.2 version.

@daniruddha29

This comment has been minimized.

Show comment
Hide comment
@daniruddha29

daniruddha29 Apr 21, 2015

That is quite useful information.
Thank you very much for all the help.

That is quite useful information.
Thank you very much for all the help.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 24, 2015

Contributor

Seeing this issue at the moment. Info:

ahobsons@docker03:~$ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.4.1
Git commit (server): a8a31ef
ahobsons@docker03:~$ docker info
Containers: 24
Images: 298
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 346
Execution Driver: native-0.2
Kernel Version: 3.13.0-46-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 7.63 GiB
Name: docker03
ID: 2JDT:3FKJ:P4FW:C3YE:TQ54:USOD:6X7J:XDF5:SH3A:NTKY:X7GJ:CUUM
WARNING: No swap limit support

Machine has spare memory, low load and very little i/o - the daemon looks idle, even when the docker ps command is hung.

Bizarrely, this machine is part of a swarm cluster...and doing docker ps via that works fine.

Contributor

aidanhs commented Apr 24, 2015

Seeing this issue at the moment. Info:

ahobsons@docker03:~$ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.4.1
Git commit (server): a8a31ef
ahobsons@docker03:~$ docker info
Containers: 24
Images: 298
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 346
Execution Driver: native-0.2
Kernel Version: 3.13.0-46-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 7.63 GiB
Name: docker03
ID: 2JDT:3FKJ:P4FW:C3YE:TQ54:USOD:6X7J:XDF5:SH3A:NTKY:X7GJ:CUUM
WARNING: No swap limit support

Machine has spare memory, low load and very little i/o - the daemon looks idle, even when the docker ps command is hung.

Bizarrely, this machine is part of a swarm cluster...and doing docker ps via that works fine.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 24, 2015

Contributor

Ah, swarm appears to have cached information. New containers I start don't appear.
FWIW, docker images, docker run work fine when run directly on the machine.

Contributor

aidanhs commented Apr 24, 2015

Ah, swarm appears to have cached information. New containers I start don't appear.
FWIW, docker images, docker run work fine when run directly on the machine.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 24, 2015

Contributor

@aidanhs Yeah, it means that some container acquired lock. So you can't do ps because of one locked container, also you can't stop or kill it from cli. Such bugs is very hard to fix without reproduction case :( It is pretty possible that it was fixed in 1.6, because code around running containers was rewritten to use new libcontainer API.

Contributor

LK4D4 commented Apr 24, 2015

@aidanhs Yeah, it means that some container acquired lock. So you can't do ps because of one locked container, also you can't stop or kill it from cli. Such bugs is very hard to fix without reproduction case :( It is pretty possible that it was fixed in 1.6, because code around running containers was rewritten to use new libcontainer API.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 24, 2015

Contributor

I assume you're thinking a deadlock then? Or are there other things that can cause this in go? I assume care is taken to always defer unlock after acquiring a lock.

Have you (Docker Inc as a whole, not you specifically) not got any scripts to loop over the goroutines in a running daemon and dump out what they're blocked on? If it was a deadlock that would presumably help.

I've just compiled the daemon with symbols and then loaded those symbols into a gdb attached to the docker daemon. I can now write python scripts to traverse all the goroutines looking for particular things...if I know what I'm looking for. Do you have any links to previous tickets like this?

Contributor

aidanhs commented Apr 24, 2015

I assume you're thinking a deadlock then? Or are there other things that can cause this in go? I assume care is taken to always defer unlock after acquiring a lock.

Have you (Docker Inc as a whole, not you specifically) not got any scripts to loop over the goroutines in a running daemon and dump out what they're blocked on? If it was a deadlock that would presumably help.

I've just compiled the daemon with symbols and then loaded those symbols into a gdb attached to the docker daemon. I can now write python scripts to traverse all the goroutines looking for particular things...if I know what I'm looking for. Do you have any links to previous tickets like this?

@estesp

This comment has been minimized.

Show comment
Hide comment
@estesp

estesp Apr 24, 2015

Contributor

@aidanhs see PR #10786 for dump capability added to Docker daemon.. if you can reproduce easily, add this patch (if possible) and you can then dump goroutines.

Contributor

estesp commented Apr 24, 2015

@aidanhs see PR #10786 for dump capability added to Docker daemon.. if you can reproduce easily, add this patch (if possible) and you can then dump goroutines.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 24, 2015

Contributor

I can dump goroutine traces with gdb. Unfortunately there are 618 goroutines :)
I've scripted their dumping, result at https://gist.github.com/aidanhs/960beaf2db1de622a2dd. This is pretty big.

But! If we grep out the callers to Lock then we find the callsites goroutines are blocked at:

root@docker03:~# grep -A1 Lock gdb.txt | grep -v '^--$' | grep -v 'Mutex' | sort | uniq -c
      1 #5  0x00000000004bc5ba in github.com/docker/docker/daemon.(*Daemon).ContainerInspect (daemon=0xc2080691e0, job=0xc208857980, ~r1=0) at /go/src/github.com/docker/docker/daemon/inspect.go:17
      1 #5  0x00000000004c861f in github.com/docker/docker/daemon.(*State).IsRunning (s=0xc2081c79d0, ~r0=54) at /go/src/github.com/docker/docker/daemon/state.go:116
      1 #5  0x00000000004c861f in github.com/docker/docker/daemon.(*State).IsRunning (s=0xc2091c2e70, ~r0=246) at /go/src/github.com/docker/docker/daemon/state.go:116
     41 #5  0x00000000004d4f20 in github.com/docker/docker/daemon.func·025 (container=0xc209060a80, ~r1=...) at /go/src/github.com/docker/docker/daemon/list.go:79
      3 #5  0x00000000004d4f20 in github.com/docker/docker/daemon.func·025 (container=0xc209060fc0, ~r1=...) at /go/src/github.com/docker/docker/daemon/list.go:79
      1 #68 0x00000000005eb5bf in sync.(*Pool).getSlow (p=0x12eac50 <net/http.textprotoReaderPool>, x=...) at /usr/local/go/src/sync/pool.go:129

Since I have gdb I can (in theory) print variables from goroutines and so on if you like.
Will have more of a look a bit later. I strongly suspect one of the first three lines is the culprit.

Contributor

aidanhs commented Apr 24, 2015

I can dump goroutine traces with gdb. Unfortunately there are 618 goroutines :)
I've scripted their dumping, result at https://gist.github.com/aidanhs/960beaf2db1de622a2dd. This is pretty big.

But! If we grep out the callers to Lock then we find the callsites goroutines are blocked at:

root@docker03:~# grep -A1 Lock gdb.txt | grep -v '^--$' | grep -v 'Mutex' | sort | uniq -c
      1 #5  0x00000000004bc5ba in github.com/docker/docker/daemon.(*Daemon).ContainerInspect (daemon=0xc2080691e0, job=0xc208857980, ~r1=0) at /go/src/github.com/docker/docker/daemon/inspect.go:17
      1 #5  0x00000000004c861f in github.com/docker/docker/daemon.(*State).IsRunning (s=0xc2081c79d0, ~r0=54) at /go/src/github.com/docker/docker/daemon/state.go:116
      1 #5  0x00000000004c861f in github.com/docker/docker/daemon.(*State).IsRunning (s=0xc2091c2e70, ~r0=246) at /go/src/github.com/docker/docker/daemon/state.go:116
     41 #5  0x00000000004d4f20 in github.com/docker/docker/daemon.func·025 (container=0xc209060a80, ~r1=...) at /go/src/github.com/docker/docker/daemon/list.go:79
      3 #5  0x00000000004d4f20 in github.com/docker/docker/daemon.func·025 (container=0xc209060fc0, ~r1=...) at /go/src/github.com/docker/docker/daemon/list.go:79
      1 #68 0x00000000005eb5bf in sync.(*Pool).getSlow (p=0x12eac50 <net/http.textprotoReaderPool>, x=...) at /usr/local/go/src/sync/pool.go:129

Since I have gdb I can (in theory) print variables from goroutines and so on if you like.
Will have more of a look a bit later. I strongly suspect one of the first three lines is the culprit.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 24, 2015

Contributor

I personally fail to observe how it can produce a deadlock :)

Contributor

LK4D4 commented Apr 24, 2015

I personally fail to observe how it can produce a deadlock :)

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Apr 25, 2015

@LK4D4 This problem is still present in 1.6.0. I just had the case yesterday. We use docker in a build system. A build hung and the agent abandoned it (without trying to stop or kill the build container). The agent and the build were running in two different containers. After that, docker ps hung with no way to recover but a reboot. This was on centos atomic 4a524a58cb with docker 1.6.0.

@LK4D4 This problem is still present in 1.6.0. I just had the case yesterday. We use docker in a build system. A build hung and the agent abandoned it (without trying to stop or kill the build container). The agent and the build were running in two different containers. After that, docker ps hung with no way to recover but a reboot. This was on centos atomic 4a524a58cb with docker 1.6.0.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 25, 2015

Contributor

@elephantfries It would help if you're described your flags to daemon and run.

Contributor

LK4D4 commented Apr 25, 2015

@elephantfries It would help if you're described your flags to daemon and run.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Apr 25, 2015

Docker deamon (after reboot), from systemctl status docker:

CGroup: /system.slice/docker.service
├─ 843 /usr/bin/docker -d --selinux-enabled --insecure-registry <my.company.registry> --storage-opt dm.loopdatasize=150G
└─1051 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8811 -container-ip 172.17.0.1 -container-port 8811

The agent container, from systemctl status:

CGroup: /system.slice/agent.service
└─1009 /usr/bin/docker run -a STDOUT -a STDERR --name agent -p 8811:8811 -h agent23 --security-opt label:disable -v /usr/bin/docker:/usr/bin/docker -v /var/run/docker.sock:/var/run/docker.sock -v /home/build:/home/build -v /var/srv/data:/opt/data <my.company.registry/image-name> /root/run

The build container, from the build script:

docker run -t --rm --name builder -e JOB=$proj -c 256 -u build -w /home/build --security-opt label:disable -v /opt/data/scripts/docker:/home/build/scripts -v $workdir/build:/home/build/workdir -v /home/build:/home/build/homedir -e MAVEN_OPTS=$maven_opts $(cat builder-image.txt) /home/build/scripts/build-maven $proj $target

No other containers were running. After the build gave up, a subsequent build started which included a script that kills any left-over containers. That script uses docker ps to find out what's running. That script hung and attempts at docker ps from host also hung.

Docker deamon (after reboot), from systemctl status docker:

CGroup: /system.slice/docker.service
├─ 843 /usr/bin/docker -d --selinux-enabled --insecure-registry <my.company.registry> --storage-opt dm.loopdatasize=150G
└─1051 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8811 -container-ip 172.17.0.1 -container-port 8811

The agent container, from systemctl status:

CGroup: /system.slice/agent.service
└─1009 /usr/bin/docker run -a STDOUT -a STDERR --name agent -p 8811:8811 -h agent23 --security-opt label:disable -v /usr/bin/docker:/usr/bin/docker -v /var/run/docker.sock:/var/run/docker.sock -v /home/build:/home/build -v /var/srv/data:/opt/data <my.company.registry/image-name> /root/run

The build container, from the build script:

docker run -t --rm --name builder -e JOB=$proj -c 256 -u build -w /home/build --security-opt label:disable -v /opt/data/scripts/docker:/home/build/scripts -v $workdir/build:/home/build/workdir -v /home/build:/home/build/homedir -e MAVEN_OPTS=$maven_opts $(cat builder-image.txt) /home/build/scripts/build-maven $proj $target

No other containers were running. After the build gave up, a subsequent build started which included a script that kills any left-over containers. That script uses docker ps to find out what's running. That script hung and attempts at docker ps from host also hung.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 26, 2015

Contributor

So hard to reproduce :( Thanks for info though.
@elephantfries @aidanhs If you will be able to reproduce this at least from time to time. Let me know, we can build debug binary with deadlock detection.

Contributor

LK4D4 commented Apr 26, 2015

So hard to reproduce :( Thanks for info though.
@elephantfries @aidanhs If you will be able to reproduce this at least from time to time. Let me know, we can build debug binary with deadlock detection.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Apr 26, 2015

I do see it from time to time. I should be able to use a special build.

I do see it from time to time. I should be able to use a special build.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 26, 2015

Contributor

I've not yet had a chance to use my (still) running process to debug, but I do have some thoughts after a quick look.

In my case, all of the container.Lock operations are blocking because of the two (*State).IsRunning stacks - you can see from the full stack dump that these are being called during container start, which locks the container as literally the first thing it does (daemon/container.go Start).
So the question now becomes "Why is state.Lock deadlocking?".
Scanning daemon/state.go doesn't reveal anything that even calls out to somewhere else, so can't really produce a deadlock. There must be something though, as the state.Lock lines are definitely hung. I'm not sure how I'd identify where else state gets locked...

Contributor

aidanhs commented Apr 26, 2015

I've not yet had a chance to use my (still) running process to debug, but I do have some thoughts after a quick look.

In my case, all of the container.Lock operations are blocking because of the two (*State).IsRunning stacks - you can see from the full stack dump that these are being called during container start, which locks the container as literally the first thing it does (daemon/container.go Start).
So the question now becomes "Why is state.Lock deadlocking?".
Scanning daemon/state.go doesn't reveal anything that even calls out to somewhere else, so can't really produce a deadlock. There must be something though, as the state.Lock lines are definitely hung. I'm not sure how I'd identify where else state gets locked...

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Apr 26, 2015

Is the lock held for the entire life of the container? In my case, the agent is a long running container, it gets locked days after it started. The other containers are either running for a while (30mins), are in the processes of exiting, or have exited. Next time, I'll try to determine exactly what state they're in.

Is the lock held for the entire life of the container? In my case, the agent is a long running container, it gets locked days after it started. The other containers are either running for a while (30mins), are in the processes of exiting, or have exited. Next time, I'll try to determine exactly what state they're in.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 26, 2015

Contributor

Well, the problem in my case appears to be a deadlock occurring during a container start, and so future attempts to lock that container (ps iterates over all containers, locking them, retrieving information and unlocking them) will also deadlock.

To me it sounds like your build container is the one hanging during startup, which then causes the ps hang? Or are you saying the build container is running a while before it hangs?

Contributor

aidanhs commented Apr 26, 2015

Well, the problem in my case appears to be a deadlock occurring during a container start, and so future attempts to lock that container (ps iterates over all containers, locking them, retrieving information and unlocking them) will also deadlock.

To me it sounds like your build container is the one hanging during startup, which then causes the ps hang? Or are you saying the build container is running a while before it hangs?

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Apr 26, 2015

To be precise it is the agent container that hangs. That container is long running, days. When it gets a new job it tries to cleanup after old possibly broken builds and runs a script with docker ps to find out if any leftover containers need to be killed before starting any new build containers. In the case described above, there was a previous broken build. Unfortunately, I did not determine what the state of the old build container(s) was. I suspect that old broken build is critical to creating the deadlock because otherwise the agent is the only container running.

What may also be interesting to note is that the 'broken' build did succeed without trouble next time around so I would say there was something that made the build break and that something may have also deadlocked ps.

I would be happy to run instrumented docker with those builds if available.

To be precise it is the agent container that hangs. That container is long running, days. When it gets a new job it tries to cleanup after old possibly broken builds and runs a script with docker ps to find out if any leftover containers need to be killed before starting any new build containers. In the case described above, there was a previous broken build. Unfortunately, I did not determine what the state of the old build container(s) was. I suspect that old broken build is critical to creating the deadlock because otherwise the agent is the only container running.

What may also be interesting to note is that the 'broken' build did succeed without trouble next time around so I would say there was something that made the build break and that something may have also deadlocked ps.

I would be happy to run instrumented docker with those builds if available.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Apr 26, 2015

Contributor

I'll build binary at Monday.

Contributor

LK4D4 commented Apr 26, 2015

I'll build binary at Monday.

@aidanhs

This comment has been minimized.

Show comment
Hide comment
@aidanhs

aidanhs Apr 26, 2015

Contributor

Got a repro, looking into a fix.

Contributor

aidanhs commented Apr 26, 2015

Got a repro, looking into a fix.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries May 26, 2015

All right. Mystery solved. The debug messages show with:

journalctl -u systemd-udevd (not journalctl -u udevd)

Unfortunately, I rebooted the machine so I lost my logs.
Let me try another trace.

All right. Mystery solved. The debug messages show with:

journalctl -u systemd-udevd (not journalctl -u udevd)

Unfortunately, I rebooted the machine so I lost my logs.
Let me try another trace.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries May 26, 2015

@rhvgoyal Deadlock 2

qba23 ~ # dmsetup udevcookies
Cookie Semid Value Last semop time Last change time
0xd4d9b78 4390912 1 Tue May 26 19:51:22 2015 Tue May 26 19:51:22 2015

Docker log last lines:

http://pastebin.com/nSUuczJb

Udevd log last lines:

http://pastebin.com/cqN9DALu

@rhvgoyal Deadlock 2

qba23 ~ # dmsetup udevcookies
Cookie Semid Value Last semop time Last change time
0xd4d9b78 4390912 1 Tue May 26 19:51:22 2015 Tue May 26 19:51:22 2015

Docker log last lines:

http://pastebin.com/nSUuczJb

Udevd log last lines:

http://pastebin.com/cqN9DALu

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal May 26, 2015

Contributor

@elephantfries

I am looking at udev logs and can't find any "dmsetup udevcomplete" calls. In fact, I can't seem to find any messages related to dm in udev logs. Not sure if rules are being executed properly or not.

Contributor

rhvgoyal commented May 26, 2015

@elephantfries

I am looking at udev logs and can't find any "dmsetup udevcomplete" calls. In fact, I can't seem to find any messages related to dm in udev logs. Not sure if rules are being executed properly or not.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries May 26, 2015

Here's the full docker log, covers the entire build session.

http://pastebin.com/WR1bqQvE

Here's the full docker log, covers the entire build session.

http://pastebin.com/WR1bqQvE

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries May 26, 2015

Sorry, wrng log. This is full udevd log:

http://pastebin.com/h2ULXutb

Sorry, wrng log. This is full udevd log:

http://pastebin.com/h2ULXutb

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal May 26, 2015

Contributor

Ok, Looking at udev logs and libdevmapper logs, some observations.

  • Last udevcomplete was done on cookie 4248975 (0x40D58F).
  • This seems to match the second last cookie (0xd4dd58f )used for device removal.
  • After that docker tried to activate the actual container device and created a new cookie ( 0xd4d9b78) and I don't see a corresponding udevcomplete.

So on the surface it will seem as if udev did not call udevcomplete. Not clear that why did that happen.

Contributor

rhvgoyal commented May 26, 2015

Ok, Looking at udev logs and libdevmapper logs, some observations.

  • Last udevcomplete was done on cookie 4248975 (0x40D58F).
  • This seems to match the second last cookie (0xd4dd58f )used for device removal.
  • After that docker tried to activate the actual container device and created a new cookie ( 0xd4d9b78) and I don't see a corresponding udevcomplete.

So on the surface it will seem as if udev did not call udevcomplete. Not clear that why did that happen.

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal May 26, 2015

Contributor

@elephantfries

Can you provide "dmsetup table" output please.

Contributor

rhvgoyal commented May 26, 2015

@elephantfries

Can you provide "dmsetup table" output please.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries May 27, 2015

qba23 ~ # dmsetup table
docker-8:17-5242883-7bed6f54f55c6e5eaeb90dd1282b9e56fa30309e807581495031b423b400d4c3: 0 20971520 thin 253:0 122
docker-8:17-5242883-d0dabf387dc55d67ec3f8ffbbe31bb4ff31b773057876c7b4f635c631cc43bc1: 0 20971520 thin 253:0 111
docker-8:17-5242883-pool: 0 314572800 thin-pool 7:1 7:0 128 32768 1 skip_block_zeroing
docker-8:17-5242883-3c038f93e80f55e283c9785eb906b768d4c543d51e3ab9d93caffb4bfb7da1a5: 0 20971520 thin 253:0 116
docker-8:17-5242883-d01af288d675afbdea460a55194c7f1fab0426127fd2303c2bbbe7bf965f2a60: 0 20971520 thin 253:0 117
docker-8:17-5242883-7a19daadc8c30d9f311d9e0aa5442b33a22ed3de34d9d9060dde37ec05101cde: 0 20971520 thin 253:0 91

qba23 ~ # dmsetup table
docker-8:17-5242883-7bed6f54f55c6e5eaeb90dd1282b9e56fa30309e807581495031b423b400d4c3: 0 20971520 thin 253:0 122
docker-8:17-5242883-d0dabf387dc55d67ec3f8ffbbe31bb4ff31b773057876c7b4f635c631cc43bc1: 0 20971520 thin 253:0 111
docker-8:17-5242883-pool: 0 314572800 thin-pool 7:1 7:0 128 32768 1 skip_block_zeroing
docker-8:17-5242883-3c038f93e80f55e283c9785eb906b768d4c543d51e3ab9d93caffb4bfb7da1a5: 0 20971520 thin 253:0 116
docker-8:17-5242883-d01af288d675afbdea460a55194c7f1fab0426127fd2303c2bbbe7bf965f2a60: 0 20971520 thin 253:0 117
docker-8:17-5242883-7a19daadc8c30d9f311d9e0aa5442b33a22ed3de34d9d9060dde37ec05101cde: 0 20971520 thin 253:0 91

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Jun 8, 2015

Switching to using lvm thin pool directly improves the situation. That solution has its own set of problems but it seems to work for now. Perhaps the strategy to target overlayfs is best if it is believed to be the successor.

I left a couple of machines in the original configuration in case we want to put some more effort fixing this bug.

Switching to using lvm thin pool directly improves the situation. That solution has its own set of problems but it seems to work for now. Perhaps the strategy to target overlayfs is best if it is believed to be the successor.

I left a couple of machines in the original configuration in case we want to put some more effort fixing this bug.

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal Jun 8, 2015

Contributor

@elephantfries

I had suspicion that udev flags are wrong but that they seem to be right for snapshot device creation.

Alsdair had looked at the logs and noticed that one of the udev thread is being killed and it is possible that probe command failed. Looks like logs are gone now.

using lvm thin pool is the right thing to do. loop devices are unreliable and also docker does not have any management utilities to do pool management (grow pool etc).

BTW, what are the issues you are facing with lvm thin pool?

Contributor

rhvgoyal commented Jun 8, 2015

@elephantfries

I had suspicion that udev flags are wrong but that they seem to be right for snapshot device creation.

Alsdair had looked at the logs and noticed that one of the udev thread is being killed and it is possible that probe command failed. Looks like logs are gone now.

using lvm thin pool is the right thing to do. loop devices are unreliable and also docker does not have any management utilities to do pool management (grow pool etc).

BTW, what are the issues you are facing with lvm thin pool?

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Jun 8, 2015

@rhvgoyal I can re-post the logs if it would be useful.

With direct thin pool, it seems that we're limited to 100G. I tried larger amounts, like 160G, but then I am running into devmapper confusion. I forgot the exact error but somebody else posted it on #atomic list as well. If you want, I can most likely recreate that problem. The other issues are minor, manageability type. I can't complain about them really but to give you an idea what the thinking is for example it forces us to LVM partitions. We'd much rather give 1T to the / partition ext4 and be done with it. Those atomic hosts are commodities, the simpler the better. Nobody's expanding disks on them, we'd switch to another host if necessary. Another example is when we decide to rm -rf /var/lib/docker (because of unrelated issues), then the pool is unusable and we have to re-create it from scratch. It's the cost of living on the cutting edge of technology, we understand that.

@rhvgoyal I can re-post the logs if it would be useful.

With direct thin pool, it seems that we're limited to 100G. I tried larger amounts, like 160G, but then I am running into devmapper confusion. I forgot the exact error but somebody else posted it on #atomic list as well. If you want, I can most likely recreate that problem. The other issues are minor, manageability type. I can't complain about them really but to give you an idea what the thinking is for example it forces us to LVM partitions. We'd much rather give 1T to the / partition ext4 and be done with it. Those atomic hosts are commodities, the simpler the better. Nobody's expanding disks on them, we'd switch to another host if necessary. Another example is when we decide to rm -rf /var/lib/docker (because of unrelated issues), then the pool is unusable and we have to re-create it from scratch. It's the cost of living on the cutting edge of technology, we understand that.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal Jun 8, 2015

Contributor

@elephantfries

Ok, you mentioned lot of issues, let us handle the most important one first. That is some kind of this 100G limitation. Looks like I am not facing this issue.

I created an lvm thin pool (using docker-storage-setup) of size 800G and used around 500G for docker thin pool.

I started docker with option --storage-opt dm.basesize=160G and everything seems to be fine. I am
having container root of size 160G.
[root@d2cc496561d6 /]# df -h
Filesystem Size Used Avail Use% Mounted on

/dev/mapper/docker-253:1-2763198-d2cc496561d6d520cbc0236b4ba88c362c446a7619992123f11c809cded25b47 160G 230M 160G 1% /

I am using upstream docker on F21 and using a dynamically built binary.

So this might be something atomic specific and might have even got fixed into latest. If not, this should be something fixable.

Contributor

rhvgoyal commented Jun 8, 2015

@elephantfries

Ok, you mentioned lot of issues, let us handle the most important one first. That is some kind of this 100G limitation. Looks like I am not facing this issue.

I created an lvm thin pool (using docker-storage-setup) of size 800G and used around 500G for docker thin pool.

I started docker with option --storage-opt dm.basesize=160G and everything seems to be fine. I am
having container root of size 160G.
[root@d2cc496561d6 /]# df -h
Filesystem Size Used Avail Use% Mounted on

/dev/mapper/docker-253:1-2763198-d2cc496561d6d520cbc0236b4ba88c362c446a7619992123f11c809cded25b47 160G 230M 160G 1% /

I am using upstream docker on F21 and using a dynamically built binary.

So this might be something atomic specific and might have even got fixed into latest. If not, this should be something fixable.

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal Jun 8, 2015

Contributor

@elephantfries

The issue of manageability, there are many pieces to it.

  • W.r.t planning for dividing space between root and thin pool, I agree that using loop devices is simpler.
  • W.r.t being able to grow pool later, I think using loop device is not best as docker does not have ability to grow pool. You mention that you don't care as you will move to a different atomic host. So you don't care about all the containers which are already in the previous thin pool as well?
  • Sometimes pool might be in bad shape. lvm checks for the pool health during activation. docker does
    none of it. That means your operations on pool later might fail and you might not have an easy
    or automatic way to deal with it.
  • Loop devices are not considered very reliable. Thin pool is more reliable.

IIUC, looks like for you, not having to think about how to partition existing storage is more important than other downsides of loop. I guess its your choice.

Contributor

rhvgoyal commented Jun 8, 2015

@elephantfries

The issue of manageability, there are many pieces to it.

  • W.r.t planning for dividing space between root and thin pool, I agree that using loop devices is simpler.
  • W.r.t being able to grow pool later, I think using loop device is not best as docker does not have ability to grow pool. You mention that you don't care as you will move to a different atomic host. So you don't care about all the containers which are already in the previous thin pool as well?
  • Sometimes pool might be in bad shape. lvm checks for the pool health during activation. docker does
    none of it. That means your operations on pool later might fail and you might not have an easy
    or automatic way to deal with it.
  • Loop devices are not considered very reliable. Thin pool is more reliable.

IIUC, looks like for you, not having to think about how to partition existing storage is more important than other downsides of loop. I guess its your choice.

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal Jun 8, 2015

Contributor

@elephantfries

If you have to do "rm -rf /var/lib/docker", then you have to issue to more commands to delete and recreate pool.
" lvremove docke/docker-pool"
"systemctl start docker-storage-setup"

And that should take care of destroying your existing pool and creating a new one for docker use.

Does not sound like a lot of work?

Contributor

rhvgoyal commented Jun 8, 2015

@elephantfries

If you have to do "rm -rf /var/lib/docker", then you have to issue to more commands to delete and recreate pool.
" lvremove docke/docker-pool"
"systemctl start docker-storage-setup"

And that should take care of destroying your existing pool and creating a new one for docker use.

Does not sound like a lot of work?

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Jun 8, 2015

@rhvgoyal Right, your points are valid and I said that I cannot really complain about them. I only mention them to give you an idea into what we're thinking. So, again, nothing wrong with lvm but life is easier without it. Too bad loop devices are unreliable, let's hope overlayfs stabilizes soon. However, devmapper has other uses, beyond docker.

So let's focus on hard problems. I'll try to re-create the devmapper confusion with thin pool larger than 100G. This has been reported and possibly fixed but I'll try anyway.

@rhvgoyal Right, your points are valid and I said that I cannot really complain about them. I only mention them to give you an idea into what we're thinking. So, again, nothing wrong with lvm but life is easier without it. Too bad loop devices are unreliable, let's hope overlayfs stabilizes soon. However, devmapper has other uses, beyond docker.

So let's focus on hard problems. I'll try to re-create the devmapper confusion with thin pool larger than 100G. This has been reported and possibly fixed but I'll try anyway.

@elephantfries

This comment has been minimized.

Show comment
Hide comment
@elephantfries

elephantfries Jun 26, 2015

@rhvgoyal There may be time to close this issue. I met vbatts at DockerCon. He was rather pessimistic about loop devices. So maybe that's just how it is: won't fix. I switched to thin pool while setting my sights on overlayfs. Devmapper with thin pool is not without issues but they're less frequent. I could not reproduce 'no space' on thin pools larger than 100G but I did run into "cannot start container" a couple of times. Restarting docker brings it back to operation. Not ideal, but I can live with it.

@rhvgoyal There may be time to close this issue. I met vbatts at DockerCon. He was rather pessimistic about loop devices. So maybe that's just how it is: won't fix. I switched to thin pool while setting my sights on overlayfs. Devmapper with thin pool is not without issues but they're less frequent. I could not reproduce 'no space' on thin pools larger than 100G but I did run into "cannot start container" a couple of times. Restarting docker brings it back to operation. Not ideal, but I can live with it.

@rhvgoyal

This comment has been minimized.

Show comment
Hide comment
@rhvgoyal

rhvgoyal Jun 26, 2015

Contributor

@elephantfries

Ok, lets close it. Repoen it again once we have more concrete data we can go after and fix things.

@vbatts, can you please close this.

Contributor

rhvgoyal commented Jun 26, 2015

@elephantfries

Ok, lets close it. Repoen it again once we have more concrete data we can go after and fix things.

@vbatts, can you please close this.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jun 26, 2015

Contributor

Closing

Contributor

LK4D4 commented Jun 26, 2015

Closing

@LK4D4 LK4D4 closed this Jun 26, 2015

@syzer

This comment has been minimized.

Show comment
Hide comment
@syzer

syzer Sep 2, 2015

on mac i got

An error occurred trying to connect: Get https://192.168.59.103:2376/v1.19/containers/json: dial tcp 192.168.59.103:2376: i/o timeout

syzer commented Sep 2, 2015

on mac i got

An error occurred trying to connect: Get https://192.168.59.103:2376/v1.19/containers/json: dial tcp 192.168.59.103:2376: i/o timeout
@oncletom

This comment has been minimized.

Show comment
Hide comment
@oncletom

oncletom Sep 22, 2015

I face a similar issue on an Ubuntu 14.04 EC2 machine – using Docker 1.8.2 with /var/lib/docker mounted on an EBS volume.

I am not sure if it happens during the exited images cleanup (docker rm $(docker ps -a -q -f "status=exited" -f "status=dead") every 10 minutes with hundreds of exited containers).

Let me know if you need any additional details.

Linux ip-10-0-0-52 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64
Containers: 16
Images: 129
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 161
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: syslog
Kernel Version: 3.13.0-63-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 14.69 GiB
Name: ip-10-0-0-52
ID: 447J:75KW:SP3L:DK2P:YL7N:YLT2:SAZ4:UX3T:264B:V2KX:7FT7:MODF
WARNING: No swap limit support

I face a similar issue on an Ubuntu 14.04 EC2 machine – using Docker 1.8.2 with /var/lib/docker mounted on an EBS volume.

I am not sure if it happens during the exited images cleanup (docker rm $(docker ps -a -q -f "status=exited" -f "status=dead") every 10 minutes with hundreds of exited containers).

Let me know if you need any additional details.

Linux ip-10-0-0-52 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:19:00 UTC 2015
 OS/Arch:      linux/amd64
Containers: 16
Images: 129
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 161
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: syslog
Kernel Version: 3.13.0-63-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 14.69 GiB
Name: ip-10-0-0-52
ID: 447J:75KW:SP3L:DK2P:YL7N:YLT2:SAZ4:UX3T:264B:V2KX:7FT7:MODF
WARNING: No swap limit support
@khimaros

This comment has been minimized.

Show comment
Hide comment
@khimaros

khimaros Sep 30, 2015

Also seeing this with CoreOS 766.3.0 and default Docker configuration on Google Compute Engine.

docker version

# docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 2c2c52b-dirty
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 2c2c52b-dirty
OS/Arch (server): linux/amd64

docker info

# docker info
Containers: 8
Images: 55
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.6-coreos-r1
Operating System: CoreOS 766.3.0
CPUs: 1
Total Memory: 3.618 GiB

docker-compose -v

# docker-compose -v
docker-compose version: 1.4.0

I seem to be able to reproduce this by executing many commands of the following form:

# docker-compose run --rm --entrypoint=${command} ${container} /bin/sh -c '<complex command>'

Working on a consistent reproduction case with a simple container.

Also seeing this with CoreOS 766.3.0 and default Docker configuration on Google Compute Engine.

docker version

# docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 2c2c52b-dirty
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 2c2c52b-dirty
OS/Arch (server): linux/amd64

docker info

# docker info
Containers: 8
Images: 55
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.6-coreos-r1
Operating System: CoreOS 766.3.0
CPUs: 1
Total Memory: 3.618 GiB

docker-compose -v

# docker-compose -v
docker-compose version: 1.4.0

I seem to be able to reproduce this by executing many commands of the following form:

# docker-compose run --rm --entrypoint=${command} ${container} /bin/sh -c '<complex command>'

Working on a consistent reproduction case with a simple container.

@yifan-gu

This comment has been minimized.

Show comment
Hide comment
@yifan-gu

yifan-gu Oct 20, 2015

Also seeing this on coreos 766.4 on gce

$ docker info
Containers: 140
Images: 110
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.7-coreos
Operating System: CoreOS 766.4.0
CPUs: 2
Total Memory: 7.309 GiB
Name: e2e-test-yifan-minion-9wkj
ID: 4MAS:EPAP:ECO5:BZ76:PGPB:7IGD:4XCP:ANLE:GQBW:GGFR:3OCY:EZEN
$ docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): df2f73d-dirty
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): df2f73d-dirty
OS/Arch (server): linux/amd64

Also seeing this on coreos 766.4 on gce

$ docker info
Containers: 140
Images: 110
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.7-coreos
Operating System: CoreOS 766.4.0
CPUs: 2
Total Memory: 7.309 GiB
Name: e2e-test-yifan-minion-9wkj
ID: 4MAS:EPAP:ECO5:BZ76:PGPB:7IGD:4XCP:ANLE:GQBW:GGFR:3OCY:EZEN
$ docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): df2f73d-dirty
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): df2f73d-dirty
OS/Arch (server): linux/amd64
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Oct 21, 2015

Member

@sllawap @yifan-gu the versions you're running on CoreOS are maintained by CoreOS, first of all those are older versions of Docker (which means that the issue might already have been fixed here), but also, the CoreOS may contain patches that make it act different as the "vanilla" Docker version (see the -dirty in the version).

I recommend opening an issue in the CoreOS issue tracker first.

Member

thaJeztah commented Oct 21, 2015

@sllawap @yifan-gu the versions you're running on CoreOS are maintained by CoreOS, first of all those are older versions of Docker (which means that the issue might already have been fixed here), but also, the CoreOS may contain patches that make it act different as the "vanilla" Docker version (see the -dirty in the version).

I recommend opening an issue in the CoreOS issue tracker first.

@sjwoodr

This comment has been minimized.

Show comment
Hide comment
@sjwoodr

sjwoodr Nov 2, 2015

I also encountered this (again) today with docker 1.8.2 on ubuntu 14.04 in EC2. I had to reboot. Restarting docker daemon itself was of no help.

sjwoodr commented Nov 2, 2015

I also encountered this (again) today with docker 1.8.2 on ubuntu 14.04 in EC2. I had to reboot. Restarting docker daemon itself was of no help.

@dazraf

This comment has been minimized.

Show comment
Hide comment
@dazraf

dazraf Feb 17, 2016

Why is this issue not reopened? Is it really fixed?

dazraf commented Feb 17, 2016

Why is this issue not reopened? Is it really fixed?

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Feb 17, 2016

Member

@dazraf I think this thread has become a "catch all" for multiple issues that may not be related, but have similar results; if you're encountering this on 1.10, it's better to open a new issue

Member

thaJeztah commented Feb 17, 2016

@dazraf I think this thread has become a "catch all" for multiple issues that may not be related, but have similar results; if you're encountering this on 1.10, it's better to open a new issue

@alexanderkjeldaas

This comment has been minimized.

Show comment
Hide comment
@alexanderkjeldaas

alexanderkjeldaas Feb 22, 2016

If you're encountering this on 1.10, it might simply be that it is migrating images.

If you're encountering this on 1.10, it might simply be that it is migrating images.

@kevinoriordan

This comment has been minimized.

Show comment
Hide comment
@kevinoriordan

kevinoriordan Feb 25, 2016

Also seeing this on Amazon Linux 2015.09

Docker version details:

Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5/1.9.1
Built:
OS/Arch: linux/amd64

Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5/1.9.1
Built:
OS/Arch: linux/amd64

Also seeing this on Amazon Linux 2015.09

Docker version details:

Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5/1.9.1
Built:
OS/Arch: linux/amd64

Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5/1.9.1
Built:
OS/Arch: linux/amd64

@jhberges

This comment has been minimized.

Show comment
Hide comment
@jhberges

jhberges Apr 28, 2016

Also seeing this on

docker@xxxx:~$ uname -a
Linux xxxx 3.16.0-45-generic #60~14.04.1-Ubuntu SMP Fri Jul 24 21:16:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

docker@xxxx:~$ docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Also seeing this on

docker@xxxx:~$ uname -a
Linux xxxx 3.16.0-45-generic #60~14.04.1-Ubuntu SMP Fri Jul 24 21:16:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

docker@xxxx:~$ docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:54:52 2016
 OS/Arch:      linux/amd64
@brouberol

This comment has been minimized.

Show comment
Hide comment
@brouberol

brouberol May 3, 2016

Also seeing this on

# uname -a
Linux 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-7~bpo8+1 (2016-01-19) x86_64 GNU/Linux

# docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:38:58 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:38:58 2016
 OS/Arch:      linux/amd64

The funny thing is, I have NO running container. I initially thought this hang could be related to a high number of containers running.

brouberol commented May 3, 2016

Also seeing this on

# uname -a
Linux 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-7~bpo8+1 (2016-01-19) x86_64 GNU/Linux

# docker version
Client:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:38:58 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   20f81dd
 Built:        Thu Mar 10 15:38:58 2016
 OS/Arch:      linux/amd64

The funny thing is, I have NO running container. I initially thought this hang could be related to a high number of containers running.

@LouisKottmann

This comment has been minimized.

Show comment
Hide comment
@LouisKottmann

LouisKottmann May 13, 2016

Also seeing this on:

# docker version
Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64
# uname -a
Linux ip-10-0-1-66 3.13.0-86-generic #130-Ubuntu SMP Mon Apr 18 18:27:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.4 LTS
Release:        14.04
Codename:       trusty

Also seeing this on:

# docker version
Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64
# uname -a
Linux ip-10-0-1-66 3.13.0-86-generic #130-Ubuntu SMP Mon Apr 18 18:27:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.4 LTS
Release:        14.04
Codename:       trusty
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 13, 2016

Member

@brouberol @LouisKottmann please open a new issue, and also provide the output of docker info.

I'm going to lock this issue; this issue has become a collection of many issue that are not related to each other, but can result in the same behavior ("docker not responding"). If you're having an issue like this, please open a new issue, and try to provide information that can be used to reproduce, or debug. Reporting only "I have this too" likely won't be sufficient to find the cause, or to see if there's an actual bug or if it's a configuration issue.

Member

thaJeztah commented May 13, 2016

@brouberol @LouisKottmann please open a new issue, and also provide the output of docker info.

I'm going to lock this issue; this issue has become a collection of many issue that are not related to each other, but can result in the same behavior ("docker not responding"). If you're having an issue like this, please open a new issue, and try to provide information that can be used to reproduce, or debug. Reporting only "I have this too" likely won't be sufficient to find the cause, or to see if there's an actual bug or if it's a configuration issue.

@moby moby locked and limited conversation to collaborators May 13, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.