Docker Daemon Hangs under load #13885

Open
mjsalinger opened this Issue Jun 11, 2015 · 174 comments

Comments

Projects
None yet
@mjsalinger

Docker Daemon hangs under heavy load. Scenario is starting/stopping/killing/removing many containers/second - high utilization. Containers contain one port, and are run without logging and in detached mode. The container receives an incoming TCP connection, does some work, sends a response, and then exits. An outside process cleans up by killing/removing and starting a new container.

I cannot get docker info from an actual hung instance, as once it is hung I can't get docker to run without a reboot. The below info is from one of the instances that has had the problem after a reboot.

We also have instances where something completely locks up and the instance cannot even be ssh'd into. This usually happens sometime after the docker lockup occurs.

Docker Info

Containers: 8
Images: 65
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Kernel Version: 3.18.0-031800-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 2
Total Memory: 3.675 GiB
Name:
ID: FAEG:2BHA:XBTO:CNKH:3RCA:GV3Z:UWIB:76QS:6JAG:SVCE:67LH:KZBP
WARNING: No swap limit support

Docker Version

Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64

uname -a
Linux 3.18.0-031800-generic #201412071935 SMP Mon Dec 8 00:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 14972
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 14972
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

@duglin

This comment has been minimized.

Show comment
Hide comment
@duglin

duglin Jun 11, 2015

Contributor

would you happen to have a reduced testcase to show this? Perhaps a small Dockerfile for what's in the container and a bash script that does the work of starting/stopping/... the containers?

Contributor

duglin commented Jun 11, 2015

would you happen to have a reduced testcase to show this? Perhaps a small Dockerfile for what's in the container and a bash script that does the work of starting/stopping/... the containers?

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 11, 2015

Container is in dockerhub kinvey/blrunner#v0.3.8

Using the remote API with the following options:

CREATE
Image: 'kinvey/blrunner#v0.3.8'
AttachStdout: false
AttachStderr: false
ExposedPorts: {'7000/tcp': {}}
Tty: false
HostConfig:
PublishAllPorts: true
CapDrop: [
"CHOWN"
"DAC_OVERRIDE"
"FOWNER"
"KILL"
"SETGID"
"SETPCAP"
"NET_BIND_SERVICE"
"NET_RAW"
"SYS_CHROOT"
"MKNOD"
"SETFCAP"
"AUDIT_WRITE"
]
LogConfig:
Type: "none"
Config: {}

START

container.start

REMOVE
force:true

Container is in dockerhub kinvey/blrunner#v0.3.8

Using the remote API with the following options:

CREATE
Image: 'kinvey/blrunner#v0.3.8'
AttachStdout: false
AttachStderr: false
ExposedPorts: {'7000/tcp': {}}
Tty: false
HostConfig:
PublishAllPorts: true
CapDrop: [
"CHOWN"
"DAC_OVERRIDE"
"FOWNER"
"KILL"
"SETGID"
"SETPCAP"
"NET_BIND_SERVICE"
"NET_RAW"
"SYS_CHROOT"
"MKNOD"
"SETFCAP"
"AUDIT_WRITE"
]
LogConfig:
Type: "none"
Config: {}

START

container.start

REMOVE
force:true

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jun 11, 2015

Contributor

Hmm, are you seeing excessive resource usage?
The fact that ssh isn't even working makes me suspect.
Could very well be inode issue with overlay, or too many open FD's, etc.

Contributor

cpuguy83 commented Jun 11, 2015

Hmm, are you seeing excessive resource usage?
The fact that ssh isn't even working makes me suspect.
Could very well be inode issue with overlay, or too many open FD's, etc.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 11, 2015

Not particularly in regards to excessive resource usage... But the early symptom is docker completely hanging while other processes hum along happily...

Important to note, we only have 8 containers running at any one time on any one instance.

Not particularly in regards to excessive resource usage... But the early symptom is docker completely hanging while other processes hum along happily...

Important to note, we only have 8 containers running at any one time on any one instance.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 12, 2015

Captured some stats where docker is no longer resposive:

lsof | wc -l

shows 1025.

However, an error appears several times:
lsof: WARNING: can't stat() overlay file system /var/lib/docker/overlay/aba7820e8cb01e62f7ceb53b0d04bc1281295c38815185e9383ebc19c30325d0/merged
Output information may be incomplete.

Sample output of top:

top - 00:16:53 up 12:22, 2 users, load average: 2.01, 2.05, 2.05
Tasks: 123 total, 1 running, 121 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 3853940 total, 2592920 used, 1261020 free, 172660 buffers
KiB Swap: 0 total, 0 used, 0 free. 1115644 cached Mem

24971 kinvey 20 0 992008 71892 10796 S 1.3 1.9 9:11.93 node
902 root 20 0 1365860 62800 12108 S 0.3 1.6 30:06.10 docker
29901 ubuntu 20 0 27988 6720 2676 S 0.3 0.2 3:58.17 tmux
1 root 20 0 33612 4152 2716 S 0.0 0.1 14:22.00 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:04.40 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 2:21.81 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:01.91 rcu_bh

Captured some stats where docker is no longer resposive:

lsof | wc -l

shows 1025.

However, an error appears several times:
lsof: WARNING: can't stat() overlay file system /var/lib/docker/overlay/aba7820e8cb01e62f7ceb53b0d04bc1281295c38815185e9383ebc19c30325d0/merged
Output information may be incomplete.

Sample output of top:

top - 00:16:53 up 12:22, 2 users, load average: 2.01, 2.05, 2.05
Tasks: 123 total, 1 running, 121 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 3853940 total, 2592920 used, 1261020 free, 172660 buffers
KiB Swap: 0 total, 0 used, 0 free. 1115644 cached Mem

24971 kinvey 20 0 992008 71892 10796 S 1.3 1.9 9:11.93 node
902 root 20 0 1365860 62800 12108 S 0.3 1.6 30:06.10 docker
29901 ubuntu 20 0 27988 6720 2676 S 0.3 0.2 3:58.17 tmux
1 root 20 0 33612 4152 2716 S 0.0 0.1 14:22.00 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:04.40 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 2:21.81 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:01.91 rcu_bh

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jun 13, 2015

Contributor

@mjsalinger The setup you're using is unsupported. The reason why it's unsupported is that you're using Ubuntu 14.04 with a custom kernel.

Where does that 3.18.0-031800 kernel come from? Did you notice that this kernel build is outdated? The kernel you're using has been built in December last year.

I'm sorry, but there's nothing to be debugged here. This issue might actually be a kernel bug related to overlay or to some other already fixed kernel bug which is no longer an issue in the latest version of kernel 3.18.

I'm going to close this issue. Please try again with an up to date 3.18 or newer kernel and check if you're running into the problem. Please keep in mind that there are multiple issues opened against overlay and that you'll probably experience problems with overlay even after updating to the latest kernel version and to the latest Docker version.

Contributor

unclejack commented Jun 13, 2015

@mjsalinger The setup you're using is unsupported. The reason why it's unsupported is that you're using Ubuntu 14.04 with a custom kernel.

Where does that 3.18.0-031800 kernel come from? Did you notice that this kernel build is outdated? The kernel you're using has been built in December last year.

I'm sorry, but there's nothing to be debugged here. This issue might actually be a kernel bug related to overlay or to some other already fixed kernel bug which is no longer an issue in the latest version of kernel 3.18.

I'm going to close this issue. Please try again with an up to date 3.18 or newer kernel and check if you're running into the problem. Please keep in mind that there are multiple issues opened against overlay and that you'll probably experience problems with overlay even after updating to the latest kernel version and to the latest Docker version.

@unclejack unclejack closed this Jun 13, 2015

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 14, 2015

@unclejack @cpuguy83 @LK4D4 Please reopen this issue. This was advice for the configuration that we are using was specifically given by the docker team and experimentation. We've tried newer kernels (3.19 +) and they have a kernel panic bug of some kind that we were running into - so the advice was to go with 3.18 pre-December because there was a known kernel bug introduced after that which caused a Kernel panic that we were running into, which to my knowledge, has not yet been fixed.

As far as OverlayFS, that was also presented to me as the ideal FS for Docker after experiencing numerous performance problems with AUFS. If this isn't a supported configuration, can someone help me to find a performant, stable configuration that will work for this use case? We've been pushing to get this stable for several months.

@unclejack @cpuguy83 @LK4D4 Please reopen this issue. This was advice for the configuration that we are using was specifically given by the docker team and experimentation. We've tried newer kernels (3.19 +) and they have a kernel panic bug of some kind that we were running into - so the advice was to go with 3.18 pre-December because there was a known kernel bug introduced after that which caused a Kernel panic that we were running into, which to my knowledge, has not yet been fixed.

As far as OverlayFS, that was also presented to me as the ideal FS for Docker after experiencing numerous performance problems with AUFS. If this isn't a supported configuration, can someone help me to find a performant, stable configuration that will work for this use case? We've been pushing to get this stable for several months.

@cpuguy83 cpuguy83 reopened this Jun 16, 2015

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jun 16, 2015

Contributor

@mjsalinger Can you provide the inode usage for the volume that overlay is running on? (df -i /var/lib/docker ought to do).

Contributor

cpuguy83 commented Jun 16, 2015

@mjsalinger Can you provide the inode usage for the volume that overlay is running on? (df -i /var/lib/docker ought to do).

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 16, 2015

Thanks for reopening. If the answer is a different kernel, that's fine, I just want to get to a stable scenario.

df -i /ver/lib/docker

Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda1 1638400 643893 994507 40% /

Thanks for reopening. If the answer is a different kernel, that's fine, I just want to get to a stable scenario.

df -i /ver/lib/docker

Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda1 1638400 643893 994507 40% /

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jun 16, 2015

Contributor

Overlay still has a bunch of problems: https://github.com/docker/docker/issues?q=is%3Aopen+is%3Aissue+label%3A%2Fsystem%2Foverlay

I wouldn't use overlay in production. Others have commented on this issue tracker that they're using AUFS in production and that it's been stable for them.

Kernel 3.18 is unsupported on Ubuntu 14.04. Canonical doesn't provide support for that.

Contributor

unclejack commented Jun 16, 2015

Overlay still has a bunch of problems: https://github.com/docker/docker/issues?q=is%3Aopen+is%3Aissue+label%3A%2Fsystem%2Foverlay

I wouldn't use overlay in production. Others have commented on this issue tracker that they're using AUFS in production and that it's been stable for them.

Kernel 3.18 is unsupported on Ubuntu 14.04. Canonical doesn't provide support for that.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 16, 2015

AUFS in production is not performant at all and has been anything but stable for us - I would routinely run into I/O Bottlenecks, freezes, etc.

See:
#11228, #12758, #12962

Switching to Overlay resolved all of the above issues - we only have this one issue remaining.

also:

http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/
http://qconlondon.com/system/files/presentation-slides/Docker%20Clustering.pdf

It seems that Overlay is being presented as the driver of choice by the community in general. Fine if it's not stable yet, but neither is AUFS and there has to be some way to get the performance and stability I need with Docker. I'm all for trying new things, but in our previous configuration (AUFS on Ubuntu 12.04 and AUFS on Ubuntu 14.04) we could not get either stability or performance. At least with Overlay we get good performance and better stability - but we just need to resolve this one problem.

AUFS in production is not performant at all and has been anything but stable for us - I would routinely run into I/O Bottlenecks, freezes, etc.

See:
#11228, #12758, #12962

Switching to Overlay resolved all of the above issues - we only have this one issue remaining.

also:

http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/
http://qconlondon.com/system/files/presentation-slides/Docker%20Clustering.pdf

It seems that Overlay is being presented as the driver of choice by the community in general. Fine if it's not stable yet, but neither is AUFS and there has to be some way to get the performance and stability I need with Docker. I'm all for trying new things, but in our previous configuration (AUFS on Ubuntu 12.04 and AUFS on Ubuntu 14.04) we could not get either stability or performance. At least with Overlay we get good performance and better stability - but we just need to resolve this one problem.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 16, 2015

I also experienced symptoms similar to these running default Ubuntu instances under AUFS (12.04, 14.04, and 15.04).

#10355 #13940

Both of these issues disappeared after switching to the current Kernel and OverlayFS.

I also experienced symptoms similar to these running default Ubuntu instances under AUFS (12.04, 14.04, and 15.04).

#10355 #13940

Both of these issues disappeared after switching to the current Kernel and OverlayFS.

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jun 16, 2015

Contributor

@mjsalinger I'd recommend using Ubuntu 14.04 with the latest kernel 3.13 packages. I'm using that myself and I haven't run into any of those problems at all.

Contributor

unclejack commented Jun 16, 2015

@mjsalinger I'd recommend using Ubuntu 14.04 with the latest kernel 3.13 packages. I'm using that myself and I haven't run into any of those problems at all.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 16, 2015

@unclejack Tried that and ran into the issues specified above with heavy, high volume usage (creating/destroying lots of containers) and AUFS was incredibly non-performant. So that's not an option.

@unclejack Tried that and ran into the issues specified above with heavy, high volume usage (creating/destroying lots of containers) and AUFS was incredibly non-performant. So that's not an option.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jun 16, 2015

Contributor

@mjsalinger Are you using upstart to start docker? What does /etc/init/docker.conf look like?

Contributor

cpuguy83 commented Jun 16, 2015

@mjsalinger Are you using upstart to start docker? What does /etc/init/docker.conf look like?

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 16, 2015

Yes, using upstart.

/etc/init/docker.conf

description "Docker daemon"

start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [!2345]
limit nofile 524288 1048576
limit nproc 524288 1048576

respawn

pre-start script
        # see also https://github.com/tianon/cgroupfs-mount/blob/master/cgroupfs-mount
        if grep -v '^#' /etc/fstab | grep -q cgroup \
                || [ ! -e /proc/cgroups ] \
                || [ ! -d /sys/fs/cgroup ]; then
                exit 0
        fi
        if ! mountpoint -q /sys/fs/cgroup; then
                mount -t tmpfs -o uid=0,gid=0,mode=0755 cgroup /sys/fs/cgroup
        fi
        (
                cd /sys/fs/cgroup
                for sys in $(awk '!/^#/ { if ($4 == 1) print $1 }' /proc/cgroups); do
                        mkdir -p $sys
                        if ! mountpoint -q $sys; then
                                if ! mount -n -t cgroup -o $sys cgroup $sys; then
                                        rmdir $sys || true
                                fi
                        fi
                done
        )
end script

script
        # modify these in /etc/default/$UPSTART_JOB (/etc/default/docker)
        DOCKER=/usr/bin/$UPSTART_JOB
        DOCKER_OPTS=
        if [ -f /etc/default/$UPSTART_JOB ]; then
                . /etc/default/$UPSTART_JOB
        fi
        exec "$DOCKER" -d $DOCKER_OPTS
end script

# Don't emit "started" event until docker.sock is ready.
# See https://github.com/docker/docker/issues/6647
post-start script
        DOCKER_OPTS=
        if [ -f /etc/default/$UPSTART_JOB ]; then
                . /etc/default/$UPSTART_JOB
        fi
        if ! printf "%s" "$DOCKER_OPTS" | grep -qE -e '-H|--host'; then
                while ! [ -e /var/run/docker.sock ]; do
                        initctl status $UPSTART_JOB | grep -qE "(stop|respawn)/" && exit 1
                        echo "Waiting for /var/run/docker.sock"
                        sleep 0.1
                done
                echo "/var/run/docker.sock is up"
        fi
end script

Yes, using upstart.

/etc/init/docker.conf

description "Docker daemon"

start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [!2345]
limit nofile 524288 1048576
limit nproc 524288 1048576

respawn

pre-start script
        # see also https://github.com/tianon/cgroupfs-mount/blob/master/cgroupfs-mount
        if grep -v '^#' /etc/fstab | grep -q cgroup \
                || [ ! -e /proc/cgroups ] \
                || [ ! -d /sys/fs/cgroup ]; then
                exit 0
        fi
        if ! mountpoint -q /sys/fs/cgroup; then
                mount -t tmpfs -o uid=0,gid=0,mode=0755 cgroup /sys/fs/cgroup
        fi
        (
                cd /sys/fs/cgroup
                for sys in $(awk '!/^#/ { if ($4 == 1) print $1 }' /proc/cgroups); do
                        mkdir -p $sys
                        if ! mountpoint -q $sys; then
                                if ! mount -n -t cgroup -o $sys cgroup $sys; then
                                        rmdir $sys || true
                                fi
                        fi
                done
        )
end script

script
        # modify these in /etc/default/$UPSTART_JOB (/etc/default/docker)
        DOCKER=/usr/bin/$UPSTART_JOB
        DOCKER_OPTS=
        if [ -f /etc/default/$UPSTART_JOB ]; then
                . /etc/default/$UPSTART_JOB
        fi
        exec "$DOCKER" -d $DOCKER_OPTS
end script

# Don't emit "started" event until docker.sock is ready.
# See https://github.com/docker/docker/issues/6647
post-start script
        DOCKER_OPTS=
        if [ -f /etc/default/$UPSTART_JOB ]; then
                . /etc/default/$UPSTART_JOB
        fi
        if ! printf "%s" "$DOCKER_OPTS" | grep -qE -e '-H|--host'; then
                while ! [ -e /var/run/docker.sock ]; do
                        initctl status $UPSTART_JOB | grep -qE "(stop|respawn)/" && exit 1
                        echo "Waiting for /var/run/docker.sock"
                        sleep 0.1
                done
                echo "/var/run/docker.sock is up"
        fi
end script
@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 17, 2015

We are also now seeing the below when running any Docker command on some instances, for example...

sudo docker ps

FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json?all=1: dial unix /var/run/docker.sock: resource temporarily unavailable. Are you trying to connect to a TLS-enabled
daemon without TLS?

We are also now seeing the below when running any Docker command on some instances, for example...

sudo docker ps

FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json?all=1: dial unix /var/run/docker.sock: resource temporarily unavailable. Are you trying to connect to a TLS-enabled
daemon without TLS?

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jun 17, 2015

Contributor

@mjsalinger It's just crappy error message. In most cases it means that daemon crashed.

Contributor

LK4D4 commented Jun 17, 2015

@mjsalinger It's just crappy error message. In most cases it means that daemon crashed.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jun 17, 2015

Contributor

@mjsalinger What do the docker daemon logs say during this? /var/log/upstart/docker.log should be the location.

Contributor

cpuguy83 commented Jun 17, 2015

@mjsalinger What do the docker daemon logs say during this? /var/log/upstart/docker.log should be the location.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 17, 2015

Frozen, no new entries coming in. Here are the last entries in the log:

INFO[46655] -job log(create, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46655] -job create() = OK (0)
INFO[46655] POST /containers/48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a/start
INFO[46655] +job start(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] +job allocate_interface(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] -job allocate_interface(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46655] +job allocate_port(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[46655] +job create()
INFO[46655] DELETE /containers/4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187?force=true
INFO[46655] +job rm(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187)
INFO[46656] -job allocate_port(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] +job log(start, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job log(create, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(create, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job create() = OK (0)
INFO[46656] +job log(die, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187)
INFO[46656] POST /containers/7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f/start
INFO[46656] +job start(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] +job allocate_interface(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job allocate_interface(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job allocate_port(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job release_interface(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187) = OK (0)
INFO[46656] DELETE /containers/cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b?force=true
INFO[46656] +job rm(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b)
INFO[46656] +job log(destroy, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187) = OK (0)
INFO[46656] +job log(die, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b)
INFO[46656] DELETE /containers/1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20?force=true
INFO[46656] +job rm(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20)
INFO[46656] -job allocate_port(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job log(start, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job start(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[46656] +job create()
INFO[46656] +job log(create, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(create, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job create() = OK (0)
INFO[46656] +job log(die, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20)
INFO[46656] GET /containers/48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a/json
INFO[46656] +job container_inspect(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46656] -job container_inspect(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] POST /containers/1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830/start
INFO[46656] +job start(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] +job allocate_interface(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] -job allocate_interface(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830) = OK (0)
INFO[46656] +job allocate_port(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] -job release_interface(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b) = OK (0)
INFO[46656] -job start(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] GET /containers/7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f/json
INFO[46656] +job container_inspect(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job release_interface(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20) = OK (0)
INFO[46656] -job container_inspect(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job log(destroy, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b) = OK (0)
INFO[46656] +job log(destroy, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20) = OK (0)
INFO[46656] DELETE /containers/4cfeb48701f194cfd40f71d7883d82906d54a538084fa7be6680345e4651aa60?force=true
INFO[46656] +job rm(4cfeb48701f194cfd40f71d7883d82906d54a538084fa7be6680345e4651aa60)
INFO[46656] -job allocate_port(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830) = OK (0)
INFO[46656] +job log(start, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8) = OK (0)

Frozen, no new entries coming in. Here are the last entries in the log:

INFO[46655] -job log(create, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46655] -job create() = OK (0)
INFO[46655] POST /containers/48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a/start
INFO[46655] +job start(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] +job allocate_interface(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] -job allocate_interface(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46655] +job allocate_port(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[46655] +job create()
INFO[46655] DELETE /containers/4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187?force=true
INFO[46655] +job rm(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187)
INFO[46656] -job allocate_port(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] +job log(start, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job log(create, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(create, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job create() = OK (0)
INFO[46656] +job log(die, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187)
INFO[46656] POST /containers/7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f/start
INFO[46656] +job start(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] +job allocate_interface(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job allocate_interface(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job allocate_port(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job release_interface(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187) = OK (0)
INFO[46656] DELETE /containers/cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b?force=true
INFO[46656] +job rm(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b)
INFO[46656] +job log(destroy, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187) = OK (0)
INFO[46656] +job log(die, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b)
INFO[46656] DELETE /containers/1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20?force=true
INFO[46656] +job rm(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20)
INFO[46656] -job allocate_port(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job log(start, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job start(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[46656] +job create()
INFO[46656] +job log(create, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(create, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job create() = OK (0)
INFO[46656] +job log(die, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20)
INFO[46656] GET /containers/48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a/json
INFO[46656] +job container_inspect(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46656] -job container_inspect(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] POST /containers/1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830/start
INFO[46656] +job start(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] +job allocate_interface(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] -job allocate_interface(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830) = OK (0)
INFO[46656] +job allocate_port(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] -job release_interface(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b) = OK (0)
INFO[46656] -job start(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] GET /containers/7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f/json
INFO[46656] +job container_inspect(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job release_interface(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20) = OK (0)
INFO[46656] -job container_inspect(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job log(destroy, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b) = OK (0)
INFO[46656] +job log(destroy, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20) = OK (0)
INFO[46656] DELETE /containers/4cfeb48701f194cfd40f71d7883d82906d54a538084fa7be6680345e4651aa60?force=true
INFO[46656] +job rm(4cfeb48701f194cfd40f71d7883d82906d54a538084fa7be6680345e4651aa60)
INFO[46656] -job allocate_port(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830) = OK (0)
INFO[46656] +job log(start, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8) = OK (0)
@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 19, 2015

@cpuguy83 Was the log helpful at all?

@cpuguy83 Was the log helpful at all?

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jun 19, 2015

Contributor

@mjsalinger Makes me think there's a deadlock somewhere since there's nothing else indicating an issue.

Contributor

cpuguy83 commented Jun 19, 2015

@mjsalinger Makes me think there's a deadlock somewhere since there's nothing else indicating an issue.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 19, 2015

@cpuguy83 That would make sense given the symptoms. Is there anything I can do to help further trace this issue and where it comes from?

@cpuguy83 That would make sense given the symptoms. Is there anything I can do to help further trace this issue and where it comes from?

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jun 19, 2015

Contributor

Maybe we can get an strace to see that it's actually hanging on a lock.

Contributor

cpuguy83 commented Jun 19, 2015

Maybe we can get an strace to see that it's actually hanging on a lock.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 23, 2015

Ok. Will work to see if we can get that. We wanted to try 1.7 first - did that but did not notice any improvement.

Ok. Will work to see if we can get that. We wanted to try 1.7 first - did that but did not notice any improvement.

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 23, 2015

@cpuguy83 From one of the affected hosts:

root@<host>:~# strace -q -y -v -p 899
futex(0x134f1b8, FUTEX_WAIT, 0, NULL^C <detached ...>

@cpuguy83 From one of the affected hosts:

root@<host>:~# strace -q -y -v -p 899
futex(0x134f1b8, FUTEX_WAIT, 0, NULL^C <detached ...>
@mjsalinger

This comment has been minimized.

Show comment
Hide comment

@cpuguy83 Any ideas?

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 25, 2015

Seeing the following in 1.7 with containers not being killed/started. This seems to be a precurser to the problem (note didn't see these errors in 1.6 but did see a volume of dead containers start to build up, even though a command was issued to kill/remove)

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 09c12c9f72d461342447e822857566923d5532280f9ce25659d1ef3e54794484: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 5e29407bb3154d6f5778676905d112a44596a23fd4a1b047336c3efaca6ada18: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container be22e8b24b70e24e5269b860055423236e4a2fca08969862c3ae3541c4ba4966: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container c067d14b67be8fb81922b87b66c0025ae5ae1ebed3d35dcb4052155afc4cafb4: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 7f21c4fd8b6620eba81475351e8417b245f86b6014fd7125ba0e37c6684e3e42: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container b31531492ab7e767158601c438031c8e9ef0b50b9e84b0b274d733ed4fbe03a0: Link not found

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container 477dc3c691a12f74ea3f07af0af732082094418f6825f7c3d320bda0495504a1: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32822 -j DNAT --to-destination 172.17.0.44:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 6965eec076db173f4e3b9a05f06e1c87c02cfe849821bea4008ac7bd0f1e083a: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 7c721dd2b8d31b51427762ac1d0fe86ffbb6e1d7462314fdac3afe1f46863ff1: Link not found

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container c908d059631e66883ee1a7302c16ad16df3298ebfec06bba95232d5f204c9c75: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32837 -j DNAT --to-destination 172.17.0.47:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container c3e11ffb82fe08b8a029ce0a94e678ad46e3d2f3d76bed7350544c6c48789369: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32847 -j DNAT --to-destination 172.17.0.48:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Seeing the following in 1.7 with containers not being killed/started. This seems to be a precurser to the problem (note didn't see these errors in 1.6 but did see a volume of dead containers start to build up, even though a command was issued to kill/remove)

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 09c12c9f72d461342447e822857566923d5532280f9ce25659d1ef3e54794484: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 5e29407bb3154d6f5778676905d112a44596a23fd4a1b047336c3efaca6ada18: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container be22e8b24b70e24e5269b860055423236e4a2fca08969862c3ae3541c4ba4966: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container c067d14b67be8fb81922b87b66c0025ae5ae1ebed3d35dcb4052155afc4cafb4: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 7f21c4fd8b6620eba81475351e8417b245f86b6014fd7125ba0e37c6684e3e42: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container b31531492ab7e767158601c438031c8e9ef0b50b9e84b0b274d733ed4fbe03a0: Link not found

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container 477dc3c691a12f74ea3f07af0af732082094418f6825f7c3d320bda0495504a1: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32822 -j DNAT --to-destination 172.17.0.44:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 6965eec076db173f4e3b9a05f06e1c87c02cfe849821bea4008ac7bd0f1e083a: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 7c721dd2b8d31b51427762ac1d0fe86ffbb6e1d7462314fdac3afe1f46863ff1: Link not found

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container c908d059631e66883ee1a7302c16ad16df3298ebfec06bba95232d5f204c9c75: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32837 -j DNAT --to-destination 172.17.0.47:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container c3e11ffb82fe08b8a029ce0a94e678ad46e3d2f3d76bed7350544c6c48789369: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32847 -j DNAT --to-destination 172.17.0.48:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)
@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jun 26, 2015

Tail of docker.log from a recent incident:

INFO[44089] DELETE /containers/4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25?force=true
INFO[44089] +job rm(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25)
INFO[44089] DELETE /containers/a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96?force=true
INFO[44089] +job rm(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96)
INFO[44089] +job log(die, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8)
INFO[44089] -job log(die, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44089] +job release_interface(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25)
INFO[44089] -job release_interface(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25) = OK (0)
INFO[44089] +job log(die, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8)
INFO[44089] -job log(die, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44089] +job release_interface(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96)
INFO[44089] -job release_interface(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96) = OK (0)
INFO[44092] +job log(destroy, 285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6, kinvey/blrunner:v0.3.8)
INFO[44092] -job log(destroy, 285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44092] -job rm(285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6) = OK (0)
INFO[44092] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44092] +job create()
INFO[44097] +job log(destroy, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8)
INFO[44097] -job log(destroy, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44097] -job rm(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25) = OK (0)
INFO[44097] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44097] +job create()
INFO[44098] +job log(destroy, c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4, kinvey/blrunner:v0.3.8)
INFO[44098] -job log(destroy, c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44098] -job rm(c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4) = OK (0)
INFO[44098] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44098] +job create()
INFO[44098] +job log(create, 3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9, kinvey/blrunner:v0.3.8)
INFO[44098] -job log(create, 3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44098] -job create() = OK (0)
INFO[44098] POST /containers/3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9/start
INFO[44098] +job start(3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9)
INFO[44102] +job log(destroy, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8)
INFO[44102] -job log(destroy, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44102] -job rm(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96) = OK (0)
INFO[44102] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44102] +job create()

Tail of docker.log from a recent incident:

INFO[44089] DELETE /containers/4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25?force=true
INFO[44089] +job rm(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25)
INFO[44089] DELETE /containers/a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96?force=true
INFO[44089] +job rm(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96)
INFO[44089] +job log(die, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8)
INFO[44089] -job log(die, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44089] +job release_interface(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25)
INFO[44089] -job release_interface(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25) = OK (0)
INFO[44089] +job log(die, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8)
INFO[44089] -job log(die, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44089] +job release_interface(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96)
INFO[44089] -job release_interface(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96) = OK (0)
INFO[44092] +job log(destroy, 285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6, kinvey/blrunner:v0.3.8)
INFO[44092] -job log(destroy, 285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44092] -job rm(285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6) = OK (0)
INFO[44092] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44092] +job create()
INFO[44097] +job log(destroy, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8)
INFO[44097] -job log(destroy, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44097] -job rm(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25) = OK (0)
INFO[44097] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44097] +job create()
INFO[44098] +job log(destroy, c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4, kinvey/blrunner:v0.3.8)
INFO[44098] -job log(destroy, c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44098] -job rm(c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4) = OK (0)
INFO[44098] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44098] +job create()
INFO[44098] +job log(create, 3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9, kinvey/blrunner:v0.3.8)
INFO[44098] -job log(create, 3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44098] -job create() = OK (0)
INFO[44098] POST /containers/3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9/start
INFO[44098] +job start(3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9)
INFO[44102] +job log(destroy, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8)
INFO[44102] -job log(destroy, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44102] -job rm(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96) = OK (0)
INFO[44102] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44102] +job create()
@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jul 1, 2015

@cpuguy83 @LK4D4 Any ideas/updates?

@cpuguy83 @LK4D4 Any ideas/updates?

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jul 1, 2015

Contributor

Have no idea, what to think. It isn't software deadlock, because even docker info hanging. Can this be memory leak?

Contributor

LK4D4 commented Jul 1, 2015

Have no idea, what to think. It isn't software deadlock, because even docker info hanging. Can this be memory leak?

@mjsalinger

This comment has been minimized.

Show comment
Hide comment
@mjsalinger

mjsalinger Jul 1, 2015

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
root 20 0 1314832 89688 11568 S 0.3 2.3 65:47.56 docker

That's what the Docker process looks to be using; doesn't look too much like a memory leak to me...

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
root 20 0 1314832 89688 11568 S 0.3 2.3 65:47.56 docker

That's what the Docker process looks to be using; doesn't look too much like a memory leak to me...

@nicolasnoble

This comment has been minimized.

Show comment
Hide comment
@nicolasnoble

nicolasnoble Jul 10, 2015

I'll comment on that one to say "us too". We're using Docker for our testing infrastructure, and we're hit by that within the same conditions / scenarios. This is going to impact our ability to use Docker in a meaningful manner if it locks up under load.

I'd be happy to contribute troubleshooting results if you would point me at what you'd need as information.

I'll comment on that one to say "us too". We're using Docker for our testing infrastructure, and we're hit by that within the same conditions / scenarios. This is going to impact our ability to use Docker in a meaningful manner if it locks up under load.

I'd be happy to contribute troubleshooting results if you would point me at what you'd need as information.

@filex

This comment has been minimized.

Show comment
Hide comment
@filex

filex Aug 5, 2015

I just ran into the same issue on a CentOS7 host.

$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): ba1f6c3/1.6.2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): ba1f6c3/1.6.2
OS/Arch (server): linux/amd64
$ docker info
Containers: 7
Images: 672
Storage Driver: devicemapper
 Pool Name: docker-253:2-67171716-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/mapper/vg01-docker--data
 Metadata file: /dev/mapper/vg01-docker--metadata
 Data Space Used: 54.16 GB
 Data Space Total: 59.06 GB
 Data Space Available: 4.894 GB
 Metadata Space Used: 53.88 MB
 Metadata Space Total: 5.369 GB
 Metadata Space Available: 5.315 GB
 Udev Sync Supported: true
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Kernel Version: 3.10.0-229.7.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 8
Total Memory: 31.25 GiB

This happend on our CI system. I changed the number of parallel jobs and then startet 4 jobs in parallel. So there were 4-6 containers running. Load was about 10 (with some of the 8 cores being still idle).

While all containers were running fine, the dockerd itself was stuck. docker images would still show images, but all deamon commands like docker ps or docker info have stalled.

My strace was similar to the one above:

strace -p 1194
Process 1194 attached
futex(0x11d2838, FUTEX_WAIT, 0, NULL

After a while all containers were done with their jobs (compiling, testing..), but neither would "return". It looks as if they were waiting themselves for the dockerd.

When I finally killed dockerd, the containers terminated with a message like this:

time="2015-08-05T15:59:32+02:00" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/9117731bd16a451b89fd938f569c39761b5f8d6df505256e172738e0593ba125/wait: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?" 

filex commented Aug 5, 2015

I just ran into the same issue on a CentOS7 host.

$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): ba1f6c3/1.6.2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): ba1f6c3/1.6.2
OS/Arch (server): linux/amd64
$ docker info
Containers: 7
Images: 672
Storage Driver: devicemapper
 Pool Name: docker-253:2-67171716-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/mapper/vg01-docker--data
 Metadata file: /dev/mapper/vg01-docker--metadata
 Data Space Used: 54.16 GB
 Data Space Total: 59.06 GB
 Data Space Available: 4.894 GB
 Metadata Space Used: 53.88 MB
 Metadata Space Total: 5.369 GB
 Metadata Space Available: 5.315 GB
 Udev Sync Supported: true
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Kernel Version: 3.10.0-229.7.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 8
Total Memory: 31.25 GiB

This happend on our CI system. I changed the number of parallel jobs and then startet 4 jobs in parallel. So there were 4-6 containers running. Load was about 10 (with some of the 8 cores being still idle).

While all containers were running fine, the dockerd itself was stuck. docker images would still show images, but all deamon commands like docker ps or docker info have stalled.

My strace was similar to the one above:

strace -p 1194
Process 1194 attached
futex(0x11d2838, FUTEX_WAIT, 0, NULL

After a while all containers were done with their jobs (compiling, testing..), but neither would "return". It looks as if they were waiting themselves for the dockerd.

When I finally killed dockerd, the containers terminated with a message like this:

time="2015-08-05T15:59:32+02:00" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/9117731bd16a451b89fd938f569c39761b5f8d6df505256e172738e0593ba125/wait: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?" 
@fishnix

This comment has been minimized.

Show comment
Hide comment
@fishnix

fishnix Aug 12, 2015

We are seeing the same thing as @filex on Centos 7 in our CI environment

Containers: 1
Images: 19
Storage Driver: devicemapper
 Pool Name: docker--vg-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 2.611 GB
 Data Space Total: 32.17 GB
 Data Space Available: 29.56 GB
 Metadata Space Used: 507.9 kB
 Metadata Space Total: 54.53 MB
 Metadata Space Available: 54.02 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.11.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 2
Total Memory: 7.389 GiB
Name: ip-10-1-2-234
ID: 5YVL:O32X:4NNA:ICSJ:RSYS:CNCI:6QVC:C5YR:XGR4:NQTW:PUSE:YFTA
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-108.el7.centos.x86_64
Go version (client): go1.4.2
Git commit (client): 3043001/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-108.el7.centos.x86_64
Go version (server): go1.4.2
Git commit (server): 3043001/1.7.1
OS/Arch (server): linux/amd64

fishnix commented Aug 12, 2015

We are seeing the same thing as @filex on Centos 7 in our CI environment

Containers: 1
Images: 19
Storage Driver: devicemapper
 Pool Name: docker--vg-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 2.611 GB
 Data Space Total: 32.17 GB
 Data Space Available: 29.56 GB
 Metadata Space Used: 507.9 kB
 Metadata Space Total: 54.53 MB
 Metadata Space Available: 54.02 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.11.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 2
Total Memory: 7.389 GiB
Name: ip-10-1-2-234
ID: 5YVL:O32X:4NNA:ICSJ:RSYS:CNCI:6QVC:C5YR:XGR4:NQTW:PUSE:YFTA
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-108.el7.centos.x86_64
Go version (client): go1.4.2
Git commit (client): 3043001/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-108.el7.centos.x86_64
Go version (server): go1.4.2
Git commit (server): 3043001/1.7.1
OS/Arch (server): linux/amd64
@liquid-sky

This comment has been minimized.

Show comment
Hide comment
@liquid-sky

liquid-sky Nov 8, 2016

I think this might be related to this issue In a sense that we run container every minute with --rm and Docker hangs on that particular container, however some nodes where docker ps is hanging do not have unregistered loopback device message.

Just in case, a snippet of strace -f of Docker daemon:

[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1766] write(23, "\2\0\0\0\0\0\0\321I1108 10:48:12.429657   "..., 217) = 217
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1766] futex(0xc822075d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC, {514752, 140621361}) = 0
[pid  1875] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1298424080, u64=140528333381904}}}, 128, -1) = 1
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1875] epoll_wait(6,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 431376727}) = 0
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] <... read resumed> "I1108 10:48:12.431656    12 mast"..., 32768) = 1674
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] futex(0xc822075d08, FUTEX_WAKE, 1 <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1766] <... futex resumed> )       = 0
[pid  1514] <... futex resumed> )       = 1
[pid  1766] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 142476308}) = 0
[pid  1514] <... clock_gettime resumed> {1478602092, 432632124}) = 0
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431656   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 432677401}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 432812895}) = 0
[pid  1766] <... select resumed> )      = 0 (Timeout)
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431763   "..., 349 <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] <... write resumed> )       = 349
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 142831922}) = 0
[pid  1514] <... clock_gettime resumed> {1478602092, 432948135}) = 0
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431874   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 432989394}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1766] read(44, "I1108 10:48:12.432255    12 mast"..., 32768) = 837
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 433114452}) = 0
[pid  1766] read(44,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431958   "..., 349) = 349
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] <... clock_gettime resumed> {1478602092, 433272783}) = 0
[pid  1507] <... clock_gettime resumed> {514752, 143176397}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432035   "..., 349 <unfinished ...>
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] <... write resumed> )       = 349
[pid  1507] <... clock_gettime resumed> {1478602092, 433350170}) = 0
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] <... clock_gettime resumed> {1478602092, 433416138}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432126   "..., 349) = 349
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] <... clock_gettime resumed> {1478602092, 433605782}) = 0
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432255   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 143537029}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 433748730}) = 0
[pid  1507] <... clock_gettime resumed> {1478602092, 433752245}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432343   "..., 348 <unfinished ...>
[pid  1507] futex(0xc8211b2d08, FUTEX_WAKE, 1 <unfinished ...>
[pid  1514] <... write resumed> )       = 348
[pid  1507] <... futex resumed> )       = 1
[pid 10488] <... futex resumed> )       = 0
[pid 10488] write(23, "\2\0\0\0\0\0\t\317I1108 10:48:12.431656   "..., 2519 <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10488] <... write resumed> )       = 2519
[pid 10488] futex(0xc8211b2d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1875] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1298424080, u64=140528333381904}}}, 128, -1) = 1
[pid  1514] <... clock_gettime resumed> {1478602092, 434094221}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432445   "..., 349) = 349
[pid  1514] futex(0xc820209908, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1507] clock_gettime(CLOCK_MONOTONIC, {514752, 144370753}) = 0
[pid  1507] futex(0x224fe10, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid  1875] epoll_wait(6,  <unfinished ...>
[pid  1683] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1683] clock_gettime(CLOCK_MONOTONIC, {514752, 292709791}) = 0
[pid  1683] futex(0x224fe10, FUTEX_WAKE, 1) = 1
[pid  1507] <... futex resumed> )       = 0

liquid-sky commented Nov 8, 2016

I think this might be related to this issue In a sense that we run container every minute with --rm and Docker hangs on that particular container, however some nodes where docker ps is hanging do not have unregistered loopback device message.

Just in case, a snippet of strace -f of Docker daemon:

[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1766] write(23, "\2\0\0\0\0\0\0\321I1108 10:48:12.429657   "..., 217) = 217
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1766] futex(0xc822075d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC, {514752, 140621361}) = 0
[pid  1875] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1298424080, u64=140528333381904}}}, 128, -1) = 1
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1875] epoll_wait(6,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 431376727}) = 0
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] <... read resumed> "I1108 10:48:12.431656    12 mast"..., 32768) = 1674
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] futex(0xc822075d08, FUTEX_WAKE, 1 <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1766] <... futex resumed> )       = 0
[pid  1514] <... futex resumed> )       = 1
[pid  1766] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 142476308}) = 0
[pid  1514] <... clock_gettime resumed> {1478602092, 432632124}) = 0
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431656   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 432677401}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 432812895}) = 0
[pid  1766] <... select resumed> )      = 0 (Timeout)
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431763   "..., 349 <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] <... write resumed> )       = 349
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 142831922}) = 0
[pid  1514] <... clock_gettime resumed> {1478602092, 432948135}) = 0
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431874   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 432989394}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1766] read(44, "I1108 10:48:12.432255    12 mast"..., 32768) = 837
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 433114452}) = 0
[pid  1766] read(44,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431958   "..., 349) = 349
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] <... clock_gettime resumed> {1478602092, 433272783}) = 0
[pid  1507] <... clock_gettime resumed> {514752, 143176397}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432035   "..., 349 <unfinished ...>
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] <... write resumed> )       = 349
[pid  1507] <... clock_gettime resumed> {1478602092, 433350170}) = 0
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] <... clock_gettime resumed> {1478602092, 433416138}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432126   "..., 349) = 349
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] <... clock_gettime resumed> {1478602092, 433605782}) = 0
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432255   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 143537029}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 433748730}) = 0
[pid  1507] <... clock_gettime resumed> {1478602092, 433752245}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432343   "..., 348 <unfinished ...>
[pid  1507] futex(0xc8211b2d08, FUTEX_WAKE, 1 <unfinished ...>
[pid  1514] <... write resumed> )       = 348
[pid  1507] <... futex resumed> )       = 1
[pid 10488] <... futex resumed> )       = 0
[pid 10488] write(23, "\2\0\0\0\0\0\t\317I1108 10:48:12.431656   "..., 2519 <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10488] <... write resumed> )       = 2519
[pid 10488] futex(0xc8211b2d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1875] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1298424080, u64=140528333381904}}}, 128, -1) = 1
[pid  1514] <... clock_gettime resumed> {1478602092, 434094221}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432445   "..., 349) = 349
[pid  1514] futex(0xc820209908, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1507] clock_gettime(CLOCK_MONOTONIC, {514752, 144370753}) = 0
[pid  1507] futex(0x224fe10, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid  1875] epoll_wait(6,  <unfinished ...>
[pid  1683] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1683] clock_gettime(CLOCK_MONOTONIC, {514752, 292709791}) = 0
[pid  1683] futex(0x224fe10, FUTEX_WAKE, 1) = 1
[pid  1507] <... futex resumed> )       = 0
@depay

This comment has been minimized.

Show comment
Hide comment
@depay

depay Nov 9, 2016

i am using devicemapper with loopback on centos 7.

Not sure about what happend, but dmsetup udevcookies & dmsetup udevcomplete <cookie> works for me.

@thaJeztah could you help explain what happend ?

depay commented Nov 9, 2016

i am using devicemapper with loopback on centos 7.

Not sure about what happend, but dmsetup udevcookies & dmsetup udevcomplete <cookie> works for me.

@thaJeztah could you help explain what happend ?

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Nov 9, 2016

Member

I'm repeating what has been commented many times before in here. If you encounter a hang please send SIGUSR1 signal to daemon process and copy the trace that is written to the logs. Strace output is not useful for debugging hangs, is likely from a client process and usually doesn't even show if the process is stuck or just sleeping.

Member

tonistiigi commented Nov 9, 2016

I'm repeating what has been commented many times before in here. If you encounter a hang please send SIGUSR1 signal to daemon process and copy the trace that is written to the logs. Strace output is not useful for debugging hangs, is likely from a client process and usually doesn't even show if the process is stuck or just sleeping.

@crawford crawford referenced this issue in coreos/bugs Nov 9, 2016

Closed

docker ps hangs #1654

@AaronDMarasco-VSI

This comment has been minimized.

Show comment
Hide comment
@AaronDMarasco-VSI

AaronDMarasco-VSI Nov 17, 2016

I have a USR1 dump as @tonistiigi asked for with a hang-under-load.

$ docker info
Containers: 18
 Running: 15
 Paused: 0
 Stopped: 3
Images: 232
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: vg_ex-docker--pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 132.8 GB
 Data Space Total: 805.3 GB
 Data Space Available: 672.5 GB
 Metadata Space Used: 98.46 MB
 Metadata Space Total: 4.001 GB
 Metadata Space Available: 3.903 GB
 Thin Pool Minimum Free Space: 80.53 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.36.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 251.6 GiB
Name: server.masked.example.com
ID: Z7J7:CLEG:KZPK:TNEI:QLYL:JBTO:XNM4:NX2X:KPQC:VHPC:6SFS:G4GR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker.txt

I have a USR1 dump as @tonistiigi asked for with a hang-under-load.

$ docker info
Containers: 18
 Running: 15
 Paused: 0
 Stopped: 3
Images: 232
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: vg_ex-docker--pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 132.8 GB
 Data Space Total: 805.3 GB
 Data Space Available: 672.5 GB
 Metadata Space Used: 98.46 MB
 Metadata Space Total: 4.001 GB
 Metadata Space Available: 3.903 GB
 Thin Pool Minimum Free Space: 80.53 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.36.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 251.6 GiB
Name: server.masked.example.com
ID: Z7J7:CLEG:KZPK:TNEI:QLYL:JBTO:XNM4:NX2X:KPQC:VHPC:6SFS:G4GR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker.txt

@coolljt0725

This comment has been minimized.

Show comment
Hide comment
@coolljt0725

coolljt0725 Nov 17, 2016

Contributor

from the provided log, it seems it stuck on dm_udev_wait

goroutine 2747 [syscall, 12 minutes, locked to thread]:
github.com/docker/docker/pkg/devicemapper._Cfunc_dm_udev_wait(0xc80d4d887c, 0xc800000000)
\t??:0 +0x41
github.com/docker/docker/pkg/devicemapper.dmUdevWaitFct(0xd4d887c, 0x44288e)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:222 +0x22
github.com/docker/docker/pkg/devicemapper.UdevWait(0xc821c7e8f8, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src
Nov 17 06:50:13 server docker: /github.com/docker/docker/pkg/devicemapper/devmapper.go:259 +0x3b
github.com/docker/docker/pkg/devicemapper.activateDevice(0xc82021a880, 0x1e, 0xc8202f0cc0, 0x57, 0x9608, 0x1900000000, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:771 +0x8b8
github.com/docker/docker/pkg/devicemapper.ActivateDevice(0xc82021a880, 0x1e, 0xc8202f0cc0, 0x57, 0x9608, 0x1900000000, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:732 +0x78
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).activateDeviceIfNeeded(0xc8200b6340, 0xc8208e0a40, 0xc8200b6300, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:539 +0x58d
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).MountDevice(0xc8200b6340, 0xc820a00200, 0x40, 0xc820517ab0, 0x61, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2269 +0x2cd
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Get(0xc82047bf40, 0xc820a00200, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:185 +0x62f
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).Get(0xc8201c2680, 0xc820a00200, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t<autogenerated>:30 +0xa6

similar issues #27900 #27543

@AaronDMarasco-VSI Have you tried dmsetup udevcomplete_all?

Contributor

coolljt0725 commented Nov 17, 2016

from the provided log, it seems it stuck on dm_udev_wait

goroutine 2747 [syscall, 12 minutes, locked to thread]:
github.com/docker/docker/pkg/devicemapper._Cfunc_dm_udev_wait(0xc80d4d887c, 0xc800000000)
\t??:0 +0x41
github.com/docker/docker/pkg/devicemapper.dmUdevWaitFct(0xd4d887c, 0x44288e)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:222 +0x22
github.com/docker/docker/pkg/devicemapper.UdevWait(0xc821c7e8f8, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src
Nov 17 06:50:13 server docker: /github.com/docker/docker/pkg/devicemapper/devmapper.go:259 +0x3b
github.com/docker/docker/pkg/devicemapper.activateDevice(0xc82021a880, 0x1e, 0xc8202f0cc0, 0x57, 0x9608, 0x1900000000, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:771 +0x8b8
github.com/docker/docker/pkg/devicemapper.ActivateDevice(0xc82021a880, 0x1e, 0xc8202f0cc0, 0x57, 0x9608, 0x1900000000, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:732 +0x78
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).activateDeviceIfNeeded(0xc8200b6340, 0xc8208e0a40, 0xc8200b6300, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:539 +0x58d
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).MountDevice(0xc8200b6340, 0xc820a00200, 0x40, 0xc820517ab0, 0x61, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2269 +0x2cd
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Get(0xc82047bf40, 0xc820a00200, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:185 +0x62f
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).Get(0xc8201c2680, 0xc820a00200, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t<autogenerated>:30 +0xa6

similar issues #27900 #27543

@AaronDMarasco-VSI Have you tried dmsetup udevcomplete_all?

@AaronDMarasco-VSI

This comment has been minimized.

Show comment
Hide comment
@AaronDMarasco-VSI

AaronDMarasco-VSI Nov 17, 2016

@coolljt0725 No, sorry, after I dumped the log for reporting, I restarted the daemon without playing with it any more.

@coolljt0725 No, sorry, after I dumped the log for reporting, I restarted the daemon without playing with it any more.

@liquid-sky

This comment has been minimized.

Show comment
Hide comment
@liquid-sky

liquid-sky Nov 17, 2016

@tonistiigi, full stack trace from Docker daemon after SIGUSR1 can be found here

@tonistiigi, full stack trace from Docker daemon after SIGUSR1 can be found here

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Nov 17, 2016

Member

The trace from @liquid-sky seems to be blocked on netlink socket

goroutine 13038 [syscall, 9838 minutes, locked to thread]:
syscall.Syscall6(0x2d, 0x16, 0xc820fc3000, 0x1000, 0x0, 0xc8219bc028, 0xc8219bc024, 0x0, 0x300000002, 0x66b5a6)
    /usr/lib/go1.6/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.recvfrom(0x16, 0xc820fc3000, 0x1000, 0x1000, 0x0, 0xc8219bc028, 0xc8219bc024, 0x427289, 0x0, 0x0)
    /usr/lib/go1.6/src/syscall/zsyscall_linux_amd64.go:1712 +0x9e
syscall.Recvfrom(0x16, 0xc820fc3000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go1.6/src/syscall/syscall_unix.go:251 +0xb9
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc820c6c5e0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:341 +0xae
github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute(0xc820dded20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:228 +0x20b
github.com/vishvananda/netlink.LinkSetMasterByIndex(0
 x7f1e88c67c20, 0xc820e00630, 0x3, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:231 +0x3b0
github.com/vishvananda/netlink.LinkSetMaster(0x7f1e88c67c20, 0xc820e00630, 0xc8219bc550, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:205 +0xb7
github.com/docker/libnetwork/drivers/bridge.addToBridge(0xc820c78830, 0xb, 0x177bf88, 0x7, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:782 +0x275
github.com/docker/libnetwork/drivers/bridge.(*driver).CreateEndpoint(0xc820316640, 0xc8201f6840, 0x40, 0xc821879500, 0x40, 0x7f1e88c67bd8, 0xc820f38580, 0xc821a87bf0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:1004 +0x1436
github.com/docker/libnetwork.(*network).addEndpoint(0xc820f026c0, 0xc8209f2300, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/network.go:749 +0x21b
github.com/docker/libnetwork.(*network).CreateEndpoint(0xc820f026c0, 0xc820eb1909, 0x6, 0xc8210af160, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/network.go:813 +0xaa7
github.com/docker/docker/daemon.(*Daemon).connectToNetwork(0xc82044e480, 0xc820e661c0, 0x177aea8, 0x6, 0xc820448c00, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:539 +0x39f
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker
 -1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:401 +0x373
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:680 +0x459
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/start.go:125 +0x219
github.com/docker/docker/daemon.(*Daemon).ContainerStart(0xc82044e480, 0xc820ed61d7, 0x40, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/start.go:75 +0x575

cc @aboch

Member

tonistiigi commented Nov 17, 2016

The trace from @liquid-sky seems to be blocked on netlink socket

goroutine 13038 [syscall, 9838 minutes, locked to thread]:
syscall.Syscall6(0x2d, 0x16, 0xc820fc3000, 0x1000, 0x0, 0xc8219bc028, 0xc8219bc024, 0x0, 0x300000002, 0x66b5a6)
    /usr/lib/go1.6/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.recvfrom(0x16, 0xc820fc3000, 0x1000, 0x1000, 0x0, 0xc8219bc028, 0xc8219bc024, 0x427289, 0x0, 0x0)
    /usr/lib/go1.6/src/syscall/zsyscall_linux_amd64.go:1712 +0x9e
syscall.Recvfrom(0x16, 0xc820fc3000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go1.6/src/syscall/syscall_unix.go:251 +0xb9
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc820c6c5e0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:341 +0xae
github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute(0xc820dded20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:228 +0x20b
github.com/vishvananda/netlink.LinkSetMasterByIndex(0
 x7f1e88c67c20, 0xc820e00630, 0x3, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:231 +0x3b0
github.com/vishvananda/netlink.LinkSetMaster(0x7f1e88c67c20, 0xc820e00630, 0xc8219bc550, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:205 +0xb7
github.com/docker/libnetwork/drivers/bridge.addToBridge(0xc820c78830, 0xb, 0x177bf88, 0x7, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:782 +0x275
github.com/docker/libnetwork/drivers/bridge.(*driver).CreateEndpoint(0xc820316640, 0xc8201f6840, 0x40, 0xc821879500, 0x40, 0x7f1e88c67bd8, 0xc820f38580, 0xc821a87bf0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:1004 +0x1436
github.com/docker/libnetwork.(*network).addEndpoint(0xc820f026c0, 0xc8209f2300, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/network.go:749 +0x21b
github.com/docker/libnetwork.(*network).CreateEndpoint(0xc820f026c0, 0xc820eb1909, 0x6, 0xc8210af160, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/network.go:813 +0xaa7
github.com/docker/docker/daemon.(*Daemon).connectToNetwork(0xc82044e480, 0xc820e661c0, 0x177aea8, 0x6, 0xc820448c00, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:539 +0x39f
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker
 -1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:401 +0x373
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:680 +0x459
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/start.go:125 +0x219
github.com/docker/docker/daemon.(*Daemon).ContainerStart(0xc82044e480, 0xc820ed61d7, 0x40, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/start.go:75 +0x575

cc @aboch

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Nov 17, 2016

Contributor

As other have suggested this is likely a side effect of something very bad happening in the kernel.
We are adding docker/libnetwork#1557 to attempt to mitigate this situation.

Contributor

aboch commented Nov 17, 2016

As other have suggested this is likely a side effect of something very bad happening in the kernel.
We are adding docker/libnetwork#1557 to attempt to mitigate this situation.

@mblaschke

This comment has been minimized.

Show comment
Hide comment
@mblaschke

mblaschke Nov 26, 2016

Maybe i have another two three stack traces with some error messages (taken from journalctl syslog).
I'm running some rspec/serverspec tests for docker images and in this state docker seems not to respond. The test is stuck for more than 20 minutes (will retry 5 times if fails).
Each image tests is finished in less than 5 minutes, the test is now running for at least 30 or 60 minutes.

https://gist.github.com/mblaschke/d5d38cb34f6178e50e465c6e1f02131c

uname -a:
Linux webdevops.infogene.fr 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker info:
Containers: 5
Running: 1
Paused: 0
Stopped: 4
Images: 1208
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 449
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 7
Total Memory: 11.73 GiB
Name: webdevops.infogene.fr
ID: DGNM:G3MV:ZMMV:HY6K:UPLU:CMEV:VHCA:RVY3:CRV7:FS5M:KMVQ:MQO5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8

mblaschke commented Nov 26, 2016

Maybe i have another two three stack traces with some error messages (taken from journalctl syslog).
I'm running some rspec/serverspec tests for docker images and in this state docker seems not to respond. The test is stuck for more than 20 minutes (will retry 5 times if fails).
Each image tests is finished in less than 5 minutes, the test is now running for at least 30 or 60 minutes.

https://gist.github.com/mblaschke/d5d38cb34f6178e50e465c6e1f02131c

uname -a:
Linux webdevops.infogene.fr 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker info:
Containers: 5
Running: 1
Paused: 0
Stopped: 4
Images: 1208
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 449
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 7
Total Memory: 11.73 GiB
Name: webdevops.infogene.fr
ID: DGNM:G3MV:ZMMV:HY6K:UPLU:CMEV:VHCA:RVY3:CRV7:FS5M:KMVQ:MQO5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8

@mblaschke

This comment has been minimized.

Show comment
Hide comment

Here also the result output from serverspec:

https://gist.github.com/mblaschke/21a521367a2a2b71301607482647a748

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Nov 28, 2016

Member

@mblaschke I looked over your traces(https://gist.github.com/tonistiigi/0fb0abfb068a89975c072b68e6ed07ce for better view). I can't find anything suspicious in there though. All long running goroutines are from open io copy that is normal if there are running containers or execs as these goroutines don't hold any locks. From the traces I would expect that other commands like docker ps or docker exec/stop were not blocked in your case and the hang you saw was only that you expected a specific container or exec to finish but it didn't. This may be related to either the application itself hanging on something or containerd not sending us any events about the container(cc @mlaventure).

The warnings you have in the logs should be fixed with https://github.com/docker/containerd/pull/351 . They should only cause spam and not be anything serious. Because the debug log is not enabled in the logs I can't see if there are any suspicious commands sent to the daemon. There don't seem to be any meaningful logs for minutes before you took the stacktrace.

Member

tonistiigi commented Nov 28, 2016

@mblaschke I looked over your traces(https://gist.github.com/tonistiigi/0fb0abfb068a89975c072b68e6ed07ce for better view). I can't find anything suspicious in there though. All long running goroutines are from open io copy that is normal if there are running containers or execs as these goroutines don't hold any locks. From the traces I would expect that other commands like docker ps or docker exec/stop were not blocked in your case and the hang you saw was only that you expected a specific container or exec to finish but it didn't. This may be related to either the application itself hanging on something or containerd not sending us any events about the container(cc @mlaventure).

The warnings you have in the logs should be fixed with https://github.com/docker/containerd/pull/351 . They should only cause spam and not be anything serious. Because the debug log is not enabled in the logs I can't see if there are any suspicious commands sent to the daemon. There don't seem to be any meaningful logs for minutes before you took the stacktrace.

@mblaschke

This comment has been minimized.

Show comment
Hide comment
@mblaschke

mblaschke Nov 29, 2016

The same code works with Docker 1.10.3, but not after Docker 1.11.x. The serverspec tests are failing randomly with timeouts.
I have the feeling that it's happening more often when we run the tests in parallels (eg. running 2 oder 4 tests on 4 core cpu) than in serial mode but the result is the same. There is no test run which runs successfully.
With Docker 1.11 the whole docker daemon was hanging, since 1.12 Docker only the exec fails or run into a timeout.

The same code works with Docker 1.10.3, but not after Docker 1.11.x. The serverspec tests are failing randomly with timeouts.
I have the feeling that it's happening more often when we run the tests in parallels (eg. running 2 oder 4 tests on 4 core cpu) than in serial mode but the result is the same. There is no test run which runs successfully.
With Docker 1.11 the whole docker daemon was hanging, since 1.12 Docker only the exec fails or run into a timeout.

@mlaventure

This comment has been minimized.

Show comment
Hide comment
@mlaventure

mlaventure Nov 29, 2016

Contributor

@mblaschke I had a look at the trace too, it really looks like the exec is just not finishing with its IO.

What exec program are you executing? Does it fork new processes withing its own session id?

Contributor

mlaventure commented Nov 29, 2016

@mblaschke I had a look at the trace too, it really looks like the exec is just not finishing with its IO.

What exec program are you executing? Does it fork new processes withing its own session id?

@GameScripting

This comment has been minimized.

Show comment
Hide comment
@GameScripting

GameScripting Nov 29, 2016

We are executing (docker exec) ip addr on a regular basis in all containers to find their IP-adresses assigned by https://github.com/rancher/rancher , which are unfortunately not part of the docker metadata (yet)

We are seeing the hanging problem on a regular basis within our production cluster.
With ip addr there sould not be much magic with processes running withing their own session id, right?

GameScripting commented Nov 29, 2016

We are executing (docker exec) ip addr on a regular basis in all containers to find their IP-adresses assigned by https://github.com/rancher/rancher , which are unfortunately not part of the docker metadata (yet)

We are seeing the hanging problem on a regular basis within our production cluster.
With ip addr there sould not be much magic with processes running withing their own session id, right?

@mblaschke

This comment has been minimized.

Show comment
Hide comment
@mblaschke

mblaschke Nov 29, 2016

@mlaventure
We're using serverspec/rspec for testing and are executing very simple tests (file checks, simple command checks, simple application checks). Most of the time even simple file permission checks are failing.

Tests are here (but they need some environment settings for execution):
https://github.com/webdevops/Dockerfile/tree/develop/tests/serverspec

you could try it with our code base:

  1. make requirements for fetching all the python (pip) and ruby modules
  2. bin/console docker:pull --threads=auto (fetch ~180 images from the hub)
  3. bin/console test:serverspec --threads=auto to run all tests in parallized mode or bin/console test:serverspec in serial mode

@mlaventure
We're using serverspec/rspec for testing and are executing very simple tests (file checks, simple command checks, simple application checks). Most of the time even simple file permission checks are failing.

Tests are here (but they need some environment settings for execution):
https://github.com/webdevops/Dockerfile/tree/develop/tests/serverspec

you could try it with our code base:

  1. make requirements for fetching all the python (pip) and ruby modules
  2. bin/console docker:pull --threads=auto (fetch ~180 images from the hub)
  3. bin/console test:serverspec --threads=auto to run all tests in parallized mode or bin/console test:serverspec in serial mode
@mlaventure

This comment has been minimized.

Show comment
Hide comment
@mlaventure

mlaventure Nov 29, 2016

Contributor

@GameScripting I'm getting confused now 😅 . Are you referring to the same context that @mblaschke is running from?

If not, which docker version are you using?

And to answer your question, no, it's unlikely that ip addr would do anything like this.

Which image are you using? What is the exact docker exec command being used?

Contributor

mlaventure commented Nov 29, 2016

@GameScripting I'm getting confused now 😅 . Are you referring to the same context that @mblaschke is running from?

If not, which docker version are you using?

And to answer your question, no, it's unlikely that ip addr would do anything like this.

Which image are you using? What is the exact docker exec command being used?

@GameScripting

This comment has been minimized.

Show comment
Hide comment
@GameScripting

GameScripting Nov 29, 2016

Sorry, it was not my intention to make this more confusion.
I am not refering to the same context, I do not use the test-suite that hewas refering to. I wanted to provide more (hopefully useful) context and/or infos on what we are doing that might cause the issue.

Main issue on resolving this bug is that no one was yet able to come up with stable, reproducable steps to trigger the hanging.

Seems like @mblaschke found something so heis able to reliably trigger the bug.

GameScripting commented Nov 29, 2016

Sorry, it was not my intention to make this more confusion.
I am not refering to the same context, I do not use the test-suite that hewas refering to. I wanted to provide more (hopefully useful) context and/or infos on what we are doing that might cause the issue.

Main issue on resolving this bug is that no one was yet able to come up with stable, reproducable steps to trigger the hanging.

Seems like @mblaschke found something so heis able to reliably trigger the bug.

@wallneradam

This comment has been minimized.

Show comment
Hide comment
@wallneradam

wallneradam Nov 29, 2016

I use the CoreOS stable (1185.3.0) with Docker 1.11.2.
I run a watch with docker exec to my Redis container to check some variables. At least a day the Docker daemon hangs.

But I've found a workaround until you find a solution. I use the ctrutility from containerd (https://github.com/docker/containerd) to start a process inside the running container. It can be also used to start and stop containers. I think it is integrated into docker 1.11.2.
Unfortunately it has another bug on CoreOS I reported here: https://github.com/docker/containerd/issues/356#event-874038823 (they fixed it since), so you need to regularly clean /tmp but at least it has been working for me for a week in production.

The next docker exec example:

docker exec -i container_name /bin/sh 'echo "Hello"'

can be translated to ctr:

/usr/bin/ctr --address=/run/docker/libcontainerd/docker-containerd.sock containers exec --id=${id_of_the_running_container} --pid=any_unique_process_name -a --cwd=/ /bin/sh -c 'echo "Hello"'

So you can translate scripts temporarily to ctr as a workaround.

I use the CoreOS stable (1185.3.0) with Docker 1.11.2.
I run a watch with docker exec to my Redis container to check some variables. At least a day the Docker daemon hangs.

But I've found a workaround until you find a solution. I use the ctrutility from containerd (https://github.com/docker/containerd) to start a process inside the running container. It can be also used to start and stop containers. I think it is integrated into docker 1.11.2.
Unfortunately it has another bug on CoreOS I reported here: https://github.com/docker/containerd/issues/356#event-874038823 (they fixed it since), so you need to regularly clean /tmp but at least it has been working for me for a week in production.

The next docker exec example:

docker exec -i container_name /bin/sh 'echo "Hello"'

can be translated to ctr:

/usr/bin/ctr --address=/run/docker/libcontainerd/docker-containerd.sock containers exec --id=${id_of_the_running_container} --pid=any_unique_process_name -a --cwd=/ /bin/sh -c 'echo "Hello"'

So you can translate scripts temporarily to ctr as a workaround.

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Nov 30, 2016

Member

@mblaschke The commands you posted did fail for me but they don't look like docker failures. https://gist.github.com/tonistiigi/86badf5a41dff3fe53bd68d8e83e4ec4 Could you enable debug logs. The master build also stores more debug data about daemon internals and allows to trace containerd as well with sigusr1 signals. Because we are tracking a stuck process even ps aux could help.

Member

tonistiigi commented Nov 30, 2016

@mblaschke The commands you posted did fail for me but they don't look like docker failures. https://gist.github.com/tonistiigi/86badf5a41dff3fe53bd68d8e83e4ec4 Could you enable debug logs. The master build also stores more debug data about daemon internals and allows to trace containerd as well with sigusr1 signals. Because we are tracking a stuck process even ps aux could help.

@mblaschke

This comment has been minimized.

Show comment
Hide comment
@mblaschke

mblaschke Nov 30, 2016

@tonistiigi
I've forgot the alpine blacklist (currently build problems) so please run: bin/console test:serverspec --threads=auto --blacklist=:alpine

Sorry :(

@tonistiigi
I've forgot the alpine blacklist (currently build problems) so please run: bin/console test:serverspec --threads=auto --blacklist=:alpine

Sorry :(

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Dec 1, 2016

Member

@mblaschke Still doesn't look like docker issue. It is hanging on docker exec <id> /usr/bin/php -i. If I trace the image being used and start it manually I see that this image doesn't have real php installed:

root@254424aecc57:/# php -v
HipHop VM 3.11.1 (rel)
Compiler: 3.11.1+dfsg-1ubuntu1
Repo schema: 2f678922fc70b326c82e56bedc2fc106c2faca61

And this HipHopVM doesn't support -i but just blocks.

Member

tonistiigi commented Dec 1, 2016

@mblaschke Still doesn't look like docker issue. It is hanging on docker exec <id> /usr/bin/php -i. If I trace the image being used and start it manually I see that this image doesn't have real php installed:

root@254424aecc57:/# php -v
HipHop VM 3.11.1 (rel)
Compiler: 3.11.1+dfsg-1ubuntu1
Repo schema: 2f678922fc70b326c82e56bedc2fc106c2faca61

And this HipHopVM doesn't support -i but just blocks.

@mblaschke

This comment has been minimized.

Show comment
Hide comment
@mblaschke

mblaschke Dec 1, 2016

@tonistiigi
I was hunting this bug in the last Docker versions long time and didn't see this bug in our testsuite..We've fixed it, thanks and sorry /o\

I've searched the build logs and found one test failure (random issue, not in the hhvm tests) when using 1.12.3. We will continue to stress Docker and try to find the issue.

@tonistiigi
I was hunting this bug in the last Docker versions long time and didn't see this bug in our testsuite..We've fixed it, thanks and sorry /o\

I've searched the build logs and found one test failure (random issue, not in the hhvm tests) when using 1.12.3. We will continue to stress Docker and try to find the issue.

@AaronDMarasco-VSI

This comment has been minimized.

Show comment
Hide comment
@AaronDMarasco-VSI

AaronDMarasco-VSI Dec 6, 2016

@coolljt0725 I just came back to work to find Docker hung again. docker ps hung in one session, and sudo service docker stop hung as well. Opened a third session:

$ sudo lvs
  LV          VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool vg_ex  twi-aot--- 750.00g             63.49  7.88                            
$ sudo dmsetup udevcomplete_all
This operation will destroy all semaphores with keys that have a prefix 3405 (0xd4d).
Do you really want to continue? [y/n]: y
2 semaphores with keys prefixed by 3405 (0xd4d) destroyed. 0 skipped.

As soon as that completed, the other two sessions unblocked. I was checking disk space because that had been a problem once before so was the first place I looked. The 2 semaphores may have been the two calls I had, but there were many hung docker calls from my Jenkins server, which is why I was poking around at all...

An example of hung output after I did the dmsetup call, the docker run had started days ago:

docker run -td -v /opt/Xilinx/:/opt/Xilinx/:ro -v /opt/Modelsim/:/opt/Modelsim/:ro -v /data/jenkins_workspace_modelsim/workspace/examples_hdl/APPLICATION/bias/HDL_PLATFORM/modelsim_pf/TARGET_OS/7:/build --name jenkins-examples_hdl-APPLICATION--bias--HDL_PLATFORM--modelsim_pf--TARGET_OS--7-546 jenkins/build:v3-C7 /bin/sleep 10m
docker: An error occurred trying to connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/create?name=jenkins-examples_hdl-APPLICATION--bias--HDL_PLATFORM--modelsim_pf--TARGET_OS--7-546: EOF.
See 'docker run --help'.

AaronDMarasco-VSI commented Dec 6, 2016

@coolljt0725 I just came back to work to find Docker hung again. docker ps hung in one session, and sudo service docker stop hung as well. Opened a third session:

$ sudo lvs
  LV          VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool vg_ex  twi-aot--- 750.00g             63.49  7.88                            
$ sudo dmsetup udevcomplete_all
This operation will destroy all semaphores with keys that have a prefix 3405 (0xd4d).
Do you really want to continue? [y/n]: y
2 semaphores with keys prefixed by 3405 (0xd4d) destroyed. 0 skipped.

As soon as that completed, the other two sessions unblocked. I was checking disk space because that had been a problem once before so was the first place I looked. The 2 semaphores may have been the two calls I had, but there were many hung docker calls from my Jenkins server, which is why I was poking around at all...

An example of hung output after I did the dmsetup call, the docker run had started days ago:

docker run -td -v /opt/Xilinx/:/opt/Xilinx/:ro -v /opt/Modelsim/:/opt/Modelsim/:ro -v /data/jenkins_workspace_modelsim/workspace/examples_hdl/APPLICATION/bias/HDL_PLATFORM/modelsim_pf/TARGET_OS/7:/build --name jenkins-examples_hdl-APPLICATION--bias--HDL_PLATFORM--modelsim_pf--TARGET_OS--7-546 jenkins/build:v3-C7 /bin/sleep 10m
docker: An error occurred trying to connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/create?name=jenkins-examples_hdl-APPLICATION--bias--HDL_PLATFORM--modelsim_pf--TARGET_OS--7-546: EOF.
See 'docker run --help'.
@Calpicow

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Dec 21, 2016

Member

trace from @Calpicow seems to be stuck on devicemapper. But doesn't look to udev_wait case. @coolljt0725 @rhvgoyal

goroutine 488 [syscall, 10 minutes, locked to thread]:
github.com/docker/docker/pkg/devicemapper._C2func_dm_task_run(0x7fbffc010f80, 0x0, 0x0, 0x0)
	??:0 +0x47
github.com/docker/docker/pkg/devicemapper.dmTaskRunFct(0x7fbffc010f80, 0xc800000001)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:96 +0x21
github.com/docker/docker/pkg/devicemapper.(*Task).run(0xc8200266d8, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engin
e/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:155 +0x37
github.com/docker/docker/pkg/devicemapper.SuspendDevice(0xc821b07380, 0x55, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:627 +0x99
github.com/docker/docker/pkg/devicemapper.CreateSnapDevice(0xc821013f80, 0x25, 0x1a, 0xc821b07380, 0x55, 0x17, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:759 +0x92
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).createRegisterSnapDevice(0xc8203be9c0, 0xc82184d1c0, 0x40, 0xc821a78f40, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:860 +0x557
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).AddDevice(0xc8203be9c0, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:1865 +0x81f
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Create(0xc8200ff770, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:124 +0x5f
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).Create(0xc820363940, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0, 0x0, 0x0)
	<autogenerated>:24 +0xaa
github.com/docker/docker/layer.(*layerStore).Register(0xc820363980, 0x7fc02faa3898, 0xc821a6ae00, 0xc821ace370, 0x47, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/layer/layer_store.go:266 +0x382
github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1.1(0xc8210cd740, 0xc8210cd680, 0x7fc02fb6b758, 0xc820f422d0, 0xc820ee15c0, 0xc820ee1680, 0xc8210cd6e0, 0xc820e8e0b0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/distribut
ion/xfer/download.go:316 +0xc01
created by github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/distribution/xfer/download.go:341 +0x191
Member

tonistiigi commented Dec 21, 2016

trace from @Calpicow seems to be stuck on devicemapper. But doesn't look to udev_wait case. @coolljt0725 @rhvgoyal

goroutine 488 [syscall, 10 minutes, locked to thread]:
github.com/docker/docker/pkg/devicemapper._C2func_dm_task_run(0x7fbffc010f80, 0x0, 0x0, 0x0)
	??:0 +0x47
github.com/docker/docker/pkg/devicemapper.dmTaskRunFct(0x7fbffc010f80, 0xc800000001)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:96 +0x21
github.com/docker/docker/pkg/devicemapper.(*Task).run(0xc8200266d8, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engin
e/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:155 +0x37
github.com/docker/docker/pkg/devicemapper.SuspendDevice(0xc821b07380, 0x55, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:627 +0x99
github.com/docker/docker/pkg/devicemapper.CreateSnapDevice(0xc821013f80, 0x25, 0x1a, 0xc821b07380, 0x55, 0x17, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:759 +0x92
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).createRegisterSnapDevice(0xc8203be9c0, 0xc82184d1c0, 0x40, 0xc821a78f40, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:860 +0x557
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).AddDevice(0xc8203be9c0, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:1865 +0x81f
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Create(0xc8200ff770, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:124 +0x5f
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).Create(0xc820363940, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0, 0x0, 0x0)
	<autogenerated>:24 +0xaa
github.com/docker/docker/layer.(*layerStore).Register(0xc820363980, 0x7fc02faa3898, 0xc821a6ae00, 0xc821ace370, 0x47, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/layer/layer_store.go:266 +0x382
github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1.1(0xc8210cd740, 0xc8210cd680, 0x7fc02fb6b758, 0xc820f422d0, 0xc820ee15c0, 0xc820ee1680, 0xc8210cd6e0, 0xc820e8e0b0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/distribut
ion/xfer/download.go:316 +0xc01
created by github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/distribution/xfer/download.go:341 +0x191
@coolljt0725

This comment has been minimized.

Show comment
Hide comment
@coolljt0725

coolljt0725 Dec 22, 2016

Contributor

@Calpicow Do you have some dmesg log? and can your show the output of dmsetup status?

Contributor

coolljt0725 commented Dec 22, 2016

@Calpicow Do you have some dmesg log? and can your show the output of dmsetup status?

@ahmetb

This comment has been minimized.

Show comment
Hide comment
@ahmetb

ahmetb Jan 2, 2017

Contributor

My docker ps hangs, but other commands like info/restart/images work just fine.

I have a huge SIGUSR1 dump below as well (didn't fit here). https://gist.github.com/ahmetalpbalkan/34bf40c02a78e319eaf5710acb15cf9a

  • docker Server Version: 1.11.2
  • Storage Driver: overlay
  • Kernel Version: 4.7.3-coreos-r3
  • Operating System: CoreOS 1185.5.0 (MoreOS)

It looks like I have a ton (like 700) of these goroutines:

...
goroutine 1149 [chan send, 1070 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc821159d40, 0xc820fa4c60)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107

goroutine 442 [chan send, 1129 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc8211eb380, 0xc821095920)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107
...
Contributor

ahmetb commented Jan 2, 2017

My docker ps hangs, but other commands like info/restart/images work just fine.

I have a huge SIGUSR1 dump below as well (didn't fit here). https://gist.github.com/ahmetalpbalkan/34bf40c02a78e319eaf5710acb15cf9a

  • docker Server Version: 1.11.2
  • Storage Driver: overlay
  • Kernel Version: 4.7.3-coreos-r3
  • Operating System: CoreOS 1185.5.0 (MoreOS)

It looks like I have a ton (like 700) of these goroutines:

...
goroutine 1149 [chan send, 1070 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc821159d40, 0xc820fa4c60)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107

goroutine 442 [chan send, 1129 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc8211eb380, 0xc821095920)
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
	/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107
...
@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jan 2, 2017

Contributor

@ahmetalpbalkan You look blocked waiting on a netlink socket to return.
This is a bug in the kernel, but 1.12.5 should at least have a timeout on this netlink socket.
If I had to guess, you'd have something in your dmesg output like device_count = 1; waiting for <interface> to become free

Contributor

cpuguy83 commented Jan 2, 2017

@ahmetalpbalkan You look blocked waiting on a netlink socket to return.
This is a bug in the kernel, but 1.12.5 should at least have a timeout on this netlink socket.
If I had to guess, you'd have something in your dmesg output like device_count = 1; waiting for <interface> to become free

@ahmetb

This comment has been minimized.

Show comment
Hide comment
@ahmetb

ahmetb Jan 2, 2017

Contributor

@cpuguy83 yeah I saw the coreos/bugs#254 which looked similar to my case, however I don't see those "waiting" messages in the kernel logs the person and you mentioned.

it looks like 1.12.5 did not hit even the coreos alpha stream yet. is there a kernel/docker version I can downgrade and have it working?

Contributor

ahmetb commented Jan 2, 2017

@cpuguy83 yeah I saw the coreos/bugs#254 which looked similar to my case, however I don't see those "waiting" messages in the kernel logs the person and you mentioned.

it looks like 1.12.5 did not hit even the coreos alpha stream yet. is there a kernel/docker version I can downgrade and have it working?

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jan 2, 2017

Contributor

@ahmetalpbalkan Yay, for another kernel bug.
This is why we introduced the timeout on the netlink socket... in all likelihood you won't be able to start/stop any new containers, but at least Docker won't be blocked.

Contributor

cpuguy83 commented Jan 2, 2017

@ahmetalpbalkan Yay, for another kernel bug.
This is why we introduced the timeout on the netlink socket... in all likelihood you won't be able to start/stop any new containers, but at least Docker won't be blocked.

@GameScripting

This comment has been minimized.

Show comment
Hide comment
@GameScripting

GameScripting Jan 2, 2017

Is know what exactly IS the bug? Was the kernel-bug reported upstream? Or is there even a kernel version where this bug has been fixed?

Is know what exactly IS the bug? Was the kernel-bug reported upstream? Or is there even a kernel version where this bug has been fixed?

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jan 2, 2017

Contributor

@GameScripting Issue will have to be reported to whatever distro this was produced in, and as you can see we have more than 1 issue causing the same effect here as well.

Contributor

cpuguy83 commented Jan 2, 2017

@GameScripting Issue will have to be reported to whatever distro this was produced in, and as you can see we have more than 1 issue causing the same effect here as well.

@schuylr schuylr referenced this issue in rancher/rancher Jan 5, 2017

Closed

etcd Catalog Disaster Recovery is broken #7310

@Calpicow

This comment has been minimized.

Show comment
Hide comment
@Calpicow

Calpicow Jan 9, 2017

Here's another one with Docker v1.12.3

dump
dmesg
dmsetup

Relevant syslog:

Jan  6 01:41:19 ip-100-64-32-70 kernel: INFO: task kworker/u31:1:91 blocked for more than 120 seconds.
Jan  6 01:41:19 ip-100-64-32-70 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  6 01:41:19 ip-100-64-32-70 kernel: kworker/u31:1   D ffff880201fc98e0     0    91      2 0x00000000
Jan  6 01:41:19 ip-100-64-32-70 kernel: Workqueue: kdmremove do_deferred_remove [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff88020141bcf0 0000000000000046 ffff8802044ce780 ffff88020141bfd8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff88020141bfd8 ffff88020141bfd8 ffff8802044ce780 ffff880201fc98d8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff880201fc98dc ffff8802044ce780 00000000ffffffff ffff880201fc98e0
Jan  6 01:41:20 ip-100-64-32-70 kernel: Call Trace:
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163b959>] schedule_preempt_disabled+0x29/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81639655>] __mutex_lock_slowpath+0xc5/0x1c0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81638abf>] mutex_lock+0x1f/0x2f
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0392e9d>] __dm_destroy+0xad/0x340 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa03947e3>] dm_destroy+0x13/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0398d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0390dae>] do_deferred_remove+0xe/0x10 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81645818>] ret_from_fork+0x58/0x90
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan  6 01:41:20 ip-100-64-32-70 kernel: INFO: task dockerd:31587 blocked for more than 120 seconds.
Jan  6 01:41:20 ip-100-64-32-70 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  6 01:41:20 ip-100-64-32-70 kernel: dockerd         D 0000000000000000     0 31587      1 0x00000080
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768fab0 0000000000000086 ffff880034215c00 ffff8800e768ffd8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768ffd8 ffff8800e768ffd8 ffff880034215c00 ffff8800e768fbf0
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768fbf8 7fffffffffffffff ffff880034215c00 0000000000000000
Jan  6 01:41:20 ip-100-64-32-70 kernel: Call Trace:
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163a879>] schedule+0x29/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81638569>] schedule_timeout+0x209/0x2d0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163ac46>] wait_for_completion+0x116/0x170
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab190>] ? call_srcu+0x70/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81219e3f>] ? __sync_blockdev+0x1f/0x40
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039318d>] __dm_suspend+0x5d/0x220 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0394c9a>] dm_suspend+0xca/0xf0 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0399fe0>] ? table_load+0x380/0x380 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039a174>] dev_suspend+0x194/0x250 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0399fe0>] ? table_load+0x380/0x380 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8112482d>] ? call_rcu_sched+0x1d/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff811f1e75>] do_vfs_ioctl+0x2e5/0x4c0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8128bbee>] ? file_has_perm+0xae/0xc0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81640d01>] ? __do_page_fault+0xb1/0x450
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff811f20f1>] SyS_ioctl+0xa1/0xc0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b

Calpicow commented Jan 9, 2017

Here's another one with Docker v1.12.3

dump
dmesg
dmsetup

Relevant syslog:

Jan  6 01:41:19 ip-100-64-32-70 kernel: INFO: task kworker/u31:1:91 blocked for more than 120 seconds.
Jan  6 01:41:19 ip-100-64-32-70 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  6 01:41:19 ip-100-64-32-70 kernel: kworker/u31:1   D ffff880201fc98e0     0    91      2 0x00000000
Jan  6 01:41:19 ip-100-64-32-70 kernel: Workqueue: kdmremove do_deferred_remove [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff88020141bcf0 0000000000000046 ffff8802044ce780 ffff88020141bfd8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff88020141bfd8 ffff88020141bfd8 ffff8802044ce780 ffff880201fc98d8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff880201fc98dc ffff8802044ce780 00000000ffffffff ffff880201fc98e0
Jan  6 01:41:20 ip-100-64-32-70 kernel: Call Trace:
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163b959>] schedule_preempt_disabled+0x29/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81639655>] __mutex_lock_slowpath+0xc5/0x1c0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81638abf>] mutex_lock+0x1f/0x2f
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0392e9d>] __dm_destroy+0xad/0x340 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa03947e3>] dm_destroy+0x13/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0398d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0390dae>] do_deferred_remove+0xe/0x10 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81645818>] ret_from_fork+0x58/0x90
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan  6 01:41:20 ip-100-64-32-70 kernel: INFO: task dockerd:31587 blocked for more than 120 seconds.
Jan  6 01:41:20 ip-100-64-32-70 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  6 01:41:20 ip-100-64-32-70 kernel: dockerd         D 0000000000000000     0 31587      1 0x00000080
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768fab0 0000000000000086 ffff880034215c00 ffff8800e768ffd8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768ffd8 ffff8800e768ffd8 ffff880034215c00 ffff8800e768fbf0
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768fbf8 7fffffffffffffff ffff880034215c00 0000000000000000
Jan  6 01:41:20 ip-100-64-32-70 kernel: Call Trace:
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163a879>] schedule+0x29/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81638569>] schedule_timeout+0x209/0x2d0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163ac46>] wait_for_completion+0x116/0x170
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab190>] ? call_srcu+0x70/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81219e3f>] ? __sync_blockdev+0x1f/0x40
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039318d>] __dm_suspend+0x5d/0x220 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0394c9a>] dm_suspend+0xca/0xf0 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0399fe0>] ? table_load+0x380/0x380 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039a174>] dev_suspend+0x194/0x250 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0399fe0>] ? table_load+0x380/0x380 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8112482d>] ? call_rcu_sched+0x1d/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff811f1e75>] do_vfs_ioctl+0x2e5/0x4c0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8128bbee>] ? file_has_perm+0xae/0xc0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81640d01>] ? __do_page_fault+0xb1/0x450
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff811f20f1>] SyS_ioctl+0xa1/0xc0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jan 9, 2017

Contributor

@Calpicow Thanks, yours looks like devicemapper has stalled.

github.com/docker/docker/pkg/devicemapper._C2func_dm_task_run(0x7fd3a40231b0, 0x7fd300000000, 0x0, 0x0)
	??:0 +0x4c
github.com/docker/docker/pkg/devicemapper.dmTaskRunFct(0x7fd3a40231b0, 0xc821a91620)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:96 +0x75
github.com/docker/docker/pkg/devicemapper.(*Task).run(0xc820345838, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:155 +0x37
github.com/docker/docker/pkg/devicemapper.SuspendDevice(0xc8219c8600, 0x5a, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:648 +0x99
github.com/docker/docker/pkg/devicemapper.CreateSnapDevice(0xc821a915f0, 0x25, 0x21, 0xc8219c8600, 0x5a, 0x1f, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:780 +0x92
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).createRegisterSnapDevice(0xc820433040, 0xc821084080, 0x40, 0xc8210842c0, 0x140000000, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:861 +0x550
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).AddDevice(0xc820433040, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:1887 +0xa5c
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Create(0xc82036af00, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:131 +0x6f
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).CreateReadWrite(0xc82036af00, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:126 +0x86
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).CreateReadWrite(0xc82021c500, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
	<autogenerated>:28 +0xbe
github.com/docker/docker/layer.(*layerStore).CreateRWLayer(0xc82021c580, 0xc82154f980, 0x40, 0xc82025b2c0, 0x47, 0x0, 0x0, 0xc820981380, 0x0, 0x0, ...)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/layer/layer_store.go:476 +0x5a9
github.com/docker/docker/daemon.(*Daemon).setRWLayer(0xc820432ea0, 0xc8216d12c0, 0x0, 0x0)

Can you open a separate issue with all the details?
Thanks!

Contributor

cpuguy83 commented Jan 9, 2017

@Calpicow Thanks, yours looks like devicemapper has stalled.

github.com/docker/docker/pkg/devicemapper._C2func_dm_task_run(0x7fd3a40231b0, 0x7fd300000000, 0x0, 0x0)
	??:0 +0x4c
github.com/docker/docker/pkg/devicemapper.dmTaskRunFct(0x7fd3a40231b0, 0xc821a91620)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:96 +0x75
github.com/docker/docker/pkg/devicemapper.(*Task).run(0xc820345838, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:155 +0x37
github.com/docker/docker/pkg/devicemapper.SuspendDevice(0xc8219c8600, 0x5a, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:648 +0x99
github.com/docker/docker/pkg/devicemapper.CreateSnapDevice(0xc821a915f0, 0x25, 0x21, 0xc8219c8600, 0x5a, 0x1f, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:780 +0x92
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).createRegisterSnapDevice(0xc820433040, 0xc821084080, 0x40, 0xc8210842c0, 0x140000000, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:861 +0x550
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).AddDevice(0xc820433040, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:1887 +0xa5c
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Create(0xc82036af00, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:131 +0x6f
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).CreateReadWrite(0xc82036af00, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:126 +0x86
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).CreateReadWrite(0xc82021c500, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
	<autogenerated>:28 +0xbe
github.com/docker/docker/layer.(*layerStore).CreateRWLayer(0xc82021c580, 0xc82154f980, 0x40, 0xc82025b2c0, 0x47, 0x0, 0x0, 0xc820981380, 0x0, 0x0, ...)
	/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/layer/layer_store.go:476 +0x5a9
github.com/docker/docker/daemon.(*Daemon).setRWLayer(0xc820432ea0, 0xc8216d12c0, 0x0, 0x0)

Can you open a separate issue with all the details?
Thanks!

@Calpicow

This comment has been minimized.

Show comment
Hide comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment