Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker ps hang by an incompletely exited container #31685

Closed
adaiguoguo opened this issue Mar 9, 2017 · 5 comments
Closed

Docker ps hang by an incompletely exited container #31685

adaiguoguo opened this issue Mar 9, 2017 · 5 comments
Labels
area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.13

Comments

@adaiguoguo
Copy link

adaiguoguo commented Mar 9, 2017

Output of docker version:

[root@adct-eless-mesos-9 data]# docker version
Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:38:28 2017
 OS/Arch:      linux/amd64
Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:38:28 2017
 OS/Arch:      linux/amd64
 Experimental: false
[root@adct-eless-mesos-9 data]# uname -r
3.10.0-514.6.1.el7.x86_64

Output of docker info:

[root@adct-eless-mesos-9 data]# docker info
Containers: 7
 Running: 2
 Paused: 0
 Stopped: 5
Images: 15
Server Version: 1.13.1
Storage Driver: devicemapper
 Pool Name: docker-253:17-4456452-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 3.151 GB
 Data Space Total: 107.4 GB
 Data Space Available: 96.11 GB
 Metadata Space Used: 5.353 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.142 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /data/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /data/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.6.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.76 GiB
Name: adct-eless-mesos-9.vm
ID: IFPR:2AE3:EOYQ:KVIN:MABO:SCMJ:HF2P:2AYR:LGIU:6WT2:DRS4:LXH6
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 docker-registry.tools.elenet.me:5000
 127.0.0.0/8
Live Restore Enabled: false

What is the bug behavior?

[root@adct-eless-mesos-9 ~]# docker ps

That action hang and container can't be stopped.

How to find the bug root cause?

1: Find the dockerd status

[root@adct-eless-mesos-9 data]# ps -ef|grep dockerd
root        1784       1  0 Feb21 ?        00:58:37 /usr/bin/dockerd --exec-opt native.cgroupdriver=cgroupfs --insecure-registry docker-registry.tools.elenet.me:5000 -g /data/docker -H tcp://127.0.0.1:4243 -H unix:///var/run/docker.sock --bip=10.99.5.129/28 --mtu=1450
[root@adct-eless-mesos-9 data]# lsof -p 1784
dockerd 1784 root   23u     unix 0xffff880aecb92c00       0t0   13743016 /var/run/docker.sock
dockerd 1784 root   24u     unix 0xffff880bfe629c00       0t0   13775439 /var/run/docker.sock
dockerd 1784 root   25u     unix 0xffff880f6d75f400       0t0   14510073 /var/run/docker.sock
dockerd 1784 root   26u     unix 0xffff880f6d758400       0t0   14526831 /var/run/docker.sock
dockerd 1784 root   27w      REG             253,17 283825669    4457362 /data/docker/containers/e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676/e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676-json.log
dockerd 1784 root   28r      REG             253,17 283825669    4457362 /data/docker/containers/e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676/e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676-json.log
dockerd 1784 root   29u     sock                0,7       0t0      17167 protocol: NETLINK
dockerd 1784 root   30u     unix 0xffff880f6d75ec00       0t0   14531849 /var/run/docker.sock
dockerd 1784 root   31u     unix 0xffff880ae8e18c00       0t0   14531856 /var/run/docker.sock
dockerd 1784 root   32u     unix 0xffff880fef600400       0t0   14534909 /var/run/docker.sock

Now we get the container e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676 both read and write.

2:Docker inspect

[root@adct-eless-mesos-9 data]# docker inspect e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676

Now docker inspect id hang.

3:Find Pid by docker container id

[root@adct-eless-mesos-9 data]# cat /data/docker/containers/e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676/config.v2.json |python -m json.tool|grep Pid
        "Pid": 7294,
[root@adct-eless-mesos-9 data]# ps -ef|grep 7294
root      661378  660472  0 16:04 pts/0    00:00:00 grep --color=auto 7294

The container process 7294 is disappeared.

4:Ensure the container when to exit.

[root@adct-eless-mesos-9 data]# cat /run/docker/libcontainerd/containerd/events.log|grep e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676
{"id":"e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676","type":"start-container","timestamp":"2017-02-21T11:54:43.827455953+08:00"}
{"id":"e8e8a9e36e6ce633077467c7c7954b1b0cfff3df177a936d4d9b47b7b5e3a676","type":"exit","timestamp":"2017-03-08T19:54:27.877604017+08:00","pid":"init","status":137}
"status":137 128 + 9 = 137 (9 coming from SIGKILL)

So I guess?

The container received a SIGKILL by Mesos.But for some reason, the exit operate was crashed.Then we docker ps,that operate will get info from container.But the container process is not found.

@thaJeztah thaJeztah added area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Mar 9, 2017
@thaJeztah
Copy link
Member

Can you send "SIGUSR1" to the dockerd process and paste the stack trace from the daemon logs into a gist?

@adaiguoguo
Copy link
Author

I followed your tips.

[root@adct-eless-mesos-9 ~]# ps -ef|grep docker
root        1784       1  0 Feb21 ?        00:59:15 /usr/bin/dockerd --exec-opt native.cgroupdriver=cgroupfs --insecure-registry docker-registry.tools.elenet.me:5000 -g /data/docker -H tcp://127.0.0.1:4243 -H unix:///var/run/docker.sock --bip=10.99.5.129/28 --mtu=1450
root        2227    1784  0 Feb21 ?        00:22:45 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc
root      654785    2227  0 12:14 ?        00:00:00 docker-containerd-shim 97894deb4371cd7cdf7d33cfbcd64604aa1068190ceee407fdf1ebfe5a8ec47a /var/run/docker/libcontainerd/97894deb4371cd7cdf7d33cfbcd64604aa1068190ceee407fdf1ebfe5a8ec47a docker-runc
root      670717  670613  0 21:40 pts/0    00:00:00 grep --color=auto docker
[root@adct-eless-mesos-9 ~]# kill -10 1784
[root@adct-eless-mesos-9 ~]# ps -ef|grep docker
root        1784       1  0 Feb21 ?        00:59:16 /usr/bin/dockerd --exec-opt native.cgroupdriver=cgroupfs --insecure-registry docker-registry.tools.elenet.me:5000 -g /data/docker -H tcp://127.0.0.1:4243 -H unix:///var/run/docker.sock --bip=10.99.5.129/28 --mtu=1450
root        2227    1784  0 Feb21 ?        00:22:45 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc
root      654785    2227  0 12:14 ?        00:00:00 docker-containerd-shim 97894deb4371cd7cdf7d33cfbcd64604aa1068190ceee407fdf1ebfe5a8ec47a /var/run/docker/libcontainerd/97894deb4371cd7cdf7d33cfbcd64604aa1068190ceee407fdf1ebfe5a8ec47a docker-runc
root      670727  670613  0 21:40 pts/0    00:00:00 grep --color=auto docker
[root@adct-eless-mesos-9 ~]# cat /var/log/daemon.log | grep docker
2017-03-09 12:13:34.815 info dockerd[1784]: time="2017-03-09T12:13:34.814928757+08:00" level=info msg="Container e0bfa13ac279bc21f6550c328d712d729722bba2c8d04825553a54288249bf72 failed to exit within 10 seconds of signal 15 - using the force"
2017-03-09 21:40:55.458 info dockerd[1784]: time="2017-03-09T21:40:55.458033682+08:00" level=info msg="goroutine stacks written to /var/run/docker/goroutine-stacks-2017-03-09T214055+0800.log"
2017-03-09 21:40:55.832 info dockerd[1784]: time="2017-03-09T21:40:55.832573438+08:00" level=info msg="daemon datastructure dump written to /var/run/docker/daemon-data-2017-03-09T214055+0800.log"
[root@adct-eless-mesos-9 ~]# cat /var/run/docker/goroutine-stacks-2017-03-09T214055+0800.log

The gist is https://gist.github.com/adaiguoguo/a391567020ccc443a4336f5a37ca664d

@thaJeztah
Copy link
Member

/cc @cpuguy83

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 9, 2017

Thanks, this is fixed in 17.03, which is basically what 1.13.2 would have been.

@cpuguy83 cpuguy83 closed this as completed Mar 9, 2017
@ohmystack
Copy link
Contributor

@adaiguoguo, I found it was fixed by PR #30223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.13
Projects
None yet
Development

No branches or pull requests

5 participants