Docker does not catch container exit #33820

titpetric · 2017-06-26T07:48:52Z

Description

Occasionaly/often (within days/weeks), a container would exit, but docker doesn't catch this exit, or cleans up after this container (--rm). The container is "stuck", and the process running docker run waits indefinitely.

Steps to reproduce the issue:

days/weeks time before issue occurs,
somewhere after api 1.24 (that one works stably)
kernel after Linux dev2 4.6.0-1-amd64 Container tagging / naming #1 SMP Debian 4.6.2-2 (2016-06-25) x86_64 GNU/Linux (again, stable)

Describe the results you received:

This is the output of docker events for the container which exited:

root@docker2:~# docker events --since $(date -d '2017-06-26 00:00:00' "+%s")
2017-06-26T08:56:49.683281660+02:00 container create d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T08:56:49.690299160+02:00 container attach d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T08:56:49.846012360+02:00 network connect da8583186519df97c129e6791f90c246631b41969913aeaa3206e56cc8b71beb (container=d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e, name=bridge, type=bridge)
2017-06-26T08:56:50.239283260+02:00 container start d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T09:08:05.624284023+02:00 container exec_create: bash  d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T09:08:05.628605923+02:00 container exec_start: bash  d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)

# docker ps -a
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS                    NAMES
d4370401db24        registry.rtvslo.si/shell-php         "/start.sh php ./u..."   22 minutes ago      Up 22 minutes                                agitated_perlman

The full run cmd:

# # pstree -laSg 56268
docker,56228 run -i --rm -v /etc/config:/etc/config -v /etc/cron.scripts:/etc/cron.scripts -v /usr/local/bin:/usr/local/bin -v /root/.ssh:/root/.ssh -h docker2 -v /var/www/oddaje/oddaje:/app -w /app/public_html registry.rtvslo.si/shell-php php ./update-resource-versions
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  └─{docker},56228

Describe the results you expected:

I expected dockerd to handle/catch that the container has exited and clean up after it. The process tree doesn't indicate that any container process is still running, and docker didn't catch it.

Additional information you deem important (e.g. issue happens only occasionally):

This issue was occuring on overlay2, and recently (a few weeks ago) I switched to aufs to see if it improved behaviour (it did not). Server is mostly idle (no live traffic except deployments), and pretty much the same issue was occuring in every docker-ce version since 17.01. Had a discussion with @theJeztah some months back, he suggested it was related to an issue where removal was skipped due client/server API version mismatch. Pretty much every docker-ce edge build since then had the same issue unfortunately.

Issuing docker kill has no effect, can't exec into the container (because no root PID is alive, I suspect), docker stop doesn't do anything also, docker rm -f issues this:

root@docker2:~# docker rm -f agitated_perlman
Error response from daemon: Unable to remove filesystem for d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e: remove /var/lib/docker/containers/d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e/shm: device or resource busy

But it doesn't affect the master docker run command (did not break out of that), even if the container is now actually removed from docker ps output. The only way to move forward is to kill -9 the docker run command. The following docker run command gets stuck as well. So, once the issue starts occuring, it's occuring all the time.

Output of docker version:

# docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:18:28 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:18:28 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

# docker info
Containers: 7
 Running: 7
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 48
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-3-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 7.698GiB
Name: docker2
ID: G2Z3:XLWE:P3V3:FTZR:U2Y6:2ABJ:6HTP:PIV2:KRHA:2ATV:ZMPQ:SHMJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Server is a VM running on Hyper-V, latest debian stretch install.

Any thoughts how this could be resolved? We're very happy and stable on 1.12/1.13 versions with slightly older installation (ie, no such painful issues). If I can inspect dockerd somehow deeper (ie, more verbosity)

I'm going to upgrade the server(s) to 17.06 and again hope for the best. I can't seem to find any directly related issues on the tracker, it seems that some people have issues with version 1.12/1.13, but that one literally works like a charm for us. I'd hate to downgrade :(

Edit 1: formatting, and to note that service docker restart does fix the issue temporarily. That suggest to me that there's some internal dockerd state that went wrong and isn't handled properly (best guess). I'm not sure I should put service docker restart into a cron/health check however... :)

The text was updated successfully, but these errors were encountered:

titpetric · 2017-07-17T11:40:55Z

Just had an occurrence of the issue after 21 days. Since 17.06 is released now, I'm performing the upgrade. I'd still appreciate some guide/link on how to debug this further in internal levels, so I can make a good stab at either diagnosing the root cause of the issue, or to try to reproduce it.

thaJeztah · 2017-07-20T19:58:45Z

I think there actually may have been fixes in 17.06 in this area, so keep us posted if 17.06 works for you

titpetric · 2017-07-24T13:16:28Z

@thaJeztah is there some parent issue or PR/commit that you could refer me to? I'm interested in figuring out what the underlying issue is, which part of the runtime this falls into and to review some of the source changes myself. I don't expect to understand all of it, I would just like to contribute this either to human error and leave it in the testing phase, or better understand the technical reasons for why this happens to see if there is some kind of work-around which I can do (ie, not use ext4 filesystem or similar). Any additional info is appreciated.

thaJeztah · 2017-07-24T15:37:03Z

hm, don't have a direct PR to refer to; possibly @mlaventure can point to it, but I think he's having this week off 😊

mlaventure · 2017-08-01T16:25:29Z

I can't think of a particular PR.

This looks like docker didn't receive the exit event from containerd or didn't process it correctly.

If you could have a daemon running in debug mode, that would help checking that assumption, also that would generate quite a bit of data if it takes about 3 weeks to occur.

titpetric · 2017-08-01T17:58:13Z

I would be fine with that. Can you link me or give a short how to for this? Best, Tit

…

On Tue, 1 Aug 2017 at 18:27, Kenfe-Mickaël Laventure < ***@***.***> wrote: I can't think of a particular PR. This looks like docker didn't receive the exit event from containerd or didn't process it correctly. If you could have a daemon running in debug mode, that would help checking that assumption, also that would generate quite a bit of data if it takes about 3 weeks to occur. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#33820 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOPkLViKMP0s6iTQ1mmCfamuVtrg3xpks5sT1H0gaJpZM4OE-TU> .

mlaventure · 2017-08-01T18:34:42Z

@titpetric

Update /etc/docker/daemon.json to add "debug": true. E.g.:

{
  "debug": true
}

Reload the daemon configuration: kill -HUP $(cat /run/docker.pid)
Reproduce the issue :)

titpetric · 2017-08-01T19:59:06Z

Simple. Output is syslog or some docker command like events? I'll figure it out, thanks.

…

On Tue, 1 Aug 2017 at 20:36, Kenfe-Mickaël Laventure < ***@***.***> wrote: @titpetric <https://github.com/titpetric> 1. Update /etc/docker/daemon.json to add "debug": true. E.g.: { "debug": true } 1. Reload the daemon configuration: kill -HUP $(cat /run/docker.pid) 2. Reproduce the issue :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33820 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOPkOj_-ZmVdSLInB5skNGppOGN7lx9ks5sT3AfgaJpZM4OE-TU> .

mlaventure · 2017-08-01T20:33:11Z

@titpetric the output would be in your system logger yes

titpetric · 2017-08-04T16:09:15Z

Regarding #34381 - while I'm not hitting OOM conditions (no process is unexpectedly killed off), there are some circumstances which might be influencing this behavior and memory usage is indeed one. There's one container which uses about 3GB / 8GB ram total, and the error as it occurs is usually visible on this container.

The way I'm reloading these containers is something akin to:

docker stop $CONTAINER > /dev/null 2>&1
docker rm $CONTAINER > /dev/null 2>&1

DOCKER_ARGS=$(docker_args $BACKEND)
DOCKER_IMAGE=$(docker_image $BACKEND)

OUTPUT=$(docker run -h $CONTAINER --name $CONTAINER --restart=always $DOCKER_ARGS -d $DOCKER_IMAGE 2>&1)
check_error $? "$OUTPUT"

The combination of these commands somehow triggers the issue. If I'm correct, it should be easy enough to replicate the issue, by just writing a program that allocates a few GB's of RAM and looping through the stop/rm/run commands for a while. I'll try to figure something out.

Also: confirming that I added the debug opts to the daemon on the last occurance of this issue.

titpetric · 2017-08-24T11:36:59Z

Ok, I've had the issue occur today. This is what I filtered from the syslog:

Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.213419350+02:00" level=debug msg="Calling GET /_ping"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.215965623+02:00" level=debug msg="Calling POST /v1.30/containers/create"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.216531306+02:00" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"/var/www/monocms-novi/monocms/bin/get-entrypoints
\"],\"Domainname\":\"\",\"Entrypoint\":null,\"Env\":[],\"HostConfig\":{\"AutoRemove\":true,\"Binds\":[\"/etc/config/:/etc/config/\",\"/var/www/monocms-novi/monocms:/var/www/monocms-novi/monocms\",\"/src/spiderweb/conf/nginx.conf:/etc/ngi
nx/nginx.conf\",\"/src/spiderweb/bin/build/monocms-novi:/etc/nginx/sites/\"],\"BlkioDeviceReadBps\":null,\"BlkioDeviceReadIOps\":null,\"BlkioDeviceWriteBps\":null,\"BlkioDeviceWriteIOps\":null,\"BlkioWeight\":0,\"BlkioWeightDevice\":null
,\"CapAdd\":null,\"CapDrop\":null,\"Cgroup\":\"\",\"CgroupParent\":\"\",\"ConsoleSize\":[0,0],\"ContainerIDFile\":\"\",\"CpuCount\":0,\"CpuPercent\":0,\"CpuPeriod\":0,\"CpuQuota\":0,\"CpuRealtimePeriod\":0,\"CpuRealtimeRuntime\":0,\"CpuS
hares\":0,\"CpusetCpus\":\"\",\"CpusetMems\":\"\",\"DeviceCgroupRules\":null,\"Devices\":[],\"DiskQuota\":0,\"Dns\":[],\"DnsOptions\":[],\"DnsSearch\":[],\"ExtraHosts\":[\"dbhost:10.55.8.56\",\"dbhost_api:10.55.8.53\",\"cache1:10.55.8.74
\",\"cache2:10.55.8.73\",\"cache3:10.55.8.74\",\"fsng1:10.55.8.73\",\"fsng2:10.55.8.74\",\"api1:10.55.8.151\",\"api2:10.55.8.152\",\"api3:10.55.8.153\",\"ava3api.rtvslo.si:10.55.8.21\",\"docker:172.25.0.1\",\"mail:172.25.0.1\",\"docker2:
172.25.0.1\"],\"GroupAdd\":null,\"IOMaximumBandwidth\":0,\"IOMaximumIOps\":0,\"IpcMode\":\"\",\"Isolation\":\"\",\"KernelMemory\":0,\"Links\":null,\"LogConfig\":{\"Config\":{},\"Type\":\"\"},\"Memory\":0,\"MemoryReservation\":0,\"MemoryS
wap\":0,\"MemorySwappiness\":-1,\"NanoCpus\":0,\"NetworkMode\":\"party\",\"OomKillDisable\":false,\"OomScoreAdj\":0,\"PidMode\":\"\",\"PidsLimit\":0,\"PortBindings\":{},\"Privileged\":false,\"PublishAllPorts\":false,\"ReadonlyRootfs\":fa
lse,\"RestartPolicy\":{\"MaximumRetryCount\":0,\"Name\":\"no\"},\"SecurityOpt\":null,\"ShmSize\":0,\"UTSMode\":\"\",\"Ulimits\":null,\"UsernsMode\":\"\",\"VolumeDriver\":\"\",\"VolumesFrom\":n
Aug 24 13:10:51 docker2 dockerd[61989]: ull},\"Hostname\":\"\",\"Image\":\"registry.rtvslo.si/stretch-web-php\",\"Labels\":{},\"NetworkingConfig\":{\"EndpointsConfig\":{}},\"OnBuild\":null,\"OpenStdin\":true,\"StdinOnce\":true,\"Tty\":fa
lse,\"User\":\"\",\"Volumes\":{},\"WorkingDir\":\"\"}"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.297225332+02:00" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/75b9afaa3383c297831d3d322b5f1a59eaa06fb4eb2279572933f8d0fee48656"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.353545378+02:00" level=debug msg="Calling POST /v1.30/containers/ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022/attach?stderr=1&stdin=1&stdout=1&stream=1
"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354000906+02:00" level=debug msg="attach: stderr: begin"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354018115+02:00" level=debug msg="attach: stdin: begin"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354226219+02:00" level=debug msg="attach: stdout: begin"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354546879+02:00" level=debug msg="Calling POST /v1.30/containers/ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022/wait?condition=removed"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.355556583+02:00" level=debug msg="Calling POST /v1.30/containers/ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022/start"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.359023716+02:00" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/75b9afaa3383c297831d3d322b5f1a59eaa06fb4eb2279572933f8d0fee48656"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.359668338+02:00" level=debug msg="Assigning addresses for endpoint quirky_jang's interface on network party"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.359725567+02:00" level=debug msg="RequestAddress(LocalDefault/172.25.0.0/24, <nil>, map[])"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.377647623+02:00" level=debug msg="Assigning addresses for endpoint quirky_jang's interface on network party"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.390483738+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, true) updateSv
cRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.390638115+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, false) updateS
vcRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.399951870+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, true) updateSv
cRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.400034111+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, false) updateS
vcRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.404364375+02:00" level=debug msg="Programming external connectivity on endpoint quirky_jang (291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979)"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.408908346+02:00" level=debug msg="EnableService ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022 START"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.408983984+02:00" level=debug msg="EnableService ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022 DONE"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.881215608+02:00" level=debug msg="sandbox set key processing took 225.13944ms for container ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022"
Aug 24 13:10:53 docker2 dockerd[61989]: time="2017-08-24T13:10:53.546508639+02:00" level=debug msg="containerd: process exited" id=ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022 pid=init status=0 systemPid=18732
Aug 24 13:17:30 docker2 systemd[1]: Started Session 107031 of user root.
Aug 24 13:17:35 docker2 systemd[1]: Stopping Docker Application Container Engine...

Nothing obvious pops out at me. Last two lines are me already running service docker restart. There's a bunch of errors after that, anything from:

Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.835988806+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container feb3f25d279d935c93167e8f8d21ba166684
755d9fcc25249855a13f4185ef0c: rpc error: code = 2 desc = containerd: container not found"
Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.836061342+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ead59e503d78c316935aaa340e4e1e24c7d4
21b9c4aef02337edcbbf7ef6b022: rpc error: code = 2 desc = containerd: container not found"
Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.836171497+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 76467dc50e57cf36902bd92c6d235501a606
2afa4b52d7554ad9d4ba42efe123: rpc error: code = 2 desc = containerd: container not found"
Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.836231827+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container de77de908d972b25f76a860fee651112e38b
038eaf905d5f17d6cd871ea25d20: rpc error: code = 2 desc = containerd: container not found"

and then timeouts that result in kill -9:

Aug 24 13:17:45 docker2 dockerd[61989]: time="2017-08-24T13:17:45.829687931+02:00" level=info msg="Container 30fed9ae935e1a11381cfc81b61d5a2e957eaff579bf7c0a584c328c114cd165 failed to exit within 10 seconds of signal 15 - using the force
"
Aug 24 13:17:45 docker2 dockerd[61989]: time="2017-08-24T13:17:45.829772973+02:00" level=debug msg="Sending kill signal 9 to container 30fed9ae935e1a11381cfc81b61d5a2e957eaff579bf7c0a584c328c114cd165"

And more kills failed after that. This is running:

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:17:22 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:16:12 2017
 OS/Arch:      linux/amd64
 Experimental: false

So, latest upgrade on the same host (no reinstall). If you need a docker info, LMK

titpetric · 2017-08-24T11:42:54Z

Also, we had limited success in reproducing the issue with a program that allocates a significant amount of memory and is then removed. Docker was failing in both OOM conditions and a simple run/allocate/exit loop with less memory than required for OOM. We didn't finish the tests fully tho, other priorities. If you need one LMK and I'll try to push it to a higher prio.

Raffo · 2017-12-21T10:19:51Z

We are experiencing the same issue running the following docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64
 Experimental: false

There is an easy way to reproduce it for us on a instance that is experiencing the problem:

$ docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit
exit

[nothing happens]

The container can't be found with ps aux but it is present in docker ps. docker kill to the container also doesn't help, docker thinks it killed the container, but it is actually still listed if we run docker ps but still not running as process.

thaJeztah · 2017-12-21T13:27:29Z

@Raffo anything in the logs that could be useful? Also, what does your setup look like? Are you connected to a local daemon, or a remote one? What distro (and version) are you running? (perhaps docker info output)

Raffo · 2017-12-21T13:43:20Z

Sure, here the info:

Containers: 33
 Running: 26
 Paused: 0
 Stopped: 7
Images: 21
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.13.16-coreos-r2
Operating System: Container Linux by CoreOS 1576.4.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.799GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Live Restore Enabled: false

The setup is just that: https://github.com/zalando-incubator/kubernetes-on-aws

thaJeztah · 2017-12-21T13:46:29Z

Looks like you're running on CoreOS, which is not a supported platform (i.e., there's no official packages for CoreOS); can you report that issue with them, because CoreOS has its own packages for Docker?

Raffo · 2017-12-21T13:52:16Z

I can and I will totally do it, but I guess this will still be a Docker/Moby bug and we should try to figure it out together, especially given the similarity to the one reported originally in this issue.

qingbo · 2017-12-28T07:52:22Z

I have a small kubernetes cluster of 3 nodes and one of them is suffering from the same issue. I provisioned the nodes (same hardware specs) with Ansible so I believe they should be consistent, but the problem only happens on one node. I have to watch k8s for pods stuck in "Terminating" state, and when it happens, I'd restart the docker daemon on that node.

docker info output:

Containers: 143
 Running: 120
 Paused: 0
 Stopped: 23
Images: 89
Server Version: 17.09.0-ce
Storage Driver: devicemapper
 Pool Name: docker-images
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 13.17GB
 Data Space Total: 107.4GB
 Data Space Available: 94.21GB
 Metadata Space Used: 40.87MB
 Metadata Space Total: 104.9MB
 Metadata Space Available: 63.99MB
 Thin Pool Minimum Free Space: 10.74GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.13.8-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 251.8GiB
ID: BESE:G7P4:LFG2:RZNZ:I2DG:RY4N:GHFD:56HV:WWZI:HSTK:MS7B:LOUX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 2409
 Goroutines: 2949
 System Time: 2017-12-28T02:35:37.793930552-05:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

Like @Raffo described above, it's very easy to reproduce when the issue happens, but I have no idea how docker daemon enters such a state. Sometimes it can run for 3-4 days without problem and sometimes I have to restart docker every day.

Interesting logs (I tried docker kill):

...
Dec 28 02:31:10 kub03s dockerd: time="2017-12-28T02:31:10.305483576-05:00" level=debug msg="containerd: process exited" id=7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b pid=init status=0 systemPid=23306
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726743261-05:00" level=debug msg="Calling POST /v1.32/containers/7e2f63ce6867/kill?signal=KILL"
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726836346-05:00" level=debug msg="Sending kill signal 9 to container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b"
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.727128965-05:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b: rpc error: code = Unknown desc = containerd: container not found"
Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727333467-05:00" level=info msg="Container 7e2f63ce6867 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727385529-05:00" level=debug msg="Cannot kill process (pid=23306) with signal 9: no such process."

dockerd knows that process 23306 has exited, yet it still tries to kill that process?

titpetric · 2017-12-28T08:15:31Z

This is the smallest code that has been able, with time, to reproduce the issue: ~~~ #!/bin/bash while true do docker run -d --name test [a large docker image] docker stop test docker rm test done ~~~ Unfortunately, the stable versions are in the 1.12 tree. That is to say, we migrated down because we're being forced to because of stability issues. So far known stable versions: Docker version 1.12.3, build 6b644ec Docker version 1.12.6, build 78d1802 I don't feel bad about saying this, but hitting this issue doesn't have a production workaround (or even a propper diagnosis as to the root cause). At the current time this also means that roll out of docker in production is being stopped, kubernetes is being evaluated as it has a different container runtime, and it seems that docker itself will be abandoned for anything other than development. We're sort of hoping that some already production deployments, don't experience this issue. So far, these have been "stable" (only one crash in sort of half a year): Ubuntu 16.04, Docker version 17.10.0-ce, build f4ffd25 Debian stretch, Docker version 17.09.0-ce, build afdb6d4 (this one has 1 crash against it) Debian stretch, Docker version 17.06.0-ce, build 02c1d87 The key to stability seems to be load and frequency. The more containers you have, or the more you touch them (docker exec from cron jobs, for example), the more the instability shows itself. As to the Kubernetes comment above - the issue seems to be only the fact that you use Docker to run your k8s. Install it on the host directly (a clean VM, no docker), and you should have a work around. Unless I'm significantly mistaken, k8s only uses libcontainer (or some library) to load the docker images, but the actual containers that it runs are not ran with docker. As such, my reasoning is that you might regain stability with k8s, without throwing the baby out with the bathwater. Best, Tit

…

On Thu, Dec 28, 2017 at 8:55 AM, Qingbo Zhou ***@***.***> wrote: I have a small kubernetes cluster of 3 nodes and one of them is suffering from the same issue. I provisioned the nodes (same hardware specs) with Ansible so I believe they should be consistent, but the problem only happens on one node. I have to watch k8s for pods stuck in "Terminating" state, and when it happens, I'd restart the docker daemon on that node. docker info output: Containers: 143 Running: 120 Paused: 0 Stopped: 23 Images: 89 Server Version: 17.09.0-ce Storage Driver: devicemapper Pool Name: docker-images Pool Blocksize: 65.54kB Base Device Size: 10.74GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 13.17GB Data Space Total: 107.4GB Data Space Available: 94.21GB Metadata Space Used: 40.87MB Metadata Space Total: 104.9MB Metadata Space Available: 63.99MB Thin Pool Minimum Free Space: 10.74GB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.140-RHEL7 (2017-05-03) Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0 runc version: 3f2f8b84a77f73d38244dd690525642a72156c64 init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.13.8-1.el7.elrepo.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 56 Total Memory: 251.8GiB ID: BESE:G7P4:LFG2:RZNZ:I2DG:RY4N:GHFD:56HV:WWZI:HSTK:MS7B:LOUX Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): true File Descriptors: 2409 Goroutines: 2949 System Time: 2017-12-28T02:35:37.793930552-05:00 EventsListeners: 0 Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: true Like @Raffo <https://github.com/raffo> described above, it's very easy to reproduce when the issue happens, but I have no idea how docker daemon enters such a state. Sometimes it can run for 3-4 days without problem and sometimes I have to restart docker every day. Interesting logs (I tried docker kill): ... Dec 28 02:31:10 kub03s dockerd: time="2017-12-28T02:31:10.305483576-05:00" level=debug msg="containerd: process exited" id=7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b pid=init status=0 systemPid=23306 Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726743261-05:00" level=debug msg="Calling POST /v1.32/containers/7e2f63ce6867/kill?signal=KILL" Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726836346-05:00" level=debug msg="Sending kill signal 9 to container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b" Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.727128965-05:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b: rpc error: code = Unknown desc = containerd: container not found" Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727333467-05:00" level=info msg="Container 7e2f63ce6867 failed to exit within 10 seconds of kill - trying direct SIGKILL" Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727385529-05:00" level=debug msg="Cannot kill process (pid=23306) with signal 9: no such process." dockerd knows that process 23306 has exited, yet it still tries to kill that process? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33820 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOPkL7XFVLdJfhJvzo1E-x5WGPb1igHks5tE0lWgaJpZM4OE-TU> .

Raffo · 2017-12-28T11:37:01Z

@thaJeztah Opened to CoreOS: coreos/bugs#2306 . Any further help would be appreciated, in the meantime we are forced to downgrade to the a previous docker release to see if the problem disappears.

thaJeztah · 2017-12-28T13:12:09Z

Thanks @Raffo

Ok, I've had the issue occur today. This is what I filtered from the syslog:

@titpetric in that output, I notice "AutoRemove":true. Starting with Docker 1.13 (and API version 1.25), the --rm option is no longer handled by the client, but daemon-side. This means that the moment the container exits, it's automatically removed. Possibly the containerd: container not found error is a race-condition, where the container did not exit within x seconds, but soon after (causing it to be removed), after which the second kill failed because the container was already gone.

This may not be an explanation for all situations you reported, but was something that stood out (and could explain some)

Unless I'm significantly mistaken, k8s only uses libcontainer (or some library) to load the docker images, but the actual containers that it runs are not ran with docker.

Most k8s installations currently use Docker (through the cri-docker wrapper). Note that the docker daemon itself (since a long time) does not contain a "runtime", but uses OCI runtime compliant runtimes (runc being the default), controlled through containerd to actually run the containers. In such a setup, the same things could occur. Docker versions up to 17.11 use the containerd 0.2 branch.

Docker 17.11 and up use containerd 1.0; the 1.0 version of containerd was developed from the ground up to be used as a runtime, both for docker (directly using its gRPC API), and kubernetes (using the https://github.com/kubernetes-incubator/cri-containerd project to provide the CRI interface), and is targeted to become the default runtime for kubernetes.

containerd is a better fit as a runtime for kubernetes, because it focuses on the parts that are needed to deploy containers on kubernetes; for that, it's kept minimal, and does not provide things such as networking, orchestration, a CLI, or features such as a "builder" (docker build) to build images.

ghost · 2018-01-11T02:32:11Z

We're seeing this too since we upgraded CoreOS version.
We went from docker 12.06 to 17.09-ce , and it a matter of hours we have containers that refuses to stop.
Same issue, the PID reported by docker inspect is killed, but docker ps still shows the container as running.
Kubernetes reports the pod in a "Terminating state"

Raffo · 2018-01-11T10:14:52Z

@dada941 We "fixed" it by rolling back to docker 1.12.06. Please note that Kubernetes 1.8 as far as I know it is not compatible with docker 17, not sure why Container Linux decided to ship this version given the number of people that are using it to run Kubernetes.

dghubble · 2018-01-11T21:19:21Z

Kubernetes 1.8 is officially validated against Docker 1.11.2, 1.12.6, 1.13.1, and 17.03.2. Its in the release notes here and a blog post was written about that unblocking Container Linux reving the Docker version here.

titpetric · 2018-01-14T23:22:48Z

@thaJeztah A race condition would explain it. Some research I did a while back was related to file descriptors (or something similar, along the lines of bind mounts) which wouldn't be cleaned up by the kernel/docker and every issue I came across was ending along the lines "this is a kernel bug". Did you perhaps discuss some sort of stress test with the run/stop/rm cycle yet?

Edit: well, as you said, it would explain some of it. I'm running that infinite loop on a digital ocean instance rn, with a run -d + stop + rm, because that's the more usual (and unfortunately, very sporadic) way to reproduce the full stop. At that point it's not a race condition anymore, but more something along the lines of an unresponsive docker api (only service docker restart cures it).

qianzhangxa · 2018-01-25T07:34:07Z

@titpetric @dada941 When this issue happened, how will docker stop behave for the container which is actually not running but still show up in docker ps as running? Will it fail with an error message? Or it will do nothing (not output anything) and exit with 0? And how will docker stop behave for a container which is actually running? Can it stop the container?

Due to a Docker issue (moby/moby#33820), Docker daemon can fail to catch a container exit, i.e., the container process has already exited but the command `docker ps` shows the container still running, this will lead to the "docker run" command that we execute in Docker executor never returning, and it will also cause the `docker stop` command takes no effect, i.e., it will return without error but `docker ps` shows the container still running, so the task will stuck in `TASK_KILLING` state. To workaround this Docker issue, in this patch we made Docker executor reaps the container process directly so Docker executor will be notified once the container process exits. Review: https://reviews.apache.org/r/65518

euank · 2018-05-16T21:23:44Z

Based on the description of the bug here, I think this is a duplicate of #36010 / #36048 and can thus be closed as fixed in 17.12.1 and newer.

thaJeztah · 2018-05-16T21:33:46Z

@titpetric can you confirm if this is resolved for you in 18.03.1 or up?

thaJeztah · 2018-05-16T21:34:48Z

(given that 17.12.1 reached EOL)

qingbo · 2018-05-17T07:35:03Z

We upgraded to 18.03.0-ce a couple months ago and have not seen this issue since.

thaJeztah · 2018-05-17T12:06:03Z

We upgraded to 18.03.0-ce a couple months ago and have not seen this issue since.

Thanks for confirming!
On a side-note: I would recommend updating to 18.03.1-ce (as it has several bug fixes)

Let me go ahead and close this issue, as I think this should be fixed

titpetric · 2018-05-17T12:31:21Z

Generally, I'm running 17.09 stably, and didn't notice the issue on 18.01 either, but thanks for the heads up. I'll do some upgrades shortly 👍 thanks for all the hard work (and listening to complaints / bug reports!)

thaJeztah · 2018-05-17T12:53:10Z

You're welcome! Let us know if you're still running into issues 😅

webPageDev · 2019-09-03T08:58:58Z

@qingbo You know the commit code record for this issue？

webPageDev · 2019-09-03T09:12:31Z

@thaJeztah @mlaventure @titpetric Because of the OOM，I am facing the similar problem，Dockerd fail to receive containerd's exit signal when container process have be oom kill. Container's process, runc and contaierd has been all exit. But docker ps show the container is UP. And docker stop, docker kill, docker exec no longer work as expect. such as below:
docker exec -it MySQL-for-Developer_MySQL-for-Developer_1 sh
rpc error: code = 2 desc = containerd: container not found
are you sure which docker version helps to fix this? You know the commit code record for this issue？

honkiko · 2019-09-09T13:22:27Z

I created an issue with same result but more information: docker/for-linux#779

Could you check some more status?

is container process is alive. (Get Pid by docker inspect )
what's the state of containerd task (by "docker-container-ctr --namespace moby --address /var/run/docker/containerd/docker-containerd.sock task ls")
is shim process alive (ps -elf|grep $containerID)

GordonTheTurtle added the version/17.05 label Jun 26, 2017

thaJeztah added area/daemon area/runtime labels Jul 20, 2017

thaJeztah added the status/more-info-needed label Aug 3, 2017

thaJeztah mentioned this issue Aug 3, 2017

docker stuck when run out of host physical memory #34381

Closed

Raffo mentioned this issue Dec 22, 2017

Added docker watchdog zalando-incubator/kubernetes-on-aws#768

Closed

This was referenced Dec 28, 2017

Downgrade Docker to 1.12 zalando-incubator/kubernetes-on-aws#769

Merged

Docker does not catch container exit coreos/bugs#2306

Closed

ghost mentioned this issue Jan 11, 2018

Node flapping between Ready/NotReady with PLEG issues kubernetes/kubernetes#45419

Closed

thaJeztah added this to backlog in maintainers-session Jan 11, 2018

phsiao mentioned this issue Jan 19, 2018

docker-runc does not terminate and leave docker-shim hanging in 17.12 #36010

Closed

thaJeztah removed this from backlog in maintainers-session Feb 1, 2018

jorge07 mentioned this issue Feb 15, 2018

Pods Terminating forever due to Docker 17.09-ce Bug kubernetes-retired/kube-aws#1135

Closed

thaJeztah closed this as completed May 17, 2018

Docker does not catch container exit #33820

Docker does not catch container exit #33820

Comments

titpetric commented Jun 26, 2017 • edited Loading

titpetric commented Jul 17, 2017

thaJeztah commented Jul 20, 2017

titpetric commented Jul 24, 2017

thaJeztah commented Jul 24, 2017

mlaventure commented Aug 1, 2017

titpetric commented Aug 1, 2017 via email

mlaventure commented Aug 1, 2017

titpetric commented Aug 1, 2017 via email

mlaventure commented Aug 1, 2017

titpetric commented Aug 4, 2017

titpetric commented Aug 24, 2017

titpetric commented Aug 24, 2017

Raffo commented Dec 21, 2017 • edited Loading

thaJeztah commented Dec 21, 2017

Raffo commented Dec 21, 2017

thaJeztah commented Dec 21, 2017

Raffo commented Dec 21, 2017

qingbo commented Dec 28, 2017

titpetric commented Dec 28, 2017 via email

Raffo commented Dec 28, 2017

thaJeztah commented Dec 28, 2017

ghost commented Jan 11, 2018 • edited by ghost Loading

Raffo commented Jan 11, 2018

dghubble commented Jan 11, 2018

titpetric commented Jan 14, 2018 • edited Loading

qianzhangxa commented Jan 25, 2018 • edited Loading

euank commented May 16, 2018

thaJeztah commented May 16, 2018

thaJeztah commented May 16, 2018

qingbo commented May 17, 2018

thaJeztah commented May 17, 2018

titpetric commented May 17, 2018

thaJeztah commented May 17, 2018

webPageDev commented Sep 3, 2019

webPageDev commented Sep 3, 2019

honkiko commented Sep 9, 2019

titpetric commented Jun 26, 2017 •

edited

Loading

Raffo commented Dec 21, 2017 •

edited

Loading

ghost commented Jan 11, 2018 •

edited by ghost

Loading

titpetric commented Jan 14, 2018 •

edited

Loading

qianzhangxa commented Jan 25, 2018 •

edited

Loading