Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker does not catch container exit #33820

Closed
titpetric opened this issue Jun 26, 2017 · 36 comments
Closed

Docker does not catch container exit #33820

titpetric opened this issue Jun 26, 2017 · 36 comments

Comments

@titpetric
Copy link

titpetric commented Jun 26, 2017

Description

Occasionaly/often (within days/weeks), a container would exit, but docker doesn't catch this exit, or cleans up after this container (--rm). The container is "stuck", and the process running docker run waits indefinitely.

Steps to reproduce the issue:

  1. days/weeks time before issue occurs,
  2. somewhere after api 1.24 (that one works stably)
  3. kernel after Linux dev2 4.6.0-1-amd64 Container tagging / naming #1 SMP Debian 4.6.2-2 (2016-06-25) x86_64 GNU/Linux (again, stable)

Describe the results you received:

This is the output of docker events for the container which exited:

root@docker2:~# docker events --since $(date -d '2017-06-26 00:00:00' "+%s")
2017-06-26T08:56:49.683281660+02:00 container create d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T08:56:49.690299160+02:00 container attach d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T08:56:49.846012360+02:00 network connect da8583186519df97c129e6791f90c246631b41969913aeaa3206e56cc8b71beb (container=d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e, name=bridge, type=bridge)
2017-06-26T08:56:50.239283260+02:00 container start d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T09:08:05.624284023+02:00 container exec_create: bash  d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
2017-06-26T09:08:05.628605923+02:00 container exec_start: bash  d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e (image=registry.rtvslo.si/shell-php, name=agitated_perlman)
# docker ps -a
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS                    NAMES
d4370401db24        registry.rtvslo.si/shell-php         "/start.sh php ./u..."   22 minutes ago      Up 22 minutes                                agitated_perlman

The full run cmd:

# # pstree -laSg 56268
docker,56228 run -i --rm -v /etc/config:/etc/config -v /etc/cron.scripts:/etc/cron.scripts -v /usr/local/bin:/usr/local/bin -v /root/.ssh:/root/.ssh -h docker2 -v /var/www/oddaje/oddaje:/app -w /app/public_html registry.rtvslo.si/shell-php php ./update-resource-versions
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  ├─{docker},56228
  └─{docker},56228

Describe the results you expected:

I expected dockerd to handle/catch that the container has exited and clean up after it. The process tree doesn't indicate that any container process is still running, and docker didn't catch it.

Additional information you deem important (e.g. issue happens only occasionally):

This issue was occuring on overlay2, and recently (a few weeks ago) I switched to aufs to see if it improved behaviour (it did not). Server is mostly idle (no live traffic except deployments), and pretty much the same issue was occuring in every docker-ce version since 17.01. Had a discussion with @theJeztah some months back, he suggested it was related to an issue where removal was skipped due client/server API version mismatch. Pretty much every docker-ce edge build since then had the same issue unfortunately.

Issuing docker kill has no effect, can't exec into the container (because no root PID is alive, I suspect), docker stop doesn't do anything also, docker rm -f issues this:

root@docker2:~# docker rm -f agitated_perlman
Error response from daemon: Unable to remove filesystem for d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e: remove /var/lib/docker/containers/d4370401db245d209415ce49b6b8642ce99ef3874845a815cf8ed3fbb1e0d74e/shm: device or resource busy

But it doesn't affect the master docker run command (did not break out of that), even if the container is now actually removed from docker ps output. The only way to move forward is to kill -9 the docker run command. The following docker run command gets stuck as well. So, once the issue starts occuring, it's occuring all the time.

Output of docker version:

# docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:18:28 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:18:28 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

# docker info
Containers: 7
 Running: 7
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 48
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-3-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 7.698GiB
Name: docker2
ID: G2Z3:XLWE:P3V3:FTZR:U2Y6:2ABJ:6HTP:PIV2:KRHA:2ATV:ZMPQ:SHMJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Server is a VM running on Hyper-V, latest debian stretch install.

Any thoughts how this could be resolved? We're very happy and stable on 1.12/1.13 versions with slightly older installation (ie, no such painful issues). If I can inspect dockerd somehow deeper (ie, more verbosity)

I'm going to upgrade the server(s) to 17.06 and again hope for the best. I can't seem to find any directly related issues on the tracker, it seems that some people have issues with version 1.12/1.13, but that one literally works like a charm for us. I'd hate to downgrade :(

Edit 1: formatting, and to note that service docker restart does fix the issue temporarily. That suggest to me that there's some internal dockerd state that went wrong and isn't handled properly (best guess). I'm not sure I should put service docker restart into a cron/health check however... :)

@titpetric
Copy link
Author

Just had an occurrence of the issue after 21 days. Since 17.06 is released now, I'm performing the upgrade. I'd still appreciate some guide/link on how to debug this further in internal levels, so I can make a good stab at either diagnosing the root cause of the issue, or to try to reproduce it.

@thaJeztah
Copy link
Member

I think there actually may have been fixes in 17.06 in this area, so keep us posted if 17.06 works for you

@titpetric
Copy link
Author

@thaJeztah is there some parent issue or PR/commit that you could refer me to? I'm interested in figuring out what the underlying issue is, which part of the runtime this falls into and to review some of the source changes myself. I don't expect to understand all of it, I would just like to contribute this either to human error and leave it in the testing phase, or better understand the technical reasons for why this happens to see if there is some kind of work-around which I can do (ie, not use ext4 filesystem or similar). Any additional info is appreciated.

@thaJeztah
Copy link
Member

hm, don't have a direct PR to refer to; possibly @mlaventure can point to it, but I think he's having this week off 😊

@mlaventure
Copy link
Contributor

I can't think of a particular PR.

This looks like docker didn't receive the exit event from containerd or didn't process it correctly.

If you could have a daemon running in debug mode, that would help checking that assumption, also that would generate quite a bit of data if it takes about 3 weeks to occur.

@titpetric
Copy link
Author

titpetric commented Aug 1, 2017 via email

@mlaventure
Copy link
Contributor

@titpetric

  1. Update /etc/docker/daemon.json to add "debug": true. E.g.:
{
  "debug": true
}
  1. Reload the daemon configuration: kill -HUP $(cat /run/docker.pid)
  2. Reproduce the issue :)

@titpetric
Copy link
Author

titpetric commented Aug 1, 2017 via email

@mlaventure
Copy link
Contributor

@titpetric the output would be in your system logger yes

@titpetric
Copy link
Author

Regarding #34381 - while I'm not hitting OOM conditions (no process is unexpectedly killed off), there are some circumstances which might be influencing this behavior and memory usage is indeed one. There's one container which uses about 3GB / 8GB ram total, and the error as it occurs is usually visible on this container.

The way I'm reloading these containers is something akin to:

docker stop $CONTAINER > /dev/null 2>&1
docker rm $CONTAINER > /dev/null 2>&1

DOCKER_ARGS=$(docker_args $BACKEND)
DOCKER_IMAGE=$(docker_image $BACKEND)

OUTPUT=$(docker run -h $CONTAINER --name $CONTAINER --restart=always $DOCKER_ARGS -d $DOCKER_IMAGE 2>&1)
check_error $? "$OUTPUT"

The combination of these commands somehow triggers the issue. If I'm correct, it should be easy enough to replicate the issue, by just writing a program that allocates a few GB's of RAM and looping through the stop/rm/run commands for a while. I'll try to figure something out.

Also: confirming that I added the debug opts to the daemon on the last occurance of this issue.

@titpetric
Copy link
Author

Ok, I've had the issue occur today. This is what I filtered from the syslog:

Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.213419350+02:00" level=debug msg="Calling GET /_ping"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.215965623+02:00" level=debug msg="Calling POST /v1.30/containers/create"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.216531306+02:00" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"/var/www/monocms-novi/monocms/bin/get-entrypoints
\"],\"Domainname\":\"\",\"Entrypoint\":null,\"Env\":[],\"HostConfig\":{\"AutoRemove\":true,\"Binds\":[\"/etc/config/:/etc/config/\",\"/var/www/monocms-novi/monocms:/var/www/monocms-novi/monocms\",\"/src/spiderweb/conf/nginx.conf:/etc/ngi
nx/nginx.conf\",\"/src/spiderweb/bin/build/monocms-novi:/etc/nginx/sites/\"],\"BlkioDeviceReadBps\":null,\"BlkioDeviceReadIOps\":null,\"BlkioDeviceWriteBps\":null,\"BlkioDeviceWriteIOps\":null,\"BlkioWeight\":0,\"BlkioWeightDevice\":null
,\"CapAdd\":null,\"CapDrop\":null,\"Cgroup\":\"\",\"CgroupParent\":\"\",\"ConsoleSize\":[0,0],\"ContainerIDFile\":\"\",\"CpuCount\":0,\"CpuPercent\":0,\"CpuPeriod\":0,\"CpuQuota\":0,\"CpuRealtimePeriod\":0,\"CpuRealtimeRuntime\":0,\"CpuS
hares\":0,\"CpusetCpus\":\"\",\"CpusetMems\":\"\",\"DeviceCgroupRules\":null,\"Devices\":[],\"DiskQuota\":0,\"Dns\":[],\"DnsOptions\":[],\"DnsSearch\":[],\"ExtraHosts\":[\"dbhost:10.55.8.56\",\"dbhost_api:10.55.8.53\",\"cache1:10.55.8.74
\",\"cache2:10.55.8.73\",\"cache3:10.55.8.74\",\"fsng1:10.55.8.73\",\"fsng2:10.55.8.74\",\"api1:10.55.8.151\",\"api2:10.55.8.152\",\"api3:10.55.8.153\",\"ava3api.rtvslo.si:10.55.8.21\",\"docker:172.25.0.1\",\"mail:172.25.0.1\",\"docker2:
172.25.0.1\"],\"GroupAdd\":null,\"IOMaximumBandwidth\":0,\"IOMaximumIOps\":0,\"IpcMode\":\"\",\"Isolation\":\"\",\"KernelMemory\":0,\"Links\":null,\"LogConfig\":{\"Config\":{},\"Type\":\"\"},\"Memory\":0,\"MemoryReservation\":0,\"MemoryS
wap\":0,\"MemorySwappiness\":-1,\"NanoCpus\":0,\"NetworkMode\":\"party\",\"OomKillDisable\":false,\"OomScoreAdj\":0,\"PidMode\":\"\",\"PidsLimit\":0,\"PortBindings\":{},\"Privileged\":false,\"PublishAllPorts\":false,\"ReadonlyRootfs\":fa
lse,\"RestartPolicy\":{\"MaximumRetryCount\":0,\"Name\":\"no\"},\"SecurityOpt\":null,\"ShmSize\":0,\"UTSMode\":\"\",\"Ulimits\":null,\"UsernsMode\":\"\",\"VolumeDriver\":\"\",\"VolumesFrom\":n
Aug 24 13:10:51 docker2 dockerd[61989]: ull},\"Hostname\":\"\",\"Image\":\"registry.rtvslo.si/stretch-web-php\",\"Labels\":{},\"NetworkingConfig\":{\"EndpointsConfig\":{}},\"OnBuild\":null,\"OpenStdin\":true,\"StdinOnce\":true,\"Tty\":fa
lse,\"User\":\"\",\"Volumes\":{},\"WorkingDir\":\"\"}"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.297225332+02:00" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/75b9afaa3383c297831d3d322b5f1a59eaa06fb4eb2279572933f8d0fee48656"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.353545378+02:00" level=debug msg="Calling POST /v1.30/containers/ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022/attach?stderr=1&stdin=1&stdout=1&stream=1
"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354000906+02:00" level=debug msg="attach: stderr: begin"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354018115+02:00" level=debug msg="attach: stdin: begin"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354226219+02:00" level=debug msg="attach: stdout: begin"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.354546879+02:00" level=debug msg="Calling POST /v1.30/containers/ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022/wait?condition=removed"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.355556583+02:00" level=debug msg="Calling POST /v1.30/containers/ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022/start"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.359023716+02:00" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/75b9afaa3383c297831d3d322b5f1a59eaa06fb4eb2279572933f8d0fee48656"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.359668338+02:00" level=debug msg="Assigning addresses for endpoint quirky_jang's interface on network party"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.359725567+02:00" level=debug msg="RequestAddress(LocalDefault/172.25.0.0/24, <nil>, map[])"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.377647623+02:00" level=debug msg="Assigning addresses for endpoint quirky_jang's interface on network party"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.390483738+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, true) updateSv
cRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.390638115+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, false) updateS
vcRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.399951870+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, true) updateSv
cRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.400034111+02:00" level=debug msg="291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979 (50d0edd).addSvcRecords(ead59e503d78, 172.25.0.11, <nil>, false) updateS
vcRecord sid:291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.404364375+02:00" level=debug msg="Programming external connectivity on endpoint quirky_jang (291d82a829689faf31b774fa896dc99fd12b4fd3ce155e5c0be8c16424a99979)"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.408908346+02:00" level=debug msg="EnableService ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022 START"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.408983984+02:00" level=debug msg="EnableService ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022 DONE"
Aug 24 13:10:51 docker2 dockerd[61989]: time="2017-08-24T13:10:51.881215608+02:00" level=debug msg="sandbox set key processing took 225.13944ms for container ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022"
Aug 24 13:10:53 docker2 dockerd[61989]: time="2017-08-24T13:10:53.546508639+02:00" level=debug msg="containerd: process exited" id=ead59e503d78c316935aaa340e4e1e24c7d421b9c4aef02337edcbbf7ef6b022 pid=init status=0 systemPid=18732
Aug 24 13:17:30 docker2 systemd[1]: Started Session 107031 of user root.
Aug 24 13:17:35 docker2 systemd[1]: Stopping Docker Application Container Engine...

Nothing obvious pops out at me. Last two lines are me already running service docker restart. There's a bunch of errors after that, anything from:

Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.835988806+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container feb3f25d279d935c93167e8f8d21ba166684
755d9fcc25249855a13f4185ef0c: rpc error: code = 2 desc = containerd: container not found"
Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.836061342+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ead59e503d78c316935aaa340e4e1e24c7d4
21b9c4aef02337edcbbf7ef6b022: rpc error: code = 2 desc = containerd: container not found"
Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.836171497+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 76467dc50e57cf36902bd92c6d235501a606
2afa4b52d7554ad9d4ba42efe123: rpc error: code = 2 desc = containerd: container not found"
Aug 24 13:17:35 docker2 dockerd[61989]: time="2017-08-24T13:17:35.836231827+02:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container de77de908d972b25f76a860fee651112e38b
038eaf905d5f17d6cd871ea25d20: rpc error: code = 2 desc = containerd: container not found"

and then timeouts that result in kill -9:

Aug 24 13:17:45 docker2 dockerd[61989]: time="2017-08-24T13:17:45.829687931+02:00" level=info msg="Container 30fed9ae935e1a11381cfc81b61d5a2e957eaff579bf7c0a584c328c114cd165 failed to exit within 10 seconds of signal 15 - using the force
"
Aug 24 13:17:45 docker2 dockerd[61989]: time="2017-08-24T13:17:45.829772973+02:00" level=debug msg="Sending kill signal 9 to container 30fed9ae935e1a11381cfc81b61d5a2e957eaff579bf7c0a584c328c114cd165"

And more kills failed after that. This is running:

Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:17:22 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:16:12 2017
 OS/Arch:      linux/amd64
 Experimental: false

So, latest upgrade on the same host (no reinstall). If you need a docker info, LMK

@titpetric
Copy link
Author

Also, we had limited success in reproducing the issue with a program that allocates a significant amount of memory and is then removed. Docker was failing in both OOM conditions and a simple run/allocate/exit loop with less memory than required for OOM. We didn't finish the tests fully tho, other priorities. If you need one LMK and I'll try to push it to a higher prio.

@Raffo
Copy link

Raffo commented Dec 21, 2017

We are experiencing the same issue running the following docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64
 Experimental: false

There is an easy way to reproduce it for us on a instance that is experiencing the problem:

$ docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit
exit

[nothing happens]

The container can't be found with ps aux but it is present in docker ps. docker kill to the container also doesn't help, docker thinks it killed the container, but it is actually still listed if we run docker ps but still not running as process.

@thaJeztah
Copy link
Member

@Raffo anything in the logs that could be useful? Also, what does your setup look like? Are you connected to a local daemon, or a remote one? What distro (and version) are you running? (perhaps docker info output)

@Raffo
Copy link

Raffo commented Dec 21, 2017

Sure, here the info:

Containers: 33
 Running: 26
 Paused: 0
 Stopped: 7
Images: 21
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.13.16-coreos-r2
Operating System: Container Linux by CoreOS 1576.4.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.799GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Live Restore Enabled: false

The setup is just that: https://github.com/zalando-incubator/kubernetes-on-aws

@thaJeztah
Copy link
Member

Looks like you're running on CoreOS, which is not a supported platform (i.e., there's no official packages for CoreOS); can you report that issue with them, because CoreOS has its own packages for Docker?

@Raffo
Copy link

Raffo commented Dec 21, 2017

I can and I will totally do it, but I guess this will still be a Docker/Moby bug and we should try to figure it out together, especially given the similarity to the one reported originally in this issue.

@qingbo
Copy link

qingbo commented Dec 28, 2017

I have a small kubernetes cluster of 3 nodes and one of them is suffering from the same issue. I provisioned the nodes (same hardware specs) with Ansible so I believe they should be consistent, but the problem only happens on one node. I have to watch k8s for pods stuck in "Terminating" state, and when it happens, I'd restart the docker daemon on that node.

docker info output:

Containers: 143
 Running: 120
 Paused: 0
 Stopped: 23
Images: 89
Server Version: 17.09.0-ce
Storage Driver: devicemapper
 Pool Name: docker-images
 Pool Blocksize: 65.54kB
 Base Device Size: 10.74GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 13.17GB
 Data Space Total: 107.4GB
 Data Space Available: 94.21GB
 Metadata Space Used: 40.87MB
 Metadata Space Total: 104.9MB
 Metadata Space Available: 63.99MB
 Thin Pool Minimum Free Space: 10.74GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.13.8-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 251.8GiB
ID: BESE:G7P4:LFG2:RZNZ:I2DG:RY4N:GHFD:56HV:WWZI:HSTK:MS7B:LOUX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 2409
 Goroutines: 2949
 System Time: 2017-12-28T02:35:37.793930552-05:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

Like @Raffo described above, it's very easy to reproduce when the issue happens, but I have no idea how docker daemon enters such a state. Sometimes it can run for 3-4 days without problem and sometimes I have to restart docker every day.

Interesting logs (I tried docker kill):

...
Dec 28 02:31:10 kub03s dockerd: time="2017-12-28T02:31:10.305483576-05:00" level=debug msg="containerd: process exited" id=7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b pid=init status=0 systemPid=23306
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726743261-05:00" level=debug msg="Calling POST /v1.32/containers/7e2f63ce6867/kill?signal=KILL"
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726836346-05:00" level=debug msg="Sending kill signal 9 to container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b"
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.727128965-05:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b: rpc error: code = Unknown desc = containerd: container not found"
Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727333467-05:00" level=info msg="Container 7e2f63ce6867 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727385529-05:00" level=debug msg="Cannot kill process (pid=23306) with signal 9: no such process."

dockerd knows that process 23306 has exited, yet it still tries to kill that process?

@titpetric
Copy link
Author

titpetric commented Dec 28, 2017 via email

@Raffo
Copy link

Raffo commented Dec 28, 2017

@thaJeztah Opened to CoreOS: coreos/bugs#2306 . Any further help would be appreciated, in the meantime we are forced to downgrade to the a previous docker release to see if the problem disappears.

@thaJeztah
Copy link
Member

Thanks @Raffo

Ok, I've had the issue occur today. This is what I filtered from the syslog:

@titpetric in that output, I notice "AutoRemove":true. Starting with Docker 1.13 (and API version 1.25), the --rm option is no longer handled by the client, but daemon-side. This means that the moment the container exits, it's automatically removed. Possibly the containerd: container not found error is a race-condition, where the container did not exit within x seconds, but soon after (causing it to be removed), after which the second kill failed because the container was already gone.

This may not be an explanation for all situations you reported, but was something that stood out (and could explain some)

Unless I'm significantly mistaken, k8s only uses libcontainer (or some library) to load the docker images, but the actual containers that it runs are not ran with docker.

Most k8s installations currently use Docker (through the cri-docker wrapper). Note that the docker daemon itself (since a long time) does not contain a "runtime", but uses OCI runtime compliant runtimes (runc being the default), controlled through containerd to actually run the containers. In such a setup, the same things could occur. Docker versions up to 17.11 use the containerd 0.2 branch.

Docker 17.11 and up use containerd 1.0; the 1.0 version of containerd was developed from the ground up to be used as a runtime, both for docker (directly using its gRPC API), and kubernetes (using the https://github.com/kubernetes-incubator/cri-containerd project to provide the CRI interface), and is targeted to become the default runtime for kubernetes.

containerd is a better fit as a runtime for kubernetes, because it focuses on the parts that are needed to deploy containers on kubernetes; for that, it's kept minimal, and does not provide things such as networking, orchestration, a CLI, or features such as a "builder" (docker build) to build images.

@ghost
Copy link

ghost commented Jan 11, 2018

We're seeing this too since we upgraded CoreOS version.
We went from docker 12.06 to 17.09-ce , and it a matter of hours we have containers that refuses to stop.
Same issue, the PID reported by docker inspect is killed, but docker ps still shows the container as running.
Kubernetes reports the pod in a "Terminating state"

@Raffo
Copy link

Raffo commented Jan 11, 2018

@dada941 We "fixed" it by rolling back to docker 1.12.06. Please note that Kubernetes 1.8 as far as I know it is not compatible with docker 17, not sure why Container Linux decided to ship this version given the number of people that are using it to run Kubernetes.

@thaJeztah thaJeztah added this to backlog in maintainers-session Jan 11, 2018
@dghubble
Copy link

Kubernetes 1.8 is officially validated against Docker 1.11.2, 1.12.6, 1.13.1, and 17.03.2. Its in the release notes here and a blog post was written about that unblocking Container Linux reving the Docker version here.

@titpetric
Copy link
Author

titpetric commented Jan 14, 2018

@thaJeztah A race condition would explain it. Some research I did a while back was related to file descriptors (or something similar, along the lines of bind mounts) which wouldn't be cleaned up by the kernel/docker and every issue I came across was ending along the lines "this is a kernel bug". Did you perhaps discuss some sort of stress test with the run/stop/rm cycle yet?

Edit: well, as you said, it would explain some of it. I'm running that infinite loop on a digital ocean instance rn, with a run -d + stop + rm, because that's the more usual (and unfortunately, very sporadic) way to reproduce the full stop. At that point it's not a race condition anymore, but more something along the lines of an unresponsive docker api (only service docker restart cures it).

@qianzhangxa
Copy link

qianzhangxa commented Jan 25, 2018

@titpetric @dada941 When this issue happened, how will docker stop behave for the container which is actually not running but still show up in docker ps as running? Will it fail with an error message? Or it will do nothing (not output anything) and exit with 0? And how will docker stop behave for a container which is actually running? Can it stop the container?

@thaJeztah thaJeztah removed this from backlog in maintainers-session Feb 1, 2018
asfgit pushed a commit to apache/mesos that referenced this issue Feb 14, 2018
Due to a Docker issue (moby/moby#33820),
Docker daemon can fail to catch a container exit, i.e., the container
process has already exited but the command `docker ps` shows the
container still running, this will lead to the "docker run" command
that we execute in Docker executor never returning, and it will also
cause the `docker stop` command takes no effect, i.e., it will return
without error but `docker ps` shows the container still running, so
the task will stuck in `TASK_KILLING` state.

To workaround this Docker issue, in this patch we made Docker executor
reaps the container process directly so Docker executor will be notified
once the container process exits.

Review: https://reviews.apache.org/r/65518
Gilbert88 pushed a commit to Gilbert88/mesos that referenced this issue Mar 3, 2018
Due to a Docker issue (moby/moby#33820),
Docker daemon can fail to catch a container exit, i.e., the container
process has already exited but the command `docker ps` shows the
container still running, this will lead to the "docker run" command
that we execute in Docker executor never returning, and it will also
cause the `docker stop` command takes no effect, i.e., it will return
without error but `docker ps` shows the container still running, so
the task will stuck in `TASK_KILLING` state.

To workaround this Docker issue, in this patch we made Docker executor
reaps the container process directly so Docker executor will be notified
once the container process exits.

Review: https://reviews.apache.org/r/65518
asfgit pushed a commit to apache/mesos that referenced this issue Mar 7, 2018
Due to a Docker issue (moby/moby#33820),
Docker daemon can fail to catch a container exit, i.e., the container
process has already exited but the command `docker ps` shows the
container still running, this will lead to the "docker run" command
that we execute in Docker executor never returning, and it will also
cause the `docker stop` command takes no effect, i.e., it will return
without error but `docker ps` shows the container still running, so
the task will stuck in `TASK_KILLING` state.

To workaround this Docker issue, in this patch we made Docker executor
reaps the container process directly so Docker executor will be notified
once the container process exits.

Review: https://reviews.apache.org/r/65518
Gilbert88 pushed a commit to mesosphere/mesos that referenced this issue Mar 7, 2018
Due to a Docker issue (moby/moby#33820),
Docker daemon can fail to catch a container exit, i.e., the container
process has already exited but the command `docker ps` shows the
container still running, this will lead to the "docker run" command
that we execute in Docker executor never returning, and it will also
cause the `docker stop` command takes no effect, i.e., it will return
without error but `docker ps` shows the container still running, so
the task will stuck in `TASK_KILLING` state.

To workaround this Docker issue, in this patch we made Docker executor
reaps the container process directly so Docker executor will be notified
once the container process exits.

Review: https://reviews.apache.org/r/65518
@euank
Copy link
Contributor

euank commented May 16, 2018

Based on the description of the bug here, I think this is a duplicate of #36010 / #36048 and can thus be closed as fixed in 17.12.1 and newer.

@thaJeztah
Copy link
Member

@titpetric can you confirm if this is resolved for you in 18.03.1 or up?

@thaJeztah
Copy link
Member

(given that 17.12.1 reached EOL)

@qingbo
Copy link

qingbo commented May 17, 2018

We upgraded to 18.03.0-ce a couple months ago and have not seen this issue since.

@thaJeztah
Copy link
Member

We upgraded to 18.03.0-ce a couple months ago and have not seen this issue since.

Thanks for confirming!
On a side-note: I would recommend updating to 18.03.1-ce (as it has several bug fixes)

Let me go ahead and close this issue, as I think this should be fixed

@titpetric
Copy link
Author

Generally, I'm running 17.09 stably, and didn't notice the issue on 18.01 either, but thanks for the heads up. I'll do some upgrades shortly 👍 thanks for all the hard work (and listening to complaints / bug reports!)

@thaJeztah
Copy link
Member

You're welcome! Let us know if you're still running into issues 😅

@webPageDev
Copy link

@qingbo You know the commit code record for this issue?

@webPageDev
Copy link

@thaJeztah @mlaventure @titpetric Because of the OOM,I am facing the similar problem,Dockerd fail to receive containerd's exit signal when container process have be oom kill. Container's process, runc and contaierd has been all exit. But docker ps show the container is UP. And docker stop, docker kill, docker exec no longer work as expect. such as below:
docker exec -it MySQL-for-Developer_MySQL-for-Developer_1 sh
rpc error: code = 2 desc = containerd: container not found
are you sure which docker version helps to fix this? You know the commit code record for this issue?

@honkiko
Copy link

honkiko commented Sep 9, 2019

I created an issue with same result but more information: docker/for-linux#779

Could you check some more status?

  • is container process is alive. (Get Pid by docker inspect )
  • what's the state of containerd task (by "docker-container-ctr --namespace moby --address /var/run/docker/containerd/docker-containerd.sock task ls")
  • is shim process alive (ps -elf|grep $containerID)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests