-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker does not catch container exit #33820
Comments
Just had an occurrence of the issue after 21 days. Since 17.06 is released now, I'm performing the upgrade. I'd still appreciate some guide/link on how to debug this further in internal levels, so I can make a good stab at either diagnosing the root cause of the issue, or to try to reproduce it. |
I think there actually may have been fixes in 17.06 in this area, so keep us posted if 17.06 works for you |
@thaJeztah is there some parent issue or PR/commit that you could refer me to? I'm interested in figuring out what the underlying issue is, which part of the runtime this falls into and to review some of the source changes myself. I don't expect to understand all of it, I would just like to contribute this either to human error and leave it in the testing phase, or better understand the technical reasons for why this happens to see if there is some kind of work-around which I can do (ie, not use ext4 filesystem or similar). Any additional info is appreciated. |
hm, don't have a direct PR to refer to; possibly @mlaventure can point to it, but I think he's having this week off 😊 |
I can't think of a particular PR. This looks like If you could have a daemon running in debug mode, that would help checking that assumption, also that would generate quite a bit of data if it takes about 3 weeks to occur. |
I would be fine with that. Can you link me or give a short how to for this?
Best, Tit
…On Tue, 1 Aug 2017 at 18:27, Kenfe-Mickaël Laventure < ***@***.***> wrote:
I can't think of a particular PR.
This looks like docker didn't receive the exit event from containerd or
didn't process it correctly.
If you could have a daemon running in debug mode, that would help checking
that assumption, also that would generate quite a bit of data if it takes
about 3 weeks to occur.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#33820 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOPkLViKMP0s6iTQ1mmCfamuVtrg3xpks5sT1H0gaJpZM4OE-TU>
.
|
{
"debug": true
}
|
Simple. Output is syslog or some docker command like events? I'll figure it
out, thanks.
…On Tue, 1 Aug 2017 at 20:36, Kenfe-Mickaël Laventure < ***@***.***> wrote:
@titpetric <https://github.com/titpetric>
1. Update /etc/docker/daemon.json to add "debug": true. E.g.:
{
"debug": true
}
1. Reload the daemon configuration: kill -HUP $(cat /run/docker.pid)
2. Reproduce the issue :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33820 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOPkOj_-ZmVdSLInB5skNGppOGN7lx9ks5sT3AfgaJpZM4OE-TU>
.
|
@titpetric the output would be in your system logger yes |
Regarding #34381 - while I'm not hitting OOM conditions (no process is unexpectedly killed off), there are some circumstances which might be influencing this behavior and memory usage is indeed one. There's one container which uses about 3GB / 8GB ram total, and the error as it occurs is usually visible on this container. The way I'm reloading these containers is something akin to:
The combination of these commands somehow triggers the issue. If I'm correct, it should be easy enough to replicate the issue, by just writing a program that allocates a few GB's of RAM and looping through the stop/rm/run commands for a while. I'll try to figure something out. Also: confirming that I added the debug opts to the daemon on the last occurance of this issue. |
Ok, I've had the issue occur today. This is what I filtered from the syslog:
Nothing obvious pops out at me. Last two lines are me already running
and then timeouts that result in kill -9:
And more kills failed after that. This is running:
So, latest upgrade on the same host (no reinstall). If you need a docker info, LMK |
Also, we had limited success in reproducing the issue with a program that allocates a significant amount of memory and is then removed. Docker was failing in both OOM conditions and a simple run/allocate/exit loop with less memory than required for OOM. We didn't finish the tests fully tho, other priorities. If you need one LMK and I'll try to push it to a higher prio. |
We are experiencing the same issue running the following docker version:
There is an easy way to reproduce it for us on a instance that is experiencing the problem:
The container can't be found with |
@Raffo anything in the logs that could be useful? Also, what does your setup look like? Are you connected to a local daemon, or a remote one? What distro (and version) are you running? (perhaps |
Sure, here the info:
The setup is just that: https://github.com/zalando-incubator/kubernetes-on-aws |
Looks like you're running on CoreOS, which is not a supported platform (i.e., there's no official packages for CoreOS); can you report that issue with them, because CoreOS has its own packages for Docker? |
I can and I will totally do it, but I guess this will still be a Docker/Moby bug and we should try to figure it out together, especially given the similarity to the one reported originally in this issue. |
I have a small kubernetes cluster of 3 nodes and one of them is suffering from the same issue. I provisioned the nodes (same hardware specs) with Ansible so I believe they should be consistent, but the problem only happens on one node. I have to watch k8s for pods stuck in "Terminating" state, and when it happens, I'd restart the docker daemon on that node. docker info output:
Like @Raffo described above, it's very easy to reproduce when the issue happens, but I have no idea how docker daemon enters such a state. Sometimes it can run for 3-4 days without problem and sometimes I have to restart docker every day. Interesting logs (I tried
dockerd knows that process 23306 has exited, yet it still tries to kill that process? |
This is the smallest code that has been able, with time, to reproduce the
issue:
~~~
#!/bin/bash
while true
do
docker run -d --name test [a large docker image]
docker stop test
docker rm test
done
~~~
Unfortunately, the stable versions are in the 1.12 tree. That is to say, we
migrated down because we're being forced to because of stability issues. So
far known stable versions:
Docker version 1.12.3, build 6b644ec
Docker version 1.12.6, build 78d1802
I don't feel bad about saying this, but hitting this issue doesn't have a
production workaround (or even a propper diagnosis as to the root cause).
At the current time this also means that roll out of docker in production
is being stopped, kubernetes is being evaluated as it has a different
container runtime, and it seems that docker itself will be abandoned for
anything other than development. We're sort of hoping that some already
production deployments, don't experience this issue. So far, these have
been "stable" (only one crash in sort of half a year):
Ubuntu 16.04, Docker version 17.10.0-ce, build f4ffd25
Debian stretch, Docker version 17.09.0-ce, build afdb6d4 (this one has 1
crash against it)
Debian stretch, Docker version 17.06.0-ce, build 02c1d87
The key to stability seems to be load and frequency. The more containers
you have, or the more you touch them (docker exec from cron jobs, for
example), the more the instability shows itself.
As to the Kubernetes comment above - the issue seems to be only the fact
that you use Docker to run your k8s. Install it on the host directly (a
clean VM, no docker), and you should have a work around. Unless I'm
significantly mistaken, k8s only uses libcontainer (or some library) to
load the docker images, but the actual containers that it runs are not ran
with docker. As such, my reasoning is that you might regain stability with
k8s, without throwing the baby out with the bathwater.
Best,
Tit
…On Thu, Dec 28, 2017 at 8:55 AM, Qingbo Zhou ***@***.***> wrote:
I have a small kubernetes cluster of 3 nodes and one of them is suffering
from the same issue. I provisioned the nodes (same hardware specs) with
Ansible so I believe they should be consistent, but the problem only
happens on one node. I have to watch k8s for pods stuck in "Terminating"
state, and when it happens, I'd restart the docker daemon on that node.
docker info output:
Containers: 143
Running: 120
Paused: 0
Stopped: 23
Images: 89
Server Version: 17.09.0-ce
Storage Driver: devicemapper
Pool Name: docker-images
Pool Blocksize: 65.54kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 13.17GB
Data Space Total: 107.4GB
Data Space Available: 94.21GB
Metadata Space Used: 40.87MB
Metadata Space Total: 104.9MB
Metadata Space Available: 63.99MB
Thin Pool Minimum Free Space: 10.74GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.13.8-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 251.8GiB
ID: BESE:G7P4:LFG2:RZNZ:I2DG:RY4N:GHFD:56HV:WWZI:HSTK:MS7B:LOUX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 2409
Goroutines: 2949
System Time: 2017-12-28T02:35:37.793930552-05:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
Like @Raffo <https://github.com/raffo> described above, it's very easy to
reproduce when the issue happens, but I have no idea how docker daemon
enters such a state. Sometimes it can run for 3-4 days without problem and
sometimes I have to restart docker every day.
Interesting logs (I tried docker kill):
...
Dec 28 02:31:10 kub03s dockerd: time="2017-12-28T02:31:10.305483576-05:00" level=debug msg="containerd: process exited" id=7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b pid=init status=0 systemPid=23306
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726743261-05:00" level=debug msg="Calling POST /v1.32/containers/7e2f63ce6867/kill?signal=KILL"
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.726836346-05:00" level=debug msg="Sending kill signal 9 to container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b"
Dec 28 02:32:39 kub03s dockerd: time="2017-12-28T02:32:39.727128965-05:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 7e2f63ce6867bc28d3660b58f82159bc7a309bbdc2eb05851c97d375ea97336b: rpc error: code = Unknown desc = containerd: container not found"
Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727333467-05:00" level=info msg="Container 7e2f63ce6867 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Dec 28 02:32:49 kub03s dockerd: time="2017-12-28T02:32:49.727385529-05:00" level=debug msg="Cannot kill process (pid=23306) with signal 9: no such process."
dockerd knows that process 23306 has exited, yet it still tries to kill
that process?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33820 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOPkL7XFVLdJfhJvzo1E-x5WGPb1igHks5tE0lWgaJpZM4OE-TU>
.
|
@thaJeztah Opened to CoreOS: coreos/bugs#2306 . Any further help would be appreciated, in the meantime we are forced to downgrade to the a previous docker release to see if the problem disappears. |
Thanks @Raffo
@titpetric in that output, I notice This may not be an explanation for all situations you reported, but was something that stood out (and could explain some)
Most k8s installations currently use Docker (through the Docker 17.11 and up use containerd 1.0; the 1.0 version of containerd was developed from the ground up to be used as a runtime, both for docker (directly using its gRPC API), and kubernetes (using the https://github.com/kubernetes-incubator/cri-containerd project to provide the CRI interface), and is targeted to become the default runtime for kubernetes. containerd is a better fit as a runtime for kubernetes, because it focuses on the parts that are needed to deploy containers on kubernetes; for that, it's kept minimal, and does not provide things such as networking, orchestration, a CLI, or features such as a "builder" ( |
We're seeing this too since we upgraded CoreOS version. |
@dada941 We "fixed" it by rolling back to docker 1.12.06. Please note that Kubernetes 1.8 as far as I know it is not compatible with docker 17, not sure why Container Linux decided to ship this version given the number of people that are using it to run Kubernetes. |
@thaJeztah A race condition would explain it. Some research I did a while back was related to file descriptors (or something similar, along the lines of bind mounts) which wouldn't be cleaned up by the kernel/docker and every issue I came across was ending along the lines "this is a kernel bug". Did you perhaps discuss some sort of stress test with the run/stop/rm cycle yet? Edit: well, as you said, it would explain some of it. I'm running that infinite loop on a digital ocean instance rn, with a |
@titpetric @dada941 When this issue happened, how will |
Due to a Docker issue (moby/moby#33820), Docker daemon can fail to catch a container exit, i.e., the container process has already exited but the command `docker ps` shows the container still running, this will lead to the "docker run" command that we execute in Docker executor never returning, and it will also cause the `docker stop` command takes no effect, i.e., it will return without error but `docker ps` shows the container still running, so the task will stuck in `TASK_KILLING` state. To workaround this Docker issue, in this patch we made Docker executor reaps the container process directly so Docker executor will be notified once the container process exits. Review: https://reviews.apache.org/r/65518
Due to a Docker issue (moby/moby#33820), Docker daemon can fail to catch a container exit, i.e., the container process has already exited but the command `docker ps` shows the container still running, this will lead to the "docker run" command that we execute in Docker executor never returning, and it will also cause the `docker stop` command takes no effect, i.e., it will return without error but `docker ps` shows the container still running, so the task will stuck in `TASK_KILLING` state. To workaround this Docker issue, in this patch we made Docker executor reaps the container process directly so Docker executor will be notified once the container process exits. Review: https://reviews.apache.org/r/65518
Due to a Docker issue (moby/moby#33820), Docker daemon can fail to catch a container exit, i.e., the container process has already exited but the command `docker ps` shows the container still running, this will lead to the "docker run" command that we execute in Docker executor never returning, and it will also cause the `docker stop` command takes no effect, i.e., it will return without error but `docker ps` shows the container still running, so the task will stuck in `TASK_KILLING` state. To workaround this Docker issue, in this patch we made Docker executor reaps the container process directly so Docker executor will be notified once the container process exits. Review: https://reviews.apache.org/r/65518
Due to a Docker issue (moby/moby#33820), Docker daemon can fail to catch a container exit, i.e., the container process has already exited but the command `docker ps` shows the container still running, this will lead to the "docker run" command that we execute in Docker executor never returning, and it will also cause the `docker stop` command takes no effect, i.e., it will return without error but `docker ps` shows the container still running, so the task will stuck in `TASK_KILLING` state. To workaround this Docker issue, in this patch we made Docker executor reaps the container process directly so Docker executor will be notified once the container process exits. Review: https://reviews.apache.org/r/65518
@titpetric can you confirm if this is resolved for you in 18.03.1 or up? |
(given that 17.12.1 reached EOL) |
We upgraded to 18.03.0-ce a couple months ago and have not seen this issue since. |
Thanks for confirming! Let me go ahead and close this issue, as I think this should be fixed |
Generally, I'm running 17.09 stably, and didn't notice the issue on 18.01 either, but thanks for the heads up. I'll do some upgrades shortly 👍 thanks for all the hard work (and listening to complaints / bug reports!) |
You're welcome! Let us know if you're still running into issues 😅 |
@qingbo You know the commit code record for this issue? |
@thaJeztah @mlaventure @titpetric Because of the OOM,I am facing the similar problem,Dockerd fail to receive containerd's exit signal when container process have be oom kill. Container's process, runc and contaierd has been all exit. But docker ps show the container is UP. And docker stop, docker kill, docker exec no longer work as expect. such as below: |
I created an issue with same result but more information: docker/for-linux#779 Could you check some more status?
|
Description
Occasionaly/often (within days/weeks), a container would exit, but docker doesn't catch this exit, or cleans up after this container (
--rm
). The container is "stuck", and the process runningdocker run
waits indefinitely.Steps to reproduce the issue:
Describe the results you received:
This is the output of
docker events
for the container which exited:The full run cmd:
Describe the results you expected:
I expected dockerd to handle/catch that the container has exited and clean up after it. The process tree doesn't indicate that any container process is still running, and docker didn't catch it.
Additional information you deem important (e.g. issue happens only occasionally):
This issue was occuring on
overlay2
, and recently (a few weeks ago) I switched toaufs
to see if it improved behaviour (it did not). Server is mostly idle (no live traffic except deployments), and pretty much the same issue was occuring in every docker-ce version since 17.01. Had a discussion with @theJeztah some months back, he suggested it was related to an issue where removal was skipped due client/server API version mismatch. Pretty much every docker-ce edge build since then had the same issue unfortunately.Issuing
docker kill
has no effect, can't exec into the container (because no root PID is alive, I suspect),docker stop
doesn't do anything also,docker rm -f
issues this:But it doesn't affect the master
docker run
command (did not break out of that), even if the container is now actually removed from docker ps output. The only way to move forward is tokill -9
the docker run command. The following docker run command gets stuck as well. So, once the issue starts occuring, it's occuring all the time.Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Server is a VM running on Hyper-V, latest debian stretch install.
Any thoughts how this could be resolved? We're very happy and stable on 1.12/1.13 versions with slightly older installation (ie, no such painful issues). If I can inspect dockerd somehow deeper (ie, more verbosity)
I'm going to upgrade the server(s) to 17.06 and again hope for the best. I can't seem to find any directly related issues on the tracker, it seems that some people have issues with version 1.12/1.13, but that one literally works like a charm for us. I'd hate to downgrade :(
Edit 1: formatting, and to note that
service docker restart
does fix the issue temporarily. That suggest to me that there's some internal dockerd state that went wrong and isn't handled properly (best guess). I'm not sure I should putservice docker restart
into a cron/health check however... :)The text was updated successfully, but these errors were encountered: