-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dockerd enters State=D, uninterruptible sleep; fsnotify_destroy_group() #38750
Comments
While the overall ticket appears to be describing a different situation, the observations made in kubernetes/kubernetes#70229 (comment) are very similar to mine - including Azure + Kubernetes + Docker + Datadog. |
I do see
|
I have a strong suspicion that dockerd is a victim here, and that https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021 aka https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=dev&id=1a05c0cd2fee234a10362cc8f66057557cbb291f is the cause. That bug professes to be very difficult to reproduce deterministically, so it will be hard to be certain it's fixed. I really don't expect anyone to take action for this issue -- it's a placeholder to point at which I hope will be happily closed in ~2 weeks with no further effort. |
Closing this in favour of Azure/AKS#838 |
I have upgraded to 4.15.0-1040-azure. Now waiting to see if the problem happens again. |
I have observed 2 weeks of uptime on 8 nodes without observation of the original symptoms since upgrading the AKS node kernel to 4.15.0-1040-azure. I am confident the kernel patch has resolved our problem. |
am experiencing this too, basically exactly the same kernel dump, only not on Azure, VMWare based ubuntu boxen. All Ubuntu 16.04 with |
It was an upstream bug in many distribution's Linux kernels. The distribution-specific bit is going to be the fix vector. I guess you may want |
Description
After 3-15 days of uptime, I observe dockerd stop responding.
Steps to reproduce the issue:
I do not know the exact recipe. I deployed an Azure AKS 1.11.5 cluster with 8 nodes, added a Datadog DaemonSet, created a bunch of deployments that CD updated ~5 times a day, waited. After a week of uptime, dockerd on at least one of the VMs goes into this state. If the VM is power-cycled, it'll experience this again within about 2 weeks.
Describe the results you received:
This has been observed in AKS with:
The first k8s-centric observation is the node going into state=NotReady, with an event recorded
ContainerGCFailed: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
. The kubelet continues running, but can no longer interact withdockerd
. The Kubernetes master tries to Terminate the pods and recreate them on other nodes - the termination blocks (since the node is NotReady) and the new replicas come up without issue.At the same time, the VM's 5-minute load average starts rapidly increasing, jumping from below 5 to above 300 in <30 minutes. The combination of the above symptoms results in someone getting paged.
Existing pods/containers on the node/vm continue to run and are responsive to network communication, though no containers can be stopped, started, exec'd or have logs fetched.
At this point someone can ssh to the VM fine. Running
docker ps
reveals the same error-text, "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?". The dockerd process exists.Describe the results you expected:
dockerd responds to docker cli / kubelet
Additional information you deem important (e.g. issue happens only occasionally):
/proc/DOCKERPID/status
shows that dockerd is in State=D -- uninterruptible sleep.Attempting to restart it with
systemctl restart docker.service
times out. Attaching to dockerd with strace or gdb in this state predictably fails, as doeskill -9
-- they're all going to block on the syscall completing.Inspecting the threads comprising dockerd in this state many threads blocked on (presumably long-running) syscalls doing inotify / fsnotify things.
In contrast, a post-reboot dockerd has threads which look like
The only semi-unusual thing I'm doing that might contribute to inotify-stuff is running the Datadog Agent https://github.com/DataDog/datadog-agent as a Daemonset following standard guidance at https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/. That mounts the host's
/proc
and/sys/fs/cgroup
readonly plus/var/run/docker.sock
readwrite, all to monitor status and fetch container logs. The datadog agent can't talk to dockerd either once this problem starts though is still fetching system metrics (eg, loadavg) for at least 1h.This doesn't look like #31007 (not at startup) or #31648 (no large number of other processes in State=D) or #15204 (that has a working dockerd)
This does look very similar to https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1798212 which points to https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021 which appears to be a linux kernel bug. I have not verified the applicability of that bug.
Output of
docker version
:Taken from a failed machine post-reboot:
Output of
docker info
:Taken from a failed machine post-reboot:
Additional environment details (AWS, VirtualBox, physical, etc.):
I have observed this in Azure's AKS 1.11.5+1.11.7, with Ubuntu 16.04.x, with an Azure kernel based on 4.15, with Moby 3.0.1+3.0.5, with DatadogAgent 6.7.0+6.8.2+6.9.0.
Linux version 4.15.0-1037-azure (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019
The text was updated successfully, but these errors were encountered: