Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerd enters State=D, uninterruptible sleep; fsnotify_destroy_group() #38750

Closed
ellieayla opened this issue Feb 18, 2019 · 8 comments
Closed

Comments

@ellieayla
Copy link

ellieayla commented Feb 18, 2019

Description

After 3-15 days of uptime, I observe dockerd stop responding.

# docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Steps to reproduce the issue:

I do not know the exact recipe. I deployed an Azure AKS 1.11.5 cluster with 8 nodes, added a Datadog DaemonSet, created a bunch of deployments that CD updated ~5 times a day, waited. After a week of uptime, dockerd on at least one of the VMs goes into this state. If the VM is power-cycled, it'll experience this again within about 2 weeks.

Describe the results you received:

This has been observed in AKS with:

  • kubernetes 1.11.5 + moby runtime 3.0.1 + Ubuntu 16.04.5
  • kubernetes 1.11.7 + moby runtime 3.0.4 + Ubuntu 16.04.10

The first k8s-centric observation is the node going into state=NotReady, with an event recorded ContainerGCFailed: rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?. The kubelet continues running, but can no longer interact with dockerd. The Kubernetes master tries to Terminate the pods and recreate them on other nodes - the termination blocks (since the node is NotReady) and the new replicas come up without issue.

At the same time, the VM's 5-minute load average starts rapidly increasing, jumping from below 5 to above 300 in <30 minutes. The combination of the above symptoms results in someone getting paged.

system load 5

Existing pods/containers on the node/vm continue to run and are responsive to network communication, though no containers can be stopped, started, exec'd or have logs fetched.

At this point someone can ssh to the VM fine. Running docker ps reveals the same error-text, "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?". The dockerd process exists.

Describe the results you expected:

dockerd responds to docker cli / kubelet

Additional information you deem important (e.g. issue happens only occasionally):

/proc/DOCKERPID/status shows that dockerd is in State=D -- uninterruptible sleep.

Name:	dockerd
Umask:	0022
State:	D (disk sleep)
Tgid:	4223
Ngid:	0
Pid:	4223
PPid:	1
TracerPid:	23605
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	2048
Groups:	 
NStgid:	4223
NSpid:	4223
NSpgid:	4223
NSsid:	4223
VmPeak:	10807220 kB
VmSize:	10807220 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  417684 kB
VmRSS:	  408324 kB
RssAnon:	  370940 kB
RssFile:	   37384 kB
RssShmem:	       0 kB
VmData:	 8748436 kB
VmStk:	     132 kB
VmExe:	   26844 kB
VmLib:	       8 kB
VmPTE:	    5048 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
CoreDumping:	0
Threads:	1019
SigQ:	1/128602
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	ffffffffffc1feff
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Speculation_Store_Bypass:	vulnerable
Cpus_allowed:	f
Cpus_allowed_list:	0-3
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	143376
nonvoluntary_ctxt_switches:	35258

Attempting to restart it with systemctl restart docker.service times out. Attaching to dockerd with strace or gdb in this state predictably fails, as does kill -9 -- they're all going to block on the syscall completing.

Inspecting the threads comprising dockerd in this state many threads blocked on (presumably long-running) syscalls doing inotify / fsnotify things.

/proc/4223/task# for THREAD in */stack; do cat $THREAD | cut -d ' ' -f 2 | tr "\n" " "; echo; done | sort | uniq -c | sort -rn
    958 flush_work+0x129/0x1e0 flush_delayed_work+0x3f/0x50 fsnotify_wait_marks_destroyed+0x15/0x20 fsnotify_destroy_group+0x48/0xd0 inotify_release+0x1e/0x50 __fput+0xea/0x220 ____fput+0xe/0x10 task_work_run+0x8a/0xb0 exit_to_usermode_loop+0xc4/0xd0 do_syscall_64+0xf4/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff 
      4 flush_work+0x129/0x1e0 flush_delayed_work+0x3f/0x50 fsnotify_wait_marks_destroyed+0x15/0x20 fsnotify_destroy_group+0x48/0xd0 inotify_new_group+0x10c/0x120 SyS_inotify_init1+0x2d/0x70 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff

In contrast, a post-reboot dockerd has threads which look like

    141 futex_wait_queue_me+0xc4/0x120 futex_wait+0x119/0x260 do_futex+0x128/0x500 SyS_futex+0x7f/0x180 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff 
      6 ep_poll+0x2fc/0x3b0 SyS_epoll_wait+0xb8/0xd0 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff 
      1 hrtimer_nanosleep+0xd8/0x1f0 SyS_nanosleep+0x8f/0xa0 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff 
      1 ep_poll+0x2fc/0x3b0 SyS_epoll_pwait+0x1d5/0x200 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff 
      1 do_wait+0x1be/0x240 kernel_waitid+0xa4/0xd0 SYSC_waitid+0x138/0x150 SyS_waitid+0xe/0x10 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 0xffffffffffffffff 

The only semi-unusual thing I'm doing that might contribute to inotify-stuff is running the Datadog Agent https://github.com/DataDog/datadog-agent as a Daemonset following standard guidance at https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/. That mounts the host's /proc and /sys/fs/cgroup readonly plus /var/run/docker.sock readwrite, all to monitor status and fetch container logs. The datadog agent can't talk to dockerd either once this problem starts though is still fetching system metrics (eg, loadavg) for at least 1h.

This doesn't look like #31007 (not at startup) or #31648 (no large number of other processes in State=D) or #15204 (that has a working dockerd)

This does look very similar to https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1798212 which points to https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021 which appears to be a linux kernel bug. I have not verified the applicability of that bug.

Output of docker version:

Taken from a failed machine post-reboot:

Client:
 Version:           3.0.3
 API version:       1.40
 Go version:        go1.11.4
 Git commit:        48bd4c6d
 Built:             Wed Jan 23 16:17:56 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          3.0.4
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.11.4
  Git commit:       8ecd530
  Built:            Fri Jan 25 01:45:38 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.2.2
  GitCommit:        9754871865f7fe2f4e74d43e2fc7ccd237edcbce
 runc:
  Version:          1.0.0-rc6+dev
  GitCommit:        96ec2177ae841256168fcf76954f7177af9446eb
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Taken from a failed machine post-reboot:

Containers: 74
 Running: 73
 Paused: 0
 Stopped: 1
Images: 65
Server Version: 3.0.4
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 96ec2177ae841256168fcf76954f7177af9446eb
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1037-azure
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 31.42GiB
Name: aks-nodepool1-22789428-10
ID: PR6J:NMEH:LPVE:DUOR:XMI2:6BRX:P7OF:MIY2:WL6C:QSGE:UBEY:ZBQE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

I have observed this in Azure's AKS 1.11.5+1.11.7, with Ubuntu 16.04.x, with an Azure kernel based on 4.15, with Moby 3.0.1+3.0.5, with DatadogAgent 6.7.0+6.8.2+6.9.0.

Linux version 4.15.0-1037-azure (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019

@ellieayla
Copy link
Author

While the overall ticket appears to be describing a different situation, the observations made in kubernetes/kubernetes#70229 (comment) are very similar to mine - including Azure + Kubernetes + Docker + Datadog.

@ellieayla
Copy link
Author

I do see INFO: task dockerd:4227 blocked for more than 120 seconds. in the kernel ring buffer, similar to https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021 - @davecore82 do you think that's the underlying issue?

[125062.156295] INFO: task dockerd:4227 blocked for more than 120 seconds.
[125062.162186]       Not tainted 4.15.0-1037-azure #39~16.04.1-Ubuntu
[125062.169886] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[125062.176139] dockerd         D    0  4227      1 0x00000000
[125062.176142] Call Trace:
[125062.176147]  __schedule+0x3d6/0x8b0
[125062.176149]  schedule+0x36/0x80
[125062.176152]  schedule_timeout+0x1db/0x370
[125062.176154]  ? try_to_wake_up+0x59/0x5f0
[125062.176156]  wait_for_completion+0xb4/0x140
[125062.176158]  ? wake_up_q+0x70/0x70
[125062.176161]  flush_work+0x129/0x1e0
[125062.176163]  ? worker_detach_from_pool+0xb0/0xb0
[125062.176165]  flush_delayed_work+0x3f/0x50
[125062.176167]  fsnotify_wait_marks_destroyed+0x15/0x20
[125062.176169]  fsnotify_destroy_group+0x48/0xd0
[125062.176171]  inotify_release+0x1e/0x50
[125062.176174]  __fput+0xea/0x220
[125062.176177]  ____fput+0xe/0x10
[125062.176179]  task_work_run+0x8a/0xb0
[125062.176182]  exit_to_usermode_loop+0xc4/0xd0
[125062.176184]  do_syscall_64+0xf4/0x130
[125062.176186]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[125062.176188] RIP: 0033:0x4bd440
[125062.176190] RSP: 002b:000000c0058a1d10 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
[125062.176192] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000004bd440
[125062.176193] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000278
[125062.176194] RBP: 000000c0058a1d50 R08: 0000000000000000 R09: 0000000000000000
[125062.176195] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000008
[125062.176196] R13: 0000000000000033 R14: 0000000001dbf22a R15: 0000000000000000

@ellieayla
Copy link
Author

I have a strong suspicion that dockerd is a victim here, and that https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021 aka https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=dev&id=1a05c0cd2fee234a10362cc8f66057557cbb291f is the cause. That bug professes to be very difficult to reproduce deterministically, so it will be hard to be certain it's fixed.

I really don't expect anyone to take action for this issue -- it's a placeholder to point at which I hope will be happily closed in ~2 weeks with no further effort.

@ellieayla
Copy link
Author

Closing this in favour of Azure/AKS#838

@ellieayla
Copy link
Author

I have upgraded to 4.15.0-1040-azure. Now waiting to see if the problem happens again.

@ellieayla
Copy link
Author

I have observed 2 weeks of uptime on 8 nodes without observation of the original symptoms since upgrading the AKS node kernel to 4.15.0-1040-azure. I am confident the kernel patch has resolved our problem.

@victort
Copy link

victort commented May 15, 2019

am experiencing this too, basically exactly the same kernel dump, only not on Azure, VMWare based ubuntu boxen. All Ubuntu 16.04 with 4.15.0-32-generic kernels.

@ellieayla
Copy link
Author

It was an upstream bug in many distribution's Linux kernels. The distribution-specific bit is going to be the fix vector. I guess you may want 4.15.0-47.50, maybe with a -generic in package name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants