Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker daemon hang for futex #42921

Open
zhangxingdeppon opened this issue Oct 8, 2021 · 7 comments
Open

docker daemon hang for futex #42921

zhangxingdeppon opened this issue Oct 8, 2021 · 7 comments

Comments

@zhangxingdeppon
Copy link

zhangxingdeppon commented Oct 8, 2021

Description
Hi team, I met a problem that docker daemon hang on futex, which unlike #25321; #28754

Steps to reproduce the issue:

  1. use strace, see below:

strace: Process 13664 attached
futex(0x563b8eb4d6c8, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 13664 detached
 <detached ...>

2.debug the docker, dump the goroutine stacks which use kill -SIGUSR1 $(pidof dockerd)

goroutine-stacks-2021-10-08T153230+0800.log

3.use pprof check the groutine, find some mutex, for example:

18 @ 0x563b8add0c46 0x563b8ade1aa4 0x563b8ade1a8f 0x563b8ade1809 0x563b8ae1c50e 0x563b8b69f5fc 0x563b8b69f602 0x563b8c28f839 0x563b8c4fa0aa 0x563b8c2f0b37 0x563b8c2f118b 0x563b8bda9506 0x563b8c2f7371 0x563b8b116dd6 0x563b8bdabbe4 0x563b8c2f5ca5 0x563b8b11a256 0x563b8b115bb7 0x563b8adff921
#	0x563b8ade1808	sync.runtime_SemacquireMutex+0x48								/usr/local/go/src/runtime/sema.go:71
#	0x563b8ae1c50d	sync.(*Mutex).lockSlow+0xfd									/usr/local/go/src/sync/mutex.go:138
#	0x563b8b69f5fb	sync.(*Mutex).Lock+0xbb										/usr/local/go/src/sync/mutex.go:81
#	0x563b8b69f601	github.com/docker/docker/container.(*State).SetRemovalInProgress+0xc1				/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/container/state.go:344
#	0x563b8c28f838	github.com/docker/docker/daemon.(*Daemon).ContainerRm+0xb8					/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/daemon/delete.go:30
#	0x563b8c4fa0a9	github.com/docker/docker/api/server/router/container.(*containerRouter).deleteContainers+0x179	/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router/container/container_routes.go:528
#	0x563b8c2f0b36	github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1+0x176	/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/experimental.go:26
#	0x563b8c2f118a	github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1+0x5fa	/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/middleware/version.go:62
#	0x563b8bda9505	github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1+0x825		/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/pkg/authorization/middleware.go:59
#	0x563b8c2f7370	github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1+0x240			/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/server.go:142
#	0x563b8b116dd5	net/http.HandlerFunc.ServeHTTP+0x45								/usr/local/go/src/net/http/server.go:2036
#	0x563b8bdabbe3	github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP+0xe3			/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:212
#	0x563b8c2f5ca4	github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP+0x74				/root/rpmbuild/BUILD/src/engine/.gopath/src/github.com/docker/docker/api/server/router_swapper.go:29
#	0x563b8b11a255	net/http.serverHandler.ServeHTTP+0xa5								/usr/local/go/src/net/http/server.go:2831
#	0x563b8b115bb6	net/http.(*conn).serve+0x876									/usr/local/go/src/net/http/server.go:1919

the details:
goroutine.log

Describe the results you received:

can't use the command docker rm -f docker run -it etc.
Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.5
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:33:55 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          19.03.15
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       99e3ed8919
  Built:            Sat Jan 30 03:16:33 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

[root@d20808u19-72-131 ~]# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 110
  Running: 88
  Paused: 0
  Stopped: 22
 Images: 71
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1062.el7.x86_64
 Operating System: Red Hat Enterprise Linux Server 7.7 (Maipo)
 OSType: linux
 Architecture: x86_64
 CPUs: 64
 Total Memory: 251.4GiB
 Name: d20808u19-72-131
 ID: IPHD:P4TD:SFTL:WKRP:77D5:HGOB:JFFJ:BM62:D2LN:H5U4:NN2H:GQE3
 Docker Root Dir: /data1/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):


Linux d20808u19-72-131 3.10.0-1062.el7.x86_64 #1 SMP Thu Jul 18 20:25:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.7 (Maipo)
@zq-david-wang
Copy link
Contributor

3.10.0-1062 is known to have kernel memory leak, easily reproduced via frequently restarting containers.
Any kernel error log shows up in dmesg -T?

@zhangxingdeppon
Copy link
Author

@zq-david-wang okay, I have searched the dmesg -T, but not find the correlative information, where did I ignore it? the details is below:
dmesg.log

3.10.0-1062 is known to have kernel memory leak, easily reproduced via frequently restarting containers. Any kernel error log shows up in dmesg -T?

@zq-david-wang
Copy link
Contributor

I do not see memory error log in the file you attached. Strange thing is the timestamp, the last line ' s timestamp [Sat Oct 9 17:21:11 2021] indicates that nothing significant happened for about 4days.
Maybe this is a IO issue, is there any process stuck in "D"(waiting) state?

@zhangxingdeppon
Copy link
Author

@zq-david-wang I use the command ``ps -aux | awk '$8 == "D"', return nothing, don't find process stuck in "D" state, is there any other reason? in addition , I search all log about dockerd, find a warning message:

Oct  8 14:28:01 d20808u19 dockerd: time="2021-10-08T14:28:01.637852298+08:00" level=warning msg="Error getting v2 registry: Get https://X.X.X.X:8080/v2/: http: server gave HTTP response to HTTPS client"
Oct  8 14:28:01 d20808u19 dockerd: time="2021-10-08T14:28:01.637909750+08:00" level=info msg="Attempting next endpoint for pull after error: Get https://X.X.X.X:8080/v2/: http: server gave HTTP response to HTTPS client"

is it a possible reason? but I have changed the configurition successful before the time Oct 8 14:28:01, use commanddocker info, see below:

Insecure Registries:
  X.X.X.X:8080

@zq-david-wang
Copy link
Contributor

Any chance that this runc 1.0.0-rc93 bug opencontainers/runc#2871 would be relevent?
The failure signatures are:

  1. docker run ... would hang
  2. docker ps shows several containers are stuck in Created state
  3. ps aux | grep " runc" shows several runc processes . (normally there should not be long-living runc process, I forgot to check whether those runc process is stuck in 'D' state or not...)

If the singature matches, you could upgrade your docker. To simplily fix the runc issue:

  1. kill all runc process
  2. replace runc 1.0.0-rc93 with runc 1.0.0-rc95 or above
  3. docker run should not hang

@zhangxingdeppon

@pawcykca
Copy link

pawcykca commented Apr 25, 2022

Hi, I have similar problem on my GitLab Runners.
After few hours of running pipelines, docker socket stops responding as below.
This problem appear on two different VM's (CentOS7 and CentOS Stream 8).

[root@szkk-gitlab-runner ~]# uname -a
Linux szkk-gitlab-runner 4.18.0-383.el8.x86_64 #1 SMP Wed Apr 20 15:38:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

[root@szkk-gitlab-runner ~]# dnf list installed | grep -E 'docker|containerd.io'
containerd.io.x86_64                     1.5.11-3.1.el8                           @docker-ce-stable
docker-ce.x86_64                         3:20.10.14-3.el8                         @docker-ce-stable
docker-ce-cli.x86_64                     1:20.10.14-3.el8                         @docker-ce-stable
docker-ce-rootless-extras.x86_64         20.10.14-3.el8                           @docker-ce-stable
docker-scan-plugin.x86_64                0.17.0-3.el8                             @docker-ce-stable

[root@szkk-gitlab-runner ~]# docker ps
CONTAINER ID   IMAGE          COMMAND                  CREATED        STATUS        PORTS     NAMES
06a0d4ae7b55   0e8f69d1c5dd   "/usr/bin/dumb-init …"   37 hours ago   Up 37 hours             runner-wzinv33--project-10483-concurrent-1-bda010d00716d2ad-predefined-3
0dda81348438   0e8f69d1c5dd   "/usr/bin/dumb-init …"   37 hours ago   Up 37 hours             runner-wzinv33--project-10483-concurrent-9-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e-set-permission-b4c78e255abe7e57
96e6e3e60642   0e8f69d1c5dd   "/usr/bin/dumb-init …"   37 hours ago   Up 37 hours             runner-wzinv33--project-10483-concurrent-5-5d63554d4bbef892-predefined-0
97716c4f215e   0e8f69d1c5dd   "/usr/bin/dumb-init …"   37 hours ago   Up 37 hours             runner-wzinv33--project-10483-concurrent-3-0b535ab4cf10124d-predefined-1
637adfa68e7f   b4d90f357ab4   "sh -c 'if [ -x /usr…"   37 hours ago   Up 37 hours             runner-wzinv33--project-10483-concurrent-10-8471caa5e2322218-build-4
4423a122b1a7   b4d90f357ab4   "sh -c 'if [ -x /usr…"   37 hours ago   Up 37 hours             runner-wzinv33--project-10483-concurrent-2-be7c1dccc0bfa10a-build-4

[root@szkk-gitlab-runner ~]# ps ax | grep docker
   1289 ?        Ssl   52:56 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
  15730 ?        Sl     0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/c1fae3130bdad345217e618a86734387a77f24e63f6ca57fc23e647b26fc5f5c/log.json --log-format json create --bundle /run/containerd/io.containerd.runtime.v2.task/moby/c1fae3130bdad345217e618a86734387a77f24e63f6ca57fc23e647b26fc5f5c --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/c1fae3130bdad345217e618a86734387a77f24e63f6ca57fc23e647b26fc5f5c/init.pid c1fae3130bdad345217e618a86734387a77f24e63f6ca57fc23e647b26fc5f5c
  15757 ?        Sl     0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/76e74309a937bc0f10274a38029c1d218b58aa29564350c583dfc60cdc073cb5/log.json --log-format json create --bundle /run/containerd/io.containerd.runtime.v2.task/moby/76e74309a937bc0f10274a38029c1d218b58aa29564350c583dfc60cdc073cb5 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/76e74309a937bc0f10274a38029c1d218b58aa29564350c583dfc60cdc073cb5/init.pid 76e74309a937bc0f10274a38029c1d218b58aa29564350c583dfc60cdc073cb5
  15792 ?        Sl     0:15 libnetwork-setkey -exec-root=/var/run/docker c1fae3130bdad345217e618a86734387a77f24e63f6ca57fc23e647b26fc5f5c f7e558ac49f7
  15815 ?        Sl     0:15 libnetwork-setkey -exec-root=/var/run/docker 76e74309a937bc0f10274a38029c1d218b58aa29564350c583dfc60cdc073cb5 f7e558ac49f7
  15874 ?        Sl     0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/a4f938b1bb8e322cf5e0f31437286e460f1501a8eaf6b6fb1b2ddbfe13c554c9/log.json --log-format json create --bundle /run/containerd/io.containerd.runtime.v2.task/moby/a4f938b1bb8e322cf5e0f31437286e460f1501a8eaf6b6fb1b2ddbfe13c554c9 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/a4f938b1bb8e322cf5e0f31437286e460f1501a8eaf6b6fb1b2ddbfe13c554c9/init.pid a4f938b1bb8e322cf5e0f31437286e460f1501a8eaf6b6fb1b2ddbfe13c554c9
  15923 ?        Sl     0:15 libnetwork-setkey -exec-root=/var/run/docker a4f938b1bb8e322cf5e0f31437286e460f1501a8eaf6b6fb1b2ddbfe13c554c9 f7e558ac49f7
  32717 ?        Ss     0:00 bash /usr/share/gitlab-runner/clear-docker-cache
  32737 ?        Sl     0:06 docker system prune --volumes -af --filter label=com.gitlab.gitlab-runner.managed=true
 138212 pts/0    S+     0:00 grep --color=auto docker

[root@szkk-gitlab-runner ~]# journalctl -f -u docker
-- Logs begin at Sat 2022-04-23 15:44:34 UTC. --
Apr 23 18:44:05 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:05.946120657Z" level=info msg="ignoring event" container=76c0a3110f8ff30776d2223c6d054d71e0239ad75699328b8444a4a951b09831 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:44:06 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:06.482445096Z" level=info msg="ignoring event" container=9a2255a4a6b018158059f613c64ad73f61a58e748f52030d1c57a12744a7dbd6 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:44:06 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:06.699391992Z" level=info msg="ignoring event" container=8c094bf0fd174653cddaa340ba6d1f38410720310258402e1891647f36b04cf0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:44:07 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:07.159161636Z" level=info msg="ignoring event" container=96e6e3e60642dcdc3513d41d81c399bbb91e5c2d732745d897558dbc45a81afc module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:44:07 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:07.364510885Z" level=info msg="ignoring event" container=0dda81348438350a925277de62ef8079a1974bfced0fdfcf6ae26dc1376ef1de module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:44:07 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:07.463338598Z" level=info msg="ignoring event" container=97716c4f215ea327bf2004cbf2d79996c1795030d4b4342906330f91aad20895 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:44:07 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:44:07.481941529Z" level=info msg="ignoring event" container=06a0d4ae7b552b133895ed03c29bef17ff438cd0ba94e80726f5b17bbc4227af module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:49:08 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:49:08.177250441Z" level=info msg="ignoring event" container=637adfa68e7f64c259dc39dbb943fcac7933c12769f68f69f289c0dbb3863075 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 23 18:51:01 szkk-gitlab-runner dockerd[1289]: time="2022-04-23T18:51:01.249597956Z" level=info msg="ignoring event" container=4423a122b1a70b628f2f9f69c730428fc3d90a6d5d03e02e3710887c6fb2320a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 25 07:29:30 szkk-gitlab-runner dockerd[1289]: time="2022-04-25T07:29:30.263287242Z" level=info msg="goroutine stacks written to /var/run/docker/goroutine-stacks-2022-04-25T072930Z.log"
[root@szkk-gitlab-runner ~]# lsof /var/run/docker.sock | wc -l
15972

[root@szkk-gitlab-runner ~]# pidof docker
32737

[root@szkk-gitlab-runner ~]# strace -tt -f -o /tmp/docker.info.strace docker info

https://gist.github.com/pawcykca/2e4fc245a08e7ee62ccdd26233870ad4

[root@szkk-gitlab-runner ~]# strace -tt -f -o /tmp/docker.strace -p32737
strace: Process 32737 attached with 8 threads

https://gist.github.com/pawcykca/292ec81a8810136e957f1ded6a72c172

[root@szkk-gitlab-runner ~]# kill -SIGUSR1 $(pidof dockerd)

https://gist.github.com/pawcykca/7c2bf80127e7c0f0d53672fb6ce4d040

[root@szkk-gitlab-runner ~]# cat /proc/32737/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             30766                30766                processes
Max open files            1024                 262144               files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30766                30766                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Any ideas what is the root cause?

@rlanore
Copy link

rlanore commented Nov 22, 2023

I encounter this today and i can reproduc into a lab.
For us it's due to forward all container stdout to a central syslog server.
But this morning the central syslog has been full and stop to responding.
Docker hangs ! all container hangs on epoll_wait system call
To fix temporarily we have delete this docker daemon config

{
  "log-driver": "syslog",
  "log-opts": {
    "syslog-address": "tcp://foo.bar:10514",
    "tag": "{{.Name}}_hosted_bar_foo"
  }
}

we use docker 20.10 on the impacted docker host

Client: Docker Engine - Community
 Version:           20.10.22
 API version:       1.41
 Go version:        go1.18.9
 Git commit:        3a2c30b
 Built:             Thu Dec 15 22:28:04 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.22
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.9
  Git commit:       42c8b31
  Built:            Thu Dec 15 22:25:49 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.25
  GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants