Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker stops responding on some machines #42386

Open
shapiroj opened this issue May 17, 2021 · 3 comments
Open

docker stops responding on some machines #42386

shapiroj opened this issue May 17, 2021 · 3 comments

Comments

@shapiroj
Copy link

Description

The docker process on some of our machines seems to get in a bad state where many threads are blocked on mutexes. Some of the problems we see:

  1. containers can not resolve hostnames as DNS hangs
  2. running docker inspect on some containers hangs forever. Most inspect calls on other containers work. docker images and ps calls work. docker info hangs.

Steps to reproduce the issue:

  1. No reliable way to reproduce the issue. We just notice problems after instances have been running for days.

Describe the results you received:
docker inspect hangs for some containers (not all)
DNS forwarding hangs; cannot resolve hostnames from within container

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):
docker dump log attached. File is huge (64M) compared to same file on working machines (which is less than 1M):
curl --unix-socket /var/run/docker.sock http://./debug/pprof/goroutine?debug=2
docker_dump.log.gz

Output of docker version:

Client: Docker Engine - Community
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        2291f61
 Built:             Mon Dec 28 16:11:26 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8891c58
  Built:            Mon Dec 28 16:15:23 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

docker info on similar, working machine:
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 34
  Running: 32
  Paused: 0
  Stopped: 2
 Images: 521
 Server Version: 20.10.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  userns
 Kernel Version: 5.4.71-amd64-3e298f2a7ded99f4
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 117.9GiB
 Name: ***********
 ID: GV2J:HMGA:6CBQ:PG3I:GFYT:WUBH:7FAD:ISTV:NXJ5:YKTO:JRT5:M64K
 Docker Root Dir: /***/docker/100000.100000
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

WARNING: No blkio weight support
WARNING: No blkio weight_device support

Additional environment details (AWS, VirtualBox, physical, etc.):

@thaJeztah
Copy link
Member

Just to be sure; 20.10.2 is various patch releases behind the current release (20.10.6). If you have a test environment to reproduce; are you still able to reproduce this issue on the latest 20.10.6 release?

@shapiroj
Copy link
Author

shapiroj commented Jun 14, 2021

I've reproduced this with an updated docker:

Client:
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:42:10 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd2
  Built:            Fri Apr  9 22:46:14 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 60
  Running: 46
  Paused: 0
  Stopped: 14
 Images: 1035
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  userns
 Kernel Version: 5.4.98-5.4.4-amd64-14794deb7c18c4e7
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 58.88GiB
 Name: ***
 ID: KC3X:GR7B:MNWN:IWHW:KAEY:NRKD:NI5P:EY34:GPXC:H3CX:LLYF:J3AM
 Docker Root Dir: /***/docker/100000.100000
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

docker.dump.txt.2.gz
docker.dump.txt.gz

@shapiroj
Copy link
Author

Hello, I still see this problem frequently. Anything I can do to start troubleshooting the cause? It seems to happen more often on machines that have more containers running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants