Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerd doesn't seem to recover from low memory on host #29854

Open
RRAlex opened this issue Jan 3, 2017 · 9 comments
Open

dockerd doesn't seem to recover from low memory on host #29854

RRAlex opened this issue Jan 3, 2017 · 9 comments

Comments

@RRAlex
Copy link

RRAlex commented Jan 3, 2017

Steps to reproduce the issue:
TL;DR: It seems like the Docker daemon can't recover from a low memory situation and is even timing out trying to kill containers or a simple docker ps.

I have an issue where, after a low memory situation, docker stats only gives empty output, all the container fall in unhealthy mode, docker ps is stuck, etc. Even when something gets OOM-ed (container via mem_limit: or system level) or if memory is freed, it stays in that broken state until it is restarted.

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
cbb96093e5d0        --                  -- / --             --                  -- / --             -- / --             --
fcfef32b9dea        --                  -- / --             --                  -- / --             -- / --             --
f096ca6c6729        --                  -- / --             --                  -- / --             -- / --             --
...

A daemon reboot is basically required to have any control over it again after it ran low on memory.

ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 120).

Interesting log lines (cherry-picked):

Dec 21 20:50:34 server0 dockerd[1380]: time="2016-12-21T20:50:34.489742521-05:00" level=warning msg="Health check error: context cancelled"
Dec 21 20:50:34 server0 dockerd[1380]: time="2016-12-21T20:50:34.489782799-05:00" level=warning msg="Health check error: context cancelled"
Dec 22 01:39:04 server0 dockerd[1380]: time="2016-12-22T01:39:04.146478388-05:00" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/tonistiigi/fifo.(*fifo).Read\n\t/usr/src/docker/vendor/src/github.com/tonistiigi/fifo/fifo.go:141\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:471\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:370\nio.Copy\n\t/usr/local/go/src/io/io.go:350\ngithub.com/docker/docker/pkg/pools.Copy\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/pkg/pools/pools.go:64\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1998"
Dec 23 10:26:49 server0 dockerd[1380]: time="2016-12-23T10:26:49.506267503-05:00" level=error msg="containerd: start container" error="fork/exec /usr/bin/docker-containerd-shim: cannot allocate memory" id=91fe76066cd4825794aa6c5480e7ce197dd1a1472ad8d4200e77a46f2aa0bb21
Dec 23 10:26:49 server0 dockerd[1380]: time="2016-12-23T10:26:49.518211618-05:00" level=error msg="libcontainerd: error restarting rpc error: code = 2 desc = fork/exec /usr/bin/docker-containerd-shim: cannot allocate memory"
Dec 23 10:26:49 server0 dockerd[1380]: time="2016-12-23T10:26:49.519956311-05:00" level=error msg="libcontainerd: rpc error: code = 2 desc = fork/exec /usr/bin/docker-containerd-shim: cannot allocate memory"
Dec 23 10:29:08 server0 dockerd[1380]: time="2016-12-23T10:28:52.155014792-05:00" level=warning msg="Connect failed: dial udp 8.8.8.8:53: i/o timeout"

Describe the results you received:
Daemon stays stuck when the machine is low on RAM and doesn't recover when RAM is freed, requiring a restart.

Describe the results you expected:
Daemon's API should answer the cli tool or give an error saying it's in a bad state that it can't recover from? Or simply recover from low memory situations.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 97
 Running: 97
 Paused: 0
 Stopped: 0
Images: 31
Server Version: 1.12.5
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 488
 Dirperm1 Supported: true
Logging Driver: syslog
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-58-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 20
Total Memory: 19.61 GiB
Name: preview
ID: 5KZB:ZJ4B:LL7T:4XFX:JUSJ:IT6I:ADUZ:AEAN:PRUW:5V4H:GEK7:2ODH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 environment=staging
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
KVM & Ubuntu Xenial LTS

( Initial unrelated post/issue: #29635 (comment) )

@thaJeztah
Copy link
Member

ping @mlaventure

@RRAlex
Copy link
Author

RRAlex commented Jan 10, 2017

I'm also seeing plenty of those, which seem to be associated with the daemon not responding to docker ps.
That is even with free -m showing more than 8GB of available RAM...

Jan  9 05:38:10 server0 dockerd[1396]: time="2017-01-09T05:38:10.837042399-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:10 server0 dockerd[1396]: time="2017-01-09T05:38:10.855022070-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:11 server0 dockerd[1396]: time="2017-01-09T05:38:11.233570154-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:16 server0 dockerd[1396]: time="2017-01-09T05:38:16.657212338-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:16 server0 dockerd[1396]: time="2017-01-09T05:38:16.882623786-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:25 server0 dockerd[1396]: time="2017-01-09T05:38:25.983520891-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:27 server0 dockerd[1396]: time="2017-01-09T05:38:27.757829382-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:34 server0 dockerd[1396]: time="2017-01-09T05:38:34.567578370-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:40 server0 dockerd[1396]: time="2017-01-09T05:38:40.080797671-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.732896006-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.774281372-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.778320711-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.784983937-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"

@mlaventure
Copy link
Contributor

@RRAlex then your issue might be #29369

@RRAlex
Copy link
Author

RRAlex commented Jan 12, 2017

Do you mean that, because I have more than 10 containers running, in situations of low memory, the containerd could lose contact with the daemon in some way and, then, they wouldn't be able to reconnect because of their numbers (> 10 not being queued/managed/...) ?

(I'm a bit confused by the other issue as I'm simply running containers, not restoring them nor playing with any CRIU features.)

@mlaventure
Copy link
Contributor

@RRAlex from your description and log provided you don't seem to hit a low memory scenario since your system reports more than 8GB available.

Here the health check timeouts, meaning the exec command it tries to execute wasn't processed by containerd in time.

if you try to do an exec manually (or run a brand new container), does it go through?

@RRAlex
Copy link
Author

RRAlex commented Jan 12, 2017

The low memory scenario (seen in ressource graphs) happened before memory was freed by OOM though.

I'll try the exec / docker run next time and will keep you posted!

@mlaventure
Copy link
Contributor

@RRAlex I was referring to your second message sorry, not the original issue.

@yogo1212
Copy link

I'm pretty sure I just got this with Docker version 20.10.18, build b40c2f6 on Rocky Linux 9.
The machine has 2 GB RAM. Processing incoming mails (including amavis) took ~5 minutes (the time from seeing the sending connection and amavis processing the email. after amavis, emails showed up pretty instantly).
Regular messages about a health check failing (health check for container .. error: context deadline exceeded) + 2 lines of closed fifo (level=error msg="stream copy error: reading from a closed fifo").
That was after I added a swap file for another 2 GB (before that there were a lot of other error messages, including the oom killer from time to time).

Compose down/up didn't solve the problem and I'm pretty sure I restarted dockerd at one time (not containerd, though) - with no success.
The messages went away with a reboot (and the delivery times went down).

@ndtreviv
Copy link

ndtreviv commented Oct 6, 2022

I'm experiencing similar to @yogo1212 . It's on a system I don't have access to but just get logs for.
Seeing a lot of "health check for container .. error: context deadline exceeded" and "stream copy error: reading from a closed fifo") and then a ton of "connect failed: dial udp [IP]:53 i/o timeout" and then the entire machine becomes an unresponsive black screen and requires a hard reset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants