dockerd doesn't seem to recover from low memory on host #29854

RRAlex · 2017-01-03T18:53:29Z

Steps to reproduce the issue:
TL;DR: It seems like the Docker daemon can't recover from a low memory situation and is even timing out trying to kill containers or a simple docker ps.

I have an issue where, after a low memory situation, docker stats only gives empty output, all the container fall in unhealthy mode, docker ps is stuck, etc. Even when something gets OOM-ed (container via mem_limit: or system level) or if memory is freed, it stays in that broken state until it is restarted.

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
cbb96093e5d0        --                  -- / --             --                  -- / --             -- / --             --
fcfef32b9dea        --                  -- / --             --                  -- / --             -- / --             --
f096ca6c6729        --                  -- / --             --                  -- / --             -- / --             --
...

A daemon reboot is basically required to have any control over it again after it ran low on memory.

ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 120).

Interesting log lines (cherry-picked):

Dec 21 20:50:34 server0 dockerd[1380]: time="2016-12-21T20:50:34.489742521-05:00" level=warning msg="Health check error: context cancelled"
Dec 21 20:50:34 server0 dockerd[1380]: time="2016-12-21T20:50:34.489782799-05:00" level=warning msg="Health check error: context cancelled"
Dec 22 01:39:04 server0 dockerd[1380]: time="2016-12-22T01:39:04.146478388-05:00" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/tonistiigi/fifo.(*fifo).Read\n\t/usr/src/docker/vendor/src/github.com/tonistiigi/fifo/fifo.go:141\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:471\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:370\nio.Copy\n\t/usr/local/go/src/io/io.go:350\ngithub.com/docker/docker/pkg/pools.Copy\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/pkg/pools/pools.go:64\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/usr/src/docker/.gopath/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1998"
Dec 23 10:26:49 server0 dockerd[1380]: time="2016-12-23T10:26:49.506267503-05:00" level=error msg="containerd: start container" error="fork/exec /usr/bin/docker-containerd-shim: cannot allocate memory" id=91fe76066cd4825794aa6c5480e7ce197dd1a1472ad8d4200e77a46f2aa0bb21
Dec 23 10:26:49 server0 dockerd[1380]: time="2016-12-23T10:26:49.518211618-05:00" level=error msg="libcontainerd: error restarting rpc error: code = 2 desc = fork/exec /usr/bin/docker-containerd-shim: cannot allocate memory"
Dec 23 10:26:49 server0 dockerd[1380]: time="2016-12-23T10:26:49.519956311-05:00" level=error msg="libcontainerd: rpc error: code = 2 desc = fork/exec /usr/bin/docker-containerd-shim: cannot allocate memory"
Dec 23 10:29:08 server0 dockerd[1380]: time="2016-12-23T10:28:52.155014792-05:00" level=warning msg="Connect failed: dial udp 8.8.8.8:53: i/o timeout"

Describe the results you received:
Daemon stays stuck when the machine is low on RAM and doesn't recover when RAM is freed, requiring a restart.

Describe the results you expected:
Daemon's API should answer the cli tool or give an error saying it's in a bad state that it can't recover from? Or simply recover from low memory situations.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 97
 Running: 97
 Paused: 0
 Stopped: 0
Images: 31
Server Version: 1.12.5
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 488
 Dirperm1 Supported: true
Logging Driver: syslog
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-58-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 20
Total Memory: 19.61 GiB
Name: preview
ID: 5KZB:ZJ4B:LL7T:4XFX:JUSJ:IT6I:ADUZ:AEAN:PRUW:5V4H:GEK7:2ODH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 environment=staging
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
KVM & Ubuntu Xenial LTS

( Initial unrelated post/issue: #29635 (comment) )

The text was updated successfully, but these errors were encountered:

thaJeztah · 2017-01-03T19:25:53Z

ping @mlaventure

RRAlex · 2017-01-10T15:00:21Z

I'm also seeing plenty of those, which seem to be associated with the daemon not responding to docker ps.
That is even with free -m showing more than 8GB of available RAM...

Jan  9 05:38:10 server0 dockerd[1396]: time="2017-01-09T05:38:10.837042399-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:10 server0 dockerd[1396]: time="2017-01-09T05:38:10.855022070-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:11 server0 dockerd[1396]: time="2017-01-09T05:38:11.233570154-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:16 server0 dockerd[1396]: time="2017-01-09T05:38:16.657212338-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:16 server0 dockerd[1396]: time="2017-01-09T05:38:16.882623786-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:25 server0 dockerd[1396]: time="2017-01-09T05:38:25.983520891-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:27 server0 dockerd[1396]: time="2017-01-09T05:38:27.757829382-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:34 server0 dockerd[1396]: time="2017-01-09T05:38:34.567578370-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:40 server0 dockerd[1396]: time="2017-01-09T05:38:40.080797671-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.732896006-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.774281372-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.778320711-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
Jan  9 05:38:45 server0 dockerd[1396]: time="2017-01-09T05:38:45.784983937-05:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"

mlaventure · 2017-01-10T18:26:01Z

@RRAlex then your issue might be #29369

RRAlex · 2017-01-12T16:00:43Z

Do you mean that, because I have more than 10 containers running, in situations of low memory, the containerd could lose contact with the daemon in some way and, then, they wouldn't be able to reconnect because of their numbers (> 10 not being queued/managed/...) ?

(I'm a bit confused by the other issue as I'm simply running containers, not restoring them nor playing with any CRIU features.)

mlaventure · 2017-01-12T16:21:56Z

@RRAlex from your description and log provided you don't seem to hit a low memory scenario since your system reports more than 8GB available.

Here the health check timeouts, meaning the exec command it tries to execute wasn't processed by containerd in time.

if you try to do an exec manually (or run a brand new container), does it go through?

RRAlex · 2017-01-12T16:45:11Z

The low memory scenario (seen in ressource graphs) happened before memory was freed by OOM though.

I'll try the exec / docker run next time and will keep you posted!

mlaventure · 2017-01-12T17:19:57Z

@RRAlex I was referring to your second message sorry, not the original issue.

yogo1212 · 2022-09-19T09:25:06Z

I'm pretty sure I just got this with Docker version 20.10.18, build b40c2f6 on Rocky Linux 9.
The machine has 2 GB RAM. Processing incoming mails (including amavis) took ~5 minutes (the time from seeing the sending connection and amavis processing the email. after amavis, emails showed up pretty instantly).
Regular messages about a health check failing (health check for container .. error: context deadline exceeded) + 2 lines of closed fifo (level=error msg="stream copy error: reading from a closed fifo").
That was after I added a swap file for another 2 GB (before that there were a lot of other error messages, including the oom killer from time to time).

Compose down/up didn't solve the problem and I'm pretty sure I restarted dockerd at one time (not containerd, though) - with no success.
The messages went away with a reboot (and the delivery times went down).

ndtreviv · 2022-10-06T09:52:20Z

I'm experiencing similar to @yogo1212 . It's on a system I don't have access to but just get logs for.
Seeing a lot of "health check for container .. error: context deadline exceeded" and "stream copy error: reading from a closed fifo") and then a ton of "connect failed: dial udp [IP]:53 i/o timeout" and then the entire machine becomes an unresponsive black screen and requires a hard reset.

GordonTheTurtle added the version/1.12 label Jan 3, 2017

thaJeztah added the area/daemon label Jan 3, 2017

joonas-fi mentioned this issue Jan 6, 2017

On out-of-memory (OOM) restarts all Swarm-based containers, plus some inconsistencies #29941

Open

RRAlex mentioned this issue Jan 12, 2017

1.13.0-rc5 healthcheck on non existant containers goes awry #30107

Closed

csandanov mentioned this issue Aug 1, 2017

Docker daemon hangs #34293

Closed

archseer mentioned this issue Nov 30, 2017

mysql-mailcow at 100% CPU mailcow/mailcow-dockerized#784

Closed

yogo1212 mentioned this issue Sep 19, 2022

random extremely slow response times of dockerd #29635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dockerd doesn't seem to recover from low memory on host #29854

dockerd doesn't seem to recover from low memory on host #29854

RRAlex commented Jan 3, 2017

thaJeztah commented Jan 3, 2017

RRAlex commented Jan 10, 2017 •

edited

Loading

mlaventure commented Jan 10, 2017

RRAlex commented Jan 12, 2017

mlaventure commented Jan 12, 2017

RRAlex commented Jan 12, 2017

mlaventure commented Jan 12, 2017

yogo1212 commented Sep 19, 2022

ndtreviv commented Oct 6, 2022

dockerd doesn't seem to recover from low memory on host #29854

dockerd doesn't seem to recover from low memory on host #29854

Comments

RRAlex commented Jan 3, 2017

thaJeztah commented Jan 3, 2017

RRAlex commented Jan 10, 2017 • edited Loading

mlaventure commented Jan 10, 2017

RRAlex commented Jan 12, 2017

mlaventure commented Jan 12, 2017

RRAlex commented Jan 12, 2017

mlaventure commented Jan 12, 2017

yogo1212 commented Sep 19, 2022

ndtreviv commented Oct 6, 2022

RRAlex commented Jan 10, 2017 •

edited

Loading