Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker container stopped under heavy load: RPC error in /var/log/upstart/docker.log #34377

Closed
samsun387 opened this issue Aug 3, 2017 · 8 comments

Comments

@samsun387
Copy link

samsun387 commented Aug 3, 2017

Description
Unexpected docker container stopped under heavy load: RPC error in /var/log/upstart/docker.log
We are using Jenkins to start multiple docker container (40+ per host) and do builds inside the container. When it's fully loaded, all docker containers in this hosted are stopped and I see the rpc error in the log.

The command Jenkins uses to start docker container is
docker run -t -d -u 1000:1000 --privileged -u root --memory=1400m --cpus 1.0 --oom-kill-disable

Describe the results you received:
time="2017-08-02T16:01:59.170520197-07:00" level=error msg="Error running exec in container: rpc error: code = 13 desc = transport is closing"
time="2017-08-02T16:01:59.179258675-07:00" level=error msg="Error running exec in container: rpc error: code = 13 desc = transport is closing"
time="2017-08-02T16:01:59.206258346-07:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"
time="2017-08-02T16:01:59.208830605-07:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"

Additional information you deem important (e.g. issue happens only occasionally):
The issue seems to only happen under heavy load.
Another entry I keep seeing in the log is:
level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."

Output of docker version:

Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:10:36 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:10:36 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 35
 Running: 35
 Paused: 0
 Stopped: 0
Images: 14
Server Version: 17.03.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 105
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 4.4.0-79-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 62.68 GiB
Name: jenkins-dell-03
ID: LE6R:54UE:ALZ4:XGES:RHC4:AJJV:X3AU:P4RA:5VN4:H6WL:G7H7:QSLX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 172.16.181.203:5000
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
The host is a vm under VMWare.

@cpuguy83
Copy link
Member

cpuguy83 commented Aug 3, 2017

I believe this is because containerd is getting OOM killed and there was an error in how docker's containerd supervisor was handling this situation.

This won't be addressed in 17.03 but should be fixed in the upcoming 17.06 patch release.
Can you try with the 17.06 rc?
The main fix is to adjust containerd's OOM score so the kernel doesn't kill it under stress.

@samsun387
Copy link
Author

Hi @cpuguy83 , Thanks for the quick reply!

What you described also match what we are suspecting as well, though I still have two questions:
#1 We suspect there are some issue related to the memory hence we added '--memory=1400m --cpus 1.0 --oom-kill-disable' to the options. Wouldn't oom-kill-disable stop kernel from killing the container?

#2 I thought 17.06.0-ce is already out, or do you mean there's another patch coming on top of 17.06.0-ce? Do you happen to have the PR that I can reference?

Thank you very much. This error has been giving me so much trouble lately...

@thaJeztah
Copy link
Member

I'd strongly recommend not using --oom-kill-disable - setting this on a container means that if the system runs out of memory it starts to kill "random" processes instead of the container; basically it will kill your system before killing the container

It's better to have a container killed than (e.g.) containerd, because that kills all your containers

@cpuguy83
Copy link
Member

cpuguy83 commented Aug 3, 2017

because that kills all your containers

Well, it sure as shit messes up Docker because the error handling for the containerd healthcheck was wrong in this case and containerd is never automatically restarted like it should be.... but containers should stick around just fine outside of the kernel OOM killing them.

But yes, you want dockerd and containerd to be less likely to be OOM killed than most of your other processes.

@cpuguy83
Copy link
Member

cpuguy83 commented Aug 3, 2017

#2 I thought 17.06.0-ce is already out, or do you mean there's another patch coming on top of 17.06.0-ce? Do you happen to have the PR that I can reference?

Honestly I can't remember where the fixes came in. I think it's not in the .0 release but they could be.... I'd have to do some poking around.

@totoroliu
Copy link

When Container OOM happened,
does it print to any of the log files? (such as docker.log, or host's syslog, or container's syslog)
Right now,
we can't find any of OOM trace in these log files.

I know under Ubuntu,
when OOM happened,
it will print out Kernel panic error inside syslog with OOM error.

@yuexiao-wang
Copy link
Contributor

@totoroliu You can fine OOM messages in /var/log/syslog

@thaJeztah
Copy link
Member

Let me close this issue, because I don't think there's a bug at hand here, but feel free to continue the conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants