Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerd fail to receive containerd's exit signal when container process have be oom kill. #33192

Open
BSWANG opened this issue May 15, 2017 · 23 comments

Comments

@BSWANG
Copy link
Contributor

BSWANG commented May 15, 2017

Description

Dockerd fail to receive containerd's exit signal when container process have be oom kill. Container's process, runc and contaierd has been all exit. But docker ps show the container is UP. And docker stop, docker kill, docker exec no longer work as expect. such as below:

docker exec -it MySQL-for-Developer_MySQL-for-Developer_1 sh
rpc error: code = 2 desc = containerd: container not found

Steps to reproduce the issue:

  1. run an container with memory limit.
  2. container use memory more than memory limit
  3. container process oom kill by kernel

Describe the results you received:

docker ps show the container still "UP", and docker stop, docker kill, docker exec no longer work as expect.

docker exec -it MySQL-for-Developer_MySQL-for-Developer_1 sh
rpc error: code = 2 desc = containerd: container not found

Describe the results you expected:

When container's process has be killed, the docker container should be "Exit", Not "Up".

Additional information you deem important (e.g. issue happens only occasionally):

issue happens only occasionally.

Output of docker version:

Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   0801b25
 Built:        Tue Mar 28 08:41:15 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   0801b25
 Built:        Tue Mar 28 08:41:15 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 14
Server Version: 17.03.1-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local ossfs acd nas
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.4.41-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67 GiB
Name: app6
ID: xxx
Docker Root Dir: /work1/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.):

physical

docker daemon's log:

May 15 14:36:28 app6 dockerd: time="2017-05-15T14:36:28.987367854+08:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 3f08d72500fc972a4f194afa5b7038dfe7a2be1d4b7355d857b82bd8b0b5332d: rpc error: code = 2 desc = containerd: container not found"
May 15 14:36:29 app6 dockerd: time="2017-05-15T14:36:29.617753767+08:00" level=error msg="collecting stats for 3f08d72500fc972a4f194afa5b7038dfe7a2be1d4b7355d857b82bd8b0b5332d: rpc error: code = 2 desc = containerd: container not found"
May 15 14:36:30 app6 dockerd: time="2017-05-15T14:36:30.617792626+08:00" level=error msg="collecting stats for 3f08d72500fc972a4f194afa5b7038dfe7a2be1d4b7355d857b82bd8b0b5332d: rpc error: code = 2 desc = containerd: container not found"
May 15 14:36:31 app6 dockerd: time="2017-05-15T14:36:31.617938914+08:00" level=error msg="collecting stats for 3f08d72500fc972a4f194afa5b7038dfe7a2be1d4b7355d857b82bd8b0b5332d: rpc error: code = 2 desc = containerd: container not found"
@thaJeztah
Copy link
Member

ping @mlaventure

@mlaventure
Copy link
Contributor

Wasn't able to reproduce with a quick try.

@BSWANG if you have an easy way to reproduce it, could you put your daemon in debug mode and provide its logs?

@BSWANG
Copy link
Contributor Author

BSWANG commented May 24, 2017

Saw similar issue in rancher/rancher#6922 #31614 . But I can not reproduce it easily.

@fragpit
Copy link

fragpit commented May 31, 2017

facing this right now.

# docker exec -it cddb83de5cf9 exit
rpc error: code = 2 desc = containerd: container not found

it began after OOM killed container process.

# docker info
Containers: 43
 Running: 43
 Paused: 0
 Stopped: 0
Images: 20
Server Version: 17.03.1-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: Scientific Linux 7.3 (Nitrogen)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.64 GiB
Name: webbox6
ID: HYHP:WBBD:LBM4:4J2X:NDE5:ALGF:4XDG:H3JQ:RPMY:JDDI:Z57T:RBYB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

@corruptmem
Copy link

I believe I am also experiencing this. After an OOM kill, containerd seems to get into a state where all subsequent container terminations, normal or otherwise, are ignored by the daemon. Getting docker back into a usable state requires restarting containerd.

I cannot reliably reproduce this right now, but I can confirm that I'm also seeing this behaviour. I'm running 17.03.1-ce on AWS (EC2 Container Service).

@stieler-it
Copy link

I think we have a similar problem. I am pretty sure it is not an OOM issue but our application terminating itself due to a detected performance problem (probably off-topic). However, the container still shows as up and running.

$ docker inspect 35eef8a31559 
...
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 19883,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2017-07-11T13:33:51.870304165Z",
            "FinishedAt": "2017-07-11T13:29:46.615889316Z"
        },
...

It is not possible to exec into the container:

$ sudo docker exec -it 35eef8a31559 /bin/bash
rpc error: code = 2 desc = containerd: container not found

It is not possible to stop the container:

$ sudo docker stop 35eef8a31559
35eef8a31559
$ sudo docker ps | grep 35
35eef8a31559        repos/image                  "/.r/r /bin/sh -c ..."   6 days ago          Up 6 days                               r-stack-service-10-1d306d05

Other containers work as expected. The same thing happend two weeks ago and I could only fix it by restarting docker daemon.

$ sudo docker info
Containers: 13
 Running: 13
 Paused: 0
 Stopped: 0
Images: 22
Server Version: 17.03.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 128
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-78-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.796 GiB
Name: machine-name-hidden
ID: Z4U6:KLQU:WJDV:NCO6:DWDF:V5UQ:LO4B:FISF:CGIR:JAGW:HIAU:BTTK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

@thaJeztah
Copy link
Member

ping @mlaventure

@mlaventure
Copy link
Contributor

Is everyone affected can check if the container marked as "not found" appears in the output of docker-runc list?

Also, in your docker daemon logs, can you see a stacktrace or a indication that containerd was restarted?

If you have a way to reproduce the issue, or had the daemon in debug mode when it occurs, please provide the logs, they would be helpful.

@BSWANG
Copy link
Contributor Author

BSWANG commented Aug 1, 2017

The steal container can not found in docker-runc list . @mlaventure

@mlaventure
Copy link
Contributor

@BSWANG what about a stacktrace within the docker daemon log? or a line mentioning a restart of containerd?

@firelyu
Copy link

firelyu commented Aug 3, 2017

I have faced the same issue with 17.03.1-ce. The host total memory is 128GB. There are 200+ container running. I limit the container's memory, and the total limited memory exceed 128GB. I can't docker stop some containers after one day.

@quantonganh
Copy link

quantonganh commented Oct 19, 2017

Is everyone affected can check if the container marked as "not found" appears in the output of docker-runc list?

# docker-runc list | grep c134d8c8e747
#

or a line mentioning a restart of containerd?

dockerd[5283]: time="2017-10-18T00:02:25.000031528Z" level=error msg="libcontainerd: error restarting containerd: fork/exec /usr/bin/docker-containerd: cannot allocate memory"

@rprieto
Copy link

rprieto commented Dec 14, 2017

I suspect I have the same issue happening right now. I didn't notice the OOM but the host was running very close to 100% before it happened. I only noticed many hours later that the container did not terminate.

  • docker -v: Docker version 17.05.0-ce, build 9f07f0e-synology
  • docker ps: container is listed with status = Up 13 hours
  • docker stop|kill <container id>: hangs for 10 seconds, then prints the container ID. The contains still shows as running after that.
  • docker-runc list: empty
  • docker inspect <container id>: as below
"State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 19395,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2017-12-13T20:11:18.876338839Z",
            "FinishedAt": "2017-12-13T20:11:05.616491875Z",
            "StartedTs": 1513195878,
            "FinishedTs": 1513195865
        }

Is it odd that FinishedAt is earlier than StartedAt?

@prblm
Copy link

prblm commented Jan 22, 2018

Hi!

We have a similar problem.
After run out of memory killer in our system the container do not restart.

docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:41:23 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:49 2017
 OS/Arch:      linux/amd64
 Experimental: false

Jan 08 20:07:05 dc kernel: runc:[2:INIT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Jan 08 20:07:05 dc kernel: Task in /docker/0abce43b1fec97241cdf5374a881a08767a5a758a6590ed3c991ff5d73c6c896 killed as a result of limit of /docker/0abce43b1fec97241cdf5374a881a08767a5a758a6590ed3c991ff5d73c6c896

As we can see docker service ps showed that service_0001_00001 is "Running 2 weeks ago" but docker ps -a showed that container service_1 was stopped at 30 minutes ago by the reason 137.

docker service ls

nbdcwyna40z6        service_0001_00001       replicated          1/1                 dc:6111/tsk/service_1:0.0.03    *:6212->6112/tcp,*:60023->9111/tcp

docker service ps service_0001_00001

ID                  NAME                       IMAGE                         NODE DESIRED STATE       CURRENT STATE          ERROR               PORTS
elzie64arymw        service_0001_00001.1       dc:6111/tsk/service_1:0.0.03   dc   Running             Running 2 weeks ago
lfud1xir2qwm         \_ service_0001_00001.1   dc:6111/tsk/service_1:0.0.02   dc   Shutdown            Shutdown 3 weeks ago
tx9abyne6to0         \_ service_0001_00001.1   dc:6111/tsk/service_1:0.0.01   dc   Shutdown            Shutdown 5 weeks ago

docker ps -a

0abce43b1fec        dc:6111/tsk/service_1:0.0.03    "dotnet service_00..."   3 weeks ago         Exited (137) 30 minutes ago                                         service_0001_00001.1.elzie64arymwc2jdcw3isc0ba

@SmilingNavern
Copy link

SmilingNavern commented Jan 14, 2019

Any updates on this? We have the similiar problem with docker 17.03 Does update to newer version helps to fix this?

@rishiloyola
Copy link

Any updates on this? I am facing this issue a lot.

@SmilingNavern
Copy link

@rishiloyola which docker version do you run?

@thaJeztah
Copy link
Member

@rishiloyola also; what kernel version, because there was a bug in some kernel versions recently that prevented OOM events from being read; see containerd/cgroups#74

@rishiloyola
Copy link

@SmilingNavern I am using Docker version 17.09.0-ce, build afdb6d4

@thaJeztah I am using following kernel version
Linux VM-6bfe78a3-e1c8-4bf0-b38e-f249faa06317 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@SmilingNavern
Copy link

@rishiloyola i sugest you to try 18.03 docker

@webPageDev
Copy link

@rishiloyola @SmilingNavern We have the similiar problem,are you sure which docker version helps to fix this? You know the commit code record for this issue?

@webPageDev
Copy link

@thaJeztah are you sure which docker version helps to fix this? You know the commit code record for this issue?

@SmilingNavern
Copy link

@webPageDev docker with containerd > 1.0 is good to go. I believe it's fixed inside of containerd but i don't know specific commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests