Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start container: Getting the final child's pid from pipe caused "EOF" #40835

Open
ceecko opened this issue Apr 20, 2020 · 54 comments
Open
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/19.03

Comments

@ceecko
Copy link

ceecko commented Apr 20, 2020

Description

Intermittently containers cannot be started and docker returns the following error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

Restarting the machine resolves the issue.

Steps to reproduce the issue:

  1. docker run --rm -it img /bin/bash

Describe the results you received:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

Describe the results you expected:
Container should start

Additional information you deem important (e.g. issue happens only occasionally):
The issue happens only occasionally.
It appears to be connected to #37722 and docker/for-linux#856
Feel free to close if there's nothing Docker can do about it.

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b
 Built:             Wed Mar 11 01:27:04 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b
  Built:            Wed Mar 11 01:25:42 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 19
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: journald
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1062.18.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 6.492GiB
 Name: fs2
 ID: K2B7:KJGO:JNN3:PQWZ:E7AS:YUUY:YCUH:ALUD:2QXI:TEPE:Q6V3:NIEZ
 Docker Root Dir: /data/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

Additional environment details (AWS, VirtualBox, physical, etc.):
VM

/var/log/messages entries

Apr 20 09:13:21 xxx kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0xc0d0
Apr 20 09:13:21 xxx kernel: CPU: 1 PID: 3522 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G               ------------ T 3.10.0-1062.18.1.el7.x86_64 #1
Apr 20 09:13:21 xxx kernel: Hardware name: OpenStack Foundation OpenStack Nova, BIOS 2:1.10.2-58953eb7 04/01/2014
Apr 20 09:13:21 xxx kernel: Call Trace:
Apr 20 09:13:21 xxx kernel: [<ffffffffa6d7b416>] dump_stack+0x19/0x1b
Apr 20 09:13:21 xxx kernel: [<ffffffffa67c3fc0>] warn_alloc_failed+0x110/0x180
Apr 20 09:13:21 xxx kernel: [<ffffffffa6d7698a>] __alloc_pages_slowpath+0x6bb/0x729
Apr 20 09:13:21 xxx kernel: [<ffffffffa67c8636>] __alloc_pages_nodemask+0x436/0x450
Apr 20 09:13:21 xxx kernel: [<ffffffffa6816c58>] alloc_pages_current+0x98/0x110
Apr 20 09:13:21 xxx kernel: [<ffffffffa67e3658>] kmalloc_order+0x18/0x40
Apr 20 09:13:21 xxx kernel: [<ffffffffa6822216>] kmalloc_order_trace+0x26/0xa0
Apr 20 09:13:21 xxx kernel: [<ffffffffa68261a1>] __kmalloc+0x211/0x230
Apr 20 09:13:21 xxx kernel: [<ffffffffa683f041>] memcg_alloc_cache_params+0x81/0xb0
Apr 20 09:13:21 xxx kernel: [<ffffffffa67e3304>] do_kmem_cache_create+0x74/0xf0
Apr 20 09:13:21 xxx kernel: [<ffffffffa67e3482>] kmem_cache_create+0x102/0x1b0
Apr 20 09:13:21 xxx kernel: [<ffffffffc069bdd1>] nf_conntrack_init_net+0xf1/0x260 [nf_conntrack]
Apr 20 09:13:21 xxx kernel: [<ffffffffc069c6d4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
Apr 20 09:13:21 xxx kernel: [<ffffffffa6c44054>] ops_init+0x44/0x150
Apr 20 09:13:21 xxx kernel: [<ffffffffa6c44203>] setup_net+0xa3/0x160
Apr 20 09:13:21 xxx kernel: [<ffffffffa6c449a5>] copy_net_ns+0xb5/0x180
Apr 20 09:13:21 xxx kernel: [<ffffffffa66cb599>] create_new_namespaces+0xf9/0x180
Apr 20 09:13:21 xxx kernel: [<ffffffffa66cb7da>] unshare_nsproxy_namespaces+0x5a/0xc0
Apr 20 09:13:21 xxx kernel: [<ffffffffa669afeb>] SyS_unshare+0x1cb/0x340
Apr 20 09:13:21 xxx kernel: [<ffffffffa6d8dede>] system_call_fastpath+0x25/0x2a
Apr 20 09:13:21 xxx kernel: Mem-Info:
Apr 20 09:13:21 xxx kernel: active_anon:58019 inactive_anon:86369 isolated_anon:0#012 active_file:75955 inactive_file:1180913 isolated_file:0#012 unevictable:6102 dirty:95 writeback:0 unstable:0#012 slab_reclaimable:90514 slab_unreclaimable:48526#012 mapped:35231 shmem:143 pagetables:3532 bounce:0#012 free:62074 free_pcp:0 free_cma:0
Apr 20 09:13:21 xxx kernel: Node 0 DMA free:15908kB min:156kB low:192kB high:232kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 20 09:13:21 xxx kernel: lowmem_reserve[]: 0 2830 6628 6628
Apr 20 09:13:21 xxx kernel: Node 0 DMA32 free:127088kB min:28792kB low:35988kB high:43188kB active_anon:76456kB inactive_anon:137404kB active_file:221232kB inactive_file:2012980kB unevictable:4900kB isolated(anon):0kB isolated(file):0kB present:3129200kB managed:2898768kB mlocked:4900kB dirty:184kB writeback:0kB mapped:107104kB shmem:304kB slab_reclaimable:155508kB slab_unreclaimable:62240kB kernel_stack:1584kB pagetables:4852kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 20 09:13:21 xxx kernel: lowmem_reserve[]: 0 0 3798 3798
Apr 20 09:13:21 xxx kernel: Node 0 Normal free:105300kB min:38632kB low:48288kB high:57948kB active_anon:155620kB inactive_anon:208072kB active_file:82588kB inactive_file:2710672kB unevictable:19508kB isolated(anon):0kB isolated(file):0kB present:4022272kB managed:3892156kB mlocked:19508kB dirty:196kB writeback:0kB mapped:33820kB shmem:268kB slab_reclaimable:206548kB slab_unreclaimable:131864kB kernel_stack:2256kB pagetables:9276kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 20 09:13:21 xxx kernel: lowmem_reserve[]: 0 0 0 0
Apr 20 09:13:21 xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Apr 20 09:13:21 xxx kernel: Node 0 DMA32: 9256*4kB (UEM) 8488*8kB (UEM) 1288*16kB (UEM) 25*32kB (UM) 9*64kB (M) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 127040kB
Apr 20 09:13:21 xxx kernel: Node 0 Normal: 14942*4kB (UEM) 5699*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 105360kB
Apr 20 09:13:21 xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 20 09:13:21 xxx kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 20 09:13:21 xxx kernel: 1260617 total pagecache pages
Apr 20 09:13:21 xxx kernel: 1355 pages in swap cache
Apr 20 09:13:21 xxx kernel: Swap cache stats: add 45752, delete 44397, find 6635088/6637528
Apr 20 09:13:21 xxx kernel: Free swap  = 4148988kB
Apr 20 09:13:21 xxx kernel: Total swap = 4194300kB
Apr 20 09:13:21 xxx kernel: 1791866 pages RAM
Apr 20 09:13:21 xxx kernel: 0 pages HighMem/MovableOnly
Apr 20 09:13:21 xxx kernel: 90158 pages reserved
Apr 20 09:13:21 xxx kernel: kmem_cache_create(nf_conntrack_ffff9b3827d29480) failed with error -12
Apr 20 09:13:21 xxx kernel: CPU: 1 PID: 3522 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G               ------------ T 3.10.0-1062.18.1.el7.x86_64 #1
Apr 20 09:13:21 xxx kernel: Hardware name: OpenStack Foundation OpenStack Nova, BIOS 2:1.10.2-58953eb7 04/01/2014
Apr 20 09:13:21 xxx kernel: Call Trace:
Apr 20 09:13:21 xxx kernel: [<ffffffffa6d7b416>] dump_stack+0x19/0x1b
Apr 20 09:13:21 xxx kernel: [<ffffffffa67e3507>] kmem_cache_create+0x187/0x1b0
Apr 20 09:13:21 xxx kernel: [<ffffffffc069bdd1>] nf_conntrack_init_net+0xf1/0x260 [nf_conntrack]
Apr 20 09:13:21 xxx kernel: [<ffffffffc069c6d4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
Apr 20 09:13:21 xxx kernel: [<ffffffffa6c44054>] ops_init+0x44/0x150
Apr 20 09:13:21 xxx kernel: [<ffffffffa6c44203>] setup_net+0xa3/0x160
Apr 20 09:13:21 xxx kernel: [<ffffffffa6c449a5>] copy_net_ns+0xb5/0x180
Apr 20 09:13:21 xxx kernel: [<ffffffffa66cb599>] create_new_namespaces+0xf9/0x180
Apr 20 09:13:21 xxx kernel: [<ffffffffa66cb7da>] unshare_nsproxy_namespaces+0x5a/0xc0
Apr 20 09:13:21 xxx kernel: [<ffffffffa669afeb>] SyS_unshare+0x1cb/0x340
Apr 20 09:13:21 xxx kernel: [<ffffffffa6d8dede>] system_call_fastpath+0x25/0x2a
Apr 20 09:13:21 xxx kernel: Unable to create nf_conn slab cache
Apr 20 09:13:21 xxx containerd: time="2020-04-20T09:13:21.565957368+02:00" level=info msg="shim reaped" id=a8da2d54ebf451c4ee6118a276b2e4a10f3d6f61ebc52b853c34419ba4c132bb
Apr 20 09:13:21 xxx dockerd: time="2020-04-20T09:13:21.579186951+02:00" level=error msg="stream copy error: reading from a closed fifo"
@kleysonr
Copy link

kleysonr commented Apr 29, 2020

Same issue.

$ docker run -it --restart unless-stopped --name imgesrv -p 8080:8080 image
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.
ERRO[0000] error waiting for container: context canceled 

$ docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b
 Built:             Wed Mar 11 01:27:04 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b
  Built:            Wed Mar 11 01:25:42 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 1
  Running: 0
  Paused: 0
  Stopped: 1
 Images: 1
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
  selinux
  userns
 Kernel Version: 3.10.0-1127.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 19.45GiB
 Name: zeus.agroneural.com
 ID: ZVFK:5NM6:QJGU:7RFF:A6TA:IMMJ:RXKF:NIKF:6P5P:YJGN:WDWR:ZX2U
 Docker Root Dir: /var/lib/docker/1002.1002
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

@vioan
Copy link

vioan commented May 9, 2020

I have the same issue ...

@thaJeztah
Copy link
Member

@kolyshkin ptal

@chrisjohnson
Copy link

We are seeing this as well. Does anybody have any findings on the culprit for a potential workaround? Right now the only recourse we have is to restart the entire docker service

@deanmax
Copy link

deanmax commented Jun 10, 2020

Same issue here. This issue tends to happen when there're a large number of containers running on the host. (33 this time)

@ClepToManix
Copy link

ClepToManix commented Jun 12, 2020

Im also affacted.

Client:
 Debug Mode: false

Server:
 Containers: 20
  Running: 18
  Paused: 0
  Stopped: 2
 Images: 24
 Server Version: 19.03.11
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.9.0
 Operating System: Debian GNU/Linux 9 (stretch)
 OSType: linux
 Architecture: x86_64
 CPUs: 6
 Total Memory: 16GiB
 Name: h2883899.stratoserver.net
 ID: OKUB:5PLM:JBOD:3L7X:ET2T:FBPH:M3P3:PC73:EQLM:7QJE:TX3F:247B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@sudo-bmitch
Copy link

Seeing this on a fresh lab deploy, no other containers running, CentOS 7, with userns enabled. My Debian 10 environment isn't seeing any issues. Disabling userns made the issue go away.

[root@vm-1 docker]# docker version
Client: Docker Engine - Community
 Version:           19.03.11
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        42e35e61f3
 Built:             Mon Jun  1 09:13:48 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:12:26 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

[root@vm-1 docker]# cat /etc/docker/daemon.json
{
  "experimental": false,
  "features": {"buildkit": true },
  "hosts": ["unix:///var/run/docker.sock"],
  "labels": ["from_ansible=true"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "labels": "com.docker.stack.namespace,com.docker.swarm.service.name,environment"
  },
  "storage-driver": "overlay2",
  "userns-remap": "dockerns:dockerns"
}

[root@vm-1 docker]# more /etc/subuid
dockerns:100000:65536

[root@vm-1 docker]# more /etc/subgid
dockerns:100000:65536

[root@vm-1 docker]# docker run -it --rm busybox echo hello
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
76df9210b28c: Pull complete
Digest: sha256:95cf004f559831017cdf4628aaf1bb30133677be8702a8c5f2994629f637a209
Status: Downloaded newer image for busybox:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

[bmitch@vm-1 ~]$ cat /etc/os-release
NAME="CentOS Linux"       
VERSION="7 (Core)"                                          
ID="centos"
ID_LIKE="rhel fedora"         
VERSION_ID="7"                                              
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"         
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
                      
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"                
REDHAT_SUPPORT_PRODUCT="centos"   
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@vm-1 docker]# journalctl -u docker | tail
Jun 15 21:21:21 vm-1 dockerd[1638]: time="2020-06-15T21:21:21.335146186Z" level=info msg="Loading containers: start."
Jun 15 21:21:21 vm-1 dockerd[1638]: time="2020-06-15T21:21:21.756251741Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 15 21:21:21 vm-1 dockerd[1638]: time="2020-06-15T21:21:21.860590393Z" level=info msg="Loading containers: done."
Jun 15 21:21:21 vm-1 dockerd[1638]: time="2020-06-15T21:21:21.894862938Z" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11
Jun 15 21:21:21 vm-1 dockerd[1638]: time="2020-06-15T21:21:21.894947518Z" level=info msg="Daemon has completed initialization"
Jun 15 21:21:21 vm-1 systemd[1]: Started Docker Application Container Engine.
Jun 15 21:21:21 vm-1 dockerd[1638]: time="2020-06-15T21:21:21.976800756Z" level=info msg="API listen on /var/run/docker.sock"
Jun 15 21:21:26 vm-1 dockerd[1638]: time="2020-06-15T21:21:26.626754125Z" level=error msg="stream copy error: reading from a closed fifo"
Jun 15 21:21:26 vm-1 dockerd[1638]: time="2020-06-15T21:21:26.876435720Z" level=error msg="96f1eba1fcc0ecaf53eace842a337ea14bcf3c463eb8ab0f2fb8c6cb754929a6 cleanup: failed to delete container from containerd: no such container"
Jun 15 21:21:26 vm-1 dockerd[1638]: time="2020-06-15T21:21:26.923436311Z" level=error msg="Handler for POST /v1.40/containers/96f1eba1fcc0ecaf53eace842a337ea14bcf3c463eb8ab0f2fb8c6cb754929a6/start returned error: OCI runtime create failed: container_linux.go:349: starting container process caused \"process_linux.go:319: getting the final child's pid from pipe caused \\\"EOF\\\"\": unknown"

@sudo-bmitch
Copy link

Just found the following indicating that it's a configuration issue on my side:

sysctl -w user.max_user_namespaces=15000

docker/docs#7962 (comment)

@Manvi07
Copy link

Manvi07 commented Jun 17, 2020

I am facing a similar issue when trying to build a docker image:

Step 3/5 : RUN pip install -r /scripts/requirements.txt
---> Running in c69c338a8f1c
OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown 

The above solution doesn't seem to work.

@ClepToManix
Copy link

Just found the following indicating that it's a configuration issue on my side:

sysctl -w user.max_user_namespaces=15000

docker/docker.github.io#7962 (comment)

I will try it ASAP and report it back.

@deanmax
Copy link

deanmax commented Jun 24, 2020

The user.max_user_namespaces kernel parameter doesn't seem to help in my case

~> sysctl -n user.max_user_namespaces
15000
~> docker run --rm -it busybox date
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

@obilixilido
Copy link

It may not matter, I have got same error message when creating deployment on GKE(ver.15.12.2).
After investigating, I found mistake manifesto file :

        resources:
          limits:
            cpu: 1100m
            memory: 3000m  # correct: 3000M
          requests:
            cpu: 1100m
            memory: 3000m # correct: 3000M

JFYI.

@snyff
Copy link

snyff commented Jul 1, 2020

I ran into the same issue on a brand new debian 10 and new debian 9 provisioned using docker-machine.

There were 2 files in /etc/systemd/system/
/etc/systemd/system/docker.service.d/10-machine.conf and /etc/systemd/system/docker.service

I rm'd /etc/systemd/system/docker.service and restarted the service.

@slackab
Copy link

slackab commented Jul 21, 2020

how to fix?

@jamshid
Copy link
Contributor

jamshid commented Jul 21, 2020

There are probably multiple causes of this error. I ran into it because k3s had also been installed on my centos 7.6 docker server.

[root@XX ~]# docker run hello-world
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown.

Uninstalling k3s (https://rancher.com/docs/k3s/latest/en/installation/uninstall/) appears to have fixed it.

@dontub
Copy link

dontub commented Jul 24, 2020

I solved it by removing MountFlags=private from the systemd config of the docker service. Though I neither can say when exactly this started to become an issue, nor do I remember why I've added it.

@taylormgeorge91
Copy link

I fixed the issue in my case by increasing the resource limits of the pod.

In a previous case it was the cores, in the most recent case it was the memory.

@Aetherus
Copy link

Aetherus commented Jul 31, 2020

I solved it by removing MountFlags=private from the systemd config of the docker service. Though I neither can say when exactly this started to become an issue, nor do I remember why I've added it.

In my case (CentOS 7.8), it was MountFlags=slave, and solved it by removing the whole line and restarting the docker daemon.

@ClepToManix
Copy link

I solved it by removing MountFlags=private from the systemd config of the docker service. Though I neither can say when exactly this started to become an issue, nor do I remember why I've added it.

I read about it that they had removed the value behind the =
I found this in my configuration but im unable to find the stored configuration. I'm running Debian Linux in latest version. Maybe some one can help me to find where is the config stored.

@ClepToManix
Copy link

Okay, just found another interesting post on other forum.
If you are running an vps which is virtualized with Virtuozzo you hosting provider maybe locked your tasks...
Im using strato and it seems to be that they have limited my server. Under /proc/user_beancounters you can find those settings. The numprocs is set to 700 and my actual held is 661. Starting an bigger docker stock seems to be impossible...

You can find more in this post https://serverfault.com/questions/1017994/docker-compose-oci-runtime-create-failed-pthread-create-failed/1018402

It seems to be there is no bug...

@ceecko
Copy link
Author

ceecko commented Aug 2, 2020

In my case the OS is running on a dedicated server without virtualization.

@MitchellBot
Copy link

Ran into the same issue with Debian on WSL1 while following this: https://nickjanetakis.com/blog/setting-up-docker-for-windows-and-wsl-to-work-flawlessly

docker-compose -version was complaining that I needed WSL2. made sure to install docker-compose via pip3 and place the path to it before the other bins.

@VictorLee0321
Copy link

I just restart docker service(systemctl restart docker.service) and it works for me.
look at : microsoft/vscode-docker#1963 (comment)

@ngrilly
Copy link

ngrilly commented Aug 26, 2020

Same as @deanmax. Commenting network_mode: host on all services in our docker-compose.yaml "fixes" it. Any idea about what the root cause?

@yogevyuval
Copy link

Setting user.max_user_namespaces or restarting docker seemed to only fix the issue temporarily, it keeps coming back, any update from someone from the moby team?

@cpuguy83
Copy link
Member

This error would typically be seen due to OOM.

  1. Do you have memory limits applied? Are they enough to bootstrap the container (runc mem + hooks)
  2. Is the system just OOM?

@ceecko
Copy link
Author

ceecko commented Oct 16, 2020

@cpuguy83 My system is not running any containers and has 30gb free memory. The issue appears before a container is started (or maybe during the start). It occurs regularly every ~30 days. Only a reboot helps.

@cpuguy83
Copy link
Member

@ceecko The error is during container bootstrap. While it is "before the container starts", it is part of the startup process.

Explaining how runc works is kind of tricky on a forum... but it basically requires multiple re-execs... that is runc executes, sets up some pipes, then re-executes, sends some data over the pipes to the original process. The EOF comes because the pipe writer (the re-exec'd process) closed the pipe but the reader did not get the data it was expecting.
This would pretty much be due to a crash... which I would generally pin onto an OOM kill.

Because of the nature of the crash, it is really difficult to debug where it is coming from.
Since a reboot helps, it would seem to be resource related.

@chrisjohnson
Copy link

This error has nothing to do with OOM, we see it all the time without having any OOM notifications or memory threshold

@ceecko
Copy link
Author

ceecko commented Oct 16, 2020

@cpuguy83 once it occurs, it's easy to reproduce - happens in 100% of cases :)
If there's any information I can provide you, I'd be happy to help. Just let me know.

@adwski
Copy link

adwski commented Apr 21, 2021

For me the problem was solved by downgrading docker to 18.06.0.

@myclau
Copy link

myclau commented May 10, 2021

My case is swap size not enough, and can quick fix by this

i have the same problem,and i fixed it by increase the pids.max.

the way1:

$ sysctl -n kernel.pid_max
32768
$ sysctl -w kernel.pid_max=100000

the way2:

$ sysctl -n user.max_user_namespaces
0  
# if zero try this
$ sysctl -w user.max_user_namespaces=15000

in way3:

$ grep -w 'runc:\[1:CHILD\]: page allocation failure' /var/log/messages | tail -n 4
Nov 20 16:13:54 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:15:46 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:16:28 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:16:41 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0

solution1:

echo 3 > /proc/sys/vm/drop_caches

solution2:

echo 1 > /proc/sys/vm/compact_memory
# or
sysctl -w vm.compact_memory=1

For me I only run the way3 fix and you can put it in cronjob to run it every few hours/days
For long term fix is increase the swap size of the machine(s)

@attila123
Copy link

attila123 commented Aug 6, 2021

I have this problem occasionally occur in one of our Jenkins machines (what is the politically correct name to a Jenkins slave?), while did not see it on the other(s). I have a sandbox job and printed some info. Looks like the kernel version and docker version differs in these servers. sysctl provided values differ I guess because of the (quite) different kernel version. Also the problematic server does not have any swap at all (not sure if it could have any effect as it has plenty of RAM).
Wanted to share this, hope it helps ...

This is where I did not see this docker problem happen (I did not check any logs etc, just as a simple Jenkins user):
[Pipeline] sh
14:44:48 + docker --version
14:44:48 Docker version 20.10.6, build 370c289
[Pipeline] sh
14:44:48 + cat /etc/centos-release
14:44:48 CentOS Linux release 7.8.2003 (Core)
[Pipeline] sh
14:44:48 + uname -rvm
14:44:48 5.8.4-1.el7.elrepo.x86_64 #1 SMP Mon Aug 24 18:27:53 EDT 2020 x86_64
[Pipeline] sh
14:44:49 + free -h
14:44:49 total used free shared buff/cache available
14:44:49 Mem: 125G 17G 19G 4.1G 88G 102G
14:44:49 Swap: 4.0G 750M 3.3G
[Pipeline] sh
14:44:49 + /usr/sbin/sysctl -n kernel.pid_max
14:44:49 32768
[Pipeline] sh
14:44:49 + /usr/sbin/sysctl -n user.max_user_namespaces
14:44:49 515138

This is where it occasionally happens.
(Colleagues will update stuff in this machine and hopefully this will be fixed after that.)
14:46:01 + docker --version
14:46:01 Docker version 19.03.11, build 42e35e61f3
[Pipeline] sh
14:46:01 + cat /etc/centos-release
14:46:01 CentOS Linux release 7.8.2003 (Core)
[Pipeline] sh
14:46:02 + uname -rvm
14:46:02 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64
[Pipeline] sh
14:46:02 + free -h
14:46:03 total used free shared buff/cache available
14:46:03 Mem: 251G 46G 3.7G 4.1G 201G 199G
14:46:03 Swap: 0B 0B 0B
[Pipeline] sh
14:46:04 + /usr/sbin/sysctl -n kernel.pid_max
14:46:05 4194304
[Pipeline] sh
14:46:05 + /usr/sbin/sysctl -n user.max_user_namespaces
14:46:05 0

dduportal added a commit to dduportal/jenkins-infra that referenced this issue Aug 12, 2021
Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to jenkins-infra/jenkins-infra that referenced this issue Aug 12, 2021
…1822)

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@ysoftman
Copy link

ysoftman commented Aug 14, 2021

In my k8s environment, it seems the sandbox container couldn't find the user-created container resource(pid, file.. etc), as the user-created container had finished too fast.
The container job is just print date and It ends in 1sec.
I created the following container with a one-minute cronjob and tested it.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: ysoftman-cronjob-test
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: ysoftman-cronjob-test
            image: busybox
            imagePullPolicy: Always
            command:
            - /bin/sh
            - -c
            - date

The error no longer occurred after I added 'sleep 5' command in my container.

  command:
  - /bin/sh
  - -c
  - date; sleep 5

@Tobias-UniBwM
Copy link

I've had the same issue starting WebODM. Turns out there was an option called oom_score_adj in their docker-compose.yml which seems to have something to do with the linux kernel killing processes consuming too much memory too fast.

Commenting out those option for redis and opendronemap/webodm_db solved it for me.

@f18m
Copy link

f18m commented Jun 28, 2022

I'm also getting this problem on build machines that remain up and running for a lot of time and spawn a lot of containers.
Here's my docker info:

[root@bilcentos7-build21-6 ~]# docker info
Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 287
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc version: v1.0.2-0-g52b36a2
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.49.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.51GiB
 Name: bilcentos7-build21-6
 ID: TX7Z:IFWO:EYPL:VUO3:2KB3:2BB2:CSJP:WSDK:DVEW:5NOM:E5ZR:3ABQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: *****
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

I get an error like:

------
 > [3/40] RUN echo "List of YUM Fedora repositories:" &&     ls -l /etc/yum.repos.d/*.repo:
#7 0.070 container_linux.go:380: starting container process caused: process_linux.go:402: getting the final child's pid from pipe caused: EOF
------
failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c echo "List of YUM Fedora repositories:" &&     ls -l /etc/yum.repos.d/*.repo]: runc did not terminate sucessfully

everytime I try to do a "docker build" or "docker run".
The workaround I found is simply:

systemctl restart containerd

so my guess is that the actual bug is in the containerd layer.

@thaJeztah
Copy link
Member

I see you're running an older version of docker and a version of containerd (v1.4) that reached EOL; if you have a system to test on, perhaps you could try if the problem still occurs on current versions of containerd (and docker)

@VanHoevenTR
Copy link

VanHoevenTR commented Jul 2, 2022

Getting same issue running hello-world on LXD on Container station.

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't get final child's PID from pipe: EOF: unknown.
ERRO[0000] error waiting for container: context canceled 

I can run hello-world directly but the thing is docker commands doesn't work with Container station and it has issues with opening port, whereas LXD can open port without any issues

Is there a way to get it work on LXD?

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
  compose: Docker Compose (Docker Inc., v2.6.0)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 5
  Running: 0
  Paused: 0
  Stopped: 5
 Images: 1
 Server Version: 20.10.17
 Storage Driver: vfs
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.60-qnap
 Operating System: Ubuntu 18.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.39GiB
 Name: stremio
 ID: 4U7W:FAIE:5LX2:MHAK:UDCV:6PLD:ZF45:HXLK:EAVO:AF6F:BJSU:LAPK
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support

@thaJeztah
Copy link
Member

Is there a way to get it work on LXD?

Haven't done so myself, but there's some tutorials here;

@VanHoevenTR
Copy link

Nevermind, i got it work by enabling privileged mode

image

@pciavald
Copy link

pciavald commented Jul 14, 2022

Same issue here:

root@localhost:~# curl -fsSL https://get.docker.com | bash
# Executing docker install script, commit: b2e29ef7a9a89840d2333637f7d1900a83e7153f
+ sh -c 'apt-get update -qq >/dev/null'
+ sh -c 'DEBIAN_FRONTEND=noninteractive apt-get install -y -qq apt-transport-https ca-certificates curl gnupg >/dev/null'
+ sh -c 'mkdir -p /etc/apt/keyrings && chmod -R 0755 /etc/apt/keyrings'
+ sh -c 'curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" | gpg --dearmor --yes -o /etc/apt/keyrings/docker.gpg'
+ sh -c 'chmod a+r /etc/apt/keyrings/docker.gpg'
+ sh -c 'echo "deb [arch=arm64 signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" > /etc/apt/sources.list.d/docker.list'
+ sh -c 'apt-get update -qq >/dev/null'
+ sh -c 'DEBIAN_FRONTEND=noninteractive apt-get install -y -qq --no-install-recommends docker-ce docker-ce-cli containerd.io docker-compose-plugin >/dev/null'
E: Sub-process /usr/bin/dpkg returned an error code (1)

root@localhost:~# apt-get install -y -qq --no-install-recommends docker-ce docker-ce-cli containerd.io docker-compose-plugin
Setting up docker-ce (5:20.10.17~3-0~ubuntu-jammy) ...
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xeu docker.service" for details.
invoke-rc.d: initscript docker, action "start" failed.
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2022-07-14 10:50:18 UTC; 38ms ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
    Process: 2556 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE)
   Main PID: 2556 (code=exited, status=1/FAILURE)
        CPU: 316ms
dpkg: error processing package docker-ce (--configure):
 installed docker-ce package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 docker-ce
E: Sub-process /usr/bin/dpkg returned an error code (1)

root@localhost:~# update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives: using /usr/sbin/iptables-legacy to provide /usr/sbin/iptables (iptables) in manual mode

root@localhost:~# systemctl restart docker
root@localhost:~# docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

root@localhost:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04 LTS
Release:	22.04
Codename:	jammy

root@localhost:~# docker --version
Docker version 20.10.17, build 100c701

root@localhost:~# docker run --rm hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
7050e35b49f5: Pull complete 
Digest: sha256:53f1bbee2f52c39e41682ee1d388285290c5c8a76cc92b42687eecf38e0af3f0
Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't get final child's PID from pipe: EOF: unknown.

root@localhost:~# sysctl -w kernel.pid_max=100000
kernel.pid_max = 100000

root@localhost:~# sysctl -w user.max_user_namespaces=15000
user.max_user_namespaces = 15000

root@localhost:~# sysctl -w vm.compact_memory=1
vm.compact_memory = 1

root@localhost:~# echo 3 > /proc/sys/vm/drop_caches

root@localhost:~# docker run --rm hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't get final child's PID from pipe: EOF: unknown.

@uyw4687
Copy link

uyw4687 commented Oct 4, 2022

I'm using CentOS 7.9 and it is fixed by setting max_user_namespaces to 15000
https://superuser.com/questions/1294215/is-it-safe-to-enable-user-namespaces-in-centos-7-4-and-how-to-do-it

@TimHal
Copy link

TimHal commented Jan 16, 2023

Okay, just found another interesting post on other forum. If you are running an vps which is virtualized with Virtuozzo you hosting provider maybe locked your tasks... Im using strato and it seems to be that they have limited my server. Under /proc/user_beancounters you can find those settings. The numprocs is set to 700 and my actual held is 661. Starting an bigger docker stock seems to be impossible...

You can find more in this post https://serverfault.com/questions/1017994/docker-compose-oci-runtime-create-failed-pthread-create-failed/1018402

It seems to be there is no bug...

You are a life saver!

In my case the problem was related to fail2ban filling up the available ip rules. The last line of
cat /proc/user_beancount
was
numiptent 2000 2000 2000 2000 89
so all I had to do was clear the fail2ban list fail2ban-client unban --all and everything works like a charm.

@maxiloEmmmm
Copy link

My case is swap size not enough, and can quick fix by this

i have the same problem,and i fixed it by increase the pids.max.

the way1:

$ sysctl -n kernel.pid_max
32768
$ sysctl -w kernel.pid_max=100000

the way2:

$ sysctl -n user.max_user_namespaces
0  
# if zero try this
$ sysctl -w user.max_user_namespaces=15000

in way3:

$ grep -w 'runc:\[1:CHILD\]: page allocation failure' /var/log/messages | tail -n 4
Nov 20 16:13:54 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:15:46 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:16:28 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:16:41 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0

solution1:

echo 3 > /proc/sys/vm/drop_caches

solution2:

echo 1 > /proc/sys/vm/compact_memory
# or
sysctl -w vm.compact_memory=1

For me I only run the way3 fix and you can put it in cronjob to run it every few hours/days For long term fix is increase the swap size of the machine(s)

solution1!!!
good

@jackgray
Copy link

I made a change to a cloudflare container config file, not even adding lines but modifying a pre-existing one. docker compose down as usual and docker compose up -d throws the error. Trying to restart sites with 1 second planned downtime has turned to 30 minutes and counting :(

pid max is set high af (> 4 million)
user namespaces max > 500,000
1.6GB of swap and 96GB real memory available
plenty of space on the system drive

I only have 5 containers running, and was running this same cloudflare container perfectly fine with 3 times as many for the past several months.

Restarting containerd did not help.
Docker version 24.0.7, build 311b9ff
I have not updated the docker version or OS (5.15.0-86-generic #96~20.04.1-Ubuntu)

This only seems to be an issue on this cloudflare container, and I tried rolling back to two previous releases that would have been latest during development. I can create new services, but am afraid to take down any existing services to find out if those will work.

sleep 5 before and after the run command did nothing as well. I'm out of ideas

@septatrix
Copy link

Okay, just found another interesting post on other forum. If you are running an vps which is virtualized with Virtuozzo you hosting provider maybe locked your tasks... Im using strato and it seems to be that they have limited my server. Under /proc/user_beancounters you can find those settings. The numprocs is set to 700 and my actual held is 661. Starting an bigger docker stock seems to be impossible...

Strato's new (already since a few months) V-Server generation does not use Virtuozzo any more and no longer seems to have this limitation luckily.

@b19g3r
Copy link

b19g3r commented Apr 20, 2024

My case is swap size not enough, and can quick fix by this

i have the same problem,and i fixed it by increase the pids.max.

the way1:

$ sysctl -n kernel.pid_max
32768
$ sysctl -w kernel.pid_max=100000

the way2:

$ sysctl -n user.max_user_namespaces
0  
# if zero try this
$ sysctl -w user.max_user_namespaces=15000

in way3:

$ grep -w 'runc:\[1:CHILD\]: page allocation failure' /var/log/messages | tail -n 4
Nov 20 16:13:54 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:15:46 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:16:28 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0
Nov 20 16:16:41 ETL010080 kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0x10c0d0

solution1:

echo 3 > /proc/sys/vm/drop_caches

solution2:

echo 1 > /proc/sys/vm/compact_memory
# or
sysctl -w vm.compact_memory=1

For me I only run the way3 fix and you can put it in cronjob to run it every few hours/days For long term fix is increase the swap size of the machine(s)

works for me with

echo 3 > /proc/sys/vm/drop_caches

@manojsitapara
Copy link

Same issue here. This issue tends to happen when there're a large number of containers running on the host. (33 this time)

have you found any solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/19.03
Projects
None yet
Development

No branches or pull requests