New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On out-of-memory (OOM) restarts all Swarm-based containers, plus some inconsistencies #29941

Open
joonas-fi opened this Issue Jan 6, 2017 · 8 comments

Comments

Projects
None yet
6 participants
@joonas-fi

joonas-fi commented Jan 6, 2017

tl;dr : OOM causes:

  • All Swarm-based tasks restarting for (apparently) no reason, but curiously does not restart non-Swarm containers.
  • Containers wrongly marked as running.
  • Wrong amount of replicas running.

Possibly related to #29854, but not sure.

How to reproduce:

Issues occur with:

  • Docker version 1.13.0-rc4, build 88862e7 (Ubuntu 16.04.1 LTS, Kernel 4.4.0-31-generic)
  • Docker version 1.13.0-rc4, build 88862e7 (Ubuntu 15.10, Kernel 4.2.0-36-generic)
  • Docker version 1.12.1, build 23cf638 (Ubuntu 15.10, Kernel 4.2.0-36-generic)

Steps to reproduce:

  1. Make sure the Kernel can overcommit memory freely, so we're gonna OOM:
$ echo 1 > /proc/sys/vm/overcommit_memory

# it doesn't seem to matter much if you have swap on or off, maybe with swap the issue probability increases
  1. Have any Swarm services running:
$ docker swarm init # I had also --advertise-addr 192.168.56.61
$ docker service create --name whoami1 emilevauge/whoami
$ docker service create --name whoami2 emilevauge/whoami
  1. Have any non-swarm based container running, to demonstrate that non-Swarm based tasks don't exit on OOM:
$ docker run -it alpine sh

# open another SSH session to keep this alive
  1. Then run (this allocates all RAM and ends up in "Killed"):
$ docker run -it joonas/my-memory-hog # >/dev/null to suppress progress logging

(Also weird thing is that sometimes after "Killed" stdin gets left open to the original docker run process, ctrl + c nor ctrl + d works and I have to open another SSH session, and sometimes it gets closed and returned back to shell.)

I have observed a couple possible outcomes happening randomly:

Outcome a) nothing special happens

Memory hog gets oomkilled of course, and $ docker ps shows that Swarm tasks survived.

Just run the memory hog command again to observe the more interesting outcomes.

Outcome b) all Swarm tasks restart (most probable) but alpine container survives

Pretty much instantly you can see the effect of Swarm-based tasks restarting:

$ docker ps
CONTAINER ID        IMAGE                                                                                       COMMAND             CREATED              STATUS              PORTS               NAMES
eca47a5f29dd        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"           About a minute ago   Up About a minute   80/tcp              whoami1.1.vb89hlixavtrrthrzt9ewzkzd
980c65ac73e2        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"           About a minute ago   Up About a minute   80/tcp              whoami2.1.koyvg8en8fus9a60bdk6xv62u
f9c18be1d660        alpine                                                                                      "sh"                4 minutes ago        Up 4 minutes                            jolly_heisenberg

# ^-- notice above that swarm-based containers were older than the "alpine" container, so they got restarted

$ docker service ps whoami1
ID            NAME           IMAGE                     NODE    DESIRED STATE  CURRENT STATE           ERROR                             PORTS
vb89hlixavtr  whoami1.1      emilevauge/whoami:latest  master  Running        Running 3 minutes ago                
lbu78bmju9hf   \_ whoami1.1  emilevauge/whoami:latest  master  Shutdown       Failed 3 minutes ago    "starting container failed: gr…"
0cgrmsfzf9qm   \_ whoami1.1  emilevauge/whoami:latest  master  Shutdown       Shutdown 3 minutes ago

Logs (09:30:25 was when Kernel OOM'd):

Jan 06 09:29:30 master dockerd[2321]: time="2017-01-06T09:29:30.383838586Z" level=error msg="agent: session failed" error="rpc error: code = 4 desc = context deadline exceeded" module="node/agent"
Jan 06 09:29:38 master dockerd[2321]: time="2017-01-06T09:29:38.365619018Z" level=info msg="libcontainerd: new containerd process, pid: 3297"
Jan 06 09:29:38 master dockerd[2321]: time="2017-01-06T09:29:38.509523623Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Jan 06 09:29:38 master dockerd[2321]: time="2017-01-06T09:29:38.784627884Z" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent"
Jan 06 09:29:41 master dockerd[2321]: time="2017-01-06T09:29:40.833939658Z" level=info msg="libcontainerd: new containerd process, pid: 3302"
Jan 06 09:29:43 master dockerd[2321]: time="2017-01-06T09:29:43.334178522Z" level=info msg="libcontainerd: new containerd process, pid: 3307"
Jan 06 09:29:45 master dockerd[2321]: time="2017-01-06T09:29:45.292822822Z" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent"
Jan 06 09:29:47 master dockerd[2321]: time="2017-01-06T09:29:46.524736489Z" level=info msg="libcontainerd: new containerd process, pid: 3312"
Jan 06 09:29:50 master dockerd[2321]: time="2017-01-06T09:29:50.390015749Z" level=info msg="libcontainerd: new containerd process, pid: 3318"
Jan 06 09:29:52 master dockerd[2321]: time="2017-01-06T09:29:52.521972655Z" level=info msg="libcontainerd: new containerd process, pid: 3323"
Jan 06 09:29:52 master dockerd[2321]: time="2017-01-06T09:29:52.475072094Z" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent"
Jan 06 09:29:57 master dockerd[2321]: time="2017-01-06T09:29:56.619997292Z" level=info msg="libcontainerd: new containerd process, pid: 3328"
Jan 06 09:30:03 master dockerd[2321]: time="2017-01-06T09:30:00.763142215Z" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent"
Jan 06 09:30:07 master dockerd[2321]: time="2017-01-06T09:30:05.336964258Z" level=info msg="libcontainerd: new containerd process, pid: 3333"
Jan 06 09:30:16 master dockerd[2321]: time="2017-01-06T09:30:13.686260200Z" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent"
Jan 06 09:30:25 master kernel: dockerd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=-500
.. snip ..
Jan 06 09:30:25 master kernel: Out of memory: Kill process 3288 (php) score 939 or sacrifice child
Jan 06 09:30:25 master kernel: Killed process 3288 (php) total-vm:3405544kB, anon-rss:2923288kB, file-rss:0kB
.. above php process was my memory hog ..
Jan 06 09:30:25 master dockerd[2321]: time="2017-01-06T09:30:22.718651495Z" level=info msg="libcontainerd: new containerd process, pid: 3338"
Jan 06 09:30:25 master dockerd[2321]: time="2017-01-06T09:30:25.673212148Z" level=info msg="libcontainerd: new containerd process, pid: 3345"
Jan 06 09:30:25 master dockerd[2321]: time="2017-01-06T09:30:25.815682595Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 7095e54c3c29e06eb738af715c90062c1c3f66318e5f3c0f4a5df6fb64fddd57 does not exist\\none or more of the container deletions failed\\n\""
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:25.999762563Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/229d6a6b57c5472a4be7fdc2da5a287416fe0f43a8d58c976f2b2a7ed06584cd-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.055335129Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/f33cee7f7187d53582c98a3981ce883c744938398ac880020e6d10620172d81f-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.077351364Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/229d6a6b57c5472a4be7fdc2da5a287416fe0f43a8d58c976f2b2a7ed06584cd: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.100488064Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/f33cee7f7187d53582c98a3981ce883c744938398ac880020e6d10620172d81f: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.309512027Z" level=warning msg="Your kernel does not support swap memory limit."
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.310093274Z" level=warning msg="Your kernel does not support cgroup rt period"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.310172107Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.338330156Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.339767897Z" level=error msg="Create container failed with error: grpc: the connection is unavailable"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.338353634Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.350374936Z" level=error msg="Could not delete service state for endpoint whoami1.1.lbu78bmju9hf6veb0gihny6lk from cluster: cannot delete entry as the entry in table endpoint_table with network id ewhrmtv7gonbxjywv5lia22rl and key c8e1c7fb605f34adcc0a80200e996d9b0a68ed19c9a94a668370b7b354372d1d does not exist"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.476976774Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/229d6a6b57c5472a4be7fdc2da5a287416fe0f43a8d58c976f2b2a7ed06584cd: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.494986682Z" level=error msg="fatal task error" error="starting container failed: grpc: the connection is unavailable" module="node/agent/taskmanager" task.id=lbu78bmju9hf6veb0gihny6lk
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.827461779Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/61a8624ffc0b413584b2de775caf1df4e9a4df3359663bea78895896219c9285-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26.861682835Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/61a8624ffc0b413584b2de775caf1df4e9a4df3359663bea78895896219c9285: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:26 master dockerd[2321]: time="2017-01-06T09:30:26Z" level=info msg="Firewalld running: false"
Jan 06 09:30:27 master dockerd[2321]: time="2017-01-06T09:30:27.260784060Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/59d4e04d8eea43143d07d41c2af550e59c76b193d0c8d5c4512d645eb3d4281d: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:28 master dockerd[2321]: time="2017-01-06T09:30:28.881129567Z" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jan 06 09:30:28 master dockerd[2321]: time="2017-01-06T09:30:28.961591911Z" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jan 06 09:30:28 master dockerd[2321]: time="2017-01-06T09:30:28.982868819Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 5e35175795ba7a5d5b04ef15cdd96a67c1ecb20e2467750e2e2e11f2a3abe10f does not exist\\none or more of the container deletions failed\\n\""
Jan 06 09:30:29 master dockerd[2321]: time="2017-01-06T09:30:29.059754475Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 5c5971441b83443007e39042f1bde8647927a3d6e8755051a4e09a30fe06607a does not exist\\none or more of the container deletions failed\\n\""
Jan 06 09:30:29 master dockerd[2321]: time="2017-01-06T09:30:29.228747986Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/af70a61ad0924d6824cd4a830f733f9ba10c2826c5d3469ec8506216e1935cc3: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:29 master dockerd[2321]: time="2017-01-06T09:30:29.517846505Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/7d8b3ddd40c9c5d7e81e0c4ab0eab69ee4d272a28310e47efc001b80fffeb18d: exec: \"auplink\": executable file not found in $PATH"
Jan 06 09:30:30 master dockerd[2321]: time="2017-01-06T09:30:30.771296279Z" level=warning msg="Your kernel does not support swap memory limit."
Jan 06 09:30:30 master dockerd[2321]: time="2017-01-06T09:30:30.771692177Z" level=warning msg="Your kernel does not support cgroup rt period"
Jan 06 09:30:30 master dockerd[2321]: time="2017-01-06T09:30:30.771947156Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Jan 06 09:30:31 master dockerd[2321]: time="2017-01-06T09:30:31.920910384Z" level=warning msg="Your kernel does not support swap memory limit."
Jan 06 09:30:31 master dockerd[2321]: time="2017-01-06T09:30:31.921021442Z" level=warning msg="Your kernel does not support cgroup rt period"
Jan 06 09:30:31 master dockerd[2321]: time="2017-01-06T09:30:31.921034281Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Jan 06 09:30:32 master dockerd[2321]: time="2017-01-06T09:30:32Z" level=info msg="Firewalld running: false"
Jan 06 09:30:32 master dockerd[2321]: time="2017-01-06T09:30:32Z" level=info msg="Firewalld running: false"

Outcome c) memory hog killed, but still listed as "running"

Memory hog got "Killed" but still listed as running:

$ docker ps
CONTAINER ID        IMAGE                                                                                       COMMAND                  CREATED             STATUS              PORTS               NAMES
3e0f9104ada8        joonas/my-memory-hog                                                                        "/bin/sh -c 'php s..."   5 minutes ago       Up 5 minutes                            eager_varahamihira
337805c6cebb        alpine                                                                                      "sh"                     9 minutes ago       Up 9 minutes                            elegant_chandrasekhar
441ed67348c7        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                9 minutes ago       Up 9 minutes        80/tcp              whoami2.1.m75hq7dgze146rpzsfdhjiwbb
a12c54def820        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                9 minutes ago       Up 9 minutes        80/tcp              whoami1.1.8f7v9c6d43uuh0d6fvwpg754e

$ docker exec -it 3e0f9104ada8 sh
rpc error: code = 2 desc = containerd: container not found

Logs:

Jan 06 10:01:55 master dockerd[2320]: time="2017-01-06T10:01:54.914143452Z" level=info msg="libcontainerd: new containerd process, pid: 3328"
Jan 06 10:01:55 master dockerd[2320]: time="2017-01-06T10:01:55.011557668Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Jan 06 10:02:02 master kernel: dockerd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=-500
Jan 06 10:02:02 master kernel: Out of memory: Kill process 3308 (php) score 936 or sacrifice child
Jan 06 10:02:02 master kernel: Killed process 3308 (php) total-vm:3395924kB, anon-rss:2911252kB, file-rss:0kB
Jan 06 10:02:02 master dockerd[2320]: time="2017-01-06T10:02:02.521181015Z" level=info msg="libcontainerd: new containerd process, pid: 3333"
Jan 06 10:02:02 master dockerd[2320]: time="2017-01-06T10:02:02.581465781Z" level=warning msg="failed to retrieve containerd version: rpc error: code = 14 desc = grpc: the connection is unavailable"
Jan 06 10:02:02 master dockerd[2320]: time="2017-01-06T10:02:02.777440541Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 3e0f9104ada8020a566e7fd0f3220703c3be25596d459929c2d56c872d087742 does not exist\\none or more of the container deletions failed\\n\""
Jan 06 10:02:03 master dockerd[2320]: time="2017-01-06T10:02:03.228024702Z" level=info msg="libcontainerd: new containerd process, pid: 3366"
Jan 06 10:02:04 master dockerd[2320]: time="2017-01-06T10:02:04.754699573Z" level=info msg="libcontainerd: new containerd process, pid: 3374"
Jan 06 10:02:06 master dockerd[2320]: time="2017-01-06T10:02:06.227803910Z" level=info msg="libcontainerd: new containerd process, pid: 3382"
Jan 06 10:08:04 master dockerd[2320]: time="2017-01-06T10:08:04.671547242Z" level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found"
Jan 06 10:08:04 master dockerd[2320]: time="2017-01-06T10:08:04.673423649Z" level=error msg="Handler for POST /v1.25/exec/5a8279fe74b5b32b246bdd830f9f4016d0ef4561fb4be07820bf10d4a199440f/resize returned error: rpc error: code = 2 desc = containerd: container not found"

Outcome d) Swarm tasks restarted AND with inappropriate amount of replicas

The running container list should be: one Alpine image and two Whoami images, but:

$ docker ps
CONTAINER ID        IMAGE                                                                                       COMMAND                  CREATED              STATUS              PORTS               NAMES
b3b122ec43d4        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                About a minute ago   Up About a minute   80/tcp              whoami1.1.x245sakfoct0gb0666rs62u13
1961db38cab9        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                About a minute ago   Up About a minute   80/tcp              whoami2.1.8f1ru8th4bz48j9ci8gtj1pxb
2509f8c1cb64        joonas/my-memory-hog                                                                        "/bin/sh -c 'php s..."   2 minutes ago        Up 2 minutes                            serene_franklin
a4a18ed92e8a        alpine                                                                                      "sh"                     4 minutes ago        Up 4 minutes                            relaxed_ritchie
0d9fdd1c1488        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                4 minutes ago        Up 4 minutes        80/tcp              whoami2.1.m77p4asz4h2m03x79k0oaqj9w

=>

  • One whoami image too many
  • Memory hog still marked running but docker exec -it 2509f8c1cb64 sh => rpc error: code = 2 desc = containerd: container not found

Logs:

Jan 06 10:51:33 master dockerd[2317]: time="2017-01-06T10:51:28.655021857Z" level=error msg="agent: session failed" error="rpc error: code = 4 desc = context deadline exceeded" module="node/agent"
Jan 06 10:51:38 master dockerd[2317]: time="2017-01-06T10:51:36.178713681Z" level=info msg="libcontainerd: new containerd process, pid: 3409"
Jan 06 10:51:40 master dockerd[2317]: time="2017-01-06T10:51:36.181792704Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Jan 06 10:51:48 master dockerd[2317]: time="2017-01-06T10:51:47.819068355Z" level=info msg="libcontainerd: new containerd process, pid: 3414"
Jan 06 10:51:57 master kernel: php invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
.. snip ..
Jan 06 10:51:57 master kernel: Out of memory: Kill process 3399 (php) score 925 or sacrifice child
Jan 06 10:51:57 master kernel: Killed process 3399 (php) total-vm:2873064kB, anon-rss:2846304kB, file-rss:0kB
Jan 06 10:51:57 master dockerd[2317]: time="2017-01-06T10:51:57.531048160Z" level=info msg="libcontainerd: new containerd process, pid: 3419"
Jan 06 10:51:57 master dockerd[2317]: time="2017-01-06T10:51:57.611194228Z" level=warning msg="failed to retrieve containerd version: rpc error: code = 14 desc = grpc: the connection is unavailable"
Jan 06 10:51:57 master dockerd[2317]: time="2017-01-06T10:51:57.71671837Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 2509f8c1cb641d31be3d843f71f9561ddadca73f6e3df221a7f53f2981d54230 does not exist\\none or more of the container deletions failed\\n\""
Jan 06 10:51:58 master kernel: aufs au_opts_verify:1597:dockerd[2569]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:51:58 master dockerd[2317]: time="2017-01-06T10:51:58.181038741Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/9211b7e3ee30c4459258332d9c345922b12dae5f52e5537f804f46b5bbcb89c2-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:51:58 master kernel: aufs au_opts_verify:1597:dockerd[3042]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:51:58 master dockerd[2317]: time="2017-01-06T10:51:58.209125593Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/b0398bbb9d4f847667ec9065b6ddf0276b33915127a0731f47114c61142edf67-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:51:58 master kernel: aufs au_opts_verify:1597:dockerd[2569]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:51:58 master dockerd[2317]: time="2017-01-06T10:51:58.231281260Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/9211b7e3ee30c4459258332d9c345922b12dae5f52e5537f804f46b5bbcb89c2: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:51:58 master dockerd[2317]: time="2017-01-06T10:51:58.248888533Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/b0398bbb9d4f847667ec9065b6ddf0276b33915127a0731f47114c61142edf67: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:51:58 master kernel: aufs au_opts_verify:1597:dockerd[3042]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:51:58 master dockerd[2317]: time="2017-01-06T10:51:58.421582459Z" level=info msg="libcontainerd: new containerd process, pid: 3452"
Jan 06 10:51:59 master kernel: IPVS: __ip_vs_del_service: enter
Jan 06 10:51:59 master dockerd[2317]: time="2017-01-06T10:51:59Z" level=info msg="Firewalld running: false"
Jan 06 10:51:59 master dockerd[2317]: time="2017-01-06T10:51:59.921715871Z" level=info msg="libcontainerd: new containerd process, pid: 3473"
Jan 06 10:52:01 master dockerd[2317]: time="2017-01-06T10:52:01.003368348Z" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jan 06 10:52:01 master dockerd[2317]: time="2017-01-06T10:52:01.003998774Z" level=error msg="fatal task error" error="Cannot stop container whoami2.1.m77p4asz4h2m03x79k0oaqj9w: Cannot kill container 0d9fdd1c1488ac3f99220c8f4aff4b437b90a9f46ec43125976f25c4c6e3852e: rpc error: code = 14 desc = grpc: the connection is unavailable" module="node/agent/taskmanager" task.id=m77p4asz4h2m03x79k0oaqj9w
Jan 06 10:52:01 master dockerd[2317]: time="2017-01-06T10:52:01.212696972Z" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jan 06 10:52:01 master dockerd[2317]: time="2017-01-06T10:52:01.212843576Z" level=error msg="fatal task error" error="Cannot stop container whoami1.1.gfe0wap68qzbu992vbh6fo47o: Cannot kill container 3592deaa8455a24cf55242689c14fa23e9d913dbc6289076db9d8767afaf6c5a: rpc error: code = 14 desc = grpc: the connection is unavailable" module="node/agent/taskmanager" task.id=gfe0wap68qzbu992vbh6fo47o
Jan 06 10:52:01 master dockerd[2317]: time="2017-01-06T10:52:01.430232819Z" level=info msg="libcontainerd: new containerd process, pid: 3481"
Jan 06 10:52:02 master kernel: aufs au_opts_verify:1597:dockerd[3248]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:02 master kernel: aufs au_opts_verify:1597:dockerd[2317]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:02 master kernel: device veth18aece1 entered promiscuous mode
Jan 06 10:52:02 master kernel: IPv6: ADDRCONF(NETDEV_UP): veth18aece1: link is not ready
Jan 06 10:52:02 master systemd-udevd[3491]: Could not generate persistent MAC address for veth18aece1: No such file or directory
Jan 06 10:52:02 master systemd-udevd[3490]: Could not generate persistent MAC address for vethdceac5b: No such file or directory
Jan 06 10:52:02 master systemd-udevd[3497]: Could not generate persistent MAC address for veth494309f: No such file or directory
Jan 06 10:52:02 master systemd-udevd[3496]: Could not generate persistent MAC address for veth1dc16b3: No such file or directory
Jan 06 10:52:02 master kernel: veth3: renamed from veth494309f
Jan 06 10:52:02 master kernel: device veth3 entered promiscuous mode
Jan 06 10:52:02 master kernel: IPv6: ADDRCONF(NETDEV_UP): veth3: link is not ready
Jan 06 10:52:02 master kernel: br0: port 3(veth3) entered forwarding state
Jan 06 10:52:02 master kernel: br0: port 3(veth3) entered forwarding state
Jan 06 10:52:02 master kernel: device veth633e688 entered promiscuous mode
Jan 06 10:52:02 master kernel: IPv6: ADDRCONF(NETDEV_UP): veth633e688: link is not ready
Jan 06 10:52:02 master kernel: docker_gwbridge: port 3(veth633e688) entered forwarding state
Jan 06 10:52:02 master kernel: docker_gwbridge: port 3(veth633e688) entered forwarding state
Jan 06 10:52:02 master systemd-udevd[3532]: Could not generate persistent MAC address for veth2eb3523: No such file or directory
Jan 06 10:52:02 master systemd-udevd[3533]: Could not generate persistent MAC address for veth633e688: No such file or directory
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.910166492Z" level=warning msg="Your kernel does not support swap memory limit."
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.910715804Z" level=warning msg="Your kernel does not support cgroup rt period"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.911050800Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.921929868Z" level=info msg="libcontainerd: new containerd process, pid: 3551"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.924703978Z" level=error msg="Create container failed with error: grpc: the connection is unavailable"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.925370341Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.925444314Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.935313592Z" level=warning msg="Your kernel does not support swap memory limit."
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.935381612Z" level=warning msg="Your kernel does not support cgroup rt period"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.935393677Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.942180702Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.943748666Z" level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.943799839Z" level=error msg="Create container failed with error: grpc: the connection is unavailable"
Jan 06 10:52:02 master kernel: docker0: port 4(veth18aece1) entered disabled state
Jan 06 10:52:02 master kernel: device veth18aece1 left promiscuous mode
Jan 06 10:52:02 master kernel: docker0: port 4(veth18aece1) entered disabled state
Jan 06 10:52:02 master dockerd[2317]: time="2017-01-06T10:52:02.953653410Z" level=error msg="Could not delete service state for endpoint whoami1.1.k835pmog8zygngm0d4lr47dt7 from cluster: cannot delete entry as the entry in table endpoint_table with network id ok72tkrsi0ujuoym5acj2vsm6 and key 6380bc98ceccc5cb16d226d9a176ca188bebbe95ece8cb2004b7a02c49f26532 does not exist"
Jan 06 10:52:02 master kernel: docker_gwbridge: port 3(veth633e688) entered disabled state
Jan 06 10:52:02 master kernel: device veth633e688 left promiscuous mode
Jan 06 10:52:02 master kernel: docker_gwbridge: port 3(veth633e688) entered disabled state
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:02.999895226Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/9211b7e3ee30c4459258332d9c345922b12dae5f52e5537f804f46b5bbcb89c2: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.013583622Z" level=error msg="fatal task error" error="starting container failed: grpc: the connection is unavailable" module="node/agent/taskmanager" task.id=f9x292ougy8lk0m61a444vh4l
Jan 06 10:52:03 master kernel: br0: port 3(veth3) entered disabled state
Jan 06 10:52:03 master kernel: device veth3 left promiscuous mode
Jan 06 10:52:03 master kernel: br0: port 3(veth3) entered disabled state
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.084578239Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/b0398bbb9d4f847667ec9065b6ddf0276b33915127a0731f47114c61142edf67: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.103431524Z" level=error msg="fatal task error" error="starting container failed: grpc: the connection is unavailable" module="node/agent/taskmanager" task.id=k835pmog8zygngm0d4lr47dt7
Jan 06 10:52:03 master kernel: aufs au_opts_verify:1597:dockerd[3046]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.304273463Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/1b02b1d207beb01eaa5199502b76c79b0f1ba18c71ecc89c9a6df1f294b4cc76-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:03 master kernel: aufs au_opts_verify:1597:dockerd[2881]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.325366170Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/cb617162b4df614a4849abda254cb22686d453be29138ed1119b22a335cb8886-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.342564413Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/1b02b1d207beb01eaa5199502b76c79b0f1ba18c71ecc89c9a6df1f294b4cc76: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:03 master kernel: aufs au_opts_verify:1597:dockerd[3046]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:03 master dockerd[2317]: time="2017-01-06T10:52:03.368127248Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/cb617162b4df614a4849abda254cb22686d453be29138ed1119b22a335cb8886: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:03 master kernel: aufs au_opts_verify:1597:dockerd[2317]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:04 master dockerd[2317]: time="2017-01-06T10:52:04.428161566Z" level=info msg="libcontainerd: new containerd process, pid: 3588"
Jan 06 10:52:05 master dockerd[2317]: time="2017-01-06T10:52:05.921402940Z" level=info msg="libcontainerd: new containerd process, pid: 3595"
Jan 06 10:52:07 master dockerd[2317]: time="2017-01-06T10:52:07.445130322Z" level=info msg="libcontainerd: new containerd process, pid: 3602"
Jan 06 10:52:08 master kernel: aufs au_opts_verify:1597:dockerd[2881]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:08 master kernel: device vethaf09675 entered promiscuous mode
Jan 06 10:52:08 master kernel: IPv6: ADDRCONF(NETDEV_UP): vethaf09675: link is not ready
Jan 06 10:52:08 master systemd-udevd[3611]: Could not generate persistent MAC address for veth06cccf6: No such file or directory
Jan 06 10:52:08 master kernel: aufs au_opts_verify:1597:dockerd[3042]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:08 master systemd-udevd[3612]: Could not generate persistent MAC address for vethaf09675: No such file or directory
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.247029901Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/1b02b1d207beb01eaa5199502b76c79b0f1ba18c71ecc89c9a6df1f294b4cc76: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.264974480Z" level=error msg="fatal task error" error="starting container failed: Address already in use" module="node/agent/taskmanager" task.id=f9wh93fyk9dcb7pk7or9nhqau
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.265123288Z" level=warning msg="Your kernel does not support swap memory limit."
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.265174203Z" level=warning msg="Your kernel does not support cgroup rt period"
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.265185542Z" level=warning msg="Your kernel does not support cgroup rt runtime"
Jan 06 10:52:08 master kernel: IPVS: Creating netns size=2192 id=9
Jan 06 10:52:08 master kernel: eth0: renamed from veth06cccf6
Jan 06 10:52:08 master kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethaf09675: link becomes ready
Jan 06 10:52:08 master kernel: docker0: port 4(vethaf09675) entered forwarding state
Jan 06 10:52:08 master kernel: docker0: port 4(vethaf09675) entered forwarding state
Jan 06 10:52:08 master kernel: aufs au_opts_verify:1597:dockerd[2950]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.481246477Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/867989a4338d0050631a3cf885f0e27edbb87926220ad46f4e72c6b90ea4ee5b-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:08 master kernel: aufs au_opts_verify:1597:dockerd[2950]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:08 master dockerd[2317]: time="2017-01-06T10:52:08.515322323Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/867989a4338d0050631a3cf885f0e27edbb87926220ad46f4e72c6b90ea4ee5b: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:13 master kernel: aufs au_opts_verify:1597:dockerd[2317]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:13 master dockerd[2317]: time="2017-01-06T10:52:13.450403247Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/867989a4338d0050631a3cf885f0e27edbb87926220ad46f4e72c6b90ea4ee5b: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:13 master dockerd[2317]: time="2017-01-06T10:52:13.470237429Z" level=error msg="fatal task error" error="starting container failed: Address already in use" module="node/agent/taskmanager" task.id=7j38kwipsr1nqmojmsn5tclmt
Jan 06 10:52:13 master kernel: aufs au_opts_verify:1597:dockerd[2340]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:13 master dockerd[2317]: time="2017-01-06T10:52:13.805155797Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/c385e5b5e1b545fc3d733c42479f40eb076e33a5141b059a395de3b007825cf8-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:13 master kernel: aufs au_opts_verify:1597:dockerd[2340]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:13 master dockerd[2317]: time="2017-01-06T10:52:13.825701355Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/c385e5b5e1b545fc3d733c42479f40eb076e33a5141b059a395de3b007825cf8: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:18 master kernel: aufs au_opts_verify:1597:dockerd[2950]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:18 master dockerd[2317]: time="2017-01-06T10:52:18.801947457Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/c385e5b5e1b545fc3d733c42479f40eb076e33a5141b059a395de3b007825cf8: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:18 master dockerd[2317]: time="2017-01-06T10:52:18.815524866Z" level=error msg="fatal task error" error="starting container failed: Address already in use" module="node/agent/taskmanager" task.id=3krcdg9fdi32b5oq3t960cbpq
Jan 06 10:52:18 master dockerd[2317]: time="2017-01-06T10:52:18.909893679Z" level=warning msg="underweighting node p30oex9lu1xtwubx1r7aw3yup for service ibe2hx6qn8cm3okv0ywxdy6cy because it experienced 5 failures or rejections within 5m0s"
Jan 06 10:52:19 master kernel: aufs au_opts_verify:1597:dockerd[2317]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:19 master dockerd[2317]: time="2017-01-06T10:52:19.085938737Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/4443390b640ed9c00db6f3bfdec6427f47925e6af582563983f9f069bec37ad9-init: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:19 master dockerd[2317]: time="2017-01-06T10:52:19.114821042Z" level=warning msg="Couldn't run auplink before unmount /var/lib/docker/aufs/mnt/4443390b640ed9c00db6f3bfdec6427f47925e6af582563983f9f069bec37ad9: exec: \"auplink\": executable file not found in $PATH"
Jan 06 10:52:19 master kernel: aufs au_opts_verify:1597:dockerd[2317]: dirperm1 breaks the protection by the permission bits on the lower branch
Jan 06 10:52:19 master dockerd[2317]: time="2017-01-06T10:52:19.747569499Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 3592deaa8455a24cf55242689c14fa23e9d913dbc6289076db9d8767afaf6c5a does not exist\\none or more of the container deletions failed\\n\""

Misc details

Docker info:

Containers: 7
 Running: 3
 Paused: 0
 Stopped: 4
Images: 3
Server Version: 1.13.0-rc4
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 19
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: xvo3jj9kh3bjk6g04y82d9szi
 Is Manager: true
 ClusterID: js0kgxxjbda8jk69g3cw06tix
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 192.168.56.61
 Manager Addresses:
  192.168.56.61:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-31-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 3
Total Memory: 2.938 GiB
Name: master
ID: 362A:QIEY:OSPQ:ZX3U:R4G6:S5MT:IZAO:VAZU:RHQB:UCVS:76BB:CXLU
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

My environment Host Win10 -> Vagrant -> VirtualBox -> Ubuntu VM.

Keywords for search:

  • Out of memory
  • OOM
  • Swarm tasks
  • oomkiller
  • Task restart
  • Container restart
  • Wrong amount of replicas
@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Jan 6, 2017

All Swarm-based tasks restarting for (apparently) no reason, but curiously does not restart non-Swarm containers.

This looks to be what should happen; an OOM kill is done by the kernel. Swarm is meant to make sure that your services / tasks stay "up". If a task was killed by the kernel, Swarm will create a new task to replace it.

Outcome b) all Swarm tasks restart (most probable) but alpine container survives

Anything is possible if you're system is OOM. The kernel will start killing off random processes. In this case it decided to not kill the alpine container.

There's not really anything that docker can do about that; by default, a negative oom-score is set on the daemon (a894aec) to make it less likely (but not impossible) to be killed by the kernel if the system is OOM

Outcome c) memory hog killed, but still listed as "running"

This sounds like a bug, however, strange things can happen if a machine completely runs out of memory <-- @mlaventure

Outcome d) Swarm tasks restarted AND with inappropriate amount of replicas

Can you check if

  • the extra replica is actually running or if it's also incorrectly reporting it as running (like outcome C)?
  • What options are set on that service? Does it have a restart policy set?

Note that Swarm will stop monitoring a container if it's stopped/restarted out-of-bound (e.g. If I manually docker restart <some-task>, the container is no longer managed by Swarm, and a new task will be started to replace it.

In this case, it's possible that Swarm was temporarily not able to monitor the container (due to the OOM), or possibly another process (such as docker-containerd-shim) was killed by the kernel. If swarm no longer considers the container a "managed" container, you'll likely see that docker service scale whoami1 will stop one of these containers, but not the other one. (also see #23704)

The reason for not forcibly stopping the container, is that exited tasks are kept around, to allow you to inspect/analyze them (e.g. to find out why they exited), which may involve starting them.

In general it is recommended to;

  • Always restrict resources (cpu, memory) on containers and services. Without this, a container can consume unlimited resources, which can lead to the host going "down".
  • When using Swarm / services, don't put load on the manager. If a manager runs out of resources, you can loose control the Swarm.

ping @aaronlehmann @mlaventure perhaps you can have a look if there's things that can be improved around this?

@joonas-fi

This comment has been minimized.

joonas-fi commented Jan 6, 2017

This looks to be what should happen; an OOM kill is done by the kernel. Swarm is meant to make sure that your services / tasks stay "up". If a task was killed by the kernel, Swarm will create a new task to replace it.

Swarm tasks were never killed by the kernel. oom-killer only always killed the most offensive process: the container from image joonas/my-memory-hog (not started via Swarm but interactively). But Swarm tasks were restarted anyway (why restart them if they continued running smoothly? => unexpected). Non-swarm containers were never restarted => expected.

What options are set on that service? Does it have a restart policy set?

All the commands I used were listed in the issue description - the service was created with docker service create --name <name> <image>.

or possibly another process (such as docker-containerd-shim) was killed by the kernel.

According to logs, oomkiller always killed the most offensive process - the memory hog.

"Always restrict resources (cpu, memory) on containers and services" ... "When using Swarm / services, don't put load on the manager"

Those are sound recommendations, but I feel that this kind of infrastructure should be robust even under the most hostile of circumstances. And no matter what you recommend, dedicating an entire VM (or 3 for a HA setup) for manager role will be a money issue for hobby use/small businesses, so in practice it is not always feasible to do what's recommended but instead what we can afford.

Note that Swarm will stop monitoring a container if it's stopped/restarted out-of-bound (e.g. If I manually docker restart , the container is no longer managed by Swarm, and a new task will be started to replace it.

I didn't touch the Swarm-based containers manually. All I used was the commands I mentioned in the issue and I repeated these tests so many times by tearing down the VM and starting fresh again that I am confident that this is not a user error. These symptoms are really easy to replicate - I replicated by following this issue description 1:1.

Can you check if: the extra replica is actually running or if it's also incorrectly reporting it as running (like outcome C)?

I will check this and report back!

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Jan 6, 2017

But Swarm tasks were restarted anyway (why restart them if they continued running smoothly? => unexpected).

I suspect Swarm was temporarily not able to monitor the processes, resulting in falsely regarding the containers to be not running. Perhaps @aaronlehmann has more insight in this.

@joonas-fi

This comment has been minimized.

joonas-fi commented Jan 6, 2017

It took me about four tries to produce a situation where the same Swarm tasks now has two replicas (outcome d). Here's all that I did:

$ echo 1 > /proc/sys/vm/overcommit_memory
$ docker swarm init # I had also --advertise-addr 192.168.56.61
$ docker service create --name whoami1 emilevauge/whoami
$ docker service create --name whoami2 emilevauge/whoami
$ docker service create --name whoami3 emilevauge/whoami
$ docker service create --name whoami4 emilevauge/whoami

# this I had to repeat 3-4 times to have outcome d)
$ docker run -it joonas/my-memory-hog
Killed

$ docker ps
CONTAINER ID        IMAGE                                                                                       COMMAND                  CREATED              STATUS              PORTS               NAMES
730e4bef8113        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                About a minute ago   Up 56 seconds       80/tcp              whoami3.1.x700kdacaib2daxvuutlyy0xo
973e3281fa4b        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                About a minute ago   Up 56 seconds       80/tcp              whoami4.1.nheyzafzw9lodxsiixpi1mjsq
6c6eeb5493ea        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                About a minute ago   Up 56 seconds       80/tcp              whoami2.1.5lls6hn38w684fbawk1ksigyk
fa386b4f635e        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                About a minute ago   Up 56 seconds       80/tcp              whoami1.1.01glyzhh8q7k9wn4eyqsj1odd
78b3797327b0        joonas/my-memory-hog                                                                        "/bin/sh -c 'php s..."   2 minutes ago        Up 2 minutes                            relaxed_mayer
87bdf0aa25c2        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                2 minutes ago        Up 2 minutes        80/tcp              whoami1.1.6khhwdva9fdeli0bv5i7mszks
e540988a2b9e        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                2 minutes ago        Up 2 minutes        80/tcp              whoami3.1.xvr2y12dccsvk4xkijjbiymxc
ea9548ae8fa9        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                2 minutes ago        Up 2 minutes        80/tcp              whoami4.1.q8cqnksjqglq2pmdgsfy01kc8
abeb1f180a41        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                2 minutes ago        Up 2 minutes        80/tcp              whoami2.1.xfrczg5lsdvakvbyf4m8u803l
4b7674c59996        joonas/my-memory-hog                                                                        "/bin/sh -c 'php s..."   4 minutes ago        Up 4 minutes                            zen_swartz

# Observe:
# - two containers of memory hog still "running" even though they were killed
# - two replicas each from each whoami{1,2,3,4} service definitions which should have 1 replica each

$ ps aux | grep whoamI
root      3892  0.0  0.0   7520   512 ?        Ssl  14:03   0:00 /whoamI
root      3900  0.0  0.0   9184   908 ?        Ssl  14:03   0:00 /whoamI
root      3911  0.0  0.0   7520   620 ?        Ssl  14:03   0:00 /whoamI
root      3920  0.0  0.0   7520   508 ?        Ssl  14:03   0:00 /whoamI
root      4407  0.0  0.1   7520  4336 ?        Ssl  14:04   0:00 /whoamI
root      4504  0.0  0.1   7520  4364 ?        Ssl  14:04   0:00 /whoamI
root      4536  0.0  0.1   7776  4632 ?        Ssl  14:04   0:00 /whoamI
root      4584  0.0  0.1   7520  4364 ?        Ssl  14:04   0:00 /whoamI

# => 8 Swarm-based container processes are *really* running on the host

$ docker ps | grep hog
78b3797327b0        joonas/my-memory-hog                                                                        "/bin/sh -c 'php s..."   9 minutes ago       Up 9 minutes                            relaxed_mayer
4b7674c59996        joonas/my-memory-hog                                                                        "/bin/sh -c 'php s..."   11 minutes ago      Up 11 minutes                           zen_swartz

$ docker exec -it relaxed_mayer sh
rpc error: code = 2 desc = containerd: container not found

$ docker exec -it zen_swartz sh
rpc error: code = 2 desc = containerd: container not found

# => each memory hog is actually dead

$ docker ps | grep whoami1
fa386b4f635e        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                6 minutes ago       Up 6 minutes        80/tcp              whoami1.1.01glyzhh8q7k9wn4eyqsj1odd
87bdf0aa25c2        emilevauge/whoami@sha256:348eb7b296538f972afc8ccb3c97d88ae444f98944d58e812b00606407396bbc   "/whoamI"                7 minutes ago       Up 7 minutes        80/tcp              whoami1.1.6khhwdva9fdeli0bv5i7mszks

$ docker service ps whoami1
ID            NAME           IMAGE                     NODE    DESIRED STATE  CURRENT STATE            ERROR                             PORTS
01glyzhh8q7k  whoami1.1      emilevauge/whoami:latest  master  Running        Running 10 minutes ago               
es4f5b8ozksy   \_ whoami1.1  emilevauge/whoami:latest  master  Shutdown       Failed 10 minutes ago    "starting container failed: gr…"
6khhwdva9fde   \_ whoami1.1  emilevauge/whoami:latest  master  Shutdown       Failed 10 minutes ago    "Cannot stop container whoami1…"
ssnciemx7b5s   \_ whoami1.1  emilevauge/whoami:latest  master  Shutdown       Shutdown 11 minutes ago              

edit: logs of above in separate gist

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented Jan 6, 2017

Perhaps something to do with this?

Jan 06 14:04:38 master dockerd[2325]: time="2017-01-06T14:04:38.063162259Z" level=error msg="fatal task error" error="Cannot stop container whoami2.1.xfrczg5lsdvakvbyf4m8u803l: Cannot kill container abeb1f180a413d511a07e7bebfd50508e19f4ab06b7530429a433bba48aeab33: rpc error: code = 14 desc = grpc: the connection is unavailable" module="node/agent/taskmanager" task.id=xfrczg5lsdvakvbyf4m8u803l
@mlaventure

This comment has been minimized.

Contributor

mlaventure commented Jan 6, 2017

It also does look like containerd is being terminated:

Jan 06 14:04:15 master dockerd[2325]: time="2017-01-06T14:04:13.208839706Z" level=info msg="libcontainerd: new containerd process, pid: 4140"
Jan 06 14:04:17 master dockerd[2325]: time="2017-01-06T14:04:13.273558318Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"

Given that the message is closing, it looks like a SIGTERM was sent to it.

@DanielBodnar

This comment has been minimized.

DanielBodnar commented Oct 28, 2018

@joonas-fi did you ever figure out a fix for this? I'm having an identical issue, and setting vm.swappiness=0 helps but doesn't completely resolve the issue.

Thanks!

@joonas-fi

This comment has been minimized.

joonas-fi commented Oct 29, 2018

@DanielBodnar unfortunately nope, I didn't figure out a solution.

And I havent't tested this scenario with recent Docker versions, but sad to hear if this still happens because if we could be left with an incorrect amount of replicas running, that sounds like a serious issue that should be addressed.

As a "fix" for this I've been stressingly obsessed with the memory use of my containers to try to avoid OOM issue in production. It's a good practice to set max memory limits per container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment