openshift node becomes dysfunctional because of devicemapper issues with docker #14601

rajatchopra · 2017-06-12T18:54:56Z

openshift node becomes dysfunctional because of devicemapper issues. Happens on a running cluster, where the node was previously functioning well.

See issue: moby/moby#23089

Version

openshift v3.6.94
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Steps To Reproduce

On a large cluster (100+ nodes), create several thousand pods at once (4000)
See that some nodes have trouble creating pods

Current Result

Node not ready

Origin node gives this error in the log:

 19246 kuberuntime_gc.go:123] Failed to remove container "f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32": rpc error: code = 2 desc = failed to remove container "f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32: failed to remove device d6a003f88219386bd95462cfc7e78aecde7cd605fb358ff17fd2e1fa2731fa90:devicemapper: Can not set cookie: dm_task_set_cookie failed"}

Expected Result

Node should be ready and pods should run

Additional Information

Docker daemon restart gives this error:

[openshift@b10-h06-r620 ~]$ sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[openshift@b10-h06-r620 ~]$ journalctl -flu docker
-- Logs begin at Sun 2017-06-11 22:41:55 UTC. --
Jun 12 17:59:59 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service failed.
Jun 12 18:31:14 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Starting Docker Application Container Engine...
Jun 12 18:31:14 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:14.688724739Z" level=info msg="libcontainerd: new containerd process, pid: 14832"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.708582348Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section."
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.715165962Z" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.715998695Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Failed to start Docker Application Container Engine.
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Unit docker.service entered failed state.
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service failed.

The text was updated successfully, but these errors were encountered:

rajatchopra · 2017-06-12T19:58:35Z

Likely core issue: moby/moby#33603

jwhonce · 2017-06-13T15:20:38Z

@rhvgoyal FYI -- I know you are active on the upstream issue

pravisankar · 2017-06-15T22:26:26Z

Related devicemapper issue: https://bugzilla.redhat.com/show_bug.cgi?id=1461370#c6

Pod creation failed with error:
Error response from daemon: {"message":"devmapper: Error activating devmapper device for '1b87bd5a693a2b9fef9008e122b07237d63ed69d913111063c85adf99c5c912e-init': devicemapper: Can't set cookie dm_task_set_cookie failed }

rajatchopra · 2017-06-15T22:51:03Z

Two workarounds so far (both maybe needed):

The semaphore leak reaches the 128 array limit of the system soon. Change the default to 8192 maybe.

sudo printf 'kernel.sem = 250\t32000\t32\t8192\n'  > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

And I get to push the issue away in future.

And if you do hit the issue anyway, clean up the system:

sudo systemctl stop origin-node
sudo systemctl stop docker
sudo dmsetup remove_all
sudo dmsetup -y udevcomplete_all
sudo systemctl start docker
sudo systemctl start origin-node

rhvgoyal · 2017-06-19T15:51:46Z

I updated original issue. We suspect that a recent commit in docker caused this cookie/semaphore leak issue.

@nhorman is looking into writing a patch for it now.

rhvgoyal · 2017-06-19T18:31:57Z

Check if following PR fixes the issue.

moby/moby#33732

bboreham · 2017-06-28T09:39:44Z

Hi, I'm trying to help someone who is having similar issue with Docker 1.12.6; those references to Docker issue and PR all seem to reference recent changes, but you wouldn't be using such a recent Docker with Kubernetes, would you?

nsu700 · 2017-06-29T07:42:47Z

sudo printf 'kernel.sem = 250\t32000\t32\t8192\n' > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

this works for me , since it will keep permantently the semaphore limit as 8192 . one can issue the command "ipcs -su" to know how manay semaphores are in use , but how can I know what process is using these semaphore , can anone help me , thank you .

rhvgoyal · 2017-06-29T20:50:43Z

We have backported fix for this in projectatomic/docker as well. So please take latest docker build from your source and it might have the fix.

projectatomic/docker#256
projectatomic/docker#255
projectatomic/docker#254

xThomo · 2017-07-03T09:54:00Z

We are using docker-1.12.6-28.git1398f24.el7.centos.x86_64.rpm and we are seeing the following issues on an upgrade of docker version.

Jun 29 18:49:13 openshift-master-01 systemd[1]: Starting Docker Application Container Engine...
Jun 29 18:49:13 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:13.924855449Z" level=info msg="libcontainerd: new containerd process, pid: 124068"
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946117449Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use --storage-opt dm.thinpooldev or use man docker to re...hinpooldev section."
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946590822Z" level=error msg="[graphdriver] prior storage driver "devicemapper" failed: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set co...k_set_cookie failed"
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946758304Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie ...k_set_cookie failed"
Jun 29 18:49:14 openshift-master-01 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jun 29 18:49:14 openshift-master-01 systemd[1]: Failed to start Docker Application Container Engine.
Jun 29 18:49:14 openshift-master-01 systemd[1]: Unit docker.service entered failed state.
Jun 29 18:49:14 openshift-master-01 systemd[1]: docker.service failed.

It looks like this is fixed in the upcoming docker-1.12.6-32.git88a4867.el7 version

An alternative appears to be downgrade/remain at docker-1.12.6-16.el7 according to the Bugzilla report.

Edit:
Now available in Centos mirror:
http://mirror.centos.org/centos/7/extras/x86_64/Packages/docker-1.12.6-32.git88a4867.el7.centos.x86_64.rpm

DanyC97 · 2017-10-22T15:17:08Z

i'm afraid the issue is not fixed in the docker version mentioned above

rpm -qa|grep docker
docker-client-1.12.6-32.git88a4867.el7.centos.x86_64
docker-common-1.12.6-32.git88a4867.el7.centos.x86_64
docker-1.12.6-32.git88a4867.el7.centos.x86_64
origin-docker-excluder-1.4.1-1.el7.noarch

and

openshift version
openshift v1.4.1
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

as i still got hit by it

nhorman · 2017-10-22T16:35:44Z

It is actually likely PR #33376 that you are after:
moby/moby#33376

xThomo · 2017-10-22T17:04:42Z

@nhorman do you know if that's back-ported to an existing 1.12.6 version?

nhorman · 2017-10-22T17:07:06Z

@xThomo not off the top of my head, no, but it should be pretty easy to go and see, its a very small patch

DanyC97 · 2017-10-23T05:59:59Z

cheers @nhorman !

jwhonce · 2017-10-23T21:47:28Z

@xThomo, @nhorman That should have been available since build docker-2:1.12.6-29.git97ba2c0

openshift-bot · 2018-02-23T17:20:39Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-03-25T17:28:24Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

rajatchopra · 2018-03-25T19:22:55Z

I don't see this issue anymore, plus the workaround steps unblock me anyway. Closing this issue. Please reopen if needed.

pweil- added component/containers kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Jun 13, 2017

pweil- assigned jwhonce Jun 13, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2018

rajatchopra closed this as completed Mar 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openshift node becomes dysfunctional because of devicemapper issues with docker #14601

openshift node becomes dysfunctional because of devicemapper issues with docker #14601

rajatchopra commented Jun 12, 2017

rajatchopra commented Jun 12, 2017

jwhonce commented Jun 13, 2017

pravisankar commented Jun 15, 2017

rajatchopra commented Jun 15, 2017 •

edited

rhvgoyal commented Jun 19, 2017

rhvgoyal commented Jun 19, 2017

bboreham commented Jun 28, 2017

nsu700 commented Jun 29, 2017 •

edited

rhvgoyal commented Jun 29, 2017

xThomo commented Jul 3, 2017 •

edited

DanyC97 commented Oct 22, 2017

nhorman commented Oct 22, 2017

xThomo commented Oct 22, 2017

nhorman commented Oct 22, 2017

DanyC97 commented Oct 23, 2017

jwhonce commented Oct 23, 2017

openshift-bot commented Feb 23, 2018

openshift-bot commented Mar 25, 2018

rajatchopra commented Mar 25, 2018

openshift node becomes dysfunctional because of devicemapper issues with docker #14601

openshift node becomes dysfunctional because of devicemapper issues with docker #14601

Comments

rajatchopra commented Jun 12, 2017

Version

Steps To Reproduce

Current Result

Expected Result

Additional Information

rajatchopra commented Jun 12, 2017

jwhonce commented Jun 13, 2017

pravisankar commented Jun 15, 2017

rajatchopra commented Jun 15, 2017 • edited

rhvgoyal commented Jun 19, 2017

rhvgoyal commented Jun 19, 2017

bboreham commented Jun 28, 2017

nsu700 commented Jun 29, 2017 • edited

rhvgoyal commented Jun 29, 2017

xThomo commented Jul 3, 2017 • edited

DanyC97 commented Oct 22, 2017

nhorman commented Oct 22, 2017

xThomo commented Oct 22, 2017

nhorman commented Oct 22, 2017

DanyC97 commented Oct 23, 2017

jwhonce commented Oct 23, 2017

openshift-bot commented Feb 23, 2018

openshift-bot commented Mar 25, 2018

rajatchopra commented Mar 25, 2018

rajatchopra commented Jun 15, 2017 •

edited

nsu700 commented Jun 29, 2017 •

edited

xThomo commented Jul 3, 2017 •

edited