Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift node becomes dysfunctional because of devicemapper issues with docker #14601

Closed
rajatchopra opened this issue Jun 12, 2017 · 19 comments
Assignees
Labels
component/containers kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2

Comments

@rajatchopra
Copy link
Contributor

openshift node becomes dysfunctional because of devicemapper issues. Happens on a running cluster, where the node was previously functioning well.

See issue: moby/moby#23089

Version

openshift v3.6.94
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Steps To Reproduce
  1. On a large cluster (100+ nodes), create several thousand pods at once (4000)
  2. See that some nodes have trouble creating pods
Current Result

Node not ready

Origin node gives this error in the log:

 19246 kuberuntime_gc.go:123] Failed to remove container "f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32": rpc error: code = 2 desc = failed to remove container "f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem f2d792518c47a670a633696c69b8360340d2cab70e8cc7087c510844bb5cad32: failed to remove device d6a003f88219386bd95462cfc7e78aecde7cd605fb358ff17fd2e1fa2731fa90:devicemapper: Can not set cookie: dm_task_set_cookie failed"}
Expected Result

Node should be ready and pods should run

Additional Information

Docker daemon restart gives this error:

[openshift@b10-h06-r620 ~]$ sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[openshift@b10-h06-r620 ~]$ journalctl -flu docker
-- Logs begin at Sun 2017-06-11 22:41:55 UTC. --
Jun 12 17:59:59 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service failed.
Jun 12 18:31:14 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Starting Docker Application Container Engine...
Jun 12 18:31:14 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:14.688724739Z" level=info msg="libcontainerd: new containerd process, pid: 14832"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.708582348Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section."
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.715165962Z" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com dockerd-current[14823]: time="2017-06-12T18:31:15.715998695Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Failed to start Docker Application Container Engine.
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: Unit docker.service entered failed state.
Jun 12 18:31:15 b10-h06-r620.rdu.openstack.engineering.redhat.com_4.example.com systemd[1]: docker.service failed.
@rajatchopra
Copy link
Contributor Author

Likely core issue: moby/moby#33603

@pweil- pweil- added component/containers kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Jun 13, 2017
@jwhonce
Copy link
Contributor

jwhonce commented Jun 13, 2017

@rhvgoyal FYI -- I know you are active on the upstream issue

@pravisankar
Copy link

Related devicemapper issue: https://bugzilla.redhat.com/show_bug.cgi?id=1461370#c6

Pod creation failed with error:
Error response from daemon: {"message":"devmapper: Error activating devmapper device for '1b87bd5a693a2b9fef9008e122b07237d63ed69d913111063c85adf99c5c912e-init': devicemapper: Can't set cookie dm_task_set_cookie failed }

@rajatchopra
Copy link
Contributor Author

rajatchopra commented Jun 15, 2017

Two workarounds so far (both maybe needed):

  • The semaphore leak reaches the 128 array limit of the system soon. Change the default to 8192 maybe.
sudo printf 'kernel.sem = 250\t32000\t32\t8192\n'  > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

And I get to push the issue away in future.

  • And if you do hit the issue anyway, clean up the system:
sudo systemctl stop origin-node
sudo systemctl stop docker
sudo dmsetup remove_all
sudo dmsetup -y udevcomplete_all
sudo systemctl start docker
sudo systemctl start origin-node

@rhvgoyal
Copy link

I updated original issue. We suspect that a recent commit in docker caused this cookie/semaphore leak issue.

@nhorman is looking into writing a patch for it now.

@rhvgoyal
Copy link

Check if following PR fixes the issue.

moby/moby#33732

@bboreham
Copy link

Hi, I'm trying to help someone who is having similar issue with Docker 1.12.6; those references to Docker issue and PR all seem to reference recent changes, but you wouldn't be using such a recent Docker with Kubernetes, would you?

@nsu700
Copy link

nsu700 commented Jun 29, 2017

sudo printf 'kernel.sem = 250\t32000\t32\t8192\n' > /etc/sysctl.d/99-kernelsem.conf
sudo sysctl --system

this works for me , since it will keep permantently the semaphore limit as 8192 . one can issue the command "ipcs -su" to know how manay semaphores are in use , but how can I know what process is using these semaphore , can anone help me , thank you .

@rhvgoyal
Copy link

We have backported fix for this in projectatomic/docker as well. So please take latest docker build from your source and it might have the fix.

projectatomic/docker#256
projectatomic/docker#255
projectatomic/docker#254

@xThomo
Copy link

xThomo commented Jul 3, 2017

We are using docker-1.12.6-28.git1398f24.el7.centos.x86_64.rpm and we are seeing the following issues on an upgrade of docker version.

Jun 29 18:49:13 openshift-master-01 systemd[1]: Starting Docker Application Container Engine...
Jun 29 18:49:13 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:13.924855449Z" level=info msg="libcontainerd: new containerd process, pid: 124068"
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946117449Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use --storage-opt dm.thinpooldev or use man docker to re...hinpooldev section."
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946590822Z" level=error msg="[graphdriver] prior storage driver "devicemapper" failed: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set co...k_set_cookie failed"
Jun 29 18:49:14 openshift-master-01 dockerd-current[124060]: time="2017-06-29T18:49:14.946758304Z" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Can't set cookie ...k_set_cookie failed"
Jun 29 18:49:14 openshift-master-01 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jun 29 18:49:14 openshift-master-01 systemd[1]: Failed to start Docker Application Container Engine.
Jun 29 18:49:14 openshift-master-01 systemd[1]: Unit docker.service entered failed state.
Jun 29 18:49:14 openshift-master-01 systemd[1]: docker.service failed.

It looks like this is fixed in the upcoming docker-1.12.6-32.git88a4867.el7 version

An alternative appears to be downgrade/remain at docker-1.12.6-16.el7 according to the Bugzilla report.

Edit:
Now available in Centos mirror:
http://mirror.centos.org/centos/7/extras/x86_64/Packages/docker-1.12.6-32.git88a4867.el7.centos.x86_64.rpm

@DanyC97
Copy link
Contributor

DanyC97 commented Oct 22, 2017

i'm afraid the issue is not fixed in the docker version mentioned above

rpm -qa|grep docker
docker-client-1.12.6-32.git88a4867.el7.centos.x86_64
docker-common-1.12.6-32.git88a4867.el7.centos.x86_64
docker-1.12.6-32.git88a4867.el7.centos.x86_64
origin-docker-excluder-1.4.1-1.el7.noarch

and

openshift version
openshift v1.4.1
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

as i still got hit by it

@nhorman
Copy link

nhorman commented Oct 22, 2017

It is actually likely PR #33376 that you are after:
moby/moby#33376

@xThomo
Copy link

xThomo commented Oct 22, 2017

@nhorman do you know if that's back-ported to an existing 1.12.6 version?

@nhorman
Copy link

nhorman commented Oct 22, 2017

@xThomo not off the top of my head, no, but it should be pretty easy to go and see, its a very small patch

@DanyC97
Copy link
Contributor

DanyC97 commented Oct 23, 2017

cheers @nhorman !

@jwhonce
Copy link
Contributor

jwhonce commented Oct 23, 2017

@xThomo, @nhorman That should have been available since build docker-2:1.12.6-29.git97ba2c0

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2018
@rajatchopra
Copy link
Contributor Author

I don't see this issue anymore, plus the workaround steps unblock me anyway. Closing this issue. Please reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/containers kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2
Projects
None yet
Development

No branches or pull requests