Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Engine fails to restart - Base Device UUID and Filesystem verification failed #23089

Closed
rkharya opened this issue May 29, 2016 · 25 comments

Comments

Projects
None yet
@rkharya
Copy link

commented May 29, 2016

Output of docker version:

Docker version 1.11.1-cs2, build 7cd1c0b

Output of docker info:

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

Additional environment details (AWS, VirtualBox, physical, etc.):

Physical environment

Steps to reproduce the issue:

  1. Docker was installed without docker_device config
  2. Used Contiv Cluster-Manager to make the host part of contiv cluster manager, which uses 'device_mapper' functionality.
  3. Re-start of Docker Engine fails with above error message and ansible play-book ran for node commissioning to the contiv-cluster fails.

Describe the results you received:
Docker Engine fails to restart with below error -

May 29 07:16:42 contiv-aci-scale-1.cisco.com systemd[1]: Starting Docker Application Container Engine...
May 29 07:16:43 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:43.000531244-07:00" level=info msg="New containerd process, pid: 32241\n"
May 29 07:16:44 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:44.018947601-07:00" level=error msg="[graphdriver] prior storage driver "devicemapper" failed: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"
May 29 07:16:44 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:44.019375151-07:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
May 29 07:16:44 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:44-07:00" level=info msg="stopping containerd after receiving terminated"
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: Failed to start Docker Application Container Engine.
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: Unit docker.service entered failed state.
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: docker.service failed.

Describe the results you expected:
Docker Engine should start gracefully, when its get configured with Docker_Device functionality.

Additional information you deem important (e.g. issue happens only occasionally):
Consistently re-produceable failure.

Work around is manual - Need to clean-up /var/lib/docker and reboot the node. Re-issuing cluster-commissioning task this time gets through and Docker Engine with 'docker_device' config this time does get successfully started.

@rkharya

This comment has been minimized.

Copy link
Author

commented May 29, 2016

Attaching previous Docker Engine state and de-bug output when it was reconfigured with 'docker_device'.
docker-bug.txt

@rkharya

This comment has been minimized.

Copy link
Author

commented May 29, 2016

** # rm -rf /var/lib/docker
** # reboot

And re-commissioning the node for contiv cluster fixes the issue and docker engine does gets started with docker_device configs.

@HackToday

This comment has been minimized.

Copy link
Contributor

commented May 30, 2016

  1. Docker was installed without docker_device config

How did you install docker ? it seems a customized ways to install that. Could you specify that ?

@rkharya

This comment has been minimized.

Copy link
Author

commented May 30, 2016

yes through Ansible playbook tasks -

*[cluster-admin@contiv-aci-scale-1 tasks]$ pwd
/home/cluster-admin/ansible/roles/docker/tasks
[cluster-admin@contiv-aci-scale-1 tasks]$ cat main.yml

---
*# This role contains tasks for configuring and starting docker service
*#

- name: check docker version
  shell: docker --version
  register: docker_installed_version
  ignore_errors: yes
  tags:
    - prebake-for-dev

- include: ubuntu_install_tasks.yml
  when: (ansible_os_family == "Debian") and not (docker_installed_version.stdout | match("Docker version {{ docker_version }}, build.*"))
  tags:
    - prebake-for-dev

- include: redhat_install_tasks.yml
  when: (ansible_os_family == "RedHat") and not (docker_installed_version.stdout | match("Docker version {{ docker_version }}, build.*"))
  tags:
    - prebake-for-dev

- name: create docker daemon's config directory
  file: path=/etc/systemd/system/docker.service.d state=directory
  tags:
    - prebake-for-dev

- name: setup docker daemon's environment
  template: src=env.conf.j2 dest=/etc/systemd/system/docker.service.d/env.conf
  tags:
    - prebake-for-dev

- name: setup iptables for docker
  shell: >
      ( iptables -L INPUT | grep "{{ docker_rule_comment }} ({{ item }})" ) || \
      iptables -I INPUT 1 -p tcp --dport {{ item }} -j ACCEPT -m comment --comment "{{ docker_rule_comment }} ({{ item }})"
  become: true
  with_items:
    - "{{ docker_api_port }}"

- name: copy systemd units for docker(enable cluster store) (debian)
  template: src=docker-svc.j2 dest=/lib/systemd/system/docker.service
  when: ansible_os_family == "Debian"

- name: copy systemd units for docker(enable cluster store) (redhat)
  template: src=docker-svc.j2 dest=/usr/lib/systemd/system/docker.service
  when: ansible_os_family == "RedHat"

- name: check docker-tcp socket state
  shell: systemctl status docker-tcp.socket | grep 'Active.*active' -o
  ignore_errors: true
  register: docker_tcp_socket_state

- include: create_docker_device.yml
  when: docker_device != ""
@HackToday

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

not quite know ansible, but from logs error message

Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"

It seems devicemapper related, and system special configuration info needed to track that

@dfmadrid

This comment has been minimized.

Copy link

commented Aug 31, 2016

I found this problem after a yum update in Red Hat. Delete (or move) devicemapper folder and start docker, it will be recreated and docker will start normally:

$ rm -rf /var/lib/docker/devicemapper
$ systemctl start docker

@anannaya

This comment has been minimized.

Copy link

commented Sep 2, 2016

I found same issue.. on docker 1.12 .. Tired all the possibilities.. Still the issue is not resolved.
Word-Aroud:
Cleaned the all the loop devices
Word Around 2:
rm -rf /var/lib/docker/devicemapper

WA3: deleted and created the LVM.
WA4: removed the docker and reinstalled.

Still the issue is not resolved.

@rkharya

This comment has been minimized.

Copy link
Author

commented Sep 2, 2016

workaround is - rm -rf /var/lib/docker & restart docker(sometimes host reboot needed)

@anannaya

This comment has been minimized.

Copy link

commented Sep 6, 2016

Thanks..It worked.

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Sep 27, 2016

Looks like the issue was resolved for @rkharya. Also, there's some other issues for this that may contain useful information; #15721, #19055, #21606, #16344, #13894

@thaJeztah thaJeztah closed this Sep 27, 2016

@liyaka

This comment has been minimized.

Copy link

commented Jan 5, 2017

I have the same issue on RH 7.2 with Docker version 1.12.1, build 23cf638
It's really problematic to remove /var/lib/docker/devicemapper, i will loose my data
Any other workaround?

@grizzlyfred

This comment has been minimized.

Copy link

commented Jan 17, 2017

I think this is very annoying, I have this on a raspberry pi3 with aarch64 openSuSe.
Moved the whole /var/lib/docker to a different btrfs subvol, because the /var/lib/docker/devicemapper/devicemapper contains a 2GB and a 100GB sparsefile, which explodes my rsync-backup of the root partion... and then I get this UUID mismatch. Again, simple deleting everything in the new partition (rather than re-using the moved-over contents) helped.
Fortunately, I had to rebuild only the one container I had just made and not yet really done anything...
I would have appreciated if I could move dockerdata around to whatever phys. medium or network or cluster fs without that hassle...
On the other hand, it's rather the distros shortcoming that there seems to be no simple way to get AUFS installed on suse LEAP, and that is the only running aarch64 distro for the pi3 right now afaik.

@rtnpro

This comment has been minimized.

Copy link

commented Jun 12, 2017

I am facing the same issue on CentOS 7.3.1611 with Docker version 1.12.6, build 1398f24/1.12.6. Docker service fails to start. The above mentioned workarounds did not work for me.

Debug info

[root@nxx ~]# docker --version
Docker version 1.12.6, build 1398f24/1.12.6

[root@nxx ~]# systemctl status docker -l
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled;
vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2017-06-12 08:14:41
BST; 1min 22s ago
     Docs: http://docs.docker.com
  Process: 300 ExecStart=/usr/bin/dockerd-current --add-runtime
docker-runc=/usr/libexec/docker/docker-runc-current
--default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd
--userland-proxy-path=/usr/libexec/docker/docker-proxy-current
$OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY
$BLOCK_REGISTRY $INSECURE_REGISTRY (code=exited, status=1/FAILURE)
 Main PID: 300 (code=exited, status=1/FAILURE)

Jun 12 08:14:40 n41.pufty.ci.centos.org systemd[1]: Starting Docker
Application Container Engine...
Jun 12 08:14:40 n41.pufty.ci.centos.org dockerd-current[300]:
time="2017-06-12T08:14:40.732651146+01:00" level=warning msg="[!]
DON'T BIND ON ANY IP ADDRESS WITHOUT setting -tlsverify IF YOU DON'T
KNOW WHAT YOU'RE DOING [!]"
Jun 12 08:14:40 n41.pufty.ci.centos.org dockerd-current[300]:
time="2017-06-12T08:14:40.742206601+01:00" level=info
msg="libcontainerd: new containerd process, pid: 308"
Jun 12 08:14:41 n41.pufty.ci.centos.org dockerd-current[300]:
time="2017-06-12T08:14:41.758615859+01:00" level=warning
msg="devmapper: Usage of loopback devices is strongly discouraged for
production use. Please use `--storage-opt dm.thinpooldev` or use `man
docker` to refer to dm.thinpooldev section."
Jun 12 08:14:41 n41.pufty.ci.centos.org dockerd-current[300]:
time="2017-06-12T08:14:41.759174439+01:00" level=error
msg="[graphdriver] prior storage driver \"devicemapper\" failed:
devmapper: Base Device UUID and Filesystem verification failed:
devicemapper: Can't set cookie dm_task_set_cookie failed"
Jun 12 08:14:41 n41.pufty.ci.centos.org dockerd-current[300]:
time="2017-06-12T08:14:41.759576207+01:00" level=fatal msg="Error
starting daemon: error initializing graphdriver: devmapper: Base
Device UUID and Filesystem verification failed: devicemapper: Can't
set cookie dm_task_set_cookie failed"
Jun 12 08:14:41 n41.pufty.ci.centos.org systemd[1]: docker.service:
main process exited, code=exited, status=1/FAILURE
Jun 12 08:14:41 n41.pufty.ci.centos.org systemd[1]: Failed to start
Docker Application Container Engine.
Jun 12 08:14:41 n41.pufty.ci.centos.org systemd[1]: Unit
docker.service entered failed state.
Jun 12 08:14:41 n41.pufty.ci.centos.org systemd[1]: docker.service failed.
@rhvgoyal

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2017

Are there any more logs in journal? Like running out of semaphores. Or something being wrong with thin pool. Try running "dmsetup table" and "dmsetup status" as well and paste output here.

If nothing gives a clue, we need to enable more debugging of libdevmapper in docker. Right now we supress libdevmapper messages as there are too many. We will have to enable it, recompile docker and re-run and try to reproduce.

My guess is that check journal logs and there might be some more information which might hint that why setting cookie failed.

@rhvgoyal

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2017

@rtnpro I suspect you are running into following issue.

#33603

@rhvgoyal

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2017

That's somehow cookie/semaphore are leaking and that leads to hitting system max limit of semaphore and that mean set_task_cookie() fails and docker fails to start.

I am not very clear though that why this issue happens only on centos. What about fedora/rhel. Has anybody noticed it there too?

@faraway

This comment has been minimized.

Copy link

commented Jun 17, 2017

@rhvgoyal I'm having the same problem on RHEL 7.1
My docker version is 1.12.6, build 1398f24/1.12.6
As mentioned by others, rm -rf /var/lib/docker and reboot fixes the issue... for now...

@rtnpro

This comment has been minimized.

Copy link

commented Jun 17, 2017

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Jun 17, 2017

@rtnpro running with loop devices is also strongly discouraged for production use; it can lead to various issues; see the warning in your logs (and output of docker info);

devmapper: Usage of loopback devices is strongly discouraged for production use. Please use --storage-opt dm.thinpooldev or use man docker to refer to dm.thinpooldev section.

@rhvgoyal

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2017

@rhvgoyal

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2017

I had a conversation with neil horman and we suspect that issue is happening to due a recent change.

5206d45

With current code, we are not calling UdevWait() in case of dm_task_run() error and that can lead to leakage of cookie/semaphore. Previously we were using defer UdevWait and that made sure dm_udev_wait() is called even if dm_task_run() failed.

@nhorman is looking into fixing this and should soon create a PR.

This one is easy to reproduce. Keep a container device busy and then try to remove container and that should leak cookie.

  • Disable deferred removal
  • Run a container
  • In a separate terminal do, "unshare -m"
  • Exit container
  • Remove container

container removal should fail and now check cookies with dmsetup udevcookies and there should be bunch of leaked cookies.

@rhvgoyal

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2017

This PR should fix the issue.

#33732

@BenoitAverty

This comment has been minimized.

Copy link

commented Aug 18, 2017

A possible workaround is to raise the maximum number of semaphores on the system.

It worked for us and the docker daemon could start correctly

This is described in issue #33603

@kiemrong08

This comment has been minimized.

Copy link

commented Aug 30, 2017

@rtnpro : I face your probem but i issue command:

dmsetup udevcomplete_all and press y to confirm.

And it work like a charm.

@richajoy

This comment has been minimized.

Copy link

commented Sep 14, 2017

@rkharya I should thank you for this page and explaining the issue in depth.
I faced the same error while adding additional node to my openshift container service and docker service was failing with devicemapper error.

[dockerd-current[117028]: time="2017-09-13T18:26:40.456865612-07:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"]

Playbook used to scale the openshift cluster node is w.r.t https://docs.openshift.com/container-platform/3.5/install_config/adding_hosts_to_existing_cluster.html ;
# ansible-playbook [-i /path/to/file] /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml

After cleaning up /var/lib/docker directory and rebooting the EC2 instance, I re-ran the playbook to get the docker service start and add the node back to openshift service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.