New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Engine fails to restart - Base Device UUID and Filesystem verification failed #23089
Comments
|
Attaching previous Docker Engine state and de-bug output when it was reconfigured with 'docker_device'. |
|
** # rm -rf /var/lib/docker And re-commissioning the node for contiv cluster fixes the issue and docker engine does gets started with docker_device configs. |
How did you install docker ? it seems a customized ways to install that. Could you specify that ? |
|
yes through Ansible playbook tasks - *[cluster-admin@contiv-aci-scale-1 tasks]$ pwd ---
*# This role contains tasks for configuring and starting docker service
*#
- name: check docker version
shell: docker --version
register: docker_installed_version
ignore_errors: yes
tags:
- prebake-for-dev
- include: ubuntu_install_tasks.yml
when: (ansible_os_family == "Debian") and not (docker_installed_version.stdout | match("Docker version {{ docker_version }}, build.*"))
tags:
- prebake-for-dev
- include: redhat_install_tasks.yml
when: (ansible_os_family == "RedHat") and not (docker_installed_version.stdout | match("Docker version {{ docker_version }}, build.*"))
tags:
- prebake-for-dev
- name: create docker daemon's config directory
file: path=/etc/systemd/system/docker.service.d state=directory
tags:
- prebake-for-dev
- name: setup docker daemon's environment
template: src=env.conf.j2 dest=/etc/systemd/system/docker.service.d/env.conf
tags:
- prebake-for-dev
- name: setup iptables for docker
shell: >
( iptables -L INPUT | grep "{{ docker_rule_comment }} ({{ item }})" ) || \
iptables -I INPUT 1 -p tcp --dport {{ item }} -j ACCEPT -m comment --comment "{{ docker_rule_comment }} ({{ item }})"
become: true
with_items:
- "{{ docker_api_port }}"
- name: copy systemd units for docker(enable cluster store) (debian)
template: src=docker-svc.j2 dest=/lib/systemd/system/docker.service
when: ansible_os_family == "Debian"
- name: copy systemd units for docker(enable cluster store) (redhat)
template: src=docker-svc.j2 dest=/usr/lib/systemd/system/docker.service
when: ansible_os_family == "RedHat"
- name: check docker-tcp socket state
shell: systemctl status docker-tcp.socket | grep 'Active.*active' -o
ignore_errors: true
register: docker_tcp_socket_state
- include: create_docker_device.yml
when: docker_device != "" |
|
not quite know ansible, but from logs error message It seems devicemapper related, and system special configuration info needed to track that |
|
I found this problem after a yum update in Red Hat. Delete (or move) devicemapper folder and start docker, it will be recreated and docker will start normally: $ rm -rf /var/lib/docker/devicemapper |
|
I found same issue.. on docker 1.12 .. Tired all the possibilities.. Still the issue is not resolved. WA3: deleted and created the LVM. Still the issue is not resolved. |
|
workaround is - rm -rf /var/lib/docker & restart docker(sometimes host reboot needed) |
|
Thanks..It worked. |
|
I have the same issue on RH 7.2 with Docker version 1.12.1, build 23cf638 |
|
I think this is very annoying, I have this on a raspberry pi3 with aarch64 openSuSe. |
|
I am facing the same issue on CentOS 7.3.1611 with Docker version 1.12.6, build 1398f24/1.12.6. Docker service fails to start. The above mentioned workarounds did not work for me. Debug info |
|
Are there any more logs in journal? Like running out of semaphores. Or something being wrong with thin pool. Try running "dmsetup table" and "dmsetup status" as well and paste output here. If nothing gives a clue, we need to enable more debugging of libdevmapper in docker. Right now we supress libdevmapper messages as there are too many. We will have to enable it, recompile docker and re-run and try to reproduce. My guess is that check journal logs and there might be some more information which might hint that why setting cookie failed. |
|
That's somehow cookie/semaphore are leaking and that leads to hitting system max limit of semaphore and that mean set_task_cookie() fails and docker fails to start. I am not very clear though that why this issue happens only on centos. What about fedora/rhel. Has anybody noticed it there too? |
|
@rhvgoyal I'm having the same problem on RHEL 7.1 |
|
On Jun 17, 2017 10:33 AM, "Yuan" <notifications@github.com> wrote:
@rhvgoyal <https://github.com/rhvgoyal> I'm having the same problem on RHEL
7.1
My docker version is 1.12.6, build 1398f24/1.12.6
As mentioned by others, rm -rf /var/lib/docker and reboot fixes the
issue... for now...
Yes, reboot fixes the issue, but it's not a great thing to do in production
:(
|
|
@rtnpro running with loop devices is also strongly discouraged for production use; it can lead to various issues; see the warning in your logs (and output of
|
|
cc @nhorman |
|
I had a conversation with neil horman and we suspect that issue is happening to due a recent change. With current code, we are not calling UdevWait() in case of dm_task_run() error and that can lead to leakage of cookie/semaphore. Previously we were using @nhorman is looking into fixing this and should soon create a PR. This one is easy to reproduce. Keep a container device busy and then try to remove container and that should leak cookie.
container removal should fail and now check cookies with |
|
This PR should fix the issue. |
|
A possible workaround is to raise the maximum number of semaphores on the system. It worked for us and the docker daemon could start correctly This is described in issue #33603 |
|
@rtnpro : I face your probem but i issue command:
And it work like a charm. |
|
@rkharya I should thank you for this page and explaining the issue in depth. [dockerd-current[117028]: time="2017-09-13T18:26:40.456865612-07:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"] Playbook used to scale the openshift cluster node is w.r.t https://docs.openshift.com/container-platform/3.5/install_config/adding_hosts_to_existing_cluster.html ; After cleaning up /var/lib/docker directory and rebooting the EC2 instance, I re-ran the playbook to get the docker service start and add the node back to openshift service. |
Output of
docker version:Output of
docker info:Additional environment details (AWS, VirtualBox, physical, etc.):
Physical environment
Steps to reproduce the issue:
Describe the results you received:
Docker Engine fails to restart with below error -
May 29 07:16:42 contiv-aci-scale-1.cisco.com systemd[1]: Starting Docker Application Container Engine...
May 29 07:16:43 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:43.000531244-07:00" level=info msg="New containerd process, pid: 32241\n"
May 29 07:16:44 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:44.018947601-07:00" level=error msg="[graphdriver] prior storage driver "devicemapper" failed: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"
May 29 07:16:44 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:44.019375151-07:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
May 29 07:16:44 contiv-aci-scale-1.cisco.com docker[32236]: time="2016-05-29T07:16:44-07:00" level=info msg="stopping containerd after receiving terminated"
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: Failed to start Docker Application Container Engine.
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: Unit docker.service entered failed state.
May 29 07:16:44 contiv-aci-scale-1.cisco.com systemd[1]: docker.service failed.
Describe the results you expected:
Docker Engine should start gracefully, when its get configured with Docker_Device functionality.
Additional information you deem important (e.g. issue happens only occasionally):
Consistently re-produceable failure.
Work around is manual - Need to clean-up /var/lib/docker and reboot the node. Re-issuing cluster-commissioning task this time gets through and Docker Engine with 'docker_device' config this time does get successfully started.
The text was updated successfully, but these errors were encountered: