-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error removing container (1.10, 1.11/master) with AUFS #21704
Comments
Can you give the output of |
|
How recent is your master? I think we fixed this. |
Oh, and #17902 |
ping @anusha-ragunathan would you be able to look into this? I linked various related / similar issues above |
I think the daemon logs from one successful run of our integration tests and one run that caused this error will be helpful, but I'm not sure if I can share them publicly here. If you have access, check these out: success: https://ci.qa.aws.dckr.io/job/dtr-deploy/2749/artifact/integration/results/docker.log They are not exactly from the same PR, but they are very similar. There is a potentially relevant one earlier in the logs:
This is the complete error at the time of the failed container delete:
It's interesting that the same error log appears when we restart the daemon earlier in the test. |
If I had to guess, I'd say there are left over processes referencing the same layers from when the daemon tried to restart and failed to properly kill them. |
@vikstrous : I cannot access the jenkins logs. Can you create a gist of the logs? I tried a quick test of creating and removing containers in a loop of 15 (not concurrent) on AUFS and didnt observe this issue. Is there a deterministic way to repro the issue? Can you confirm that the containers start successfully? If yes, then a couple of things to proceed on:
|
@vikstrous @anusha-ragunathan please post it on slack if the jenkins log contains information that should not be shared publicly 👍 |
I haven't seen this bug since last time I posted in this thread. It's possible that it was fixed. I'll update you if I see it again. |
I have the similar issue, please have a look. test.yml: version: "2"
services:
browser:
image: elgalu/selenium:2.53.0e
ports:
- "5920:25900"
# - "4444:24444"
# volumes:
# - "/dev/shm:/dev/shm"
environment:
- "VNC_PASSWORD=test"
- "FIREFOX=false"
- "CHROME=true"
networks:
my-net:
aliases:
- browser
networks:
my-net:
driver: bridge Stdout: $ docker-compose --version
docker-compose version 1.6.2, build 4d72027
$ docker --version
Docker version 1.10.3, build 20f81dd
$ docker info
Containers: 3
Running: 1
Paused: 0
Stopped: 2
Images: 142
Server Version: 1.10.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 245
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 3.13.0-85-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.58 GiB
Name: tk9
ID: CYDT:VFSD:2M77:W5P7:OQUD:J6G7:EWQR:KJWR:SOUX:JCLZ:2SBG:J7GX
WARNING: No swap limit support
$ docker-compose -f test.yml up -d
Creating network "homelocal_my-net" with driver "bridge"
Creating homelocal_browser_1
$ docker-compose -f test.yml stop
Stopping homelocal_browser_1 ...
ERROR: for homelocal_browser_1 ('Connection aborted.', BadStatusLine("''",))
ERROR: Couldn't connect to Docker daemon at http+docker://localunixsocket - is it running?
If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.
$ docker-compose -f test.yml rm -f
Going to remove homelocal_browser_1
Removing homelocal_browser_1 ... error
ERROR: for homelocal_browser_1 Driver aufs failed to remove root filesystem 0e6e88bcc931eb13e141ac871b4ba965d01aae880a20255a5e974f15dff40b0e: rename /var/lib/docker/aufs/mnt/d4e6ee5ebd3ac40e256afa4492451e25cbea87f5041a1dce0bec7a302f41cc45 /var/lib/docker/aufs/mnt/d4e6ee5ebd3ac40e256afa4492451e25cbea87f5041a1dce0bec7a302f41cc45-removing: device or resource busy
$ docker-compose -f test.yml rm -f
Going to remove homelocal_browser_1
Removing homelocal_browser_1 ... error |
And this probably related as well #21845 |
@cpuguy83 looks like it does not related to aufs... Trying to solve this, I switched to overlayfs and see the same picture: Error response from daemon: Driver overlay failed to remove root filesystem 8b21bec99eccde191ca98e944003274c5b45bbf6f1e4cc08560c0e454e5d3719: readdirent: no such file or directory $ docker info
Containers: 4
Running: 0
Paused: 0
Stopped: 4
Images: 44
Server Version: 1.10.3
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: null host bridge
Kernel Version: 3.19.0-58-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.58 GiB
Name: tk9
ID: CYDT:VFSD:2M77:W5P7:OQUD:J6G7:EWQR:KJWR:SOUX:JCLZ:2SBG:J7GX
WARNING: No swap limit support |
@FelikZ : Can you upgrade to docker-engine 1.11 rc4 and try again? |
ping @FelikZ do you still see this on 1.11.0? |
Seeing this frequently on different machines, all on 1.11.1 plus aufs. Most are on 14.04 LTS (3.13 kernel). |
I just saw this when trying to start a container using the
I couldn't find reference to that filesystem or any of those directories when groveling around in |
I've seen this several times recently also.
|
Same here with Ubuntu 14.04
|
@Puneeth-n I already did it, I even restarted my computer, but the container still there. Do you have some link to instruct me how to change to EDIT: I changed to |
@lazize yes you lose everything cos the storage driver is different. |
@lazize can you open a new issue with details? Docker 17.06 has some changes related to removal of containers; if you're running docker 17.06, we may have to look into that (e.g. ignoring errors where the containers file system was already removed) |
@thaJeztah As I changed to |
@lazize if you still have logs from around the time it happened, that could be welcome (be sure to check them for confidential information) |
@thaJeztah Unfortunately SystemD wasn't configured to persist logs, it means that I just have logs from my boot of this morning. I changed it to persist logs now, if it happens again I will be able to help much more. Sorry for that! |
No worries, thanks! |
I am not a linux expert by any means. I have installed docker-ce in order to use a container called elabftw. It does not start and the error is the one continually referenced in this thread. I am su throughout. I am using 17.09.0-ce, there are in fact dead containers, and I tried service docker restart but the problem persists. A reboot did not fix the problem. This is debian8 jessie amd64. Some captures are included below. Thanks for any help anyone can give me. FROM THE END OF THE DOCKER INSTALLATION pauldeyoung@local-root-analysis-3:~$ sudo docker run hello-world Hello from Docker! To generate this message, Docker took the following steps:
To try something more ambitious, you can run an Ubuntu container with: Share images, automate workflows, and more with a free Docker ID: For more examples and ideas, visit: TRYING TO START THE CONTAINER Removing mysql INFO ABOUT INSTALLED VERSION OF DOCKER root@local-root-analysis-3:/home/pauldeyoung# docker version Server: TRYING A DOCKER RESTART AND GETTING SAME ERROR root@local-root-analysis-3:/home/pauldeyoung# service docker restart Removing mysql RUNNING A COMMAND FROM THE THREAD TO DIAGNOSE DOCKER ISSUE root@local-root-analysis-3:/home/pauldeyoung# docker ps -a --filter status="dead" |
This is supremely annoying. I have the same issue confirmed on: Docker version 17.06.2-ce, build cec0b72 Completely stopped all containers and the docker service trying to release this directory. No joy. The only thing that is odd is that the directory in question ( /var/lib/docker/aufs/diff/1db9ace3915e17dc1a4753b7368598a4d0aacbde37e2f01c5589d3a3ea52f851) is mostly empty directories. The mysqld directory that was in it had a user: group of 999:docker. It appears that the aufs driver is what has a hold of it. Can't rmmod aufs because it's in use by PID #2. Tried rmmod -f (at my own peril) and got a seg fault. My only recourse at the moment is to reboot when it happens. Prior to using docker, I rebooted my Linux workstation once a year. It now has to be rebooted with the frequency of a Windows XP machine. Please fix this or post a workaround. Something, somewhere is not properly closing a file handle to this directory. |
...if changing to a different file driver would help, I'm open to that. |
This should be fixed on master and am working on backports for docker 17.12. Typically what happens here is a mount has leaked into another namespace (could be a container, or even another service started by systemd). |
btw, you can use this to sniff out what's holding onto the mount reference: https://github.com/rhvgoyal/misc/blob/master/find-busy-mnt.sh |
That one is for devmapper, but can be modified for aufs. |
Re: find-busy-mnt.sh |
And actually looking closer, should work for any graphdriver that does mounts. |
You can grab a nightly static binary from https://master.dockerproject.org/ and replace |
Not really announced yet, but we also now have nightly builds in our apt/yum repository, e.g.: https://download.docker.com/linux/ubuntu/dists/xenial/pool/nightly/ |
FWIW, we switched to overlay2 everywhere and never had any (graphdriver) issues since. aufs just doesn't appear to work very well (also solves the core-dumps-in-images issue). |
Closing because his is fixed in 17.12.1 and 18.03+ |
I still get this error occasionally, on Debian 8.11 with kernel 3.16.0-6-amd64 and docker-ce 18.06.0 @cpuguy83 can you please reopen? |
@moritz What's the exact error you received and how did you get it? Do you have daemon logs for this time period? |
@cpuguy83 I'm having this error with
Corresponding daemon log:
|
Note that Docker 18.06 reached EOL and is no longer maintained; in addition Ubuntu 14.04 reached EOL on April 30, so I'd recommend upgrading both. Docker 18.06.1 uses a version of runc that has a critical vulnerability (CVE-2019-5736), which has been addressed in Docker 18.06.3, which ships with a patched version of runc. However, running that version requires you to upgrade to a 4.x kernel. If possible, I'd recommend upgrading, and consider using the overlay2 storage driver (which is now the default for all distros); switching storage driver will cause you to no longer have access to existing local images and containers, so you may have to push them to a registry first. |
This version of docker-ce is the last one that supports Trusty on a 3.x Linux Kernel [1]. I'm upgrading as we seem to be experiencing an issue with aufs (see below) on one of our machines. My investigation indicates that this may potentially be a bug in docker-ce that was fixed in 17.12.1 and 18.03 [2]. So I expect updating to 18.06.1 will resolve it - if not at least we're on a more recent version. ``` 17:45:53 driver "aufs" failed to remove root filesystem for 1689bc31da539840e42f29043cd62bd7ebab3da05d0d5094aafcaebb90ea1958: could not remove diff path for id 0be2d29298c44335f47beb6a20151f1fa1b9516a551f7a6beea00b62aa880faf: error preparing atomic delete: rename /var/lib/docker/aufs/diff/0be2d29298c44335f47beb6a20151f1fa1b9516a551f7a6beea00b62aa880faf /var/lib/docker/aufs/diff/0be2d29298c44335f47beb6a20151f1fa1b9516a551f7a6beea00b62aa880faf-removing: device or resource busy ``` [1]: https://docs.docker.com/engine/release-notes/18.06/ [2]: moby/moby#21704
I've been seeing this error in our integration tests a lot recently:
This happens when a container is being removed and causes our tests to fail. I've seen it only on aufs so far.
Output of
docker version
:Additional environment details (AWS, VirtualBox, physical, etc.):
This is happening on AWS with AUFS
Steps to reproduce the issue:
unknown
Describe the results you received:
500 error from the daemon
Describe the results you expected:
the container should be removed without an error
Additional information you deem important (e.g. issue happens only occasionally):
It happens less than half of the time
The text was updated successfully, but these errors were encountered: