Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error removing intermediate container 20dbfe8b8d9d: Driver devicemapper failed to remove root filesystem #12923

Closed
bparees opened this issue Feb 10, 2017 · 13 comments
Assignees
Labels

Comments

@bparees
Copy link
Contributor

bparees commented Feb 10, 2017

2017-02-10T20:30:34.782053000Z Error removing intermediate container 20dbfe8b8d9d: Driver devicemapper failed to remove root filesystem 20dbfe8b8d9d836609fb35fcb7afe644e0a606fdbc3ca769744d65c57240f46a: remove /var/lib/docker/devicemapper/mnt/2ba0d9d550fe1b30b3c59c0c2228deaf81d21635ee5816b9829fdb4be195483b: device or resource busy

as seen in:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/93/consoleFull#-62719347577f0ce7e4b0b14b5836ce6d

we previously had #9548 and #9490 in this space, but they got pretty messy and ultimately closed, but it's not clear to me if we think it should be working at this point or not (or maybe we have a bad docker in our AMIs).... so i'm starting the conversation here. Is the situation that:

  1. we still have no idea
  2. we thought it was fixed, apparently not
  3. it is fixed, but our AMI doesn't contain the fix yet
  4. our AMI used to contain the fix and our AMI regressed for some reason
  5. we know the issue but we're still waiting for the fix to make it into our docker distribution
  6. it was fixed and the docker package now contains a regression

?

@stevekuznetsov @jwhonce @runcom

@bparees bparees added component/containers kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels Feb 10, 2017
@stevekuznetsov
Copy link
Contributor

Docker version on that test is:

Client:
 Version:         1.12.5
 API version:     1.24
 Package version: docker-common-1.12.5-8.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1d8f205
 Built:           Wed Dec 21 08:37:50 2016
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.5
 API version:     1.24
 Package version: docker-common-1.12.5-8.el7.x86_64
 Go version:      go1.7.4
 Git commit:      1d8f205
 Built:           Wed Dec 21 08:37:50 2016
 OS/Arch:         linux/amd64

We have not bumped it in a while (we have had it locked to 1.12.5) so maybe we recently got a newer patch?

@jwhonce
Copy link
Contributor

jwhonce commented Feb 22, 2017

@stevekuznetsov There is a new docker build coming that will add the ID of the device that cannot be deleted. Using that ID and a script from https://bugzilla.redhat.com/show_bug.cgi?id=1391665 we will be able to determine which application is holding that device busy during the attempted delete.

@stevekuznetsov
Copy link
Contributor

@jwhonce these failures happen on ephemeral test VMs so it will be impossible for someone to come in and run the script. Are you suggesting that we need to create an exit check for the failed to remove root filesystem ... device or resource busy in the log and run the script?

@nanomad
Copy link

nanomad commented Feb 23, 2017

We actually got our single-node cluster (ha!) stuck with this issue since a couple of days (probably due to some upgrade or sheer coincidence) and we are not able to some builds

Any help on how to debug this issue is more than welcome

I've tried running the script ex-post but I always get "No Pid" output

@nanomad
Copy link

nanomad commented Feb 23, 2017

A bit of differential analysis between a build "type" that succeeds and one that fails if it is of any help:

  • They are both based on the same builder image (jorgemoralespou/s2i-java with hash bd7903)
  • They both use maven 3.3.9 to build
  • The failing one runs npm and webpack during maven package phase to build the front-end assets (node 7.5.0, npm 4.1.2)
  • Altering the source location of the failing one to a repository that does not use npm and webpack makes the build succeed

It could be that either nodejs, npm or webpack fail to release some kind of resource which in turn creates issues to docker or openshift.

@stevekuznetsov
Copy link
Contributor

@nanomad You are able to reproduce the failure consistently with the build that uses npm? We would be very interested in how you set up your cluster (if possible, your MasterConfig) and steps to recreate a namespace like that in which you see the failure. What version of Docker and OpenShift are you running?

@nanomad
Copy link

nanomad commented Feb 24, 2017

@stevekuznetsov We can even go a step further and provide ssh access to RH / Openshift developers if needed. The machine is not a PROD environment to begin with. If you would like to do so, just send me a message on GitHub.

The installation details are the following:

@stevekuznetsov
Copy link
Contributor

Cool. @jwhonce would be interested in SSH I believe, thank you for your help. The reproducer is huge -- I don't think we had one before.

@jwhonce
Copy link
Contributor

jwhonce commented Feb 24, 2017

@stevekuznetsov The more failures we can capture the faster we can zero in on the root cause. If it's possible to automate the capture even better! Thanks.

@nanomad
Copy link

nanomad commented Feb 24, 2017

@stevekuznetsov , @jwhonce you should receive a mail from me shortly

@imcleod
Copy link

imcleod commented Mar 6, 2017

@jwhonce - Should this enter the same MODIFIED/believed-to-be-fixed state as https://bugzilla.redhat.com/show_bug.cgi?id=1391665 ? (Or are we uncertain if this is tracking the same issue.)

cc: @stevekuznetsov

@jwhonce jwhonce assigned bparees and unassigned jwhonce Mar 6, 2017
@jwhonce
Copy link
Contributor

jwhonce commented Mar 6, 2017

Re-assigned to @bparees as he provided fix to bug

@bparees
Copy link
Contributor Author

bparees commented Mar 6, 2017

we still see these issues just during normal docker builds when docker is trying to clean up intermediate containers. however they do not appear to cause failures. i'll close this for now(since the failure mode should be fixed by the s2i changes to allow the commit to take longer) and open a new issue for the docker build errors we're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants