Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix EBUSY errors under overlayfs and v4.13+ kernels #34948
This removes and recreates the merged dir with each umount/mount
It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
As of recently, overlayfs reacts to these mounts being leaked (see
One trick to force an unmount is to remove the mounted directory and
- Description for the changelog
Fix upperdir in use warnings under overlayfs and v4.13+ kernels
- A picture of a cute animal (not mandatory but encouraged)
referenced this pull request
Sep 22, 2017
Ok, so kernel commit
Secondly, even if you make
I feel we can try to minimize mount point leaks but being able to take care of all corner cases will be very hard. So I am wondering if a solution should come from kernel side instead.
I am wondering will it make sense to convert kernel error message to a warning message? Or is there a way where kernel can figure out that we are essentially using same set of lower/upper/work directories and instead of instantiating a new super block, it re-uses existing super block (which is around due to busy mount). Amir will know much more, I am ccing him.
@rhvgoyal overlayfa is right to complain, so kernel warning is not good enough.
@tonistiigi The root cause of the problem IMO is the chroot() does not Unmount
That's not true. I can reliably reproduce a mount point leak to runc (leaked for 100ms to 1s) without this patch. With this patch I can't. It therefore still makes sense to try and avoid it.
Because the code here uses MNT_DETACH as of #33638 we're opting in to async behavior either way. I think that PR is technically wrong, but that's something to address separate from this.
False. runc makes the mountpoints rslave. See either my original issue I say this fixes and my discussion there or my pr against runc and the referenced defaulting code.
If they do
A better solution would be something like @cyphar suggests imo.
A good first step there would be to not do "mount umount mount umount mount" to start a container, but instead just mount it once and leave it mounted.
Read my original issue (#34672) where I talk about the actual issue I'm trying to solve. There may be more issues, but the issue I'm trying to fix does not actually use any of that code so it's not closely related to what I see as the root problem.
I agree the things you pointed out are problems (though I think the solution is making it
This PR is for fixing something else though. This is to fix the case where mounting
Are you willing to review this change, or do you have specific questions about this issue that isn't answered already by my comments in #34672?
I can only ACK the change that doesn't make home dir private. I don't know enough on the bug picture to say if var/lib/docker/overlay is always shared to begin with.
I suppose if makes little sense to make home dir private for other graph drivers? And the rslave change should be made to chroot() as well. Otherwise this fix is correct but partial. You shouldn't fix just the problem you see in front of you when you know there is a bigger issue to solve. But fine with me of this is the first step.
@amir73il I apologize if I came off a little roughly. The internet does make it hard to convey tone properly.
Thanks! What I'm looking for here is making sure this change is in the right direction
I think it generally will be, especially in a systemd world. It's possible this should be swapping
It might, but I'm not an expert. I do know there are some related issues floating around for devicemapper at the least.
Yup, happy to look at that and changing the MNT_DETACH bit later. One step at a time.
@amir73il Thinking more about it, it feels like regression to me (from userspace point of view). Sure, we can try modifying user space to not leak mount points, but kernel upgrade will still break existing container runtime.
And making sure setup is completely right and none of the mounts point are leaking is hard.
So if I take a device
IMHO, either we need to implement same behavior for overlay or reduce the level from error to warning.
And this is irrespective of docker changes. Sure try to reduce the amount of leaked mount points, that can only help. But it still feels like a regression.
@amir73il In some ways it feels like DOS to me. If an overlay mount point has leaked somewhere, then root can not create a new overlay mount point and get to its data.
Now root can't access its own data due to leaked mount point in some process mount namespace. And that sounds not desirable to me (given how easy it is to leak mount points).
referenced this pull request
Sep 29, 2017
In my limited testing, this patch indeed helps with EBUSY on /merged on removal, which I can reproduce easily on a RHEL 7.4 system with
This is not ultimate fix though as I still see occasional EBUSY on /shm removal (which is somewhat
Is there a separate issue or bugzilla filed for the
Also, review ping! It seems like this PR has stalled out, but I don't think I've gotten any actionable requests / reasons why it's stalled.
Even if overlayfs lets us get away with leaking mounts, I don't think we should be and so IMO this should still be merged.
In that model, I've currently got it using
On a 4.13 kernel, that will either fail, or on newer 4.13 it'll give a scary warning in dmesg and succeed.
However, on any kernel you can see that the mounts shown after "c2: still in runc init, current view of mounts" include the same c1 mount with the same mount ID, not a new mount with a different mount ID.
That's the leak I'm trying to squash.
(note: the below explanation might be wrong in some way; experimentally the above is true, but I wouldn't be surprised if I'm misunderstanding some intricacy of mount subtrees here)
Using rslave doesn't fix that leak because the leak is really two layers deep (runc has its own rslave subtree), and thus the umount sitll doesn't forward since slave mounts are only defined to receive from their master, not propagate as shared do.
If we switch it to explicitly be