-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not make graphdriver homes private mounts. #36047
Conversation
cfd9b42
to
4f1a959
Compare
4f1a959
to
362f582
Compare
The idea behind making the graphdrivers private is to prevent leaking mounts into other namespaces. Unfortunately this is not really what happens. There is one case where this does work, and that is when the namespace was created before the daemon's namespace. However with systemd each system servie winds up with it's own mount namespace. This causes a race betwen daemon startup and other system services as to if the mount is actually private. This also means there is a negative impact when other system services are started while the daemon is running. Basically there are too many things that the daemon does not have control over (nor should it) to be able to protect against these kinds of leakages. One thing is certain, setting the graphdriver roots to private disconnects the mount ns heirarchy preventing propagation of unmounts... new mounts are of course not propagated either, but the behavior is racey (or just bad in the case of restarting services)... so it's better to just be able to keep mount propagation in tact. It also does not protect situations like `-v /var/lib/docker:/var/lib/docker` where all mounts are recursively bound into the container anyway. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
362f582
to
9803272
Compare
ping @kolyshkin @euank @cyphar PTAL |
This LGTM; I included something like this previously, but ended up removing it to get it merged more easily based on comments (#34948 (comment)). @cyphar, if you can recall the reason you preferred leaving it out then, that would be good to discuss here. |
LGTM I remember the previous implementation from @euank and it's a pity it got lost, I think it was fine. |
I disagree with this point. There is something that Docker could do to prevent most (if not all) of the leakages we see (though the current When discussing mountpoint leaks with the upstream kernel folks, I had to continually explain that while Docker currently does the "sloppy" thing (we mount everything in the same namespace which is then implicitly shared by all containers before |
That is true. I believe I opened an issue on runc (but maybe it was containerd since it's initializing runc) before to request to be able to pass a mount ns for runc to start off with.... May be worthwhile to bring this up again. Still we wouldn't want to set these dirs to private, I think. |
Intended effect of this patch is music to my ears. Some testing notes below. environment (DM deferred features intentionally disabled):
dockerd running in host mount namespace:
simulate effect of this patch:
up the stakes:
test:
Container stop and/or removal would otherwise fail with |
@trapier The issue you're talking about should already have been fixed by #34573 (though you're on an older kernel and they might not have backported the relevant kernel fix that makes @cpuguy83 I still feel that we should make this |
@cyphar rslave would translate into private for submounts and have the same problem. |
LGTM |
@cyphar: I was indeed talking about older kernels without detach_mounts(). Last set of notes I posted indicated benefit when mounts are Before this patch, on a kernel without detach_mounts(), the Same setup as last time:
Simulate effect of this patch:
Test (PASS):
|
This was added in moby#36047 just as a way to make sure the tree is fully unmounted on shutdown. For ZFS this could be a breaking change since there was no unmount before. Someone could have setup the zfs tree themselves. It would be better, if we really do want the cleanup to actually the unpacked layers checking for mounts rather than a blind recursive unmount of the root. BTRFS does not use mounts and does not need to unmount anyway. These was only an unmount to begin with because for some reason the btrfs tree was being moutned with `private` propagation. For the other graphdrivers that still have a recursive unmount here... these were already being unmounted and performing the recursive unmount shouldn't break anything. If anyone had anything mounted at the graphdriver location it would have been unmounted on shutdown anyway. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@jingxiaolu Would have to do some testing around this. |
The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.
There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system service winds up with it's own mount
namespace. This causes a race between daemon startup and other system
services as to if the mount is actually private to the namespace or if it gets propagated as a private reference.
This also means there is a negative impact when other system services
are started while the daemon is running.
Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns hierarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.
Setting to private also does not protect situations like
-v /var/lib/docker:/var/lib/docker
where all mounts are recursively bound into the container anyway (this can be mitigated by using unbindable mounts, but this needs to be explored separately as to the impact this could have with graphdrivers).This change should fix some cases of
Device or resource busy
errors on container removal (should be all except the above case where the daemon root has been bound into a container w/o the right propagation).Newer kernels would not generally see this error due to using detached unmounts, however detached unmounts are a bandaid, the issue is still present just that the resources are freed up lazily once all references to the mountpoint are gone... meaning the resources still exist on the host (likely until the next reboot)