Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not make graphdriver homes private mounts. #36047

Merged
merged 1 commit into from
Jan 26, 2018

Conversation

cpuguy83
Copy link
Member

@cpuguy83 cpuguy83 commented Jan 18, 2018

The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.

There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system service winds up with it's own mount
namespace. This causes a race between daemon startup and other system
services as to if the mount is actually private to the namespace or if it gets propagated as a private reference.

This also means there is a negative impact when other system services
are started while the daemon is running.

Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns hierarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.

Setting to private also does not protect situations like -v /var/lib/docker:/var/lib/docker where all mounts are recursively bound into the container anyway (this can be mitigated by using unbindable mounts, but this needs to be explored separately as to the impact this could have with graphdrivers).

This change should fix some cases of Device or resource busy errors on container removal (should be all except the above case where the daemon root has been bound into a container w/o the right propagation).
Newer kernels would not generally see this error due to using detached unmounts, however detached unmounts are a bandaid, the issue is still present just that the resources are freed up lazily once all references to the mountpoint are gone... meaning the resources still exist on the host (likely until the next reboot)

The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.

There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system servie winds up with it's own mount
namespace. This causes a race betwen daemon startup and other system
services as to if the mount is actually private.

This also means there is a negative impact when other system services
are started while the daemon is running.

Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns heirarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.

It also does not protect situations like `-v
/var/lib/docker:/var/lib/docker` where all mounts are recursively bound
into the container anyway.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@thaJeztah
Copy link
Member

ping @kolyshkin @euank @cyphar PTAL

@euank
Copy link
Contributor

euank commented Jan 18, 2018

This LGTM; I included something like this previously, but ended up removing it to get it merged more easily based on comments (#34948 (comment)).

@cyphar, if you can recall the reason you preferred leaving it out then, that would be good to discuss here.

@kolyshkin
Copy link
Contributor

LGTM

I remember the previous implementation from @euank and it's a pity it got lost, I think it was fine.

@cyphar
Copy link
Contributor

cyphar commented Jan 19, 2018

Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages.

I disagree with this point. There is something that Docker could do to prevent most (if not all) of the leakages we see (though the current containerd+dockerd architecture makes it even more difficult). For each container spawned, we create a new mount namespace and then mount the image in an rslave or rprivate mount so it doesn't propagate to the host. We then run runc underneath this new mount namespace so it can see the rootfs, but the host (and any other containers) cannot see the rootfs. This would prevent almost all leaks. However, the problem is that due to how containerd "inherits" the mounts from dockerd (or at least, that's how it used to work last I checked), it's non-trivial to implement the runc spawning from inside the mount namespace created when the rootfs is mounted.

When discussing mountpoint leaks with the upstream kernel folks, I had to continually explain that while Docker currently does the "sloppy" thing (we mount everything in the same namespace which is then implicitly shared by all containers before pivot_root which leads us to the whole mountpoint leaking mess) there are some underlying bugs that even if Docker did the right thing we'd still have DoS opportunities.

@cpuguy83
Copy link
Member Author

That is true. I believe I opened an issue on runc (but maybe it was containerd since it's initializing runc) before to request to be able to pass a mount ns for runc to start off with.... May be worthwhile to bring this up again.

Still we wouldn't want to set these dirs to private, I think.

@trapier
Copy link
Contributor

trapier commented Jan 24, 2018

Intended effect of this patch is music to my ears. Some testing notes below.

environment (DM deferred features intentionally disabled):

[vagrant@node02 ~]$ docker info |grep -Ee '^(Server|Kernel) Version' -e 'Storage' -e 'Deferred.*Enabled' -e 'Pool Name'
WARNING: bridge-nf-call-ip6tables is disabled
Server Version: 17.12.0-ce
Storage Driver: devicemapper
 Pool Name: docker-thinpool
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
Kernel Version: 3.10.0-693.5.2.el7.x86_64

dockerd running in host mount namespace:

[vagrant@node02 ~]$ sudo readlink /proc/{1,$(pidof dockerd)}/ns/mnt
mnt:[4026531840]
mnt:[4026531840]

simulate effect of this patch:

[vagrant@node02 ~]$ findmnt -o TARGET,PROPAGATION /
TARGET PROPAGATION
/      shared
[vagrant@node02 ~]$ findmnt -o TARGET,PROPAGATION /var/lib/docker/devicemapper
TARGET                       PROPAGATION
/var/lib/docker/devicemapper private
[vagrant@node02 ~]$ sudo mount --make-shared /var/lib/docker/devicemapper
[vagrant@node02 ~]$ findmnt -o TARGET,PROPAGATION /var/lib/docker/devicemapper
TARGET                       PROPAGATION
/var/lib/docker/devicemapper shared

up the stakes:

[vagrant@node02 ~]$ sysctl fs.may_detach_mounts
fs.may_detach_mounts = 1
[vagrant@node02 ~]$ sudo sysctl -w fs.may_detach_mounts=0
fs.may_detach_mounts = 0

test:

[vagrant@node02 ~]$ docker run -d --name idbeazombieifhomewereprivate nginx
256cd9ede28eb8c68c8844d14341206d1944af0a56a65482a3639868907944c9
[vagrant@node02 ~]$ sudo systemctl restart ntpd
[vagrant@node02 ~]$ docker stop idbeazombieifhomewereprivate
idbeazombieifhomewereprivate
[vagrant@node02 ~]$ docker rm idbeazombieifhomewereprivate
idbeazombieifhomewereprivate

Container stop and/or removal would otherwise fail with device or resource busy because private on /var/lib/docker/$graphdriver prevents propagation of container unmount towards ntpd's namespace.

@cyphar
Copy link
Contributor

cyphar commented Jan 25, 2018

@trapier The issue you're talking about should already have been fixed by #34573 (though you're on an older kernel and they might not have backported the relevant kernel fix that makes rmdir force an unmount in all non-bind cases). As we discussed in #34542, doing an rmdir is a much more solid fix for that particular leak issue than changing the propagation (because a malicious process can change the propagation as an unprivileged user anyway -- it's not a perfect fix because of bind-mounts but that's a kernel issue at the moment).

@cpuguy83 I still feel that we should make this rslave.

@cpuguy83
Copy link
Member Author

@cyphar rslave would translate into private for submounts and have the same problem.

@crosbymichael
Copy link
Contributor

LGTM

@crosbymichael crosbymichael merged commit 2c05aef into moby:master Jan 26, 2018
@cpuguy83 cpuguy83 deleted the graphdriver_improvements branch January 26, 2018 18:55
@trapier
Copy link
Contributor

trapier commented Jan 26, 2018

@cyphar: I was indeed talking about older kernels without detach_mounts().

Last set of notes I posted indicated benefit when mounts are unshared to processes from the host mount namespace. There is an additional benefit to mounts shared between containers.

Before this patch, on a kernel without detach_mounts(), the rslave on -v /var/lib/docker:/hostdocker:rslave did not have the desired effect for mounts under /var/lib/docker/$graphdriver, because private there blocks umount propagation.

Same setup as last time:

[vagrant@node02 ~]$ docker info |grep -Ee '^(Server|Kernel) Version' -e 'Storage' -e 'Deferred.*Enabled' -e 'Pool Name'
WARNING: bridge-nf-call-ip6tables is disabled
Server Version: 17.12.0-ce
Storage Driver: devicemapper
 Pool Name: docker-thinpool
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
Kernel Version: 3.10.0-693.5.2.el7.x86_64
[vagrant@node02 ~]$ sudo sysctl -w fs.may_detach_mounts=0
fs.may_detach_mounts = 0

Simulate effect of this patch:

[vagrant@node02 ~]$ sudo mount --make-shared /var/lib/docker/devicemapper

Test (PASS):

[vagrant@node02 ~]$ docker run -d --name clungto kubernetes/pause                                                                                                                                                                             
e4e719ee90b1d0e69637901e2d115233d612ee10816ad6b1eaf4f7940f0ad927
[vagrant@node02 ~]$ docker run -d --name clingy -v /var/lib/docker:/var/lib/docker:rslave kubernetes/pause
fea3737eae8df753437f65ea8ab0c411d275fd0f80ff263a40fbb5542f1734dd
[vagrant@node02 ~]$ docker stop clungto
clungto
[vagrant@node02 ~]$ docker rm clungto
clungto

cpuguy83 added a commit to cpuguy83/docker that referenced this pull request Feb 7, 2018
This was added in moby#36047 just as a way to make sure the tree is fully
unmounted on shutdown.

For ZFS this could be a breaking change since there was no unmount before.
Someone could have setup the zfs tree themselves. It would be better, if
we really do want the cleanup to actually the unpacked layers checking
for mounts rather than a blind recursive unmount of the root.

BTRFS does not use mounts and does not need to unmount anyway.
These was only an unmount to begin with because for some reason the
btrfs tree was being moutned with `private` propagation.

For the other graphdrivers that still have a recursive unmount here...
these were already being unmounted and performing the recursive unmount
shouldn't break anything. If anyone had anything mounted at the
graphdriver location it would have been unmounted on shutdown anyway.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@jingxiaolu
Copy link

@cpuguy83 I do agree with you that remount root as rprivate may be meaningless, but I found that @rhatdan added this remount on #4884 to fix a scaling issue.

What's your comments on this scaling issue?

@cpuguy83
Copy link
Member Author

@jingxiaolu Would have to do some testing around this.
The problem is the remount is utterly broken and prevents layer removal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants