Do not make graphdriver homes private mounts. #36047

cpuguy83 · 2018-01-18T02:36:26Z

The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.

There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system service winds up with it's own mount
namespace. This causes a race between daemon startup and other system
services as to if the mount is actually private to the namespace or if it gets propagated as a private reference.

This also means there is a negative impact when other system services
are started while the daemon is running.

Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns hierarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.

Setting to private also does not protect situations like -v /var/lib/docker:/var/lib/docker where all mounts are recursively bound into the container anyway (this can be mitigated by using unbindable mounts, but this needs to be explored separately as to the impact this could have with graphdrivers).

This change should fix some cases of Device or resource busy errors on container removal (should be all except the above case where the daemon root has been bound into a container w/o the right propagation).
Newer kernels would not generally see this error due to using detached unmounts, however detached unmounts are a bandaid, the issue is still present just that the resources are freed up lazily once all references to the mountpoint are gone... meaning the resources still exist on the host (likely until the next reboot)

The idea behind making the graphdrivers private is to prevent leaking mounts into other namespaces. Unfortunately this is not really what happens. There is one case where this does work, and that is when the namespace was created before the daemon's namespace. However with systemd each system servie winds up with it's own mount namespace. This causes a race betwen daemon startup and other system services as to if the mount is actually private. This also means there is a negative impact when other system services are started while the daemon is running. Basically there are too many things that the daemon does not have control over (nor should it) to be able to protect against these kinds of leakages. One thing is certain, setting the graphdriver roots to private disconnects the mount ns heirarchy preventing propagation of unmounts... new mounts are of course not propagated either, but the behavior is racey (or just bad in the case of restarting services)... so it's better to just be able to keep mount propagation in tact. It also does not protect situations like `-v /var/lib/docker:/var/lib/docker` where all mounts are recursively bound into the container anyway. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

thaJeztah · 2018-01-18T15:37:20Z

ping @kolyshkin @euank @cyphar PTAL

euank · 2018-01-18T20:26:37Z

This LGTM; I included something like this previously, but ended up removing it to get it merged more easily based on comments (#34948 (comment)).

@cyphar, if you can recall the reason you preferred leaving it out then, that would be good to discuss here.

kolyshkin · 2018-01-18T22:14:12Z

LGTM

I remember the previous implementation from @euank and it's a pity it got lost, I think it was fine.

cyphar · 2018-01-19T08:22:09Z

Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages.

I disagree with this point. There is something that Docker could do to prevent most (if not all) of the leakages we see (though the current containerd+dockerd architecture makes it even more difficult). For each container spawned, we create a new mount namespace and then mount the image in an rslave or rprivate mount so it doesn't propagate to the host. We then run runc underneath this new mount namespace so it can see the rootfs, but the host (and any other containers) cannot see the rootfs. This would prevent almost all leaks. However, the problem is that due to how containerd "inherits" the mounts from dockerd (or at least, that's how it used to work last I checked), it's non-trivial to implement the runc spawning from inside the mount namespace created when the rootfs is mounted.

When discussing mountpoint leaks with the upstream kernel folks, I had to continually explain that while Docker currently does the "sloppy" thing (we mount everything in the same namespace which is then implicitly shared by all containers before pivot_root which leads us to the whole mountpoint leaking mess) there are some underlying bugs that even if Docker did the right thing we'd still have DoS opportunities.

cpuguy83 · 2018-01-19T11:46:17Z

That is true. I believe I opened an issue on runc (but maybe it was containerd since it's initializing runc) before to request to be able to pass a mount ns for runc to start off with.... May be worthwhile to bring this up again.

Still we wouldn't want to set these dirs to private, I think.

trapier · 2018-01-24T23:50:17Z

Intended effect of this patch is music to my ears. Some testing notes below.

environment (DM deferred features intentionally disabled):

[vagrant@node02 ~]$ docker info |grep -Ee '^(Server|Kernel) Version' -e 'Storage' -e 'Deferred.*Enabled' -e 'Pool Name'
WARNING: bridge-nf-call-ip6tables is disabled
Server Version: 17.12.0-ce
Storage Driver: devicemapper
 Pool Name: docker-thinpool
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
Kernel Version: 3.10.0-693.5.2.el7.x86_64

dockerd running in host mount namespace:

[vagrant@node02 ~]$ sudo readlink /proc/{1,$(pidof dockerd)}/ns/mnt
mnt:[4026531840]
mnt:[4026531840]

simulate effect of this patch:

[vagrant@node02 ~]$ findmnt -o TARGET,PROPAGATION /
TARGET PROPAGATION
/      shared
[vagrant@node02 ~]$ findmnt -o TARGET,PROPAGATION /var/lib/docker/devicemapper
TARGET                       PROPAGATION
/var/lib/docker/devicemapper private
[vagrant@node02 ~]$ sudo mount --make-shared /var/lib/docker/devicemapper
[vagrant@node02 ~]$ findmnt -o TARGET,PROPAGATION /var/lib/docker/devicemapper
TARGET                       PROPAGATION
/var/lib/docker/devicemapper shared

up the stakes:

[vagrant@node02 ~]$ sysctl fs.may_detach_mounts
fs.may_detach_mounts = 1
[vagrant@node02 ~]$ sudo sysctl -w fs.may_detach_mounts=0
fs.may_detach_mounts = 0

test:

[vagrant@node02 ~]$ docker run -d --name idbeazombieifhomewereprivate nginx
256cd9ede28eb8c68c8844d14341206d1944af0a56a65482a3639868907944c9
[vagrant@node02 ~]$ sudo systemctl restart ntpd
[vagrant@node02 ~]$ docker stop idbeazombieifhomewereprivate
idbeazombieifhomewereprivate
[vagrant@node02 ~]$ docker rm idbeazombieifhomewereprivate
idbeazombieifhomewereprivate

Container stop and/or removal would otherwise fail with device or resource busy because private on /var/lib/docker/$graphdriver prevents propagation of container unmount towards ntpd's namespace.

cyphar · 2018-01-25T09:44:10Z

@trapier The issue you're talking about should already have been fixed by #34573 (though you're on an older kernel and they might not have backported the relevant kernel fix that makes rmdir force an unmount in all non-bind cases). As we discussed in #34542, doing an rmdir is a much more solid fix for that particular leak issue than changing the propagation (because a malicious process can change the propagation as an unprivileged user anyway -- it's not a perfect fix because of bind-mounts but that's a kernel issue at the moment).

@cpuguy83 I still feel that we should make this rslave.

cpuguy83 · 2018-01-25T13:48:59Z

@cyphar rslave would translate into private for submounts and have the same problem.

crosbymichael · 2018-01-25T22:29:28Z

LGTM

trapier · 2018-01-26T20:22:52Z

@cyphar: I was indeed talking about older kernels without detach_mounts().

Last set of notes I posted indicated benefit when mounts are unshared to processes from the host mount namespace. There is an additional benefit to mounts shared between containers.

Before this patch, on a kernel without detach_mounts(), the rslave on -v /var/lib/docker:/hostdocker:rslave did not have the desired effect for mounts under /var/lib/docker/$graphdriver, because private there blocks umount propagation.

Same setup as last time:

[vagrant@node02 ~]$ docker info |grep -Ee '^(Server|Kernel) Version' -e 'Storage' -e 'Deferred.*Enabled' -e 'Pool Name'
WARNING: bridge-nf-call-ip6tables is disabled
Server Version: 17.12.0-ce
Storage Driver: devicemapper
 Pool Name: docker-thinpool
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
Kernel Version: 3.10.0-693.5.2.el7.x86_64
[vagrant@node02 ~]$ sudo sysctl -w fs.may_detach_mounts=0
fs.may_detach_mounts = 0

Simulate effect of this patch:

[vagrant@node02 ~]$ sudo mount --make-shared /var/lib/docker/devicemapper

Test (PASS):

[vagrant@node02 ~]$ docker run -d --name clungto kubernetes/pause                                                                                                                                                                             
e4e719ee90b1d0e69637901e2d115233d612ee10816ad6b1eaf4f7940f0ad927
[vagrant@node02 ~]$ docker run -d --name clingy -v /var/lib/docker:/var/lib/docker:rslave kubernetes/pause
fea3737eae8df753437f65ea8ab0c411d275fd0f80ff263a40fbb5542f1734dd
[vagrant@node02 ~]$ docker stop clungto
clungto
[vagrant@node02 ~]$ docker rm clungto
clungto

This was added in moby#36047 just as a way to make sure the tree is fully unmounted on shutdown. For ZFS this could be a breaking change since there was no unmount before. Someone could have setup the zfs tree themselves. It would be better, if we really do want the cleanup to actually the unpacked layers checking for mounts rather than a blind recursive unmount of the root. BTRFS does not use mounts and does not need to unmount anyway. These was only an unmount to begin with because for some reason the btrfs tree was being moutned with `private` propagation. For the other graphdrivers that still have a recursive unmount here... these were already being unmounted and performing the recursive unmount shouldn't break anything. If anyone had anything mounted at the graphdriver location it would have been unmounted on shutdown anyway. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

jingxiaolu · 2018-07-18T11:32:13Z

@cpuguy83 I do agree with you that remount root as rprivate may be meaningless, but I found that @rhatdan added this remount on #4884 to fix a scaling issue.

What's your comments on this scaling issue?

cpuguy83 · 2018-07-20T19:26:13Z

@jingxiaolu Would have to do some testing around this.
The problem is the remount is utterly broken and prevents layer removal.

cpuguy83 requested a review from dmcgowan as a code owner January 18, 2018 02:36

GordonTheTurtle added the status/0-triage label Jan 18, 2018

cpuguy83 force-pushed the graphdriver_improvements branch from cfd9b42 to 4f1a959 Compare January 18, 2018 13:57

cpuguy83 added status/2-code-review rebuild/powerpc and removed status/0-triage labels Jan 18, 2018

GordonTheTurtle removed the rebuild/powerpc label Jan 18, 2018

cpuguy83 force-pushed the graphdriver_improvements branch from 4f1a959 to 362f582 Compare January 18, 2018 14:33

cpuguy83 force-pushed the graphdriver_improvements branch from 362f582 to 9803272 Compare January 18, 2018 14:34

cpuguy83 mentioned this pull request Jan 24, 2018

Set daemon root to use shared propagation #36096

Merged

cpuguy83 mentioned this pull request Jan 26, 2018

inodes not freed after repeated builds - aufs #9755

Closed

crosbymichael merged commit 2c05aef into moby:master Jan 26, 2018

cpuguy83 deleted the graphdriver_improvements branch January 26, 2018 18:55

This was referenced Jan 29, 2018

Unable to remove a stopped container: device or resource busy #22260

Closed

Docker gradually exhausts disk space on BTRFS #27653

Closed

cpuguy83 mentioned this pull request Feb 7, 2018

Do not recursive unmount on cleanup of zfs/btrfs #36237

Merged

cpuguy83 mentioned this pull request Feb 7, 2018

[17.12] Do not make graphdriver homes private mounts. docker/docker-ce#415

Merged

gbarr01 mentioned this pull request Feb 20, 2018

[17.12] update changelog for 17.12.1-ce-rc2 docker/docker-ce#431

Merged

This was referenced Mar 19, 2018

Container state "Removal In Progress" when ntpd/chronyd is active docker/for-linux#124

Closed

Cannot delete dead containers with overlay2 on RHEL 7.4 #34538

Closed

cpuguy83 mentioned this pull request May 28, 2019

aufs: retry umount on ebusy, ignore ENOENT in graphdriver.Mounted #39270

Merged

cpuguy83 mentioned this pull request Feb 14, 2020

Pod is stuck in terminating due to containerd-shim unmount error. containerd/containerd#4020

Closed

github-actions bot mentioned this pull request Mar 17, 2021

Pod terminating2 qiwihui/pocket_readings#1047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not make graphdriver homes private mounts. #36047

Do not make graphdriver homes private mounts. #36047

cpuguy83 commented Jan 18, 2018 •

edited

thaJeztah commented Jan 18, 2018

euank commented Jan 18, 2018

kolyshkin commented Jan 18, 2018

cyphar commented Jan 19, 2018 •

edited

cpuguy83 commented Jan 19, 2018

trapier commented Jan 24, 2018 •

edited

cyphar commented Jan 25, 2018 •

edited

cpuguy83 commented Jan 25, 2018

crosbymichael commented Jan 25, 2018

trapier commented Jan 26, 2018

jingxiaolu commented Jul 18, 2018

cpuguy83 commented Jul 20, 2018

Do not make graphdriver homes private mounts. #36047

Do not make graphdriver homes private mounts. #36047

Conversation

cpuguy83 commented Jan 18, 2018 • edited

thaJeztah commented Jan 18, 2018

euank commented Jan 18, 2018

kolyshkin commented Jan 18, 2018

cyphar commented Jan 19, 2018 • edited

cpuguy83 commented Jan 19, 2018

trapier commented Jan 24, 2018 • edited

cyphar commented Jan 25, 2018 • edited

cpuguy83 commented Jan 25, 2018

crosbymichael commented Jan 25, 2018

trapier commented Jan 26, 2018

jingxiaolu commented Jul 18, 2018

cpuguy83 commented Jul 20, 2018

cpuguy83 commented Jan 18, 2018 •

edited

cyphar commented Jan 19, 2018 •

edited

trapier commented Jan 24, 2018 •

edited

cyphar commented Jan 25, 2018 •

edited