Skip to content

daemon mount namespaces #10225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 22, 2015
Merged

Conversation

vbatts
Copy link
Contributor

@vbatts vbatts commented Jan 20, 2015

This change covers two of the contrib init's. The daemon ought to be invoked in its own mount namespace, unshared from the root mount namespace.

Some of the error most often seen affect the devicemapper graph driver, seen as EBUSY or "device or resource busy".
Though others are likely to run in to this as folks will have containers that will have mounts or even devices being mounted.

If any other pid outside of the docker daemon unshares their mount namespace, and that pid's mountinfo includes a mount reference of a container, then even when the container exits, and attempts to remove, there kernel will return that the path is busy, due to the mount reference by the outside pid.

I've written about this here: http://blog.hashbangbash.com/2014/11/docker-devicemapper-fix-for-device-or-resource-busy-ebusy/

I did not include a commit for the init scripts with start-stop-daemon, as I would like them to get more testing first. See:

(and for completeness, here is the change for ubuntu's upstart http://pastebin.com/D5MsVCUK)

This systemd.exec setting will construct a new mount namespace for the
docker daemon, and use slave shared-subtree mounts so that volume mounts
propogate correctly into containers.

By having an unshared mount namespace for the daemon it ensures that
mount references are not held by other pids outside of the docker
daemon. Frequently this can be seen in EBUSY or "device or resource
busy" errors.

Signed-off-by: Vincent Batts <vbatts@redhat.com>
unshare the mount namespace of the docker daemon to avoid other pids
outside the daemon holding mount references of docker containers.

Signed-off-by: Vincent Batts <vbatts@redhat.com>
@vbatts vbatts force-pushed the vbatts-init_mount_namespaces branch from b2f2aea to 6bb6586 Compare January 20, 2015 21:03
@tianon
Copy link
Member

tianon commented Jan 20, 2015

LGTM, although I think we should wait until after 1.5 (and then include the start-stop-daemon and upstart changes too)

@tianon
Copy link
Member

tianon commented Jan 20, 2015

Unless you can make a compelling argument for this not breaking other stuff so close to our freeze window. 😉

@vbatts
Copy link
Contributor Author

vbatts commented Jan 20, 2015

@vbatts
Copy link
Contributor Author

vbatts commented Jan 20, 2015

and, this is ./contrib/ ;-)

@thaJeztah
Copy link
Member

^^ Add a line to those files: "In case of emergency, call Vincent +1 555 ...", then it should be good to merge 😸

@vbatts
Copy link
Contributor Author

vbatts commented Jan 20, 2015

@thaJeztah woah there, cowboy ;-)

@thaJeztah
Copy link
Member

I'm not very familiar with this stuff, but would this also have a positive effect on AUFS not freeing inodes I saw reported?

@vbatts
Copy link
Contributor Author

vbatts commented Jan 20, 2015

@thaJeztah do you have a link to that conversation?

@thaJeztah
Copy link
Member

Yup, read it earlier today, that's why I thought of it; #9755

@dqminh
Copy link
Contributor

dqminh commented Jan 21, 2015

@vbatts Do you think this will fix aufs device and resource busy bugs too ? I had a few reports about it recently, and the symptoms look familiar.

@tianon
Copy link
Member

tianon commented Jan 21, 2015

@vbatts For the redhat sysvinit change, I agree 100% that the only repercussions would be to the redhat packagers who are quite familiar with creating out-of-band patches.

For the systemd change, this is used in the packages we release, so I'm much more leery (since a problem there would mean we have to re-release depending on the severity and the number of times it gets reported 😞).

Regardless, we'll need our systemd maintainers to weigh in: @lsm5 @philips @jfrazelle

@tianon
Copy link
Member

tianon commented Jan 21, 2015

@cpuguy83
Copy link
Member

Didn't @crosbymichael already try doing this, although within Docker itself? #4599

Also, what happens in situations where someone has mounted an fs on the host and wants to then bind-mount that in a container?

@unclejack
Copy link
Contributor

@cpuguy83 No, that wasn't merged because there were other problems to take into account.

@vbatts
Copy link
Contributor Author

vbatts commented Jan 21, 2015

@tianon well, fedora and centos7 had been using MountFlags=private for a number of versions. So it has had exposure and testing. They have recently switched to MountFlags=slave, for the purpose of ensuring the shared-subtree mounts propagate in the right direction.

@tianon
Copy link
Member

tianon commented Jan 21, 2015

@vbatts OK, that's fair; so, if the contrib/init/systemd maintainers (@lsm5 @philips @jfrazelle) sign-off on this, I'm OK to merge it for 1.5

I think it's probably prudent that we at least ping @maxamillion for the sysvinit change. 😄

Also, for future reference, the argument that "it's just contrib" doesn't hold any water with me. We use bits of contrib in our releases (as I linked above).

@vbatts
Copy link
Contributor Author

vbatts commented Jan 21, 2015

@tianon fair. I like to keep contrib fresh and current too. ❤️

@jessfraz
Copy link
Contributor

LGTM

@@ -6,6 +6,7 @@ Requires=docker.socket

[Service]
ExecStart=/usr/bin/docker -d -H fd://
MountFlags=slave
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jessfraz pushed a commit that referenced this pull request Jan 22, 2015
@jessfraz jessfraz merged commit fcc4abc into moby:master Jan 22, 2015
@crosbymichael
Copy link
Contributor

Y'all know that you can no longer go into /var/lib/docker from the host with this change and inspect/backup a containers filesystems because the mounts happen in a new namespace. And by y'all i mean @vbatts ;)

@jessfraz
Copy link
Contributor

I am kinda in favor of possibly reverting this now because there are instances when people need to backup a stopped container etc.

@dqminh
Copy link
Contributor

dqminh commented Jan 22, 2015

Being able to access current active mounts in /var/lib/docker is currently important for some of our use cases too ( we use aufs, but i guess that would apply to other drivers ), since we do have some tasks that runs outside of container and exposing the container's active filesystem.

@vbatts
Copy link
Contributor Author

vbatts commented Jan 22, 2015

works fine on overlayfs even. I don't know what to say for aufs. Really a shame that you'd improve many folks experience, reduce errors, but determine a need to revert because of mystical behavior on aufs.

@jessfraz
Copy link
Contributor

Is there another solution maybe :(

On Wednesday, January 21, 2015, Vincent Batts notifications@github.com
wrote:

works fine on overlayfs even. I don't know what to say for aufs. Really a
shame that you'd improve many folks experience, reduce errors, but
determine a need to revert because of mystical behavior on aufs.


Reply to this email directly or view it on GitHub
#10225 (comment).

@vbatts
Copy link
Contributor Author

vbatts commented Jan 22, 2015

@jfrazelle i even tried https://gist.github.com/vbatts/7257da4823c8bc8c61c1 and the directories are still hidden on aufs.

@jessfraz
Copy link
Contributor

shakes fist in air ok @crosbymichael you think we should revert?

On Wednesday, January 21, 2015, Vincent Batts notifications@github.com
wrote:

@jfrazelle https://github.com/jfrazelle i even tried
https://gist.github.com/vbatts/7257da4823c8bc8c61c1 and the directories
are still hidden on aufs.


Reply to this email directly or view it on GitHub
#10225 (comment).

@jessfraz
Copy link
Contributor

We could also add something to docs about how this recommended for
devicemapper

On Wednesday, January 21, 2015, Jessica Frazelle jess@docker.com wrote:

shakes fist in air ok @crosbymichael you think we should revert?

On Wednesday, January 21, 2015, Vincent Batts <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@jfrazelle https://github.com/jfrazelle i even tried
https://gist.github.com/vbatts/7257da4823c8bc8c61c1 and the directories
are still hidden on aufs.


Reply to this email directly or view it on GitHub
#10225 (comment).

@dqminh
Copy link
Contributor

dqminh commented Jan 22, 2015

I think the PR is safe as-is right ? It only modifies systemd / sysvinit-redhat init script and aufs is not generally used on those systems while devicemapper and btrfs are.

It's probably better to keep this and add some docs on how this doesnt work with aufs.

@jessfraz
Copy link
Contributor

I have debian aufs machines, actually the core machine is debian aufs.

On Wednesday, January 21, 2015, Daniel, Dao Quang Minh <
notifications@github.com> wrote:

I think the PR is safe as-is right ? It only modifies systemd /
sysvinit-redhat init script and aufs is not generally used on those systems
while devicemapper and btrfs are.

It's probably better to keep this and add some docs on how this doesnt
work with aufs.


Reply to this email directly or view it on GitHub
#10225 (comment).

@dqminh
Copy link
Contributor

dqminh commented Jan 22, 2015

@jfrazelle debian should use sysvinit-debian script, which is not affected by this change right ?

@jessfraz
Copy link
Contributor

Not debian jessie ;)

On Wednesday, January 21, 2015, Daniel, Dao Quang Minh <
notifications@github.com> wrote:

@jfrazelle https://github.com/jfrazelle debian should use
sysvinit-debian script, which is not affected by this change right ?


Reply to this email directly or view it on GitHub
#10225 (comment).

@jessfraz
Copy link
Contributor

As much as I would love to be the only debian jessie user running aufs I do
not think that is the case

On Wednesday, January 21, 2015, Daniel, Dao Quang Minh <
notifications@github.com> wrote:

@jfrazelle https://github.com/jfrazelle debian should use
sysvinit-debian script, which is not affected by this change right ?


Reply to this email directly or view it on GitHub
#10225 (comment).

@jessfraz
Copy link
Contributor

Also what about arch linux users with aufs

On Wednesday, January 21, 2015, Jessica Frazelle jess@docker.com wrote:

Not debian jessie ;)

On Wednesday, January 21, 2015, Daniel, Dao Quang Minh <
notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@jfrazelle https://github.com/jfrazelle debian should use
sysvinit-debian script, which is not affected by this change right ?


Reply to this email directly or view it on GitHub
#10225 (comment).

@dqminh
Copy link
Contributor

dqminh commented Jan 22, 2015

oh noes 🔥

@dqminh
Copy link
Contributor

dqminh commented Jan 22, 2015

@vbatts weird, https://gist.github.com/vbatts/7257da4823c8bc8c61c1 works on my aufs system.

# without setns
> ls /var/lib/docker/aufs/mnt/f714499277a27743ad55e052d4ab1f7946a8baf98a64463e521aa9a0f25f242f

# with setns, 30411 is docker daemon pid
> gcc -o test-setns main.c
> sudo ./test-setns /proc/30411/ns/mnt ls /var/lib/docker/aufs/mnt/f714499277a27743ad55e052d4ab1f7946a8baf98a64463e521aa9a0f25f242f
bin  dev  etc  file.txt  home  lib  lib64  linuxrc  media  mnt  opt  proc  root  run  sbin  sys  tmp  usr  var

@vbatts
Copy link
Contributor Author

vbatts commented Jan 22, 2015

@dqminh oh good. is that on a running or stopped container?

Perhaps this is the start of a contrib tool for debugging containers in their namespace.

@dqminh
Copy link
Contributor

dqminh commented Jan 22, 2015

@vbatts it's on a running container. If the container is stopped, then /var/lib/docker/aufs/mnt/container-id should be empty right, because it's unmounted.

@philips
Copy link
Contributor

philips commented Jan 22, 2015

@dqminh Now I am confused. Does your version of aufs differ from @jfrazelle 's?

@crosbymichael
Copy link
Contributor

Running containers, we don't leave them mounted if they are stopped

@vbatts
Copy link
Contributor Author

vbatts commented Jan 22, 2015

#10281 is up for debate

davedoesdev added a commit to davedoesdev/heddle that referenced this pull request Apr 14, 2015
http://blog.hashbangbash.com/2014/11/docker-devicemapper-fix-for-device-or-resource-busy-ebusy/
moby/moby#10225

Also, we'll need an extra option in the future - commented out now, uncomment
when it's introduced (probably Docker 1.7.0).
@tianon
Copy link
Member

tianon commented Apr 17, 2015

@vbatts heh, looks like we forgot to push this over to sysvinit, upstart, and openrc too

@vbatts
Copy link
Contributor Author

vbatts commented Apr 20, 2015

@tianon

6bb6586

is one, but the others were hackish, with start-stop-daemon. I may have a branch floating around here though...

@tianon
Copy link
Member

tianon commented Apr 20, 2015

Yeah, I remember that we discussed it quite a bit and thought we'd found a solution to the start-stop-daemon problems. 😢

@tianon
Copy link
Member

tianon commented Apr 21, 2015

@vbatts
Copy link
Contributor Author

vbatts commented Apr 21, 2015

"I want to die" that was the solution there
On Apr 21, 2015 5:49 PM, "Tianon Gravi" notifications@github.com wrote:

https://botbot.me/freenode/docker-maintainers/msg/30055754/


Reply to this email directly or view it on GitHub
#10225 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants