Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running kubelet in a container: mounts #6848

Closed
pmorie opened this issue Apr 15, 2015 · 33 comments
Closed

Running kubelet in a container: mounts #6848

pmorie opened this issue Apr 15, 2015 · 33 comments
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@pmorie
Copy link
Member

pmorie commented Apr 15, 2015

Currently when the kubelet is run in a container, the mounts that the kubelet performs are not visible to the containers in pods because the kubelet runs in its own mount namespace with a private shared-subtree mode. In order to run the kubelet in a container, we must find a way to have the kubelet perform mounts in the host's mount namespace.

How to do the mount

The mount must be performed from the root mount namespace. Ideally, it would be possible to run a container in the host's root mount namespace with something like docker run --mnt='host'. However, this is currently not possible, although it has been requested. That being the case, one option is the super privileged container concept. The basic formula for a super privileged container to execute a command in the host's mount namespace is:

  1. The container's filesystem must have the nsenter bits
  2. Bind mount the host's filesystem into the container wholesale, viz: docker run -v /:/host
  3. Read the value of the host's mount namespace from /proc:
    host_mnt_ns=$(</host/proc/1/ns/mnt)
  4. Execute a command in the host's mount namespace by execing a subprocess:
    nsenter --mount=$host_mnt_ns <some command>
  5. A caveat: once you enter the host's mount namespace, you lose what's mounted in the container's mount namespace; the command nsenter execs will be relative to the host's filesystem.

Wrinkle: mount helpers

When mount -t <fstype> is invoked, it looks for a mount helper named mount.<fstype> to delegate the mount operation to. In order for a containerized kubelet to be able to mount all filesystem types in the manner described here, the mount helpers would need to be installed on the host.

Factoring kubernetes

Currently the volume plugins use the mount.Interface interface to perform mounts, which serves our purposes for keeping volume plugin code orthogonal from what is performing the mount. We can make an implementation of this interface which handles execing a subprocess that nsenter's the host's mount namespace to perform the mount without requiring any changes to the interface itself.

Currently the volume plugins create the instance of mount.Interface directly using the mount.New() method. In order to facilitate injecting an nsentering mounter, we could instead provide volume plugins with a new MounterFactory (via the Host interface) which they can use to get a mounter:

package mount

type MounterFactory interface {
  New() mount.Interface
}
package volume

type Host interface {
  // other methods omitted
  MounterFactory() mount.Factory
}

The MounterFactory should be an exported field on the Kubelet so that the creator of the Kubelet can provide whatever MounterFactory implementation they want. This will facilitate using alternate implementations in downstream projects, integration tests, etc.

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

Note: thanks to @vbatts @eparis @mrunalp @rootfs for their help understanding this topic.

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

@thockin
Copy link
Member

thockin commented Apr 15, 2015

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol https://github.com/vmarmol
@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

@thockin no POC yet; wanted to get a write-up out there first.

On Wed, Apr 15, 2015 at 12:38 AM, Tim Hockin notifications@github.com
wrote:

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol <https://github.com/vmarmol

@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis


Reply to this email directly or view it on GitHub
<
#6848 (comment)

.


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@thockin
Copy link
Member

thockin commented Apr 15, 2015

Didn't mean to be curt. I really mean "it might work, it's worth investing
time into" :)

Quick test bears fruit:

$ sudo ls -l /proc/1/ns/mnt
lrwxrwxrwx 1 root root 0 Apr 14 22:08 /proc/1/ns/mnt -> mnt:[4026531840]

$ docker run -ti --privileged -v /proc:/realproc busybox ls -l
/proc/1/ns/mnt /realproc/1/ns/mnt
lrwxrwxrwx 1 root root 0 Apr 15 05:10 /proc/1/ns/mnt ->
mnt:[4026532441]
lrwxrwxrwx 1 root root 0 Apr 15 05:08 /realproc/1/ns/mnt
-> mnt:[4026531840]

On Tue, Apr 14, 2015 at 9:42 PM, Paul Morie notifications@github.com
wrote:

@thockin no POC yet; wanted to get a write-up out there first.

On Wed, Apr 15, 2015 at 12:38 AM, Tim Hockin notifications@github.com
wrote:

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol <
https://github.com/vmarmol

@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis


Reply to this email directly or view it on GitHub
<

#6848 (comment)

.


Reply to this email directly or view it on GitHub
<
#6848 (comment)

.


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

@thockin

Didn't mean to be curt. I really mean "it might work, it's worth investing
time into" :)

Natch 👍

I did think you were talking about a POC of kubelet running in container with the above changes. I've done the same test you did:

$ docker run -it --privileged -v /:/host busybox
/ # ls -l /host/proc/1/ns/
total 0
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 ipc -> ipc:[4026531839]
lrwxrwxrwx    1 root     root             0 Apr 13 21:19 mnt -> mnt:[4026531840]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 net -> net:[4026531957]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 pid -> pid:[4026531836]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 user -> user:[4026531837]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 uts -> uts:[4026531838]

@thockin
Copy link
Member

thockin commented Apr 15, 2015

Well then, the next step is to hack kubelet to do volume mounts through an
nsenter-mount rig, and see if you can get a trivial pod running.

On Tue, Apr 14, 2015 at 10:49 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin

Didn't mean to be curt. I really mean "it might work, it's worth investing
time into" :)

Natch [image: 👍]

I did think you were talking about a POC of kubelet running in container
with the above changes. I've done the same test you did:

$ docker run -it --privileged -v /:/host busybox
/ # ls -l /host/proc/1/ns/
total 0
lrwxrwxrwx 1 root root 0 Apr 15 05:47 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Apr 13 21:19 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 uts -> uts:[4026531838]


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

@thockin yep, agree

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

For the record, this also works:

/ # readlink /host/proc/1/ns/mnt
mnt:[4026531840]

I didn't realize you could readlink an ns file in proc before

@smarterclayton
Copy link
Contributor

Yeah, when you said Factory I thought (interface?). It's true Java does unpleasant things to your brain....

On Apr 15, 2015, at 7:18 AM, Tim Hockin notifications@github.com wrote:

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol https://github.com/vmarmol
@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis


Reply to this email directly or view it on GitHub
#6848 (comment)
.


Reply to this email directly or view it on GitHub.

@rootfs
Copy link
Contributor

rootfs commented Apr 15, 2015

the mount helpers are indeed tricky issues. I implemented a POC (aka hack) by intercepting mount(8) call in LD_PRELOAD. You can find the details in my code here

@vmarmol
Copy link
Contributor

vmarmol commented Apr 15, 2015

Nice hack, I like it!

An alternative to fork/exec-ing mount, we can run a process in our Kubelet container/pod (that runs on the root mnt namespace) and does the mounting. We'd have the Kubelet talk to it for the mounting operations. Short term it may be more trouble than it's worth, but long term we'd get flexibility and will be more maintainable I think.

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

@vmarmol Yep, this is definitely a short-term hack.
@smarterclayton @thockin This is your brain on java

@dchen1107 dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 15, 2015
@vishh
Copy link
Contributor

vishh commented Apr 15, 2015

👍

@eparis
Copy link
Contributor

eparis commented Apr 15, 2015

I would say that getting mount to live in a container, but be able to mount things on the host, would be the holy grail here.

So, for example, you could have a container with /usr/sbin/mount.glusterfs instead of it needing to live in the root filesystem as is required today (or with this hack).

I'm not arguing against this hack, but more progress in line with what @rootfs is talking about could make it unnecessarily (some day).

@pmorie
Copy link
Member Author

pmorie commented Apr 15, 2015

agree re: holy grail @eparis

@thockin
Copy link
Member

thockin commented Apr 15, 2015

The downside to putting mount.glusterfs into a container is that container
now needs to track some upstream source for updates

On Wed, Apr 15, 2015 at 2:24 PM, Paul Morie notifications@github.com
wrote:

agree re: holy grail @eparis https://github.com/eparis


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@eparis
Copy link
Contributor

eparis commented Apr 15, 2015

but isn't updating such a container easier than updating binaries in your host/vm (which also have to track some upstream source)?

@eparis
Copy link
Contributor

eparis commented Apr 15, 2015

also, if this is something like fuse, where you must have a daemon running for the mount to function, wouldn't it be nice if that daemon was in a container, rather than running on the host itself?

@vmarmol
Copy link
Contributor

vmarmol commented Apr 15, 2015

The host mount namespace should have access to the container's filesystem. Kinda dirty (break out of container's mount, find the container, use that FS for some things), but possible. It'd save us from depending on the host's.

@thockin
Copy link
Member

thockin commented Apr 15, 2015

updating binaries in your host/vm (which also have to track some upstream
source)

yeah, but that is Someone Else's Problem :)

On Wed, Apr 15, 2015 at 2:29 PM, Eric Paris notifications@github.com
wrote:

but isn't updating such a container easier than updating binaries in your
host/vm (which also have to track some upstream source)?


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@thockin
Copy link
Member

thockin commented Apr 15, 2015

also, if this is something like fuse, where you must have a daemon
running for the mount to function, wouldn't it be nice if that daemon was
in a container, rather than running on the host itself?

Well, yes, of course, but I am not sure I see the connection. Maybe I
misunderstood your point? I assumed you meant that
mount.{ext{2,3,4},xfs,glusterfs,ceph,nfs} would be bundled into the kubelet
container. Is that not what you meant?

On Wed, Apr 15, 2015 at 2:33 PM, Eric Paris notifications@github.com
wrote:

also, if this is something like fuse, where you must have a daemon running
for the mount to function, wouldn't it be nice if that daemon was in a
container, rather than running on the host itself?


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@eparis
Copy link
Contributor

eparis commented Apr 16, 2015

I'm suggesting a completely separate mount_guster container, which the kubelet could use to get a gusterfs mounted on the host (which docker can then put into another unrelated container). Now both the host and the kubelet container can be completely ignorant of gluster. It also wouldn't then matter if the kubelet was in a container or not, since the thing doing the mount would always be in a container.

Paul's trick here works so long as the host has what it needs to mount the filesystem in question. But given a system with say the functionality of boot2docker, his trick can not solve the problem. He escapes into the host, but the host doesn't have the functionality.

Even putting the binaries in the kubelet container does help a ton. You could, I guess, escape to the host, do some crazy LD_PRELOAD and PATH magic such that you ran the stuff back in the kubelet container, and execute the mount that way. But I'm not certain how to really make that work when you need a daemon running (like all fuse FS)

Nothing about mounting + mount namespaces is easy :)

and everything related to mounting gluster can be delivered by the "gluster" team. Same for other filesystems. (although a generic mount(8) container could work for ext[2-4], tmps, xfs, since they don't need helpers)

@smarterclayton
Copy link
Contributor

Makes sense. I wish we had per node pod controller.....

On Apr 15, 2015, at 8:15 PM, Eric Paris notifications@github.com wrote:

I'm suggesting a completely separate mount_guster container, which the kubelet could use to get a gusterfs mounted on the host (which docker can then put into another unrelated container). Now both the host and the kubelet container can be completely ignorant of gluster. It also wouldn't then matter if the kubelet was in a container or not, since the thing doing the mount would always be in a container.

Paul's trick here works so long as the host has what it needs to mount the filesystem in question. But given a system with say the functionality of boot2docker, his trick can not solve the problem. He escapes into the host, but the host doesn't have the functionality.

Even putting the binaries in the kubelet container does help a ton. You could, I guess, escape to the host, do some crazy LD_PRELOAD and PATH magic such that you ran the stuff back in the kubelet container, and execute the mount that way. But I'm not certain how to really make that work when you need a daemon running (like all fuse FS)

Nothing about mounting + mount namespaces is easy :)

and everything related to mounting gluster can be delivered by the "gluster" team. Same for other filesystems. (although a generic mount(8) container could work for ext[2-4], tmps, xfs, since they don't need helpers)


Reply to this email directly or view it on GitHub.

@pmorie
Copy link
Member Author

pmorie commented Apr 16, 2015

@erictune @thockin @smarterclayton @eparis @vmarmol @rootfs

One thing I ran into today is that the NFS plugin uses its own mounter abstraction with a different API that calls mount in a shell instead of making the syscall. I'm factoring that into its own mounter implementation that conforms to the interface, but it got me thinking (and I think the discussion here headed in the same direction) about the need to line up different mounter implementations with different plugins under different circumstances. Presumable there will be a need to differentiate the mounters different plugins need when running under a container.

I think if we can make all the plugins use and be injected with the same mount interface, it will be a good start toward allowing the above. We can differentiate the mounter used for the different plugin type at the call site (kubelet.newVolumeBuilderFromPlugins) in a way that lines up the plugin with the right mounter if the kubelet is running on the host or in a container.

Really the specific method I've suggested here using the super privileged container approach is a temporary means. The long-term value we'll get from this is probably more in the dimension of making it easier to implement new strategies for dealing with running in a container on a plugin by plugin basis.

@pmorie
Copy link
Member Author

pmorie commented Apr 16, 2015

Addendum: all of the strategies we've discussed in this issue so far imo would be implementable behind mount.Interface. Anyone disagree?

@pmorie
Copy link
Member Author

pmorie commented Apr 16, 2015

@smarterclayton

I wish we had per node pod controller.....

Why can't we have it? We deserve nice things!

@thockin
Copy link
Member

thockin commented Apr 16, 2015

There's already a PR open for mount.Interface() in terms of exec

On Wed, Apr 15, 2015 at 8:04 PM, Paul Morie notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton

I wish we had per node pod controller.....

Why can't we have it? We deserve nice things!


Reply to this email directly or view it on GitHub
#6848 (comment)
.

@pmorie
Copy link
Member Author

pmorie commented Apr 16, 2015

@thockin thanks for the heads up, I like the nascent interface in #6400 better; I would love for this PR to go in on top of that.

So, that said, I'm not going to try to rework all the volume stuff now -- just get the basic volumes working in a container. Once #6400 goes in, I'll rebase on top of it and pick up the new interface.

@pmorie
Copy link
Member Author

pmorie commented Apr 16, 2015

Throwing something out there: if you run boot2docker, could you not run a service on the boot2docker vm that performs the mount and has a restful interface? In that case the mounter implementation can make a rest call to the mounter service.

@pmorie
Copy link
Member Author

pmorie commented May 5, 2015

I think we can close this out now that PRs are going into master and open new issues as things develop.

@pmorie
Copy link
Member Author

pmorie commented Jul 31, 2015

Related to: #4869

@bogdando
Copy link

This looks related moby/moby#17034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

10 participants