Running kubelet in a container: mounts #6848

pmorie · 2015-04-15T04:10:45Z

Currently when the kubelet is run in a container, the mounts that the kubelet performs are not visible to the containers in pods because the kubelet runs in its own mount namespace with a private shared-subtree mode. In order to run the kubelet in a container, we must find a way to have the kubelet perform mounts in the host's mount namespace.

How to do the mount

The mount must be performed from the root mount namespace. Ideally, it would be possible to run a container in the host's root mount namespace with something like docker run --mnt='host'. However, this is currently not possible, although it has been requested. That being the case, one option is the super privileged container concept. The basic formula for a super privileged container to execute a command in the host's mount namespace is:

The container's filesystem must have the nsenter bits
Bind mount the host's filesystem into the container wholesale, viz: docker run -v /:/host
Read the value of the host's mount namespace from /proc:
host_mnt_ns=$(</host/proc/1/ns/mnt)
Execute a command in the host's mount namespace by execing a subprocess:
nsenter --mount=$host_mnt_ns <some command>
A caveat: once you enter the host's mount namespace, you lose what's mounted in the container's mount namespace; the command nsenter execs will be relative to the host's filesystem.

Wrinkle: mount helpers

When mount -t <fstype> is invoked, it looks for a mount helper named mount.<fstype> to delegate the mount operation to. In order for a containerized kubelet to be able to mount all filesystem types in the manner described here, the mount helpers would need to be installed on the host.

Factoring kubernetes

Currently the volume plugins use the mount.Interface interface to perform mounts, which serves our purposes for keeping volume plugin code orthogonal from what is performing the mount. We can make an implementation of this interface which handles execing a subprocess that nsenter's the host's mount namespace to perform the mount without requiring any changes to the interface itself.

Currently the volume plugins create the instance of mount.Interface directly using the mount.New() method. In order to facilitate injecting an nsentering mounter, we could instead provide volume plugins with a new MounterFactory (via the Host interface) which they can use to get a mounter:

package mount

type MounterFactory interface {
  New() mount.Interface
}

package volume

type Host interface {
  // other methods omitted
  MounterFactory() mount.Factory
}

The MounterFactory should be an exported field on the Kubelet so that the creator of the Kubelet can provide whatever MounterFactory implementation they want. This will facilitate using alternate implementations in downstream projects, integration tests, etc.

The text was updated successfully, but these errors were encountered:

pmorie · 2015-04-15T04:11:44Z

Note: thanks to @vbatts @eparis @mrunalp @rootfs for their help understanding this topic.

pmorie · 2015-04-15T04:15:21Z

@thockin @bgrant0607 @smarterclayton @vmarmol @dchen1107 @eparis

thockin · 2015-04-15T04:38:02Z

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol https://github.com/vmarmol
@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

pmorie · 2015-04-15T04:41:56Z

@thockin no POC yet; wanted to get a write-up out there first.

On Wed, Apr 15, 2015 at 12:38 AM, Tim Hockin notifications@github.com
wrote:

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol <https://github.com/vmarmol

@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis

—
Reply to this email directly or view it on GitHub
<
#6848 (comment)

.

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

thockin · 2015-04-15T05:11:17Z

Didn't mean to be curt. I really mean "it might work, it's worth investing
time into" :)

Quick test bears fruit:

$ sudo ls -l /proc/1/ns/mnt
lrwxrwxrwx 1 root root 0 Apr 14 22:08 /proc/1/ns/mnt -> mnt:[4026531840]

$ docker run -ti --privileged -v /proc:/realproc busybox ls -l
/proc/1/ns/mnt /realproc/1/ns/mnt
lrwxrwxrwx 1 root root 0 Apr 15 05:10 /proc/1/ns/mnt ->
mnt:[4026532441]
lrwxrwxrwx 1 root root 0 Apr 15 05:08 /realproc/1/ns/mnt
-> mnt:[4026531840]

On Tue, Apr 14, 2015 at 9:42 PM, Paul Morie notifications@github.com
wrote:

@thockin no POC yet; wanted to get a write-up out there first.

On Wed, Apr 15, 2015 at 12:38 AM, Tim Hockin notifications@github.com
wrote:

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol <
https://github.com/vmarmol

@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis

—
Reply to this email directly or view it on GitHub
<

#6848 (comment)

.

—
Reply to this email directly or view it on GitHub
<
#6848 (comment)

.

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

pmorie · 2015-04-15T05:49:23Z

@thockin

Didn't mean to be curt. I really mean "it might work, it's worth investing
time into" :)

Natch 👍

I did think you were talking about a POC of kubelet running in container with the above changes. I've done the same test you did:

$ docker run -it --privileged -v /:/host busybox
/ # ls -l /host/proc/1/ns/
total 0
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 ipc -> ipc:[4026531839]
lrwxrwxrwx    1 root     root             0 Apr 13 21:19 mnt -> mnt:[4026531840]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 net -> net:[4026531957]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 pid -> pid:[4026531836]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 user -> user:[4026531837]
lrwxrwxrwx    1 root     root             0 Apr 15 05:47 uts -> uts:[4026531838]

thockin · 2015-04-15T06:06:36Z

Well then, the next step is to hack kubelet to do volume mounts through an
nsenter-mount rig, and see if you can get a trivial pod running.

On Tue, Apr 14, 2015 at 10:49 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin

Didn't mean to be curt. I really mean "it might work, it's worth investing
time into" :)

Natch [image: 👍]

I did think you were talking about a POC of kubelet running in container
with the above changes. I've done the same test you did:

$ docker run -it --privileged -v /:/host busybox
/ # ls -l /host/proc/1/ns/
total 0
lrwxrwxrwx 1 root root 0 Apr 15 05:47 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Apr 13 21:19 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Apr 15 05:47 uts -> uts:[4026531838]

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

pmorie · 2015-04-15T06:14:07Z

@thockin yep, agree

pmorie · 2015-04-15T06:20:38Z

For the record, this also works:

/ # readlink /host/proc/1/ns/mnt
mnt:[4026531840]

I didn't realize you could readlink an ns file in proc before

smarterclayton · 2015-04-15T11:24:41Z

Yeah, when you said Factory I thought (interface?). It's true Java does unpleasant things to your brain....

On Apr 15, 2015, at 7:18 AM, Tim Hockin notifications@github.com wrote:

It might work. Where's the proof-of-concept?

I don't see why you need a MounterFactory - a single injected
mount.Interface should suffice, no?

On Tue, Apr 14, 2015 at 9:15 PM, Paul Morie notifications@github.com
wrote:

@thockin https://github.com/thockin @bgrant0607
https://github.com/bgrant0607 @smarterclayton
https://github.com/smarterclayton @vmarmol https://github.com/vmarmol
@dchen1107 https://github.com/dchen1107 @eparis
https://github.com/eparis

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

—
Reply to this email directly or view it on GitHub.

rootfs · 2015-04-15T13:37:59Z

the mount helpers are indeed tricky issues. I implemented a POC (aka hack) by intercepting mount(8) call in LD_PRELOAD. You can find the details in my code here

vmarmol · 2015-04-15T14:59:31Z

Nice hack, I like it!

An alternative to fork/exec-ing mount, we can run a process in our Kubelet container/pod (that runs on the root mnt namespace) and does the mounting. We'd have the Kubelet talk to it for the mounting operations. Short term it may be more trouble than it's worth, but long term we'd get flexibility and will be more maintainable I think.

pmorie · 2015-04-15T15:01:06Z

@vmarmol Yep, this is definitely a short-term hack.
@smarterclayton @thockin This is your brain on java

vishh · 2015-04-15T21:11:34Z

👍

eparis · 2015-04-15T21:21:29Z

I would say that getting mount to live in a container, but be able to mount things on the host, would be the holy grail here.

So, for example, you could have a container with /usr/sbin/mount.glusterfs instead of it needing to live in the root filesystem as is required today (or with this hack).

I'm not arguing against this hack, but more progress in line with what @rootfs is talking about could make it unnecessarily (some day).

pmorie · 2015-04-15T21:23:36Z

agree re: holy grail @eparis

thockin · 2015-04-15T21:27:45Z

The downside to putting mount.glusterfs into a container is that container
now needs to track some upstream source for updates

On Wed, Apr 15, 2015 at 2:24 PM, Paul Morie notifications@github.com
wrote:

agree re: holy grail @eparis https://github.com/eparis

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

eparis · 2015-04-15T21:29:26Z

but isn't updating such a container easier than updating binaries in your host/vm (which also have to track some upstream source)?

eparis · 2015-04-15T21:32:58Z

also, if this is something like fuse, where you must have a daemon running for the mount to function, wouldn't it be nice if that daemon was in a container, rather than running on the host itself?

vmarmol · 2015-04-15T21:37:06Z

The host mount namespace should have access to the container's filesystem. Kinda dirty (break out of container's mount, find the container, use that FS for some things), but possible. It'd save us from depending on the host's.

thockin · 2015-04-15T22:40:16Z

updating binaries in your host/vm (which also have to track some upstream
source)

yeah, but that is Someone Else's Problem :)

On Wed, Apr 15, 2015 at 2:29 PM, Eric Paris notifications@github.com
wrote:

but isn't updating such a container easier than updating binaries in your
host/vm (which also have to track some upstream source)?

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

thockin · 2015-04-15T22:42:46Z

also, if this is something like fuse, where you must have a daemon
running for the mount to function, wouldn't it be nice if that daemon was
in a container, rather than running on the host itself?

Well, yes, of course, but I am not sure I see the connection. Maybe I
misunderstood your point? I assumed you meant that
mount.{ext{2,3,4},xfs,glusterfs,ceph,nfs} would be bundled into the kubelet
container. Is that not what you meant?

On Wed, Apr 15, 2015 at 2:33 PM, Eric Paris notifications@github.com
wrote:

also, if this is something like fuse, where you must have a daemon running
for the mount to function, wouldn't it be nice if that daemon was in a
container, rather than running on the host itself?

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

eparis · 2015-04-16T00:15:15Z

I'm suggesting a completely separate mount_guster container, which the kubelet could use to get a gusterfs mounted on the host (which docker can then put into another unrelated container). Now both the host and the kubelet container can be completely ignorant of gluster. It also wouldn't then matter if the kubelet was in a container or not, since the thing doing the mount would always be in a container.

Paul's trick here works so long as the host has what it needs to mount the filesystem in question. But given a system with say the functionality of boot2docker, his trick can not solve the problem. He escapes into the host, but the host doesn't have the functionality.

Even putting the binaries in the kubelet container does help a ton. You could, I guess, escape to the host, do some crazy LD_PRELOAD and PATH magic such that you ran the stuff back in the kubelet container, and execute the mount that way. But I'm not certain how to really make that work when you need a daemon running (like all fuse FS)

Nothing about mounting + mount namespaces is easy :)

and everything related to mounting gluster can be delivered by the "gluster" team. Same for other filesystems. (although a generic mount(8) container could work for ext[2-4], tmps, xfs, since they don't need helpers)

smarterclayton · 2015-04-16T02:00:13Z

Makes sense. I wish we had per node pod controller.....

On Apr 15, 2015, at 8:15 PM, Eric Paris notifications@github.com wrote:

I'm suggesting a completely separate mount_guster container, which the kubelet could use to get a gusterfs mounted on the host (which docker can then put into another unrelated container). Now both the host and the kubelet container can be completely ignorant of gluster. It also wouldn't then matter if the kubelet was in a container or not, since the thing doing the mount would always be in a container.

Paul's trick here works so long as the host has what it needs to mount the filesystem in question. But given a system with say the functionality of boot2docker, his trick can not solve the problem. He escapes into the host, but the host doesn't have the functionality.

Even putting the binaries in the kubelet container does help a ton. You could, I guess, escape to the host, do some crazy LD_PRELOAD and PATH magic such that you ran the stuff back in the kubelet container, and execute the mount that way. But I'm not certain how to really make that work when you need a daemon running (like all fuse FS)

Nothing about mounting + mount namespaces is easy :)

and everything related to mounting gluster can be delivered by the "gluster" team. Same for other filesystems. (although a generic mount(8) container could work for ext[2-4], tmps, xfs, since they don't need helpers)

—
Reply to this email directly or view it on GitHub.

pmorie · 2015-04-16T02:31:25Z

@erictune @thockin @smarterclayton @eparis @vmarmol @rootfs

One thing I ran into today is that the NFS plugin uses its own mounter abstraction with a different API that calls mount in a shell instead of making the syscall. I'm factoring that into its own mounter implementation that conforms to the interface, but it got me thinking (and I think the discussion here headed in the same direction) about the need to line up different mounter implementations with different plugins under different circumstances. Presumable there will be a need to differentiate the mounters different plugins need when running under a container.

I think if we can make all the plugins use and be injected with the same mount interface, it will be a good start toward allowing the above. We can differentiate the mounter used for the different plugin type at the call site (kubelet.newVolumeBuilderFromPlugins) in a way that lines up the plugin with the right mounter if the kubelet is running on the host or in a container.

Really the specific method I've suggested here using the super privileged container approach is a temporary means. The long-term value we'll get from this is probably more in the dimension of making it easier to implement new strategies for dealing with running in a container on a plugin by plugin basis.

pmorie · 2015-04-16T02:33:23Z

Addendum: all of the strategies we've discussed in this issue so far imo would be implementable behind mount.Interface. Anyone disagree?

pmorie · 2015-04-16T03:03:46Z

@smarterclayton

I wish we had per node pod controller.....

Why can't we have it? We deserve nice things!

thockin · 2015-04-16T04:16:20Z

There's already a PR open for mount.Interface() in terms of exec

On Wed, Apr 15, 2015 at 8:04 PM, Paul Morie notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton

I wish we had per node pod controller.....

Why can't we have it? We deserve nice things!

—
Reply to this email directly or view it on GitHub
#6848 (comment)
.

pmorie · 2015-04-16T14:32:14Z

@thockin thanks for the heads up, I like the nascent interface in #6400 better; I would love for this PR to go in on top of that.

So, that said, I'm not going to try to rework all the volume stuff now -- just get the basic volumes working in a container. Once #6400 goes in, I'll rebase on top of it and pick up the new interface.

pmorie · 2015-04-16T17:54:41Z

Throwing something out there: if you run boot2docker, could you not run a service on the boot2docker vm that performs the mount and has a restful interface? In that case the mounter implementation can make a rest call to the mounter service.

pmorie · 2015-05-05T17:12:50Z

I think we can close this out now that PRs are going into master and open new issues as things develop.

pmorie · 2015-07-31T17:54:33Z

Related to: #4869

bogdando · 2016-11-22T14:20:43Z

This looks related moby/moby#17034

pmorie mentioned this issue Apr 15, 2015

Experiment run kubelet as docker container #4869

Closed

dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 15, 2015

erictune added the priority/design label Apr 15, 2015

pmorie mentioned this issue Apr 15, 2015

Secret volumes content is not available in containers openshift/origin#1469

Closed

yifan-gu mentioned this issue Apr 16, 2015

kubelet: Introduce volume manager. #6885

Merged

pmorie mentioned this issue Apr 16, 2015

Experimental support for running kubelet in container #6936

Closed

eparis closed this as completed May 5, 2015

rootfs mentioned this issue Jun 15, 2015

Propogate container's mountpoint to the host docker-archive/libcontainer#632

Closed

alban mentioned this issue Jul 15, 2015

support mount propagation modes rkt/rkt#1149

Open

rexc mentioned this issue Apr 28, 2016

Flocker in containers with AWS status detached for EBS volume ClusterHQ/flocker#2772

Open

This was referenced Mar 19, 2019

Dockerfiles and scripts to containerize edge controller kubeedge/kubeedge#246

Merged

Dockerfiles and scripts to containerize edge core kubeedge/kubeedge#247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running kubelet in a container: mounts #6848

Running kubelet in a container: mounts #6848

pmorie commented Apr 15, 2015

pmorie commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

pmorie commented Apr 15, 2015

pmorie commented Apr 15, 2015

smarterclayton commented Apr 15, 2015

rootfs commented Apr 15, 2015

vmarmol commented Apr 15, 2015

pmorie commented Apr 15, 2015

vishh commented Apr 15, 2015

eparis commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

eparis commented Apr 15, 2015

eparis commented Apr 15, 2015

vmarmol commented Apr 15, 2015

thockin commented Apr 15, 2015

thockin commented Apr 15, 2015

eparis commented Apr 16, 2015

smarterclayton commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented Apr 16, 2015

thockin commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented May 5, 2015

pmorie commented Jul 31, 2015

bogdando commented Nov 22, 2016

Running kubelet in a container: mounts #6848

Running kubelet in a container: mounts #6848

Comments

pmorie commented Apr 15, 2015

How to do the mount

Wrinkle: mount helpers

Factoring kubernetes

pmorie commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

pmorie commented Apr 15, 2015

pmorie commented Apr 15, 2015

smarterclayton commented Apr 15, 2015

rootfs commented Apr 15, 2015

vmarmol commented Apr 15, 2015

pmorie commented Apr 15, 2015

vishh commented Apr 15, 2015

eparis commented Apr 15, 2015

pmorie commented Apr 15, 2015

thockin commented Apr 15, 2015

eparis commented Apr 15, 2015

eparis commented Apr 15, 2015

vmarmol commented Apr 15, 2015

thockin commented Apr 15, 2015

thockin commented Apr 15, 2015

eparis commented Apr 16, 2015

smarterclayton commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented Apr 16, 2015

thockin commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented Apr 16, 2015

pmorie commented May 5, 2015

pmorie commented Jul 31, 2015

bogdando commented Nov 22, 2016