New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: containerized mount utilities in pods #589

Merged
merged 1 commit into from Nov 9, 2017

Conversation

### Controller
* There will be new parameter to kube-controller-manager and kubelet:
* `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside.
* `--experimental-mount-plugins`, which contains comma-separated list of all volume plugins that should run their utilities in pods instead on the host. The list must include also all flex volume drivers. Without this option, controllers and kubelet would not know if a plugin should use a pod with mount utilites or directly host, especially on startup when the daemon set may not yet be fully deployed on all nodes. * If so, it finds a pod in the dedicated namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin with `VolumeExec` pointing to the pod. All utilities that are executed by the volume plugin for attach/detach/provision/delete are executed in the pod as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`).

This comment has been minimized.

@jsafrane

jsafrane Apr 28, 2017

Member

I admit I don't like this experimental-mount-plugins cmdline option. Can anyone find a bulletproof way, how kubelet / a controller can reliably find if a volume plugin should execute its utilities on the host or in a pod? Especially, when kubelet starts in a fresh cluster, the pod with mount utilities may not be running yet and kubelet must know if it should wait for it or not. Kubelet must not try to run the utilities on the host, because there may be wrong version or wrong configuration of the utilities.

## User story
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
1. Admin installs Kubernetes in any way.
2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future.

This comment has been minimized.

@vishh

vishh Apr 28, 2017

Member

Why not leave discovery of appropriate mount plugins be a vendor specific requirement. Kubelet execs a script or binary that knows which container or service to talk to for each type of storage.

This comment has been minimized.

@jsafrane

jsafrane May 11, 2017

Member

You then need to deploy the script to all nodes and masters and that's exactly I'd like to avoid. Otherwise I can deploy the mount utilities directly there, right? I see GCI and Atomic Host and CoreOS as mostly immutable images with some configuration in /etc that just starts Kubernetes with the right options (and even that is complicated enough!)

This comment has been minimized.

@vishh

vishh May 12, 2017

Member

All these Container optimizes distros do have some writable stateful partitions. That would be necessary for other parts of the system like CNI. How does this align with CSI?

This comment has been minimized.

@jsafrane

jsafrane May 15, 2017

Member

CSI does not dictate any specific way how its drivers will run. @saad-ali expects they will run as a daemonset.

With CNI approach (a script in /opt/cni/bin), we would need a way how to deploy it on a master. This is OK for OpenShift, would it be fine for GKE, where user does not have access to masters so they can't install a attach/detach/provision/delete script for their favorite storage? And how would it talk to Kubernetes to find the right pod where to do the attaching/provisioning?

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

Why does GKE need to support this on the masters? User pods will not be scheduled to the masters and they would not need to have the binaries installed.

This comment has been minimized.

@jsafrane

jsafrane May 17, 2017

Member

PV controller on master needs a way how to execute Ceph utilities when creating a volume + attach/detach controller uses the same utilities to attach/detach the volume. Now it uses plain exec. When we move Ceph utilities from master somewhere else we need to tell controller-manager where the utilities are.

* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers.
* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container.
-> User does not need to configure anything and sees the pod Running as usual.

This comment has been minimized.

@vishh

vishh Apr 28, 2017

Member

Have you considered packaging mount scripts into the infra container? Kubelet will then have to exec into the infra container to mount a volume. This will alter the pod lifecycle in the kubelet where volumes are now setup prior to starting a pod.
The advantage is that all storage related processes belonging to a pod is contained with a pod's boundary and it's lifecycle is tied to that of the pod.

This comment has been minimized.

@jsafrane

jsafrane Apr 28, 2017

Member
  1. rkt does not use infrastructure container, they "hold" network NS in other way.
  2. using long-running pods better reflects CSI, as it will run one long-running process on each node. @saad-ali, can you confirm?

I will add a note about it to the proposal

This comment has been minimized.

@jsafrane

jsafrane May 11, 2017

Member

Note added. To be hones, infrastructure container looks compelling to me if we did not want to mimic long-running processes of CSI.

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

Infra container is an implementation detail for the docker integration, I'd not recommend using it. In fact, CRI in its current state would not allow you to exec into an "infra" container.

This comment has been minimized.

@jsafrane

jsafrane May 17, 2017

Member

Note added (and thanks for spotting this)

## Implementation notes
Flex volumes won't be changed in alpha implementation of this PR. Flex volumes will still need their utilities (and binaries in /usr/libexec/kubernetes) on all hosts.

This comment has been minimized.

@jingxu97

jingxu97 May 2, 2017

Contributor

Is there some reason for this flex volume note?

This comment has been minimized.

@gnufied

gnufied May 3, 2017

Member

As mentioned above we are hoping that flex utils will eventually be moved to pods as well - with label mount.kubernetes.io/flexvolume/foo=true but we are not considering that as part of alpha implementation.

* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers.
* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container.
-> User does not need to configure anything and sees the pod Running as usual.

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

Infra container is an implementation detail for the docker integration, I'd not recommend using it. In fact, CRI in its current state would not allow you to exec into an "infra" container.

Disadvantages:
* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot.
* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...)

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

The sidecar container approach above also requires about the same level of kubelet refactoring. Might want to add it to the "drawbacks" of side-car too.

This comment has been minimized.

@jsafrane

jsafrane May 17, 2017

Member

Added to drawbacks.

Disadvantages:
* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot.
* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...)
* Short-living processes instead of long-running ones that would mimic CSI.

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

What's the advantage of the long-running processes?

This comment has been minimized.

@jsafrane

jsafrane May 17, 2017

Member

The advantage is that it mimics our current design of CSI and we can catch bugs or even discover that it's not ideal before CSI is standardized.

### Infrastructure containers
Mount utilities could be also part of infrastructure container that holds network namespace (when using Docker). Now it's typically simple `pause` container that does not do anything, it could hold mount utilities too.

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

As mentioned above, this'd only work for the legacy, pre-CRI docker integration.

## User story
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
1. Admin installs Kubernetes in any way.
2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future.

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

Why does GKE need to support this on the masters? User pods will not be scheduled to the masters and they would not need to have the binaries installed.

## User story
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
1. Admin installs Kubernetes in any way.
2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future.

This comment has been minimized.

@yujuhong

yujuhong May 16, 2017

Contributor

I think the discovery process is far from ideal. One would need to enumerate a list of plugins via a kubelet flag (which is static) before kubelet starts and before the (dynamic) DaemonSet pods are created. Any change to the plugin list will require restarting kubelet. Can we try finding other discovery methods?

This comment has been minimized.

@jsafrane

jsafrane May 17, 2017

Member

I agree that the discovery is not ideal. What are other options? AFAIK there is no magic way how to configure kubelet dynamically. Is it possible to have a config object somewhere where all kubelets and controller-manager would reliably get list of volume plugins that are supposed to be containerized?

The list is needed only at startup, where kubelet gets its first pods from scheduler - a pod that uses e.g. a gluster volume may be scheduled before daemon set for gluster is started or daemon set controller spawns a pod on the node. With the list kubelet knows that it should wait. Without the list, it blindly tries to mount the Gluster volume on the host, which is likely to fail with something as ugly as wrong fs type, bad option, bad superblock on 192.168.0.1:/foo missing codepage or helper program, or other error. mount stderr and exit codes are not helpful at all here.

When all daemon sets are up and running we don't need --experimental-mount-plugins at all and dynamic discovery works.

This comment has been minimized.

@jsafrane

jsafrane May 22, 2017

Member

I removed --experimental-mount-plugins for now, but it will behave exactly as I described in the previous comment - weird errors may appear in pod events during kubelet startup before a pod with mount utilities is scheduled and started.

@jsafrane

This comment has been minimized.

Member

jsafrane commented May 22, 2017

I updated the proposal with current development:

  • Added Terminology section to clear some confusion
  • Removed --experimental-mount-plugins option
  • During alpha (hopefuly 1.7), this feature must be explicitly enabled using kubelet --experimental-mount-namespace=foo so we don't break working clusters accidentally. This may change during beta/GA!
  • During alpha, no controller changes will be done, as it is only Ceph RBD provisioner who needs to execute stuff on master. I may implement it if time permits, I am just not sure...
* `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside. It would default to `kube-mount`.
* Whenever PV or attach/detach controller needs to call a volume plugin, it looks for *any* running pod in the specified namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin so it all mount utilities are executed as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`).
* If such pod does not exist, it executes the mount utilities on the host as usual.
* During alpha, no controller-manager changes will be done. That means that Ceph RBD provisioner will still require `/usr/bin/rbd` installed on the master. All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume.

This comment has been minimized.

@kfox1111

kfox1111 Jul 21, 2017

the rbd provisioner has been pulled out to here:
https://github.com/kubernetes-incubator/external-storage/tree/master/ceph/rbd

so the container can be built with the right ceph version already.

* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for one volume plugin, including `mkfs` and `fsck` utilities if they're needed.
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
* The only exception are kernel modules. They are not portable across distros and they should be on the host.
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.

This comment has been minimized.

@lucab

lucab Jul 24, 2017

Are mounts constrained to be performed under /var/lib/kubelet? If so, this seems to be a contract detail between controller/kubelet/daemonset that should be mentioned.

This comment has been minimized.

@jsafrane

jsafrane Jul 25, 2017

Member

No, any directory can be shared. It's up to the system admin + author of the privileged pods to make sure it can be shared (i.e. it's on a mount with shared mount propagation) and it's safe to share (e.g. systemd inside a container does not like /sys/fs/cgroup to be shared to the host, I don't remember exact error message but it simply won't start).

@jsafrane jsafrane force-pushed the jsafrane:containerized-mount branch 6 times, most recently from 0486ab2 to 85b39a3 Jul 28, 2017

@jsafrane

This comment has been minimized.

Member

jsafrane commented Jul 28, 2017

I updated the proposal with latest discussion om sig-node and with @tallclair.

  • Kubelet won't use docker exec <pod> mount <what> <where> to mount things in a pod, it will talk to a gRPC service running in the pod instead (following docker shim example).
    • Allows for much easier discovery of mount pods by kubelet, no magic namespaces or labels.
    • Opens a question how controller-manager will talk to these pods, see open items at the bottom.

This is basically a new proposal and needs a complete re-review. I left the original proposal as a separate commit so we can roll back easily.

@jsafrane jsafrane force-pushed the jsafrane:containerized-mount branch from 85b39a3 to f803f7b Jul 28, 2017

We considered this user story:
* Admin installs Kubernetes.
* Admin configures Kubernetes to use sidecar container with template XXX for glusterfs mount/unmount operations and pod with template YYY for glusterfs provision/attach/detach/delete operations. These templates would be yaml files stored somewhere.
* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers.

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

Do we need to extend the pod spec for doing this( sidecar template injection) operation or can it be done by existing pod spec or achievable by kube jobs or similar mechanism?

-> User does not need to configure anything and sees the pod Running as usual.
-> Admin needs to set up the templates.
Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod.

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

Can it be random node ? or it should be the same node where the pod is getting scheduled ?

This comment has been minimized.

@kfox1111

kfox1111 Jul 28, 2017

I think it may be driver specific. For some drivers, its probably best to exec into the container on the host going to have the volume?

For my k8s systems, I tend to have user provided code running in containers. I usually segregate these into differently labeled nodes then the control plane. In this configuration, the container doing the reach back to, say, openstack to move around volumes from vm to vm should never run on the user reachable nodes, as access to the secret for volume manipulation would be really bad. with k8s 1.7+, the secret is inaccessable to nodes that don't have a pod referencing the secret. So targeted exec would be much much better for that use case.

This comment has been minimized.

@jsafrane

jsafrane Aug 11, 2017

Member

For Attach/Detach operation a random node is IMO OK. All state is kept in attach/detach controller and volume plugins, not in the utilities that are executed by a volume plugin. Note that there is only Ceph RBD that executes something during attach.

For chasing secrets, that's actually benefit of a pod with mount utilities - any secrets that are needed to talk to backend storage can be easily available to the pod via env. variables or Secret volume. And since only os.exec will be delegated to a pod, whole command line will be provided to the pod incl. all necessary credentials.

Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod.
Advantages:
* It's probably easier to update the templates than update the DaemonSet.

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

I have one doubt here, how we are going to control the version of required mount utils ? for eg# if the mount utils need to be of a particular version can we specify that in the template ? Does that also mean there can be more than one sidecar container if user wish to ?

* It's probably easier to update the templates than update the DaemonSet.
Drawbacks:
* Admin needs to store the templates somewhere. Where?

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

Cant we make use of configmap or similar mechanism for this templates ? Just a thought 👍

* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

if we have one Daemonset per volume plugin and if we share /dev amoung these containers , there is a risk or security concern, Isnt it ?

This comment has been minimized.

@kfox1111

kfox1111 Jul 28, 2017

not sure its any worse then what is there today.

This comment has been minimized.

@fabiand

fabiand Aug 18, 2017

There are some things to take care of when mounting /dev into a container, i.e. oyu need to take care of the pts device to not break the console. And there are other things to take care of as well.

Because of this I wonder if it makes sense to add an API flag to signal that a container should get the hosts proc, sys, and dev paths. If we had such a flag it would be much more well defined what a container will get, if he is told to get the host view of these three directories.

Also a side note, we can not prevent it, but mounting the host's dev directory into a privileged container can cause quite a lot confusion (actually any setup where more than one udev is run can cause problems).

This comment has been minimized.

@fabiand

fabiand Aug 21, 2017

To be more precise on the "things" - Please take a look here https://github.com/kubevirt/libvirt/blob/master/libvirtd.sh#L5-L42 to see the workarounds coughhackscough we need to do to have libvirt running "properly" (we don't use all features, just a subset, and they work well so far) in a container.

This comment has been minimized.

@jsafrane

jsafrane Aug 21, 2017

Member

We can't influence how Docker (or other container runtime) creates/binds /dev and /sys. Once this flag is available in Docker/Moby and CRI we could expose it via Kubernetes API, but it's a long process. Until then we're stick to workarounds done inside the container.

I'll make sure we ship a well documented sample of such mount container. That's why it's alpha feature and we know all these workarounds before going to beta/stable.

* All volume plugins need to be updated to use a new `mount.Exec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to simple `os.exec` on the host or a gRPC call to a socket in `/var/lib/kubelet/plugin-sockets/` directory.
### Controllers
TODO: how will controller-manager talk to a remote pod? It's relatively easy to do something like `kubectl exec <mount pod>` from controller-manager, however it's harder to *discover* the right pod.

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

may be we could make use of labelling/selector mechanism based on the pod content.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

What are the tradeoffs of using exec vs. http to serve this? My hunch is that this should just be a service model, with a Kubernetes service that provides the volume plugin (how the controller manager identifies the service could be up for debate - predefined name? labels? namespace?). The auth{n/z} is a bit more complicated with that model though.

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

kubectl exec is easy to implement, does not need a new protocol and can be restrained by RBAC. With HTTP, we need to define and maintain the protocol, its implementation, have a db for auth{n,z}, generate certificates, ...

how the controller manager identifies the service could be up for debate - predefined name? labels? namespace

getting rid of namespaces / labels was the reason why we have gRPC over UNIX sockets. If we have half of the system using gPRC, second half with kubectl exec, why don't we use kubectl exec (or gRPC) for everything?

* Update the pod.
* Remove the taint.
Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better?

This comment has been minimized.

@humblec

humblec Jul 28, 2017

Contributor

iirc, rolling update is yet to come in DS , need to check the current status though. However there is an option called --cascade=false and possible to do a rolling update by manually or in a scripted way, not sure is that you are looking for.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

DaemonSets support rolling update as of 1.6 (https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/)

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

I am asking if there are some tricks how to do DaemonSet rolling update that would drain a node first before updating the pod. Otherwise I need to fall back to --cascade=false and do the update manually as @humblec suggests.

## Goal
Kubernetes should be able to run all utilities that are needed to provision/attach/mount/unmount/detach/delete volumes in *pods* instead of running them on *the host*. The host can be a minimal Linux distribution without tools to create e.g. Ceph RBD or mount GlusterFS volumes.
## Secondary objectives

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

What are the goals around adding (or removing) volume plugins dynamically? In other words, do you expect the pods serving the volume plugins to be deployed at cluster creation time, or at a later time? How about removing plugins?

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

Volume plugins are not a real plugins, they're hardcoded in Kubernetes.

It does not really matter when the pods with mount utilities are deployed - I would expect that they should be deployed during Kubernetes installation because cluster admin plans storage ahead (e.g. has existing NFS server) , however I can imagine that cluster admin could deploy pods for Gluster volumes later as the NFS server becomes full or so.

The only exception are flex plugin drivers. In 1.7, they needed to be installed before kubelet and controller-manager started. In #833 we're trying to change it to a more dynamic model, where flex drivers can be added/removed dynamically and this proposal could be easily extended with flex drivers running in pods. So admins could dynamically install/remove flex drivers running in pods. Again. I would expect that this would be mostly done during installation of a cluster. And #833 is better place to discuss it.

## Requirements on DaemonSets with mount utilities
These are rules that need to be followed by DaemonSet authors:
* One DaemonSet can serve mount utilities for one or more volume plugins. We expect that one volume plugin per DaemonSet will be the most popular choice.
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

that are needed to provision

Provisioning is a "cluster level" operation, and is handled by the volume controller rather than the Kubelet, right? In that case, I don't think they need to be handled by the same pod. In practice its probably often the same utilities that handle both, but I don't think it should be a hard requirement.

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

Yes, technically it does not need to be the same pod.

On the other hands, the only internal volume plugin that needs to execute something during provisioning or attach/detach (i.e. initiated by controller-manager) is Ceph RBD that needs /usr/bin/rbd. The same utility is then needed by kubelet to finish attachment of the device.

* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

Especially /var/lib/kubelet must be mounted with shared mount propagation so kubelet can see mounts created by the pods.

This only applies if the Kubelet is running in a container, right? Also, it needs slave mount propogation, not shared, right? (Pardon my ignorance of this subject)

This comment has been minimized.

@kfox1111

kfox1111 Jul 28, 2017

no, shared is needed.

slave:
(u)mount events on the host show up in containers. events on the containers dont affect the host.

shared:
(u)mount events that are initiated from either the host or the container show up on the other side.

If you want an (u)mount event in the mount utility container to show up to kubelet, it needs shared.

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

This only applies if the Kubelet is running in a container, right

No. Kubelet running on the host must see mounts mounted by a pod. Therefore we need shared mount propagation from the pod to the host. With slave propagation in the pod the mount would be visible only in the pod and not on the host.

* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

that reaps zombies of potential fuse daemons.

What does this mean? I believe the zombie process issue was fixed in 1.6 (kubernetes/kubernetes#36853)

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

@yujuhong says in #589 (comment) that infrastructure pod ("pause") is implementation detail of docker integration and other container engines may not use it.

This comment has been minimized.

@tallclair

tallclair Jul 31, 2017

Member

Yes, but if zombie processes is an issue for other runtimes, they should have a built-in way of dealing with them. It shouldn't be necessary to implement reaping in the pod, unless it's expected to generate a lot of zombie processes, I believe. ( @yujuhong does this sound right? )

This comment has been minimized.

@fabiand

fabiand Aug 18, 2017

Could it be an option to provide a base container for these mount util containers, which has a sane pid 1?

This comment has been minimized.

@jsafrane

jsafrane Aug 21, 2017

Member

I'd like to stay distro agnostic here and let the DaemonSet authors use anything they want. For NFS, simple Alpine Linux + buysbox init could be enough, for Gluster and Ceph a more powerful distro is needed.

* The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons.
* The pods with mount utilities run a daemon with gRPC server that implements `VolumExecService` defined below.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

VolumExecService

nit: I'd prefer VolumePluginService, or some other variation. I think Exec in this case is a bit unclear.

### gRPC API
`VolumeExecService` is a simple gRPC service that allows to execute anything via gRPC:

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

Is there a CSI API proposal out? Does this align with that? It might be worth using the CSI API in it's current state, if it's sufficient.

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

CSI is too complicated. Also, this would require a completely new implementation of at least gluster, nfs, CephFS, Ceph RBD, git volume, iSCSI, FC and ScaleIO volume plugins which is IMO too much. Keeping the plugins as they are, just using an interface that would defer os.Exec to a pod where appropriate is IMO much simpler and without risk of breaking existing (and tested!) volume plugins.

message ExecRequest {
// Command to execute
string cmd = 1;

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

This should be abstracted so that the Kubelet doesn't need to understand the specifics of the volume type. I believe this is what the volume interfaces defined in https://github.com/kubernetes/kubernetes/blob/4a73f19aed1f95b3fde1177074aee2a8bec1196e/pkg/volume/volume.go do? In that case, this API should probably mirror those interfaces.

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

Again, that would require me to rewrite the volume plugins. Volume plugins need e.g. access to CloudProvider or SecretManager, I can't put them into pods easily. And this pod would have access to all Kubernetes secrets...

Whole idea of ExecRequest/Response is to take existing and tested volume plugins and replace all os.Exec calls with <abstract exec interface>.Exec. Kubelet would provide the right interface implementation, leading to os.Exec or gRPC. No big changes in the volume plugins*, simple changes in Kubelet, one common VolumeExec server daemon for all pods with mount utilities.

It does not leak any specific volume knowledge to kubelet / controller-manager. It's dumb exec interface, common to all volume plugins.

*) one or two plugins would still need nontrivial refactoring to pass the interface from place where it's available to place where it's needed, but that's another story.

* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.
### Upgrade
Upgrade of the DaemonSet with pods with mount utilities needs to be done node by node and with extra care. The pods may run fuse daemons and killing such pod with glusterfs fuse daemon would kill all pods that use glusterfs on the same node.

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

Would it kill the pods, or just cause IO errors?

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

IO errors. I guess health probe should fail and the pod should be rescheduled (or deployment / replication set will create a new one).

* Update the pod.
* Remove the taint.
Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better?

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

DaemonSets support rolling update as of 1.6 (https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/)

* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.
### Upgrade

This comment has been minimized.

@tallclair

tallclair Jul 28, 2017

Member

What happens if the kubelet can't reach the pod serving a volume plugin (either do to update, or some other error) when a pod with a volume is deleted? Will the Kubelet keep retrying until it is able to unmount the volume? What are the implications of being unable to unmount the volume?

This comment has been minimized.

@jsafrane

jsafrane Jul 31, 2017

Member

Yes, kubelet tries indefinitely.

And if the pod with mount utilities is not available for a longer time... I checked volume plugins, most (if not all) run umount on the host. So the volume gets unmounted cleanly and data won't be corrupted. Detaching an iSCSI/FC/Ceph RBD disk may be a different story. The disk may be attached forever and then it depends on the backend if it supports attaching the volume to a different node.

As I wrote, update of the daemon set is a very tricky operation and the node should be drained first.

This comment has been minimized.

@tallclair

tallclair Jul 31, 2017

Member

Does unmounting block pod deletion? I.e. will the pod be stuck in a terminated state until the volume utility pod is able to be reached?

This comment has been minimized.

@jsafrane

jsafrane Aug 1, 2017

Member

No, unmounting happens after a pod is deleted.

-> User does not need to configure anything and sees the pod Running as usual.
-> Admin needs to set up the templates.
Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod.

This comment has been minimized.

@liggitt

liggitt Jul 31, 2017

Member

would the pod need to be a highly privileged pod, likely with hostpath volume mount privileges?

This comment has been minimized.

@jsafrane

This comment has been minimized.

@liggitt

liggitt Jul 31, 2017

Member

this is the first instance I'm aware of where a controller would be required to have the ability to create privileged pods. not necessarily a blocker, but that is a significant change.

This comment has been minimized.

@humblec

humblec Aug 1, 2017

Contributor

iic, for mounts we only need CAP_SYS_ADMIN, however if we export `/dev/ we need privilege pods.

@tallclair

This comment has been minimized.

Member

tallclair commented Jul 31, 2017

I misunderstood the original intent of this proposal. I thought the goal was to get much closer to the desired end-state of true CSI plugins. However, I now see that this is just providing the binary utilities for the existing (hard-coded) plugins.

Given that, I'm afraid I want to go back on my original suggestions. Since this really is an exec interface, I think the original proposal of using the native CRI exec (specifically, ExecSync) makes sense.

@jsafrane

This comment has been minimized.

Member

jsafrane commented Aug 11, 2017

@tallclair ExecSync looks usable.

I'd like to revisit the UNIX sockets. We still need a way how to run stuff in pods with mount utilities from controller-manager, which cannot use UNIX sockets. So there must be a way (namespaces, labels) to find these pods. Why can't kubelet use the same mechanism instead of UNIX sockets? It's easy to do kubectl exec <pod> mount -t glusterfs ... from kubelet (using ExecSync in the background) and it's currently the easiest way how to reach pods from controller-manager.

@jsafrane

This comment has been minimized.

Member

jsafrane commented Aug 14, 2017

@tallclair I just had meeting with @saad-ali and @thockin and we agreed that UNIX sockets are better for now, we care about mount in 1.8 and we'll see if we ever need to implement attach/detach and how.

So, ExecSync indeed looks usable. I am not sure about whole service RuntimeService - can the gRPC endpoint that runs in a pod with mount utilities implement just ExecSync part of it? Would it be better to create a new service RemoteExecService just with ExecSync in it? Is it possible to have two services having the same ExecSync rpc function?

@bassam

This comment has been minimized.

bassam commented Sep 1, 2017

@jsafrane it seems to me that if we are willing to run containers on master that we should be willing to schedule pods. The bigger point here is wether we can use Kubernetes itself to run storage plugins.

I see your point about having NFS and others work without having to install/run anything. I think that could be achieved by introducing a "storage-addon" like we do with for DNS for example. Kubeadm and installers could run the relevant pods on installation.

@jsafrane

This comment has been minimized.

Member

jsafrane commented Sep 14, 2017

Trying to resurrect the discussion, I am still interested in this proposal.

@tallclair, looking at device plugin gRPC API, it looks better to me to follow this approach and introduce "container exec API" with ExecSync call instead of re-using CRI. What do you think about it?

@tallclair

This comment has been minimized.

Member

tallclair commented Sep 18, 2017

The device plugin api is a higher level abstraction than just arbitrary exec. I wasn't a part of the meeting where it was decided to stick with a socket interface, but I don't see the value in implementing an alternative arbitrary command exec interface rather than relying on ExecSync. I'm happy to join in another discussion.

@jsafrane jsafrane force-pushed the jsafrane:containerized-mount branch 3 times, most recently from 6802ff5 to 72189ee Oct 3, 2017

@jsafrane

This comment has been minimized.

Member

jsafrane commented Oct 3, 2017

Reworked according to result of the latest discussion:

  • Use CRI ExecSync to execute stuff in pods with mount utilities (= no new gRPC interface)

  • Use files in /var/lib/kubelet/plugin-containers/<plugin name> for discovery of pods with mount utilities.

  • Added a note about containerized kubelet for completeness, no extra changes necessary.

@tallclair @dchen1107 @thockin @vishh PTAL

@jsafrane jsafrane force-pushed the jsafrane:containerized-mount branch 3 times, most recently from a02d582 to f003e82 Oct 3, 2017

@jsafrane

This comment has been minimized.

Member

jsafrane commented Oct 4, 2017

Implementation of this proposal is at kubernetes/kubernetes#53440 - it's quite small and well contained.

@castrojo

This comment has been minimized.

Contributor

castrojo commented Oct 10, 2017

This change is Reviewable

To sum it up, it's just a daemon set that spawns privileged pods, running a simple init and registering itself into Kubernetes by placing a file into well-known location.
**Note**: It may be quite difficult to create a pod that see's host's `/dev` and `/sys`, contains necessary kernel modules, does the initialization right and reaps zombies. We're going to provide a template with all this. During alpha, it is expected that this template will be polished as we encounter new bugs, corner cases, systemd / udev / docker weirdness.

This comment has been minimized.

@tallclair

tallclair Oct 13, 2017

Member

During alpha,

Is this expected to ever leave alpha? I thought this was a temporary hack while we wait for CSI?

This comment has been minimized.

@jsafrane

jsafrane Oct 18, 2017

Member

I removed all notes about alpha in the text and added a note about feature gate and that it's going to be alpha forever.

@jsafrane jsafrane force-pushed the jsafrane:containerized-mount branch 2 times, most recently from d797c65 to 69c780f Oct 18, 2017

@jsafrane

This comment has been minimized.

Member

jsafrane commented Oct 18, 2017

I squashed all the commits, the PR is ready to be merged.

On personal meeting with @tallclair and @saad-ali we agreed that all volume plugins are going to be moved to CSI eventually, so this proposal has limited lifetime. CSI drivers will have different discovery mechanism and all kubelet changes proposed here won't be needed.

I still think this PR is useful, as it allows us to create tests for internal volume plugins so we can check their CSI counterparts for regressions in e2e tests. Wherever the CSI drivers will live, Kubernetes still needs to keep its backward compatibility and make sure that old PVs keep working.

@jsafrane

This comment has been minimized.

Member

jsafrane commented Oct 27, 2017

/assign @tallclair

@tallclair

This comment has been minimized.

Member

tallclair commented Oct 31, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Oct 31, 2017

@jsafrane

This comment has been minimized.

Member

jsafrane commented Nov 9, 2017

Why is this not merged? "pull-community-verify — Waiting for status to be reported"
/test all

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Nov 9, 2017

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Nov 9, 2017

Automatic merge from submit-queue.

@k8s-merge-robot k8s-merge-robot merged commit f04cbac into kubernetes:master Nov 9, 2017

2 of 3 checks passed

Submit Queue Required Github CI test is not green: pull-community-verify
Details
cla/linuxfoundation jsafrane authorized
Details
pull-community-verify Job succeeded.
Details

k8s-merge-robot added a commit to kubernetes/kubernetes that referenced this pull request Nov 14, 2017

Merge pull request #53440 from jsafrane/mount-container4-10-03
Automatic merge from submit-queue (batch tested with PRs 54005, 55127, 53850, 55486, 53440). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Containerized mount utilities

This is implementation of kubernetes/community#589

@tallclair @vishh @dchen1107 PTAL
@kubernetes/sig-node-pr-reviews 

**Release note**:
```release-note
Kubelet supports running mount utilities and final mount in a container instead running them on the host.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment