Dynamic Flexvolume plugin discovery proposal. #833

verult · 2017-07-22T05:49:30Z

This lays out the design for the improved deployment story of Flexvolume, making it easier to work with out-of-tree volume plugins before CSI comes to fruition.

verult · 2017-07-24T17:41:18Z

/assign @chakri-nelluri @msau42 @saad-ali
/cc @thockin

k8s-ci-robot · 2017-07-24T17:41:19Z

@verult: GitHub didn't allow me to assign the following users: chakri-nelluri.

Note that only kubernetes members can be assigned.

In response to this:

/assign @chakri-nelluri @msau42 @saad-ali
/cc @thockin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2017-07-24T18:06:25Z

/sig storage

msau42 · 2017-07-24T18:30:39Z

contributors/design-proposals/flexvolume-deployment.md

+
+## **Background**
+
+Beginning in version 1.8, the Kubernetes Storage SIG is putting a stop to accepting in-tree volume plugins and advises all storage providers to implement out-of-tree plugins. Currently, there are two recommended implementations: Container Storage Interface (CSI) and Flexvolume. CSI provides a single interface that storage vendors can implement in order for their storage solutions to work across many different container orchestrators, and volume plugins are out-of-tree by design. This is a large effort, the full implementation of CSI is several quarters away, and there is a need for an immediate solution for storage vendors to continue adding volume plugins. Flexvolume is an in-tree plugin that has the ability to run any storage solution by executing volume commands against a user-provided driver on the Kubernetes host, and this currently exists today. However, the process of setting up Flexvolume is very manual, pushing it out of consideration for many users. Problems include having to copy the driver to a specific location in each node, manually restarting kubelet, and user's limited access to machines.


nit: split this up into smaller paragraphs to make it easier to read and comment on.

msau42 · 2017-07-24T18:34:18Z

contributors/design-proposals/flexvolume-deployment.md

+
+
+
+*   Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She creates an image by copying her Flexvolume driver directory, with driver names in `[vendor~]driver/driver` format (e.g. `flexvolume/k8s~nfs/nfs`), to the Flexvolume deployment base image at `/flexvolume`. Then she makes her image available Bob, a cluster admin.


Maybe split this up into steps per persona.

Could you add an example of an image Dockerfile and DaemonSet?

msau42 · 2017-07-24T18:56:10Z

contributors/design-proposals/flexvolume-deployment.md

+*   Bob modifies the existing deployment DaemonSet spec with the name of the given image, and creates the DaemonSet.
+*   Charlie, an end user, creates volumes using the installed plugin.
+
+The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec. Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host, thus in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.


Can you describe exactly "what is desired"

msau42 · 2017-07-24T18:58:57Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Driver Installation Script
+
+The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.


How are failures shown to the user?

Not sure what's the best way. I'll add that to Open Questions.

Can we use the exit status from Daemonset to notify the error?

I don't understand the "are restored" ? A driver should never ever touch anything that isn't it's own.

The install process, for most drivers, should be:

while true; do if [ -f /flexvolume/myvendor~mydriver/mydriver ]; then sleep 60 else mkdir /flexvolume/myvendor~mydriver cp -a /mydriver /flexvolume/myvendor~mydriver fi done

The model in the current design is the entire Flexvolume directory containing all drivers get replaced, so when a copy fails, the backup of the entire directory is restored.

After moving to the single-driver install model, this will be changed to backing up a single driver instead.

I think the hard line is at "you should never ever touch something you don't own".

msau42 · 2017-07-24T18:59:34Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Dynamic Plugin Discovery
+
+Create a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. `Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. `Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see Security Considerations). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.


Split this up into steps instead of one giant block of text

msau42 · 2017-07-24T19:00:10Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Dynamic Plugin Discovery
+
+Create a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. `Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. `Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see Security Considerations). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.


Who creates this interface?

msau42 · 2017-07-24T19:03:08Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Copying Driver File to Image
+
+Using a publicly available deployment image containing the script as the base image, the plugin author copies the Flexvolume driver directory (e.g. `flexvolume/k8s~nfs/nfs`) to `/flexvolume` and makes the image available to the cluster admin.


Maybe also have a section for the public deployment base image.

msau42 · 2017-07-24T19:04:47Z

contributors/design-proposals/flexvolume-deployment.md

+
+Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.
+
+There is a possibility that a probe occurs at the same time the DaemonSet updates the driver, so that the prober's view of drivers is inconsistent. However, this is very rare and when it does occur, the next `Probe()`call, which occurs shortly after, will be consistent.


There are some parts of the volume code that iterate through all the volume plugins. This will also need to deal with synchronization with those calls.

kokhang

Also how does this work in conjunction with kubernetes/enhancements#278? It seems that we can utilize the base image to install/add mount tools to be available in the host.

kokhang · 2017-07-24T22:03:48Z

contributors/design-proposals/flexvolume-deployment.md

+
+Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.
+
+Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.


If the flexvolume implements attach/detach interface, how are you going to extend the controller-manager (assuming you are using a daemonset)?

What do you mean by "extend"?

If you are asking how the drivers get deployed through DaemonSet to a master: DaemonSet will add a pod to the master node regardless of whether the node is set to schedulable.

If you are asking how the controller-manager picks up the newly added attach/detach procedures when the driver is replaced: during plugin probe a FlexVolumeAttachablePlugin is created, which replaces the previous plugin. The attachable plugin will enable attach/detach calls from AttachDetachController.

DaemonSet will add a pod to the master node

Not all masters run kubelet. E.g. GKE.

kokhang · 2017-07-24T22:07:11Z

contributors/design-proposals/flexvolume-deployment.md

+
+Pros: Designed for containers that eventually terminate. No need to have the container go into an infinite loop.
+
+Cons: Does not guarantee every node has a pod running.


Can we use affinity/placement to target all nodes? Although you would still have issues if new nodes are added to the k8s cluster.

That's a good point, I'll add it to the doc. Yes like you said, with Jobs you have to specify how many pods you'd like to see running to completion, which is a problem when new nodes come up.

kokhang · 2017-07-24T22:07:51Z

contributors/design-proposals/flexvolume-deployment.md

+
+Pros: Simpler to implement.
+
+Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity.


I agree. Flexvolume drivers are not something that will be added or modified often.

Why is probe called so often?

It's called for every FindPluginBy*(), because currently it iterates through all plugins and errors out when multiple plugins are found. As for why FindPluginBy*() is called often, I don't know. I can only tell that from logs.

we should track that backwards - it is surprising to me.

Seems like there are (at least) two reasons:

In every DSW populator loop, it's called for volumes of every existing pod, in order to re-associate each pod with its required volumes.

Certain volume types are set to constantly remount, triggering a Probe every time the remount occurs.

verult · 2017-07-24T23:17:19Z

Just to clarify: the containerized mounts feature doesn't install mount utilities on the host directly; rather, it keeps them in a container, and volume plugins need to do a container exec to access them.

The deployment script in this design can be composed with the mount utility container, and the DaemonSet spec here can be combined with the containerized mounts DaemonSet spec. The Flexvolume drivers will have to be able to perform the container execs to access containerized mount utilities, but otherwise it should just work.

gnufied · 2017-07-25T19:39:46Z

contributors/design-proposals/flexvolume-deployment.md

+
+### High Level Design
+
+The DaemonSet mounts a hostpath volume exposing the host's Flexvolume driver directory onto every pod. The base deployment image contains a script that copies drivers in the image to the hostpath. A notification is then sent to the filesystem watch from kubelet or controller manager. During volume creation, if there is a signal from the watch, kubelet or controller manager probes the driver directory and loads currently installed drivers as volume plugins.


Why does the flexvolume content have to be copied to every pod? Assuming non-containarized kubelet, shouldn't this just be on host?

I meant to say every DaemonSet pod. Will clarify.

gnufied · 2017-07-25T19:54:46Z

contributors/design-proposals/flexvolume-deployment.md

+
+`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.
+
+Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.


if the PluginStub is an instance of PluginProber

I thought both PluginStub and PluginProber were interfaces. Do you mean if plugin in InitPlugins loop is an instance of PluginProber ? I assume all Volume plugins will be instance of PluginStub since it would be the type that would be extended by VolumePlugin.

okay, so the idea is, each time any function on VolumePluginMgr is executed, we call Probe on plugin and if that returns true we rescan the directories and return with updated plugin list. I am wondering, if we have can simply rescan the directory always whenever a call to VolumePluginMgr is made, that is keep the polling on-demand rather than periodic and we won't need inotify etc hook.

Do you mean if plugin in InitPlugins loop is an instance of PluginProber ? I assume all Volume plugins will be instance of PluginStub since it would be the type that would be extended by VolumePlugin.

Yes that's correct. Will clarify.

I am wondering, if we have can simply rescan the directory always whenever a call to VolumePluginMgr is made

This is discussed in Alternative Designs (2).

liggitt · 2017-07-25T20:08:05Z

contributors/design-proposals/flexvolume-deployment.md

+
+*   Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She
+    1. prepares her Flexvolume driver directory, with driver names in `[vendor~]driver/driver` format (e.g. `flexvolume/k8s~nfs/nfs`, see [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)).
+    2. creates an image by copying her driver to the Flexvolume deployment base image at `/flexvolume`.


it might simplify this proposal to focus it on dynamic discovery of new plugins, and reprobing of plugins from the kubelet/controllermanager, etc. Deployment via daemonset could be one mechanism, but shouldn't be prescriptive.

chakri-nelluri · 2017-07-25T20:09:18Z

contributors/design-proposals/flexvolume-deployment.md

+
+### High Level Design
+
+The DaemonSet mounts a hostpath volume exposing the host's Flexvolume driver directory onto every pod. The base deployment image contains a script that copies drivers in the image to the hostpath. A notification is then sent to the filesystem watch from kubelet or controller manager. During volume creation, if there is a signal from the watch, kubelet or controller manager probes the driver directory and loads currently installed drivers as volume plugins.


Daemonset has to be privileged to copy the driver.

@chakri-nelluri Could you clarify which exactly operations require privileged mode?

@php-coder Kubelet executes Flexvolume drivers outside the container's namespace, so driver files have to be copied into the hostpath volume in privileged mode.

liggitt · 2017-07-25T20:17:17Z

contributors/design-proposals/flexvolume-deployment.md

+
+The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec.
+
+Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host (as described in the [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)). The deployment will replace the existing driver directory on the host with contents in `/flexvolume`. Thus, in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.


in order to add a new driver without removing existing ones, existing drivers must also appear in /flexvolume

that is unexpected. I'd expect a flexvolume deployment to be focused on installing a single driver, e.g. populating one driver under /usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor>~<driver>/<driver>

Yeah, I think it has to be possible to install multiple of these

liggitt · 2017-07-25T20:21:50Z

contributors/design-proposals/flexvolume-deployment.md

+
+Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly.
+
+4) Using Jobs instead of DaemonSets to deploy.


deployment method should be a separated out of this proposal, IMO. The core need is for kubelet and controller manager to detect and use added flex plugins without restart. If that requirement is met, you can use all sorts of methods for installing/maintaining the drivers.

Agree. We should provide an example. That's all.

chakri-nelluri · 2017-07-25T20:34:42Z

contributors/design-proposals/flexvolume-deployment.md

+
+Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.
+
+There is a possibility that a probe occurs at the same time the DaemonSet updates the driver, so the prober's view of drivers is inconsistent. However, this is very rare and when it does occur, the next `Probe()`call, which occurs shortly after, will be consistent.


+1 for this approach.

chakri-nelluri · 2017-07-25T20:40:05Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Driver Installation Script
+
+The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.


Also, what happens if the probe fails or the plugin fails to initialize?
We might have to document how to revert to an working older version.

chakri-nelluri · 2017-07-25T20:40:28Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Driver Installation Script
+
+The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.


Please add this is a recommended way. Users are free to innovate :)

We're planning to provide a driver installation mechanism as part of this proposal, but like @liggitt mentioned, it can be separated out.

thockin · 2017-07-26T04:56:54Z

contributors/design-proposals/flexvolume-deployment.md

+
+The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec.
+
+Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host (as described in the [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)). The deployment will replace the existing driver directory on the host with contents in `/flexvolume`. Thus, in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.


Yeah, I think it has to be possible to install multiple of these

thockin · 2017-07-26T05:02:27Z

contributors/design-proposals/flexvolume-deployment.md

+
+### Driver Installation Script
+
+The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.


I don't understand the "are restored" ? A driver should never ever touch anything that isn't it's own.

The install process, for most drivers, should be:

while true; do if [ -f /flexvolume/myvendor~mydriver/mydriver ]; then sleep 60 else mkdir /flexvolume/myvendor~mydriver cp -a /mydriver /flexvolume/myvendor~mydriver fi done

thockin · 2017-07-26T05:05:12Z

contributors/design-proposals/flexvolume-deployment.md

+
+In the volume plugin code, introduce a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`.
+
+`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen.


why use notify, if you are just going to cache the results?

This is discussed in Alternative Design (3).

If probe() were called a reasonable amount of times per second, would you reconsider this point? Just trying to bound complexity.

Yes definitely. Unfortunately I was seeing bursts of tens of milliseconds between calls. If we change the Find*() logic so that if there's already a match then don't check Flex, then we don't need a watch.

thockin · 2017-07-26T05:06:35Z

contributors/design-proposals/flexvolume-deployment.md

+
+`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.
+
+Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.


I don't think anything related to flex should leak outside the flex directory. It should just be part of the initial handshake to figure out if each thing is a plugin or a meta-plugin. Meta-plugins are probed later.

thockin · 2017-07-26T05:07:50Z

contributors/design-proposals/flexvolume-deployment.md

+
+Pros: Simpler to implement.
+
+Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity.


Why is probe called so often?

thockin · 2017-07-26T05:08:17Z

contributors/design-proposals/flexvolume-deployment.md

+
+Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly.
+
+4) Using Jobs instead of DaemonSets to deploy.


Agree. We should provide an example. That's all.

jsafrane · 2017-07-26T08:41:06Z

contributors/design-proposals/flexvolume-deployment.md

+
+`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen.
+
+`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.


So, mere presence of a new directory means that there is a new flex driver there? File copy is not an atomic operation and it may happen that only part of the driver script was copied there. You should wait until whole driver has been copied there. And big question is how do you recognize that...

IMO, installation of a new driver (or driver update) should be atomic operation on the fs, e.g. link(2).

Jan, excellent point. The only atomic file op is rename, so we should be sure that flex EXPLICITLY ignores files that start with a .. The installer must copy to flex/.../.mydriver and then rename that to mydriver. Let's make this as explicit as possible.

That assumes that the driver is a single file. The installer must rename multiple files one by one (and rename does not work well with directories).

I am ok with requiring the driver to be a single file, with no helper scripts around, but it must be clearly defined somewhere in flex documentation and a release note.

jsafrane · 2017-07-26T08:46:40Z

contributors/design-proposals/flexvolume-deployment.md

+## **Open Questions**
+
+* How does this system work with containerized kubelet?
+* What if drivers are updated while Flexvolume plugin is executing commands?


This is very important question and I'd like to see some answer. Especially kubelet must not ever execute a half-copied driver - running partly copied shell script as root sounds scary to me.

Rename, as described above, should not suffer this problem, I think.

Other than the scenario Jan mentioned, there are two other possible bad scenarios:

When a driver command is called, part of the driver command is loaded into memory and is being executed. The driver gets modified, and the remaining part of the command (which is new) gets executed. I'm not sure if this could happen.

The driver gets modified while the command is read into memory, before execution even occurs. This could definitely happen.

I don't think renaming solves either of these issues. A write should never happen while an execution is underway. Ideally the solution fixes both the issues above and Jan's scenario.

My first thought is a RWLock implementation, but I'd like to avoid locking if possible.

should get an ETXTBUSY for the person trying to modify the binary

I don't understand 2?

In any case, rename is atomic. As soon as I open the file descriptor, if anyone renames another file over this file, I still get the old file. It is pinned until I close that fd. Happy to explain more - it's a good trick to have in your toolbox.

That's really neat. In that case yes renaming will solve both of these scenarios.

What I tried to describe in (2) is: before the driver gets executed, part of its commands are read into memory, then the remaining commands are modified, and lastly the modified commands are read in.

thockin · 2017-07-27T06:08:03Z

Rename does work with symlinks, though. It's ugly but doable. mydriver -> .mydriver/mydriver

…

On Wed, Jul 26, 2017 at 10:06 AM, Jan Šafránek ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In contributors/design-proposals/flexvolume-deployment.md <#833 (comment)>: > + volumeMounts: + - mountPath: /flexmnt + name: flexvolume-mount + volumes: + - name: flexvolume-mount + hostPath: + path: <host_driver_directory> +``` + +### Dynamic Plugin Discovery + +In the volume plugin code, introduce a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. + +`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. + +`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists. That assumes that the driver is a single file. The installer must rename multiple files one by one (and rename does not work well with directories). I am ok with requiring the driver to be a single file, with no helper scripts around, but it must be clearly defined somewhere in flex documentation and a release note. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#833 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVDz5L99Pw55PF7KMdOgDOOXDpX2Wks5sR3IJgaJpZM4OgFHi> .

verult · 2017-08-02T20:30:48Z

If we go with the route of having a separate credentials file, then like @jsafrane and @thockin mentioned before, we need to make sure that driver command execution doesn't interleave with driver installation. The deployment script in this proposal assumes there is a single file inside the driver directory.

This problem only comes up if the crendentials file needs to be changed. One solution is to install each file individually (copy to .dot file & rename) rather than installing the entire directory all at once.

By the way, the credentials file doesn't have to have a .dot prefix. The Flexvolume plugin uses the driver directory name to compute the name of the driver file.

verult · 2017-08-07T22:12:19Z

@kokhang Have you tried using the master IP directly?

/cc @MrHohn

MrHohn · 2017-08-07T22:49:28Z

I tried doing this but i got an error because the K8S API IP was a in-network service IP and not routable by the host.

Kubernetes service VIP should be routable on the host as well, assuming you have kube-proxy running on the same host?

kokhang · 2017-08-07T23:33:09Z

@verult @MrHohn How do i know what the master IP is?
I have been testing on Minikube. What i tried was constructing the kubeconfig file using information from the daemonset that contains the flex driver. I tried getting the serviceaccount certs, the kubernetes host and port by doing something similar to https://github.com/kubernetes/client-go/blob/master/rest/config.go#L310.

However on minikube, the IP i get from the KUBERNETES_SERVICE_HOST env is not routable.

MrHohn · 2017-08-08T00:48:16Z

How do i know what the master IP is?

@kokhang For Minikube, probably try minikube ip.

liggitt · 2017-08-08T02:48:37Z

How do i know what the master IP is?

@kokhang For Minikube, probably try minikube ip.

The point is that flex plugins need a reliable way to know how to reach the API server. If they have to try to figure out what type of kube deployment they are on, they'll do it inconsistently at best and fail to handle some kube deployments at worst.

In-process volume plugins are given an API client (via the host interface) as part of their Init() call. Passing information about the API server to the flexplugin's init() call seems like a parallel to that. I could see the following possibilities:

no api server info (kubelet is running in standalone mode)
full path to a kubeconfig file containing api server address, CA, credentials, etc.

for now, the kubeconfig file could be the same as the one the kubelet uses, but if a need for a separate identity came up in the future, a different kubeconfig could be fed to the plugin

saad-ali · 2017-08-08T18:03:54Z

We had a productive meeting today.

We identified three problems for the Rook Flex Driver:

Discovery of k8s API endpoint by Rook Flex Drive
Credentials Rook Flex Driver should use against k8s API endpoint.
Discovery of Flex directory that deployment DeamonSet pod should drop the Rook Flex Driver in (since this directory is configurable).

Agreed upon solution:

Rook has a "proxy" pod deployed as a DeamonSet that proxies calls between the Flex volume driver and the k8s API server. The proxy exposes a Unix Domain Socket via HostPath to the host that the Flex Driver talks to.
- With this approach, inside the pod, the Rook driver can do DNS lookup (kube DNS) to discover k8s API endpoint in a reliable manner, and it can the pod can run with a service account that has the correct permission to create CRDs.
Rook will investigate using the node API to read kubelet configz and extract the Flex driver path for kubelets.

Once @verult's dynamic discovery of Flex drivers work is merged, we should add E2E test for Flex to prevent regressions (accidental changes to the Flex API).

verult · 2017-08-08T18:43:05Z

A PR for e2e Flexvolume test was recently merged and is in the process of being added to the testgrid: kubernetes/kubernetes#48366

Currently it uses kubelet restart. Once dynamic discovery is finished, this can be changed to make the test non-disruptive.

adelton · 2017-08-17T12:02:13Z

contributors/design-proposals/flexvolume-deployment.md

+
+Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.
+
+There is a possibility that a Flexvolume command execution occurs at the same time as the DaemonSet updates the driver, which leads to a bad execution. This cannot be solved within the Kubernetes system without an overhaul. Instead, this is discussed in [Atomic Driver Installation](#atomic-driver-installation) as part of the deployment mechanism. As part of the solution, the Prober will ignore all files that begins with "." in the driver directory.


This is the first time DaemonSet was mentioned and it caught me by surprise that it is being referred to as the DaemonSet.

Could the proposal be extended to describe that DaemonSet is expected to be the mechanism to run on every node something which will then copy scripts to the host, before discussing the race condition?

Or avoid discussing specific deployment mechanisms here and just say "…possibility that a Flexvolume command execution occurs at the same time as the driver is updated…"

Updated. Thanks for catching this!

fabiand

To me the solution proposed by @jsafrane can provide the same functionality, but

aligns better with the mount-utils delivery work
does not touch the host, which avoids to keep track of the host sided things

To me these are two pretty strong points.

Just my 2ct

verult · 2017-08-18T17:54:09Z

I think @jsafrane 's proposal would be a natural evolution of Flex. The dynamic plugin discover work here will build towards that goal, offering the ability to discover unix domain socket on the fly.

…stallation.

saad-ali · 2017-08-22T03:59:11Z

Merging this.

saad-ali · 2017-08-22T03:59:21Z

/lgtm

k8s-github-robot · 2017-08-22T04:00:41Z

Automatic merge from submit-queue

Automatic merge from submit-queue Dynamic Flexvolume plugin discovery proposal. This lays out the design for the improved deployment story of Flexvolume, making it easier to work with out-of-tree volume plugins before CSI comes to fruition.

Fixes kubernetes#817

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 22, 2017

k8s-ci-robot assigned msau42 and saad-ali Jul 24, 2017

k8s-ci-robot requested a review from thockin July 24, 2017 17:41

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 24, 2017

msau42 reviewed Jul 24, 2017

View reviewed changes

verult force-pushed the flexdeploy branch 2 times, most recently from 22ce9e0 to d8d5536 Compare July 24, 2017 21:45

kokhang reviewed Jul 24, 2017

View reviewed changes

verult force-pushed the flexdeploy branch from d8d5536 to fc2a69c Compare July 25, 2017 18:17

Automated Flexvolume deployment proposal.

aff0cfa

verult force-pushed the flexdeploy branch from fc2a69c to aff0cfa Compare July 25, 2017 18:18

gnufied reviewed Jul 25, 2017

View reviewed changes

liggitt reviewed Jul 25, 2017

View reviewed changes

chakri-nelluri reviewed Jul 25, 2017

View reviewed changes

liggitt reviewed Jul 25, 2017

View reviewed changes

chakri-nelluri reviewed Jul 25, 2017

View reviewed changes

verult mentioned this pull request Jul 26, 2017

WIP: Barebone prototype of Flexvolume plugin prober, with dependency injection through VolumeHost. kubernetes/kubernetes#49600

Closed

thockin reviewed Jul 26, 2017

View reviewed changes

jsafrane reviewed Jul 26, 2017

View reviewed changes

verult mentioned this pull request Jul 26, 2017

WIP: Auto Flexvolume plugin discovery, with prober dependency injection through VolumePluginMgr initializer. kubernetes/kubernetes#49668

Closed

kokhang mentioned this pull request Aug 8, 2017

Dynamic Flexvolume plugin discovery, probing with filesystem watch. kubernetes/kubernetes#50031

Merged

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017

adelton reviewed Aug 17, 2017

View reviewed changes

fabiand reviewed Aug 18, 2017

View reviewed changes

verult force-pushed the flexdeploy branch 2 times, most recently from 61fe071 to ebee98b Compare August 18, 2017 18:14

Added notes regarding symlinked directories; updated atomic driver in…

5091fef

…stallation.

verult force-pushed the flexdeploy branch from ebee98b to 5091fef Compare August 18, 2017 18:23

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2017

k8s-github-robot merged commit 6a7241b into kubernetes:master Aug 22, 2017

klausenbusk mentioned this pull request Aug 27, 2017

Block storage support? digitalocean/digitalocean-cloud-controller-manager#19

Closed

verult mentioned this pull request Aug 28, 2017

Flexvolume Dynamic Plugin Discovery kubernetes/kubernetes#51470

Closed

MrHohn mentioned this pull request Aug 30, 2017

Set up DNS server in containerized mounter path kubernetes/kubernetes#51645

Merged

kokhang mentioned this pull request Sep 12, 2017

rbd image locked by other nodes rook/rook#964

Closed

kokhang mentioned this pull request Oct 4, 2017

Rook flex volume rook/rook#882

Merged

verult mentioned this pull request Jan 17, 2018

Automatic plugin discovery should trigger plugin init only for the relevant plugin (and not to all the existing plugins) kubernetes/kubernetes#58352

Closed

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Remove inactive maintainers (kubernetes#833)

5723054

Fixes kubernetes#817


		## Background

		Beginning in version 1.8, the Kubernetes Storage SIG is putting a stop to accepting in-tree volume plugins and advises all storage providers to implement out-of-tree plugins. Currently, there are two recommended implementations: Container Storage Interface (CSI) and Flexvolume. CSI provides a single interface that storage vendors can implement in order for their storage solutions to work across many different container orchestrators, and volume plugins are out-of-tree by design. This is a large effort, the full implementation of CSI is several quarters away, and there is a need for an immediate solution for storage vendors to continue adding volume plugins. Flexvolume is an in-tree plugin that has the ability to run any storage solution by executing volume commands against a user-provided driver on the Kubernetes host, and this currently exists today. However, the process of setting up Flexvolume is very manual, pushing it out of consideration for many users. Problems include having to copy the driver to a specific location in each node, manually restarting kubelet, and user's limited access to machines.




		* Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She creates an image by copying her Flexvolume driver directory, with driver names in `[vendor~]driver/driver` format (e.g. `flexvolume/k8s~nfs/nfs`), to the Flexvolume deployment base image at `/flexvolume`. Then she makes her image available Bob, a cluster admin.


		### Driver Installation Script

		The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.


		### Dynamic Plugin Discovery

		Create a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. `Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. `Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see Security Considerations). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.


		### Copying Driver File to Image

		Using a publicly available deployment image containing the script as the base image, the plugin author copies the Flexvolume driver directory (e.g. `flexvolume/k8s~nfs/nfs`) to `/flexvolume` and makes the image available to the cluster admin.


		Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.

		There is a possibility that a probe occurs at the same time the DaemonSet updates the driver, so that the prober's view of drivers is inconsistent. However, this is very rare and when it does occur, the next `Probe()`call, which occurs shortly after, will be consistent.


		Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.

		Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.


		Pros: Designed for containers that eventually terminate. No need to have the container go into an infinite loop.

		Cons: Does not guarantee every node has a pod running.


		Pros: Simpler to implement.

		Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity.


		### High Level Design

		The DaemonSet mounts a hostpath volume exposing the host's Flexvolume driver directory onto every pod. The base deployment image contains a script that copies drivers in the image to the hostpath. A notification is then sent to the filesystem watch from kubelet or controller manager. During volume creation, if there is a signal from the watch, kubelet or controller manager probes the driver directory and loads currently installed drivers as volume plugins.


		`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.

		Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.


		The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec.

		Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host (as described in the [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)). The deployment will replace the existing driver directory on the host with contents in `/flexvolume`. Thus, in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.


		Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly.

		4) Using Jobs instead of DaemonSets to deploy.


		In the volume plugin code, introduce a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`.

		`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen.


		Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.

		There is a possibility that a Flexvolume command execution occurs at the same time as the DaemonSet updates the driver, which leads to a bad execution. This cannot be solved within the Kubernetes system without an overhaul. Instead, this is discussed in [Atomic Driver Installation](#atomic-driver-installation) as part of the deployment mechanism. As part of the solution, the Prober will ignore all files that begins with "." in the driver directory.

Dynamic Flexvolume plugin discovery proposal. #833

Dynamic Flexvolume plugin discovery proposal. #833

Conversation

verult commented Jul 22, 2017

verult commented Jul 24, 2017

k8s-ci-robot commented Jul 24, 2017

msau42 commented Jul 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kokhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

verult commented Jul 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Jul 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Jul 27, 2017 via email

verult commented Aug 2, 2017

verult commented Aug 7, 2017 • edited

MrHohn commented Aug 7, 2017

kokhang commented Aug 7, 2017

MrHohn commented Aug 8, 2017

liggitt commented Aug 8, 2017

saad-ali commented Aug 8, 2017

verult commented Aug 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Jul 25, 2017 •

edited

verult commented Aug 7, 2017 •

edited

fabiand left a comment •

edited