Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Flexvolume plugin discovery proposal. #833

Merged
merged 4 commits into from Aug 22, 2017

Conversation

verult
Copy link
Contributor

@verult verult commented Jul 22, 2017

This lays out the design for the improved deployment story of Flexvolume, making it easier to work with out-of-tree volume plugins before CSI comes to fruition.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 22, 2017
@verult
Copy link
Contributor Author

verult commented Jul 24, 2017

/assign @chakri-nelluri @msau42 @saad-ali
/cc @thockin

@k8s-ci-robot
Copy link
Contributor

@verult: GitHub didn't allow me to assign the following users: chakri-nelluri.

Note that only kubernetes members can be assigned.

In response to this:

/assign @chakri-nelluri @msau42 @saad-ali
/cc @thockin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Member

msau42 commented Jul 24, 2017

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 24, 2017

## **Background**

Beginning in version 1.8, the Kubernetes Storage SIG is putting a stop to accepting in-tree volume plugins and advises all storage providers to implement out-of-tree plugins. Currently, there are two recommended implementations: Container Storage Interface (CSI) and Flexvolume. CSI provides a single interface that storage vendors can implement in order for their storage solutions to work across many different container orchestrators, and volume plugins are out-of-tree by design. This is a large effort, the full implementation of CSI is several quarters away, and there is a need for an immediate solution for storage vendors to continue adding volume plugins. Flexvolume is an in-tree plugin that has the ability to run any storage solution by executing volume commands against a user-provided driver on the Kubernetes host, and this currently exists today. However, the process of setting up Flexvolume is very manual, pushing it out of consideration for many users. Problems include having to copy the driver to a specific location in each node, manually restarting kubelet, and user's limited access to machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split this up into smaller paragraphs to make it easier to read and comment on.




* Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She creates an image by copying her Flexvolume driver directory, with driver names in `[vendor~]driver/driver` format (e.g. `flexvolume/k8s~nfs/nfs`), to the Flexvolume deployment base image at `/flexvolume`. Then she makes her image available Bob, a cluster admin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe split this up into steps per persona.

Could you add an example of an image Dockerfile and DaemonSet?

* Bob modifies the existing deployment DaemonSet spec with the name of the given image, and creates the DaemonSet.
* Charlie, an end user, creates volumes using the installed plugin.

The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec. Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host, thus in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe exactly "what is desired"


### Driver Installation Script

The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are failures shown to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the best way. I'll add that to Open Questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the exit status from Daemonset to notify the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the "are restored" ? A driver should never ever touch anything that isn't it's own.

The install process, for most drivers, should be:

while true; do 
  if [ -f /flexvolume/myvendor~mydriver/mydriver ]; then
    sleep 60
  else
    mkdir /flexvolume/myvendor~mydriver
    cp -a /mydriver /flexvolume/myvendor~mydriver
  fi
done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model in the current design is the entire Flexvolume directory containing all drivers get replaced, so when a copy fails, the backup of the entire directory is restored.

After moving to the single-driver install model, this will be changed to backing up a single driver instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hard line is at "you should never ever touch something you don't own".


### Dynamic Plugin Discovery

Create a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. `Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. `Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see Security Considerations). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this up into steps instead of one giant block of text


### Dynamic Plugin Discovery

Create a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`. `Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen. `Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see Security Considerations). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who creates this interface?


### Copying Driver File to Image

Using a publicly available deployment image containing the script as the base image, the plugin author copies the Flexvolume driver directory (e.g. `flexvolume/k8s~nfs/nfs`) to `/flexvolume` and makes the image available to the cluster admin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also have a section for the public deployment base image.


Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.

There is a possibility that a probe occurs at the same time the DaemonSet updates the driver, so that the prober's view of drivers is inconsistent. However, this is very rare and when it does occur, the next `Probe()`call, which occurs shortly after, will be consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some parts of the volume code that iterate through all the volume plugins. This will also need to deal with synchronization with those calls.

@verult verult force-pushed the flexdeploy branch 2 times, most recently from 22ce9e0 to d8d5536 Compare July 24, 2017 21:45
Copy link

@kokhang kokhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also how does this work in conjunction with kubernetes/enhancements#278? It seems that we can utilize the base image to install/add mount tools to be available in the host.


Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.

Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the flexvolume implements attach/detach interface, how are you going to extend the controller-manager (assuming you are using a daemonset)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "extend"?

If you are asking how the drivers get deployed through DaemonSet to a master: DaemonSet will add a pod to the master node regardless of whether the node is set to schedulable.

If you are asking how the controller-manager picks up the newly added attach/detach procedures when the driver is replaced: during plugin probe a FlexVolumeAttachablePlugin is created, which replaces the previous plugin. The attachable plugin will enable attach/detach calls from AttachDetachController.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DaemonSet will add a pod to the master node

Not all masters run kubelet. E.g. GKE.


Pros: Designed for containers that eventually terminate. No need to have the container go into an infinite loop.

Cons: Does not guarantee every node has a pod running.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use affinity/placement to target all nodes? Although you would still have issues if new nodes are added to the k8s cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll add it to the doc. Yes like you said, with Jobs you have to specify how many pods you'd like to see running to completion, which is a problem when new nodes come up.


Pros: Simpler to implement.

Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Flexvolume drivers are not something that will be added or modified often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is probe called so often?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called for every FindPluginBy*(), because currently it iterates through all plugins and errors out when multiple plugins are found. As for why FindPluginBy*() is called often, I don't know. I can only tell that from logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should track that backwards - it is surprising to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are (at least) two reasons:

  1. In every DSW populator loop, it's called for volumes of every existing pod, in order to re-associate each pod with its required volumes.

  2. Certain volume types are set to constantly remount, triggering a Probe every time the remount occurs.

@verult
Copy link
Contributor Author

verult commented Jul 24, 2017

Just to clarify: the containerized mounts feature doesn't install mount utilities on the host directly; rather, it keeps them in a container, and volume plugins need to do a container exec to access them.

The deployment script in this design can be composed with the mount utility container, and the DaemonSet spec here can be combined with the containerized mounts DaemonSet spec. The Flexvolume drivers will have to be able to perform the container execs to access containerized mount utilities, but otherwise it should just work.


### High Level Design

The DaemonSet mounts a hostpath volume exposing the host's Flexvolume driver directory onto every pod. The base deployment image contains a script that copies drivers in the image to the hostpath. A notification is then sent to the filesystem watch from kubelet or controller manager. During volume creation, if there is a signal from the watch, kubelet or controller manager probes the driver directory and loads currently installed drivers as volume plugins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the flexvolume content have to be copied to every pod? Assuming non-containarized kubelet, shouldn't this just be on host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say every DaemonSet pod. Will clarify.


`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.

Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the PluginStub is an instance of PluginProber

I thought both PluginStub and PluginProber were interfaces. Do you mean if plugin in InitPlugins loop is an instance of PluginProber ? I assume all Volume plugins will be instance of PluginStub since it would be the type that would be extended by VolumePlugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so the idea is, each time any function on VolumePluginMgr is executed, we call Probe on plugin and if that returns true we rescan the directories and return with updated plugin list. I am wondering, if we have can simply rescan the directory always whenever a call to VolumePluginMgr is made, that is keep the polling on-demand rather than periodic and we won't need inotify etc hook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean if plugin in InitPlugins loop is an instance of PluginProber ? I assume all Volume plugins will be instance of PluginStub since it would be the type that would be extended by VolumePlugin.

Yes that's correct. Will clarify.

I am wondering, if we have can simply rescan the directory always whenever a call to VolumePluginMgr is made

This is discussed in Alternative Designs (2).


* Alice is a storage plugin author and would like to deploy a Flexvolume driver on all node instances. She
1. prepares her Flexvolume driver directory, with driver names in `[vendor~]driver/driver` format (e.g. `flexvolume/k8s~nfs/nfs`, see [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)).
2. creates an image by copying her driver to the Flexvolume deployment base image at `/flexvolume`.
Copy link
Member

@liggitt liggitt Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might simplify this proposal to focus it on dynamic discovery of new plugins, and reprobing of plugins from the kubelet/controllermanager, etc. Deployment via daemonset could be one mechanism, but shouldn't be prescriptive.


### High Level Design

The DaemonSet mounts a hostpath volume exposing the host's Flexvolume driver directory onto every pod. The base deployment image contains a script that copies drivers in the image to the hostpath. A notification is then sent to the filesystem watch from kubelet or controller manager. During volume creation, if there is a signal from the watch, kubelet or controller manager probes the driver directory and loads currently installed drivers as volume plugins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daemonset has to be privileged to copy the driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chakri-nelluri Could you clarify which exactly operations require privileged mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@php-coder Kubelet executes Flexvolume drivers outside the container's namespace, so driver files have to be copied into the hostpath volume in privileged mode.


The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec.

Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host (as described in the [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)). The deployment will replace the existing driver directory on the host with contents in `/flexvolume`. Thus, in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to add a new driver without removing existing ones, existing drivers must also appear in /flexvolume

that is unexpected. I'd expect a flexvolume deployment to be focused on installing a single driver, e.g. populating one driver under /usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor>~<driver>/<driver>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it has to be possible to install multiple of these


Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly.

4) Using Jobs instead of DaemonSets to deploy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deployment method should be a separated out of this proposal, IMO. The core need is for kubelet and controller manager to detect and use added flex plugins without restart. If that requirement is met, you can use all sorts of methods for installing/maintaining the drivers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We should provide an example. That's all.


Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.

There is a possibility that a probe occurs at the same time the DaemonSet updates the driver, so the prober's view of drivers is inconsistent. However, this is very rare and when it does occur, the next `Probe()`call, which occurs shortly after, will be consistent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this approach.


### Driver Installation Script

The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens if the probe fails or the plugin fails to initialize?
We might have to document how to revert to an working older version.


### Driver Installation Script

The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this is a recommended way. Users are free to innovate :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning to provide a driver installation mechanism as part of this proposal, but like @liggitt mentioned, it can be separated out.


The user story for driver update is similar: Alice creates a new image with her new drivers, and Bob deploys it using the DaemonSet spec.

Note that the `/flexvolume` directory must look exactly like what is desired in the Flexvolume directory on the host (as described in the [Flexvolume documentation](https://github.com/kubernetes/community/blob/master/contributors/devel/flexvolume.md#prerequisites)). The deployment will replace the existing driver directory on the host with contents in `/flexvolume`. Thus, in order to add a new driver without removing existing ones, existing drivers must also appear in `/flexvolume`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it has to be possible to install multiple of these


### Driver Installation Script

The script will copy the existing content of `/flexvolume` on the host to a location in `/tmp`, and then attempt to copy user-provided drivers to that directory. If the copy fails, the original drivers are restored. This script will not perform any driver validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the "are restored" ? A driver should never ever touch anything that isn't it's own.

The install process, for most drivers, should be:

while true; do 
  if [ -f /flexvolume/myvendor~mydriver/mydriver ]; then
    sleep 60
  else
    mkdir /flexvolume/myvendor~mydriver
    cp -a /mydriver /flexvolume/myvendor~mydriver
  fi
done


In the volume plugin code, introduce a `PluginStub` interface containing a single method `Init()`, and have `VolumePlugin` extend it. Create a `PluginProber` type which extends `PluginStub` and includes methods `Init()` and `Probe()`.

`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use notify, if you are just going to cache the results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is discussed in Alternative Design (3).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If probe() were called a reasonable amount of times per second, would you reconsider this point? Just trying to bound complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely. Unfortunately I was seeing bursts of tens of milliseconds between calls. If we change the Find*() logic so that if there's already a match then don't check Flex, then we don't need a watch.


`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.

Inside `InitPlugins()` from `volume/plugins.go`, if the `PluginStub` is an instance of `PluginProber`, only call its `Init()` and nothing else. Add an additional field, `flexVolumePluginList`, in `VolumePluginMgr` as a cache. For every iteration of the plugin list, call `Probe()` and update `flexVolumePluginList` if true is returned, and iterate through the new plugin list. If the return value is false, iterate through the existing `flexVolumePluginList`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anything related to flex should leak outside the flex directory. It should just be part of the initial handshake to figure out if each thing is a plugin or a meta-plugin. Meta-plugins are probed later.


Pros: Simpler to implement.

Cons: Kubelet or controller manager iterates through the plugin list many times, so Probe() is called very frequently. Using this model would increase unnecessary disk usage. This issue is mitigated if we guarantee that `PluginProber` is the last `PluginStub` in the iteration, and only `Probe()` if no other plugin is matched, but this logic adds additional complexity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is probe called so often?


Cons: Depending on the polling period, either it's needlessly frequent, or it's too infrequent to pick up driver updates quickly.

4) Using Jobs instead of DaemonSets to deploy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We should provide an example. That's all.


`Init()` initializes fsnotify, creates a watch on the driver directory as well as its subdirectories (if any), and spawn a goroutine listening to the signal. When the goroutine receives signal that a new directory is created, create a watch for the directory so that driver changes can be seen.

`Probe()` scans the driver directory only when the goroutine sets a flag. If the flag is set, return true (indicating that new plugins are available) and the list of plugins. Otherwise, return false and nil. After the scan, the watch is refreshed to include the new list of subdirectories. The goroutine should only record a signal if there has been a 1-second delay since the last signal (see [Security Considerations](#security-considerations)). Because inotify (used by fsnotify) can only be used to watch an existing directory, the goroutine needs to maintain the invariant that the driver directory always exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, mere presence of a new directory means that there is a new flex driver there? File copy is not an atomic operation and it may happen that only part of the driver script was copied there. You should wait until whole driver has been copied there. And big question is how do you recognize that...

IMO, installation of a new driver (or driver update) should be atomic operation on the fs, e.g. link(2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jan, excellent point. The only atomic file op is rename, so we should be sure that flex EXPLICITLY ignores files that start with a .. The installer must copy to flex/.../.mydriver and then rename that to mydriver. Let's make this as explicit as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That assumes that the driver is a single file. The installer must rename multiple files one by one (and rename does not work well with directories).

I am ok with requiring the driver to be a single file, with no helper scripts around, but it must be clearly defined somewhere in flex documentation and a release note.

## **Open Questions**

* How does this system work with containerized kubelet?
* What if drivers are updated while Flexvolume plugin is executing commands?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very important question and I'd like to see some answer. Especially kubelet must not ever execute a half-copied driver - running partly copied shell script as root sounds scary to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename, as described above, should not suffer this problem, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the scenario Jan mentioned, there are two other possible bad scenarios:

  1. When a driver command is called, part of the driver command is loaded into memory and is being executed. The driver gets modified, and the remaining part of the command (which is new) gets executed. I'm not sure if this could happen.
  2. The driver gets modified while the command is read into memory, before execution even occurs. This could definitely happen.

I don't think renaming solves either of these issues. A write should never happen while an execution is underway. Ideally the solution fixes both the issues above and Jan's scenario.

My first thought is a RWLock implementation, but I'd like to avoid locking if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should get an ETXTBUSY for the person trying to modify the binary

I don't understand 2?

In any case, rename is atomic. As soon as I open the file descriptor, if anyone renames another file over this file, I still get the old file. It is pinned until I close that fd. Happy to explain more - it's a good trick to have in your toolbox.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really neat. In that case yes renaming will solve both of these scenarios.

What I tried to describe in (2) is: before the driver gets executed, part of its commands are read into memory, then the remaining commands are modified, and lastly the modified commands are read in.

@thockin
Copy link
Member

thockin commented Jul 27, 2017 via email

@verult
Copy link
Contributor Author

verult commented Aug 2, 2017

If we go with the route of having a separate credentials file, then like @jsafrane and @thockin mentioned before, we need to make sure that driver command execution doesn't interleave with driver installation. The deployment script in this proposal assumes there is a single file inside the driver directory.

This problem only comes up if the crendentials file needs to be changed. One solution is to install each file individually (copy to .dot file & rename) rather than installing the entire directory all at once.

By the way, the credentials file doesn't have to have a .dot prefix. The Flexvolume plugin uses the driver directory name to compute the name of the driver file.

@verult
Copy link
Contributor Author

verult commented Aug 7, 2017

@kokhang Have you tried using the master IP directly?

/cc @MrHohn

@MrHohn
Copy link
Member

MrHohn commented Aug 7, 2017

I tried doing this but i got an error because the K8S API IP was a in-network service IP and not routable by the host.

Kubernetes service VIP should be routable on the host as well, assuming you have kube-proxy running on the same host?

@kokhang
Copy link

kokhang commented Aug 7, 2017

@verult @MrHohn How do i know what the master IP is?
I have been testing on Minikube. What i tried was constructing the kubeconfig file using information from the daemonset that contains the flex driver. I tried getting the serviceaccount certs, the kubernetes host and port by doing something similar to https://github.com/kubernetes/client-go/blob/master/rest/config.go#L310.

However on minikube, the IP i get from the KUBERNETES_SERVICE_HOST env is not routable.

@MrHohn
Copy link
Member

MrHohn commented Aug 8, 2017

How do i know what the master IP is?

@kokhang For Minikube, probably try minikube ip.

@liggitt
Copy link
Member

liggitt commented Aug 8, 2017

How do i know what the master IP is?

@kokhang For Minikube, probably try minikube ip.

The point is that flex plugins need a reliable way to know how to reach the API server. If they have to try to figure out what type of kube deployment they are on, they'll do it inconsistently at best and fail to handle some kube deployments at worst.

In-process volume plugins are given an API client (via the host interface) as part of their Init() call. Passing information about the API server to the flexplugin's init() call seems like a parallel to that. I could see the following possibilities:

  • no api server info (kubelet is running in standalone mode)
  • full path to a kubeconfig file containing api server address, CA, credentials, etc.

for now, the kubeconfig file could be the same as the one the kubelet uses, but if a need for a separate identity came up in the future, a different kubeconfig could be fed to the plugin

@saad-ali
Copy link
Member

saad-ali commented Aug 8, 2017

We had a productive meeting today.

We identified three problems for the Rook Flex Driver:

  1. Discovery of k8s API endpoint by Rook Flex Drive
  2. Credentials Rook Flex Driver should use against k8s API endpoint.
  3. Discovery of Flex directory that deployment DeamonSet pod should drop the Rook Flex Driver in (since this directory is configurable).

Agreed upon solution:

  • Rook has a "proxy" pod deployed as a DeamonSet that proxies calls between the Flex volume driver and the k8s API server. The proxy exposes a Unix Domain Socket via HostPath to the host that the Flex Driver talks to.
    • With this approach, inside the pod, the Rook driver can do DNS lookup (kube DNS) to discover k8s API endpoint in a reliable manner, and it can the pod can run with a service account that has the correct permission to create CRDs.
  • Rook will investigate using the node API to read kubelet configz and extract the Flex driver path for kubelets.

Once @verult's dynamic discovery of Flex drivers work is merged, we should add E2E test for Flex to prevent regressions (accidental changes to the Flex API).

@verult
Copy link
Contributor Author

verult commented Aug 8, 2017

A PR for e2e Flexvolume test was recently merged and is in the process of being added to the testgrid: kubernetes/kubernetes#48366

Currently it uses kubelet restart. Once dynamic discovery is finished, this can be changed to make the test non-disruptive.

@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017

Because Flexvolume has two separate plugin instantiations (attachable and non-attachable), it's worth considering the case when a driver that implements attach/detach is replaced with a driver that does not, or vice versa. This does not cause an issue because plugins are recreated every time the driver directory is changed.

There is a possibility that a Flexvolume command execution occurs at the same time as the DaemonSet updates the driver, which leads to a bad execution. This cannot be solved within the Kubernetes system without an overhaul. Instead, this is discussed in [Atomic Driver Installation](#atomic-driver-installation) as part of the deployment mechanism. As part of the solution, the Prober will ignore all files that begins with "." in the driver directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time DaemonSet was mentioned and it caught me by surprise that it is being referred to as the DaemonSet.

Could the proposal be extended to describe that DaemonSet is expected to be the mechanism to run on every node something which will then copy scripts to the host, before discussing the race condition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or avoid discussing specific deployment mechanisms here and just say "…possibility that a Flexvolume command execution occurs at the same time as the driver is updated…"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks for catching this!

Copy link

@fabiand fabiand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the solution proposed by @jsafrane can provide the same functionality, but

  1. aligns better with the mount-utils delivery work
  2. does not touch the host, which avoids to keep track of the host sided things

To me these are two pretty strong points.

Just my 2ct

@verult
Copy link
Contributor Author

verult commented Aug 18, 2017

I think @jsafrane 's proposal would be a natural evolution of Flex. The dynamic plugin discover work here will build towards that goal, offering the ability to discover unix domain socket on the fly.

@verult verult force-pushed the flexdeploy branch 2 times, most recently from 61fe071 to ebee98b Compare August 18, 2017 18:14
@saad-ali
Copy link
Member

Merging this.

@saad-ali
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 6a7241b into kubernetes:master Aug 22, 2017
@kokhang kokhang mentioned this pull request Oct 4, 2017
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Automatic merge from submit-queue

Dynamic Flexvolume plugin discovery proposal.

This lays out the design for the improved deployment story of Flexvolume, making it easier to work with out-of-tree volume plugins before CSI comes to fruition.
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet