New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Volume System Redesign Proposal #18333

Closed
saad-ali opened this Issue Dec 8, 2015 · 19 comments

Comments

Projects
None yet
8 participants
@saad-ali
Copy link
Member

saad-ali commented Dec 8, 2015

Objective

The purpose of this document is to consolidated the major Kubernetes Volume issues/feature requests, and introduce high-level designs that ensure alignment between all of them.

Background

There are three big designs currently being considered for Kubernetes Volume Storage:

  1. Dynamic Provisioning of Volumes
  2. Volume Attach Detach Controller
  3. Flex Volume Plugin

The solutions to all three can be considered and designed in isolation, however they touch on overlapping issues. Therefore, it makes sense to first agree on an overall design direction for the Kubernetes Volume architecture, before finalizing on the the specifics of each of these.

Goals

  • Introduce Dynamic Provisioning
    • Introduce ability for cluster to automatically (dynamically) create (provision) volumes instead of relying on a cluster admin to manually create them ahead of time.
    • Update (Jul 1, 2016): Introduced in v1.2 (#14537) as alpha. Being fully flushed out in v1.4 (#26908).
  • Make Attach/Detach Robust
    • Make volume attachment and detachment independent of individual node availability.
    • Update (Jul 1, 2016): Introduced in v1.3 (#26351).
  • Improve Plugin Model
    • Establish a plugin model that allows new storage plugins to be written without recompiling Kubernetes code.
    • Update (Jul 1, 2016): discussion is ongoing.
  • Enable Modularity and Improve Deployment
    • Make it easy for cluster admins to select which volume types they want to support, and easily configure/deploy the plugin(s) across the cluster.
  • Volume Selector
    • Enable users to select specific storage based on some implementation specific characteristics (specific rack, etc.).
    • Update (Jul 1, 2016): Introduced in v1.3 #25917.
  • Volume Classes
    • Create an abstraction layer (classes) for Kubernetes volumes users so they can request different grades of storage without worrying about the specifics of any one storage implementation.
    • Update (Jul 1, 2016): Being fully flushed out as part of dynamic provisioning design in v1.4 (#26908).

Proposal 1: Single Controller

Replace the existing persistent volume controllers that do binding (volumes to claims) and recycling (wiping volumes for reuse), with one controller that does provisioning (creating new volumes), binding, and recycling. This controller could monitor PersistentVolume and PersistentVolumeClaim objects via the API server to determine when a new volume needs to be created.

Another new controller would be responsible for attaching and detaching volumes by monitoring PersistentVolume and Pod objects to determine when to attach/detach a volume from the node.

For a seamless transition, in the immediate term, the existing volume plugin model could be used largely as is (although some refactoring would be needed). To implement a storage plugin a third party would need to provide methods to create/delete/attach/detach and optionally mount/unmount a volume. The new controllers and kubelet would call out to these methods as needed.

The Flex Volume plugin (which adds support for exec based plugins), would be a first step towards enabling creators of third party storage plugin to add support for new volume types without having to add code to Kubernetes. The problem with the Flex plugin is that it doesn’t provide a good deployment mechanism. In order to support a new volume plugin, an admin needs to drop scripts into the correct directories on each node. However, once all the volume controllers move to the master, the exec scripts would only need to be dropped on to the master (not the nodes, since the flex plugin currently doesn’t support custom mount/unmount logic).

Longer term all third party plugin code would be containerized and removed from the Kubernetes code base. The controllers on master could “docker run” the plugin container with the correct parameters to create/delete/attach/detach volumes. In order to support fully custom mount/unmount logic, the containerized plugin could contain mount/unmount code that Kubelet could similarly “docker run” (this will have to wait until Docker makes the changes needed to allow an executable running inside a container to mount to the host).

The controllers and kubelet will need need to know which plugin container corresponds to which volume plugin and the other plugin specific configuration information. This will require the cluster to maintain a one to one mapping of volume-type strings to a plugin container path + any cluster wide plugin configuration for that plugin (e.g. GCE maps to {container: “k8s-volume-plugin-gce”, maxVolumesToProvision: 20, … }). This mapping must be available to both the controllers and kubelet (it could be stored in config data or a new API object).

When a cluster administrator wants to add support for a new volume-type they would simply add a new key-value pair to this mapping.

Pros:

  • More concrete contract between Kubernetes and volume plugins.
  • Easier to implement volume plugins
    • Plugin writers just have to provide implementations for methods like create, attach, detach, etc. and NOT controller logic which is more complicated.
  • All plugins benefit from fixes/improvements to common controllers.
  • More performant to run a single controller vs n controllers.
  • Smaller incremental steps required to get to proposal’s end state.

Cons:

  • Kubernetes core still responsible for maintaining a lot of volume logic.
  • Less flexibility for individual plugins (i.e will the API be sufficient for all possible volume plugins?).

Proposal 2: Multiple Controllers

Alternatively, instead of a set of Kubernetes provided controllers, the entire controller implementation could be left up to individual plugins. So each plugin would have a controller that would monitor the API Server for PV, PVC, and Pod objects, and when it finds a PVC that it can fulfill, it will claim it, bind it, and be responsible for it through attachment, detachment, all the way to recycling.

Similar to the other proposal, plugins should be containerized. But instead of the container containing a binary that is triggered for specific tasks (attach, provision, etc.), it would contain the entire controller and run for the life of the cluster (or until the volume type is no longer needed) maybe via a replication controller (to ensure availability). Fully custom mount/unmount logic could be supported similar to the other proposal: by containerizing the mount/unmount code and having Kubelet execute it via a “docker run” (requires Docker support).

Pros:

  • More flexibility for plugin writers.
    • Authors can decide how to implement pretty much everything.
  • Less Kubernetes exposure.
    • Since the plugin configuration can be passed directly to the controller rather than through kubernetes, that’s one less thing that must be added to config data or a new API object.
  • Slightly simpler installation
    • Support for a plugin can be added simply by running the plugin container. (Although the other proposal is not much more complicated.)
  • Aligns with Kubernetes Networking plugin design

Cons:

  • Would lead to code duplication.
    • Most controllers will likely be implementing very similar logic for the controller portion, leading to code duplication/proliferation of bugs.
    • Could be worked around by creating a library that plugin writers could use for standard implementations.
  • Less performant.
    • Having to run a controller for each plugin adds overhead, especially for small clusters.

Comparison

Dynamic Provisioning and Attach/Detach Controller

Differs for each proposal: Single controller vs multiple controller. See proposals above for details.

Improve Plugin Model

Differs for each proposal. See proposals above for details.

Enable Modularity and Improve Deployment

Once all plugins are containerized as both plans propose, we should get the kind of plugin modularity we’re hoping for.

To deploy a new volume-type the cluster administrator would have to:

  • Installing plugin client binaries on each node.
    • Short term: the admin must manually copy these files.
    • Long term: these should be containerized and deployed as deamonset pods.
  • Configure volume classes (optional)--see below for details.
    • Short term: Config data created and maintained by admin.
    • Long term: Kubectl command for adding and removing classes that does this automatically.

In addition, only for proposal 1, the cluster admin would have to:

  • Add an entry to the key-value list keeping track of supported plugins to plugin container path/plugin config.
    • Short term: Config data manually updated by admin.
    • Long term: Kubectl command for adding and removing supported plugins (i.e. admin specifies a volume-type and config, and Kubectl automatically adds it to Config Data).

And, for proposal 2, the cluster admin would have to:

  • Start the controller container (maybe as a replication controller) with the correct parameters (plugin config) so that it can start watching for PVCs, etc.

Volume Selector

Volume selection support can be added the same way regardless of which proposal is implemented. Add “Labels” to the PersistentVolume object and “Selectors” to the PersistentVolumeClaim object. The binding controller(s) must resolve labels as it tries to fulfill PVCs.

Volume Classes

Volume Classes are a way to create an abstraction layer over Kubernetes volumes so that users can request different grades of storage (classes) without worrying about the specifics of any one storage implementation.

Implementation of classes can not be pushed into the plugin, because if individual plugins define the classes they support, then the whole point of the abstraction is lost. Instead, the cluster administrator must define the set of classes the cluster should support and the mapping of classes to the “knobs” exposed by individual plugins. More concretely, this means, for both proposals, that the cluster must maintain a mapping of admin-defined class strings to a list of parameters for each plugin that fulfills that class (e.g. “Gold” maps to “GCE plugin with parameter SSD” and “NFS with parameter XYZ”). This mapping must be maintained outside of the plugin (maybe in config data or a new API object).

Weather the blob is a simple string, a list of key-value pairs, or structured JSON is up for debate. It can be argued that a simple string or key-value pair list may not be sufficient to express some more complicated configuration options possible for some plugins. If that is the case, the map could maintain a structured JSON blob that the plugin would be responsible for parsing. For convenience, plugin writer could provide a “JSON blob creation tool” to make it easier for cluster admins to generate the blob.

CC @thockin @kubernetes/goog-cluster @kubernetes/rh-storage

We'll use this document to drive the discussion at the Storage Special Interest Group meeting on Dec. 8, 2015 11 AM PST.

@jsafrane

This comment has been minimized.

Copy link
Member

jsafrane commented Dec 8, 2015

I won't make it to today's call, here are slightly prefer Proposal 1 with one controller (in kube-controller-manager or anywhere else, it does not really matter), which would execute individual containers (or pods) just to provision / delete a volume (maybe also attach/detach, as suggested).

Benefits:

  • Single state machine. All the different platforms / storage volumes would behave exactly the same. This is hard to enforce when everyone writes his own provisioners in different style.
  • Easy synchronization. With multiple different controllers we would need some explicit synchronization between them anyway to prevent multiple controllers from creating PV for single PVC. In addition. with single controller, we can easily implement some ordering of provisioners, making sure that a default one (perhaps cheap and slow) is executed as the last one, when no other provisioner "wants" to provision a PVC.

Volume Selector

  • IMO, there should be also some selector also in namespaces to specify which provisioners (or storage classes?) can be used by pods there. E.g. "marketing namespace can use SSD (or 'gold'?)" and "R&D can use HDD only".
@thockin

This comment has been minimized.

Copy link
Member

thockin commented Dec 8, 2015

on proposal 1:

  • I don't think "flex plugin ... doesn’t support custom mount/unmount logic" will not stand for very long. If we want to eject things like ceph/gluster/nfs as flex volumes we will need to be able to call custom mount logic. To play devil's advocate: there's a lot of stuff we can do if volume plugins are built-in that are much harder if they are not (e.g. iterate). This is still a topic of internal debate for me, but it's not the topic of this doc per se.

on proposal 2:

  • Pro: radically simpler on-node plugins

For me it really comes down to:

a) A relatively complicated plugin API with prescriptive rules (that we have to maintain across the flexi-volume boundary); a required config API (classes); minimal code per-plugin
b) A simple per-node API; config by-convention but maybe not required (classes); some duplicate code and logic (some tricky) per plugin

Playing it out, we should think about how it evolves from an operational point of view. How do we debug when something goes wrong? If you assume the trend towards flexi-volumes, then both models arrive at a place where you have to ask "what version of the driver are you using" in addition to kuberenetes version. In proposal 2, that's also "what version of the controller are you running".

Meeting is soon, sending now.

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Dec 8, 2015

Some thoughts after that. Maybe we can draw some inspiration from established systems that have drivers and "plugins". Specifically I am thinking of Linux. Linux (the kernel) has multiple levels of abstraction in the storage subsystem. There are low-level drivers that adhere to pretty well-defined, prescriptive interfaces. If you want to be a SCSI driver you implement methods X, Y, and Z and voila, you're done. There are also mid-level subsystems. If SCSI isn't right for you, you can produce your own subsystem. It's a lot more work and a lot more duplicative, but it grants you a lot of freedom.

To then connect this to our conversation, we could think of it this way: Built-in volume plugins (drivers) are handled by our built-in controller(s) (subsystem). If the API of our subsystem is good enough for you, that's the path of least resistance. By having drivers in-tree you get all the same benefits that Linux kernel drivers have - we will build, release, and version them for you. If APIs change, we will refactor for you. As long as you are present as a maintainer, it will be good. But if our API is not good for you - you don't want to publish code, you need hooks we don't have, whatever - you can always write your own subsystem. It's a lot more work and a lot more duplicative, but it grants you a lot of freedom.

Then the question of flex-volumes - maybe that is an escape hatch for trying things out or even for distributing out-of-tree drivers for the default subsystem.

Again, not sure this is the answer, just writing thought.

@markturansky

This comment has been minimized.

Copy link
Member

markturansky commented Dec 8, 2015

  1. Built-in plugins are fairly easy to write and give the user a good OOTB experience. There are only a handful of them.
  2. A plugin for the driver model allows 'exec' on the master node as the 1st escape hatch.
  3. A 2nd escape hatch is a plugin for "run this pod and watch until completion", which is what the recycler does.

Between 2 and 3, I like to think we've provided enough extensibility for a great many use cases.

@chakri-nelluri

This comment has been minimized.

Copy link
Contributor

chakri-nelluri commented Dec 9, 2015

I agree.. It gives best of the both worlds.
Folks who want to implement extra features that are not provided by default subsystem can write their own subsystem.
Others can use default subsystem with built-in drivers and use flex volumes for out-of-tree drivers and trying things out.

@jdef

This comment has been minimized.

Copy link
Contributor

jdef commented Dec 9, 2015

Mesos (and DCOS) will eventually want complete control over volume binding. Proposal 2 seems compatible with that goal, as does the suggestion from @thockin to offer the ability to implement the larger "subsystem" - as long as that includes control over the actual mounting of the volume.

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Dec 10, 2015

I'm curious what this means - is there something about the way we bind
volumes that is lacking?

On Wed, Dec 9, 2015 at 12:10 PM, James DeFelice notifications@github.com
wrote:

Mesos (and DCOS) will eventually want complete control over volume
binding. Proposal 2 seems compatible with that goal, as does the suggestion
from @thockin https://github.com/thockin to offer the ability to
implement the larger "subsystem" - as long as that includes control over
the actual mounting of the volume.


Reply to this email directly or view it on GitHub
#18333 (comment)
.

@jdef

This comment has been minimized.

Copy link
Contributor

jdef commented Dec 10, 2015

Mesos is heading down the path of implementing it's own API's for storage
management. The current k8s volume management impl makes the assumption
that it controls the mount ns on the host that it's running on (similar to
how k8s assumes control over other node things like iptables, docker, ...).
In a MESOS world, k8s is not the only thing running on a node and so Mesos
will very likely want to control access to a node's mount ns.

In the context of this discussion, it seems that implementing either a
proposal-2 controller or else something along the lines of a "volume
subsystem" (vs. a lightweight plugin) would be more in alignment with the
goals of the Mesos project. A Mesos-oriented subsystem implementation would
coordinate with the Mesos runtime to handle volume lifecycle (mount,
unmount, recycle) operations, as well as attach/detach to/from pod
operations: k8s would never touch the mount ns of the node it runs on when
it's running on Mesos.

On Thu, Dec 10, 2015 at 2:19 AM, Tim Hockin notifications@github.com
wrote:

I'm curious what this means - is there something about the way we bind
volumes that is lacking?

On Wed, Dec 9, 2015 at 12:10 PM, James DeFelice notifications@github.com
wrote:

Mesos (and DCOS) will eventually want complete control over volume
binding. Proposal 2 seems compatible with that goal, as does the
suggestion
from @thockin https://github.com/thockin to offer the ability to
implement the larger "subsystem" - as long as that includes control over
the actual mounting of the volume.


Reply to this email directly or view it on GitHub
<
#18333 (comment)

.


Reply to this email directly or view it on GitHub
#18333 (comment)
.

@pmorie

This comment has been minimized.

Copy link
Member

pmorie commented Dec 11, 2015

Interesting info @jdef. I think in the context of proposal-1 we would need to be able to swap the actual implementation of the plugins within the control to accomplish your use case. It seems possible, but proposal-2 definitely seems like less work for your use-case.

Expanding on @thockin's question:

For me it really comes down to:

a) A relatively complicated plugin API with prescriptive rules (that we have to maintain across the flexi-volume boundary); a required config API (classes); minimal code per-plugin
b) A simple per-node API; config by-convention but maybe not required (classes); some duplicate code and logic (some tricky) per plugin

Agree that this is the fundamental question we are examining with this issue. We have also been grappling with this issue with ownership management. Currently, we've decided that a prescriptive set of rules for externalizing functionality were hard to get right with any degree of accuracy, and so we've decided to go with a more opaque approach, internalizing the problem of setting ownership correctly. To me option 2 seems like the analogous option here, especially in light of @jdef's specific use-case. However, there does exist a middle-ground approach: have an option-2 controller implementation of option-1, with the ability to configure which plugins are supported. So, if you want to use a special implementation of JUST NFS, as an example, it would look like:

  1. Run the built-in controller with NFS disabled.
  2. Run your own option-2 controller that handles NFS

...and the two controllers play nice because (1) ignores NFS and (2) only services NFS.

Playing it out, we should think about how it evolves from an operational point of view. How do we debug when something goes wrong? If you assume the trend towards flexi-volumes, then both models arrive at a place where you have to ask "what version of the driver are you using" in addition to kuberenetes version. In proposal 2, that's also "what version of the controller are you running".

No doubt the problem space gets more complicated with debugging. The quality of debugging experience always varies by implementations with plugin-based architectures. If we go with option-2 (which I am starting to think will be the best way), we should publish guidance on what type of information controllers should log.

@markturansky

This comment has been minimized.

Copy link
Member

markturansky commented Dec 17, 2015

Attach/detach will require an interlock between the controller trying to delete PVs: #18832

wrong crosslink

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Dec 24, 2015

Why do we need to invent a completely brand new protocol? A lot of drivers already support the Docker volume plugin interface/protocol. Can that simply be used here?

@thockin might have some thoughts on this.

The way I see it, one of the goals of Kubernetes is portability for end users. So where possible, we try to achieve that for users by introducing abstractions that they can choose to deploy against and isolate themselves from implementation details. To this end, we'd like to be container implementation agnostic where possible. Docker is the most popular container implementation in town, but it's not the only one (see rkt).

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Dec 24, 2015

Proposal 3: Keep Third-Party Plugin Code “In-Tree”

Both proposal 1 and 2 propose containerizing and removing third party plugin code from the Kubernetes code base. The primary reason for this is to decouple Kubernetes and third party plugin code so that each can be maintained independently. Both of those proposals require Kubernetes to expose an API that third party developers can develop against (the two proposals only differ in how low level the API should be).

However, @thockin has an interesting suggestion, to look at the Linux Device Driver Model for inspiration. Linux made a conscious decision to check in third party driver code into the mainline kernel (i.e. “in-tree”) instead of exposing an API that can be used by driver developers to write drivers independent of the Linux kernel. The reasons for this decision are detailed here.

This proposal, takes that approach, and therefore questions one of the stated goals:

Improve Plugin Model

  • Establish a plugin model that allows new storage plugins to be written without recompiling Kubernetes code.

In this world the volume controllers would be maintained by Kubernetes. They would call out to volume plugins, also maintained “in-tree”. The Flex volume plugin will act as a way for Plugin developers to experiment with out-of-tree plugins, no guarantees of backwards compatibility will be provided. Once the flex volume plugin is stable we can introduce a second such pugin that would operate on containers instead of scripts, like @markturnasky mentioned above. For vendors who desire a high level of customization, they can write and swap out their own volume controller(s) (lots of work, but lots of control).

Pros of maintaining 3rd-party-plugins “in-tree” as part of Kubernetes (similar to Linux Driver Model):

  • Once a plugin is working on a given version of Kubernetes, that support continues through all future versions.
  • Users don’t have to worry about what version of the plugin they use (they’re all baked in).
  • Large/breaking changes to the plugin model are possible, since all plugins are checked in to Kubernetes and can be updated together with the changes.
  • Plugins are maintained by the community instead of just the plugin vendor; more eyeballs generally lead to more bug fixes/stability.

Cons of maintaining 3rd-party-plugins “in-tree” as part of Kubernetes:

  • Plugin development is tightly coupled with Kubernetes core development.
    • Kubernetes developers/community are responsible for maintaining all volume plugins instead of just maintaining a stable plugin API.
    • Plugin developers can not revise their plugins independently of Kubernetes.
  • Plugin developers are forced to make plugin source code available and can not choose to release only a binary (unless they choose to rewrite and deploy the entire volume controller).
  • Buggy plugins could crash the entire Kubernetes volume controller instead of just the plugin.
    • Counterpoint: plugins could be containerized and still maintained “in-tree”.

We can discuss and finalize on these options during the next Storage SIG meeting (Jan 5, 2016, 11 AM PST). Happy holidays!

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Dec 31, 2015

re: Docker volumes

Docker volumes have a distinct advantage of letting us share drivers with Docker, which is a net win for anyone writing drivers. I have not written a driver myself but my understanding is:

  • run a daemon
  • implement a JSON-RPC endpoint in that daemon
  • drop a UNIX socket or other pointer to your daemon into a dir
  • receive a handful of commands on the socket, react

Issues:

All options are opaque - no type checking or input validation at our API server. That will have to wait until a Pod gets bound to a node and we actually try to start it (clumsy UX).

It's unclear whether drivers are supposed to allow multiple simulatneous connections (e.g. from Docker itself and from Kubelet). We could thunk THROUGH docker if docker is the runtime of choice.

No concept of attach/detach distinct from mount/unmount. This means every node must have creds to attach/detach (which is something we are trying to get away from).

These are not inconsequential faults. History shows that Docker (inc) is not particularly willing to adapt to our needs as a centralized manager, but maybe with Swarm in flight they would be open to changes here. I don't even know who owns this API over there - maybe @lukemarsden can offer some guidance on this.

If we decided we want to use Docker's volume plugins and we can actually get over these hurdles, we still have open design issues

  • Do we make all of our drivers be docker drivers or do we make one of our drivers be a thunk-through to Docker drivers?
  • Do we call their drivers ourself (we'll have to in rkt and other runtimes) or do we pass it through to docker daemon on docker runtimes?
  • Intersect this with ALL of the goals in this larger doc.

I'm very much not against using Docker volumes (less choice is better when it comes to APIs that vendors have to support), and from what I understand people are reasonably happy with this API (as compared to libnetwork :) if only we can sort out how.

@jdef

This comment has been minimized.

Copy link
Contributor

jdef commented Dec 31, 2015

+1 for keeping plugin code in-tree.

$0.02: The Kubernetes-Mesos project used to be an external repo and
vendored the k8s source tree. It was very hard to keep up with the pace
at which k8s moves. We (the k8s-mesos team) have been much happier since
pushing the project into the k8s source tree, even if it does live in
contrib/ and isn't compiled by default for everyone. In addition to more
eyeballs on the code, we know sooner than later when breaking API changes
are made and can react quickly -- super important since we're a small team
and k8s evolves fast. Breaking up the monolithic k8s code base into
federated repos is a larger problem that needs to be tackled at some point

  • I'm not convinced that we need to solve it here first.

On Thu, Dec 24, 2015 at 6:16 PM, Saad Ali notifications@github.com wrote:

Proposal 3: Keep Third-Party Plugin Code “In-Tree”

Both proposal 1 and 2 propose containerizing and removing third party
plugin code from the Kubernetes code base. The primary reason for this is
to decouple Kubernetes and third party plugin code so that each can be
maintained independently. Both of those proposals require Kubernetes to
expose an API that third party developers can develop against (the two
proposals only differ in how low level the API should be).

However, @thockin https://github.com/thockin has an interesting
suggestion, to look at the Linux Device Driver Model for inspiration. Linux
made a conscious decision to check in third party driver code into the
mainline kernel (i.e. “in-tree”) instead of exposing an API that can be
used by driver developers to write drivers independent of the Linux kernel.
The reasons for this decision are detailed here
http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/linuxdevicedrivermodel
.

This proposal, takes that approach, and therefore questions one of the
stated goals:

Improve Plugin Model

  • Establish a plugin model that allows new storage plugins to be
    written without recompiling Kubernetes code.

In this world the volume controllers would be maintained by Kubernetes.
They would call out to volume plugins, also maintained “in-tree”. The Flex
volume plugin will act as a way for Plugin developers to experiment with
out-of-tree plugins, no guarantees of backwards compatibility will be
provided. Once the flex volume plugin is stable we can introduce a second
such pugin that would operate on containers instead of scripts, like
@markturnasky mentioned above. For vendors who desire a high level of
customization, they can write and swap out their own volume controller(s)
(lots of work, but lots of control).
Pros of maintaining 3rd-party-plugins “in-tree” as part of Kubernetes
(similar to Linux Driver Model):

  • Once a plugin is working on a given version of Kubernetes, that
    support continues through all future versions.
  • Users don’t have to worry about what version of the plugin they use
    (they’re all baked in).
  • Large/breaking changes to the plugin model are possible, since all
    plugins are checked in to Kubernetes and can be updated together with the
    changes.
  • Plugins are maintained by the community instead of just the plugin
    vendor; more eyeballs generally lead to more bug fixes/stability.

Cons of maintaining 3rd-party-plugins “in-tree” as part of Kubernetes:

  • Plugin development is tightly coupled with Kubernetes core
    development.
    • Kubernetes developers/community are responsible for maintaining
      all volume plugins instead of just maintaining a stable plugin API.
    • Plugin developers can not revise their plugins independently of
      Kubernetes.
  • Plugin developers are forced to make plugin source code available
    and can not choose to release only a binary (unless they choose to rewrite
    and deploy the entire volume controller).
  • Buggy plugins could crash the entire Kubernetes volume controller
    instead of just the plugin.
    • Counterpoint: plugins could be containerized and still maintained
      “in-tree”.

We can discuss and finalize on these during the next Storage SIG meeting
(Jan 5, 2016, 11 AM PST). Happy holidays!


Reply to this email directly or view it on GitHub
#18333 (comment)
.

@falanjones

This comment has been minimized.

Copy link

falanjones commented Apr 13, 2016

Why was the alternative to k8 built-in plug-in a single-shot container?
The general interaction of most microservices is through REST, not 'docker run'.
I'm not a fan of the built-in model because it ties development for new storage features to the relatively long k8 release cycle and potentially longer customer adoption.
Incompatibilities in an API are handled by negotiating the version between endpoints and potentially supporting more than one.
I've also heard that a built-in model could provide vendor specific spec validation.
However, the information needed for this validation could also be exchanged through an API.
In the design I'm proposing there is an issue of discovering the endpoint.
If it is cluster wide, a normal IP and port is good.
If it continues to be kublet local, we need some form of localhost access.

I'm not convinced these approaches are incompatible. Just like there could be a generic iSCSI driver there could be a generic API driver that passes additional misc options from the spec.

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Apr 20, 2016

I'm not a fan of the built-in model because it ties development for new storage features to the relatively long k8 release cycle and potentially longer customer adoption.

In-tree volumes give us the most flexibility at the moment while the volume model is still under development. Longer term we'd like to be able offer a stronger API that would make out-of-tree development the default, but we want to do that in a careful manner making sure the API encompsses all the requiments we will have (around deployment, API guarantees, etc.). In the meantime, the Flex volume plugin acts as an escape hatch.

Why was the alternative to k8 built-in plug-in a single-shot container? The general interaction of most microservices is through REST, not 'docker run'.

Containerizing plugins does not necessitate docker run. Ideally it would be a k8s pod or daemon-set. One of the underlying goals is that we want to make it easy to add support for new volume types. A cluster admin should be able to deploy support for new volume types (requisite binaries for mounting, etc), without having to manually install software on each node. If they could just deploy a kubernetes pod or daemon-set and get additional functionality, that would be pretty powerful. @childsb has been looking into the specifics of what that could look like, see #22216. The ideas that have been discussed closely resemble what you are proposing: deploying "sidecar" containers that expose a REST API that the k8s volume code would know how to call out to. That said there is nothing concrete yet, and any further work will likely be post v1.3.

@thockin

This comment has been minimized.

Copy link
Member

thockin commented Jun 9, 2016

@saad-ali It might be worthwhile to either close this and open a new issue or issues on the remaining work, or at least post a summary / status message here and link to it from the first comment.

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Jun 9, 2016

Will do

@saad-ali

This comment has been minimized.

Copy link
Member Author

saad-ali commented Jul 1, 2016

Closing this.

Flex Volumes was merged as part of v1.2.
Attach/detach controller is introduced in v1.3.
Dynamic Volume Provisioning was introduced as alpha in v1.2 and is being fully flushed out in for v1.4.

The discussion around in-tree vs out-of-tree plugins is continuously being revisited as the project matures: https://groups.google.com/forum/#!topic/kubernetes-sig-storage/9o1vA4jFwqk

@saad-ali saad-ali closed this Jul 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment