Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Kubernetes Volume System Redesign Proposal #18333
The purpose of this document is to consolidated the major Kubernetes Volume issues/feature requests, and introduce high-level designs that ensure alignment between all of them.
There are three big designs currently being considered for Kubernetes Volume Storage:
The solutions to all three can be considered and designed in isolation, however they touch on overlapping issues. Therefore, it makes sense to first agree on an overall design direction for the Kubernetes Volume architecture, before finalizing on the the specifics of each of these.
Proposal 1: Single Controller
Replace the existing persistent volume controllers that do binding (volumes to claims) and recycling (wiping volumes for reuse), with one controller that does provisioning (creating new volumes), binding, and recycling. This controller could monitor
Another new controller would be responsible for attaching and detaching volumes by monitoring
For a seamless transition, in the immediate term, the existing volume plugin model could be used largely as is (although some refactoring would be needed). To implement a storage plugin a third party would need to provide methods to create/delete/attach/detach and optionally mount/unmount a volume. The new controllers and kubelet would call out to these methods as needed.
The Flex Volume plugin (which adds support for exec based plugins), would be a first step towards enabling creators of third party storage plugin to add support for new volume types without having to add code to Kubernetes. The problem with the Flex plugin is that it doesn’t provide a good deployment mechanism. In order to support a new volume plugin, an admin needs to drop scripts into the correct directories on each node. However, once all the volume controllers move to the master, the exec scripts would only need to be dropped on to the master (not the nodes, since the flex plugin currently doesn’t support custom mount/unmount logic).
Longer term all third party plugin code would be containerized and removed from the Kubernetes code base. The controllers on master could “docker run” the plugin container with the correct parameters to create/delete/attach/detach volumes. In order to support fully custom mount/unmount logic, the containerized plugin could contain mount/unmount code that Kubelet could similarly “docker run” (this will have to wait until Docker makes the changes needed to allow an executable running inside a container to mount to the host).
The controllers and kubelet will need need to know which plugin container corresponds to which volume plugin and the other plugin specific configuration information. This will require the cluster to maintain a one to one mapping of volume-type strings to a plugin container path + any cluster wide plugin configuration for that plugin (e.g.
When a cluster administrator wants to add support for a new volume-type they would simply add a new key-value pair to this mapping.
Proposal 2: Multiple Controllers
Alternatively, instead of a set of Kubernetes provided controllers, the entire controller implementation could be left up to individual plugins. So each plugin would have a controller that would monitor the API Server for PV, PVC, and Pod objects, and when it finds a PVC that it can fulfill, it will claim it, bind it, and be responsible for it through attachment, detachment, all the way to recycling.
Similar to the other proposal, plugins should be containerized. But instead of the container containing a binary that is triggered for specific tasks (attach, provision, etc.), it would contain the entire controller and run for the life of the cluster (or until the volume type is no longer needed) maybe via a replication controller (to ensure availability). Fully custom mount/unmount logic could be supported similar to the other proposal: by containerizing the mount/unmount code and having Kubelet execute it via a “docker run” (requires Docker support).
Dynamic Provisioning and Attach/Detach Controller
Differs for each proposal: Single controller vs multiple controller. See proposals above for details.
Improve Plugin Model
Differs for each proposal. See proposals above for details.
Enable Modularity and Improve Deployment
Once all plugins are containerized as both plans propose, we should get the kind of plugin modularity we’re hoping for.
To deploy a new volume-type the cluster administrator would have to:
In addition, only for proposal 1, the cluster admin would have to:
And, for proposal 2, the cluster admin would have to:
Volume selection support can be added the same way regardless of which proposal is implemented. Add “Labels” to the
Volume Classes are a way to create an abstraction layer over Kubernetes volumes so that users can request different grades of storage (classes) without worrying about the specifics of any one storage implementation.
Implementation of classes can not be pushed into the plugin, because if individual plugins define the classes they support, then the whole point of the abstraction is lost. Instead, the cluster administrator must define the set of classes the cluster should support and the mapping of classes to the “knobs” exposed by individual plugins. More concretely, this means, for both proposals, that the cluster must maintain a mapping of admin-defined class strings to a list of parameters for each plugin that fulfills that class (e.g. “Gold” maps to “GCE plugin with parameter SSD” and “NFS with parameter XYZ”). This mapping must be maintained outside of the plugin (maybe in config data or a new API object).
Weather the blob is a simple string, a list of key-value pairs, or structured JSON is up for debate. It can be argued that a simple string or key-value pair list may not be sufficient to express some more complicated configuration options possible for some plugins. If that is the case, the map could maintain a structured JSON blob that the plugin would be responsible for parsing. For convenience, plugin writer could provide a “JSON blob creation tool” to make it easier for cluster admins to generate the blob.
CC @thockin @kubernetes/goog-cluster @kubernetes/rh-storage
We'll use this document to drive the discussion at the Storage Special Interest Group meeting on Dec. 8, 2015 11 AM PST.
I won't make it to today's call, here are slightly prefer Proposal 1 with one controller (in kube-controller-manager or anywhere else, it does not really matter), which would execute individual containers (or pods) just to provision / delete a volume (maybe also attach/detach, as suggested).
on proposal 1:
on proposal 2:
For me it really comes down to:
a) A relatively complicated plugin API with prescriptive rules (that we have to maintain across the flexi-volume boundary); a required config API (classes); minimal code per-plugin
Playing it out, we should think about how it evolves from an operational point of view. How do we debug when something goes wrong? If you assume the trend towards flexi-volumes, then both models arrive at a place where you have to ask "what version of the driver are you using" in addition to kuberenetes version. In proposal 2, that's also "what version of the controller are you running".
Meeting is soon, sending now.
Some thoughts after that. Maybe we can draw some inspiration from established systems that have drivers and "plugins". Specifically I am thinking of Linux. Linux (the kernel) has multiple levels of abstraction in the storage subsystem. There are low-level drivers that adhere to pretty well-defined, prescriptive interfaces. If you want to be a SCSI driver you implement methods X, Y, and Z and voila, you're done. There are also mid-level subsystems. If SCSI isn't right for you, you can produce your own subsystem. It's a lot more work and a lot more duplicative, but it grants you a lot of freedom.
To then connect this to our conversation, we could think of it this way: Built-in volume plugins (drivers) are handled by our built-in controller(s) (subsystem). If the API of our subsystem is good enough for you, that's the path of least resistance. By having drivers in-tree you get all the same benefits that Linux kernel drivers have - we will build, release, and version them for you. If APIs change, we will refactor for you. As long as you are present as a maintainer, it will be good. But if our API is not good for you - you don't want to publish code, you need hooks we don't have, whatever - you can always write your own subsystem. It's a lot more work and a lot more duplicative, but it grants you a lot of freedom.
Then the question of flex-volumes - maybe that is an escape hatch for trying things out or even for distributing out-of-tree drivers for the default subsystem.
Again, not sure this is the answer, just writing thought.
Between 2 and 3, I like to think we've provided enough extensibility for a great many use cases.
I agree.. It gives best of the both worlds.
Mesos (and DCOS) will eventually want complete control over volume binding. Proposal 2 seems compatible with that goal, as does the suggestion from @thockin to offer the ability to implement the larger "subsystem" - as long as that includes control over the actual mounting of the volume.
I'm curious what this means - is there something about the way we bind
On Wed, Dec 9, 2015 at 12:10 PM, James DeFelice email@example.com
Mesos is heading down the path of implementing it's own API's for storage
In the context of this discussion, it seems that implementing either a
On Thu, Dec 10, 2015 at 2:19 AM, Tim Hockin firstname.lastname@example.org
Interesting info @jdef. I think in the context of proposal-1 we would need to be able to swap the actual implementation of the plugins within the control to accomplish your use case. It seems possible, but proposal-2 definitely seems like less work for your use-case.
Expanding on @thockin's question:
Agree that this is the fundamental question we are examining with this issue. We have also been grappling with this issue with ownership management. Currently, we've decided that a prescriptive set of rules for externalizing functionality were hard to get right with any degree of accuracy, and so we've decided to go with a more opaque approach, internalizing the problem of setting ownership correctly. To me option 2 seems like the analogous option here, especially in light of @jdef's specific use-case. However, there does exist a middle-ground approach: have an option-2 controller implementation of option-1, with the ability to configure which plugins are supported. So, if you want to use a special implementation of JUST NFS, as an example, it would look like:
...and the two controllers play nice because (1) ignores NFS and (2) only services NFS.
No doubt the problem space gets more complicated with debugging. The quality of debugging experience always varies by implementations with plugin-based architectures. If we go with option-2 (which I am starting to think will be the best way), we should publish guidance on what type of information controllers should log.
@thockin might have some thoughts on this.
The way I see it, one of the goals of Kubernetes is portability for end users. So where possible, we try to achieve that for users by introducing abstractions that they can choose to deploy against and isolate themselves from implementation details. To this end, we'd like to be container implementation agnostic where possible. Docker is the most popular container implementation in town, but it's not the only one (see rkt).
Proposal 3: Keep Third-Party Plugin Code “In-Tree”
Both proposal 1 and 2 propose containerizing and removing third party plugin code from the Kubernetes code base. The primary reason for this is to decouple Kubernetes and third party plugin code so that each can be maintained independently. Both of those proposals require Kubernetes to expose an API that third party developers can develop against (the two proposals only differ in how low level the API should be).
However, @thockin has an interesting suggestion, to look at the Linux Device Driver Model for inspiration. Linux made a conscious decision to check in third party driver code into the mainline kernel (i.e. “in-tree”) instead of exposing an API that can be used by driver developers to write drivers independent of the Linux kernel. The reasons for this decision are detailed here.
This proposal, takes that approach, and therefore questions one of the stated goals:
In this world the volume controllers would be maintained by Kubernetes. They would call out to volume plugins, also maintained “in-tree”. The Flex volume plugin will act as a way for Plugin developers to experiment with out-of-tree plugins, no guarantees of backwards compatibility will be provided. Once the flex volume plugin is stable we can introduce a second such pugin that would operate on containers instead of scripts, like @markturnasky mentioned above. For vendors who desire a high level of customization, they can write and swap out their own volume controller(s) (lots of work, but lots of control).
Pros of maintaining 3rd-party-plugins “in-tree” as part of Kubernetes (similar to Linux Driver Model):
Cons of maintaining 3rd-party-plugins “in-tree” as part of Kubernetes:
We can discuss and finalize on these options during the next Storage SIG meeting (Jan 5, 2016, 11 AM PST). Happy holidays!
This was referenced
Dec 29, 2015
re: Docker volumes
Docker volumes have a distinct advantage of letting us share drivers with Docker, which is a net win for anyone writing drivers. I have not written a driver myself but my understanding is:
All options are opaque - no type checking or input validation at our API server. That will have to wait until a Pod gets bound to a node and we actually try to start it (clumsy UX).
It's unclear whether drivers are supposed to allow multiple simulatneous connections (e.g. from Docker itself and from Kubelet). We could thunk THROUGH docker if docker is the runtime of choice.
No concept of attach/detach distinct from mount/unmount. This means every node must have creds to attach/detach (which is something we are trying to get away from).
These are not inconsequential faults. History shows that Docker (inc) is not particularly willing to adapt to our needs as a centralized manager, but maybe with Swarm in flight they would be open to changes here. I don't even know who owns this API over there - maybe @lukemarsden can offer some guidance on this.
If we decided we want to use Docker's volume plugins and we can actually get over these hurdles, we still have open design issues
I'm very much not against using Docker volumes (less choice is better when it comes to APIs that vendors have to support), and from what I understand people are reasonably happy with this API (as compared to libnetwork :) if only we can sort out how.
+1 for keeping plugin code in-tree.
$0.02: The Kubernetes-Mesos project used to be an external repo and
On Thu, Dec 24, 2015 at 6:16 PM, Saad Ali email@example.com wrote:
referenced this issue
Jan 22, 2016
Why was the alternative to k8 built-in plug-in a single-shot container?
I'm not convinced these approaches are incompatible. Just like there could be a generic iSCSI driver there could be a generic API driver that passes additional misc options from the spec.
In-tree volumes give us the most flexibility at the moment while the volume model is still under development. Longer term we'd like to be able offer a stronger API that would make out-of-tree development the default, but we want to do that in a careful manner making sure the API encompsses all the requiments we will have (around deployment, API guarantees, etc.). In the meantime, the Flex volume plugin acts as an escape hatch.
Containerizing plugins does not necessitate
Flex Volumes was merged as part of v1.2.
The discussion around in-tree vs out-of-tree plugins is continuously being revisited as the project matures: https://groups.google.com/forum/#!topic/kubernetes-sig-storage/9o1vA4jFwqk