Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Durable data #1515
I'm not looking to actually merge this PR, just get some feedback on two possible approaches.
If there is agreement on one approach, then I will code it up, and convert this design document into user documentation.
I'm inclined to add a new type of REST resource to model Durable local data. I know new resources are always controversial, so I've written this doc to compare the alternatives. Comments welcome from anyone. I'd particularly like comments from @dchen1107 @bgrant0607 @thockin @brendanburns @jbeda @lavalamp @smarterclayton.
One thought that appeals to me is that a New Resource Kind can handle a central lock independent of scheduling (i.e., the request to attach / mount the data from the New Resource Kind Source can block / fail in a way that makes the container fail, which offers central coordination per data element). Perhaps new Resource Kind can be broken into two parts - allocation (which ultimately is a lot like allocating a GCE volume), and mount. The mount semantics can be managed by the server controlling replication as well (allows replication, must have recent updates within X).
@eparis While this issue is titled "durable data", that adjective is relative. :-) Unreplicated data are inherently neither highly durable nor highly available. Not only are nodes subject to physical failures, but also reduced durability and/or unavailability due to management operations, and sharing amongst multiple containers could also require workload migration away from the data. For these reasons, internally, we generally require that applications be able to cope with empty unreplicated storage volumes automatically. Populating an unreplicated storage volume with a one-time complex workflow would be considered an anti-pattern. Instead, we should get a HA storage system running on/with Kubernetes, from which applications could simply fetch data (e.g., database snapshots) to unreplicated local storage.
I'm just thinking about @thockin statement: "Yes, if we have a run-once that installs data, we need to add a new state to distinguish "installing data" from "running after data was installed". I see three variations that could make sense:"
So he really is talking about "Populating an unreplicated storage volume with a one-time
Maybe the question is that I was suggesting the possibility of giving and admin the ability to design a 'complex' system and he refuses to accept anything other than a single step initialization?
@bgrant0607 I'm just spitballin' here. I haven't actually thought through all (any) of the details. And it's certainly possible all of them should be ignored/belittled/mocked/etc...
In my made up example I'd think you could start a replication controller (restart policy=always) with a single Pod1 which requires a /data with the 'uninitialized' label. Next, start a replication controller with a single Pod2a and a third with a single Pod2b which require a /data with the 'initialized 1' label. Now you launch a replication controller(s) with whatever number of Pods you want to use this initialized data. These last pods should require both the 'initialized 1' and 'initialized 2' labels on their /data resource. Whew, so at this moment you actually have 5+ pods which the replication controller wants to schedule, but can't because it doesn't have the /data resource available...
Now create the data replication controller which creates two /data resources both with the 'uninitialized' label. So Pod1 will run (against one of those /data resources) and relabel it to include 'initialized 1'. As soon as it finishes, the replication controller will try to schedule Pod1 again. This second time Pod1 will find the second uninitialized data resource and will thus initialize the second resource. When Pod1 exits this time there will be no /data resource for it to run against, so it will be unschedulable.
Next Pod2a/b will both find a /data, already on a minion, and will run on that minion to initialize them. (although it would suck if Pod2a found and initialized both of them, I guess I hadn't thought through that, so maybe you'd have to make these distinctly different /data resources somehow) They will both return to be 'unschedulable.' Of course lastly the actual pods that need to do work, will find their respective /data (because of the label) and will be scheduled where it exists.
Now if the
Of course this falls down wildly (as does any blind/uninformed scheduling of /data without consideration of the /pods that will use it) if the /data gets scheduled on a node without enough resources to allow the scheduling of any/all of the pods in the processing pipeline. Likely pods 1 and 2 in my example would have minimal cpu/memory/resource constraints, but pod3 might have very large resource requirements. So I don't know how the /data scheduler can know to land the durable data resource on a node that will be able to satisfy all of the possible future pod scheduling constraints....
@eparis Yes, something like what you propose probably could be made to work. But, yes, the scheduling issue you point out is a real concern.
It would be simpler to tie initialization to the pod lifecycle if the common case is one-step initialization. We could create a pod PreStart lifecycle hook that could initialize volumes if necessary.
My perspective on this issue may or may not be relevant considering my newness to this project, but I approach K8s through the lens of several of my most recent jobs where I was responsible for IT infrastructure, app development, and ops support.
(tl;dr: Refactor volumes into top level domain object and run-once docker containers are independent first steps that can be implemented before supporting new volumes in a multitude of scenarios. Following these tasks, implement new types of volumes)
After reading (and re-reading) the various issues, I have some thoughts and questions:
1. Pod and volumes succeed or fail together?
Do pods that require a volume fail to launch if the volume fails to mount for any reason? Are they a required pair? I believe the answer is yes.
If so, does this give responsibility for mounting to the Kubelet before it launches the pod? A goroutine for mounting still supports the concurrency patterns in Go, but I imagine the Kubelet would then block and wait on a response from that channel indicating success or failure. No reason to attempt a pod launch if the volume failed to mount on the host. My volume/data wouldn't be there and my app would fail.
2. Optional run-once docker containers
A hook to run something before a pod launches is a valid use case for application developers. Only the first run would be "initialize the database" but subsequent versions of the application would have various database migrations that need to run for the current version of my app.
This feature is standalone, I believe, and not dependent on durable storage. It can be developed independently.
3. Different audiences require different functionality
I had applications deployed to both AWS and SoftLayer. A big-ish company I worked for had their own hardware in a data center (which also fits the "run openshift on bare metal" requirement for RH enterprise customers).
The former scenario allows dynamic clusters. K8s can use the APIs to bring up new nodes, make new storage volumes, etc. K8s would keep track of the various names/IDs assigned. Selectors match volumes and pods with K8s managing it all.
The latter, on the other hand, is not as dynamic. If I have my own hardware, I know how many blades are in my cluster, for example, and what my NAS looks like. They are all named and known. Pods across this cluster are still cattle, but I think the NAS is a pet. Pods need to know what they want for mounted volumes, which I think is different than the cloud API version in the previous paragraph.
4. "New" resource kind or refactored "old" resource?
Moving volume from the pod manifest as is to a top level domain object is a straightforward refactoring task. Move it and make everything work like it did before. Keep the same 3 types (emptydir, hostdir, gcepersistentdisk). This task can happen independently of durable data and storage.
5. Plugins for providers
Depending on the audience, we'd need different plugins based on type (strategy pattern). I know this is already the pattern in place 'cloud provider' and I'm sure everyone already assumes this much.
One plugin per type. After the refactoring in point 4 above is complete, we can add new types to support new functionality.
6. REST API
Expand the number of "source" volumes supported and add whatever attributes are required by that provider/plugin.
Perhaps there are good reasons to refactor this API, too, but the existing API already supports many types of volumes. It seems like this is easily expanded to accommodate more types of volumes.
I believe refactoring volumes as-is into a top level domain object is an independent task that is a pre-req to any durable storage volume. This task can happen now before new types of volumes are defined and the details worked through.
Triggers for run-once docker containers also seems like an independent task. If pods and volumes are a required pair, I don't see a need for forgiveness. Mounting a volume is a pre-req for the pod. It happens or it doesn't. The timeout for this operation can be long (if that's what you mean by forgiveness), but it seems sequential in nature before launching the pod.
Only after the refactor of volumes can we add new types. Just adding one new type will expand our (my) knowledge greatly.
When the rubber needs to meet the road, I think we can/should do numbers 2 and 4 above first while hashing out details for the first new type of volume.
Responses to last @markturansky post:
This is not intended for NAS case.
Not seeing benefit of volume-as-own-object, yet. Happy to listen more though.
Agree with the rest of what you said.
Do pods that require a volume fail to launch if the volume fails to mount for any reason.
Yes. And the rest of what you said in item 1 is also true. (I'm intentionally not answering the vaguer question of "Do Pod and volumes succeed or fail together".)
Optional run-once docker containers
I agree this is a feature we could consider separately from durable volumes. There is a bunch of discussion in this thread about run-once. But, the proposal in this PR doesn't try to address that.
Pods on "own hardware" are cattle, but the NAS is a pet
I agree that NAS seems like a pet. But, I wasn't considering NAS when I wrote this proposal; I was just thinking about the "rack of 1u servers each with their own hdd(s) and/or ssd(s)" case.
Interested in your thoughts about NAS.
Moving volume from the pod manifest as is to a top level domain object
I don't agree that the existing volume types should be made into their own objects. I make a distinction between the mounting and the data being mounted. Considering each volume-type:
For hostdir and gcepersistentdisk, they only represent mounting, not data. I don't see that the "mounting" has meaning independent of the pod.
For "emptydir", I agree that you could make the "data" part of this into its own object. In a sense, that is exactly what this proposal is -- to make a top-level object for this sort of data.
However, if you buy that a common case is to want to allocate that data right before a pod starts and delete it right after the pod terminates, then it would be convenient to not have to create a pair of objects (a pod and a data-thing).
I'll have to think about this more.
Plugins for providers
Yes. Suggestions on other plugins welcome.
Thanks for the thoughtful feedback @erictune.
I think I understand the error in my understanding of the domain model. The new top level object would be data, because that is what has durable identity (it's where my data lives). Volumes would use selectors to access data and provide what security tokens are necessary. Volumes map to hostDirs or a type/implementation per provider, volumeMounts go on the pods. Refactoring volumes out is the wrong code design.
I thought, too, I read a general consensus across digital ink spilled on durable data that New Resource Kind was the preferred implementation. That'd be the 1 thing you could define that would span pods, according to my understanding of the domain model. Is this accurate?
referenced this pull request
Oct 26, 2014
I have a few concerns with some of the ideas. Chaining a pod to a specific minion just to make data "durable" seems like the wrong way to go. It gives a wrong sense of durability and security.
So instead of only looking for pinning pods to hosts, which does not make me feel secure as it exposes one to hardware failure issues. I agree that pinning to specific hardware characteristics is a need, but not specific hardware. This was discussed in #2342. It therefore seems already to be possible to let specific attributes be used in scheduling such as ssd, gpu etc.
With the defined common interface containers could then easily provide data. Basically the above plugins are the equivalent of Side Containers. (Data Volume Containers with specific logic). These methods would enable a mariadb image for example to be used with Host Volumes in development, used with Data Volume Containers in Staging and depending on the durability need be used with Side Containers exposing distributed storage systems or a completely integrated approach called Volume as a Service on GKE.
For Kubernetes we could use this standard interface to mount volumes and then make it unnecessary to pin a pod to a minion, but to use Data Volume Containers instead and move them with the pod. Easiest solution: stop - pipe to new Data Volume Container - start new container.
The solution I favour is a more integrated approach. Side Containers could be the custom method for users, but providing infrastructure specific and integrated Side Containers and exposing them as Volumes as a Service would be ideal. K8s could expose a distributed data volume, which basically maps the data to a distributed persistent disk. In a selfhosted szenario VaaS could be an admin provided Side Container binding the volume not to a Google persistent disk, but to Ceph, Gluster or so.
The actual proposal can be found here: moby/moby#9277
My two cents -- permanent storage is such a basic requirement that I think it should be a service in the same way as docker or etcd are. it's all entirely in my head at this stage, but my plan for paas world domination will involve running something like gluster on a bunch of instances inside my cluster, and mounting a single massive shared filesystem on all machines. individual containers can then just mount parts of that filesystem as volumes. so now, at the container level, i never have to worry about which machine i come up on and will the permanent storage be attached because it's always available, everywhere.
then the problem becomes the platform-specific one of managing the actual storage that the gluster instances use, EBS volumes or PDs or whatever, but that should be a rare task. once gluster is up and running and on multiple machines, i shouldn't have to worry about it any more, and gluster should be redundant enough that losing any individual gluster machine won't hurt...
@hjwp I agree with most parts.
Durable is the wrong word. Let's forget I used it. When I get back to working on this after the holidays, I plan to chose a different name for the concept. I'm thinking of calling the new resource Kind a "nodeVol", because its two essential properties are that it is tied to a node and that it can be referenced in a VolumeSource.
There are two main types of storage we will have:
@hjwp made a good distinction between using a cluster filesystem and admining a cluster filesystem. I think kubernetes should support both, and that you need "local" as a building block to provide a service that enables "remote".
I like your example of a mariadb that uses different volume sources at different stages of development. I had considered this. It is closely related to the need to have config be portable across cloud providers and different on-premise setups. To address this, I expect that configuration would be written as templates, where the pod's volumeSource is left unspecified, to be filled in by different instantiations of the template (prod vs dev, etc).
I've taken a look at moby/moby#9277, and I've subscribed to the discussion.
One thing to note is that some forms of node-local storage don't manifest themselves as a filesystem, so filesystem standards may not apply directly. For example:
referenced this pull request
Nov 26, 2014
These are the issues/solutions:
So I agree with most of your points and would love to see both better support for passing through disks and raw access to hardware for local usage. (still favoring to use /con/data as default)
I never really understood the point of volumes-from and data-volume containers. No doubt that's me being a noob, but, in the specific case of our "remote storage" use case, ie, i have a container app that needs access to some permanent storage, and that doesn't want to be tied to a particular node, just know that wherever it comes up, it can access said storage. let's assume said storage is available as a mounted filesystem on the underlying machines (which in the background is implemented using whatever distributed filesystem voodoo we like). Why would i want the extra layer of indirection of a data volume container, rather than just mounting in the remote storage paths directly into my app container?
@hjwp because then you have either the logic of a specific distributed mount inside your container, which is counterintuative to decoupling. Or you mount a node specific directory. Even if every node has this mountpoint and passes this to some distributed storage, you define special cases. With a data volume container for example you don't have to mount distributed storage for project A on every node, even if it only runs on 1/10th of the nodes for example. Additionally updating the binding software for such a mount is now not easily upgradable via containers, but involves node updates.
So there are a few negative aspects about running node specific stuff, not decoupling etc.. Especially the decoupling in one container for each concern is something, one has to wrap his head around. Sure you could just run apache+storage-endpoint+mysql+whatever in one container, but then one could just use a VM. Even when you have a bit more complexity with data-volume containers or volumes-from you add the ability for decoupling and a more service oriented infrastructure.
Data containers will always have a place, but in production systems such as k8s they will most likely be replaced with Volumes as a Service aka k8s mounts k8s provided volumes, which then map to git, secret store, distributed data store (GCE persistent disk) etc..
Based on what I learned in the discussion, I think a good first step before implementing durable data would be to develop a working prototype of a database (e.g. mysql master and slave instance) used by a replicated frontend layer.
Attributes of prototype:
individually label two nodes: database-master and database-slave.
@brendanburns I will make the example as part of my work for #2609. This is very close to the example I was already planning on building, but with a persistent disk instead of hostDir -- though in the near term I was planning on using my local host as the "persistent" disk through hostDir. Very copacetic.
referenced this pull request
Dec 19, 2014
I no longer believe that we should implement the durable data concept described in this PR. Therefore I am going to close this PR.
I now see that there are two use cases that should be handled separately:
Why not handle both cases with one concept? Because the thing you end up with:
For the "easy to run things that want attach to networked storage" case, we should do:
For the "possible to run things which must have local storage device access" case, we should do:
----- Original Message -----
At least for the types of networked applications in aware of, the utility of this snapshot is more for simple singleton processes that could potentially tolerate some loss of data (last 10 minutes) in preference to having no data. An admin team could modify their restart / node replacement processes in order to effectively maintain a possibly reasonable SLA, without the rest of the system being aware. For instance, but adding a tool that tries to snapshot all the instances of this volume type on a node prior to beginning evacuation.
I'm thinking of things like Git repositories, Jenkins servers, simple Redis cache services, test databases, QA workloads, simple collaboration style apps, etc. Losing even 15 minutes of state is unlikely to impact those sorts of tools, and most of them are inherently single pod stateful. There's probably a cost / effort sweet spot for loose consistency of data here.