New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durable local storage #598
Comments
If something happens to minion1, then your pod can't run. Kubernetes tries Is it practical for you to do one of the following:
On Thu, Jul 24, 2014 at 12:24 AM, Mark Olliver notifications@github.com
|
What Eric said. We may be forced to add such constraints in the future, but we're going to try hard not to. :) We (@Sarsate) are working on additional volume types to make this easy. |
Unfortunatly it is not that simple as the application in the docker will be updating the data all the time and we can not use NFS as that is much to much overhead for the access latencies we need. Ideally in the future this data would be stored on an SSD volume mounted to the minion. But for now I am happy with it being on the host but it does need to be pinned. |
Yeah, SSD access is one of the things that will probably force us to add some sort of constraint to keep your pod co-located with its SSD. |
Paging @bgrant0607. |
I renamed this issue to narrow it to the specific use case. Support for durable local storage is an issue that has been raised by several partners in discussions, and is evident in every example application we've looked at (Guestbook, Acme Air, Drupal). This is a requirement for running a database, other storage system (e.g., HDFS, Zookeeper, etcd), SSD-based cache, etc. Support for more types of volumes ( #97 ) is maybe necessary but definitely not sufficient. We also need to represent the storage devices as allocatable resources ( #168 ). As I mentioned in #146 , pods are currently relatively disposable and don't have durable identities. So, I think the main design question is this: do we conflate the identity of the storage with the identity of the pod and try to increase the durability of the pod, or do we represent durable local volumes as objects with an identity and lifetime independent of the pods? The latter would permit/require creation of a new pod that could be attached to pre-existing storage. The latter is somewhat attractive, but would obstruct local restarts, which is desirable for high availability and bootstrapping, and wouldn't interact well with replicationController, due to the need to create/manage an additional object and also to match individual pods and volumes, which would reduce the fungibility of the pods. So, I'm going to suggest we go with pod durability. Rather than a single pin/persistence bit, I suggest we go with forgiveness: a list of (event type, optional duration, optional rate) of disruption events (e.g., host unreachability) the pod will tolerate. We could support an any event type and infinite duration for pods that want to be pinned regardless of what happens. This approach would generalize nicely for cases where, for example, applications wanted to endure reboots but give up in the case of extended outages or in the case that the disk goes bad. We're also going to want to use a similar spec for availability requirements / failure tolerances of sets of pods. Ideally, the pod could be restarted/recreated by Kubelet directly. This would likely require checkpointing #489 , but initially we'd at least need to be able to:
Regarding the former, we probably need to introduce some indication of outages into the pod status -- probably not the primary state enum, but in a separate readiness field. Regarding the latter, there are cases where it is convenient to place the storage in the host in a user-specified location, to facilitate debugging, data recovery, etc. without needing to look up long host-specific system-generated identifiers, though that's probably not a requirement for v0. It might be nice to have a way for a durable pod to have a way to request to delete itself without making an API call. Some people have suggested that run-until-success (i.e., exit 0) is not a sufficiently reliable way to convey this. Perhaps we could use an empty volume on exit as the signal. Certainly that would mean there wasn't any valuable data to worry about, and it would be easy for an application to drop an empty file there if it just wanted to stay put. Support for raw SSD should be filed as a separate issue, if desired. |
Can we start with clear statements of requirement? What we have with local Alternatively, maybe the answer is "don't use local storage". Just like But that is getting ahead - I don't feel like I really understand the On Thu, Jul 24, 2014 at 5:52 PM, bgrant0607 notifications@github.com
|
The durable pod described above matches our experience with a broad range of real world use cases - many organizations are willing to support reasonable durability of containers in bulk, as long as the operational characteristics are understood ahead of time. They eventually want to move applications to more stateless models, but accepting outages and focusing on mean-time-to-recovery is a model they already tolerate. Furthermore, this allows operators to focus on durability in bulk (at a host level), with a corresponding reduction in effort over their previous single use systems. We'd be willing to describe a clear requirement for a way to indicate that certain pods should tolerate disruption, with a best effort attempt to preserve local volumes and the container until such a time as the operator describes a host "lost". The suggestion to indicate a pod is done by clearing its storage is elegant, although in practice it's either user intervention or the container idling out of use. |
User specified data locations is also not significant for us in the near term. |
+1 to the forgiveness model. Let's make sure that it's possible to list the same reason (especially On Thu, Jul 24, 2014 at 5:52 PM, bgrant0607 notifications@github.com
|
For a regular pod without a replication controller: absolutely nothing. for the rep. controller case, except for dockerd death, they result in a new pod being spun up somewhere. And then if the old pod shows up again, one of the pods will get killed. Not sure exactly what a dockerd death would cause. |
Dockerd death leaves orphaned processes that the daemon doesn't know is running (last I checked). EDIT: corrected, right now children processes stay running until the daemon starts, at which point the daemon loops over all containers and kills them, and then will not restart them unless daemon AutoRestart is true (daemon/daemon.go#175) |
On Thu, Jul 24, 2014 at 10:36 PM, bgrant0607 notifications@github.com wrote:
Nothing happens
All containers die, kubelet probably craps itself trying to talk to
Unless someone moves the pod from that host (in etd), the pod comes
Nothing unless someone moves the pod in etcd. What do we want to happen? |
I think host-pinning and forgiveness/stickiness is going to be unavoidable On Thu, Jul 24, 2014 at 3:07 PM, Daniel Smith notifications@github.com
|
Related to this (has someone created an issue on this yet?) is the need for Kyle Mathews Blog: http://bricolage.io http://bricolage.io On Sat, Jul 26, 2014 at 8:57 AM, Tim Hockin notifications@github.com
|
@thockin Pinning to a specific host could be achieved either using constraints or the forthcoming direct scheduling API in addition to forgiveness. |
I usually agree with @bgrant0607 but I'm going to explore a contrary position on this issue.
|
Some thoughts:
An assumption I had been working for is that replication controller is the abstraction that provides the illusion of recreating an object without user intervention. That the scheduler does not reschedule containers - instead, the reconciliation loop of the replication controller forms dynamic tension with a scheduler by deleting containers that no longer fit an appropriate definition of health. Thus the schedulers' responsibility is reduced - it only attempts to place new pods, but never reschedules. In this model, the scheduler does not need to know about forgiveness, rather the replication controller does. And the replication controller is the one that needs to make the decision about health. If that's not the case, then where does that responsibility lie? Would the scheduler be responsible for deleting pods off one location, placing them on another, and determining whether that transition is appropriate? If so, that seems like a growth in responsibility of the scheduler - every scheduler would then need to deal with the complexity of knowing "is this transition appropriate" and hardcoding the list of transitions. The former model seems more flexible - for instance, replication controller types can be created of arbitrary complexity - including forgiveness - with each replication controller needs to deal with the consequences of when delete is appropriate. And ultimately, determining when something should be deleted is often specific to use case (a task vs a service vs a build vs a stateful service etc) |
I wasn't suggesting that we should not support stateful apps, just that we should not support stateful apps with only the Pod object. Brian said earlier: I was arguing for the second option; that we have a different object type to represent that persistent data. Those data objects would need their own replication control: data could be more widely replicated than Pods, and could be in different states. |
I can at least talk to running moderately dense container hosts with a single shared durable storage volume per host (that each container was allocated storage on). The storage is network attached, and snapshot-able for backup purposes. For most outages, the volume was detached and reattached to a new host with the same identity as the old host, with the containers not rescheduled. This has been a reasonable solution for most operation teams running OpenShift - trading off some availability for reduced complexity of managing that a single volume during recovery. And in those types of outages, the most current and accurate data for those containers is on the volume, so replication is unlikely to be faster than restoring the volume. This is just one particular scenario, but it's a sort of local minima of availability for stateful containers at reasonable density and familiarity to ops teams (at lower densities individual attached volumes is probably better). And in this case, forgiveness does seem to model the tradeoff better - waiting longer before deciding the state is gone. However, to your point, if that particular volume is never coming back having a well set up model for distributing state and tracking independent volumes reduces your vulnerability to total loss. Also, the planned reallocation model works better along your model - if I decide to evacuate a host for maintenance I may very well want to rebalance to other hosts, and that requires a certain volume with a certain set of data in place on that other host. |
Revisiting older topics that I think are important: This tapered off with no clear resolution. I still don't feel like I understand the behavioral requirements. We've discussed a lot of considerations of a couple of implementations, but have not discussed exactly what we are trying to achieve. Are we trying to enable data objects to have a lifetime that is decoupled from any one pod? Are we trying to allow pods to have $large "local" data (i.e. filesystem, not DB or other storage service)? other? |
I think data objects decoupled from pod is modeled with sufficient granularity in volumes today. Being able to define some level of pod stability that does not cause significant scheduling difficulties has value for places where large local data exists. Perhaps this belongs as a scheduler problem, where an integrator can determine that a volume type as a corresponding impact on scheduling decisions. |
Volumes have a lifetime coupled perfectly their pod. If we are arguing that there's a need to have durable data that outlives any pod, we have not really started that design |
I think a major use case for this is Pod software upgrades. Right now, upgrading software deployed inside Containers in a Pod is, afaict, a destructive operation. |
Don't know how kubernetes or docker works internal, but durable local storage should be work with data only container if native docker is used? But it seems there is no solution in the near future? |
I am not against some variant of data containers, but I don't really know Is the goal just to get a stable/recoverable host dir to write into? Is On Tue, Apr 21, 2015 at 12:01 AM, pwFoo notifications@github.com wrote:
|
Data containers are persistent and reboot save. After read some issues hostdir have some permission problems. Data containers could be a reboot save volume solution and simpler to move to another minion than hostdir data if needed (?) |
@thockin could you explain for me the difference difference between a data container and a persistent volume? I'd like to understand this issue better. |
I discussed this in another issue, but I'd prefer a GC'd directory identified by a unique value that anyone who knows could reuse. Given proper uid support that content is protected by Unix rules and would satisfy the "I need a dir that is best effort reused per host across multiple pods", as a build cache or scratch dir. But it would need time based GC after the last reference is allocated.
|
To understand data containers, we need to remember that container != image. In the docker model, you can make changes to your container's filesystem, stop the container/reboot the host, and then you can restart the container and it will still have those FS changes. In the kube model, we tend to believe that containers are always launched cleanly from an image and storage/filesystem changes should be done 'outside' of the container. "Data containers" are much more in the docker thoughts. You can create a container and make some changes to the filesystem in that container. Docker can then mount the filesystem from one container into another container. And the changes are in the 'data container.' Its like what we do with volumes, but they do it with containers (and like many things docker it is only really elegant on a single host) An example of a 'data container' could be for configuration. You could create a container filled with your rsyslog configuration and another container which actually has rsyslog. You launch the rsyslog continer mounting the /etc/ files from the configuration container into the container with the daemon. Now you can update the rsyslog container independently of the config, and the config independently of the binary. Another example would be a container to save stateful data. Create a container which just has /var/lib/etcd/. Now mount that containers /var/lib/etcd/ into your etcd container. You can update/change the etcd container without worry about the data. You can also 'save' the data container as an image and docker push/docker pull to get the data onto another host, in case you wanted to migrate the data. I haven't read this issue, so the following statements are likely worth exactly 0. But in general, I am not a fan of dockers mutability of containers. I like that kubernetes has a clear and rational expectation that running the same command 2 times will give the same results. If we chose to use containers under the covers for some type of data storage, that it fine, but I really hope we don't expose that to the user. We should expose some functionality to the system user, not some underlying detail.... |
It's been close to a year since this issue was opened, is Kubernetes really no closer to providing durable local storage and nominal/stateful services (#260)? |
Why not create a data volume which creates a data only container in the background? |
Folding this together with #7562. |
Switch to gliderlabs/alpine Docker image.
The kubectl decoupling project (kubernetes#598) requires many BUILD edits. Even relatively simple PR's involve many OWNER files, e.g. kubernetes#46317 involves five. We plan to script-generate some PRs, and those may involve _hundreds_ of BUILD files. This project will take many PRs, and collecting all approvals for each will be very time consuming.
Automatic merge from submit-queue Add jregan to OWNERS for kubectl isolation work. The kubectl decoupling project (#598) requires many BUILD edits. Even relatively simple PR's involve many OWNER files, e.g. #46317 involves five. We plan to script-generate some PRs, and those may involve _hundreds_ of BUILD files. This project will take many PRs, and collecting all approvals for each will be very time consuming. **Release note**: ```release-note NONE ```
The kubectl decoupling project (kubernetes#598) requires many BUILD edits. Even relatively simple PR's involve many OWNER files, e.g. kubernetes#46317 involves five. We plan to script-generate some PRs, and those may involve _hundreds_ of BUILD files. This project will take many PRs, and collecting all approvals for each will be very time consuming.
UPSTREAM: <carry>: allow kubelet to self-authorize metrics scraping
Is there a way to pin a pod to a minion?
For example we have some data that is stored on the host disk that is persistent between reboots, as such I need to tell the replication controller that this container should be pinned for example to minion1.
The text was updated successfully, but these errors were encountered: