Durable data #1515

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
10 participants
@erictune
Member

erictune commented Sep 30, 2014

I'm not looking to actually merge this PR, just get some feedback on two possible approaches.

If there is agreement on one approach, then I will code it up, and convert this design document into user documentation.

I'm inclined to add a new type of REST resource to model Durable local data. I know new resources are always controversial, so I've written this doc to compare the alternatives. Comments welcome from anyone. I'd particularly like comments from @dchen1107 @bgrant0607 @thockin @brendanburns @jbeda @lavalamp @smarterclayton.

Addresses #598.

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Oct 1, 2014

Contributor

One thought that appeals to me is that a New Resource Kind can handle a central lock independent of scheduling (i.e., the request to attach / mount the data from the New Resource Kind Source can block / fail in a way that makes the container fail, which offers central coordination per data element). Perhaps new Resource Kind can be broken into two parts - allocation (which ultimately is a lot like allocating a GCE volume), and mount. The mount semantics can be managed by the server controlling replication as well (allows replication, must have recent updates within X).

Contributor

smarterclayton commented Oct 1, 2014

One thought that appeals to me is that a New Resource Kind can handle a central lock independent of scheduling (i.e., the request to attach / mount the data from the New Resource Kind Source can block / fail in a way that makes the container fail, which offers central coordination per data element). Perhaps new Resource Kind can be broken into two parts - allocation (which ultimately is a lot like allocating a GCE volume), and mount. The mount semantics can be managed by the server controlling replication as well (allows replication, must have recent updates within X).

@erictune erictune added this to the v0.7 milestone Oct 1, 2014

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 1, 2014

Member

Thanks for feedback smarterclayton. Hoping for feedback from others on this soon so I can begin implementation in time for milestone 0.7

Member

erictune commented Oct 1, 2014

Thanks for feedback smarterclayton. Hoping for feedback from others on this soon so I can begin implementation in time for milestone 0.7

@erictune erictune referenced this pull request Oct 1, 2014

Closed

Durable local storage #598

@erictune erictune modified the milestones: v1.0, v0.7 Oct 1, 2014

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 1, 2014

Member

Apparently this is not due til 1.0

Member

erictune commented Oct 1, 2014

Apparently this is not due til 1.0

@erictune erictune modified the milestones: v0.8, v1.0 Oct 1, 2014

docs/proposals/durable_data.md
+A Pod has a (current and desired) PodState, which has ContainerManifest that lists the Volumes and Containers of a Pod.
+A Container always has a writeable filesystem, and it can attach Volumes to that which are also writeable. Writes not
+to a volume are not visible to other containers and are not preserved past container failures. Writes to an
+EmptyDirectory volume are visible to other containers in the pod. An EmptyDirectory is only shared by containers which

This comment has been minimized.

@thockin

thockin Oct 1, 2014

Member

EmptyDir writes are also visible across container restarts

@thockin

thockin Oct 1, 2014

Member

EmptyDir writes are also visible across container restarts

docs/proposals/durable_data.md
+ - Bit(s) on a Pod which indicates whether durability is requested.
+ - Kubelet sees whole pod object (currently has only ContainerManifest) to get UID.
+ - Kubelet preserves EmptyDirectory when stopping pod which wants durable data.
+ - Kubelet attaches preseved EmptyDirectory when starting pod that wants durable data and has matching UID.

This comment has been minimized.

@thockin

thockin Oct 1, 2014

Member

This implies that someone is intentionally re-using UIDs, right? I would probably argue that we need a different ID here, and that point this sort of becomes similar to your next option.

@thockin

thockin Oct 1, 2014

Member

This implies that someone is intentionally re-using UIDs, right? I would probably argue that we need a different ID here, and that point this sort of becomes similar to your next option.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

My interpretation is that we wouldn't be reusing the UID in a new pod -- the durable pod would just remain present in etcd.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

My interpretation is that we wouldn't be reusing the UID in a new pod -- the durable pod would just remain present in etcd.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

correct.

@erictune

erictune Oct 2, 2014

Member

correct.

+ - Pod Update
+ - No change.
+ - Replication Count Reduced
+ - *Changed*. A /data remains until explictly removed. This allows the Pod Replication Count to be lowered and then raised again

This comment has been minimized.

@thockin

thockin Oct 1, 2014

Member

This sort of implies a replication controller for /data

@thockin

thockin Oct 1, 2014

Member

This sort of implies a replication controller for /data

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Right.

In addition to lowering/raising, it allows for a new pod replication controller to create a pod that matches the /data.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Right.

In addition to lowering/raising, it allows for a new pod replication controller to create a pod that matches the /data.

+the Data represents state without computation. This has an attractive simplicity. Minor point for New Resource Kind.
+
+Under the Durable Pods alternative, the NodeController has to reason about both minions (whether they are likely to be
+operating correctly) and about pods (what forgivenesses they want). Under the New Resource Kind alternative, the

This comment has been minimized.

@thockin

thockin Oct 1, 2014

Member

Forgiveness may sneak in again later anyway.

@thockin

thockin Oct 1, 2014

Member

Forgiveness may sneak in again later anyway.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Agreed. High-availability services with a lot of local state (even in memory rather than on disk/ssd) will likely want to wait longer before giving up than, say, 2-second compilation tasks. It's all about expected time to repair.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Agreed. High-availability services with a lot of local state (even in memory rather than on disk/ssd) will likely want to wait longer before giving up than, say, 2-second compilation tasks. It's all about expected time to repair.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

Yes. I expect forgiveness to come into either solution at some point.

@erictune

erictune Oct 2, 2014

Member

Yes. I expect forgiveness to come into either solution at some point.

docs/proposals/durable_data.md
+ - A pod name cannot be changed to reflect a gradual change in the responsibilities of a pod due to a series of
+ updates. Minor point in favor of New Resource Kind.
+
+New Resource Kind allows scaling data independently of servers, such as to prepare for a spike in traffic without spending

This comment has been minimized.

@thockin

thockin Oct 1, 2014

Member

I think this is more than a minor point

@thockin

thockin Oct 1, 2014

Member

I think this is more than a minor point

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

It could also facilitate pre-fetching (without arbitrary scaling).

@bgrant0607

bgrant0607 Oct 2, 2014

Member

It could also facilitate pre-fetching (without arbitrary scaling).

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Here's an interesting idea: Bootstrapping. Someone could create /data by dropping a kubelet config file, but then schedule the pod dynamically in order to simplify pod config updates.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Here's an interesting idea: Bootstrapping. Someone could create /data by dropping a kubelet config file, but then schedule the pod dynamically in order to simplify pod config updates.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

Like a /data that starts out as a HostDirectory rather than an EmptyDirectory? Interesting.

@erictune

erictune Oct 2, 2014

Member

Like a /data that starts out as a HostDirectory rather than an EmptyDirectory? Interesting.

docs/proposals/durable_data.md
+fails to avoid stale locks or other stale state; and an SSD volume which should be durable. This seems like a minor
+point in favor of New Resource Kind.
+
+New Resource Kind supports the use case where there are several pods are running different versions of the software, but sharing a single large chunk of readonly durable

This comment has been minimized.

@thockin

thockin Oct 1, 2014

Member

I was assuming that, in New Resource Kind model, data gets bound to exactly one pod. Are you proposing that it could be bound to more than one pod in read-only mode?

@thockin

thockin Oct 1, 2014

Member

I was assuming that, in New Resource Kind model, data gets bound to exactly one pod. Are you proposing that it could be bound to more than one pod in read-only mode?

This comment has been minimized.

@erictune

erictune Oct 1, 2014

Member

If we get consensus to do New Resource Kind, it will be "exactly one pod", as you say.

On line 213, I am hinting at something we might do much later on, after more discussion.

@erictune

erictune Oct 1, 2014

Member

If we get consensus to do New Resource Kind, it will be "exactly one pod", as you say.

On line 213, I am hinting at something we might do much later on, after more discussion.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

I also find this point confusing. I would also drop it, especially since it doesn't affect the outcome.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

I also find this point confusing. I would also drop it, especially since it doesn't affect the outcome.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

removed

@erictune

erictune Oct 2, 2014

Member

removed

docs/proposals/durable_data.md
+## Requirements
+
+Kubernetes requirements:
+ - Do not provide a direct mechanism to pin pods to minions.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 1, 2014

Member

Do you mean it should not be necessary to pin a specific pod to a specific minion in order to get durable local storage? We may support host constraints and/or daemon scheduling the future.

Or are you ruling out one of the design alternatives: making pods more durable?

@bgrant0607

bgrant0607 Oct 1, 2014

Member

Do you mean it should not be necessary to pin a specific pod to a specific minion in order to get durable local storage? We may support host constraints and/or daemon scheduling the future.

Or are you ruling out one of the design alternatives: making pods more durable?

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

I'm saying that pinning a pod to a minion is not a end in itself, and we should avoid endorsing that by providing a mechanism that is described as doing just that.

Instead, we should provide targeted solutions to most of the cases where people think they need to pin a pod to a minion, but which model the specific problem in a way that does not require making mention of a specific minion. Examples:

  • need specific hardware: model that as a resource or attribute, and use resource request or constraint on that attribute
  • need specific kernel or kubelet version: model as attribute
  • need specific network capabilities of machine: model as attribute or make part of cloudprovider + service
  • need durable data: see this proposal
  • need failure domain spreading: some specific mechanism
  • grouping of pods for network locality: specific mechanism

Some of those specific mechanisms may well allow a pod to be pinned to a single machine as a side effect.

@erictune

erictune Oct 2, 2014

Member

I'm saying that pinning a pod to a minion is not a end in itself, and we should avoid endorsing that by providing a mechanism that is described as doing just that.

Instead, we should provide targeted solutions to most of the cases where people think they need to pin a pod to a minion, but which model the specific problem in a way that does not require making mention of a specific minion. Examples:

  • need specific hardware: model that as a resource or attribute, and use resource request or constraint on that attribute
  • need specific kernel or kubelet version: model as attribute
  • need specific network capabilities of machine: model as attribute or make part of cloudprovider + service
  • need durable data: see this proposal
  • need failure domain spreading: some specific mechanism
  • grouping of pods for network locality: specific mechanism

Some of those specific mechanisms may well allow a pod to be pinned to a single machine as a side effect.

+ - Desirable for high availability and bootstrapping.
+ - A container spec with no volumes should have an identical filesystem view on every container instantation.
+ - Allows writing truly stateless containers.
+

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 1, 2014

Member

Other requirements:

  • No external orchestration should be required in order to set up the durable storage volume (e.g., creating/mounting host directories)
  • It should be possible to update a pod (e.g., change a container's image version) without losing the durable data
@bgrant0607

bgrant0607 Oct 1, 2014

Member

Other requirements:

  • No external orchestration should be required in order to set up the durable storage volume (e.g., creating/mounting host directories)
  • It should be possible to update a pod (e.g., change a container's image version) without losing the durable data

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

added

@erictune

erictune Oct 2, 2014

Member

added

docs/proposals/durable_data.md
+
+Application requirements:
+ - have a chunk of writeable data which is not automatically deleted by Kubernetes due to certain _events_, described
+ below. (EmptyDirectory does not provide this, nor does the containers writeable file layer.)

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 1, 2014

Member

typo: container's

@bgrant0607

bgrant0607 Oct 1, 2014

Member

typo: container's

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

fixed

@erictune

erictune Oct 2, 2014

Member

fixed

docs/proposals/durable_data.md
+garbage collection is not merged, but this description is written assuming it is.)
+
+How specific events are handled:
+ - *Container Failure*: Container exits abnormally (out of memory, segfault, assertion failed, etc), or normally, or fails a healthcheck.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 1, 2014

Member

s/healthcheck/liveness probe/

@bgrant0607

bgrant0607 Oct 1, 2014

Member

s/healthcheck/liveness probe/

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

fixed

@erictune

erictune Oct 2, 2014

Member

fixed

docs/proposals/durable_data.md
+
+Changes:
+ - Bit(s) on a Pod which indicates whether durability is requested.
+ - Kubelet sees whole pod object (currently has only ContainerManifest) to get UID.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

We plan to make this change in v1beta3, regardless: Kubelet should get the whole pod metadata + pod spec.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

We plan to make this change in v1beta3, regardless: Kubelet should get the whole pod metadata + pod spec.

docs/proposals/durable_data.md
+Changes:
+ - Bit(s) on a Pod which indicates whether durability is requested.
+ - Kubelet sees whole pod object (currently has only ContainerManifest) to get UID.
+ - Kubelet preserves EmptyDirectory when stopping pod which wants durable data.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

When would this happen? A durable pod shouldn't be stopped/deleted. It's containers might need to be recreated, such as in the case of a VM reboot.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

When would this happen? A durable pod shouldn't be stopped/deleted. It's containers might need to be recreated, such as in the case of a VM reboot.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

you are right. removed.

@erictune

erictune Oct 2, 2014

Member

you are right. removed.

docs/proposals/durable_data.md
+ - Kubelet to inventory existing EmptyDirectory objects after reboot.
+ - Kubelet may need checkpoint or the equivalent to track the preserved EmptyDirectory dirs on disk, and to remember
+ the UID of stopped-and-about-to-be-recreated pods.
+ - A new model for rolling upgrades which uses updates rather than delete-recreate of pods.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

In-place updates

@bgrant0607

bgrant0607 Oct 2, 2014

Member

In-place updates

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

Yes, in-place updates of pods. This is also about having a replicationController that has two pod templates: one for old config and one for new.

@erictune

erictune Oct 2, 2014

Member

Yes, in-place updates of pods. This is also about having a replicationController that has two pod templates: one for old config and one for new.

docs/proposals/durable_data.md
+ - Kubelet preserves EmptyDirectory when stopping pod which wants durable data.
+ - Kubelet attaches preseved EmptyDirectory when starting pod that wants durable data and has matching UID.
+ - Kubelet to inventory existing EmptyDirectory objects after reboot.
+ - Kubelet may need checkpoint or the equivalent to track the preserved EmptyDirectory dirs on disk, and to remember

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

We could embed the UIDs in the directory names in order to make them discoverable.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

We could embed the UIDs in the directory names in order to make them discoverable.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

That could work. I think you still need a checkpoint to do in-place updates. I think you can't easily complete the update in a single atomic step in all cases, and you need to remember the old and the new state and what you have done so far. Even if you find a way to do this by sticking state in docker or file paths, this becomes the moral equivalent of a checkpoint.

@erictune

erictune Oct 2, 2014

Member

That could work. I think you still need a checkpoint to do in-place updates. I think you can't easily complete the update in a single atomic step in all cases, and you need to remember the old and the new state and what you have done so far. Even if you find a way to do this by sticking state in docker or file paths, this becomes the moral equivalent of a checkpoint.

docs/proposals/durable_data.md
+ - Container Failure
+ - No change from current.
+ - Pod Update
+ - No change from current.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

It would be worth clarifying that rolling updates would change from create/delete to in-place updates, make them fall into this case.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

It would be worth clarifying that rolling updates would change from create/delete to in-place updates, make them fall into this case.

docs/proposals/durable_data.md
+ - Once a /data is bound, it stays bound until the user deletes the /data.
+ - Kubelet changes to support creating a Data, and to detect an existing Data after reboot.
+ - Kubecfg support for /data.
+ - Replication control for Data?

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

We agreed that a DataReplicationController, separate from ReplicationController, would be necessary, in order to accommodate my proposed approach to rolling updates.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

We agreed that a DataReplicationController, separate from ReplicationController, would be necessary, in order to accommodate my proposed approach to rolling updates.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

lets talk about this f2f. I know we talked about a data replication controller, but it is now less clear to me that such a thing is required to do rolling updates.

@erictune

erictune Oct 2, 2014

Member

lets talk about this f2f. I know we talked about a data replication controller, but it is now less clear to me that such a thing is required to do rolling updates.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

What would be responsible for ensuring that enough data objects exist for the pods that want them?

@bgrant0607

bgrant0607 Oct 2, 2014

Member

What would be responsible for ensuring that enough data objects exist for the pods that want them?

+How events handled:
+ - Container Failure
+ - If RestartAlways selected, all volumes (EmptyDirectory, /data volumes) remain. No change.
+ - *Changed.* If other restart policy selected, a new UID but same config pod will reattach to an existing Data.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

I don't see how this is related to restart policy. As you pointed out, replication controller is only used with restartalways.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

I don't see how this is related to restart policy. As you pointed out, replication controller is only used with restartalways.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

People might build their own pod management flows that involve a pod exiting with success success when it is ready to be upgraded, or multi-stage map-reduce pipelines where the output of one stage is persisted on local storage for the next stage to run.

@erictune

erictune Oct 2, 2014

Member

People might build their own pod management flows that involve a pod exiting with success success when it is ready to be upgraded, or multi-stage map-reduce pipelines where the output of one stage is persisted on local storage for the next stage to run.

docs/proposals/durable_data.md
+ without loss of data.
+ - Node Reboots
+ - Kubelet will see what pods and data object the node should have. It will see existing data objects and leave them
+ alone. It sees missing pods, so it starts them and attaches them to their data.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

You mean missing containers, I think.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

You mean missing containers, I think.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

fixed.

@erictune

erictune Oct 2, 2014

Member

fixed.

docs/proposals/durable_data.md
+ - No change.
+ - Storage Device Fails
+ - No change. If controller is added to detect failed storage hardware, then it would delete the lost /data object,
+ and either delete to pods too, or put the pod back in a pending state to be rescheduled.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Pods never go back to pending. Delete is the right answer.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Pods never go back to pending. Delete is the right answer.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

Done. As an aside, how do you expect to handle preemptions, when we get there? With a replication controller or job controller?

@erictune

erictune Oct 2, 2014

Member

Done. As an aside, how do you expect to handle preemptions, when we get there? With a replication controller or job controller?

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Yes, a replication controller or job controller would replace the preempted/evicted pod.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Yes, a replication controller or job controller would replace the preempted/evicted pod.

docs/proposals/durable_data.md
+Either alternative can be adapted to allow hedging bets: simultaneously starting a new pod/data pair while waiting for a
+lost pod/data to come back. This is not a deciding factor.
+
+Either alternative can be adapted to support rolling restarts. The approaches are different. This is probably not a

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

The approaches are very different. The durable pod approach would significant compromise the flexibility and fungibility intended for replication controllers. See my replication controller writeup in #1527.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

The approaches are very different. The durable pod approach would significant compromise the flexibility and fungibility intended for replication controllers. See my replication controller writeup in #1527.

This comment has been minimized.

@erictune

erictune Oct 2, 2014

Member

Updated.

@erictune

erictune Oct 2, 2014

Member

Updated.

+
+In the Durable Pods approach, one has to favor updating an existing pod (keep same Name and UID) over deleting a pod and creating a new
+one. It seems like this may have a number of effects:
+ - A config push is not as simple as deleting all the old objects and pushing new ones. It has to match existing pods

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

It also means updates of ~all fields need to be supported, which may be complicated.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

It also means updates of ~all fields need to be supported, which may be complicated.

+
+A hybrid solution, which requires more thought, is to reuse EmptyDirectories using a key which is longer lived than Pod
+UID, but not to expose a separate REST api object for the data of that EmptyDirectory. A problem would be that
+something need to garbage collect those objects, and perhaps in future preempt them and report about their resource

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Hidden resources is the path to a monolithic system. I'd rather put another layer atop the existing k8s API. It's already obvious that we're going to need one.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Hidden resources is the path to a monolithic system. I'd rather put another layer atop the existing k8s API. It's already obvious that we're going to need one.

docs/proposals/durable_data.md
+stricter controls on Data, to prevent accidental erasure, while allowing more flexibility to change Pods (e.g.
+autosizing, setting debugging flags, etc). This seems like a minor point in favor of New Resource Kind.
+
+Overall, it seems like there there are somewhat more points in favor of New Resource Kind.

This comment has been minimized.

@bgrant0607

bgrant0607 Oct 2, 2014

Member

Agree.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 2, 2014

Member

@smarterclayton @derekwaynecarr can one of you read through this specifically to see if either model fits better with uses cases you are familiar with?

Member

erictune commented Oct 2, 2014

@smarterclayton @derekwaynecarr can one of you read through this specifically to see if either model fits better with uses cases you are familiar with?

@bgrant0607 bgrant0607 modified the milestones: v0.8, v1.0 Oct 4, 2014

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 14, 2014

Member

@eparis While this issue is titled "durable data", that adjective is relative. :-) Unreplicated data are inherently neither highly durable nor highly available. Not only are nodes subject to physical failures, but also reduced durability and/or unavailability due to management operations, and sharing amongst multiple containers could also require workload migration away from the data. For these reasons, internally, we generally require that applications be able to cope with empty unreplicated storage volumes automatically. Populating an unreplicated storage volume with a one-time complex workflow would be considered an anti-pattern. Instead, we should get a HA storage system running on/with Kubernetes, from which applications could simply fetch data (e.g., database snapshots) to unreplicated local storage.

Member

bgrant0607 commented Oct 14, 2014

@eparis While this issue is titled "durable data", that adjective is relative. :-) Unreplicated data are inherently neither highly durable nor highly available. Not only are nodes subject to physical failures, but also reduced durability and/or unavailability due to management operations, and sharing amongst multiple containers could also require workload migration away from the data. For these reasons, internally, we generally require that applications be able to cope with empty unreplicated storage volumes automatically. Populating an unreplicated storage volume with a one-time complex workflow would be considered an anti-pattern. Instead, we should get a HA storage system running on/with Kubernetes, from which applications could simply fetch data (e.g., database snapshots) to unreplicated local storage.

@eparis

This comment has been minimized.

Show comment
Hide comment
@eparis

eparis Oct 14, 2014

Member

I'm just thinking about @thockin statement: "Yes, if we have a run-once that installs data, we need to add a new state to distinguish "installing data" from "running after data was installed". I see three variations that could make sense:"

So he really is talking about "Populating an unreplicated storage volume with a one-time complex workflow"

Maybe the question is that I was suggesting the possibility of giving and admin the ability to design a 'complex' system and he refuses to accept anything other than a single step initialization?

Member

eparis commented Oct 14, 2014

I'm just thinking about @thockin statement: "Yes, if we have a run-once that installs data, we need to add a new state to distinguish "installing data" from "running after data was installed". I see three variations that could make sense:"

So he really is talking about "Populating an unreplicated storage volume with a one-time complex workflow"

Maybe the question is that I was suggesting the possibility of giving and admin the ability to design a 'complex' system and he refuses to accept anything other than a single step initialization?

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 14, 2014

Member

@eparis What mechanism do you imagine would automatically provision new initialized /data instances in the case that a replication controller replaced pods that were running on failed nodes (minions).

Member

bgrant0607 commented Oct 14, 2014

@eparis What mechanism do you imagine would automatically provision new initialized /data instances in the case that a replication controller replaced pods that were running on failed nodes (minions).

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 14, 2014

Member

BTW, I think using labels to represent the state of the /data instances is a good idea, regardless what patterns we support/encourage.

Member

bgrant0607 commented Oct 14, 2014

BTW, I think using labels to represent the state of the /data instances is a good idea, regardless what patterns we support/encourage.

@eparis

This comment has been minimized.

Show comment
Hide comment
@eparis

eparis Oct 14, 2014

Member

@bgrant0607 I'm just spitballin' here. I haven't actually thought through all (any) of the details. And it's certainly possible all of them should be ignored/belittled/mocked/etc...

In my made up example I'd think you could start a replication controller (restart policy=always) with a single Pod1 which requires a /data with the 'uninitialized' label. Next, start a replication controller with a single Pod2a and a third with a single Pod2b which require a /data with the 'initialized 1' label. Now you launch a replication controller(s) with whatever number of Pods you want to use this initialized data. These last pods should require both the 'initialized 1' and 'initialized 2' labels on their /data resource. Whew, so at this moment you actually have 5+ pods which the replication controller wants to schedule, but can't because it doesn't have the /data resource available...

Now create the data replication controller which creates two /data resources both with the 'uninitialized' label. So Pod1 will run (against one of those /data resources) and relabel it to include 'initialized 1'. As soon as it finishes, the replication controller will try to schedule Pod1 again. This second time Pod1 will find the second uninitialized data resource and will thus initialize the second resource. When Pod1 exits this time there will be no /data resource for it to run against, so it will be unschedulable.

Next Pod2a/b will both find a /data, already on a minion, and will run on that minion to initialize them. (although it would suck if Pod2a found and initialized both of them, I guess I hadn't thought through that, so maybe you'd have to make these distinctly different /data resources somehow) They will both return to be 'unschedulable.' Of course lastly the actual pods that need to do work, will find their respective /data (because of the label) and will be scheduled where it exists.

Now if the minion (oh sorry, foolish political correctness) node dies eventually the data replication controller will mark the data resource dead and will schedule a new data resource, which will be uninitialized, on another minion. The pod replication controller will take care of the two step data initialization on the new minion, and eventually Pod3 will be relaunched on the new minion.

Member

eparis commented Oct 14, 2014

@bgrant0607 I'm just spitballin' here. I haven't actually thought through all (any) of the details. And it's certainly possible all of them should be ignored/belittled/mocked/etc...

In my made up example I'd think you could start a replication controller (restart policy=always) with a single Pod1 which requires a /data with the 'uninitialized' label. Next, start a replication controller with a single Pod2a and a third with a single Pod2b which require a /data with the 'initialized 1' label. Now you launch a replication controller(s) with whatever number of Pods you want to use this initialized data. These last pods should require both the 'initialized 1' and 'initialized 2' labels on their /data resource. Whew, so at this moment you actually have 5+ pods which the replication controller wants to schedule, but can't because it doesn't have the /data resource available...

Now create the data replication controller which creates two /data resources both with the 'uninitialized' label. So Pod1 will run (against one of those /data resources) and relabel it to include 'initialized 1'. As soon as it finishes, the replication controller will try to schedule Pod1 again. This second time Pod1 will find the second uninitialized data resource and will thus initialize the second resource. When Pod1 exits this time there will be no /data resource for it to run against, so it will be unschedulable.

Next Pod2a/b will both find a /data, already on a minion, and will run on that minion to initialize them. (although it would suck if Pod2a found and initialized both of them, I guess I hadn't thought through that, so maybe you'd have to make these distinctly different /data resources somehow) They will both return to be 'unschedulable.' Of course lastly the actual pods that need to do work, will find their respective /data (because of the label) and will be scheduled where it exists.

Now if the minion (oh sorry, foolish political correctness) node dies eventually the data replication controller will mark the data resource dead and will schedule a new data resource, which will be uninitialized, on another minion. The pod replication controller will take care of the two step data initialization on the new minion, and eventually Pod3 will be relaunched on the new minion.

@eparis

This comment has been minimized.

Show comment
Hide comment
@eparis

eparis Oct 14, 2014

Member

Of course this falls down wildly (as does any blind/uninformed scheduling of /data without consideration of the /pods that will use it) if the /data gets scheduled on a node without enough resources to allow the scheduling of any/all of the pods in the processing pipeline. Likely pods 1 and 2 in my example would have minimal cpu/memory/resource constraints, but pod3 might have very large resource requirements. So I don't know how the /data scheduler can know to land the durable data resource on a node that will be able to satisfy all of the possible future pod scheduling constraints....

Member

eparis commented Oct 14, 2014

Of course this falls down wildly (as does any blind/uninformed scheduling of /data without consideration of the /pods that will use it) if the /data gets scheduled on a node without enough resources to allow the scheduling of any/all of the pods in the processing pipeline. Likely pods 1 and 2 in my example would have minimal cpu/memory/resource constraints, but pod3 might have very large resource requirements. So I don't know how the /data scheduler can know to land the durable data resource on a node that will be able to satisfy all of the possible future pod scheduling constraints....

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 14, 2014

Member

@eparis Yes, something like what you propose probably could be made to work. But, yes, the scheduling issue you point out is a real concern.

It would be simpler to tie initialization to the pod lifecycle if the common case is one-step initialization. We could create a pod PreStart lifecycle hook that could initialize volumes if necessary.

Member

bgrant0607 commented Oct 14, 2014

@eparis Yes, something like what you propose probably could be made to work. But, yes, the scheduling issue you point out is a real concern.

It would be simpler to tie initialization to the pod lifecycle if the common case is one-step initialization. We could create a pod PreStart lifecycle hook that could initialize volumes if necessary.

@markturansky

This comment has been minimized.

Show comment
Hide comment
@markturansky

markturansky Oct 25, 2014

Member

My perspective on this issue may or may not be relevant considering my newness to this project, but I approach K8s through the lens of several of my most recent jobs where I was responsible for IT infrastructure, app development, and ops support.

(tl;dr: Refactor volumes into top level domain object and run-once docker containers are independent first steps that can be implemented before supporting new volumes in a multitude of scenarios. Following these tasks, implement new types of volumes)

After reading (and re-reading) the various issues, I have some thoughts and questions:

1. Pod and volumes succeed or fail together?

Do pods that require a volume fail to launch if the volume fails to mount for any reason? Are they a required pair? I believe the answer is yes.

If so, does this give responsibility for mounting to the Kubelet before it launches the pod? A goroutine for mounting still supports the concurrency patterns in Go, but I imagine the Kubelet would then block and wait on a response from that channel indicating success or failure. No reason to attempt a pod launch if the volume failed to mount on the host. My volume/data wouldn't be there and my app would fail.

2. Optional run-once docker containers

A hook to run something before a pod launches is a valid use case for application developers. Only the first run would be "initialize the database" but subsequent versions of the application would have various database migrations that need to run for the current version of my app.

This feature is standalone, I believe, and not dependent on durable storage. It can be developed independently.

3. Different audiences require different functionality

I had applications deployed to both AWS and SoftLayer. A big-ish company I worked for had their own hardware in a data center (which also fits the "run openshift on bare metal" requirement for RH enterprise customers).

The former scenario allows dynamic clusters. K8s can use the APIs to bring up new nodes, make new storage volumes, etc. K8s would keep track of the various names/IDs assigned. Selectors match volumes and pods with K8s managing it all.

The latter, on the other hand, is not as dynamic. If I have my own hardware, I know how many blades are in my cluster, for example, and what my NAS looks like. They are all named and known. Pods across this cluster are still cattle, but I think the NAS is a pet. Pods need to know what they want for mounted volumes, which I think is different than the cloud API version in the previous paragraph.

4. "New" resource kind or refactored "old" resource?

Moving volume from the pod manifest as is to a top level domain object is a straightforward refactoring task. Move it and make everything work like it did before. Keep the same 3 types (emptydir, hostdir, gcepersistentdisk). This task can happen independently of durable data and storage.

5. Plugins for providers

Depending on the audience, we'd need different plugins based on type (strategy pattern). I know this is already the pattern in place 'cloud provider' and I'm sure everyone already assumes this much.

One plugin per type. After the refactoring in point 4 above is complete, we can add new types to support new functionality.

6. REST API

Expand the number of "source" volumes supported and add whatever attributes are required by that provider/plugin.

"volumes": [
    { "name": "mymount", "source": { "hostDir": { "path": "/data" }}},
    { "name": "mynfsmount", "source": { "nfsDir": { "path": "/data", "nfsServer":  /* various NFS related attributes */ }}},
    { "name": "myawsmount", "source": { "ebsDir": { "path": "/data", "awsCredentials":  /* various AWS related attributes */ }}}
]

Perhaps there are good reasons to refactor this API, too, but the existing API already supports many types of volumes. It seems like this is easily expanded to accommodate more types of volumes.

Next steps

I believe refactoring volumes as-is into a top level domain object is an independent task that is a pre-req to any durable storage volume. This task can happen now before new types of volumes are defined and the details worked through.

Triggers for run-once docker containers also seems like an independent task. If pods and volumes are a required pair, I don't see a need for forgiveness. Mounting a volume is a pre-req for the pod. It happens or it doesn't. The timeout for this operation can be long (if that's what you mean by forgiveness), but it seems sequential in nature before launching the pod.

Only after the refactor of volumes can we add new types. Just adding one new type will expand our (my) knowledge greatly.

When the rubber needs to meet the road, I think we can/should do numbers 2 and 4 above first while hashing out details for the first new type of volume.

My $0.02.

Member

markturansky commented Oct 25, 2014

My perspective on this issue may or may not be relevant considering my newness to this project, but I approach K8s through the lens of several of my most recent jobs where I was responsible for IT infrastructure, app development, and ops support.

(tl;dr: Refactor volumes into top level domain object and run-once docker containers are independent first steps that can be implemented before supporting new volumes in a multitude of scenarios. Following these tasks, implement new types of volumes)

After reading (and re-reading) the various issues, I have some thoughts and questions:

1. Pod and volumes succeed or fail together?

Do pods that require a volume fail to launch if the volume fails to mount for any reason? Are they a required pair? I believe the answer is yes.

If so, does this give responsibility for mounting to the Kubelet before it launches the pod? A goroutine for mounting still supports the concurrency patterns in Go, but I imagine the Kubelet would then block and wait on a response from that channel indicating success or failure. No reason to attempt a pod launch if the volume failed to mount on the host. My volume/data wouldn't be there and my app would fail.

2. Optional run-once docker containers

A hook to run something before a pod launches is a valid use case for application developers. Only the first run would be "initialize the database" but subsequent versions of the application would have various database migrations that need to run for the current version of my app.

This feature is standalone, I believe, and not dependent on durable storage. It can be developed independently.

3. Different audiences require different functionality

I had applications deployed to both AWS and SoftLayer. A big-ish company I worked for had their own hardware in a data center (which also fits the "run openshift on bare metal" requirement for RH enterprise customers).

The former scenario allows dynamic clusters. K8s can use the APIs to bring up new nodes, make new storage volumes, etc. K8s would keep track of the various names/IDs assigned. Selectors match volumes and pods with K8s managing it all.

The latter, on the other hand, is not as dynamic. If I have my own hardware, I know how many blades are in my cluster, for example, and what my NAS looks like. They are all named and known. Pods across this cluster are still cattle, but I think the NAS is a pet. Pods need to know what they want for mounted volumes, which I think is different than the cloud API version in the previous paragraph.

4. "New" resource kind or refactored "old" resource?

Moving volume from the pod manifest as is to a top level domain object is a straightforward refactoring task. Move it and make everything work like it did before. Keep the same 3 types (emptydir, hostdir, gcepersistentdisk). This task can happen independently of durable data and storage.

5. Plugins for providers

Depending on the audience, we'd need different plugins based on type (strategy pattern). I know this is already the pattern in place 'cloud provider' and I'm sure everyone already assumes this much.

One plugin per type. After the refactoring in point 4 above is complete, we can add new types to support new functionality.

6. REST API

Expand the number of "source" volumes supported and add whatever attributes are required by that provider/plugin.

"volumes": [
    { "name": "mymount", "source": { "hostDir": { "path": "/data" }}},
    { "name": "mynfsmount", "source": { "nfsDir": { "path": "/data", "nfsServer":  /* various NFS related attributes */ }}},
    { "name": "myawsmount", "source": { "ebsDir": { "path": "/data", "awsCredentials":  /* various AWS related attributes */ }}}
]

Perhaps there are good reasons to refactor this API, too, but the existing API already supports many types of volumes. It seems like this is easily expanded to accommodate more types of volumes.

Next steps

I believe refactoring volumes as-is into a top level domain object is an independent task that is a pre-req to any durable storage volume. This task can happen now before new types of volumes are defined and the details worked through.

Triggers for run-once docker containers also seems like an independent task. If pods and volumes are a required pair, I don't see a need for forgiveness. Mounting a volume is a pre-req for the pod. It happens or it doesn't. The timeout for this operation can be long (if that's what you mean by forgiveness), but it seems sequential in nature before launching the pod.

Only after the refactor of volumes can we add new types. Just adding one new type will expand our (my) knowledge greatly.

When the rubber needs to meet the road, I think we can/should do numbers 2 and 4 above first while hashing out details for the first new type of volume.

My $0.02.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 25, 2014

Member

Responses to last @markturansky post:

tl;dr

This is not intended for NAS case.

Not seeing benefit of volume-as-own-object, yet. Happy to listen more though.

Agree with the rest of what you said.

Do pods that require a volume fail to launch if the volume fails to mount for any reason.

Yes. And the rest of what you said in item 1 is also true. (I'm intentionally not answering the vaguer question of "Do Pod and volumes succeed or fail together".)

Optional run-once docker containers

I agree this is a feature we could consider separately from durable volumes. There is a bunch of discussion in this thread about run-once. But, the proposal in this PR doesn't try to address that.

Pods on "own hardware" are cattle, but the NAS is a pet

I agree that NAS seems like a pet. But, I wasn't considering NAS when I wrote this proposal; I was just thinking about the "rack of 1u servers each with their own hdd(s) and/or ssd(s)" case.

Interested in your thoughts about NAS.

Moving volume from the pod manifest as is to a top level domain object

I don't agree that the existing volume types should be made into their own objects. I make a distinction between the mounting and the data being mounted. Considering each volume-type:

  • emptydir represents both the mounting and the data itself. Kubelet creates the dir and deletes the data when no longer needed. For nodes with local storage, we plan that the k8s scheduler and kubelet will allocate bytes for emptydir out of total node-local storage bytes. For a node with non-local storage, only, I haven't given this much thought yet.
  • hostdir represents just the mounting. It is not the data: Kublet never creates the directory (dentry) of a HostDir, nor does it ever delete it or its contents. It also won't account for the disk bytes in it. (IOPS for hostdir is tricky. Let's ignore that.)
  • gcepersistentdisk represents just the mounting (and I'm lumping device-attach in with mounting). It is not the data: kubelet never touches the data on the GCE PD itself; the storage space and iops on the PD are out-of-scope for k8s to manage.

For hostdir and gcepersistentdisk, they only represent mounting, not data. I don't see that the "mounting" has meaning independent of the pod.

For "emptydir", I agree that you could make the "data" part of this into its own object. In a sense, that is exactly what this proposal is -- to make a top-level object for this sort of data.

However, if you buy that a common case is to want to allocate that data right before a pod starts and delete it right after the pod terminates, then it would be convenient to not have to create a pair of objects (a pod and a data-thing).

I'll have to think about this more.

Plugins for providers

Yes. Suggestions on other plugins welcome.

Member

erictune commented Oct 25, 2014

Responses to last @markturansky post:

tl;dr

This is not intended for NAS case.

Not seeing benefit of volume-as-own-object, yet. Happy to listen more though.

Agree with the rest of what you said.

Do pods that require a volume fail to launch if the volume fails to mount for any reason.

Yes. And the rest of what you said in item 1 is also true. (I'm intentionally not answering the vaguer question of "Do Pod and volumes succeed or fail together".)

Optional run-once docker containers

I agree this is a feature we could consider separately from durable volumes. There is a bunch of discussion in this thread about run-once. But, the proposal in this PR doesn't try to address that.

Pods on "own hardware" are cattle, but the NAS is a pet

I agree that NAS seems like a pet. But, I wasn't considering NAS when I wrote this proposal; I was just thinking about the "rack of 1u servers each with their own hdd(s) and/or ssd(s)" case.

Interested in your thoughts about NAS.

Moving volume from the pod manifest as is to a top level domain object

I don't agree that the existing volume types should be made into their own objects. I make a distinction between the mounting and the data being mounted. Considering each volume-type:

  • emptydir represents both the mounting and the data itself. Kubelet creates the dir and deletes the data when no longer needed. For nodes with local storage, we plan that the k8s scheduler and kubelet will allocate bytes for emptydir out of total node-local storage bytes. For a node with non-local storage, only, I haven't given this much thought yet.
  • hostdir represents just the mounting. It is not the data: Kublet never creates the directory (dentry) of a HostDir, nor does it ever delete it or its contents. It also won't account for the disk bytes in it. (IOPS for hostdir is tricky. Let's ignore that.)
  • gcepersistentdisk represents just the mounting (and I'm lumping device-attach in with mounting). It is not the data: kubelet never touches the data on the GCE PD itself; the storage space and iops on the PD are out-of-scope for k8s to manage.

For hostdir and gcepersistentdisk, they only represent mounting, not data. I don't see that the "mounting" has meaning independent of the pod.

For "emptydir", I agree that you could make the "data" part of this into its own object. In a sense, that is exactly what this proposal is -- to make a top-level object for this sort of data.

However, if you buy that a common case is to want to allocate that data right before a pod starts and delete it right after the pod terminates, then it would be convenient to not have to create a pair of objects (a pod and a data-thing).

I'll have to think about this more.

Plugins for providers

Yes. Suggestions on other plugins welcome.

@markturansky

This comment has been minimized.

Show comment
Hide comment
@markturansky

markturansky Oct 26, 2014

Member

Thanks for the thoughtful feedback @erictune.

I think I understand the error in my understanding of the domain model. The new top level object would be data, because that is what has durable identity (it's where my data lives). Volumes would use selectors to access data and provide what security tokens are necessary. Volumes map to hostDirs or a type/implementation per provider, volumeMounts go on the pods. Refactoring volumes out is the wrong code design.

I thought, too, I read a general consensus across digital ink spilled on durable data that New Resource Kind was the preferred implementation. That'd be the 1 thing you could define that would span pods, according to my understanding of the domain model. Is this accurate?

Thanks again!

Member

markturansky commented Oct 26, 2014

Thanks for the thoughtful feedback @erictune.

I think I understand the error in my understanding of the domain model. The new top level object would be data, because that is what has durable identity (it's where my data lives). Volumes would use selectors to access data and provide what security tokens are necessary. Volumes map to hostDirs or a type/implementation per provider, volumeMounts go on the pods. Refactoring volumes out is the wrong code design.

I thought, too, I read a general consensus across digital ink spilled on durable data that New Resource Kind was the preferred implementation. That'd be the 1 thing you could define that would span pods, according to my understanding of the domain model. Is this accurate?

Thanks again!

@eparis

This comment has been minimized.

Show comment
Hide comment
@eparis

eparis Oct 27, 2014

Member

@erictune I think a lot of Mark's comments are influenced heavily by my comments in #2003. He got to see it on Friday...

Member

eparis commented Oct 27, 2014

@erictune I think a lot of Mark's comments are influenced heavily by my comments in #2003. He got to see it on Friday...

@stp-ip

This comment has been minimized.

Show comment
Hide comment
@stp-ip

stp-ip Nov 21, 2014

Member

I have a few concerns with some of the ideas. Chaining a pod to a specific minion just to make data "durable" seems like the wrong way to go. It gives a wrong sense of durability and security.
I suggest to use a more pluggable and modular interface, where the choice the user makes affects the durability. So that he chooses to be exposed to disk failure instead of merely using "durable" data methods.
In the proposal the container/pod uses a specific directory structure. For data it would use /con/data so it does not interfere and can be used from a lot of different providers. I agree that the simplest solution would be to bound a pod to a specific host and just use Host Volumes. These can easily be mounted and used as "does not loose data on restart" kind of durability. On the other hand as @smarterclayton already said, there are a lot of different projects to provide durable data already such as Ceph, Gluster and the actual services in aws, gce etc.. The best way to use these is to have a common way to switch from Host Volumes to the more favoured (in my opinion) Data Volume Containers to Side Containers. Side Containers basically provide additional logic to bind the exposed volume to ceph for example.

So instead of only looking for pinning pods to hosts, which does not make me feel secure as it exposes one to hardware failure issues. I agree that pinning to specific hardware characteristics is a need, but not specific hardware. This was discussed in #2342. It therefore seems already to be possible to let specific attributes be used in scheduling such as ssd, gpu etc.

With the defined common interface containers could then easily provide data. Basically the above plugins are the equivalent of Side Containers. (Data Volume Containers with specific logic). These methods would enable a mariadb image for example to be used with Host Volumes in development, used with Data Volume Containers in Staging and depending on the durability need be used with Side Containers exposing distributed storage systems or a completely integrated approach called Volume as a Service on GKE.

For Kubernetes we could use this standard interface to mount volumes and then make it unnecessary to pin a pod to a minion, but to use Data Volume Containers instead and move them with the pod. Easiest solution: stop - pipe to new Data Volume Container - start new container.

The solution I favour is a more integrated approach. Side Containers could be the custom method for users, but providing infrastructure specific and integrated Side Containers and exposing them as Volumes as a Service would be ideal. K8s could expose a distributed data volume, which basically maps the data to a distributed persistent disk. In a selfhosted szenario VaaS could be an admin provided Side Container binding the volume not to a Google persistent disk, but to Ceph, Gluster or so.

The actual proposal can be found here: moby/moby#9277

Member

stp-ip commented Nov 21, 2014

I have a few concerns with some of the ideas. Chaining a pod to a specific minion just to make data "durable" seems like the wrong way to go. It gives a wrong sense of durability and security.
I suggest to use a more pluggable and modular interface, where the choice the user makes affects the durability. So that he chooses to be exposed to disk failure instead of merely using "durable" data methods.
In the proposal the container/pod uses a specific directory structure. For data it would use /con/data so it does not interfere and can be used from a lot of different providers. I agree that the simplest solution would be to bound a pod to a specific host and just use Host Volumes. These can easily be mounted and used as "does not loose data on restart" kind of durability. On the other hand as @smarterclayton already said, there are a lot of different projects to provide durable data already such as Ceph, Gluster and the actual services in aws, gce etc.. The best way to use these is to have a common way to switch from Host Volumes to the more favoured (in my opinion) Data Volume Containers to Side Containers. Side Containers basically provide additional logic to bind the exposed volume to ceph for example.

So instead of only looking for pinning pods to hosts, which does not make me feel secure as it exposes one to hardware failure issues. I agree that pinning to specific hardware characteristics is a need, but not specific hardware. This was discussed in #2342. It therefore seems already to be possible to let specific attributes be used in scheduling such as ssd, gpu etc.

With the defined common interface containers could then easily provide data. Basically the above plugins are the equivalent of Side Containers. (Data Volume Containers with specific logic). These methods would enable a mariadb image for example to be used with Host Volumes in development, used with Data Volume Containers in Staging and depending on the durability need be used with Side Containers exposing distributed storage systems or a completely integrated approach called Volume as a Service on GKE.

For Kubernetes we could use this standard interface to mount volumes and then make it unnecessary to pin a pod to a minion, but to use Data Volume Containers instead and move them with the pod. Easiest solution: stop - pipe to new Data Volume Container - start new container.

The solution I favour is a more integrated approach. Side Containers could be the custom method for users, but providing infrastructure specific and integrated Side Containers and exposing them as Volumes as a Service would be ideal. K8s could expose a distributed data volume, which basically maps the data to a distributed persistent disk. In a selfhosted szenario VaaS could be an admin provided Side Container binding the volume not to a Google persistent disk, but to Ceph, Gluster or so.

The actual proposal can be found here: moby/moby#9277

@hjwp

This comment has been minimized.

Show comment
Hide comment
@hjwp

hjwp Nov 21, 2014

My two cents -- permanent storage is such a basic requirement that I think it should be a service in the same way as docker or etcd are. it's all entirely in my head at this stage, but my plan for paas world domination will involve running something like gluster on a bunch of instances inside my cluster, and mounting a single massive shared filesystem on all machines. individual containers can then just mount parts of that filesystem as volumes. so now, at the container level, i never have to worry about which machine i come up on and will the permanent storage be attached because it's always available, everywhere.

then the problem becomes the platform-specific one of managing the actual storage that the gluster instances use, EBS volumes or PDs or whatever, but that should be a rare task. once gluster is up and running and on multiple machines, i shouldn't have to worry about it any more, and gluster should be redundant enough that losing any individual gluster machine won't hurt...

hjwp commented Nov 21, 2014

My two cents -- permanent storage is such a basic requirement that I think it should be a service in the same way as docker or etcd are. it's all entirely in my head at this stage, but my plan for paas world domination will involve running something like gluster on a bunch of instances inside my cluster, and mounting a single massive shared filesystem on all machines. individual containers can then just mount parts of that filesystem as volumes. so now, at the container level, i never have to worry about which machine i come up on and will the permanent storage be attached because it's always available, everywhere.

then the problem becomes the platform-specific one of managing the actual storage that the gluster instances use, EBS volumes or PDs or whatever, but that should be a rare task. once gluster is up and running and on multiple machines, i shouldn't have to worry about it any more, and gluster should be redundant enough that losing any individual gluster machine won't hurt...

@stp-ip

This comment has been minimized.

Show comment
Hide comment
@stp-ip

stp-ip Nov 21, 2014

Member

@hjwp I agree with most parts.
So basically you set up your Gluster pod for example. It needs the logic to run Gluster and where the data lives.
Then you use this Gluster Service in special Side Containers, which expose simple volumes.
These volumes can then easily be mounted in each pod you want it to be available.
This way you have just decoupled the Base Container, the binding of volume and distributed storage, the distributed storage deployment.
This is outlined in moby/moby#9277 mostly. I agree that storage is a primitive, but that doesn't mean we should force durability to be one thing, but to enable durability via modular solutions.

Member

stp-ip commented Nov 21, 2014

@hjwp I agree with most parts.
So basically you set up your Gluster pod for example. It needs the logic to run Gluster and where the data lives.
Then you use this Gluster Service in special Side Containers, which expose simple volumes.
These volumes can then easily be mounted in each pod you want it to be available.
This way you have just decoupled the Base Container, the binding of volume and distributed storage, the distributed storage deployment.
This is outlined in moby/moby#9277 mostly. I agree that storage is a primitive, but that doesn't mean we should force durability to be one thing, but to enable durability via modular solutions.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Nov 25, 2014

Member

@stp-ip

Durable is the wrong word. Let's forget I used it. When I get back to working on this after the holidays, I plan to chose a different name for the concept. I'm thinking of calling the new resource Kind a "nodeVol", because its two essential properties are that it is tied to a node and that it can be referenced in a VolumeSource.

There are two main types of storage we will have:

  1. local storage which exposes the user to hardware failures, and which is tied to a single node, but which may have improved performance characteristics. Resource allocation is done by kubernetes.
  2. remote storage, which is not tied to a single node, and which typically provides durable storage through replication. Resource allocation is done by external system.
    We definite need both. "Local" is used for:
  3. a building block to implement services like Ceph and Gluster on top of kuberentes.
  4. applications that need the higher performance of a local source and are willing to deal with failures.

@hjwp made a good distinction between using a cluster filesystem and admining a cluster filesystem. I think kubernetes should support both, and that you need "local" as a building block to provide a service that enables "remote".

@stp-ip
You make a distinction between pods depending on a hardware type versus depending on specific hardware that has specific data on it (for the fast restart case.) I think we need to support both cases. I believe that the "new resource Kind" proposal in this PR allows users to implement either behavior.

I like your example of a mariadb that uses different volume sources at different stages of development. I had considered this. It is closely related to the need to have config be portable across cloud providers and different on-premise setups. To address this, I expect that configuration would be written as templates, where the pod's volumeSource is left unspecified, to be filled in by different instantiations of the template (prod vs dev, etc).

I've taken a look at moby/moby#9277, and I've subscribed to the discussion.
The key parts of that proposal, IIUC, are directory conventions, and new flavors of docker containers and volumes. The most important aspect of what is being proposed here, is a way to allocate, account for and control access to node storage resources, independent of how they are mapped into containers/volumes. Therefore, I think that the two proposals are largely complementary.

One thing to note is that some forms of node-local storage don't manifest themselves as a filesystem, so filesystem standards may not apply directly. For example:

  • a mysql database in a pod that uses innodb with raw block device access
  • an application that uses an NVMe SSD via ioctls on the block device.
Member

erictune commented Nov 25, 2014

@stp-ip

Durable is the wrong word. Let's forget I used it. When I get back to working on this after the holidays, I plan to chose a different name for the concept. I'm thinking of calling the new resource Kind a "nodeVol", because its two essential properties are that it is tied to a node and that it can be referenced in a VolumeSource.

There are two main types of storage we will have:

  1. local storage which exposes the user to hardware failures, and which is tied to a single node, but which may have improved performance characteristics. Resource allocation is done by kubernetes.
  2. remote storage, which is not tied to a single node, and which typically provides durable storage through replication. Resource allocation is done by external system.
    We definite need both. "Local" is used for:
  3. a building block to implement services like Ceph and Gluster on top of kuberentes.
  4. applications that need the higher performance of a local source and are willing to deal with failures.

@hjwp made a good distinction between using a cluster filesystem and admining a cluster filesystem. I think kubernetes should support both, and that you need "local" as a building block to provide a service that enables "remote".

@stp-ip
You make a distinction between pods depending on a hardware type versus depending on specific hardware that has specific data on it (for the fast restart case.) I think we need to support both cases. I believe that the "new resource Kind" proposal in this PR allows users to implement either behavior.

I like your example of a mariadb that uses different volume sources at different stages of development. I had considered this. It is closely related to the need to have config be portable across cloud providers and different on-premise setups. To address this, I expect that configuration would be written as templates, where the pod's volumeSource is left unspecified, to be filled in by different instantiations of the template (prod vs dev, etc).

I've taken a look at moby/moby#9277, and I've subscribed to the discussion.
The key parts of that proposal, IIUC, are directory conventions, and new flavors of docker containers and volumes. The most important aspect of what is being proposed here, is a way to allocate, account for and control access to node storage resources, independent of how they are mapped into containers/volumes. Therefore, I think that the two proposals are largely complementary.

One thing to note is that some forms of node-local storage don't manifest themselves as a filesystem, so filesystem standards may not apply directly. For example:

  • a mysql database in a pod that uses innodb with raw block device access
  • an application that uses an NVMe SSD via ioctls on the block device.
@kubernetes-bot

This comment has been minimized.

Show comment
Hide comment
@kubernetes-bot

kubernetes-bot Nov 26, 2014

Can one of the admins verify this patch?

Can one of the admins verify this patch?

@stp-ip

This comment has been minimized.

Show comment
Hide comment
@stp-ip

stp-ip Nov 27, 2014

Member

@erictune
The directory structure in moby/moby#9277 could be used with raw devices too. The easiest method was just mounting volumes and using directories, but one could in my opinion just mount a raw device at that mountpoint or most likely similar for ioctls.
Part of the suggestions in the proposal are more technical nature and talk about how projects can give users easier tools for volumes. I agree that there are distinctions to be made.

These are the issues/solutions:

  • Standardized directory/mountpoints for modular images and easier deployment for users
  • "Local" way to provide volumes (host volumes)
  • "Remote" way to provide volumes (Data Volume Container)
  • Additional logic for mounted volumes (Side Containers) providing config generation or similar
  • Platform integration of volumes (VaaS)
    • Remote volume such as git based volume or connector/connector Side Container for GCEpersistent disk etc.
    • Local volume enabling passthrough abillities for raw devices and scheduling of the underlaying disk utility to run ceoh on top of k8s

So I agree with most of your points and would love to see both better support for passing through disks and raw access to hardware for local usage. (still favoring to use /con/data as default)
Additionally the integration of remote volumes as is outlined in the volume plugin PR (which I can't find right now)

Member

stp-ip commented Nov 27, 2014

@erictune
The directory structure in moby/moby#9277 could be used with raw devices too. The easiest method was just mounting volumes and using directories, but one could in my opinion just mount a raw device at that mountpoint or most likely similar for ioctls.
Part of the suggestions in the proposal are more technical nature and talk about how projects can give users easier tools for volumes. I agree that there are distinctions to be made.

These are the issues/solutions:

  • Standardized directory/mountpoints for modular images and easier deployment for users
  • "Local" way to provide volumes (host volumes)
  • "Remote" way to provide volumes (Data Volume Container)
  • Additional logic for mounted volumes (Side Containers) providing config generation or similar
  • Platform integration of volumes (VaaS)
    • Remote volume such as git based volume or connector/connector Side Container for GCEpersistent disk etc.
    • Local volume enabling passthrough abillities for raw devices and scheduling of the underlaying disk utility to run ceoh on top of k8s

So I agree with most of your points and would love to see both better support for passing through disks and raw access to hardware for local usage. (still favoring to use /con/data as default)
Additionally the integration of remote volumes as is outlined in the volume plugin PR (which I can't find right now)

@hjwp

This comment has been minimized.

Show comment
Hide comment
@hjwp

hjwp Nov 29, 2014

I never really understood the point of volumes-from and data-volume containers. No doubt that's me being a noob, but, in the specific case of our "remote storage" use case, ie, i have a container app that needs access to some permanent storage, and that doesn't want to be tied to a particular node, just know that wherever it comes up, it can access said storage. let's assume said storage is available as a mounted filesystem on the underlying machines (which in the background is implemented using whatever distributed filesystem voodoo we like). Why would i want the extra layer of indirection of a data volume container, rather than just mounting in the remote storage paths directly into my app container?
Forgive me if that's a stupid question.

hjwp commented Nov 29, 2014

I never really understood the point of volumes-from and data-volume containers. No doubt that's me being a noob, but, in the specific case of our "remote storage" use case, ie, i have a container app that needs access to some permanent storage, and that doesn't want to be tied to a particular node, just know that wherever it comes up, it can access said storage. let's assume said storage is available as a mounted filesystem on the underlying machines (which in the background is implemented using whatever distributed filesystem voodoo we like). Why would i want the extra layer of indirection of a data volume container, rather than just mounting in the remote storage paths directly into my app container?
Forgive me if that's a stupid question.

@stp-ip

This comment has been minimized.

Show comment
Hide comment
@stp-ip

stp-ip Nov 29, 2014

Member

@hjwp because then you have either the logic of a specific distributed mount inside your container, which is counterintuative to decoupling. Or you mount a node specific directory. Even if every node has this mountpoint and passes this to some distributed storage, you define special cases. With a data volume container for example you don't have to mount distributed storage for project A on every node, even if it only runs on 1/10th of the nodes for example. Additionally updating the binding software for such a mount is now not easily upgradable via containers, but involves node updates.

So there are a few negative aspects about running node specific stuff, not decoupling etc.. Especially the decoupling in one container for each concern is something, one has to wrap his head around. Sure you could just run apache+storage-endpoint+mysql+whatever in one container, but then one could just use a VM. Even when you have a bit more complexity with data-volume containers or volumes-from you add the ability for decoupling and a more service oriented infrastructure.

Data containers will always have a place, but in production systems such as k8s they will most likely be replaced with Volumes as a Service aka k8s mounts k8s provided volumes, which then map to git, secret store, distributed data store (GCE persistent disk) etc..
If you want to get some ideas and some insights in why one would use data volume containers or Side Containers (data volume containers with additional logic) you can take a look at moby/moby#9277.

Member

stp-ip commented Nov 29, 2014

@hjwp because then you have either the logic of a specific distributed mount inside your container, which is counterintuative to decoupling. Or you mount a node specific directory. Even if every node has this mountpoint and passes this to some distributed storage, you define special cases. With a data volume container for example you don't have to mount distributed storage for project A on every node, even if it only runs on 1/10th of the nodes for example. Additionally updating the binding software for such a mount is now not easily upgradable via containers, but involves node updates.

So there are a few negative aspects about running node specific stuff, not decoupling etc.. Especially the decoupling in one container for each concern is something, one has to wrap his head around. Sure you could just run apache+storage-endpoint+mysql+whatever in one container, but then one could just use a VM. Even when you have a bit more complexity with data-volume containers or volumes-from you add the ability for decoupling and a more service oriented infrastructure.

Data containers will always have a place, but in production systems such as k8s they will most likely be replaced with Volumes as a Service aka k8s mounts k8s provided volumes, which then map to git, secret store, distributed data store (GCE persistent disk) etc..
If you want to get some ideas and some insights in why one would use data volume containers or Side Containers (data volume containers with additional logic) you can take a look at moby/moby#9277.

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Dec 16, 2014

Contributor

I'm marking this p2 as it is speculative and a design doc. (and also we have limited the scope of data for v1) Also this is kind of redundant with #2598

Contributor

brendandburns commented Dec 16, 2014

I'm marking this p2 as it is speculative and a design doc. (and also we have limited the scope of data for v1) Also this is kind of redundant with #2598

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Dec 16, 2014

Contributor

Copying a quote from @erictune from #598

Based on what I learned in the discussion, I think a good first step before implementing durable data would be to develop a working prototype of a database (e.g. mysql master and slave instance) used by a replicated frontend layer.

Attributes of prototype:

individually label two nodes: database-master and database-slave.
constrain mysql master pod to database-master with node selector, and likewise for slave.
constrain frontend pods to remaining machines.
give mysql pods "hostDir" access so that they can have direct disk access and so that the lifetime of the tables is the same as that of the VM.
make it possible to narrow the scope of hostDir; make hostDir a capability that can be granted to individual pods (e.g. to mysql, but not frontends).
demonstrate deleting a mysql pod and then starting a new version, with the tables still working.
demonstrate rolling upgrades on the frontends.
demonstrate a service with a fixed IP address with just the master as an endpoint. Then demonstrate updating the service to fail over to the slave, in conjunction with whatever mysql commands are needed to promote the slave.
I think we will learn quite a bit with such an excercise that will improve an eventual implementation of a durable data implementation. And it will give users a template for what to do until we have durable data implementation.

Contributor

brendandburns commented Dec 16, 2014

Copying a quote from @erictune from #598

Based on what I learned in the discussion, I think a good first step before implementing durable data would be to develop a working prototype of a database (e.g. mysql master and slave instance) used by a replicated frontend layer.

Attributes of prototype:

individually label two nodes: database-master and database-slave.
constrain mysql master pod to database-master with node selector, and likewise for slave.
constrain frontend pods to remaining machines.
give mysql pods "hostDir" access so that they can have direct disk access and so that the lifetime of the tables is the same as that of the VM.
make it possible to narrow the scope of hostDir; make hostDir a capability that can be granted to individual pods (e.g. to mysql, but not frontends).
demonstrate deleting a mysql pod and then starting a new version, with the tables still working.
demonstrate rolling upgrades on the frontends.
demonstrate a service with a fixed IP address with just the master as an endpoint. Then demonstrate updating the service to fail over to the slave, in conjunction with whatever mysql commands are needed to promote the slave.
I think we will learn quite a bit with such an excercise that will improve an eventual implementation of a durable data implementation. And it will give users a template for what to do until we have durable data implementation.

@markturansky

This comment has been minimized.

Show comment
Hide comment
@markturansky

markturansky Dec 18, 2014

Member

@brendanburns I will make the example as part of my work for #2609. This is very close to the example I was already planning on building, but with a persistent disk instead of hostDir -- though in the near term I was planning on using my local host as the "persistent" disk through hostDir. Very copacetic.

Member

markturansky commented Dec 18, 2014

@brendanburns I will make the example as part of my work for #2609. This is very close to the example I was already planning on building, but with a persistent disk instead of hostDir -- though in the near term I was planning on using my local host as the "persistent" disk through hostDir. Very copacetic.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Jan 12, 2015

Member

I no longer believe that we should implement the durable data concept described in this PR. Therefore I am going to close this PR.

I now see that there are two use cases that should be handled separately:

  1. make it easy to run things that want to attach to networked, truly persistent (replicated or tape-backed-up) storage.
  2. make it possible to run things which must have local storage device access.

Why not handle both cases with one concept? Because the thing you end up with:

  • has muddled concepts because it is trying to model too many cases
  • adds extra API complexity (e.g. new object which is similar to but not the same as a pod, with own lifetime)
  • adds scheduler complexity (dealing with pairing durable data).
  • an attractive nuisance which reduces pod mobility, which will block lots of things (upgrades, autoscaling, rescheduling, etc.)

For the "easy to run things that want attach to networked storage" case, we should do:

  • each cluster has one or more types of networked storage available, that, to a first degree, are accessible to every minion.
  • use something like #2598 volumes framework to extension for various networked storage solutions.
  • maybe provide a way to allow admins to export and access control subsets of those network storage soultions to users, along the general lines of #3318
  • pods are still completely mobile (not bound to specific machines with specific local data.
  • Do some kind of hack to deal with the GCE 1 Writer limitation.
  • nfs and ceph clients are better examples of this category.

For the "possible to run things which must have local storage device access" case, we should do:

  • make a hostDir capability, and perhaps narrower capabilities to use specific file systems or devices.
  • use policy to limit which pods can use those capabilities. In mature installations, only "infrastructure" pods would typically get that (ones which e.g. implement ceph, cassandra, hdfs, etc servers).
Member

erictune commented Jan 12, 2015

I no longer believe that we should implement the durable data concept described in this PR. Therefore I am going to close this PR.

I now see that there are two use cases that should be handled separately:

  1. make it easy to run things that want to attach to networked, truly persistent (replicated or tape-backed-up) storage.
  2. make it possible to run things which must have local storage device access.

Why not handle both cases with one concept? Because the thing you end up with:

  • has muddled concepts because it is trying to model too many cases
  • adds extra API complexity (e.g. new object which is similar to but not the same as a pod, with own lifetime)
  • adds scheduler complexity (dealing with pairing durable data).
  • an attractive nuisance which reduces pod mobility, which will block lots of things (upgrades, autoscaling, rescheduling, etc.)

For the "easy to run things that want attach to networked storage" case, we should do:

  • each cluster has one or more types of networked storage available, that, to a first degree, are accessible to every minion.
  • use something like #2598 volumes framework to extension for various networked storage solutions.
  • maybe provide a way to allow admins to export and access control subsets of those network storage soultions to users, along the general lines of #3318
  • pods are still completely mobile (not bound to specific machines with specific local data.
  • Do some kind of hack to deal with the GCE 1 Writer limitation.
  • nfs and ceph clients are better examples of this category.

For the "possible to run things which must have local storage device access" case, we should do:

  • make a hostDir capability, and perhaps narrower capabilities to use specific file systems or devices.
  • use policy to limit which pods can use those capabilities. In mature installations, only "infrastructure" pods would typically get that (ones which e.g. implement ceph, cassandra, hdfs, etc servers).

@erictune erictune closed this Jan 12, 2015

@erictune erictune referenced this pull request Jan 12, 2015

Closed

WIP: Persistent Storage #3318

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Jan 14, 2015

Member

@bgrant0607 we talked just now about how I don't think we should do durable data in any of the forms previously discussed. See my post above.

Member

erictune commented Jan 14, 2015

@bgrant0607 we talked just now about how I don't think we should do durable data in any of the forms previously discussed. See my post above.

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Jan 14, 2015

Contributor

----- Original Message -----

I no longer believe that we should implement the durable data concept
described in this PR. Therefore I am going to close this PR.

I now see that there are two use cases that should be handled separately:

  1. make it easy to run things that want to attach to networked, truly
    persistent (replicated or tape-backed-up) storage.
  2. make it possible to run things which must have local storage device
    access.

Why not handle both cases with one concept? Because the thing you end up
with:

  • has muddled concepts because it is trying to model too many cases
  • adds extra API complexity (e.g. new object which is similar to but not
    the same as a pod, with own lifetime)
  • adds scheduler complexity (dealing with pairing durable data).
  • an attractive nuisance which reduces pod mobility, which will block lots
    of things (upgrades, autoscaling, rescheduling, etc.)

For the "easy to run things that want attach to networked storage" case, we
should do:

  • each cluster has one or more types of networked storage available, that,
    to a first degree, are accessible to every minion.
  • use something like #2598 volumes framework to extension for various
    networked storage solutions.
  • maybe provide a way to allow admins to export and access control subsets
    of those network storage soultions to users, along the general lines of
    #3318
  • pods are still completely mobile (not bound to specific machines with
    specific local data.
  • Do some kind of hack to deal with the GCE 1 Writer limitation.
  • nfs and ceph clients are better examples of this category.
  • Down the road, create a PersistentVolume type that pretends to offer network volume, but is really handled by a ride-along pod that periodically snapshots the volumes to some durable storage, and offer a volume type that inits to latest snapshot or uses what's on disk.

For the "possible to run things which must have local storage device access"
case, we should do:

  • make a hostDir capability, and perhaps narrower capabilities to use
    specific file systems or devices.
  • use policy to limit which pods can use those capabilities. In mature
    installations, only "infrastructure" pods would typically get that (ones
    which e.g. implement ceph, cassandra, hdfs, etc servers).
Contributor

smarterclayton commented Jan 14, 2015

----- Original Message -----

I no longer believe that we should implement the durable data concept
described in this PR. Therefore I am going to close this PR.

I now see that there are two use cases that should be handled separately:

  1. make it easy to run things that want to attach to networked, truly
    persistent (replicated or tape-backed-up) storage.
  2. make it possible to run things which must have local storage device
    access.

Why not handle both cases with one concept? Because the thing you end up
with:

  • has muddled concepts because it is trying to model too many cases
  • adds extra API complexity (e.g. new object which is similar to but not
    the same as a pod, with own lifetime)
  • adds scheduler complexity (dealing with pairing durable data).
  • an attractive nuisance which reduces pod mobility, which will block lots
    of things (upgrades, autoscaling, rescheduling, etc.)

For the "easy to run things that want attach to networked storage" case, we
should do:

  • each cluster has one or more types of networked storage available, that,
    to a first degree, are accessible to every minion.
  • use something like #2598 volumes framework to extension for various
    networked storage solutions.
  • maybe provide a way to allow admins to export and access control subsets
    of those network storage soultions to users, along the general lines of
    #3318
  • pods are still completely mobile (not bound to specific machines with
    specific local data.
  • Do some kind of hack to deal with the GCE 1 Writer limitation.
  • nfs and ceph clients are better examples of this category.
  • Down the road, create a PersistentVolume type that pretends to offer network volume, but is really handled by a ride-along pod that periodically snapshots the volumes to some durable storage, and offer a volume type that inits to latest snapshot or uses what's on disk.

For the "possible to run things which must have local storage device access"
case, we should do:

  • make a hostDir capability, and perhaps narrower capabilities to use
    specific file systems or devices.
  • use policy to limit which pods can use those capabilities. In mature
    installations, only "infrastructure" pods would typically get that (ones
    which e.g. implement ceph, cassandra, hdfs, etc servers).
@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Jan 15, 2015

Member

@smarterclayton
with the periodic snapshots idea: if application has other state that is not snapshottable (state on remote services), then on a recovery, it will have skewed local/remote state. Not sure what fraction of apps would be able to use this?

Member

erictune commented Jan 15, 2015

@smarterclayton
with the periodic snapshots idea: if application has other state that is not snapshottable (state on remote services), then on a recovery, it will have skewed local/remote state. Not sure what fraction of apps would be able to use this?

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Jan 15, 2015

Contributor

At least for the types of networked applications in aware of, the utility of this snapshot is more for simple singleton processes that could potentially tolerate some loss of data (last 10 minutes) in preference to having no data. An admin team could modify their restart / node replacement processes in order to effectively maintain a possibly reasonable SLA, without the rest of the system being aware. For instance, but adding a tool that tries to snapshot all the instances of this volume type on a node prior to beginning evacuation.

I'm thinking of things like Git repositories, Jenkins servers, simple Redis cache services, test databases, QA workloads, simple collaboration style apps, etc. Losing even 15 minutes of state is unlikely to impact those sorts of tools, and most of them are inherently single pod stateful. There's probably a cost / effort sweet spot for loose consistency of data here.

On Jan 15, 2015, at 10:45 AM, Eric Tune notifications@github.com wrote:

@smarterclayton
with the periodic snapshots idea: if application has other state that is not snapshottable (state on remote services), then on a recovery, it will have skewed local/remote state. Not sure what fraction of apps would fit be able to use this?


Reply to this email directly or view it on GitHub.

Contributor

smarterclayton commented Jan 15, 2015

At least for the types of networked applications in aware of, the utility of this snapshot is more for simple singleton processes that could potentially tolerate some loss of data (last 10 minutes) in preference to having no data. An admin team could modify their restart / node replacement processes in order to effectively maintain a possibly reasonable SLA, without the rest of the system being aware. For instance, but adding a tool that tries to snapshot all the instances of this volume type on a node prior to beginning evacuation.

I'm thinking of things like Git repositories, Jenkins servers, simple Redis cache services, test databases, QA workloads, simple collaboration style apps, etc. Losing even 15 minutes of state is unlikely to impact those sorts of tools, and most of them are inherently single pod stateful. There's probably a cost / effort sweet spot for loose consistency of data here.

On Jan 15, 2015, at 10:45 AM, Eric Tune notifications@github.com wrote:

@smarterclayton
with the periodic snapshots idea: if application has other state that is not snapshottable (state on remote services), then on a recovery, it will have skewed local/remote state. Not sure what fraction of apps would fit be able to use this?


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment