Volume topology aware scheduling design #1054

msau42 · 2017-09-13T02:21:10Z

Proposal for a smarter scheduler that influences PV binding.

/sig storage
/sig scheduling
/cc @kubernetes/sig-storage-proposals @kubernetes/sig-scheduling-proposals

msau42 · 2017-09-13T02:22:05Z

/assign @davidopp @jsafrane @saad-ali @thockin

resouer · 2017-09-13T03:08:36Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+preemption.  Preempting a pod that uses a PV will not free up capacity on that
+node because the PV lifecycle is independent of the Pod’s lifecycle.
+
+##### Other scheduler predicates


Could you please also evaluate if this new predicate need special care in factory.go according to: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go#L70-L76

If the PVC or PV objects change, then this new predicate needs to be invalidated. Also if the Node labels change. I'll add that to the predicate section.

Cool, thanks!

gnufied · 2017-09-13T13:12:15Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+and prebinding them to the PVCs.  The PV controller completes the binding
+transaction, handling it as a prebound PV scenario.
+
+Prebound PVCs and PVs will still immediately be bound by the PV controller.


What you mean by this? if they were prebound - why do they need binding?

The user only needs to prebind one side, either the PVC or the PV. And then the PV controller finishes binding the other direction.

So if the user set PV.ClaimRef, then the PV controller will set PVC.volumeName.
And if the user set PVC.volumeName, then the PV controller will set PV.ClaimRef.

if the user set PVC.volumeName, then the PV controller will set PV.ClaimRef.

That should not be the case... a PVC user cannot target a specific PV by name.

I just tried it and it works. I created 2 PVs of different sizes, and then created a PVC that requests the size of the smaller PV, but specifies the name of the larger PV. And the volume got bound to the specified PV name. From my understanding, it's been this way since the beginning.

From my understanding, it's been this way since the beginning.

I don't think so. It looks like that was added in 1.3 in https://github.com/kubernetes/kubernetes/pull/24331/files#diff-3aaafc34911c234537e0076b5ed40b37R202. I would not expect a namespaced user to be able to target a particular PV.

Regardless, it's been like this for many releases now. If we want to change it to not allow prebinding PVCs, then it can be considered separately from this proposal. This proposal is not changing the current prebinding behavior.

@liggitt, I thought it was common practice - a Pod author can set pod.spec.nodeName to bypass scheduling. This is the same thing.

a Pod author can set pod.spec.nodeName to bypass scheduling

that is prevented in some clusters. regarding PV/PVC, as long as precedence is given to the PV's claim ref, and a reference by a PVC doesn't allow it to be bound to a PV that doesn't match the PVC's resource requests, that seems ok.

gnufied · 2017-09-13T13:16:37Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+following use cases:
+* While a Pod is running, a user deletes the PVC and PV, and then recreates them.
+    * With the current behavior, the PVC will rebind to the same or different PV.
+    * With this new behavior, the PVC will not rebind because binding would only


On EBS at least, since you can't delete a volume that is mounted and in use. Deleting a dynamically bound PVC results in PV being "Released" (I think). So when user creates a new PVC with same name (because user does not want to change pod's yaml), a new PV is created. So we get 2 PVs (one released and one bound) mapped to same PVC.

Not all storage providers may be able to detect if a volume is in use before deleting. But assuming it can detect and fail the deletion, let's break this up into 2 scenarios actually:

Delete PVC, PV is released, but not deleted. Recreate PVC.
If the container restarts, the old volume is still used in the new container. If the Pod restarts, then it goes back through scheduling, and we rebind the PVC at that time. Pod termination works because we only need the PV, not the PVC object. So no regression here.

Delete both PVC and PV objects. Recreate PVC and PV objects.
Pod termination still works because PV object is still in the actual state. If kubelet restarts, volume reconstruction will recreate the PV object. No regression here.

So actually, I made a wrong assumption that we needed the PVC object to be bound for Pod teardown. kubelet volume manager only needs the PV object, and it can reconstruct it too if kubelet crashes. Of course, this also assumes that the volume plugin implements reconstruction correctly.

I'll remove this section unless we can think of another scenario that won't work with the new design. I'll leave the discussion of if we want to prevent PVC and PV deletion when it's still in use to a separate issue.

gnufied · 2017-09-13T13:20:13Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+4. Find best matching PV for the PVC where PV topology is satisfied by the Node.
+5. Temporarily cache this PV in the PVC object, keyed by Node, for fast
+processing later in the priority and bind functions.
+6. Return true if all PVCs are matched.


What happens if matched PVs are deleted or PVC is resized since scheduler started matching? I am just trying to think, there appears to be a delay between when PVs/node is matched and when binding happens. Does the pod go through MatchUnbounPVCs again if something changed since match began?

The same window occurs today in the PV controller. If the PV object changes, then when we try to modify it, the update will fail because we're trying to update an older version. If it fails, then we fail scheduling and retry.

If the PVC gets resized, then the binding will still happen, and afterwards, the resize controller will kick in.

gnufied · 2017-09-13T13:26:01Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+
+```
+BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
+```


Do we check again if PVC constraints are still met? Can we clarify how this function accesses PV cache - because from what I see it only get pod and hence it must only have pvc name.

I need to iron out the exact details, but I'm thinking of keeping an in-memory state in the PVC object with the chosen PV name.

Before binding, we can do some sanity checks again, but even then there can still be a race between when you check and when you actually update the object. There, we rely on the api update to fail if the PV object has changed in the meantime.

NickrenREN · 2017-09-14T10:23:30Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+An alternative approach is to only trigger the new behavior only for volumes that
+have topology constraints.  A new annotation can be added to the StorageClass to
+indicate that its volumes have constraints and will need to use the new
+topology-aware scheduling logic.


That is to say: when pv controller synchronized unbound pvc, it will always have to check storageclass's annotation to see whether it should go to the claim.Spec.VolumeName == "" block ?

Yes the alternative is to have the trigger be in the storageclass, so you need to check for the annotation every time you handle unbound volumes to determine if you should delay binding or not.

jsafrane · 2017-09-14T09:50:58Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+
+In addition, this new design will cause functional regression for the
+following use cases:
+* While a Pod is running, a user deletes the PVC and PV, and then recreates them.


Nothing special happens when a Pod is running and someone deletes PV or PVC today - the Pod is still running having the old PV mounted. Kubelet will clean up the mount properly (it should not need the PV for it) and A/D controller should detach the volume too (and it's a bug if it does not).

So everything is fine until someone stops the pod and tries to start a new one with the same PVC (or PVC with the same name). With this PR, the Pod needs to go through scheduler and its PVCs will get bound to something and pods is started, so user does not observe anything wrong.

Yeah I realized kubelet doesn't need the PVC or PV object so it's ok.

With the current PV controller, it has similar behavior, except the binding happens earlier instead of waiting for the Pod to be rescheduled.

So you unmount the Pods storage and it doesn't die? What application can tolerate having a volume unmounted from under it without killing the underlying application.

The volume does not get unmounted while the Pod is still running, even if the PVC/PV are deleted. Unmount only happens when the Pod is terminated.

However, if PVC is deleted, and reclaim policy is Delete, then PV controller will immediately try to delete the volume. Luckily, nothing bad happens because it depends on the storage provider to prevent volume deletion while the disk is still attached to a node.

But this is potentially a problem for local storage, which doesn't have that protection. I need to think about if there's a reliable way to detect and prevent it (in a separate issue).

jsafrane · 2017-09-14T09:53:53Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+be triggered when the Pod has to go through scheduling.  This could cause
+problems with Pod volume cleanup because the PVC will no longer be bound.
+    * But deleting the PVC and PV (especially both) while a Pod is still using it
+is also dangerous and should be prevented in the first place.  There could be


Yes, we should prevent it somehow, but it's for separate PR / proposal. See kubernetes/kubernetes#45143

jsafrane · 2017-09-14T10:16:20Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+Looking into the future, with the potential for NUMA-aware scheduling, you could
+have a sub-scheduler on each node to handle the pod scheduling within a node.  It
+could make sense to have the volume binding as part of this sub-scheduler, to make
+sure that the volume selected will have NUMA affinity with the rest of the


I don't know much about NUMA, but it sounds odd to me that a volume (local or remote) would have different "distance" (speed) to different CPUs.

Newer CPUs have PCI-e slots attached directly to the IO controller per socket. I found a random diagram showing the architecture here.

So this is definitely something to consider for PCI-e based local SSDs. I agree, remote storage doesn't matter here.

jsafrane · 2017-09-14T10:37:23Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+concrete proposals in this area yet.
+
+### Binding multiple PVCs in one transaction
+For the alpha phase, this is not required.  Since the scheduler is serialized, a


This PR assumes that it's only the scheduler who binds PVs/PVCs and they are mostly static. There is no recovery mechanism in case of failed write.

PV controller is very robust in this regard and expects that any PV or PVC can be changed or deleted at any time - users are unpredictable and they can change their mind or they run external controllers that add / remove annotations/labels to objects. It's not that uncommon that saving a PV or PVC fails.

There should be at least some design how to make the scheduler safe from failed saves. We might paint ourselves into a corner in beta otherwise.

FYI, multi-write transactions would help a lot here but they were rejected in kubernetes/kubernetes#27548

For a single PV binding, if the PV API update fails, we'll fail scheduling and retry. So I think at least for a single PV, we are just as robust as the current PV controller.

The challenge is how to create a transaction for binding multiple PVCs together, and being able to rollback if only a subset of them bind successfully. It has the same rollback challenges as you mentioned below, and I think we could design a common rollback mechanism to handle both scenarios.

I feel like it's ok to not handle this in alpha, because even today, the PV controller doesn't have any idea of multiple PVCs being related. So this design is at least at feature parity, if not slightly better.

jsafrane · 2017-09-14T10:39:10Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+The internal and external provisioning APIs need to be updated to take in a node
+parameter.
+
+


I miss a chapter about external schedulers. Are they still supported? Should they support binding too? How are they going to compete over unbound PVs in a consistent way?

This is definitely going to make it harder to run your own scheduler, and I think this is also one of the reasons why the scheduling sig added the scheduler extender interface, so that you can still use the default scheduler + your custom extensions. In general, the scheduling team has been adding many advanced scheduling features to the default scheduler that it's becoming harder to run your own already.

If you don't use the default scheduler, and you use persistent volumes, then you'll have to pull this code and integrate it into your custom scheduler. Of course, we will have to give notice well in advance.

I thought more about it, if two pods use the same unbound PVC and two different schedulers bind this PVC to a different PV (i.e. one saves pv1.spec.claimRef = pvc and the second one pv2.spec.claimRef = pvc), PV controller is smart enough to recover from this conflict by un-binding one of these PVs, i.e. it clears say pv1.spec.claimRef=nil (and deletes it when necessary).

If both schedulers implement step "9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod ... then **wait until the PVCs are bound successfully** or encounter failure" then the PVC gets eventually bound to one of these PVs. Not necessarily the one that a particular scheduler has chosen, but that should not be a problem, right?

So, we're fine as long as both schedulers implement the rules you specified. They may generate some noise by provisioning an unnecessary PV or do extra binding, but PV controller should clean up this mess.

Thanks for the analysis. It should be fine, but the only potential problem is if the 2nd scheduler bound the PVC in a way that cannot satisfy the all the predicates in the 1st scheduler. In general, we should only recommend that one scheduler handle PVC binding.

Can @kubernetes/sig-scheduling-pr-reviews comment on the practice of running two schedulers against the same Pod? Is that a supported configuration, or do we only support a model where:

You run your own scheduler by itself

You run default scheduler + your custom extensions

jsafrane · 2017-09-14T10:50:09Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+### Recovering from kubelet rejection of pod
+Again, for alpha phase, manual intervention will be required.  For beta, we’ll
+have to consider some sort of rollback mechanism for more robustness.  Perhaps
+this rollback mechanism can also be used for binding multiple PVCs in one


Be very careful about this rollback. It may be necessary in some cases, but it may destroy data in others.

Consider this flow:

Two pods want to use the same RWX PVC.

Scheduler binds the first pod to a node and binds its PVC to a LocalVolume there.

Scheduler binds the second pod to the same node (PVC is already bound there)

Kubelet rejects the first pod.

Kubelet starts the second pod.

Scheduler must not roll back anything!

Again, some design here would be nice. Should we mark the PVCs as "clean" during binding and "dirty" when they're admitted by kubelet? Kubelet must make sure that it marked them as dirty before any pod containers are started and scheduler must not roll back any dirty PVCs.

Exactly. We need a way to know if it's safe to rollback a binding, and in my initial brainstorm, I was thinking of adding some additional state to the PVC/PV to mark as clean and dirty.

I think the rollback also needs to change the state machine too. We shouldn't need to go move to Retain or Released state during a rollback since the volume wasn't used. Although maybe if the volume was dynamically provisioned, then we should still delete it during the rollback.

Also this potentially could bring back the same issues that we had with the state changes for recycling? I'm not familiar with what the exact issues were.

I think that recycler (and deleter) is safe here, PV controller should be robust enough already.

On the other hand, any rollback in ~~recycler~~ scheduler may conflict with binding in PV controller. Scheduler sets pv.spec.claimRef and leaves the rest to PV controller, right? PV controller will save pvc.spec.volumeName, update status and some annotations. In case the scheduler wants to revert the binding, clearing pv.spec.claimRef won't be enough, it needs to clear everything. And it will race with the PV controller trying to fix the binding, it tries to fix the status and binding if something overwrites it.

Yah we will need to add some new state so that the PV controller will not try to fix the binding on a rollback.

edit: rollback in recycler -> rollback in scheduler

jsafrane · 2017-09-14T11:07:14Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+```
+1. For static PV binding:
+    1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
+    2. If the prebind fails, revert the cache updates.


Is pod.spec.nodeName already set? Will it be reverted? It probably should.

I can't tell if you mean the pod's node name is set in the API. If so, that's irreversible

We won't let the pod reach node binding/assume until all PVCs are bound.

But I thought of a potential problem. If a user sets pod.spec.NodeName, does it bypass the scheduling? Then it would not trigger binding code and PVCs would never bind. Hmmmmm...

Another hole is DaemonSets, which does its own scheduling.

jsafrane · 2017-09-14T11:08:57Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+```
+1. For static PV binding:
+    1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
+    2. If the prebind fails, revert the cache updates.


If I understand it correctly, at least in alpha it won't roll back any previously saved PVs, right? Does it then revert this binding from cache? It should probably stay there.

We won't rollback any API updates. But it will rollback any cache-only updates made in AssumePVCs. I'm thinking the rollback can be done by reloading the latest API object into the cache.

NickrenREN · 2017-09-14T13:17:24Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+6. Return true if all PVCs are matched.
+7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
+For this alpha phase, the provisioner is not topology aware, so the predicate
+will just return true if there is a provisioner specified (internal or external).


What does this mean ?

if there is a provisioner specified

Do you mean the pvc provisioner annotation is set ?

This means that:

The StorageClass exists.

The provisioner field is set.

liggitt · 2017-09-15T14:08:00Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+scheduler assumes and binds the Pod to a Node.
+11. Kubelet starts the Pod.
+
+This new workflow will have the scheduler handle unbound PVCs by choosing PVs


moving the binding responsibility from the binder controller into the scheduler means you no longer have a single writer, and that all external schedulers must also implement PVC binding, right?

Yes, if you use a custom scheduler, and not the default scheduler, then you will have to pull in the PV matching and binding logic into your own scheduler. We can try to modularize the library calls as much as possible to make it easier to import the methods. Note that we don't pull in the entire binding state machine into the scheduler. The scheduler's responsibilities are to:

Find an appropriate PV for unbound PVCs

Initiate (but not complete) the binding by setting the PV.ClaimRef.

The scheduler's responsibilities are to:

Find an appropriate PV for unbound PVCs
Initiate (but not complete) the binding by setting the PV.ClaimRef.

the split of binding responsibilities definitely makes the PV binder more difficult to reason about...

How about this?

The scheduler handles choosing an appropriate PV for the PVC based on all the pod's requirements. It's like a user manually picking their volumes, except automated.

The PV controller handles the mechanics of binding the PVC and PV together, and the cleanup lifecycle.

bsalamat · 2017-09-16T00:54:07Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+Instead, a new bind function, BindPVCs, will be called asynchronously, passing
+in the selected node.  The bind function will prebind the PV to the PVC, or
+trigger dynamic provisioning, and then wait until the PVCs are bound
+successfully or encounter failure.  Then, it always sends the Pod through the


IIUC, scheduler revisits scheduling of the pod to the node in a second round. What if the pod is no longer schedulable on the node, because, for example, a different pod is already scheduled there and the node no longer has enough resources. Doesn't this cause PV stranding and starvation of the pod?

Yes that is possible with this design (and also is the current behavior). One possible fix that we're thinking of for beta, is to implement a PVC binding "rollback" if the volumes have not been mounted by any pods yet.

Also in general for local storage once it's bound, since your node choices get constrained to 1, we are recommending that those pods have high priority to avoid starvation.

Yeah, I think having a way of "unbinding" volumes is necessary, but even then the logic would be a bit complex. It will be something like this:

Try to find a node for the pod.

Bind the volumes.

Try to schedule the pod again (probably by recording which node was picked at step 1 and try that node)

If step 3 fails, unbind the volumes and go to 1 (after wiping the previously chosen node)

Step 3 is not an orthodox way of scheduling. It probably needs some changes to the guts of scheduler.

We also need to remember though, that rolling back a volume binding is a lot harder and should be less common than rolling back pod bindings. Once a pod starts using the volume we cannot unbind it, even if scheduling fails.

For zonal storage, failing scheduling after binding is more rare because you have more nodes to choose from. For local storage, it's more likely since there's only one node, so your pods should be higher priority to force preemption

ianchakeres · 2017-09-17T19:53:01Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+    * The more constraints you add to your pod, the less flexible it becomes
+in terms of placement.  Because of this, tightly constrained storage, such as
+local storage, is only recommended for specific use cases, and the pods should
+have high priority.


/high priority./higher scheduling priority than unconstrained pods./

ianchakeres · 2017-09-17T19:57:33Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+
+## Problem
+Volumes can have topology constraints that restrict the set of nodes that the
+volume is accessible from.  For example, a GCE PD can only be accessed from a


/the volume is accessible from./can mount them./

alternatively /can access them./

ianchakeres · 2017-09-17T19:58:10Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+## Problem
+Volumes can have topology constraints that restrict the set of nodes that the
+volume is accessible from.  For example, a GCE PD can only be accessed from a
+single zone, and a local disk can only be accessible from a single node.  In the


/accessible/accessed/

ianchakeres · 2017-09-17T20:24:20Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+    1. How close the PVC’s requested capacity and PV’s capacity are.
+    2. Matching static PVs is preferred over dynamic provisioning.
+
+TODO (beta): figure out weights and exact calculation


Assuming that multiple node fit the constraints and are all equal in weight, is the ranked order of these nodes random or deterministically based on some identifier?

Today, the scheduler does a round robin selection of nodes with equal priority by incrementing an index.

ianchakeres · 2017-09-17T20:25:15Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+1. For each Node, get the cached PV matches for the Pod’s PVCs.
+2. Compute a priority score for the Node using the following factors:
+    1. How close the PVC’s requested capacity and PV’s capacity are.
+    2. Matching static PVs is preferred over dynamic provisioning.


Why are static PVs preferred over dynamic?

It's assuming that you had specifically statically created the PVs for your application.

kow3ns · 2017-09-22T19:11:25Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+
+TODO: flow chart
+
+The proposed new workflow for volume binding is:


Does this assume that workload spreading across failure domains will be achieved by anti-affinity predicates?

kow3ns · 2017-09-25T13:55:09Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+    * The scheduler itself only handles one pod at a time.  It’s possible the
+two pods may not run at the same time either, so there’s no guarantee that you
+will know both pod’s requirements at once.
+    * For two+ pods simultaneously sharing a PVC, this scenario may require an


This applies to two Pods sharing a PVC bound to a local volume correct? If the volume is remote and zonal than the only issue would be if you wanted to mount read-many volumes across zones. I would think specifying that combination of affinity predicate and storage would be a user error. For two Pods sharing the same local volume, couldn't a user achieve the same results by combining the the containers of both into a single Pod? If all the containers must reference the same local storage, then they must be scheduled on the same Node anyway.

Yes, I'm mainly referring to local volumes, and I agree, the simplest scenario is to combine all your containers into one Pod so that they will be scheduled together. But for whatever reason, if a user may want two different Pods (maybe they're not running simultaneously), then this is trying to say that we can't guarantee scheduling of the second Pod if its scheduling constraints are not compatible with the first Pod.

We definitely want to support the use case of specifying pod anti-affinity/affinity along with local storage, and having the volume binding take that into account (spread my pods and local storage across nodes/racks/zones).

msau42 · 2017-09-29T16:44:59Z

/assign @bsalamat

bsalamat · 2017-09-30T01:54:11Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+to be bound.  The bound assumption needs to be changed in order to work with
+this new workflow.
+
+TODO: how to handle race condition of PVCs becoming bound in the middle of


As you mentioned, running PVC binding in parallel with the scheduler will introduce a race condition and ordering of the predicates will not address the problem. IIUC, synchronization will be needed and synchronization could significantly impact scheduler's performance.

One possible way to deal with this is to mark the Pod at the very beginning of scheduling if any PVCs are unbound. Then at the end, we can check for this to determine if we need to send the Pod through a second pass.

bsalamat · 2017-09-30T02:07:23Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+schedulable, and it does this by re-evaluating predicates without the lower
+priority Pod.
+
+If we had put the MatchUnboundPVCs predicate at the end, then pod preemption


This can possibly be addressed by registering extra functions that preemption logic should call to filter nodes further. This might be an easier change compared to the current proposal.

Yes I think that would eliminate the need for some of the caching from the current design. But it will still require a second pass afterwards.

That's right. The second pass will still be needed.

bsalamat · 2017-09-30T03:47:38Z

contributors/design-proposals/sig-storage/volume-topology-scheduling.md

+clearly communicate these changes to the community before changing the behavior
+by default in GA.
+
+### Integrating binding with scheduling


From software engineering point of view, mixing the responsibilities of modules and removing the isolation of responsibilities is not a great practice. With this design, the scheduler process would perform volume binding which is not one of the usual responsibilities of a cluster scheduler. Other than removing the modularity, this could introduce runtime problems as well. For example, a null dereference in volume binding logic will cause a scheduler crash. High memory or CPU consumption in volume binding logic could cause scheduler to run out of memory or starve.

Let me clarify that I am not saying that we shouldn't move forward with this design, but when we consider pros and cons of various approaches, this should be considered as a negative point of this design.

I think we do need a tighter coupling (somehow) between scheduler and volume binding in order to address the problems I outlined in the beginning. At a high level:

Scheduler needs to be aware of a volume's topology wrt the node when it is scheduling a pod

The volume binding decision needs to respect the pod's other scheduling constraints

This design proposes a very tight coupling by putting the volume binding directly into the scheduler. You are correct in that it's expanding the scheduler's responsibilities to more than just Pod->Node binding. The major risk I see with this current approach is that it's trying to add in the volume binding using the existing scheduling mechanics, which may be designed with the assumption that Pod->Node binding is the only thing the scheduler does. So this approach breaks when you consider the scenario where volumes are unbound, but the user has already manually scheduled the Pod to a Node. Fixing that I think is going to require some fundamental changes in the scheduler, which could greatly increase the scope of this design. But it may be something still worth considering, especially if more objects also need to integrate their binding with scheduling. I know @vishh and @yguo0905 are also looking into extending the scheduler for binding other resources (although may not be as complex as volume binding), and they may also run into the same issues as me.

I did also consider an alternate approach with loose coupling, where it tries to keep scheduler and PV controller separate. It's similar to this alternative discussion. The basic idea is the PV controller waits for the Pod.Spec.NodeName to be set before it does the binding. However, even though the binding process is decoupled from the scheduler, the binding decision is much more prone to split brain problems. We still need a new scheduler predicate so that the scheduler can pick an appropriate node, but because the binding happens completely asynchronously in the PV controller, it's possible that the actual binding will pick a different volume, or fail completely. Then you need a way to tell the scheduler to revert its PV cache update, and also a way for the PV controller to tell kubelet to reject the Pod so that it can be sent back through scheduling. So while this approach keeps the binding process and code separate, I found that it:

is more prone to split brain failures due to scheduler and PV controller trying to make the same decisions, but operating on different and out-of-sync caches.

required very tight, complex, and delicate coordination between scheduler, PV controller, and kubelet processes to handle recovering from these split brain failures.

I am open to revisiting this alternative if you think it looks more promising than the current approach, but from my initial investigation, it seemed much more complex and fragile.

With this design, the scheduler process would perform volume binding which is not one of the usual responsibilities of a cluster scheduler.

I am not sure what "cluster scheduler" is, however with this PR we're moving from Pod scheduler to Pod and PV scheduler. And I am sure more resources will come in future.

E.g. setting pod.spec.nodeName should not bypass the scheduler as it should schedule the storage too.

I think we could benefit from extending the scheduler to support multiple object scheduling/binding in a modular manner. Each module would need to:

Given a Pod object, evaluate if it needs to run through the scheduler

Register assume/bind functions

Register rollback function

Of course, there is a ton of detail that needs to be worked out

…e approach and its downsides, add more details about binding rollback using PV taints

msau42 · 2017-10-03T23:59:36Z

Updated to add downsides of proposed approach, more details about the alternative approach, and an idea on how to handle PV binding rollback.

bsalamat · 2017-10-04T21:13:02Z

As discussed offline and in order to unblock SIG Storage, I LGTM this PR. We will hopefully have a better mechanism in scheduler that allows an easier and cleaner way of performing post scheduling tasks, such as volume or TPU binding.

/lgtm

k8s-github-robot · 2017-10-04T21:15:23Z

Automatic merge from submit-queue.

msau42 · 2017-10-05T01:28:41Z

Ah sorry, there were still some TODO items that I still need to sort out, especially regarding how to handle scheduler bypass scenarios. Please continue commenting/discussing.

bsalamat · 2017-10-05T05:55:27Z

Yeah, I was not expecting it to be auto-merged upon my lgtm. Anyway, you may want to send another PR to address the remaining items.

Automatic merge from submit-queue. Volume topology aware scheduling design Proposal for a smarter scheduler that influences PV binding. Part of kubernetes/enhancements#121 /sig storage /sig scheduling /cc @kubernetes/sig-storage-proposals @kubernetes/sig-scheduling-proposals

Responsibility disclosures and reporting policies should be available in all Istio repos. Currently we have it only in https://github.com/istio/istio/security Signed-off-by: Faseela K <faseela.k@est.tech>

k8s-github-robot assigned rootfs and childsb Sep 13, 2017

k8s-ci-robot assigned davidopp, jsafrane, saad-ali and thockin Sep 13, 2017

resouer reviewed Sep 13, 2017

View reviewed changes

gnufied reviewed Sep 13, 2017

View reviewed changes

NickrenREN reviewed Sep 14, 2017

View reviewed changes

jsafrane reviewed Sep 14, 2017

View reviewed changes

NickrenREN reviewed Sep 14, 2017

View reviewed changes

liggitt reviewed Sep 15, 2017

View reviewed changes

bsalamat reviewed Sep 16, 2017

View reviewed changes

ianchakeres reviewed Sep 17, 2017

View reviewed changes

kow3ns reviewed Sep 22, 2017

View reviewed changes

msau42 mentioned this pull request Sep 23, 2017

LocalVolume not support Multiple VolumeClaims kubernetes-retired/external-storage#380

Closed

kow3ns reviewed Sep 25, 2017

View reviewed changes

msau42 mentioned this pull request Sep 28, 2017

Using taints/tolerations, statefulset volume created in wrong AWS AZ kubernetes/kubernetes#52357

Closed

Volume topology aware scheduling design

21412a5

k8s-ci-robot assigned bsalamat Sep 29, 2017

bsalamat reviewed Sep 30, 2017

View reviewed changes

msau42 force-pushed the storage-topology-design branch from 5ad3af2 to 9ec7619 Compare October 3, 2017 21:17

Add downsides of proposed approach, add more details about alternativ…

6e486a3

…e approach and its downsides, add more details about binding rollback using PV taints

msau42 force-pushed the storage-topology-design branch from 9ec7619 to 6e486a3 Compare October 3, 2017 21:21

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2017

k8s-github-robot merged commit f20eb8c into kubernetes:master Oct 4, 2017

msau42 mentioned this pull request Oct 20, 2017

Topology Aware Volume Scheduling kubernetes/enhancements#490

Closed

		The internal and external provisioning APIs need to be updated to take in a node
		parameter.


		TODO: flow chart

		The proposed new workflow for volume binding is:

Volume topology aware scheduling design #1054

Volume topology aware scheduling design #1054

Conversation

msau42 commented Sep 13, 2017

msau42 commented Sep 13, 2017

resouer Sep 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Sep 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 Sep 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickrenREN Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane Sep 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat Sep 16, 2017 • edited Loading

Choose a reason for hiding this comment

msau42 Sep 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Sep 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

resouer Sep 13, 2017 •

edited

Loading

liggitt Sep 30, 2017 •

edited

Loading

msau42 Sep 13, 2017 •

edited

Loading

NickrenREN Sep 14, 2017 •

edited

Loading

jsafrane Sep 15, 2017 •

edited

Loading

bsalamat Sep 16, 2017 •

edited

Loading

msau42 Sep 16, 2017 •

edited

Loading