Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Storage Capacity Constraints for Pod Scheduling #1353
This KEP explains how CSI drivers can expose how much capacity a storage system has available via the API server. This information can then be used by the Kubernetes Pod scheduler to make more intelligent decisions when placing:
Enhancement issue: #1472
[APPROVALNOTIFIER] This PR is NOT APPROVED
The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing
@cdickmann just created #1347, which is a similar proposal but with a different API and different goals. We learned about each others work last week and agreed that the best approach for reconciling the two proposals would be to create KEP PRs and then use the KEP process to discuss them.
The goal has to be to define one API extension which works for both purposes (Kubernetes scheduler here, more intelligent operators in #1347).
If I compare this KEP to ours (#1347, hopefully my CLA issue is resolved today, waiting for tech support) then I see some key differences:
I feel it is useful to separate two topics:
True. The downside is that it is also more complicated and can't be implemented without extensions to the CSI standard. The proposal here was meant to work with just the existing mechanisms (CSI topology,
I still need to think about your proposal and how the Kubernetes scheduler itself could make use of it with a CSI driver that hasn't been modified.
After going through your proposal once more I am not so sure about that anymore. It seems to be focused exclusively on a flow where a PVC is somehow tied to a storage pool and then gets provisioned with
The generic late-binding case (Pod uses PVC which refers to a storage class and
As pointed out in #1347, there may be other attributes than just capacity that need to be tracked (like failures). Also, there may be different storage pools within a single node. Avoiding "capacity" in the name of the data structures and treating it as just one field in the leafs of the data structure allows future extensions for those use cases without a major API change. For the same reason all fields in the leaf are now optional, with reasonable fallbacks if not set.
As explained in my previous comment, #1353 (comment), I think the answer is no. Let me turn the question around, can you imagine basing your extension on the revised API in this KEP (see below)?
Obviously the original API was too focused on just capacity tracking. Thanks to your KEP additional use cases became clearer and I now tried to come up with a more general API that can be extended to also cover those - see e0e7c43. At this point, it's a 1:1 renaming of what I had before plus some additional flexibility regarding what information must be provided. In this PR I'd prefer to keep it at that level to ensure that it remains small enough to make progress.
If you think that this goes in the right direction, then I could try to come up with a revision of your KEP that is based on this one.
Only capacity depends on the storage class, accessibility and potentially other future "per pool" attributes shouldn't. Therefore it makes sense to have CSIStoragePool as child of CSIDriver.Status and put the list of per-class information into CSIStoragePool. This is better illustrated with some actual examples.
This was called out as redundant and unnecessary (https://github.com/kubernetes/enhancements/pull/1353/files#r358034638). While a potentially useful optimization, it's not really necessary.
…lumeSize This makes the API less ambiguous. Based on review feedback (https://github.com/kubernetes/enhancements/pull/1353/files#r369732310).
Michelle pointed out that late binding works a bit differently (https://github.com/kubernetes/enhancements/pull/1353/files#r369707568). The non-goal about prioritization is meant to explain that while this would be possible, it's not planned to be implemented (yet).
This simplifies the implementation. Proposed in https://github.com/kubernetes/enhancements/pull/1353/files#r369745671.
How to determine topology and parameters is orthogonal and thus separate flags for "local" vs. "central" and for parameters ("storageclasses", "ephemeral", "fallback") makes more sense.
As pointed out during review (https://github.com/kubernetes/enhancements/pull/1353/files#r369834608), deleting CSIStoragePools when switching the leader in the central provisioning case would cause additional downtime. However, for the local case the pod as owner makes more sense than the alternatives (daemon set and node).
No attempt is made to model how different volumes affect each other. It's worthwhile to call this out explicitly (https://github.com/kubernetes/enhancements/pull/1353/files#r370301994).
This is an optimization that is not strictly needed (https://github.com/kubernetes/enhancements/pull/1353/files#r370359124).
There are other usages of the API and potential extensions that need to explored further. But to collaborate on that, it's still better to get the API into Kubernetes as an alpha feature and then continue the discussion.
It turned out that the intended solution can't work. The API still makes sense, just generating the information isn't possible.
The content itself was already in a logical order, but not necessarily in a suitable section. Now its organized as follows: - Motivation: goals and user stories - Proposal: high level summary and flow between components (maybe a diagram would be nice) - Design details: API, and details for each of the steps in the flow, test plan - History - Drawbacks - Alternatives
saad-ali left a comment
One new CSI call could map storage class parameters to a unique ID for the corresponding storage pool. Then a sidecar can determine which storage pools it needs to track for all currently defined storage classes in Kubernetes. Another new CSI call can then query information about that storage pool (in particular topology) and combine that with the storage class parameters to call
Yes, that's the risk. But making a decision based on stale data is still better than making a decision with no data, which is the current status.
Failure rates will only go up when completely filling up the storage available to one node. When keeping enough buffer (which could be done if information about remaining capacity was available), then it might be possible to to keep the failure rate acceptable.
"considerable amount of effort" - is that effort for the implementation or runtime effort? Implementation isn't that hard. I'm willing to offer a bet that I can finish it in a week. But yes, runtime effort may be an issue. We won't know for sure unless we build and benchmark it.
This had been proposed before (retrieve total capacity, model pending operations, etc.) and at least I agree that this is not feasible.
Yes, that may be a way. But even for reservations, Kubernetes has to make some upfront decisions about where to place reservations. For node-local storage that means picking a node, and how will Kubernetes do that efficiently if it has no idea which node might have storage? Try them all, one-by-one or in parallel? One will be slow in particular in the scenario that was brought up here (high load and storage exhausted), the other wastes storage. If the reservation is left entirely to the storage system, then it might pick a node which is unsuitable.
A compromise with central provisioning of node-local storage is where Kubernetes provides a list of suitable nodes to the storage system and the storage system then picks something. But there isn't necessarily one storage system, so that approach fails when there are two independent storage systems involved (= two CSI drivers) for two different volumes: the result might be two volumes where there is no node that can reach both.
It would also rule out a design like the one proposed by @msau42 and @jsanda in kubernetes-csi/external-provisioner#367 where a node-local CSI driver gets deployed entirely without a central component. It's also very complicated to implement. In PMEM-CSI we do it by letting a central component connect to all nodes (not ideal in many ways). TopoLVM does it by communicating capacity information through the API server (fairly similar to this KEP...).
Let me propose a different path towards such a reservation system:
This obviously will require further thought, but the first step towards it would be this KEP. In case it hasn't become obvious yet: I would prefer to make progress in Kubernetes by adding a partial solution as alpha and then collaborate further on it once an initial implementation is there. Keeping it out of Kubernetes just ensures that it will get ignored. I have little motivation to pursue this further if that is the recommendation because it's already difficult enough to find reviewers for things that are on track for Kubernetes (i.e. have a merged KEP). I'll probably also have to focus on finding a vendor-specific solution for this problem in PMEM-CSI first before I can come back to Kubernetes. Just my two cents.
At least using scheduler extender here is difficult since setting scheduler extender is
For example, please see the TopoLVM's scheduler extender, topolvm-scheduler.
As we can see, it's very complicated to setup. In addition, currently, TopoLVM has succeeded
If the scheduler-related logic of this feature relies on scheduler extender, the same problem
And the alternative, using the scheduler framework to build a custom scheduler, will be even worse. Then the cluster admin has to replace the scheduler when installing a CSI driver. This may be acceptable once, but what if two different CSI drivers both need "their own" scheduler to work reliably?
But having said that, I think Tim's proposal was about the Kubernetes scheduler itself. We just don't know exactly yet what that solution could be. But as I said above, I don't think that this uncertainty should stop us from accepting this KEP and the alpha feature, because I think it is technically on the right track and procedurally an alpha feature is still about experimenting.
One more thought about this:
I've been told that this is exactly what the scheduler does also for extended resources (like accelerator cards): the "available" count only gets updated later when pods actually start running. One difference might be that it counts in-flight pods and their resources; I've not been able to get a conclusive answer on that.
But for CPU and memory the scheduling is also using "stale" data, isn't it? So overall the approach suggested in this KEP seems to be aligned with how other resources are managed.
The difference is that for storage we currently lack graceful recovery mechanisms. We need to add those, but they won't solve the problem well when the scheduler keeps on making uninformed decisions and thus triggers the recovery mechanisms (once they exist) too often.
One of those recovery mechanisms is to re-trigger scheduling for volumes with late binding. In #1353 (comment), @msau42 said that the scheduler should retry and I updated the problem statement in this KEP accordingly. But after having looked at this a bit more, my conclusion is that this currently doesn't work for CSI drivers (to be confirmed, see https://kubernetes.slack.com/archives/C8EJ01Z46/p1580032105038000).
Right, it isn't really feasible or desirable for K8s to own block placement. Storage systems (for many good reasons) will want to own that question. In fact, many systems will implement thin provisioning of some sort, and so the relationship between a placement and associated space consumption, as well as space consumption outside of placement (due to writing to a thin disk) are complex and unpredictable for a generic solution. But that doesn't mean what Patrick proposes is a bad idea. On the contrary, it adds a first set of consideration of storage constraints to the K8s pod scheduler, which is a step forward and in a good direction.
Space Reservations without a volume aren't very commonly support storage system operations. So this would mean not just a CSI change, but something that many CSI drivers can't implement. Further, it doesn't solve the fundamental issue as Patrick says, because it just moves when the placement fails, and can't prevent it. Finally, as I said, thin provisioning is common practice, and basically means the relationship of placement to space consumption isn't as trivial as the reservation/lease idea suggests.
I think this is exactly what this KEP does. It goes for a very minimal integration of the Pod scheduler and storage placement. Enough to be useful, small enough to be cheap to implement, and small enough to not cause harm and be extensible in the future.
Broadly speaking, I see two types of comments on this KEP: One set of questions ask if this KEP solves the problem space fully and questions whether a partial solution is useful. The other set of questions takes the opposite approach and asks if this KEP could be smaller, solve less problems. I think this KEP is now pretty small, and moves us in a good direction. It doesn't solve every problem, in fact it doesn't solve the problems our KEP hopes to solves. But I can see how our KEP can be built on top, and so I think that is true for most of what people ask in the comments. Isn't the whole idea in the KEP process to start small and gradually build up?
Yes, there will always be a small time window when the capacity information known by the scheduler is out of date. However this does not mean the scheduler should not have any knowledge of the capacity at all, even if it is delayed information.
If the storage pool is already out of capacity but the scheduler still schedules volume placement based on slightly out-dated information, volume creation will fail. If the storage pool is already out of capacity but the scheduler still schedules volume placement because it does not have any information on the capacity, volume creation will fail. So even though this KEP does not solve this problem, it also has not introduced this problem.
Thinking of the positive side. If the storage pool is already out of capacity and the scheduler learns that it is out of capacity based on the implementation of this KEP, it can schedule volume placement on another storage pool that has sufficient capacity.
In general, this KEP will reduce the possibility of volume placement failure due to the lack of knowledge of storage pool capacity information by the scheduler (either the in-tree scheduler or a scheduler extension). So I believe it is definitely a step into the right direction. With an alpha implementation, we can evaluate and improve the solution and make it more robust.
Representing storage capacity in a generic manner is a really hard problem. As noted, what is actually available to end users is influenced by many different factors such as: thick/thin provisioning, partitioning, replication factor, striping/mirroring, and more. And in Kubernetes environments, it's even more complicated by the fact that storage may be outside of the cluster and shared with other consumers out of Kubernetes' control.
This is different from how the Kubernetes scheduler handles cpu/memory allocations today, which is assumed to be completely dedicated to the cluster and contiguous. This allows the scheduler to cache current cpu/memory resource allocations based on Pods scheduled.
Storage capacity, is a lot more complicated and we will not be able to take the same approach. Any generic representation is not going to be perfect, and there will inherently be more failure cases. That being said, I think we should strive towards a solution that works well in most scenarios, and also have a strong idea on how we can recover from those failures. This could potentially involve some big changes in our core components, so it needs to be carefully thought through and reviewed. I also think that the current CSI spec is not sufficient to solve our needs, so we shouldn't limit ourselves to what's currently in the CSI spec.
It would also be good to get feedback from @kubernetes/sig-scheduling-proposals on how similar situations such as GPUs were handled, and how the error cases were mitigated and resolved. I would also like sig-scheduling thoughts on the current state of the new scheduling framework, and if that could potentially help our use cases. The scheduling framework is supposed to solve many limitations of the scheduler extender.