Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage Capacity Constraints for Pod Scheduling #1353

Open
wants to merge 31 commits into
base: master
from

Conversation

@pohly
Copy link
Contributor

pohly commented Nov 4, 2019

This KEP explains how CSI drivers can expose how much capacity a storage system has available via the API server. This information can then be used by the Kubernetes Pod scheduler to make more intelligent decisions when placing:

  • pods with ephemeral inline volumes
  • pods with persistent volumes that haven't been been created yet (WaitForFirstConsumer binding mode)

Enhancement issue: #1472

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 4, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
To complete the pull request process, please assign saad-ali
You can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly pohly force-pushed the pohly:storage-capacity-constraints-for-pod-scheduling branch from fb6c15e to 89ec735 Nov 4, 2019
@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Nov 4, 2019

@cdickmann just created #1347, which is a similar proposal but with a different API and different goals. We learned about each others work last week and agreed that the best approach for reconciling the two proposals would be to create KEP PRs and then use the KEP process to discuss them.

The goal has to be to define one API extension which works for both purposes (Kubernetes scheduler here, more intelligent operators in #1347).

@pohly pohly force-pushed the pohly:storage-capacity-constraints-for-pod-scheduling branch from 89ec735 to 2560793 Nov 4, 2019
@cdickmann

This comment has been minimized.

Copy link

cdickmann commented Nov 4, 2019

If I compare this KEP to ours (#1347, hopefully my CLA issue is resolved today, waiting for tech support) then I see some key differences:

  • We assume that storage can have more diverse topology, e.g. tracking individual disks or disk groups in a host, or tracking multiple independent storage arrays attached to the same node. In all these cases, the capacity is really on a StoragePool and a node has (or doesn't have) access to it. IMHO that is more powerful and a superset of what is proposed here.
  • Beyond capacity, IMHO we may need more fields in the future, e.g. health and other properties of a StoragePool.
  • This KEP proposed K8s scheduler changes to take advantage of the additional available information.

I feel it is useful to separate two topics:

  • Surfacing up information about the storage. @pohly Do you see anything that you would like K8s to surface up which isn't covered by our KEP?
  • Having K8s leverage them for scheduling. We decided to scope this out of our KEP. @pohly Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?
@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Nov 4, 2019

IMHO that is more powerful and a superset of what is proposed here.

True. The downside is that it is also more complicated and can't be implemented without extensions to the CSI standard. The proposal here was meant to work with just the existing mechanisms (CSI topology, v1.NodeSelector to represent that in Kubernetes).

Beyond capacity, IMHO we may need more fields in the future, e.g. health and other properties of a StoragePool.

Agreed.

Do you see anything that you would like K8s to surface up which isn't covered by our KEP?
Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

I still need to think about your proposal and how the Kubernetes scheduler itself could make use of it with a CSI driver that hasn't been modified.

@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Nov 4, 2019

IMHO that is more powerful and a superset of what is proposed here.

True.

After going through your proposal once more I am not so sure about that anymore. It seems to be focused exclusively on a flow where a PVC is somehow tied to a storage pool and then gets provisioned with CreateVolume first, before any pod using that volume.

The generic late-binding case (Pod uses PVC which refers to a storage class and WaitForFirstConsumer is used) cannot be handled because that generic PVC doesn't reference a storage pool, so Kubernetes doesn't have the information about storage capacity that it needs for choosing a node for the pod. Ephemeral inline volumes also cannot be handled because those don't involve CreateVolume.

pohly added 2 commits Nov 5, 2019
As pointed out in
#1347, there may be
other attributes than just capacity that need to be tracked (like
failures). Also, there may be different storage pools within a single
node.

Avoiding "capacity" in the name of the data structures and treating it
as just one field in the leafs of the data structure allows future
extensions for those use cases without a major API change. For the
same reason all fields in the leaf are now optional, with reasonable
fallbacks if not set.
@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Nov 5, 2019

@pohly Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

As explained in my previous comment, #1353 (comment), I think the answer is no. Let me turn the question around, can you imagine basing your extension on the revised API in this KEP (see below)? 😄

Obviously the original API was too focused on just capacity tracking. Thanks to your KEP additional use cases became clearer and I now tried to come up with a more general API that can be extended to also cover those - see e0e7c43. At this point, it's a 1:1 renaming of what I had before plus some additional flexibility regarding what information must be provided. In this PR I'd prefer to keep it at that level to ensure that it remains small enough to make progress.

If you think that this goes in the right direction, then I could try to come up with a revision of your KEP that is based on this one.

He was the main author behind
https://docs.google.com/document/d/1WtX2lRJjZ03RBdzQIZY3IOvmoYiF5JxDX35-SsCIAfg,
the predecessor of this proposal.
This is a problem for deployments that do support it, but haven’t been
able to provide the necessary information yet. TODO: extend
CSIDriverInfo with a field that tells the scheduler to retry later
when information is missing? Avoid the situation by (Somehow? How?!)

This comment has been minimized.

Copy link
@pohly

pohly Nov 8, 2019

Author Contributor

I might have found a solution for this:

  • the driver deployment creates the top-level storage info object
  • external-provisioner just updates it while the object exists
  • removing the driver also removes the object

This way the scheduler knows that if there is an object, it has to schedule based on the information contained in it. The information might still be stale or incomplete, though.

This comment has been minimized.

Copy link
@pohly

pohly Nov 8, 2019

Author Contributor

Or even simpler: add a CSIDriver.Status field with all the dynamic information.

This comment has been minimized.

Copy link
@pohly

pohly Nov 12, 2019

Author Contributor

I switched over to that. Conceptually it makes sense to me. But other APIs also would work.

`CSIDriver.Status` feels like a more natural place to put the
information. There's no need to have a separate cleanup of stale
information, the driver can handle all of that.
@pohly pohly force-pushed the pohly:storage-capacity-constraints-for-pod-scheduling branch from 16a69b7 to fc24335 Nov 12, 2019
Only capacity depends on the storage class, accessibility and
potentially other future "per pool" attributes shouldn't. Therefore it
makes sense to have CSIStoragePool as child of CSIDriver.Status and
put the list of per-class information into CSIStoragePool.

This is better illustrated with some actual examples.
@k8s-ci-robot k8s-ci-robot added size/XL and removed size/L labels Nov 13, 2019
pohly added 6 commits Jan 23, 2020
This was called out as redundant and unnecessary
(https://github.com/kubernetes/enhancements/pull/1353/files#r358034638). While
a potentially useful optimization, it's not really necessary.
…lumeSize

This makes the API less ambiguous. Based on review feedback
(https://github.com/kubernetes/enhancements/pull/1353/files#r369732310).
Michelle pointed out that late binding works a bit differently
(https://github.com/kubernetes/enhancements/pull/1353/files#r369707568).

The non-goal about prioritization is meant to explain that while this
would be possible, it's not planned to be implemented (yet).
This simplifies the implementation. Proposed in
https://github.com/kubernetes/enhancements/pull/1353/files#r369745671.
How to determine topology and parameters is orthogonal and thus
separate flags for "local" vs. "central" and for parameters
("storageclasses", "ephemeral", "fallback") makes more sense.
As pointed out during review
(https://github.com/kubernetes/enhancements/pull/1353/files#r369834608),
deleting CSIStoragePools when switching the leader in the central
provisioning case would cause additional downtime.

However, for the local case the pod as owner makes more sense than the
alternatives (daemon set and node).
@k8s-ci-robot k8s-ci-robot added size/XXL and removed size/XL labels Jan 23, 2020
pohly added 4 commits Jan 23, 2020
This is the second option for custom policies.
At the time that the advantage is listed, the alternative hasn't been
introduced yet, so we need to spell it out explicitly.
[Topology](https://github.com/container-storage-interface/spec/blob/4731db0e0bc53238b93850f43ab05d9355df0fd9/lib/go/csi/csi.pb.go#L1662-L1691)
and driver.

### Size of ephemeral inline volumes

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

I'm not sure if the feature will count capacity used by ephemeral CSI drivers. I believe it only tracked emptydir volumes. cc @jingxu97. If not, then I think rootfs capacity used by those CSI drivers will not be covered by either feature.

from somewhere else. In that case, Kubernetes has currently no
information about the size of an ephemeral inline volume and (as for
persistent volumes) how much storage is still available.
A new `CSIVolumeSource.fsSize` field needs to be added

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

Also add a brief one-line summary of why it's needed here.

This comment has been minimized.

Copy link
@pohly

pohly Jan 24, 2020

Author Contributor

Done, see 30925d4.


### Using capacity information

The Kubernetes scheduler already has a component, the [volume

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

Can we move this part under the scheduling section to keep it all together?

This comment has been minimized.

Copy link
@pohly

pohly Jan 24, 2020

Author Contributor

The "Pod scheduling" section was meant as the high-level introduction of the proposal while this is more about the implementation. Let's keep those separate and move this section here into the "Design details" part, as you suggested elsewhere.

(driver, accessible by node) and sufficient capacity for the
volume attributes (storage class vs. ephemeral)

The specified volume size is compared against `MaximumVolumeSize` if

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

Can you add more details here about how the following scenarios will be handed?

  • A pod requesting multiple PVCs from the same storage pool
  • Multiple pods being scheduled in parallel with PVCs from the same storage pool

This comment has been minimized.

Copy link
@pohly

pohly Jan 24, 2020

Author Contributor

You know how to put the finger where it hurts 😅

There's no special handling of these scenarios, which means that they remain problematic. The proposal is to check volumes separately, without trying to anticipate what the effect of creating one volume may have for creating other volumes, even if they are used by the same pod. There's simply not enough information to do this correctly.

For example, "maximum volume size 10GiB" could mean "one volume of that size" or "hundreds of them". "Available capacity 10GiB" also isn't a guarantee that two volumes of 5GiB each can be created.

What may help here is to prioritize nodes based on how much capacity they have left and/or bite the bullet and figure out how to deal with the inevitable cases where pod scheduling did go wrong (roll back, try again).

Let me put that into the KEP itself, too: d103caa


## Proposal

### User Stories

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

I think actually most of the sections under "proposal" would be better under "design details".

What do you think of a structure, more like:

  • Motivation: goals and user stories
  • Proposal: high level summary and flow between components (maybe a diagram would be nice)
  • Design details: API, and details for each of the steps in the flow, test plan
// +listType=map
// +listMapKey=storageClassName
// +optional
Classes []CSIStorageByClass `patchStrategy:"merge" patchMergeKey:"storageClassName" json:"classes,omitempty" protobuf:"bytes,4,opt,name=classes"`

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

What would storage pool health be used for if not associated with a storage class? Since this use case seems like it is fundamentally influencing this design, we should explicitly call it out in the use cases. I'm concerned we are expanding the problem scope of storage pool too broadly if we say an associated storageclass is not required. Especially since at the CSI layer there is no first class notion of a storage pool. The closest concept that it has is explicitly tied to storage class.

In the LVM csi driver case, data doesn't need to always be stored the same. You can have an vg that is comprised of SSDs, and another vg that is comprised of HDDs, and those would be represented by different storageclasses and different storage pools.

You're right about the PATCH, I think I was confusing it with some other issue.

// FallbackStorageClassName is used for a CSIStorage element which
// applies when there isn't a more specific element for the
// current storage class or ephemeral volume.
FallbackStorageClassName = "<fallback>"

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

I'm not sure we should try optimizing for special cases here yet, if the general case still works. It makes the API and design harder to understand with various special cases.

// NodeTopology can be used to describe a storage pool that is available
// only for nodes matching certain criteria.
// +optional
NodeTopology *v1.NodeSelector

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

If the topology of a storagepool can change, how do we detect that a topology that used to exist no longer exists and needs to be cleaned up?

and external-provisioner is given that prefix with the
`--topology-prefix=example.com` parameter.

With that parameter, external-provisioner can reconstruct topology segments as follows:

This comment has been minimized.

Copy link
@msau42

msau42 Jan 23, 2020

Member

Can you give some examples of this?

What I'm not sure about, is how to generate the topology of a storage system that may span multiple segments, ie a storage pool that spans 2 nodes or 2 racks.

This comment has been minimized.

Copy link
@pohly

pohly Jan 24, 2020

Author Contributor

Bummer ☹️

While doing that I went back to the CSI spec and noticed that I had a slightly wrong interpretation of NodeGetInfoResponse.Topology in mind: I thought the driver reported where storage is accessible, not where the node is accessible.

In retrospect that makes sense of course, in particular when considering that the same driver might make storage available that has different constraints.

The effect is that the plan outlined here doesn't work. The alternative are either a CSI extension or additional parameters. This will require further thought and discussion, so for now let me propose that we take it out of the KEP, focus on local storage in the initial implementation, and then consider how to extend this for a revised alpha in 1.19: 0b200ea.

That's still a step forward. Node-local storage is where the lack of storage capacity tracking is most painful, too, so it makes sense to start there.

pohly added 6 commits Jan 24, 2020
No attempt is made to model how different volumes affect each
other. It's worthwhile to call this out explicitly
(https://github.com/kubernetes/enhancements/pull/1353/files#r370301994).
There are other usages of the API and potential extensions that need
to explored further. But to collaborate on that, it's still better to
get the API into Kubernetes as an alpha feature and then continue the
discussion.
It turned out that the intended solution can't work. The API still
makes sense, just generating the information isn't possible.
The content itself was already in a logical order, but not necessarily
in a suitable section. Now its organized as follows:

- Motivation: goals and user stories
- Proposal: high level summary and flow between components (maybe a diagram would be nice)
- Design details: API, and details for each of the steps in the flow, test plan
- History
- Drawbacks
- Alternatives
Copy link
Member

saad-ali left a comment

@thockin feedback

  • 1 StoragePool to many StorageClass -- how will CSI express this?
  • How will MaximumVolumeSize be updated in time to prevent races? The scheduler will be out of date immediately after a decision and its next decision will be based on stale data. If we consider the current approach running in production at scale, failures will still be highly likely and it will require a considerable amount of effort just to get to that point. To really do this correctly we would have to make the k8s scheduler model down to every single allocatable chunk within the storage system which is leaking a lot of storage details in to Kubernetes.
  • Let's step back and consider some other options. Let's think about what the Kubernetes scheduler really needs? Perhaps, thinking out loud, a way for the scheduler to do storage reservations/leases may be more generally applicable and less prone to error.
  • In general, let's leave the storage logic to the storage system as much as possible. Let's think carefully about the bubbles of logic that we pull in to k8s and why.
@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Jan 25, 2020

1 StoragePool to many StorageClass -- how will CSI express this?

One new CSI call could map storage class parameters to a unique ID for the corresponding storage pool. Then a sidecar can determine which storage pools it needs to track for all currently defined storage classes in Kubernetes. Another new CSI call can then query information about that storage pool (in particular topology) and combine that with the storage class parameters to call GetCapacity.

The scheduler will be out of date immediately after a decision and its next decision will be based on stale data.

Yes, that's the risk. But making a decision based on stale data is still better than making a decision with no data, which is the current status.

If we consider the current approach running in production at scale, failures will still be highly likely and it will require a considerable amount of effort just to get to that point.

Failure rates will only go up when completely filling up the storage available to one node. When keeping enough buffer (which could be done if information about remaining capacity was available), then it might be possible to to keep the failure rate acceptable.

"considerable amount of effort" - is that effort for the implementation or runtime effort? Implementation isn't that hard. I'm willing to offer a bet that I can finish it in a week. But yes, runtime effort may be an issue. We won't know for sure unless we build and benchmark it.

To really do this correctly we would have to make the k8s scheduler model down to every single allocatable chunk within the storage system which is leaking a lot of storage details in to Kubernetes.

This had been proposed before (retrieve total capacity, model pending operations, etc.) and at least I agree that this is not feasible.

Perhaps, thinking out loud, a way for the scheduler to do storage reservations/leases may be more generally applicable and less prone to error.

Yes, that may be a way. But even for reservations, Kubernetes has to make some upfront decisions about where to place reservations. For node-local storage that means picking a node, and how will Kubernetes do that efficiently if it has no idea which node might have storage? Try them all, one-by-one or in parallel? One will be slow in particular in the scenario that was brought up here (high load and storage exhausted), the other wastes storage. If the reservation is left entirely to the storage system, then it might pick a node which is unsuitable.

A compromise with central provisioning of node-local storage is where Kubernetes provides a list of suitable nodes to the storage system and the storage system then picks something. But there isn't necessarily one storage system, so that approach fails when there are two independent storage systems involved (= two CSI drivers) for two different volumes: the result might be two volumes where there is no node that can reach both.

It would also rule out a design like the one proposed by @msau42 and @jsanda in kubernetes-csi/external-provisioner#367 where a node-local CSI driver gets deployed entirely without a central component. It's also very complicated to implement. In PMEM-CSI we do it by letting a central component connect to all nodes (not ideal in many ways). TopoLVM does it by communicating capacity information through the API server (fairly similar to this KEP...).

Let me propose a different path towards such a reservation system:

  • expose capacity information as proposed in this KEP
  • let Kubernetes tentatively create volumes using the existing mechanisms in CSI for persistent volumes and some revised approach for ephemeral inline volumes (which may make sense anyway, because the current approach is kind of hacky - see @lpabon's question about it on Slack yesterday and my later proposal in that same thread)
  • then turn those tentatively allocated volumes into real ones once the pod really starts to run.

This obviously will require further thought, but the first step towards it would be this KEP. In case it hasn't become obvious yet: I would prefer to make progress in Kubernetes by adding a partial solution as alpha and then collaborate further on it once an initial implementation is there. Keeping it out of Kubernetes just ensures that it will get ignored. I have little motivation to pursue this further if that is the recommendation because it's already difficult enough to find reviewers for things that are on track for Kubernetes (i.e. have a merged KEP). I'll probably also have to focus on finding a vendor-specific solution for this problem in PMEM-CSI first before I can come back to Kubernetes. Just my two cents.

@satoru-takeuchi

This comment has been minimized.

Copy link

satoru-takeuchi commented Jan 27, 2020

Perhaps, thinking out loud, a way for the scheduler to do storage reservations/leases
may be more generally applicable and less prone to error.

At least using scheduler extender here is difficult since setting scheduler extender is
essentially difficult and how to set up depends on Kubernetes management tools very much.

For example, please see the TopoLVM's scheduler extender, topolvm-scheduler.

https://github.com/cybozu-go/topolvm/tree/master/deploy#topolvm-scheduler

As we can see, it's very complicated to setup. In addition, currently, TopoLVM has succeeded
to be installed in CKE(*1) and Rancher. Here the installing processes are very different
from each other. There is also an issue of how to install this extender in kubeadm(*2)

If the scheduler-related logic of this feature relies on scheduler extender, the same problem
happens for each CSI drivers.

*1) https://github.com/cybozu-go/cke
*2) cybozu-go/topolvm#82

@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Jan 27, 2020

If the scheduler-related logic of this feature relies on scheduler extender, the same problem
happens for each CSI drivers.

And the alternative, using the scheduler framework to build a custom scheduler, will be even worse. Then the cluster admin has to replace the scheduler when installing a CSI driver. This may be acceptable once, but what if two different CSI drivers both need "their own" scheduler to work reliably?

But having said that, I think Tim's proposal was about the Kubernetes scheduler itself. We just don't know exactly yet what that solution could be. But as I said above, I don't think that this uncertainty should stop us from accepting this KEP and the alpha feature, because I think it is technically on the right track and procedurally an alpha feature is still about experimenting.

@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Jan 27, 2020

One more thought about this:

The scheduler will be out of date immediately after a decision and its next decision will be based on stale data.

I've been told that this is exactly what the scheduler does also for extended resources (like accelerator cards): the "available" count only gets updated later when pods actually start running. One difference might be that it counts in-flight pods and their resources; I've not been able to get a conclusive answer on that.

But for CPU and memory the scheduling is also using "stale" data, isn't it? So overall the approach suggested in this KEP seems to be aligned with how other resources are managed.

The difference is that for storage we currently lack graceful recovery mechanisms. We need to add those, but they won't solve the problem well when the scheduler keeps on making uninformed decisions and thus triggers the recovery mechanisms (once they exist) too often.

One of those recovery mechanisms is to re-trigger scheduling for volumes with late binding. In #1353 (comment), @msau42 said that the scheduler should retry and I updated the problem statement in this KEP accordingly. But after having looked at this a bit more, my conclusion is that this currently doesn't work for CSI drivers (to be confirmed, see https://kubernetes.slack.com/archives/C8EJ01Z46/p1580032105038000).

@cdickmann

This comment has been minimized.

Copy link

cdickmann commented Jan 27, 2020

@thockin feedback

  • How will MaximumVolumeSize be updated in time to prevent races? The scheduler will be out of date immediately after a decision and its next decision will be based on stale data. If we consider the current approach running in production at scale, failures will still be highly likely and it will require a considerable amount of effort just to get to that point. To really do this correctly we would have to make the k8s scheduler model down to every single allocatable chunk within the storage system which is leaking a lot of storage details in to Kubernetes.

Right, it isn't really feasible or desirable for K8s to own block placement. Storage systems (for many good reasons) will want to own that question. In fact, many systems will implement thin provisioning of some sort, and so the relationship between a placement and associated space consumption, as well as space consumption outside of placement (due to writing to a thin disk) are complex and unpredictable for a generic solution. But that doesn't mean what Patrick proposes is a bad idea. On the contrary, it adds a first set of consideration of storage constraints to the K8s pod scheduler, which is a step forward and in a good direction.

  • Let's step back and consider some other options. Let's think about what the Kubernetes scheduler really needs? Perhaps, thinking out loud, a way for the scheduler to do storage reservations/leases may be more generally applicable and less prone to error.

Space Reservations without a volume aren't very commonly support storage system operations. So this would mean not just a CSI change, but something that many CSI drivers can't implement. Further, it doesn't solve the fundamental issue as Patrick says, because it just moves when the placement fails, and can't prevent it. Finally, as I said, thin provisioning is common practice, and basically means the relationship of placement to space consumption isn't as trivial as the reservation/lease idea suggests.

  • In general, let's leave the storage logic to the storage system as much as possible. Let's think carefully about the bubbles of logic that we pull in to k8s and why.

I think this is exactly what this KEP does. It goes for a very minimal integration of the Pod scheduler and storage placement. Enough to be useful, small enough to be cheap to implement, and small enough to not cause harm and be extensible in the future.

Broadly speaking, I see two types of comments on this KEP: One set of questions ask if this KEP solves the problem space fully and questions whether a partial solution is useful. The other set of questions takes the opposite approach and asks if this KEP could be smaller, solve less problems. I think this KEP is now pretty small, and moves us in a good direction. It doesn't solve every problem, in fact it doesn't solve the problems our KEP hopes to solves. But I can see how our KEP can be built on top, and so I think that is true for most of what people ask in the comments. Isn't the whole idea in the KEP process to start small and gradually build up?

@xing-yang

This comment has been minimized.

Copy link
Contributor

xing-yang commented Jan 27, 2020

How will MaximumVolumeSize be updated in time to prevent races? The scheduler will be out of date immediately after a decision and its next decision will be based on stale data. If we consider the current approach running in production at scale, failures will still be highly likely and it will require a considerable amount of effort just to get to that point. To really do this correctly we would have to make the k8s scheduler model down to every single allocatable chunk within the storage system which is leaking a lot of storage details in to Kubernetes.

Yes, there will always be a small time window when the capacity information known by the scheduler is out of date. However this does not mean the scheduler should not have any knowledge of the capacity at all, even if it is delayed information.

If the storage pool is already out of capacity but the scheduler still schedules volume placement based on slightly out-dated information, volume creation will fail. If the storage pool is already out of capacity but the scheduler still schedules volume placement because it does not have any information on the capacity, volume creation will fail. So even though this KEP does not solve this problem, it also has not introduced this problem.

Thinking of the positive side. If the storage pool is already out of capacity and the scheduler learns that it is out of capacity based on the implementation of this KEP, it can schedule volume placement on another storage pool that has sufficient capacity.

In general, this KEP will reduce the possibility of volume placement failure due to the lack of knowledge of storage pool capacity information by the scheduler (either the in-tree scheduler or a scheduler extension). So I believe it is definitely a step into the right direction. With an alpha implementation, we can evaluate and improve the solution and make it more robust.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Jan 28, 2020

Representing storage capacity in a generic manner is a really hard problem. As noted, what is actually available to end users is influenced by many different factors such as: thick/thin provisioning, partitioning, replication factor, striping/mirroring, and more. And in Kubernetes environments, it's even more complicated by the fact that storage may be outside of the cluster and shared with other consumers out of Kubernetes' control.

This is different from how the Kubernetes scheduler handles cpu/memory allocations today, which is assumed to be completely dedicated to the cluster and contiguous. This allows the scheduler to cache current cpu/memory resource allocations based on Pods scheduled.

Storage capacity, is a lot more complicated and we will not be able to take the same approach. Any generic representation is not going to be perfect, and there will inherently be more failure cases. That being said, I think we should strive towards a solution that works well in most scenarios, and also have a strong idea on how we can recover from those failures. This could potentially involve some big changes in our core components, so it needs to be carefully thought through and reviewed. I also think that the current CSI spec is not sufficient to solve our needs, so we shouldn't limit ourselves to what's currently in the CSI spec.

It would also be good to get feedback from @kubernetes/sig-scheduling-proposals on how similar situations such as GPUs were handled, and how the error cases were mitigated and resolved. I would also like sig-scheduling thoughts on the current state of the new scheduling framework, and if that could potentially help our use cases. The scheduling framework is supposed to solve many limitations of the scheduler extender.

@pohly

This comment has been minimized.

Copy link
Contributor Author

pohly commented Jan 28, 2020

/hold

The feedback has been to first address recovery from a failed schedule attempt, then come back to this KEP and enhance the scheduler so that failed scheduling happens less often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.