KEP-1472: storage capacity: update drawbacks #3233

pohly · 2022-03-04T08:46:44Z

One-line PR description: document autoscaler support
Issue link: Storage Capacity Tracking #1472
Other comments: this is follow-up for KEP-1472: storage capacity tracking: GA #3229 (comment)

pohly · 2022-03-04T08:49:57Z

/cc @alculquicondor @xing-yang @x13n @MaciekPytel

@mwielgus: just for completeness, can you confirm here that the conclusion from the 2022-02-21 SIG Autoscaling meeting was that this feature can go to GA without kubernetes/autoscaler#3887 being merged?

alculquicondor · 2022-03-04T15:15:02Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+The approach above preserves the separation between the different
+components. Simpler solutions may be possible by adding support for specific
+CSI drivers into custom autoscaler binaries or into operators that control the
+cluster setup.



What about the other "Drawbacks" listed, like modeling of capacity usage, fragmentation and prioritization. Was there any progress on them? At the very least, were you able to quantify how bad the situation is in a busy cluster?

For PMEM-CSI I ran tests that completely filled up a cluster with Pods that have volumes. That worked well in my experience.

Can you summarize those results in the KEP?

The test with csi-driver-host-path is more suitable for the KEP because it's easier to reproduce and scale up. I can summarize the results from that in the KEP.

I think the summary of kubernetes-csi/csi-driver-host-path#350 in the revised KEP is sufficient. I just need to update the link once that PR is merged. Whoever wants specific numbers then can look up the report.

alculquicondor · 2022-03-04T15:18:05Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

@@ -1189,9 +1189,54 @@ based on storage capacity:
  to available storage and thus could run on a new node, the


I don't think network-attached storage should be much of a problem.

When we are scaling up a nodepool or a nodegroup, we know in which topology it would end up. So we can query the CSIStorageCapacity that matches.

For scale up from zero, network-attached storage has a similar problem: because the CSI driver hasn't provided topology information yet for the non-existent nodes, the cluster admin must set labels manually in the configuration of the node pool.

I guess this is for the case where the nodepool is a new topology domain.

I think it should be responsibility of the CSI driver to setup that up even if no nodes exist. It's network-attached after all. It should come from somewhere and the driver should know about it. No?

But how does the CSI driver know that the information is needed? Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

There's also a practical problem here: the CSI spec doesn't have support for listing topology segments for nodes where no CSI driver is running. If we want something like this, we need a new API.

Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

Sure, why not? It may not work in every case but it seems like it could work in some?

Sure, why not?

Scalability and performance issues? It'll depend on the number of those additional segments, of course.

It may not work in every case but it seems like it could work in some?

Yes. I just wanted to point out some of the details (conceptual and practical) that'll have to be addressed for this to work. I agree that it can work.

alculquicondor · 2022-03-04T15:20:16Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+available to the autoscaler. This can be done with the existing
+CSIStorageCapacity API as follows:
+
+- When creating a fictional Node object from an existing Node in


Is this talking about local storage? can you clarify?

I was mostly thinking of local storage here. You are right, that needs to be called out here because for network-attached storage the topology labels don't need to be modified. I'll revise.

alculquicondor · 2022-03-04T15:26:44Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+- For each storage class, the cluster administrator can then create
+  CSIStorageCapacity objects that provide the capacity information for these
+  fictional topology segments.
+- When the volume binder plugin for the scheduler runs inside the autoscaler,


If there were 2 pending pods, given the semantics of available capacity, autoscaler might think that they both fit in the same node, where in reality they don't. One of the pods would succeed, and the second one would likely fail once it reaches kubelet. This would really hurt scale up E2E scheduling time.

This "will fail" seems to be a misunderstanding, because the second Pod will not reach kubelet. I agree that autoscaler will not make the right decision immediately (create two new nodes instead of one), but it also will not make a wrong decision and eventually the cluster grows sufficiently.

Agreed that it will reach the correct size eventually and any 'failure' will be in scheduler not kubelet. That being said the problem with this argument is that eventually may take a really long time (adding tens or hundreds of nodes one-at-a-time) and for vast majority of use-cases autoscaling is only useful if you can get the nodes when you need them. In my experience scale-up/down latency is the single most important requirement for autoscaling and a solution that completely ignores latency isn't really a solution for large-scale systems.

alculquicondor · 2022-03-04T15:27:56Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+  to wait for it before considering the node ready. If it doesn't do that, it
+  might incorrectly scale up further because storage capacity checks will fail
+  for a new, unused node until the CSI driver provides CSIStorageCapacity
+  objects for it. This can be implemented in a generic way for all CSI drivers


I guess we could have a similar timeout as we have for readiness when there are ignored taints, right @MaciekPytel ?

Yeah and also for GPUs. As long as we can predict CSIStorageCapacity objects the node will eventually have (which we need to do anyway) our existing solution can be easily adapted for this.

The approach suggested here is indeed based on the solution for GPUs.

alculquicondor · 2022-03-04T15:42:02Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+- When creating a fictional Node object from an existing Node in
+  a node group, autoscaler must modify the topology labels of the CSI
+  driver(s) in the cluster so that they define a new topology segment.
+  For example, topology.hostpath.csi/node=aks-workerpool.* has to


What if there was a CSIStorageCapacityClass object that defines which topologies would get a new CSIStorageCapacity and how much capacity they have?

I had considered whether additional information in CSIDriver could be used to avoid this manual configuration step. In the end I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.

I'm not against doing something like this if there is sufficient demand. In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.

This is similar how other features get improved over time: first the base functionality gets introduced, then later improvements are added. I see this KEP as the base functionality that is sufficient for several use cases and therefore worth graduating to GA.

I agree: this could be a resource specific for CA (it could even be a CRD) and, yes, that would be a candidate for a follow up KEP.

I guess we can focus on the main problem: how to predict how many pods can fit in a node :)

I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.

I think that's better than having every user figure out independently how each CSIDriver behaves and feed that information to CA. Hopefully Kubernetes is big enough to create an incentive for CSI driver developers.

In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.

+1

it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler

IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.

I don't disagree. But as I said elsewhere, we then need some buy-in from a CSI driver vendor who wants to support this. Otherwise such a new KEP will be stuck at the "get feedback from the field" promotion criteria.

alculquicondor · 2022-03-04T15:42:46Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+
+- When creating a fictional Node object from an existing Node in
+  a node group, autoscaler must modify the topology labels of the CSI
+  driver(s) in the cluster so that they define a new topology segment.


Would we copy the available capacity from another node? What if that node has the capacity partially used?

That's why we cannot copy existing CSIStorageCapacity objects.

MaciekPytel · 2022-03-04T17:03:18Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

@@ -1189,9 +1189,54 @@ based on storage capacity:
  to available storage and thus could run on a new node, the
  simulation may decide otherwise.


A separate issue is that Cluster Autoscaler simulates scheduling of all the pods (binpacking) before making any autoscaling decisions. The way that capacity is defined here (and how kubernetes/autoscaler#3887 uses the same CSIStorageClass to represent local storage of potentially hundreds of nodes) I think means that CA will not be able to calculate how many nodes are needed for any use-case where available storage is the limiting factor for scheduling.

ex. Assume I have a 1000 pending pods that have relatively low cpu/memory requests, but require all the local storage of a node. CA would think all those pods can fit on just a few nodes (on a single node if cpu/memory requests are low enough) and only add this many nodes. Admittedly, once the nodes are created some pods will be scheduled, CA will see that remaining pods are unschedulable again and trigger another (equally small) scale-up. So theoretically the situation will eventually recover. In practice the latency can be arbitrarily long, making the feature completely unusable in sufficiently large clusters.

I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined. We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

We could further explicitly say that any driver where this assumption is too far off is incompatible with autoscaling. Given how popular the autoscaling is in cloud environments, this can hopefully generate some incentive for driver authors to interpret capacity that way.

My concern is that the API is even problematic for kube-scheduler itself. On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods. If you are using the Job API, you could hit the backoffLimit and the Job would fail. Similarly, other third party APIs (argo, airflow) are less resistant to failures.

We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

I agree that such interpretation would be better. Even if it's not a perfect prediction, it would at least generate less failed pods.

As the API stands today, I don't think I'm confident graduating it to GA, specially if changing the interpretation of a field is under consideration. Maybe we instead need to add a new field for the total capacity and scheduler needs to respect in combination with the existing .status.availableCapacity. If the field is not set, it means that the capacity is unlimited and we can assume that drivers who choose to leave it empty understand the consequences.

Do you think adding that field is possible @pohly? Do you think that's better for scheduling/autoscaling @MaciekPytel?

If you agree, we should go with a v1beta2.

FYI @liggitt @roycaihw (API reviewers)

I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined.

That's why I added a more strictly defined "maximum volume size" to the spec. That is what Kubernetes uses if available, instead of the more loosely defined "capacity".

We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

I don't think that making assumptions about how CSI drivers handle storage capacity is going to help. We've discussed that at length in SIG Storage and in CSI meetings and the consensus there was that there are always some vendor specific exceptions that cannot be handled in a generic way.

On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods.

Failing how? They won't get scheduled unless their volumes were successfully created. Until then they remain in the queue and will eventually land on some node.

I don't think we need to go that far. Both existing fields are intentionally marked as optional. We can always add new ones and stop populating the old ones, if we find that this will work better. Nothing in the current discussion indicates that there'll be a need for that, though, therefore I am confident that this API is good for GA.

I had an idea how we can support simulated volume provisioning in the autoscaler without (at least initially) extending this API here or the CSI spec:

the CSI driver must report "total capacity" in GetCapacityResponse.available_capacity and support maximum_volume_size

the CSI driver must be local to nodes and have a single topology key with the Kubernetes node name as value

the CSIDriver object must have a special annotation for the autoscaler ("linear, local storage")

nothing changes for kube-scheduler

autoscaler simulates scheduling like this:

when cloning an existing node to create a node group, it replaces the node name in the CSI driver's topology label

it clones the CSIStorageCapacity objects for each storage class, changes their topology and removes the MaximumVolumeSize field: what remains is the total capacity without any volumes allocated

when the "selected node" label gets set for a PVC while scheduling a pod, the "AvailableCapacity" gets reduces by the size of that PVC: this simulates volume creation on that node and ensures that eventually pods no longer fit the node

There are some implementation challenges, for example the volume binder code that sets "selected node" might not get called at the moment by autoscaler. The in-tree volume binder code must be made a bit more flexible to accept stores instead of creating informers itself, but other than that it doesn't need any special logic. It's also not a generic solution, so we must have a good understanding of which CSI drivers are going to use this to check in advance that they fit the assumptions.

If MaximumVolumeSize is simply removed, Cluster Autoscaler may decide to provision a node even though a workload can't get scheduled there due to its request exceeding the max size (but being less than total capacity). Proper CA support needs to somehow get this information.

Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.

I probably shouldn't have posted that idea because we are now back at discussing a solution that doesn't belong into this KEP 😅 Let's continue anyway, but let's beware that this isn't blocking merging of this PR.

The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".

Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.

The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.

Sorry for starting a solution discussion here, happy to take it elsewhere 🙂

The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".

Yes, if we want proper autoscaling support, CA has to use a more complex model, which will make sense for most/all storage implementations out there.

The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.

I think this could work through a separate API: different driver extensions would create different objects for CA to use for resource predictions.

MaciekPytel · 2022-03-04T17:25:25Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+about hardware that hasn't been provisioned yet and doesn't know about
+autoscaling.
+
+This problem can be solved by the cluster administrator. They can find out how


Kubernetes is already notoriously hard to use and by pushing our implementation challenges on cluster admin we're only making it that much worse. We should instead try thinking how we can change our implementation to make it as easy as we can.

We have already solved the exact same issue for the node object itself: autoscaler doesn't require user to create a 'template node' object manually based on the observation. We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity? We can identify all CSIStorageCapacity objects for an existing node in a node group and copy them.
Of course, the issue is passing those in-memory objects to the plugin. But that can be solved, for example by refactoring scheduler code to include CSIStorageCapacity in the scheduler snapshot rather than fetching it from informer in the plugin? We already do exactly that for Node objects (an additional benefit of this solution is that the entire scheduling cycle would use a consistent snapshot of the cluster, eliminating any potential races caused by CSIStorageCapacity object being created mid-scheduling).

We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity?

The information reported by the CSI driver for existing nodes is about remaining capacity. We would have to extend the CSI spec to report capacity as if no volumes had been created.

That may be doable, but then what about scale up from zero? That can't work that way.

We would have to extend the CSI spec to report capacity as if no volumes had been created

What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on? That'd require a bit of extra logic to set the field, but I don't think it requires changes in CSI spec?

Agreed that scale-from-zero is more tricky. I don't see how we can solve it for general case without some extra information coming from somewhere. It does feel very similar to custom resources and GPUs though and for those we defer to cloudproviders to handle the problem. I think a solution that gives more opportunity to either automate or otherwise expose an API consistent with the given provider would be preferred (ex. custom resources are specified in ASG tags).

What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on?

That assumes that the driver gets to see the node in a "pristine" state and then nothing changes later on. But it isn't unusual for disks to fail or new ones to get attached at runtime. It also doesn't help when the driver gets uninstalled completely which, depending on the deployment approach, also clears the CSIStorageCapacity objects and then gets reinstalled while there are still some volumes. When the driver then comes up, the sidecar doesn't get the full capacity.

MaciekPytel · 2022-03-04T17:32:48Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+  by adding a readiness check to the autoscaler that compares the existing
+  CSIStorageCapacity objects against the expected ones for the fictional node.
+
+A proof-of-concept of this approach is available in


My issue is that it is very much a PoC. I can see how it can work for a simple test scenario, but I don't think it's a suitable final solution:

The UX is extremely complicated, requiring multiple complex steps from cluster admin. Debugging this will be super hard for anyone not intimately familiar with CA internals.

I don't think this approach scales to large clusters (see my comment about capacity above).

alculquicondor · 2022-03-04T19:13:05Z

In the highlevel, it sounds like some of the questions have some answers or there were clear conclusions from SIG storage and CSI meetings. But the KEP is not up to date, so it's hard to make a decision on whether the API is ready for GA.

As discussed during PR review, the section was focused on local storage without explicitly saying that. Network attached storage needs to be dealt with differently and has a different problem (scale from zero and lacking capacity information).

API review during the implementation led to some changes compared to what had been proposed in the KEP. The separate into spec and status was removed. Later "maximum volume size" was added once it became available in the CSI spec.

pohly · 2022-03-04T20:10:02Z

But the KEP is not up to date, so it's hard to make a decision on whether the API is ready for GA.

I've updated the autoscaler section and copied the actual API description back into the KEP.

I appreciate that you are taking the time to review the autoscaler proposal. Your comments about node-local vs. network attached made me aware that scale from zero is still problematic for network attached storage. On the other hand, lack of storage capacity tracking isn't such a big deal for network attached storage and if there is interest, solutions seem possible also for scale up from zero.

I agree that usability of that proposal is poor. But I also think that we will need feedback from specific storage providers and users who are facing that problem before we can come up with something simpler and that this shouldn't hold back this KEP. 😓

alculquicondor · 2022-03-04T20:15:00Z

I agree that usability of that proposal is poor. But I also think that we will need feedback from specific storage providers and users who are facing that problem before we can come up with something simpler and that this shouldn't hold back this KEP.

That sounds like the API needs more time in beta stage

pohly · 2022-03-04T21:06:07Z

Only if there is an expectation that future changes cannot be done as additions to a GA API. I don't see that, quite the opposite. Simplifying the autoscaler configuration is something that can be done with additional information, perhaps even with an out-of-tree CRD. In that case no changes to the in-tree API will be needed at all.

Holding the API in beta penalizes those users who don't need autoscaling. I don't think that this is the right choice.

alculquicondor · 2022-03-04T21:15:14Z

In my ideal world, we would have "TotalCapacity" and "MaxVolumeSize" and then I wouldn't see a point of having "AvailableCapacity", so I wouldn't like having that field in a GA API.

But maybe this is not realistic. Let me take another look at the updated PR on Monday.

alculquicondor · 2022-03-04T21:16:03Z

Can you update all the "Drawbacks" subsections with whatever context was gathered from SIG storage and CSI meetings?

pohly · 2022-03-04T22:30:51Z

In my ideal world, we would have "TotalCapacity" and "MaxVolumeSize" and then I wouldn't see a point of having "AvailableCapacity", so I wouldn't like having that field in a GA API.

"Total capacity" is pretty much off the table. The storage community seems to dislike the idea of pretending that storage is linear, and such a "total capacity" only makes sense under that assumption.

One of the design decisions during API review was that the CSIStorageCapacity should mirror what CSI provides. The logic then is in the consumer of that information. From that perspective it makes sense to have both values, even if the scheduler only needs one.

pohly · 2022-03-07T13:52:11Z

Can you update all the "Drawbacks" subsections with whatever context was gathered from SIG storage and CSI meetings?

Done. That whole section was partially outdated, thanks for bringing it up.

I also reran the scale tests because I hadn't properly written up the results from the first time and because I wanted to make sure that the recent external-provisioner still works. See https://github.com/kubernetes-csi/csi-driver-host-path/blob/d6d9639077691986d676984827ea4dd7ee0c5cce/docs/storage-capacity-tracking.md

Key results:

at 10 nodes and 1000 pods+volumes, enabling storage capacity tracking makes pod scheduling over 2 times faster; it's pure luck that it succeeds without that at all
at 100 nodes and 10000 pods+volumes, pod scheduling gets stuck without storage capacity tracking (finding the right node(s) by chance no longer works)
at the same scale with storage capacity tracking enabled, a small tweak was necessary in external-provisioner to keep the apiserver responsive - with that tweak the test passed all three times that I ran it

alculquicondor · 2022-03-08T22:00:28Z

/retitle KEP-1472: storage capacity: update drawbacks

pohly · 2022-03-09T07:35:30Z

/assign @msau42

For approval.

MaciekPytel · 2022-03-09T14:53:02Z

I'm not saying we should block GA on this, but I'd like to request that we amend this PR to describe the problems with autoscaler integration (primarily the fact that we CA can't predict the result of scheduling multiple pods) and mention future work will be required to address those issues.

pohly · 2022-03-09T15:12:59Z

"Lack of storage capacity modeling will cause the autoscaler to scale up clusters more slowly because it cannot determine in advance that multiple new nodes are needed." is already called out as a drawback under "lack of storage capacity modeling".

That something else would need to be done to address that is implied. We could of course mention the need for future work explicitly, but that's a slippery slope towards outlining such a technical solution in a KEP that isn't about that solution. I'd prefer to keep the KEP as it is now, but don't feel strongly about it.

I'll let @msau42 decide that.

alculquicondor · 2022-03-09T15:44:35Z

Maybe clarify that the autoscaler support PoC needs a lot of work to be easily configurable (we talked about a CA specific CRD that I don't see included in the KEP) and it doesn't address the problem of "lack of storage capacity modeling". And then you can explicitly say: "the CSI API needs work to improve this situation".

pohly · 2022-03-09T16:27:38Z

I've added some more text about enhancements for CA.

@alculquicondor can you re-add LGTM if you agree with the change?

alculquicondor · 2022-03-09T17:53:36Z

/lgtm

MaciekPytel · 2022-03-10T17:42:44Z

To reiterate my point - the current wording reads to me like the problem of integration with autoscaler is mostly solved, with maybe a few gotchas. I disagree with that. The main part of the problem (autoscaler's ability to predict result of scheduling all the pending pods) is not addressed.

I don't see how we can solve it without large changes, likely involving both extending CSI spec and, possibly, changes in scheduler. I'm fine with the fact that those solutions are out of scope. But since you're adding a section about integration with Cluster Autoscaler I think it should clearly list all the problems.

"Lack of storage capacity modeling will cause the autoscaler to scale up clusters more slowly because it cannot determine in advance that multiple new nodes are needed." is already called out as a drawback under "lack of storage capacity modeling".

I agree. It's just that the way it's mentioned suggests it's the limitation of PoC implementation and downplays its significance. I'd like to change the accents to make it more clear that the problem is not solved, but there is a workaround implementation that may work for some cases (small scale or user doesn't care about latency).

pohly · 2022-03-10T19:31:34Z

@MaciekPytel: I'm not trying to downplay the limitations. I've pushed another commit with "the current understanding is that further work is needed" as conclusion for the prototype. I'd also be happy to get suggestions for the right wording.

xing-yang · 2022-03-14T17:32:45Z

/retest

keps/sig-storage/1472-storage-capacity-tracking/README.md

alculquicondor · 2022-03-16T17:27:23Z

still LGTM, but I'll leave the official one to @MaciekPytel

thockin

Opinion: the intersection of autoscaling and storage capacity will require more than just tweaks to this API. I'm not convinced that blocking this will produce any fundamentally different API, but I also admit I don't have all the autoscaler context.

I think this decision belongs to @msau42

thockin · 2022-03-10T17:50:05Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+- When creating a fictional Node object from an existing Node in
+  a node group, autoscaler must modify the topology labels of the CSI
+  driver(s) in the cluster so that they define a new topology segment.
+  For example, topology.hostpath.csi/node=aks-workerpool.* has to


it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler

IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.

thockin · 2022-03-16T21:05:56Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

@@ -1189,9 +1189,54 @@ based on storage capacity:
  to available storage and thus could run on a new node, the


Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

Sure, why not? It may not work in every case but it seems like it could work in some?

thockin · 2022-03-16T21:12:43Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

@@ -1189,9 +1189,54 @@ based on storage capacity:
  to available storage and thus could run on a new node, the
  simulation may decide otherwise.


In terms of API stability, we cannot change the interpretation of existing fields. Any new semantic we want in the future would have to be a new field.

Not saying it's the way to go, but a new apiversion could repurpose a field name as long as the round-trip logic converts, right?

keps/sig-storage/1472-storage-capacity-tracking/README.md

msau42 · 2022-03-16T23:06:22Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+that logic is implemented inside the CSI driver. The CSI driver doesn't know
+about hardware that hasn't been provisioned yet and doesn't know about
+autoscaling.
+


I think the the concern here is signing off on this as an agreed upon solution. Instead let's word this to indicate that further design and investigation is still needed to solve these problems:

How a CSI driver can report capacity on a pristine, simulated node

How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.

How to improve the batch scheduling use case

And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).

How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.

Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?

And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).

I've pushed an update that under "drawbacks" describes the problem and that further work is needed, then under "alternatives" the description of the current prototype.

Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?

I think there may need to be information passed both ways. In order for a CSI driver to estimate a capacity without being able to actually inspect a node, it's going to need information about the future node, such as how many disks are going to be there and their capacities.

This makes it clear that the prototype is not the suggested solution.

msau42 · 2022-03-17T20:55:32Z

/lgtm
/approve

The existing autoscaler ideas are parked under "alternatives" for future discussion. For now, the kep notes that further investigation is needed, and autoscaler integration will be tracked in a separate future kep.

k8s-ci-robot · 2022-03-17T20:56:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

storage capacity: document autoscaler support

991e4e9

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Mar 4, 2022

k8s-ci-robot requested review from saad-ali and xing-yang March 4, 2022 08:47

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Mar 4, 2022

k8s-ci-robot requested review from alculquicondor, MaciekPytel and x13n March 4, 2022 08:49

alculquicondor reviewed Mar 4, 2022

View reviewed changes

MaciekPytel reviewed Mar 4, 2022

View reviewed changes

MaciekPytel mentioned this pull request Mar 4, 2022

scale up: tests and support for pods with volumes kubernetes/autoscaler#3887

Closed

pohly added 2 commits March 4, 2022 20:58

storage capacity: update API

2b0db3b

API review during the implementation led to some changes compared to what had been proposed in the KEP. The separate into spec and status was removed. Later "maximum volume size" was added once it became available in the CSI spec.

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 4, 2022

storage capacity tracking: update "drawbacks" section

78df78d

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2022

k8s-ci-robot changed the title ~~KEP-1472: storage capacity: document autoscaler support~~ KEP-1472: storage capacity: update drawbacks Mar 8, 2022

k8s-ci-robot assigned msau42 Mar 9, 2022

storage capacity: elaborate on next steps for autoscaler

8115cbc

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022

storage capacity tracking: emphasize limitations more strongly

73cb6a3

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2022

xing-yang reviewed Mar 15, 2022

View reviewed changes

keps/sig-storage/1472-storage-capacity-tracking/README.md Show resolved Hide resolved

thockin reviewed Mar 16, 2022

View reviewed changes

msau42 reviewed Mar 16, 2022

View reviewed changes

pohly added 2 commits March 17, 2022 09:25

storage capacity: move autoscaler prototype to alternatives

a3d8015

This makes it clear that the prototype is not the suggested solution.

storage capacity: update examples

a395c55

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2022

k8s-ci-robot merged commit 0961bb8 into kubernetes:master Mar 17, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 17, 2022

		@@ -1189,9 +1189,54 @@ based on storage capacity:
		to available storage and thus could run on a new node, the

		@@ -1189,9 +1189,54 @@ based on storage capacity:
		to available storage and thus could run on a new node, the
		simulation may decide otherwise.

KEP-1472: storage capacity: update drawbacks #3233

KEP-1472: storage capacity: update drawbacks #3233

Conversation

pohly commented Mar 4, 2022

pohly commented Mar 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Mar 4, 2022

pohly commented Mar 4, 2022

alculquicondor commented Mar 4, 2022

pohly commented Mar 4, 2022

alculquicondor commented Mar 4, 2022

alculquicondor commented Mar 4, 2022

pohly commented Mar 4, 2022 • edited Loading

pohly commented Mar 7, 2022

alculquicondor commented Mar 8, 2022

pohly commented Mar 9, 2022

MaciekPytel commented Mar 9, 2022

pohly commented Mar 9, 2022

alculquicondor commented Mar 9, 2022

pohly commented Mar 9, 2022

alculquicondor commented Mar 9, 2022

MaciekPytel commented Mar 10, 2022

pohly commented Mar 10, 2022

xing-yang commented Mar 14, 2022

alculquicondor commented Mar 16, 2022

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Mar 17, 2022

k8s-ci-robot commented Mar 17, 2022

pohly commented Mar 4, 2022 •

edited

Loading