Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1472: storage capacity: update drawbacks #3233

Merged
merged 10 commits into from
Mar 17, 2022

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Mar 4, 2022

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Mar 4, 2022
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Mar 4, 2022
@pohly
Copy link
Contributor Author

pohly commented Mar 4, 2022

/cc @alculquicondor @xing-yang @x13n @MaciekPytel

@mwielgus: just for completeness, can you confirm here that the conclusion from the 2022-02-21 SIG Autoscaling meeting was that this feature can go to GA without kubernetes/autoscaler#3887 being merged?

The approach above preserves the separation between the different
components. Simpler solutions may be possible by adding support for specific
CSI drivers into custom autoscaler binaries or into operators that control the
cluster setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other "Drawbacks" listed, like modeling of capacity usage, fragmentation and prioritization. Was there any progress on them? At the very least, were you able to quantify how bad the situation is in a busy cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PMEM-CSI I ran tests that completely filled up a cluster with Pods that have volumes. That worked well in my experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize those results in the KEP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test with csi-driver-host-path is more suitable for the KEP because it's easier to reproduce and scale up. I can summarize the results from that in the KEP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the summary of kubernetes-csi/csi-driver-host-path#350 in the revised KEP is sufficient. I just need to update the link once that PR is merged. Whoever wants specific numbers then can look up the report.

@@ -1189,9 +1189,54 @@ based on storage capacity:
to available storage and thus could run on a new node, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think network-attached storage should be much of a problem.

When we are scaling up a nodepool or a nodegroup, we know in which topology it would end up. So we can query the CSIStorageCapacity that matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scale up from zero, network-attached storage has a similar problem: because the CSI driver hasn't provided topology information yet for the non-existent nodes, the cluster admin must set labels manually in the configuration of the node pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is for the case where the nodepool is a new topology domain.

I think it should be responsibility of the CSI driver to setup that up even if no nodes exist. It's network-attached after all. It should come from somewhere and the driver should know about it. No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how does the CSI driver know that the information is needed? Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

There's also a practical problem here: the CSI spec doesn't have support for listing topology segments for nodes where no CSI driver is running. If we want something like this, we need a new API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

Sure, why not? It may not work in every case but it seems like it could work in some?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not?

Scalability and performance issues? It'll depend on the number of those additional segments, of course.

It may not work in every case but it seems like it could work in some?

Yes. I just wanted to point out some of the details (conceptual and practical) that'll have to be addressed for this to work. I agree that it can work.

available to the autoscaler. This can be done with the existing
CSIStorageCapacity API as follows:

- When creating a fictional Node object from an existing Node in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this talking about local storage? can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly thinking of local storage here. You are right, that needs to be called out here because for network-attached storage the topology labels don't need to be modified. I'll revise.

- For each storage class, the cluster administrator can then create
CSIStorageCapacity objects that provide the capacity information for these
fictional topology segments.
- When the volume binder plugin for the scheduler runs inside the autoscaler,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were 2 pending pods, given the semantics of available capacity, autoscaler might think that they both fit in the same node, where in reality they don't. One of the pods would succeed, and the second one would likely fail once it reaches kubelet. This would really hurt scale up E2E scheduling time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "will fail" seems to be a misunderstanding, because the second Pod will not reach kubelet. I agree that autoscaler will not make the right decision immediately (create two new nodes instead of one), but it also will not make a wrong decision and eventually the cluster grows sufficiently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it will reach the correct size eventually and any 'failure' will be in scheduler not kubelet. That being said the problem with this argument is that eventually may take a really long time (adding tens or hundreds of nodes one-at-a-time) and for vast majority of use-cases autoscaling is only useful if you can get the nodes when you need them. In my experience scale-up/down latency is the single most important requirement for autoscaling and a solution that completely ignores latency isn't really a solution for large-scale systems.

to wait for it before considering the node ready. If it doesn't do that, it
might incorrectly scale up further because storage capacity checks will fail
for a new, unused node until the CSI driver provides CSIStorageCapacity
objects for it. This can be implemented in a generic way for all CSI drivers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could have a similar timeout as we have for readiness when there are ignored taints, right @MaciekPytel ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and also for GPUs. As long as we can predict CSIStorageCapacity objects the node will eventually have (which we need to do anyway) our existing solution can be easily adapted for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach suggested here is indeed based on the solution for GPUs.

- When creating a fictional Node object from an existing Node in
a node group, autoscaler must modify the topology labels of the CSI
driver(s) in the cluster so that they define a new topology segment.
For example, topology.hostpath.csi/node=aks-workerpool.* has to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there was a CSIStorageCapacityClass object that defines which topologies would get a new CSIStorageCapacity and how much capacity they have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered whether additional information in CSIDriver could be used to avoid this manual configuration step. In the end I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.

I'm not against doing something like this if there is sufficient demand. In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.

This is similar how other features get improved over time: first the base functionality gets introduced, then later improvements are added. I see this KEP as the base functionality that is sufficient for several use cases and therefore worth graduating to GA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree: this could be a resource specific for CA (it could even be a CRD) and, yes, that would be a candidate for a follow up KEP.

I guess we can focus on the main problem: how to predict how many pods can fit in a node :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.

I think that's better than having every user figure out independently how each CSIDriver behaves and feed that information to CA. Hopefully Kubernetes is big enough to create an incentive for CSI driver developers.

In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler

IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree. But as I said elsewhere, we then need some buy-in from a CSI driver vendor who wants to support this. Otherwise such a new KEP will be stuck at the "get feedback from the field" promotion criteria.


- When creating a fictional Node object from an existing Node in
a node group, autoscaler must modify the topology labels of the CSI
driver(s) in the cluster so that they define a new topology segment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we copy the available capacity from another node? What if that node has the capacity partially used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we cannot copy existing CSIStorageCapacity objects.

@@ -1189,9 +1189,54 @@ based on storage capacity:
to available storage and thus could run on a new node, the
simulation may decide otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate issue is that Cluster Autoscaler simulates scheduling of all the pods (binpacking) before making any autoscaling decisions. The way that capacity is defined here (and how kubernetes/autoscaler#3887 uses the same CSIStorageClass to represent local storage of potentially hundreds of nodes) I think means that CA will not be able to calculate how many nodes are needed for any use-case where available storage is the limiting factor for scheduling.

ex. Assume I have a 1000 pending pods that have relatively low cpu/memory requests, but require all the local storage of a node. CA would think all those pods can fit on just a few nodes (on a single node if cpu/memory requests are low enough) and only add this many nodes. Admittedly, once the nodes are created some pods will be scheduled, CA will see that remaining pods are unschedulable again and trigger another (equally small) scale-up. So theoretically the situation will eventually recover. In practice the latency can be arbitrarily long, making the feature completely unusable in sufficiently large clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined. We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

We could further explicitly say that any driver where this assumption is too far off is incompatible with autoscaling. Given how popular the autoscaling is in cloud environments, this can hopefully generate some incentive for driver authors to interpret capacity that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the API is even problematic for kube-scheduler itself. On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods. If you are using the Job API, you could hit the backoffLimit and the Job would fail. Similarly, other third party APIs (argo, airflow) are less resistant to failures.

We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

I agree that such interpretation would be better. Even if it's not a perfect prediction, it would at least generate less failed pods.

As the API stands today, I don't think I'm confident graduating it to GA, specially if changing the interpretation of a field is under consideration. Maybe we instead need to add a new field for the total capacity and scheduler needs to respect in combination with the existing .status.availableCapacity. If the field is not set, it means that the capacity is unlimited and we can assume that drivers who choose to leave it empty understand the consequences.

Do you think adding that field is possible @pohly? Do you think that's better for scheduling/autoscaling @MaciekPytel?

If you agree, we should go with a v1beta2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @liggitt @roycaihw (API reviewers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined.

That's why I added a more strictly defined "maximum volume size" to the spec. That is what Kubernetes uses if available, instead of the more loosely defined "capacity".

We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

I don't think that making assumptions about how CSI drivers handle storage capacity is going to help. We've discussed that at length in SIG Storage and in CSI meetings and the consensus there was that there are always some vendor specific exceptions that cannot be handled in a generic way.

On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods.

Failing how? They won't get scheduled unless their volumes were successfully created. Until then they remain in the queue and will eventually land on some node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to go that far. Both existing fields are intentionally marked as optional. We can always add new ones and stop populating the old ones, if we find that this will work better. Nothing in the current discussion indicates that there'll be a need for that, though, therefore I am confident that this API is good for GA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an idea how we can support simulated volume provisioning in the autoscaler without (at least initially) extending this API here or the CSI spec:

  • the CSI driver must report "total capacity" in GetCapacityResponse.available_capacity and support maximum_volume_size
  • the CSI driver must be local to nodes and have a single topology key with the Kubernetes node name as value
  • the CSIDriver object must have a special annotation for the autoscaler ("linear, local storage")
  • nothing changes for kube-scheduler
  • autoscaler simulates scheduling like this:
    • when cloning an existing node to create a node group, it replaces the node name in the CSI driver's topology label
    • it clones the CSIStorageCapacity objects for each storage class, changes their topology and removes the MaximumVolumeSize field: what remains is the total capacity without any volumes allocated
    • when the "selected node" label gets set for a PVC while scheduling a pod, the "AvailableCapacity" gets reduces by the size of that PVC: this simulates volume creation on that node and ensures that eventually pods no longer fit the node

There are some implementation challenges, for example the volume binder code that sets "selected node" might not get called at the moment by autoscaler. The in-tree volume binder code must be made a bit more flexible to accept stores instead of creating informers itself, but other than that it doesn't need any special logic. It's also not a generic solution, so we must have a good understanding of which CSI drivers are going to use this to check in advance that they fit the assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MaximumVolumeSize is simply removed, Cluster Autoscaler may decide to provision a node even though a workload can't get scheduled there due to its request exceeding the max size (but being less than total capacity). Proper CA support needs to somehow get this information.

Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably shouldn't have posted that idea because we are now back at discussing a solution that doesn't belong into this KEP 😅 Let's continue anyway, but let's beware that this isn't blocking merging of this PR.

The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".

Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.

The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for starting a solution discussion here, happy to take it elsewhere 🙂

The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".

Yes, if we want proper autoscaling support, CA has to use a more complex model, which will make sense for most/all storage implementations out there.

The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.

I think this could work through a separate API: different driver extensions would create different objects for CA to use for resource predictions.

about hardware that hasn't been provisioned yet and doesn't know about
autoscaling.

This problem can be solved by the cluster administrator. They can find out how

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes is already notoriously hard to use and by pushing our implementation challenges on cluster admin we're only making it that much worse. We should instead try thinking how we can change our implementation to make it as easy as we can.

We have already solved the exact same issue for the node object itself: autoscaler doesn't require user to create a 'template node' object manually based on the observation. We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity? We can identify all CSIStorageCapacity objects for an existing node in a node group and copy them.
Of course, the issue is passing those in-memory objects to the plugin. But that can be solved, for example by refactoring scheduler code to include CSIStorageCapacity in the scheduler snapshot rather than fetching it from informer in the plugin? We already do exactly that for Node objects (an additional benefit of this solution is that the entire scheduling cycle would use a consistent snapshot of the cluster, eliminating any potential races caused by CSIStorageCapacity object being created mid-scheduling).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity?

The information reported by the CSI driver for existing nodes is about remaining capacity. We would have to extend the CSI spec to report capacity as if no volumes had been created.

That may be doable, but then what about scale up from zero? That can't work that way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to extend the CSI spec to report capacity as if no volumes had been created

What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on? That'd require a bit of extra logic to set the field, but I don't think it requires changes in CSI spec?

Agreed that scale-from-zero is more tricky. I don't see how we can solve it for general case without some extra information coming from somewhere. It does feel very similar to custom resources and GPUs though and for those we defer to cloudproviders to handle the problem. I think a solution that gives more opportunity to either automate or otherwise expose an API consistent with the given provider would be preferred (ex. custom resources are specified in ASG tags).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on?

That assumes that the driver gets to see the node in a "pristine" state and then nothing changes later on. But it isn't unusual for disks to fail or new ones to get attached at runtime. It also doesn't help when the driver gets uninstalled completely which, depending on the deployment approach, also clears the CSIStorageCapacity objects and then gets reinstalled while there are still some volumes. When the driver then comes up, the sidecar doesn't get the full capacity.

by adding a readiness check to the autoscaler that compares the existing
CSIStorageCapacity objects against the expected ones for the fictional node.

A proof-of-concept of this approach is available in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is that it is very much a PoC. I can see how it can work for a simple test scenario, but I don't think it's a suitable final solution:

  • The UX is extremely complicated, requiring multiple complex steps from cluster admin. Debugging this will be super hard for anyone not intimately familiar with CA internals.
  • I don't think this approach scales to large clusters (see my comment about capacity above).

@alculquicondor
Copy link
Member

In the highlevel, it sounds like some of the questions have some answers or there were clear conclusions from SIG storage and CSI meetings. But the KEP is not up to date, so it's hard to make a decision on whether the API is ready for GA.

As discussed during PR review, the section was focused on local storage without
explicitly saying that. Network attached storage needs to be dealt with
differently and has a different problem (scale from zero and lacking
capacity information).
API review during the implementation led to some changes compared to what had
been proposed in the KEP. The separate into spec and status was removed. Later
"maximum volume size" was added once it became available in the CSI spec.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 4, 2022
@pohly
Copy link
Contributor Author

pohly commented Mar 4, 2022

But the KEP is not up to date, so it's hard to make a decision on whether the API is ready for GA.

I've updated the autoscaler section and copied the actual API description back into the KEP.

I appreciate that you are taking the time to review the autoscaler proposal. Your comments about node-local vs. network attached made me aware that scale from zero is still problematic for network attached storage. On the other hand, lack of storage capacity tracking isn't such a big deal for network attached storage and if there is interest, solutions seem possible also for scale up from zero.

I agree that usability of that proposal is poor. But I also think that we will need feedback from specific storage providers and users who are facing that problem before we can come up with something simpler and that this shouldn't hold back this KEP. 😓

@alculquicondor
Copy link
Member

I agree that usability of that proposal is poor. But I also think that we will need feedback from specific storage providers and users who are facing that problem before we can come up with something simpler and that this shouldn't hold back this KEP.

That sounds like the API needs more time in beta stage

@pohly
Copy link
Contributor Author

pohly commented Mar 4, 2022

Only if there is an expectation that future changes cannot be done as additions to a GA API. I don't see that, quite the opposite. Simplifying the autoscaler configuration is something that can be done with additional information, perhaps even with an out-of-tree CRD. In that case no changes to the in-tree API will be needed at all.

Holding the API in beta penalizes those users who don't need autoscaling. I don't think that this is the right choice.

@alculquicondor
Copy link
Member

In my ideal world, we would have "TotalCapacity" and "MaxVolumeSize" and then I wouldn't see a point of having "AvailableCapacity", so I wouldn't like having that field in a GA API.

But maybe this is not realistic. Let me take another look at the updated PR on Monday.

@alculquicondor
Copy link
Member

Can you update all the "Drawbacks" subsections with whatever context was gathered from SIG storage and CSI meetings?

@pohly
Copy link
Contributor Author

pohly commented Mar 4, 2022

In my ideal world, we would have "TotalCapacity" and "MaxVolumeSize" and then I wouldn't see a point of having "AvailableCapacity", so I wouldn't like having that field in a GA API.

"Total capacity" is pretty much off the table. The storage community seems to dislike the idea of pretending that storage is linear, and such a "total capacity" only makes sense under that assumption.

One of the design decisions during API review was that the CSIStorageCapacity should mirror what CSI provides. The logic then is in the consumer of that information. From that perspective it makes sense to have both values, even if the scheduler only needs one.

@pohly
Copy link
Contributor Author

pohly commented Mar 7, 2022

Can you update all the "Drawbacks" subsections with whatever context was gathered from SIG storage and CSI meetings?

Done. That whole section was partially outdated, thanks for bringing it up.

I also reran the scale tests because I hadn't properly written up the results from the first time and because I wanted to make sure that the recent external-provisioner still works. See https://github.com/kubernetes-csi/csi-driver-host-path/blob/d6d9639077691986d676984827ea4dd7ee0c5cce/docs/storage-capacity-tracking.md

Key results:

  • at 10 nodes and 1000 pods+volumes, enabling storage capacity tracking makes pod scheduling over 2 times faster; it's pure luck that it succeeds without that at all
  • at 100 nodes and 10000 pods+volumes, pod scheduling gets stuck without storage capacity tracking (finding the right node(s) by chance no longer works)
  • at the same scale with storage capacity tracking enabled, a small tweak was necessary in external-provisioner to keep the apiserver responsive - with that tweak the test passed all three times that I ran it

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2022
@alculquicondor
Copy link
Member

/retitle KEP-1472: storage capacity: update drawbacks

@k8s-ci-robot k8s-ci-robot changed the title KEP-1472: storage capacity: document autoscaler support KEP-1472: storage capacity: update drawbacks Mar 8, 2022
@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

/assign @msau42

For approval.

@MaciekPytel
Copy link

I'm not saying we should block GA on this, but I'd like to request that we amend this PR to describe the problems with autoscaler integration (primarily the fact that we CA can't predict the result of scheduling multiple pods) and mention future work will be required to address those issues.

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

"Lack of storage capacity modeling will cause the autoscaler to scale up clusters more slowly because it cannot determine in advance that multiple new nodes are needed." is already called out as a drawback under "lack of storage capacity modeling".

That something else would need to be done to address that is implied. We could of course mention the need for future work explicitly, but that's a slippery slope towards outlining such a technical solution in a KEP that isn't about that solution. I'd prefer to keep the KEP as it is now, but don't feel strongly about it.

I'll let @msau42 decide that.

@alculquicondor
Copy link
Member

Maybe clarify that the autoscaler support PoC needs a lot of work to be easily configurable (we talked about a CA specific CRD that I don't see included in the KEP) and it doesn't address the problem of "lack of storage capacity modeling". And then you can explicitly say: "the CSI API needs work to improve this situation".

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022
@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

I've added some more text about enhancements for CA.

@alculquicondor can you re-add LGTM if you agree with the change?

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022
@MaciekPytel
Copy link

To reiterate my point - the current wording reads to me like the problem of integration with autoscaler is mostly solved, with maybe a few gotchas. I disagree with that. The main part of the problem (autoscaler's ability to predict result of scheduling all the pending pods) is not addressed.

I don't see how we can solve it without large changes, likely involving both extending CSI spec and, possibly, changes in scheduler. I'm fine with the fact that those solutions are out of scope. But since you're adding a section about integration with Cluster Autoscaler I think it should clearly list all the problems.

"Lack of storage capacity modeling will cause the autoscaler to scale up clusters more slowly because it cannot determine in advance that multiple new nodes are needed." is already called out as a drawback under "lack of storage capacity modeling".

I agree. It's just that the way it's mentioned suggests it's the limitation of PoC implementation and downplays its significance. I'd like to change the accents to make it more clear that the problem is not solved, but there is a workaround implementation that may work for some cases (small scale or user doesn't care about latency).

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2022
@pohly
Copy link
Contributor Author

pohly commented Mar 10, 2022

@MaciekPytel: I'm not trying to downplay the limitations. I've pushed another commit with "the current understanding is that further work is needed" as conclusion for the prototype. I'd also be happy to get suggestions for the right wording.

@xing-yang
Copy link
Contributor

/retest

@alculquicondor
Copy link
Member

still LGTM, but I'll leave the official one to @MaciekPytel

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion: the intersection of autoscaling and storage capacity will require more than just tweaks to this API. I'm not convinced that blocking this will produce any fundamentally different API, but I also admit I don't have all the autoscaler context.

I think this decision belongs to @msau42

- When creating a fictional Node object from an existing Node in
a node group, autoscaler must modify the topology labels of the CSI
driver(s) in the cluster so that they define a new topology segment.
For example, topology.hostpath.csi/node=aks-workerpool.* has to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler

IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.

@@ -1189,9 +1189,54 @@ based on storage capacity:
to available storage and thus could run on a new node, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

Sure, why not? It may not work in every case but it seems like it could work in some?

@@ -1189,9 +1189,54 @@ based on storage capacity:
to available storage and thus could run on a new node, the
simulation may decide otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of API stability, we cannot change the interpretation of existing fields. Any new semantic we want in the future would have to be a new field.

Not saying it's the way to go, but a new apiversion could repurpose a field name as long as the round-trip logic converts, right?

that logic is implemented inside the CSI driver. The CSI driver doesn't know
about hardware that hasn't been provisioned yet and doesn't know about
autoscaling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the the concern here is signing off on this as an agreed upon solution. Instead let's word this to indicate that further design and investigation is still needed to solve these problems:

  • How a CSI driver can report capacity on a pristine, simulated node
  • How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.
  • How to improve the batch scheduling use case

And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.

Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?

And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).

I've pushed an update that under "drawbacks" describes the problem and that further work is needed, then under "alternatives" the description of the current prototype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?

I think there may need to be information passed both ways. In order for a CSI driver to estimate a capacity without being able to actually inspect a node, it's going to need information about the future node, such as how many disks are going to be there and their capacities.

This makes it clear that the prototype is not the suggested solution.
@msau42
Copy link
Member

msau42 commented Mar 17, 2022

/lgtm
/approve

The existing autoscaler ideas are parked under "alternatives" for future discussion. For now, the kep notes that further investigation is needed, and autoscaler integration will be tracked in a separate future kep.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2022
@k8s-ci-robot k8s-ci-robot merged commit 0961bb8 into kubernetes:master Mar 17, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants