Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale up: tests and support for pods with volumes #3887

Closed
wants to merge 3 commits into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Feb 15, 2021

Whether a pod has unbound volumes influences scheduling decisions and
thus the scale up decisions in cluster autoscaler.

This PR:

  • documents the problem in the FAQ
  • adds support for autoscaling in combination with storage capacity tracking and a suitable cluster configuration
  • adds tests

These three new test cases cover:

  • a pod with an unbound pvc using late binding -> can scale up
  • the same with storage capacity feature enabled -> cannot scale up
    without CSIStorageCapacity
  • the same with manually configured CSIStorageCapacity -> can scale up

Using this in practice depends on having different CSI topology labels for template nodes and additional configuration for each CSI driver and its storage clases. This is something that cluster administrators must take care of.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 15, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 15, 2021
// unbound PVC with late binding can be scheduled only if the
// node candidates are known to have capacity available.
// This can be configured for node candidates by giving them
// special labels (TODO: how?) and then manually creating
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key conceptual question that I have: when cluster autoscaler creates node candidates, is it configurable which node labels those have?

This is not just relevant for the CSIStorageCapacity object below, but also for topology-aware pod scheduling. Suppose a PV has been created with topology constraints as provided by a CSI driver. How can the scheduler know that the node candidate is in the same topology segment as the existing volume? That's based on node labels, so those must be get set somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA will copy an existing node to use as a node template. Any topology labels should be copied as well (CA assumes a single NodeGroup is made of nodes that are identical for scheduling purposes, ie. they're in one zone).
For scale from 0 it's up to cloudprovider TemplateNodeInfo to provide a template node that has all the relevant labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I need to read up on node groups. Do you perhaps have a link to documentation about how to configure those?

CA will copy an existing node to use as a node template.

How does it pick that node?

Any topology labels should be copied as well (CA assumes a single NodeGroup is made of nodes that are identical for scheduling purposes, ie. they're in one zone).

Hmm, I can imagine that it might be more complex. For example, the same zone (= data center) might be able to provide machines with different hardware features (for example, optional CPU instructions like AVX) and those features might be exposed as node labels.

When considering node-local storage, each node is in its own topology zone. Cloning the labels of an existing node would be wrong for the labels set for a CSI driver on that node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I need to read up on node groups. Do you perhaps have a link to documentation about how to configure those?

A NodeGroup is an internal cluster autoscaler concept for a resizable set of identical nodes. How it maps to underlying infrastructure will be different for each cloud provider (ex. on AWS a NodeGroup is an ASG, on GCP it's a ManagedInstanceGroup) and the way to set it up will also differ by cloudprovider. The instructions for most providers can be found in cluster-autoscaler/cloudprovider/<provider_name>/README.md.

How does it pick that node?

Randomly. The assumption is that a NodeGroup is a set of identical nodes, so it doesn't matter which one CA uses as a template for representing future nodes.

Hmm, I can imagine that it might be more complex. For example, the same zone (= data center) might be able to provide machines with different hardware features (for example, optional CPU instructions like AVX) and those features might be exposed as node labels.

Two nodes using different hardware should be part of different NodeGroups. Cluster Autoscaler can autoscale cluster with multiple NodeGroups.

A very high level description of how CA works is:

  • A cluster is made of different NodeGroups, which represent some sort of resizable collection of VMs in underlying cloud. All CA ever does to actuate its decisions is resize NodeGroup. The expectation is that underlying cloud will in that case provide more identical VMs and those VMs will self-register in the cluster.
  • A cluster can be made of multiple NodeGroups. Some or all of those may be autoscaled.
  • If multiple NodeGroups are autoscaled CA will run a separate binpacking simulation for each of those.
    • To perform a simulation it will create an in-memory Node object that represents a new node that can be added to the cluster. It will than use embedded scheduler code to in-memory "schedule" pending pods on this node. CA will keep adding more in-memory nodes (copies of original template node) until all the pending pods can be "scheduled". The outcome of this simulation is how many nodes would need to be added to a particular NodeGroup to make all the pending pods schedulable.
    • Once the simulation is done CA will reset the simulation by deleting all in-memory nodes and erasing any "scheduling" decisions it took. It will than run simulation for next NodeGroup. This process will be repeated for each NodeGroup in the cluster.
    • A configurable heuristic ("expander") will look at results of simulations and pick one of them. CA will increase target size of selected NodeGroup (ie. ask underlying cloud for more VMs).

When considering node-local storage, each node is in its own topology zone. Cloning the labels of an existing node would be wrong for the labels set for a CSI driver on that node.

I think I'm missing some context here. Is the CSIStorageCapacity object for a node or for some class of identical nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note how the algorithm described above doesn't work with features like local PVs (since scheduler code takes PVs from informer and not snapshot, there is no way for CA to inject an in-memory PV object that represents the local volume on in-memory node).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing some context here. Is the CSIStorageCapacity object for a node or for some class of identical nodes?

That depends on the storage system. For network-attached storage, it may be for multiple nodes in a certain zone because all those nodes share the same underlying storage. For local storage, each node has its own storage and once the node is up, will have CSIStorageCapacity objects that represent storage on that node.

If all nodes in a NodeGroup are alike and have the same amount of storage capacity, then one CSIStorageCapacity object can describe those nodes. We just have to take care that objects for nodes that don't exist yet and nodes that exits are not in conflict with each other. That can be achieved by labeling the node candidates differently - not something that CA currently supports, but it doesn't sound hard to add.

This is then using the existing APIs of the volume binder (informers for CSIStorageCapacity, PVC, Node, CSIDriver, etc.). If we change those APIs, then other solutions are also possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note how the algorithm described above doesn't work with features like local PVs (since scheduler code takes PVs from informer and not snapshot, there is no way for CA to inject an in-memory PV object that represents the local volume on in-memory node).

This is where we need different solutions for different kind of storage systems. For local storage managed by a CSI driver with dynamic provisioning, there won't be any such local PVs and there won't be a need to inject fake PVs. Instead, we need to inform the scheduler about capacity on the non-existent nodes. I was focusing on that approach.

I don't know which approach will be more common in clusters where CA will be used. 🤷

// objects that were added when creating the fake client. Without
// this wait, those objects won't be visible via the informers when
// the test runs.
synced := informerFactory.WaitForCacheSync(stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about this without timeout. I've seen cases of WaitForCacheSync() hanging ~forever.

@MaciekPytel
Copy link
Contributor

MaciekPytel commented Feb 15, 2021

I don't think unittests prove much in this case. I think CA integration with PVs is fundamentally broken:

  • VolumeBinder uses internal caches that CA has no way of resetting between running simulations (so one simulation can impact another one).
  • Everything relies on using informers. CA can't inject fake objects (such as local PVs or CSINodes) for template nodes, since those objects come from informer and not scheduler snapshot.
  • If there are multiple pending pods I think we have no way to "schedule" one pod in binpacking and see what the impact will be on CSIStorageCapacity. I haven't really dug into this, but I suspect if one pod can be scheduled CA is likely going to behave as if all pending pods can be scheduled.

None of this will be a problem for a unittest, given that we can easily control what's in the informer during test and we only run once so the problems with isolation won't have a chance to manifest.

I think ultimate solution involves refactoring relevant code in k/k to expose interfaces that allow control over internal state (similarly to how snapshot finally gave CA a way to control list of pods and nodes perceived by scheduler finally giving us a chance to fix problems with InterPodAffinity) and than making CA understand volumes enough to account for them in simulations. Unfortunately this is unlikely to happen anytime soon, so in the meantime coming up with a list of cases that are and aren't supported by CA would be great, but I'd like a more comprehensive proof than a unittest for those cases.

@pohly
Copy link
Contributor Author

pohly commented Feb 15, 2021

I don't think unittests prove much in this case. I think CA integration with PVs is fundamentally broken

I'm starting to agree. I was asked to show how CSIStorageCapacity will affect CA, but the problem seems to be much bigger than just that.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 5, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2021
@pohly
Copy link
Contributor Author

pohly commented Aug 2, 2021

/remove-lifecycle stale

I intend to come back to this by setting up some real environment with cluster autoscaler and some CSI driver where I can try out this idea.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2021
@pohly
Copy link
Contributor Author

pohly commented Aug 31, 2021

I ran an experiment with a real cluster on Azure. However, this turned out to be a dead end for now because storage capacity tracking is not supported yet by AKS. EDIT: it wasn't in the default Kubernetes version, but is available when asking to install Kubernetes 1.21.1.

The cluster with 1.21.1 was created with:

az aks create -g cloud-native --name pmem --generate-ssh-keys --node-count 1 --kubernetes-version 1.21.1
az aks nodepool add --cluster-name pmem --node-count 1 --name workerpool --resource-group cloud-native --node-vm-size Standard_DS1_v2 --tags min=1 max=5 cluster-autoscaler-enabled=true cluster-autoscaler-name=pmem
az aks get-credentials --resource-group cloud-native --name pmem

When it runs, it has two nodes from different pools:

$ kubectl get nodes
NAME                                 STATUS   ROLES   AGE    VERSION
aks-nodepool1-15818640-vmss000000    Ready    agent   16m    v1.20.7
aks-workerpool-15818640-vmss000000   Ready    agent   4m4s   v1.20.7

The worker pool is set up for autoscaling (the tags above). I ran the cluster autoscaler under a debugger like this:

dlv debug --check-go-version=false ./ -- --cloud-provider=azure -v=5 --logtostderr --kubeconfig /home/pohly/.kube/config --cloud-config `pwd`/azure-config.yaml --leader-elect=false --node-group-auto-discovery=label:cluster-autoscaler-enabled=true,cluster-autoscaler-name=pmem

The config has:


  "clusterName": "pmem",
  "resourceGroup": "mc_cloud-native_pmem_northeurope",
  "nodeResourceGroup": "mc_cloud-native_pmem_northeurope",
  "tenantID": "46c98d88-e344-4ed4-8496-4ed7712e255d",
  "subscriptionID": "33bde97e-b22f-4a54-9c24-7bb643064a58",
  "vmType": "vmss",

  "useCLI": true
}

Authorization was based on the az login credentials.

@pohly
Copy link
Contributor Author

pohly commented Aug 31, 2021

With the Azure cluster above, I can reproduce the problem that the autoscaler does not scale up when pods need additional storage.

I used the CSI hostpath driver with fake 100Gi of "fast" storage per node, deployed from https://github.com/pohly/csi-driver-host-path/commits/cluster-autoscaler such that each worker pool node has one instance of the driver.

deploy/kubernetes-distributed/deploy.sh

Then I created this example stateful set:

apiVersion: v1
kind: Service
metadata:
  name: hostpath-service
  labels:
    app.kubernetes.io/instance: hostpath.csi.k8s.io
    app.kubernetes.io/part-of: csi-driver-host-path
    app.kubernetes.io/name: csi-hostpath-example
    app.kubernetes.io/component: socat
spec:
  type: NodePort
  selector:
    app.kubernetes.io/instance: hostpath.csi.k8s.io
    app.kubernetes.io/part-of: csi-driver-host-path
    app.kubernetes.io/name: csi-hostpath-example
    app.kubernetes.io/component: example
  ports:
  - port: 10000 # fixed port inside the pod, dynamically allocated port outside
---
apiVersion: v1
kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: example-app
  labels:
    app.kubernetes.io/instance: hostpath.csi.k8s.io
    app.kubernetes.io/part-of: csi-driver-host-path
    app.kubernetes.io/name: csi-hostpath-example
    app.kubernetes.io/component: example
spec:
  serviceName: "unused"
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: hostpath.csi.k8s.io
      app.kubernetes.io/part-of: csi-driver-host-path
      app.kubernetes.io/name: csi-hostpath-example
      app.kubernetes.io/component: example
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: hostpath.csi.k8s.io
        app.kubernetes.io/part-of: csi-driver-host-path
        app.kubernetes.io/name: csi-hostpath-example
        app.kubernetes.io/component: example
    spec:
      containers:
        - name: socat
          image: alpine/socat:1.0.3
          args:
            - 'tcp-listen:10000,fork,reuseaddr,crlf'
            - 'SYSTEM:echo hello world'
          volumeMounts:
          - mountPath: /data
            name: data
      volumes:
      - name: data
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  # All storage of a single node.
                  storage: 100Gi
              storageClassName: csi-hostpath-fast

When I created that, one pod ran. When scaling up (kubectl scale --replicas=2 statefulset/example-app), the second pod failed to start (as intended):

$ kubectl describe pods/example-app-1
...
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Warning  FailedScheduling   9m19s                default-scheduler   0/2 nodes are available: 2 persistentvolumeclaim "example-app-1-data" not found.
  Normal   NotTriggerScaleUp  8m22s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) did not have enough free storage
  Warning  FailedScheduling   3s (x10 over 9m18s)  default-scheduler   0/2 nodes are available: 2 node(s) did not have enough free storage.

The cluster-autoscaler event came from manually invoking it.

@pohly
Copy link
Contributor Author

pohly commented Aug 31, 2021

The key question was what the simulated node looks like. Here it is:

(dlv) p node.Name
"template-node-for-aks-workerpool-15818640-vmss-27035017268218663...+2 more"
(dlv) p node.Labels
map[string]string [
	"topology.kubernetes.io/region": "northeurope", 
	"failure-domain.beta.kubernetes.io/zone": "0", 
	"topology.hostpath.csi/node": "aks-workerpool-15818640-vmss000000", 
	"kubernetes.io/role": "agent", 
	"node-role.kubernetes.io/agent": "", 
	"kubernetes.io/arch": "amd64", 
	"beta.kubernetes.io/instance-type": "Standard_DS1_v2", 
	"storageprofile": "managed", 
	"topology.disk.csi.azure.com/zone": "", 
	"failure-domain.beta.kubernetes.io/region": "northeurope", 
	"kubernetes.io/hostname": "template-node-for-aks-workerpool-15818640-vmss-27035017268218663...+2 more", 
	"beta.kubernetes.io/arch": "amd64", 
	"kubernetes.azure.com/cluster": "MC_cloud-native_pmem_northeurope", 
	"kubernetes.azure.com/mode": "user", 
	"agentpool": "workerpool", 
	"kubernetes.azure.com/node-image-version": "AKSUbuntu-1804gen2containerd-2021.08.14", 
	"kubernetes.azure.com/role": "agent", 
	"kubernetes.io/os": "linux", 
	"storagetier": "Premium_LRS", 
	"topology.kubernetes.io/zone": "0", 
	"kubernetes.azure.com/os-sku": "Ubuntu", 
	"node.kubernetes.io/instance-type": "Standard_DS1_v2", 
	"beta.kubernetes.io/os": "linux", 
]

topology.hostpath.csi/node: aks-workerpool-15818640-vmss000000 was copied from the existing aks-workerpool-15818640-vmss000000. This is not correct in this case because a new node would have a different value. But this depends on the CSI driver. For "known" CSI drivers, a new command line parameter for autoscaler could inform it that it needs to mangle this particular label. For "unknown" drivers there's not enough information available to predict whether the label can be copied verbatim.

The incorrect value has the effect that CSIStorageCapacity objects for the existing node aks-workerpool-15818640-vmss000000 are used when checking the fictional template-node-for-aks-workerpool-15818640-vmss-27035017268218663 because those objects have this as node selector:

  nodeTopology:
    matchLabels:
      topology.hostpath.csi/node: aks-workerpool-15818640-vmss000000

@pohly
Copy link
Contributor Author

pohly commented Sep 1, 2021

I added a command line option that allows replacing labels in template nodes:

--replace-labels ';^topology.hostpath.csi/node=aks-workerpool.*;topology.hostpath.csi/node=aks-workerpool-template;'

Now an administrator can tell the autoscaler that new nodes in that node group will have new storage available in case that they get created anew:

kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1beta1
kind: CSIStorageCapacity
metadata:
  name: aks-workerpool-fast-storage
  namespace: kube-system
capacity: 100Gi
maximumVolumeSize: 100Gi
nodeTopology:
  matchLabels:
    # This never matches a real node, only the node templates
    # inside cluster-autoscaler.
    topology.hostpath.csi/node: aks-workerpool-template
storageClassName: csi-hostpath-fast
EOF

And with that, autoscaler scales up:

I0901 10:03:27.094774  215139 scheduler_binder.go:920] Provisioning for 1 claims of pod "default/example-app-1" that has no matching volumes on node "template-node-for-aks-workerpool-15818640-vmss-2703501726821866378" ...
I0901 10:03:27.094951  215139 scheduler_binder.go:742] PVC "default/example-app-1-data" is not bound
I0901 10:03:27.095441  215139 scale_up.go:469] Best option to resize: aks-workerpool-15818640-vmss
I0901 10:03:27.095468  215139 scale_up.go:473] Estimated 1 nodes needed in aks-workerpool-15818640-vmss
I0901 10:03:27.095511  215139 azure_scale_set.go:145] Get scale set size for "aks-workerpool-15818640-vmss"
I0901 10:03:27.095532  215139 azure_scale_set.go:161] Getting scale set ("aks-workerpool-15818640-vmss") capacity: 1
I0901 10:03:27.095557  215139 scale_up.go:587] Final scale-up plan: [{aks-workerpool-15818640-vmss 1->2 (max: 5)}]
I0901 10:03:27.095599  215139 scale_up.go:676] Scale-up: setting group aks-workerpool-15818640-vmss size to 2

The CSI driver pod starts first, creates the new volume, and then the example app also starts on the new node.

It's not all perfect yet, though:

  • While bringing up that one new node, autoscaler decides that it needs yet another one and ends up adding two in total although one would have been sufficient.
  • It doesn't scale down.

I'll debug both further, but in the meantime let me share the modifications that I have right now.

@MaciekPytel: is this going in the right direction?

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 1, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
To complete the pull request process, please assign towca after the PR has been reviewed.
You can assign the PR to them by writing /assign @towca in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 1, 2021
@@ -322,7 +323,7 @@ func computeExpansionOption(context *context.AutoscalingContext, podEquivalenceG
// false if it didn't and error if an error occurred. Assumes that all nodes in the cluster are
// ready and in sync with instance groups.
func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.AutoscalingProcessors, clusterStateRegistry *clusterstate.ClusterStateRegistry, unschedulablePods []*apiv1.Pod,
nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulerframework.NodeInfo, ignoredTaints taints.TaintKeySet) (*status.ScaleUpStatus, errors.AutoscalerError) {
nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulerframework.NodeInfo, ignoredTaints taints.TaintKeySet, labelReplacements replace.Replacements) (*status.ScaleUpStatus, errors.AutoscalerError) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoredTaints and labelReplacements always get passed around together. I was wondering whether it would be worth creating a struct that holds both. Then if something else needs to be added in the future, the change will be more limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new parameter promptly lead to several merge conflicts. While cleaning those up I'll implement this idea.

labelReplacements = func() *replace.Replacements {
repl := &replace.Replacements{}
flag.Var(repl, "replace-labels", "Specifies one or more regular expression replacements of the form ;<regexp>;<replacement>; (any other character as separator also allowed) which get applied one after the other to labels of a node to form a template node. Labels are represented as a single string with <key>=<value>. If the key is empty after replacement, the label gets removed.")
return repl
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming and the format of the parameter is up for debate. Once we have an agreement, I'll add documentation and tests in utils/replace.

This is following how regex replace is specified for sed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation to the FAQ.md.

@pohly pohly changed the title RFC: scale up: tests for pods with volumes RFC: scale up: tests and support for pods with volumes Sep 1, 2021
@pohly
Copy link
Contributor Author

pohly commented Sep 1, 2021

While bringing up that one new node, autoscaler decides that it needs yet another one and ends up adding two in total although one would have been sufficient.

That happens after the first new node comes online. At that time, the CSI driver is not running yet and thus the node is assumed to have no storage. When autoscaler runs a scale up check during that time, it concludes that it still has no suitable nodes for the pod which needed a new volume, therefore it asks for another one. That stops once the CSI driver is up and the pending pod gets scheduled.

Perhaps we need a "node start up delay", i.e. some additional delay after nodes appear before they are considered online?

It doesn't scale down.

I just wasn't waiting long enough... it gets removed again after the 10m scale-down-delay-after-add.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2021
@pohly pohly force-pushed the scale-up-storage-capacity branch 2 times, most recently from 48cc845 to 7f840da Compare September 22, 2021 07:46
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 22, 2021
@MaciekPytel
Copy link
Contributor

Regarding CustomResourcesProcessor - that works if we can tell apart a node that is waiting for CSI driver from one that is not (ie. we need to be able to tell if a given node is fully initialized or not).

For example in case of GPU we assume any node that will eventually have GPU will be started with a provider-specific "gpu label" (for example in GCE cloud.google.com/gke-accelerator). CA knows that any node that have this label, but doesn't have nvidia.com/gpu allocatable is not yet fully initialized. The question is whether we can come up with a similar if statement for CSI.

@pohly
Copy link
Contributor Author

pohly commented Sep 27, 2021

The criteria could be

  • there is CSIStorageCapacity configured for a template node derived from the node, i.e. the admin expects to get storage capacity
  • there is no CSIStorageCapacity object for the actual node

@pohly
Copy link
Contributor Author

pohly commented Sep 29, 2021

One concern that was brought up is that an admin needs to configure this. To recap, setting this up depends on knowledge about:

  • labels used by a CSI driver and whether that driver manages local storage
  • what the expected capacity is for the different storage classes and node pools
  • how the autoscaler creates node pools, in particular regarding labels for the template nodes

An extension of the CSIDriver object might help with the first point, but I am not sure whether the resulting API can be generic enough to be useful for anything besides autoscaler.

The CSI driver cannot help with the second point in "scale from zero" scenarios where the driver isn't running on any node of a node pool. If running on some node already, then perhaps it could publish information for that node as if it was unused in a separate CSIStorageCapacity object.

How the autoscaler really works in a cluster depends on its version and who built it - there's not exactly "one" autoscaler, right? So the last point would depend on standardizing on one way of configuring it.

Copy link
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The criteria could be

  • there is CSIStorageCapacity configured for a template node derived from the node, i.e. the admin expects to get storage capacity
  • there is no CSIStorageCapacity object for the actual node

I implemented that in a new commit, with a slightly more generalized check: there might be more than one expected CSIStorageCapacity object for a node when there are several CSI drivers and/or storage classes.


// This isn't particularly useful at the moment. Perhaps we can accept
// a context and then our caller can decide about a suitable deadline.
stopChannel := make(chan struct{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same behavior as before in NewSchedulerBasedPredicateChecker, i.e. it's not worse - but also not better.


// CleanUp frees resources.
CleanUp()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast to CustomResourceProcessor there's nothing specific to CSI in this interface. A better solution might be to introduce a generic FilterReadyNodes interface and then support more than one of those in AutoscalingProcessors.

CustomResourcesProcessor then becomes an extension of that new interface which just adds GetNodeResourceTargets.

@@ -309,15 +318,21 @@ func buildAutoscaler() (core.Autoscaler, error) {
autoscalingOptions := createAutoscalingOptions()
kubeClient := createKubeClient(getKubeConfig())
eventsKubeClient := createKubeClient(getKubeConfig())
informerFactory := informers.NewSharedInformerFactory(kubeClient, 0)
Copy link
Contributor Author

@pohly pohly Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in the commit message, creating and starting the informer factory needs to be moved up the call chain because both scheduler predicate and processor both need a CSIStorageCapacity informer and ideally should use the same one.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2021
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i read through this again and it generally makes sense to me. i'm not an expert on the CSI system, but i think what is documented here would work with the clusterapi provider.

the one thing i'm not sure about is if we would need more information to be available about the individual instances behind the nodes in the node group. i don't think we would as it seems the labels and taints are doing most of the work here.

@jingxu97
Copy link

jingxu97 commented Dec 9, 2021

cc @jingxu97

Whether a pod has unbound volumes influences scheduling decisions and
thus the scale up decisions in cluster autoscaler.

These three new test cases cover:
- a pod with an unbound pvc using late binding -> can scale up
- the same with storage capacity feature enabled -> cannot scale up
  without CSIStorageCapacity
- the same with manually configured CSIStorageCapacity -> can scale up
The assumption that all node labels except for the hostname label can be copied
verbatim does not hold for CSI drivers which manage local storage: those
drivers have a topology label where the value also depends on the hostname. It
might be the same as the Kubernetes hostname, but that cannot be assumed.

To solve this, search/replace with regular expressions can be defined to modify
those labels. This then can be used to inform the autoscaler about available
capacity on new nodes:

   --replace-labels ';^topology.hostpath.csi/node=aks-workerpool.*;topology.hostpath.csi/node=aks-workerpool-template;'

   kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1beta1
kind: CSIStorageCapacity
metadata:
  name: aks-workerpool-fast-storage
  namespace: kube-system
capacity: 100Gi
maximumVolumeSize: 100Gi
nodeTopology:
  matchLabels:
    # This never matches a real node, only the node templates
    # inside cluster-autoscaler.
    topology.hostpath.csi/node: aks-workerpool-template
storageClassName: csi-hostpath-fast
EOF
When a new node becomes ready, a CSI driver is not going to be running on it
immediately. This can cause the cluster autoscaler to scale up once more
because of pending pods that can run on that new node once the driver is ready.

The actual check is about CSIStorageCapacity. By comparing the published
information about the new node against the information for a template node, we
can determine whether the CSI driver is done with starting up on the node.

The new CSI processor needs information about existing CSIStorageCapacity
objects in the cluster, just like the scheduler predicate. Both can share the
same informer. For that to work, managing the informer factory must be moved up
the call chain so that the setup code for both can use the same factory.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 15, 2022
@pohly pohly changed the title RFC: scale up: tests and support for pods with volumes scale up: tests and support for pods with volumes Feb 15, 2022
@pohly
Copy link
Contributor Author

pohly commented Feb 15, 2022

@MaciekPytel: I would like to propose this PR for merging. I can't think of any simpler way of solving this because of the inherent complexity outlined above and no-one else has come up with any suggestions either.

The alternative to this PR is to explain to users that autoscaling doesn't work together with storage capacity tracking. But I think the proposal here is better than giving up on the combination completely. If there is anything that I can do to make this PR acceptable, then please let me know.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2022
@k8s-ci-robot
Copy link
Contributor

@pohly: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MaciekPytel
Copy link
Contributor

I replied on kubernetes/enhancements#3233 for high-level issues I see with this approach. This approach can technically work for some use-cases, but it relies on cluster admin having in-depth knowledge of CA and I don't think it scales into large clusters.

I really think we should think 'how can we change CSIStorageCapacity API / implementation in k8s to facilitate good autoscaling story', rather than just think 'how to make CA work with the current implementation'.

@pohly
Copy link
Contributor Author

pohly commented Mar 24, 2022

The conclusion in kubernetes/enhancements#3233 was that this PR here is not sufficient for autoscaling and therefore should not be merged.

/close

To recap:

  • A solution must be easier to configure.
  • A solution must support scaling up by more than one node at a time.
  • Different changes are going to be needed for node-local storage and network-attached storage.

This can only be achieved by making some assumptions about how a CSI driver works and by a CSI driver providing more information than currently required. Before someone picks this up, we need:

  • a CSI storage vendor who wants to support this
  • a cloud provider or user who will try out the modified autoscaler and CSI driver and provide feedback

@k8s-ci-robot
Copy link
Contributor

@pohly: Closed this PR.

In response to this:

The conclusion in kubernetes/enhancements#3233 was that this PR here is not sufficient for autoscaling and therefore should not be merged.

/close

To recap:

  • A solution must be easier to configure.
  • A solution must support scaling up by more than one node at a time.
  • Different changes are going to be needed for node-local storage and network-attached storage.

This can only be achieved by making some assumptions about how a CSI driver works and by a CSI driver providing more information than currently required. Before someone picks this up, we need:

  • a CSI storage vendor who wants to support this
  • a cloud provider or user who will try out the modified autoscaler and CSI driver and provide feedback

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants