CA scale-up delays on clusters with heavy scaling activity #5769

benfain · 2023-05-18T01:01:45Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Component version: v1.26.1, though we've seen the same behavior in versions 1.24.1 and 1.27.1

What k8s version are you using (kubectl version)?:
1.24

What environment is this in?:
AWS, using kops

What did you expect to happen?:
On k8s clusters with heavy scaling activity, we would expect CA to be able to scale-up in a timely manner to clear unschedulable pending pods.

What happened instead?:
There are times when we need CA to process up to 3k+ pending (unschedulable) pods and have seen significant delays in processing, sometimes up to 15 minutes before CA gets through the list and scales up nodes. We have several deployments that scale up and down by hundreds of pods often in the cluster.

During this time frame, looking at CA metrics, we noticed significantly increased latency generally but more so in the scale-up function as seen below (in seconds):

Below is a screenshot showing the delay in scale-up time. As mentioned above, you can see we peaked above 3k unschedulable pods with a lack of scaling activity during these periods. We suspect CA is struggling to churn through the list.

Anything else we need to know?:

We do not seem to be hitting pod or node level resource limits; not OOM'ing and pods/nodes are not approaching limits in general. We use node selectors for assignment. We also are not being rate-limited on the cloud provider side from what we can tell, there's just a delay before CA attempts to update the ASGs. Per the defined SLO, we are expecting no more than 60s for CA to scale-up on large clusters like ours.

We looked into running multiple replicas to help us churn through the list but by default there can only be one leader, I can't find any documentation about how well it works to run multiple replicas in parallel; under the impression that's not recommended.

Alternatively, we looked into running multiple instances of CA in a single cluster, focused on separate workloads/resources based on pod labels. I don't believe this is supported an any version of CA at this point?

The text was updated successfully, but these errors were encountered:

vadasambar · 2023-05-22T05:38:03Z

It looks like we performed scalability tests on CA 6 years ago: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/scalability_tests.md

The tests above however don't seem to test time needed to scale up the nodes. Maybe it's time we did a scalability test to check how much time CA takes to scale up nodes. I guess the first thing to do here is to identify what is the bottleneck (or if there's one at all)

vadasambar · 2023-06-14T06:17:45Z

I performed scale test using #5820

Here's what I did:

Start running CA with kwok provider
Schedule 5k pods using a Deployment which has no taint tolerations, no node and no pod affinity and no pod topology spread constraint (any scheduling constraint in short)

Things you should know:

Every scaled up node had 12 cores of CPU and around 31.4Gi of memory, and no taints
CA has no requests and limits defined (it can consume as much CPU and memory as it wants)
Underlying kwok controller which fakes the nodes has no requests and limits defined (it can can consume as much CPU and memory as it wants)
All of this was done on a one node local kind cluster
When kwok provider is used, CA creates the nodes by itself (since there is no actual cloud provider)

(if you want to know more details, please let me know)

To start the test, I just did:

kubectl scale deployment <test-pods-deployment> --replicas=5000

Here are the results:

In the middle of the test

25 nodes:

129 nodes:

256 nodes:

288 nodes:

Some nodes are stuck in NotReady

Nodes became ready:

383 nodes:

385 nodes (all pods are scheduled)

After the test

385 nodes (all pods are scheduled)

scaleUp 99th percentile shoots up once the number of nodes crosses 300. It seems like the happy path scenario works as expected i.e., 99th percentile scale up latency is less than 21.8 seconds. Maybe the results might change if we add more scheduling constraints (affinity, selector or taints/tolerations). Also note that unschedulable pods are very similar to each other because they come from the same Deployment. CA is good at grouping similar pods (especially pods sharing the same owner) and scaling them up.

Please note that the above performance can be achieved only if NO pod affinity and anti-affinity is used on any of the pods. Unfortunately, the current implementation of the affinity predicate in scheduler is about 3 orders of magnitude slower than for all other predicates combined, and it makes CA hardly usable on big clusters.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler

Note that the actual number of unschedulable pods are 5k test pods + extra daemonset pods for new nodes.
So the actual number will be
5k pods + 385 * 3 (1 kindnet pod, 1 kube-proxy pod and 1 prometheus exporter pod) = 6155 pods

vadasambar · 2023-06-14T06:27:52Z

It might help to know what kind of workload (scheduling constraints like pod affinity, node affinity, node selector, taint tolerations, usual memory and cpu requests, etc.,) you usually schedule and what kind of taints you use on your nodes so that we can try mimicking that in the test workload and perform scale tests based on that. Maybe we can reproduce the issue with that.

tzatwork · 2023-06-14T15:52:10Z

Hey @vadasambar thanks for running this test! It looks like the images are not displaying for most folks (404ing). We can definitely add more color to the workloads we are running as they are definitely not just straight simple deployments running on this cluster. In the meantime, would love to see the results of the images.

philnielsen · 2023-06-14T15:54:33Z

Adding the color Terry referenced: I think we are having an issue with a bunch of things at once, and just pure number of pods isn't our issue (as your test, and tests we have ran ourselves have shown). I think the reason:

Also note that unschedulable pods which are very similar to each other because they come from the same Deployment. CA is good at grouping similar pods (especially pods sharing the same owner) and scaling them up.

Here is what we have found so far in our research:

When comparing PodSpec semantically, drop projected volumes for init containers #5852 which I was actually going to put a PR together for until I saw one was just opened today! This was causing our equivalence group sizes to be 1 because of the service token getting mounted like: fix pod equivalency checks for pods with projected volumes #4441
we are scaling ~350 deployments at once, and we have ~200 nodegroups so we have a lot of autoscaling groups that CA has to go through each time
we don't have any podAffinity but we do have nodeselectors that select the custom nodegroups and nodeAffinity per zone

nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
              - us-east-1d

vadasambar · 2023-06-14T17:35:28Z

Hey @vadasambar thanks for running this test! It looks like the images are not displaying for most folks (404ing).

@tzatwork my bad 🙇 . I have re-uploaded the images. You should be able to see them now (confirmed by checking this issue in a private window). Thanks for bringing this notice! Let me know if you still can't see them.

vadasambar · 2023-06-14T17:43:00Z

@philnielsen I will check the links for 1. 2 is very interesting. In my test above there were 7 CA nodegroups and each one of them had a max limit of 200 nodes.
Thanks for 3. I wonder if CA will become slow at scaling up if we throw in a few more variables like 2 (increase number of deployments from 1 to a big number like 350 and increase the number of CA nodegroups) and 3 (add node selectors).

Is the node selector present on all the deployments or is it like a mix bag? I would assume it to be a mixed bag.

philnielsen · 2023-06-14T18:04:58Z

Thank you for taking a look, we really appreciate your insights!

multiple node group example where we have probably ~30 apps/instance types that have isolated nodegroups (in addition to several general node groups that exist in the cluster that handle other deployments):

Name                                                                         TYPE        InstanceType.  Min  Max Zone
c5.12xlarge.<ISOLATED_APP_NAM>-a		        Node	c5.12xlarge	0	200	us-east-1a
c5.12xlarge..<ISOLATED_APP_NAM>-c		Node	c5.12xlarge	0	200	us-east-1c
c5.12xlarge..<ISOLATED_APP_NAM>-d		Node	c5.12xlarge	0	200	us-east-1d

The node selectors to put pods on these isolated exist on most of the deployments but I wouldn't say all, but they won't be grouped together as "similar" by CA because they are in different app groups.

qianlei90 · 2023-06-15T13:24:57Z

We observed a similar result in our test. To find out where the time is consumed, we added two metrics: #5813. Eventually we found out that the time will increase rapidly if we add scheduling constraints to pod like node affinity. #4970 solved the problem and saved us.

k8s-triage-robot · 2024-01-22T16:38:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

philnielsen · 2024-02-05T20:28:13Z

/remove-lifecycle stale

benfain added the kind/bug Categorizes issue or PR as related to a bug. label May 18, 2023

vadasambar mentioned this issue May 19, 2023

May 2023 vadafoss/daily-updates#9

Closed

This was referenced May 30, 2023

feat: implement kwok cloud provider #5820

Merged

Jun 2023 vadafoss/daily-updates#10

Closed

jbartosik added the area/cluster-autoscaler label Jun 5, 2023

vadasambar mentioned this issue Jun 19, 2023

[AEP] docs: add proposal for kwok provider #5869

Closed

vadasambar mentioned this issue Jul 3, 2023

Jul 2023 vadafoss/daily-updates#11

Closed

vadasambar mentioned this issue Aug 1, 2023

Aug 2023 vadafoss/daily-updates#12

Closed

vadasambar mentioned this issue Sep 1, 2023

Sep 2023 vadafoss/daily-updates#13

Closed

vadasambar mentioned this issue Oct 3, 2023

Oct 2023 vadafoss/daily-updates#14

Closed

vadasambar mentioned this issue Nov 1, 2023

Nov 2023 vadafoss/daily-updates#15

Closed

vadasambar mentioned this issue Dec 4, 2023

Dec 2023 vadafoss/daily-updates#16

Closed

vadasambar mentioned this issue Jan 2, 2024

Jan 2024 vadafoss/daily-updates#17

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

vadasambar mentioned this issue Feb 2, 2024

Feb 2024 vadafoss/daily-updates#18

Open

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA scale-up delays on clusters with heavy scaling activity #5769

CA scale-up delays on clusters with heavy scaling activity #5769

benfain commented May 18, 2023

vadasambar commented May 22, 2023 •

edited

vadasambar commented Jun 14, 2023 •

edited

vadasambar commented Jun 14, 2023

tzatwork commented Jun 14, 2023

philnielsen commented Jun 14, 2023

vadasambar commented Jun 14, 2023 •

edited

vadasambar commented Jun 14, 2023

philnielsen commented Jun 14, 2023

qianlei90 commented Jun 15, 2023 •

edited

k8s-triage-robot commented Jan 22, 2024

philnielsen commented Feb 5, 2024

CA scale-up delays on clusters with heavy scaling activity #5769

CA scale-up delays on clusters with heavy scaling activity #5769

Comments

benfain commented May 18, 2023

vadasambar commented May 22, 2023 • edited

vadasambar commented Jun 14, 2023 • edited

In the middle of the test

After the test

vadasambar commented Jun 14, 2023

tzatwork commented Jun 14, 2023

philnielsen commented Jun 14, 2023

vadasambar commented Jun 14, 2023 • edited

vadasambar commented Jun 14, 2023

philnielsen commented Jun 14, 2023

qianlei90 commented Jun 15, 2023 • edited

k8s-triage-robot commented Jan 22, 2024

philnielsen commented Feb 5, 2024

vadasambar commented May 22, 2023 •

edited

vadasambar commented Jun 14, 2023 •

edited

vadasambar commented Jun 14, 2023 •

edited

qianlei90 commented Jun 15, 2023 •

edited