Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA scale-up delays on clusters with heavy scaling activity #5769

Open
benfain opened this issue May 18, 2023 · 11 comments
Open

CA scale-up delays on clusters with heavy scaling activity #5769

benfain opened this issue May 18, 2023 · 11 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug.

Comments

@benfain
Copy link

benfain commented May 18, 2023

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
Component version: v1.26.1, though we've seen the same behavior in versions 1.24.1 and 1.27.1

What k8s version are you using (kubectl version)?:
1.24

What environment is this in?:
AWS, using kops

What did you expect to happen?:
On k8s clusters with heavy scaling activity, we would expect CA to be able to scale-up in a timely manner to clear unschedulable pending pods.

What happened instead?:
There are times when we need CA to process up to 3k+ pending (unschedulable) pods and have seen significant delays in processing, sometimes up to 15 minutes before CA gets through the list and scales up nodes. We have several deployments that scale up and down by hundreds of pods often in the cluster.

During this time frame, looking at CA metrics, we noticed significantly increased latency generally but more so in the scale-up function as seen below (in seconds):
Screenshot 2023-05-15 at 12 05 34

Below is a screenshot showing the delay in scale-up time. As mentioned above, you can see we peaked above 3k unschedulable pods with a lack of scaling activity during these periods. We suspect CA is struggling to churn through the list.
Screenshot 2023-05-16 at 3 28 42 PM

Anything else we need to know?:

We do not seem to be hitting pod or node level resource limits; not OOM'ing and pods/nodes are not approaching limits in general. We use node selectors for assignment. We also are not being rate-limited on the cloud provider side from what we can tell, there's just a delay before CA attempts to update the ASGs. Per the defined SLO, we are expecting no more than 60s for CA to scale-up on large clusters like ours.

We looked into running multiple replicas to help us churn through the list but by default there can only be one leader, I can't find any documentation about how well it works to run multiple replicas in parallel; under the impression that's not recommended.

Alternatively, we looked into running multiple instances of CA in a single cluster, focused on separate workloads/resources based on pod labels. I don't believe this is supported an any version of CA at this point?

@benfain benfain added the kind/bug Categorizes issue or PR as related to a bug. label May 18, 2023
@vadasambar
Copy link
Member

vadasambar commented May 22, 2023

It looks like we performed scalability tests on CA 6 years ago: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/scalability_tests.md

The tests above however don't seem to test time needed to scale up the nodes. Maybe it's time we did a scalability test to check how much time CA takes to scale up nodes. I guess the first thing to do here is to identify what is the bottleneck (or if there's one at all)

@vadasambar
Copy link
Member

vadasambar commented Jun 14, 2023

I performed scale test using #5820

Here's what I did:

  1. Start running CA with kwok provider
  2. Schedule 5k pods using a Deployment which has no taint tolerations, no node and no pod affinity and no pod topology spread constraint (any scheduling constraint in short)

Things you should know:

  1. Every scaled up node had 12 cores of CPU and around 31.4Gi of memory, and no taints
  2. CA has no requests and limits defined (it can consume as much CPU and memory as it wants)
  3. Underlying kwok controller which fakes the nodes has no requests and limits defined (it can can consume as much CPU and memory as it wants)
  4. All of this was done on a one node local kind cluster
  5. When kwok provider is used, CA creates the nodes by itself (since there is no actual cloud provider)

(if you want to know more details, please let me know)

To start the test, I just did:

kubectl scale deployment <test-pods-deployment> --replicas=5000

Here are the results:

In the middle of the test

25 nodes:
image
image

129 nodes:
image
image

256 nodes:
image
image

288 nodes:
image
Some nodes are stuck in NotReady
image
image
image

Nodes became ready:
image

383 nodes:
image
image

385 nodes (all pods are scheduled)
image
image

After the test

385 nodes (all pods are scheduled)
image
image

scaleUp 99th percentile shoots up once the number of nodes crosses 300. It seems like the happy path scenario works as expected i.e., 99th percentile scale up latency is less than 21.8 seconds. Maybe the results might change if we add more scheduling constraints (affinity, selector or taints/tolerations). Also note that unschedulable pods are very similar to each other because they come from the same Deployment. CA is good at grouping similar pods (especially pods sharing the same owner) and scaling them up.

Please note that the above performance can be achieved only if NO pod affinity and anti-affinity is used on any of the pods. Unfortunately, the current implementation of the affinity predicate in scheduler is about 3 orders of magnitude slower than for all other predicates combined, and it makes CA hardly usable on big clusters.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler

Note that the actual number of unschedulable pods are 5k test pods + extra daemonset pods for new nodes.
So the actual number will be
5k pods + 385 * 3 (1 kindnet pod, 1 kube-proxy pod and 1 prometheus exporter pod) = 6155 pods

@vadasambar
Copy link
Member

It might help to know what kind of workload (scheduling constraints like pod affinity, node affinity, node selector, taint tolerations, usual memory and cpu requests, etc.,) you usually schedule and what kind of taints you use on your nodes so that we can try mimicking that in the test workload and perform scale tests based on that. Maybe we can reproduce the issue with that.

@tzatwork
Copy link

Hey @vadasambar thanks for running this test! It looks like the images are not displaying for most folks (404ing). We can definitely add more color to the workloads we are running as they are definitely not just straight simple deployments running on this cluster. In the meantime, would love to see the results of the images.

@philnielsen
Copy link

Adding the color Terry referenced: I think we are having an issue with a bunch of things at once, and just pure number of pods isn't our issue (as your test, and tests we have ran ourselves have shown). I think the reason:

Also note that unschedulable pods which are very similar to each other because they come from the same Deployment. CA is good at grouping similar pods (especially pods sharing the same owner) and scaling them up.

Here is what we have found so far in our research:

  1. When comparing PodSpec semantically, drop projected volumes for init containers #5852 which I was actually going to put a PR together for until I saw one was just opened today! This was causing our equivalence group sizes to be 1 because of the service token getting mounted like: fix pod equivalency checks for pods with projected volumes #4441
  2. we are scaling ~350 deployments at once, and we have ~200 nodegroups so we have a lot of autoscaling groups that CA has to go through each time
  3. we don't have any podAffinity but we do have nodeselectors that select the custom nodegroups and nodeAffinity per zone
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
              - us-east-1d

@vadasambar
Copy link
Member

vadasambar commented Jun 14, 2023

Hey @vadasambar thanks for running this test! It looks like the images are not displaying for most folks (404ing).

@tzatwork my bad 🙇 . I have re-uploaded the images. You should be able to see them now (confirmed by checking this issue in a private window). Thanks for bringing this notice! Let me know if you still can't see them.

@vadasambar
Copy link
Member

@philnielsen I will check the links for 1. 2 is very interesting. In my test above there were 7 CA nodegroups and each one of them had a max limit of 200 nodes.
Thanks for 3. I wonder if CA will become slow at scaling up if we throw in a few more variables like 2 (increase number of deployments from 1 to a big number like 350 and increase the number of CA nodegroups) and 3 (add node selectors).

Is the node selector present on all the deployments or is it like a mix bag? I would assume it to be a mixed bag.

@philnielsen
Copy link

Thank you for taking a look, we really appreciate your insights!

multiple node group example where we have probably ~30 apps/instance types that have isolated nodegroups (in addition to several general node groups that exist in the cluster that handle other deployments):

Name                                                                         TYPE        InstanceType.  Min  Max Zone
c5.12xlarge.<ISOLATED_APP_NAM>-a		        Node	c5.12xlarge	0	200	us-east-1a
c5.12xlarge..<ISOLATED_APP_NAM>-c		Node	c5.12xlarge	0	200	us-east-1c
c5.12xlarge..<ISOLATED_APP_NAM>-d		Node	c5.12xlarge	0	200	us-east-1d

The node selectors to put pods on these isolated exist on most of the deployments but I wouldn't say all, but they won't be grouped together as "similar" by CA because they are in different app groups.

@qianlei90
Copy link
Contributor

qianlei90 commented Jun 15, 2023

We observed a similar result in our test. To find out where the time is consumed, we added two metrics: #5813. Eventually we found out that the time will increase rapidly if we add scheduling constraints to pod like node affinity. #4970 solved the problem and saved us.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@philnielsen
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2024
@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

9 participants