Cluster Autoscaler (0.4.0) - Excessive calls to describeautoscalinggroup #2541

SleepyBrett · 2017-04-11T00:23:39Z

We're still on 1.5.6 here and thus suck on 0.4.0

We are also in a corporate shared AWS account for just a few more weeks but we've been responsible for hammering the describe asg endpoint. At one count we had about 6200 calls an hour while watching just 2 autoscaling groups with a scan interval of 15s.

After reviewing the logs at verbosity 4 I'm seeing what seems like excessive cache regeneration log inc below.

I0411 00:18:03.743910       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:03.767672       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:03.814842       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:03.847489       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:03.878911       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:03.909796       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:03.948900       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:03.971444       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.014617       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.062654       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.112767       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.135767       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.179638       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.208361       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.255125       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.285420       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.315456       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.338964       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.391533       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.414932       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.439831       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.475670       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.532452       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.565761       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.596577       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.628497       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.689757       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.711811       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.741555       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.767574       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.809850       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.849322       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.878316       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196

I'm digging through the code trying to track down what's invalidating the cache so rapidly.

The text was updated successfully, but these errors were encountered:

SleepyBrett · 2017-04-11T00:36:24Z

It looks like the cause here is that we have 7 node pools and only 4 under management. So every time it tries to evaluate a node that is outside of the four managed asgs it regenerates the entire cache.

I might suggest keeping another pool of nodes that aren't found in the cache after a regeneration and handle those as a separate 'early-out' type case.

For now I think I'll pull all asgs under management but the other 3 in a fixed (same min as max) size. Though this makes me nervous as one of those asgs is our etcd pool.

mwielgus · 2017-04-13T22:00:56Z

I guess we might have the similar problem with GCE/GKE. Need to take a look.

mwielgus · 2017-04-13T22:05:37Z

The issue is here:
https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L173
But is not present in GCE:
https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler/cloudprovider/gce/gce_manager.go#L191

mumoshu · 2017-05-09T01:36:15Z

It is now:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L173
And for GKE:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/gce/gce_manager.go#L191

fejta-bot · 2017-12-24T03:17:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-23T04:05:26Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-22T04:12:06Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

mwielgus added area/autoscaler area/autoscaling labels Apr 13, 2017

mwielgus assigned MaciekPytel and mwielgus Apr 13, 2017

This was referenced May 9, 2017

cluster-autoscaler: Excessive calls to DescribeAutoScalingGroup kubernetes/autoscaler#45

Closed

cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup kubernetes/autoscaler#46

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2018

k8s-ci-robot closed this as completed Feb 22, 2018

cblecker unassigned MaciekPytel and mwielgus Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler (0.4.0) - Excessive calls to describeautoscalinggroup #2541

Cluster Autoscaler (0.4.0) - Excessive calls to describeautoscalinggroup #2541

SleepyBrett commented Apr 11, 2017

SleepyBrett commented Apr 11, 2017 •

edited

mwielgus commented Apr 13, 2017

mwielgus commented Apr 13, 2017 •

edited

mumoshu commented May 9, 2017

fejta-bot commented Dec 24, 2017

fejta-bot commented Jan 23, 2018

fejta-bot commented Feb 22, 2018

Cluster Autoscaler (0.4.0) - Excessive calls to describeautoscalinggroup #2541

Cluster Autoscaler (0.4.0) - Excessive calls to describeautoscalinggroup #2541

Comments

SleepyBrett commented Apr 11, 2017

SleepyBrett commented Apr 11, 2017 • edited

mwielgus commented Apr 13, 2017

mwielgus commented Apr 13, 2017 • edited

mumoshu commented May 9, 2017

fejta-bot commented Dec 24, 2017

fejta-bot commented Jan 23, 2018

fejta-bot commented Feb 22, 2018

SleepyBrett commented Apr 11, 2017 •

edited

mwielgus commented Apr 13, 2017 •

edited