Skip to content
This repository has been archived by the owner on Apr 17, 2019. It is now read-only.

Cluster Autoscaler (0.4.0) - Excessive calls to describeautoscalinggroup #2541

Closed
SleepyBrett opened this issue Apr 11, 2017 · 7 comments
Closed
Labels
area/autoscaler area/autoscaling lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@SleepyBrett
Copy link

We're still on 1.5.6 here and thus suck on 0.4.0

We are also in a corporate shared AWS account for just a few more weeks but we've been responsible for hammering the describe asg endpoint. At one count we had about 6200 calls an hour while watching just 2 autoscaling groups with a scan interval of 15s.

After reviewing the logs at verbosity 4 I'm seeing what seems like excessive cache regeneration log inc below.

I0411 00:18:03.743910       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:03.767672       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:03.814842       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:03.847489       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:03.878911       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:03.909796       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:03.948900       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:03.971444       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.014617       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.062654       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.112767       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.135767       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.179638       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.208361       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.255125       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.285420       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.315456       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.338964       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.391533       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.414932       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.439831       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.475670       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.532452       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.565761       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.596577       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.628497       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.689757       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.711811       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.741555       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196
I0411 00:18:04.767574       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-workers-asg
I0411 00:18:04.809850       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-subnet_workers-asg
I0411 00:18:04.849322       1 aws_manager.go:186] Regenerating ASG information for a0098-p18-flink_workers-asg
I0411 00:18:04.878316       1 aws_manager.go:186] Regenerating ASG information for tf-asg-00451965f3d6dfa3c713f1f196

I'm digging through the code trying to track down what's invalidating the cache so rapidly.

@SleepyBrett
Copy link
Author

SleepyBrett commented Apr 11, 2017

It looks like the cause here is that we have 7 node pools and only 4 under management. So every time it tries to evaluate a node that is outside of the four managed asgs it regenerates the entire cache.

I might suggest keeping another pool of nodes that aren't found in the cache after a regeneration and handle those as a separate 'early-out' type case.

For now I think I'll pull all asgs under management but the other 3 in a fixed (same min as max) size. Though this makes me nervous as one of those asgs is our etcd pool.

@mwielgus
Copy link
Contributor

I guess we might have the similar problem with GCE/GKE. Need to take a look.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/autoscaler area/autoscaling lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants