cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46

mumoshu · 2017-05-09T03:15:18Z

By caching AWS refs for nodes/EC2 instances already known to be not in any of ASGs managed by cluster-autoscaler(CA).

Please beware of the edge case - this method is safe as long as users don't attach nodes by calling AttachInstances API after CA cached them. I believe, even if it was necessary, a warning in the documentation about the edge case is enough for now. If we really need to support the case, I will submit an another PR to invalidate the cache periodically so that CA can detect the formerly cached nodes are attached to ASG(s).

The docker image built from this branch is available for testing at mumoshu/fix-excessive-describe-asg-calls.

You can see that CA detects and remembers nodes in unmanaged ASGs so that it can prevent the nodes from triggering unnecessary regenerateCache invocations afterwards:

I0509 05:37:28.165739       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
I0509 05:37:28.824668       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.882084       1 aws_manager.go:188] Instance {Name:i-06079265fc8999d4c} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.882149       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.918931       1 aws_manager.go:188] Instance {Name:i-0c6c700cbcc62ed1d} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.918984       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.946945       1 aws_manager.go:188] Instance {Name:i-0e48a422f75ae6500} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.947044       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.984235       1 aws_manager.go:188] Instance {Name:i-0a8cb4932d32e1296} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.984272       1 static_autoscaler.go:197] Filtering out schedulables
I0509 05:37:28.984338       1 static_autoscaler.go:205] No schedulable pods
I0509 05:37:28.984356       1 static_autoscaler.go:211] No unschedulable pods

Resolves #45

mwielgus · 2017-05-10T09:05:40Z

cc: @andrewsykim

mumoshu · 2017-05-12T00:37:05Z

Hi @SleepyBrett!
This is the fix for kubernetes-retired/contrib#2541 you've reported last month.
Would you mind confirming if it works for you?
If you have a k8s 1.6 cluster, running your cluster-autoscaler from the docker image mumoshu/fix-excessive-describe-asg-calls would work.

mwielgus · 2017-05-12T10:46:02Z

Does it work with autodiscovery and dynamic config updates for aws?

mumoshu · 2017-05-12T14:07:16Z

@mwielgus Yes, but please share me your concerns if any!

andrewsykim · 2017-05-14T04:47:22Z

I'm on vacation, so won't be able to review til next week, sorry!

mumoshu · 2017-05-23T00:15:47Z

Hi @andrewsykim, sorry for rushing but I hope you could review this 🌷

andrewsykim · 2017-05-23T06:43:15Z

cluster-autoscaler/cloudprovider/aws/aws_manager.go

+	if _, found := m.instancesNotInManagedAsg[*instance]; found {
+		// The instance is already known to not belong to any configured ASG
+		// Skip regenerateCache so that we won't unnecessarily call DescribeAutoScalingGroups
+		// See https://github.com/kubernetes/contrib/issues/2541


Let's link to the issue in the new repo instead: #45

andrewsykim · 2017-05-23T06:53:54Z

cluster-autoscaler/cloudprovider/aws/aws_manager.go

 	if err := m.regenerateCache(); err != nil {
 		return nil, fmt.Errorf("Error while looking for ASG for instance %+v, error: %v", *instance, err)
 	}
 	if config, found := m.asgCache[*instance]; found {
 		return config, nil
 	}
 	// instance does not belong to any configured ASG
+	glog.V(6).Infof("Instance %+v is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance", *instance)
+	m.instancesNotInManagedAsg[*instance] = struct{}{}


Are there possible race conditions here? i.e. could a recently launched instance be added to this list because it was missing tags for short period of time?

@andrewsykim AFAIK, this func is called only after a K8S cluster notices the existence of the node.
Suppose the node exists, shouldn't it already have appropriate tags added, kubelet started, node registered, etc?

@andrewsykim Would you mind informing me which tag will be missed?

Ahh, didn't know the instance would only be evaluated once it's registered with the master, this shouldn't be a problem then :)

andrewsykim · 2017-05-23T07:00:35Z

cluster-autoscaler/cloudprovider/aws/aws_manager.go

+		asgs:                     make([]*asgInformation, 0),
+		service:                  service,
+		asgCache:                 make(map[AwsRef]*Asg),
+		instancesNotInManagedAsg: make(map[AwsRef]struct{}),


I feel like we are adding a lot of complexity here by adding this field (instancesNotInManagedAsg). I was thinking of a solution where we can put regenerateCache in a separate loop and have it run every X seconds. Would you happen to know if that's a feasible solution?

Excuse me I'm not following you correctly but I'd rather suggest extracting a new object dedicated to efficiently fetch ASG data(probably named like asgRegistry which has the actual implementation of GetAsgForInstance if we'd want less complexity here.

After the change, AwsManager would just delegate GetAsgForInstance to asgRegistry and relevant struct members like asgs, asgCache, instancesNotInManagedAsg would move to asgRegistry.

The reasoning behind my suggestion is that we won't want to introduce an another gorountine just for calling regenerateCache - the less concurrent programming the more deterministic CA's behavior is/the happier our lives are? 😃

And if you'd just like to regenerateCache every X seconds, theoretically, we can just do it by checking if elapsed time since the last regeneration is greater than X, in the very beginning of regenerateCache?

And if you'd just like to regenerateCache every X seconds, theoretically, we can just do it by checking if elapsed time since the last regeneration is greater than X, in the very beginning of regenerateCache?

sounds good to me

@andrewsykim Thanks for the confirmation 👍

I'm going to include the work of extracting asgRegistry as I've described above in this PR.

However, regarding the periodic regenerateCache you've suggested, I'm not yet sure why it is needed.
AFAIK, cache regeneration is needed and done only when CA sees a k8s node for the first time. If we agree to assume a k8s node to not move among different ASGs, which seems reasonable to me, we won't need to periodically invalidate the asg cache.
What do you think?

I can't say at the moment, the cluster-autoscaler code changed a lot from the last time I developed anything so I'm probably not the best person to make any calls here. What you're saying seems reasonable, though I'll need to read the code again to be sure.

@andrewsykim Thanks!
Then, would you mind if we proceed to get this merged without the periodic regenerateCache thing for now? Even without it, merging this doesn't make the situation worse. If we really need the regenerateCache improvement, we can submit a new github issue, right?

cc @mwielgus

I'm probably not the best person to make any calls here

So am I!
I guess that we are the only contributors concerned to the AWS part of CA? 😄
As this is being an OSS project, I suppose that one of maintainers, or one of active contributors appointed as responsible by a maintainer, would be eligible make decision.

@andrewsykim Oh, hey, I just realized that we already had the periodic regenerateCache in our code-base 😃

…ting types Accordingly to the discussion made [here](kubernetes#46 (comment))

mumoshu · 2017-06-08T08:22:11Z

@andrewsykim According to our discussion, I've added 37b8225 to avoid adding the instancesNotInManagedAsg field directly to AwsManager while maintaining consistency throughout the codebase.
It would seem like a big change regarding LOC. However, basically, what I did is just splitting AwsManager into 3 types to keep each type simple.

…ting types Accordingly to the discussion made [here](kubernetes#46 (comment))

mumoshu · 2017-06-08T09:19:22Z

To ensure I'm not breaking anything, I did some manual testing with 37b8225 and verified it scales up/down the cluster successfully as before.

mumoshu · 2017-06-11T23:39:59Z

Hi @MaciekPytel @mwielgus, thanks as always for maintaining CA 👍
Would you mind providing your decisions on this, LGTM, need further changes, questions, another issues to be tracked in the future(more tests, more e2e, more README), etc?
Also, not very important but as a contributor, I personally prefer PRs getting merged relatively quick if they look not perfect but good enough so that we can keep constantly moving forward.

mumoshu · 2017-06-11T23:41:27Z

Anyway, as myself being a heavy user of K8S on AWS, I do want this issue to be fixed before CA 0.6.

MaciekPytel · 2017-06-13T11:33:25Z

@mumoshu We definitely want this for 0.6.
It looks ok to me at a glance, but I don't have an AWS env to test this and I don't know AWS cloudprovider too well. @andrewsykim reviewed this and if it looks good to him and you confirm you tested it (running with non-managed ASG no longer causes constant cache regeneration, scale-up and scale-down work ok) , than I'm happy to merge this.

mumoshu · 2017-06-19T14:36:17Z

I'll take some time to test this again shortly.

… feature is enabled By fixing CA not to reset `StaticAutoscaler` state before each iteration so that it remembers last scale-up/down time which is used to throttle scale-down, which is causing the issue.

…up ASGs from different k8s clusters

mumoshu · 2017-06-22T07:09:29Z

@MaciekPytel @mwielgus I believe this is going to conflict with #107.
Would you mind merging #107 before this?

By caching AWS refs for nodes/EC2 instances already known to be not in any of ASGs managed by cluster-autoscaler(CA). Please beware of the edge case - this method is safe as long as users don't attach nodes by calling AttachInstances API after CA cached them. I believe, even if it was necessary, a warning in the documentation about the edge case is enough for now. If we really need to support the case, I will submit an another PR to invalidate the cache periodically so that CA can detect the formerly cached nodes are attached to ASG(s). Also refactor AwsManager for less complexity by extracting types, accordingly to the discussion made [here](kubernetes#46 (comment))

mumoshu · 2017-06-22T08:41:12Z

Rebased this onto #137 so that we can merge this after #107 and #137 without any conflict.
Please look at the last commit if you need to see only the changes made by this PR.

MaciekPytel · 2017-06-23T09:58:16Z

/lgtm

Add support for tainted flavors

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 9, 2017

mwielgus added the area/provider/aws Issues or PRs related to aws provider label May 10, 2017

andrewsykim suggested changes May 23, 2017

View reviewed changes

mumoshu force-pushed the fix-excessive-describe-asg-calls branch from a55e868 to 24a7581 Compare June 8, 2017 08:17

mumoshu added a commit to mumoshu/autoscaler that referenced this pull request Jun 8, 2017

cluster-autoscaler: Refactor AwsManager for less complexity by extrac…

24a7581

…ting types Accordingly to the discussion made [here](kubernetes#46 (comment))

mumoshu force-pushed the fix-excessive-describe-asg-calls branch from 24a7581 to 37b8225 Compare June 8, 2017 08:43

mumoshu added a commit to mumoshu/autoscaler that referenced this pull request Jun 8, 2017

cluster-autoscaler: Refactor AwsManager for less complexity by extrac…

37b8225

…ting types Accordingly to the discussion made [here](kubernetes#46 (comment))

mumoshu added 2 commits June 22, 2017 10:25

cluster-autoscaler: Fix scale-down when the node group auto-discovery…

7697d53

… feature is enabled By fixing CA not to reset `StaticAutoscaler` state before each iteration so that it remembers last scale-up/down time which is used to throttle scale-down, which is causing the issue.

cluster-autoscaler: Fix node group auto discovery for AWS not to mix …

3e8cc02

…up ASGs from different k8s clusters

mumoshu force-pushed the fix-excessive-describe-asg-calls branch from 37b8225 to dfb481b Compare June 22, 2017 08:39

k8s-ci-robot assigned MaciekPytel Jun 23, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2017

MaciekPytel merged commit 28caf01 into kubernetes:master Jun 23, 2017

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this pull request Feb 22, 2024

Merge pull request kubernetes#46 from ahg-g/ahg-taints

900a4d2

Add support for tainted flavors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46

cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46

mumoshu commented May 9, 2017 •

edited

mwielgus commented May 10, 2017

mumoshu commented May 12, 2017

mwielgus commented May 12, 2017

mumoshu commented May 12, 2017 •

edited

andrewsykim commented May 14, 2017

mumoshu commented May 23, 2017

andrewsykim May 23, 2017

andrewsykim May 23, 2017

mumoshu May 30, 2017

mumoshu May 30, 2017

andrewsykim May 30, 2017

andrewsykim May 23, 2017

mumoshu May 30, 2017

andrewsykim May 30, 2017

mumoshu Jun 6, 2017

andrewsykim Jun 7, 2017

mumoshu Jun 8, 2017

mumoshu Jun 8, 2017

mumoshu Jun 8, 2017

mumoshu commented Jun 8, 2017 •

edited

mumoshu commented Jun 8, 2017

mumoshu commented Jun 11, 2017

mumoshu commented Jun 11, 2017 •

edited

MaciekPytel commented Jun 13, 2017

mumoshu commented Jun 19, 2017

mumoshu commented Jun 22, 2017

mumoshu commented Jun 22, 2017

MaciekPytel commented Jun 23, 2017

cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46

cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46

Conversation

mumoshu commented May 9, 2017 • edited

mwielgus commented May 10, 2017

mumoshu commented May 12, 2017

mwielgus commented May 12, 2017

mumoshu commented May 12, 2017 • edited

andrewsykim commented May 14, 2017

mumoshu commented May 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu commented Jun 8, 2017 • edited

mumoshu commented Jun 8, 2017

mumoshu commented Jun 11, 2017

mumoshu commented Jun 11, 2017 • edited

MaciekPytel commented Jun 13, 2017

mumoshu commented Jun 19, 2017

mumoshu commented Jun 22, 2017

mumoshu commented Jun 22, 2017

MaciekPytel commented Jun 23, 2017

mumoshu commented May 9, 2017 •

edited

mumoshu commented May 12, 2017 •

edited

mumoshu commented Jun 8, 2017 •

edited

mumoshu commented Jun 11, 2017 •

edited