Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46

Merged

Conversation

mumoshu
Copy link
Contributor

@mumoshu mumoshu commented May 9, 2017

By caching AWS refs for nodes/EC2 instances already known to be not in any of ASGs managed by cluster-autoscaler(CA).

Please beware of the edge case - this method is safe as long as users don't attach nodes by calling AttachInstances API after CA cached them. I believe, even if it was necessary, a warning in the documentation about the edge case is enough for now. If we really need to support the case, I will submit an another PR to invalidate the cache periodically so that CA can detect the formerly cached nodes are attached to ASG(s).

The docker image built from this branch is available for testing at mumoshu/fix-excessive-describe-asg-calls.

You can see that CA detects and remembers nodes in unmanaged ASGs so that it can prevent the nodes from triggering unnecessary regenerateCache invocations afterwards:

I0509 05:37:28.165739       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
I0509 05:37:28.824668       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.882084       1 aws_manager.go:188] Instance {Name:i-06079265fc8999d4c} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.882149       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.918931       1 aws_manager.go:188] Instance {Name:i-0c6c700cbcc62ed1d} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.918984       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.946945       1 aws_manager.go:188] Instance {Name:i-0e48a422f75ae6500} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.947044       1 aws_manager.go:197] Regenerating ASG information for kube4-Asg1-5GX1OEW9YLPA-Workers-1WBDQUOBGXA6A
I0509 05:37:28.984235       1 aws_manager.go:188] Instance {Name:i-0a8cb4932d32e1296} is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance
I0509 05:37:28.984272       1 static_autoscaler.go:197] Filtering out schedulables
I0509 05:37:28.984338       1 static_autoscaler.go:205] No schedulable pods
I0509 05:37:28.984356       1 static_autoscaler.go:211] No unschedulable pods

Resolves #45

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 9, 2017
@mwielgus
Copy link
Contributor

cc: @andrewsykim

@mwielgus mwielgus added the area/provider/aws Issues or PRs related to aws provider label May 10, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented May 12, 2017

Hi @SleepyBrett!
This is the fix for kubernetes-retired/contrib#2541 you've reported last month.
Would you mind confirming if it works for you?
If you have a k8s 1.6 cluster, running your cluster-autoscaler from the docker image mumoshu/fix-excessive-describe-asg-calls would work.

@mwielgus
Copy link
Contributor

Does it work with autodiscovery and dynamic config updates for aws?

@mumoshu
Copy link
Contributor Author

mumoshu commented May 12, 2017

@mwielgus Yes, but please share me your concerns if any!

@andrewsykim
Copy link
Member

I'm on vacation, so won't be able to review til next week, sorry!

@mumoshu
Copy link
Contributor Author

mumoshu commented May 23, 2017

Hi @andrewsykim, sorry for rushing but I hope you could review this 🌷

if _, found := m.instancesNotInManagedAsg[*instance]; found {
// The instance is already known to not belong to any configured ASG
// Skip regenerateCache so that we won't unnecessarily call DescribeAutoScalingGroups
// See https://github.com/kubernetes/contrib/issues/2541
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link to the issue in the new repo instead: #45

if err := m.regenerateCache(); err != nil {
return nil, fmt.Errorf("Error while looking for ASG for instance %+v, error: %v", *instance, err)
}
if config, found := m.asgCache[*instance]; found {
return config, nil
}
// instance does not belong to any configured ASG
glog.V(6).Infof("Instance %+v is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance", *instance)
m.instancesNotInManagedAsg[*instance] = struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there possible race conditions here? i.e. could a recently launched instance be added to this list because it was missing tags for short period of time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim AFAIK, this func is called only after a K8S cluster notices the existence of the node.
Suppose the node exists, shouldn't it already have appropriate tags added, kubelet started, node registered, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim Would you mind informing me which tag will be missed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, didn't know the instance would only be evaluated once it's registered with the master, this shouldn't be a problem then :)

asgs: make([]*asgInformation, 0),
service: service,
asgCache: make(map[AwsRef]*Asg),
instancesNotInManagedAsg: make(map[AwsRef]struct{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we are adding a lot of complexity here by adding this field (instancesNotInManagedAsg). I was thinking of a solution where we can put regenerateCache in a separate loop and have it run every X seconds. Would you happen to know if that's a feasible solution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse me I'm not following you correctly but I'd rather suggest extracting a new object dedicated to efficiently fetch ASG data(probably named like asgRegistry which has the actual implementation of GetAsgForInstance if we'd want less complexity here.

After the change, AwsManager would just delegate GetAsgForInstance to asgRegistry and relevant struct members like asgs, asgCache, instancesNotInManagedAsg would move to asgRegistry.

The reasoning behind my suggestion is that we won't want to introduce an another gorountine just for calling regenerateCache - the less concurrent programming the more deterministic CA's behavior is/the happier our lives are? 😃

And if you'd just like to regenerateCache every X seconds, theoretically, we can just do it by checking if elapsed time since the last regeneration is greater than X, in the very beginning of regenerateCache?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you'd just like to regenerateCache every X seconds, theoretically, we can just do it by checking if elapsed time since the last regeneration is greater than X, in the very beginning of regenerateCache?

sounds good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim Thanks for the confirmation 👍

I'm going to include the work of extracting asgRegistry as I've described above in this PR.

However, regarding the periodic regenerateCache you've suggested, I'm not yet sure why it is needed.
AFAIK, cache regeneration is needed and done only when CA sees a k8s node for the first time. If we agree to assume a k8s node to not move among different ASGs, which seems reasonable to me, we won't need to periodically invalidate the asg cache.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say at the moment, the cluster-autoscaler code changed a lot from the last time I developed anything so I'm probably not the best person to make any calls here. What you're saying seems reasonable, though I'll need to read the code again to be sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim Thanks!
Then, would you mind if we proceed to get this merged without the periodic regenerateCache thing for now? Even without it, merging this doesn't make the situation worse. If we really need the regenerateCache improvement, we can submit a new github issue, right?

cc @mwielgus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably not the best person to make any calls here

So am I!
I guess that we are the only contributors concerned to the AWS part of CA? 😄
As this is being an OSS project, I suppose that one of maintainers, or one of active contributors appointed as responsible by a maintainer, would be eligible make decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim Oh, hey, I just realized that we already had the periodic regenerateCache in our code-base 😃

@mumoshu mumoshu force-pushed the fix-excessive-describe-asg-calls branch from a55e868 to 24a7581 Compare June 8, 2017 08:17
mumoshu added a commit to mumoshu/autoscaler that referenced this pull request Jun 8, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 8, 2017

@andrewsykim According to our discussion, I've added 37b8225 to avoid adding the instancesNotInManagedAsg field directly to AwsManager while maintaining consistency throughout the codebase.
It would seem like a big change regarding LOC. However, basically, what I did is just splitting AwsManager into 3 types to keep each type simple.

@mumoshu mumoshu force-pushed the fix-excessive-describe-asg-calls branch from 24a7581 to 37b8225 Compare June 8, 2017 08:43
mumoshu added a commit to mumoshu/autoscaler that referenced this pull request Jun 8, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 8, 2017

To ensure I'm not breaking anything, I did some manual testing with 37b8225 and verified it scales up/down the cluster successfully as before.

@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 11, 2017

Hi @MaciekPytel @mwielgus, thanks as always for maintaining CA 👍
Would you mind providing your decisions on this, LGTM, need further changes, questions, another issues to be tracked in the future(more tests, more e2e, more README), etc?
Also, not very important but as a contributor, I personally prefer PRs getting merged relatively quick if they look not perfect but good enough so that we can keep constantly moving forward.

@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 11, 2017

Anyway, as myself being a heavy user of K8S on AWS, I do want this issue to be fixed before CA 0.6.

@MaciekPytel
Copy link
Contributor

@mumoshu We definitely want this for 0.6.
It looks ok to me at a glance, but I don't have an AWS env to test this and I don't know AWS cloudprovider too well. @andrewsykim reviewed this and if it looks good to him and you confirm you tested it (running with non-managed ASG no longer causes constant cache regeneration, scale-up and scale-down work ok) , than I'm happy to merge this.

@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 19, 2017

I'll take some time to test this again shortly.

… feature is enabled

By fixing CA not to reset `StaticAutoscaler` state before each iteration so that it remembers last scale-up/down time which is used to throttle scale-down, which is causing the issue.
@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 22, 2017

@MaciekPytel @mwielgus I believe this is going to conflict with #107.
Would you mind merging #107 before this?

By caching AWS refs for nodes/EC2 instances already known to be not in any of ASGs managed by cluster-autoscaler(CA).

Please beware of the edge case - this method is safe as long as users don't attach nodes by calling AttachInstances API after CA cached them. I believe, even if it was necessary, a warning in the documentation about the edge case is enough for now. If we really need to support the case, I will submit an another PR to invalidate the cache periodically so that CA can detect the formerly cached nodes are attached to ASG(s).

Also refactor AwsManager for less complexity by extracting types, accordingly to the discussion made [here](kubernetes#46 (comment))
@mumoshu mumoshu force-pushed the fix-excessive-describe-asg-calls branch from 37b8225 to dfb481b Compare June 22, 2017 08:39
@mumoshu
Copy link
Contributor Author

mumoshu commented Jun 22, 2017

Rebased this onto #137 so that we can merge this after #107 and #137 without any conflict.
Please look at the last commit if you need to see only the changes made by this PR.

@MaciekPytel
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2017
@MaciekPytel MaciekPytel merged commit 28caf01 into kubernetes:master Jun 23, 2017
yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this pull request Feb 22, 2024
Add support for tainted flavors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster-autoscaler: Excessive calls to DescribeAutoScalingGroup
5 participants