New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46
cluster-autoscaler: Fix excessive calls to DescribeAutoScalingGroup #46
Conversation
cc: @andrewsykim |
Hi @SleepyBrett! |
Does it work with autodiscovery and dynamic config updates for aws? |
@mwielgus Yes, but please share me your concerns if any! |
I'm on vacation, so won't be able to review til next week, sorry! |
Hi @andrewsykim, sorry for rushing but I hope you could review this 🌷 |
if _, found := m.instancesNotInManagedAsg[*instance]; found { | ||
// The instance is already known to not belong to any configured ASG | ||
// Skip regenerateCache so that we won't unnecessarily call DescribeAutoScalingGroups | ||
// See https://github.com/kubernetes/contrib/issues/2541 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's link to the issue in the new repo instead: #45
if err := m.regenerateCache(); err != nil { | ||
return nil, fmt.Errorf("Error while looking for ASG for instance %+v, error: %v", *instance, err) | ||
} | ||
if config, found := m.asgCache[*instance]; found { | ||
return config, nil | ||
} | ||
// instance does not belong to any configured ASG | ||
glog.V(6).Infof("Instance %+v is not in any ASG managed by CA. CA is now memorizing the fact not to unnecessarily call AWS API afterwards trying to find the unexistent managed ASG for the instance", *instance) | ||
m.instancesNotInManagedAsg[*instance] = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there possible race conditions here? i.e. could a recently launched instance be added to this list because it was missing tags for short period of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim AFAIK, this func is called only after a K8S cluster notices the existence of the node.
Suppose the node exists, shouldn't it already have appropriate tags added, kubelet started, node registered, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim Would you mind informing me which tag will be missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, didn't know the instance would only be evaluated once it's registered with the master, this shouldn't be a problem then :)
asgs: make([]*asgInformation, 0), | ||
service: service, | ||
asgCache: make(map[AwsRef]*Asg), | ||
instancesNotInManagedAsg: make(map[AwsRef]struct{}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we are adding a lot of complexity here by adding this field (instancesNotInManagedAsg
). I was thinking of a solution where we can put regenerateCache
in a separate loop and have it run every X seconds. Would you happen to know if that's a feasible solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse me I'm not following you correctly but I'd rather suggest extracting a new object dedicated to efficiently fetch ASG data(probably named like asgRegistry
which has the actual implementation of GetAsgForInstance
if we'd want less complexity here.
After the change, AwsManager
would just delegate GetAsgForInstance
to asgRegistry
and relevant struct members like asgs
, asgCache
, instancesNotInManagedAsg
would move to asgRegistry
.
The reasoning behind my suggestion is that we won't want to introduce an another gorountine just for calling regenerateCache
- the less concurrent programming the more deterministic CA's behavior is/the happier our lives are? 😃
And if you'd just like to regenerateCache
every X seconds, theoretically, we can just do it by checking if elapsed time since the last regeneration is greater than X, in the very beginning of regenerateCache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if you'd just like to regenerateCache every X seconds, theoretically, we can just do it by checking if elapsed time since the last regeneration is greater than X, in the very beginning of regenerateCache?
sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim Thanks for the confirmation 👍
I'm going to include the work of extracting asgRegistry
as I've described above in this PR.
However, regarding the periodic regenerateCache
you've suggested, I'm not yet sure why it is needed.
AFAIK, cache regeneration is needed and done only when CA sees a k8s node for the first time. If we agree to assume a k8s node to not move among different ASGs, which seems reasonable to me, we won't need to periodically invalidate the asg cache.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say at the moment, the cluster-autoscaler code changed a lot from the last time I developed anything so I'm probably not the best person to make any calls here. What you're saying seems reasonable, though I'll need to read the code again to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim Thanks!
Then, would you mind if we proceed to get this merged without the periodic regenerateCache
thing for now? Even without it, merging this doesn't make the situation worse. If we really need the regenerateCache improvement, we can submit a new github issue, right?
cc @mwielgus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably not the best person to make any calls here
So am I!
I guess that we are the only contributors concerned to the AWS part of CA? 😄
As this is being an OSS project, I suppose that one of maintainers, or one of active contributors appointed as responsible by a maintainer, would be eligible make decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim Oh, hey, I just realized that we already had the periodic regenerateCache in our code-base 😃
a55e868
to
24a7581
Compare
…ting types Accordingly to the discussion made [here](kubernetes#46 (comment))
@andrewsykim According to our discussion, I've added 37b8225 to avoid adding the |
24a7581
to
37b8225
Compare
…ting types Accordingly to the discussion made [here](kubernetes#46 (comment))
To ensure I'm not breaking anything, I did some manual testing with 37b8225 and verified it scales up/down the cluster successfully as before. |
Hi @MaciekPytel @mwielgus, thanks as always for maintaining CA 👍 |
Anyway, as myself being a heavy user of K8S on AWS, I do want this issue to be fixed before CA 0.6. |
@mumoshu We definitely want this for 0.6. |
I'll take some time to test this again shortly. |
… feature is enabled By fixing CA not to reset `StaticAutoscaler` state before each iteration so that it remembers last scale-up/down time which is used to throttle scale-down, which is causing the issue.
…up ASGs from different k8s clusters
@MaciekPytel @mwielgus I believe this is going to conflict with #107. |
By caching AWS refs for nodes/EC2 instances already known to be not in any of ASGs managed by cluster-autoscaler(CA). Please beware of the edge case - this method is safe as long as users don't attach nodes by calling AttachInstances API after CA cached them. I believe, even if it was necessary, a warning in the documentation about the edge case is enough for now. If we really need to support the case, I will submit an another PR to invalidate the cache periodically so that CA can detect the formerly cached nodes are attached to ASG(s). Also refactor AwsManager for less complexity by extracting types, accordingly to the discussion made [here](kubernetes#46 (comment))
37b8225
to
dfb481b
Compare
/lgtm |
Add support for tainted flavors
By caching AWS refs for nodes/EC2 instances already known to be not in any of ASGs managed by cluster-autoscaler(CA).
Please beware of the edge case - this method is safe as long as users don't attach nodes by calling AttachInstances API after CA cached them. I believe, even if it was necessary, a warning in the documentation about the edge case is enough for now. If we really need to support the case, I will submit an another PR to invalidate the cache periodically so that CA can detect the formerly cached nodes are attached to ASG(s).
The docker image built from this branch is available for testing at
mumoshu/fix-excessive-describe-asg-calls
.You can see that CA detects and remembers nodes in unmanaged ASGs so that it can prevent the nodes from triggering unnecessary
regenerateCache
invocations afterwards:Resolves #45