Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: We should add a rate limiter to AWS API calls #12121

Closed
justinsb opened this issue Aug 1, 2015 · 33 comments
Closed

AWS: We should add a rate limiter to AWS API calls #12121

justinsb opened this issue Aug 1, 2015 · 33 comments
Assignees
Labels
area/cloudprovider priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@justinsb
Copy link
Member

justinsb commented Aug 1, 2015

We had a few bugs where we would get into loops which would cause us to get into fast-retries loop, and often we would exhaust AWS API quota. Sometimes this would even cause AWS to send a warning email. e.g. #11979

@brendandburns suggested in #11979 (comment) that:

We should probably add a rate limiter into the AWS client, like we did for GCE:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/util/throttle.go

That seems like something we should have; we should obviously fix any bugs which cause fast-retry loops, but we should have a second layer-of-defense as well. Also if someone launches a thousand services which require an ELB each, we want to respect the AWS API rate limits, even if we don't have a bug per-se.

@brendandburns brendandburns self-assigned this Aug 2, 2015
@mbforbes mbforbes added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 16, 2015
@aronchick
Copy link
Contributor

Propose moving this up to P1 - I've seen this happen regularly with a default install.

@danielschonfeld
Copy link
Contributor

We're hit by the lack of this in kube-controller-manager

30 nodes. Default install utilizing the script. About 6 minutes after a launch of kibe-controller-manager we get the following:

E0105 01:18:24.270387       5 routecontroller.go:52] Couldn't reconcile node routes: error listing routes: error listing AWS instances: RequestLimitExceeded: Request limit exceeded.

with nothing of great interest before it, utilizing --v=4. (using Kubernetes 1.1.3)

@schonfeld
Copy link

@aronchick this is a pretty big deal breaker on our end here... Do you have any suggestions on workarounds for now? Suggestions?

@justinsb
Copy link
Member Author

justinsb commented Jan 6, 2016

The 30 node report above seems like something I should be able to reproduce. I'll give it a go, either tonight or tomorrow morning. (Of course if anyone else wants to try they should feel free!)

At v=4 we should see a lot of "AWS request:" log messages, as all AWS calls should be logged. My guess is somewhere we are in a loop and making calls sufficiently rapidly to exhaust our API call quota, but this should be visible in the logs. @danielschonfeld there wasn't anything of this nature?

@danielschonfeld
Copy link
Contributor

@justinsb unfortunately we're trying to set up a new production cluster on AWS so we're using v1.1.3. 1.1.3 doesn't include the "AWS request:" log statements... I think the earliest that merge appeared in was 1.2.0-alpha1.

@justinsb
Copy link
Member Author

justinsb commented Jan 6, 2016

@danielschonfeld that makes sense. I'll probably try with head first but if I can't reproduce I'll try with 1.1.3 with the logging PR cherry-picked - thanks for the tip!

@danielschonfeld
Copy link
Contributor

@justinsb let me know what you come up with. for now we've raised the node-sync-interval on kibe-controller-manager to 1 minute. This seems to "alleviate" the problem to the point where KCM can do its magic properly as opposed to not being able to at all.

But we still see similar problems when sometimes mounting an EBS and it being unsuccessful it would get rate limited asking for volume info.

I'm also planning on submitting a PR for some optimization on the ListRoutes function in aws_routes.go which should help this a bit.

@justinsb
Copy link
Member Author

justinsb commented Jan 6, 2016

I was able to cherry-pick the logging patch back to 1.1 (it's a simple cherry-pick but then you have to change the package name to aws_cloud; I then changed the logging to V(2) so I didn't have to mess around with changing log levels).

With N instances, it does appear that the ListRoutes function is a big culprit, because it makes 1+N calls every 10 seconds.

From kube-controller-manager (with 5 nodes):

I0106 05:16:05.073872       8 log_handler.go:33] AWS request: ec2 DescribeRouteTables
I0106 05:16:05.129766       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:05.243437       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:05.325364       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:05.427777       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:05.537403       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:15.625953       8 log_handler.go:33] AWS request: ec2 DescribeRouteTables
I0106 05:16:15.669651       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:15.761153       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:15.846841       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:16.016643       8 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 05:16:16.100331       8 log_handler.go:33] AWS request: ec2 DescribeInstances

Kubelet on each node is also making a call to ec2 DescribeInstances every 10 seconds:

Jan 06 05:16:39 ip-172-20-0-124 kubelet[17038]: I0106 05:16:39.993942   17038 log_handler.go:33] AWS request: ec2 DescribeInstances
Jan 06 05:16:50 ip-172-20-0-124 kubelet[17038]: I0106 05:16:50.436616   17038 log_handler.go:33] AWS request: ec2 DescribeInstances

I don't see the AWS rate limits actually documented anywhere, but this page suggests that they vary: https://forums.aws.amazon.com/message.jspa?messageID=454580 "The rate limit that you will see with EC2 can vary depending on system load.".

There's definitely issues with the EBS volume code, particularly around retries - it was originally a copy and paste from GCE, and it could use a resync to pick up all the bugfixes that have been made in GCE. I started that effort in #15938 but I need to resume that work. I'm not sure whether the volume retry loop is actually the root cause here, and we just notice the controller-manager because it's much easier to observe.

I agree that we should optimize the route list to avoid the 1+N calls. I don't know if you had a particular fix in mind @danielschonfeld . We could cache the instance-id to private-dns-name mapping; we could collect the instance ids and look them up in bulk; or we could pre-fetch the instance-id to private-dns-name mapping on the grounds that every node should be in the routing table anyway.

I did raise the number of nodes from 5 to 30 (admittedly after already having launched a cluster vs the all-at-once-launch scenario), and I think we also make N DescribeInstances calls for example when updating ELBs. So that might also be one we should look at, because that could be N^2 when a group of nodes are launched near-simultaneously.

I wasn't personally able to hit my request quota with 30 nodes, but given the limit might not be fixed I don't think this means a lot. I do think that if we optimize ListRoutes we will dramatically reduce the amount of API calls we are making.

Finally I am not sure that we really need to call ListRoutes every 10 seconds; I wonder if we should change the default on AWS to be 60 seconds for example. Or maybe even slower, but make sure that we trigger a refresh immediately whenever a node changes (or whatever the correct trigger events should be)

@schonfeld
Copy link

@justinsb we were thinking the same thing with collecting Instance IDs, and then doing a single call to DescribeInstances API (which supports a MaxResults of 1,000). @danielschonfeld mentioned that k8s has a much lower node limit than that, so other than a simple sanity check there, we should be able to get away with that. Curious that someone had made a "plural getInstancesByIds" method, but opted to still do one GET at a time..? Either way, that method isn't being used anywhere.

/re kubelet -- it looks like whoever wrote the code that makes the API request wanted to get rid of it:

// TODO: We can't assume that the node has credentials to talk to the
// cloudprovider from arbitrary nodes. At most, we should talk to a
// local metadata server here.
node.Spec.ProviderID, err = cloudprovider.GetInstanceProviderID(kl.cloud, kl.nodeName)

It seems that cloud.GetInstanceProviderID essentially boils down to aws.InstanceID. Seeing as how an instance's AvailabilityZone and InstanceId are highly unlikely to change, we can prob easily cache these and avoid an extra N calls every 10s

/re the EBS logic -- FWIW, the aws-sdk for go comes with a default exponential backoff retryer...

Thoughts?

@danielschonfeld
Copy link
Contributor

cc @schonfeld @justinsb

  1. I agree with both of your guys notion that we should grab all instance information in one single call in ListRoutes. In fact I got most of this coded already but just trying to fix the compilation bugs as i'm a total golang newbie. Hopefully be ready today or tomorrow.

  2. Re Kubelet - I agree with @schonfeld that after the first call the data is 100% cachable is it's not going to change inside the node. But even more so, there's absolutely no reason the resulting string needs to grab that data from the API server utilizing a DescribeInstance. We can change the code to grab that needed data from the metadata server to which we have no rate limits. Either way both a chance for optimization and cachability.

  3. The GetInstancesByIds method was removed later from the aws.go code but I'll be bringing it back on my own for the purposes described in (1) only I rewrote it to take an array, make a single call and return a map[string]*ec2.Instance. Where string will hold the instanceID

@justinsb
Copy link
Member Author

justinsb commented Jan 6, 2016

@danielschonfeld if you want I would be happy to give some pointers on any code to help you get up to speed with go; push it to your fork and I'd be happy to have a look.

getInstanceByIds is there, despite being a dumb loop, because when we were ready to replace it with something smarter (i.e. now) then it is easy to do so, rather than having to refactor those loops out of all the calling code. At least that was my idea at the time ;-)

I coded up (but have not yet tested) a process-wide back-off when we see RequestLimitExceeded in #19335. It is the "last line of defense", and we should also fix the individual problems like the 1+N in ListRoutes. The goal of #19335 is effectively to sacrifice/delay k8s to protect the AWS account; but whenever we hit that throttle we should fix the cause.

The two specific problems I think we should fix are ListRoutes being 1+N and the kubelet polling the API.

For ListRoutes, I think getting all the instances at once is more obviously correct than caching, so we should go with that. I'm thinking we can just list all the instances (like getInstancesByRegex does), but we just return them all in a map of instanceid -> privateDnsName

For the kubelet, I feel that special-casing the current instance is probably the way to go. PrivateDnsName and InstanceID cannot change once the instance has been launched. We already have getSelfAWSInstance, so I think we could just special-case the case where we are asking for the 'self-node' InstanceID.

@danielschonfeld
Copy link
Contributor

@justinsb here's my code https://github.com/danielschonfeld/kubernetes/tree/optimize-list-routes
aws.go needs some love... as it won't compile in its current state (I am truly sorry about how ugly it is)

as for getSelfAWSInstance assuming the call to s.metadata calls the metadata server in lieu of the API server i say thats a sound choice! And def special case as those two values will not change after instance launch. I think this is def our big ticket/bust with ListRoutes problem coming in as close second!

danielschonfeld added a commit to danielschonfeld/kubernetes that referenced this issue Jan 7, 2016
@davidopp davidopp modified the milestones: v1.2, v1.2-candidate Feb 19, 2016
@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 19, 2016
@SpencerBrown
Copy link
Contributor

We are getting hit by this big-time with a relatively small cluster of 10 EC2 instances. It's about to be a deal breaker for us. We are having big trouble trying to deploy a pod with an attached EC2 volume, because of the rate limit exceeded issues. This is causing production downtime.

Please prioritize this into 1.2.

The only alternative I can think of (besides moving off AWS which we can't do at this point), is to simply turn off the cloud_provider options and handle all the config ourselves.

@therc
Copy link
Member

therc commented Feb 19, 2016

@SpencerBrown: do you know what class of APIs is causing you to hit limits?

@SpencerBrown
Copy link
Contributor

Figuring out the answer to that question looks like my weekend project :(.

I plan to use the AWS Cloudwatch API event stream to debug, if I can get that to work.

@justinsb
Copy link
Member Author

@SpencerBrown which version are you running? I think 1.1 logged all AWS API calls with v=4, but I'm not entirely sure.

I'm sorry for the problem - I'm sure we can get it fixed. Your case is the argument for this safety belt, in that we obviously have some sort of busy retry loop around volume mounting or something similar; we should fix the specific problem & then also have the safety belt to catch the next unknown problem.

I know it's not great comfort, but we've made a lot of progress on this in 1.2.

Let me know what version you're running and we can work together (this weekend if need be) to get you back up and running. I suspect it'll be a matter of enabling more verbose debugging, capturing the log and verifying the error, and then either merging this and cherry-picking back to 1.1, or merging a specific fix and cherry-picking back to 1.1

@therc
Copy link
Member

therc commented Feb 20, 2016

I've had good luck correlating with Cloud Trail entries...

@SpencerBrown
Copy link
Contributor

currently running 1.1.7. I have enabled CloudTrail on my test cluster and am waiting to see the results. Thanks @justinsb.

@justinsb
Copy link
Member Author

@SpencerBrown I checked and we don't have a lot of the logging options I was hoping for in the 1.1 branch. So CloudTrail is a great approach. Let me know what it shows and I'll see if I can figure out the problem and provide a backported fix.

(My money's on the AttachVolume call, but it's always where you least expect it...)

@SpencerBrown
Copy link
Contributor

I have been running CloudTrail on my test cluster for a couple of hours, and wrote a quick Ruby script to download and analyze the API call frequency. (I'll put this on GitHub later). So far I'm seeing:

  • each worker node does about 1 DescribeInstances call per second
  • Launching a pod with a EBS volume attachment causes 6 DescribeVolumes calls in quick succession and 1 AttachVolume call.
  • Deleting that pod causes 8 DescribeVolumes calls and 1 DetachVolume call

The test cluster only has 2 worker nodes and none of our actual services deployed. Going to try our staging cluster next.

@SpencerBrown
Copy link
Contributor

Yikes, all kinds of odd things happening on our staging cluster. Here's the output from my Ruby script:


DescribeInstances:
  ip-172-20-1-30.ec2.internal:
   -- 594 DescribeInstances calls in 655.0 seconds = 0.907 calls/sec
  ip-172-20-1-32.ec2.internal:
   -- 589 DescribeInstances calls in 642.0 seconds = 0.917 calls/sec
  ip-172-20-1-34.ec2.internal:
   -- 598 DescribeInstances calls in 653.0 seconds = 0.916 calls/sec
  ip-172-20-1-31.ec2.internal:
   -- 599 DescribeInstances calls in 655.0 seconds = 0.915 calls/sec
  ip-172-20-1-33.ec2.internal:
   -- 591 DescribeInstances calls in 647.0 seconds = 0.913 calls/sec
DescribeSecurityGroups:
  :
   -- 1067 DescribeSecurityGroups calls in 646.0 seconds = 1.652 calls/sec
DescribeSubnets:
  :
   -- 526 DescribeSubnets calls in 631.0 seconds = 0.834 calls/sec
DescribeVpcs:
  :
   -- 533 DescribeVpcs calls in 651.0 seconds = 0.819 calls/sec
DescribeLoadBalancers:
  :
   -- 540 DescribeLoadBalancers calls in 650.0 seconds = 0.831 calls/sec
CreateLoadBalancer:
  :
   -- 122 CreateLoadBalancer calls in 625.0 seconds = 0.195 calls/sec

@justinsb
Copy link
Member Author

@SpencerBrown That's a great list - thank you. If you can collect the kube-controller-manager.log from the master (/var/log/kube-controller-manager.log) we should be able to see why CreateLoadBalancer is going crazy...

Did you launch this with kube-up, BTW? We'll get it fixed either way, but if you launched with kube-up I can assume a certain configuration...

I'm going to go through the rest of your list now, but I wanted to see about getting the kube-controller-manager log if possible. I'm justinsb on kubernetes slack or email is justin at fathomdb.com if you want to send it privately.

@justinsb
Copy link
Member Author

For the test cluster...

each worker node does about 1 DescribeInstances call per second

Each worker node does make a DescribeInstance call in 1.1, but it should be every 10 seconds. This is NodeStatusUpdateFrequency in cmd/kubelet/app/server.go. You could override this with --node-status-update-frequency on the kubelet - are you setting it to 1s?

We've fixed this entirely in 1.2, BTW.

Launching a pod with a EBS volume attachment causes 6 DescribeVolumes calls in quick succession and 1 AttachVolume call.

I'll investigate & open an issue for the repeated DescribeVolumes calls!

Deleting that pod causes 8 DescribeVolumes calls and 1 DetachVolume call

I'll investigate & open an issue for the repeated DescribeVolumes calls!


For the staging cluster. The 1 DescribeInstance / second I think we've seen from the test cluster.

I take it ip-172-20-1-33.ec2.internal is the master (which would suggest that this is not kube-up). If I can get the kube-controller-manager logs, I think it will become clear what is happening. It looks like something with a LB trying & failing to be created.

@justinsb
Copy link
Member Author

Oh ... the repeated DescribeVolumes calls are us polling to know when the volume is attached/detached. I don't think there's anything we can do about that - it isn't available through the EC2 API. I'll open an issue anyway, though I sure hope it isn't a problem because it is tough to solve. Maybe we reduce the polling interval or similar for 1.2.

@SpencerBrown
Copy link
Contributor

Been working this offline with @justinsb (big thanks)! net conclusions:

  1. CreateLoadBalancer is a known issue, triggered by an incorrect network config causing controller-manager to loop rapidly trying to create the ELB

  2. DescribeInstances 1/sec/node instead of 0.1/sec/node was cleared up for an unknown reason by restarting all the master node components

@therc
Copy link
Member

therc commented Feb 20, 2016

Is 1) fixed by my IPPermission PR? 1.2 is much better behaved than 1.1 and I tested my change on a setup that I don't even think was ever contemplated (inside the 10.0.0.0/16 default VPC and with a Direct Connect link to outside AWS)

@justinsb
Copy link
Member Author

@therc The problem with 1 was the multiple subnets problem, which I think we've fixed in 1.2 as well. Happy to hear that 1.2 is more robust though. How did you get the 10.0.0.0/16 running? Did you use --non-masquerade-cidr or a different trick?

@therc
Copy link
Member

therc commented Feb 20, 2016

I can't check right now, but I used --non-masquerade-cidr=10.64.0.0/10 because pods and services for the various clusters run in 10.64..10.127 networks. The VMs live in 10.0.16.0/20. Should this be documented?

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 21, 2016
This applies a cross-request time delay when we observe
RequestLimitExceeded errors, unlike the default library behaviour which
only applies a *per-request* backoff.

Issue kubernetes#12121
@justinsb
Copy link
Member Author

We merged #19335. @SpencerBrown's issue above was caused by an incorrect setup while creating a LoadBalancer (multiple subnets, not tagged), which we are more tolerant of in 1.2 and I've opened issues around the frequency of retries in the servicecontroller loop.

Closing :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloudprovider priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests