New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: We should add a rate limiter to AWS API calls #12121
Comments
Propose moving this up to P1 - I've seen this happen regularly with a default install. |
We're hit by the lack of this in kube-controller-manager 30 nodes. Default install utilizing the script. About 6 minutes after a launch of kibe-controller-manager we get the following:
with nothing of great interest before it, utilizing |
@aronchick this is a pretty big deal breaker on our end here... Do you have any suggestions on workarounds for now? Suggestions? |
The 30 node report above seems like something I should be able to reproduce. I'll give it a go, either tonight or tomorrow morning. (Of course if anyone else wants to try they should feel free!) At v=4 we should see a lot of "AWS request:" log messages, as all AWS calls should be logged. My guess is somewhere we are in a loop and making calls sufficiently rapidly to exhaust our API call quota, but this should be visible in the logs. @danielschonfeld there wasn't anything of this nature? |
@justinsb unfortunately we're trying to set up a new production cluster on AWS so we're using v1.1.3. 1.1.3 doesn't include the "AWS request:" log statements... I think the earliest that merge appeared in was 1.2.0-alpha1. |
@danielschonfeld that makes sense. I'll probably try with head first but if I can't reproduce I'll try with 1.1.3 with the logging PR cherry-picked - thanks for the tip! |
@justinsb let me know what you come up with. for now we've raised the node-sync-interval on kibe-controller-manager to 1 minute. This seems to "alleviate" the problem to the point where KCM can do its magic properly as opposed to not being able to at all. But we still see similar problems when sometimes mounting an EBS and it being unsuccessful it would get rate limited asking for volume info. I'm also planning on submitting a PR for some optimization on the ListRoutes function in aws_routes.go which should help this a bit. |
I was able to cherry-pick the logging patch back to 1.1 (it's a simple cherry-pick but then you have to change the package name to aws_cloud; I then changed the logging to V(2) so I didn't have to mess around with changing log levels). With N instances, it does appear that the ListRoutes function is a big culprit, because it makes 1+N calls every 10 seconds. From kube-controller-manager (with 5 nodes):
Kubelet on each node is also making a call to ec2 DescribeInstances every 10 seconds:
I don't see the AWS rate limits actually documented anywhere, but this page suggests that they vary: https://forums.aws.amazon.com/message.jspa?messageID=454580 "The rate limit that you will see with EC2 can vary depending on system load.". There's definitely issues with the EBS volume code, particularly around retries - it was originally a copy and paste from GCE, and it could use a resync to pick up all the bugfixes that have been made in GCE. I started that effort in #15938 but I need to resume that work. I'm not sure whether the volume retry loop is actually the root cause here, and we just notice the controller-manager because it's much easier to observe. I agree that we should optimize the route list to avoid the 1+N calls. I don't know if you had a particular fix in mind @danielschonfeld . We could cache the instance-id to private-dns-name mapping; we could collect the instance ids and look them up in bulk; or we could pre-fetch the instance-id to private-dns-name mapping on the grounds that every node should be in the routing table anyway. I did raise the number of nodes from 5 to 30 (admittedly after already having launched a cluster vs the all-at-once-launch scenario), and I think we also make N DescribeInstances calls for example when updating ELBs. So that might also be one we should look at, because that could be N^2 when a group of nodes are launched near-simultaneously. I wasn't personally able to hit my request quota with 30 nodes, but given the limit might not be fixed I don't think this means a lot. I do think that if we optimize ListRoutes we will dramatically reduce the amount of API calls we are making. Finally I am not sure that we really need to call ListRoutes every 10 seconds; I wonder if we should change the default on AWS to be 60 seconds for example. Or maybe even slower, but make sure that we trigger a refresh immediately whenever a node changes (or whatever the correct trigger events should be) |
@justinsb we were thinking the same thing with collecting Instance IDs, and then doing a single call to DescribeInstances API (which supports a MaxResults of 1,000). @danielschonfeld mentioned that k8s has a much lower node limit than that, so other than a simple sanity check there, we should be able to get away with that. Curious that someone had made a "plural getInstancesByIds" method, but opted to still do one GET at a time..? Either way, that method isn't being used anywhere. /re kubelet -- it looks like whoever wrote the code that makes the API request wanted to get rid of it:
It seems that /re the EBS logic -- FWIW, the aws-sdk for go comes with a default exponential backoff retryer... Thoughts? |
|
@danielschonfeld if you want I would be happy to give some pointers on any code to help you get up to speed with go; push it to your fork and I'd be happy to have a look.
I coded up (but have not yet tested) a process-wide back-off when we see RequestLimitExceeded in #19335. It is the "last line of defense", and we should also fix the individual problems like the 1+N in ListRoutes. The goal of #19335 is effectively to sacrifice/delay k8s to protect the AWS account; but whenever we hit that throttle we should fix the cause. The two specific problems I think we should fix are ListRoutes being 1+N and the kubelet polling the API. For ListRoutes, I think getting all the instances at once is more obviously correct than caching, so we should go with that. I'm thinking we can just list all the instances (like getInstancesByRegex does), but we just return them all in a map of instanceid -> privateDnsName For the kubelet, I feel that special-casing the current instance is probably the way to go. PrivateDnsName and InstanceID cannot change once the instance has been launched. We already have |
@justinsb here's my code https://github.com/danielschonfeld/kubernetes/tree/optimize-list-routes as for |
Issue kubernetes#12121 - fixes courtesy of @justinsb - thank you
We are getting hit by this big-time with a relatively small cluster of 10 EC2 instances. It's about to be a deal breaker for us. We are having big trouble trying to deploy a pod with an attached EC2 volume, because of the rate limit exceeded issues. This is causing production downtime. Please prioritize this into 1.2. The only alternative I can think of (besides moving off AWS which we can't do at this point), is to simply turn off the cloud_provider options and handle all the config ourselves. |
@SpencerBrown: do you know what class of APIs is causing you to hit limits? |
Figuring out the answer to that question looks like my weekend project :(. I plan to use the AWS Cloudwatch API event stream to debug, if I can get that to work. |
@SpencerBrown which version are you running? I think 1.1 logged all AWS API calls with v=4, but I'm not entirely sure. I'm sorry for the problem - I'm sure we can get it fixed. Your case is the argument for this safety belt, in that we obviously have some sort of busy retry loop around volume mounting or something similar; we should fix the specific problem & then also have the safety belt to catch the next unknown problem. I know it's not great comfort, but we've made a lot of progress on this in 1.2. Let me know what version you're running and we can work together (this weekend if need be) to get you back up and running. I suspect it'll be a matter of enabling more verbose debugging, capturing the log and verifying the error, and then either merging this and cherry-picking back to 1.1, or merging a specific fix and cherry-picking back to 1.1 |
I've had good luck correlating with Cloud Trail entries... |
currently running 1.1.7. I have enabled CloudTrail on my test cluster and am waiting to see the results. Thanks @justinsb. |
@SpencerBrown I checked and we don't have a lot of the logging options I was hoping for in the 1.1 branch. So CloudTrail is a great approach. Let me know what it shows and I'll see if I can figure out the problem and provide a backported fix. (My money's on the AttachVolume call, but it's always where you least expect it...) |
I have been running CloudTrail on my test cluster for a couple of hours, and wrote a quick Ruby script to download and analyze the API call frequency. (I'll put this on GitHub later). So far I'm seeing:
The test cluster only has 2 worker nodes and none of our actual services deployed. Going to try our staging cluster next. |
Yikes, all kinds of odd things happening on our staging cluster. Here's the output from my Ruby script:
|
@SpencerBrown That's a great list - thank you. If you can collect the kube-controller-manager.log from the master (/var/log/kube-controller-manager.log) we should be able to see why CreateLoadBalancer is going crazy... Did you launch this with kube-up, BTW? We'll get it fixed either way, but if you launched with kube-up I can assume a certain configuration... I'm going to go through the rest of your list now, but I wanted to see about getting the kube-controller-manager log if possible. I'm justinsb on kubernetes slack or email is justin at fathomdb.com if you want to send it privately. |
For the test cluster...
Each worker node does make a DescribeInstance call in 1.1, but it should be every 10 seconds. This is NodeStatusUpdateFrequency in cmd/kubelet/app/server.go. You could override this with We've fixed this entirely in 1.2, BTW.
I'll investigate & open an issue for the repeated DescribeVolumes calls!
I'll investigate & open an issue for the repeated DescribeVolumes calls! For the staging cluster. The 1 DescribeInstance / second I think we've seen from the test cluster. I take it ip-172-20-1-33.ec2.internal is the master (which would suggest that this is not kube-up). If I can get the kube-controller-manager logs, I think it will become clear what is happening. It looks like something with a LB trying & failing to be created. |
Oh ... the repeated DescribeVolumes calls are us polling to know when the volume is attached/detached. I don't think there's anything we can do about that - it isn't available through the EC2 API. I'll open an issue anyway, though I sure hope it isn't a problem because it is tough to solve. Maybe we reduce the polling interval or similar for 1.2. |
Been working this offline with @justinsb (big thanks)! net conclusions:
|
Is 1) fixed by my IPPermission PR? 1.2 is much better behaved than 1.1 and I tested my change on a setup that I don't even think was ever contemplated (inside the 10.0.0.0/16 default VPC and with a Direct Connect link to outside AWS) |
@therc The problem with 1 was the multiple subnets problem, which I think we've fixed in 1.2 as well. Happy to hear that 1.2 is more robust though. How did you get the 10.0.0.0/16 running? Did you use |
I can't check right now, but I used --non-masquerade-cidr=10.64.0.0/10 because pods and services for the various clusters run in 10.64..10.127 networks. The VMs live in 10.0.16.0/20. Should this be documented? |
This applies a cross-request time delay when we observe RequestLimitExceeded errors, unlike the default library behaviour which only applies a *per-request* backoff. Issue kubernetes#12121
We merged #19335. @SpencerBrown's issue above was caused by an incorrect setup while creating a LoadBalancer (multiple subnets, not tagged), which we are more tolerant of in 1.2 and I've opened issues around the frequency of retries in the servicecontroller loop. Closing :-) |
We had a few bugs where we would get into loops which would cause us to get into fast-retries loop, and often we would exhaust AWS API quota. Sometimes this would even cause AWS to send a warning email. e.g. #11979
@brendandburns suggested in #11979 (comment) that:
That seems like something we should have; we should obviously fix any bugs which cause fast-retry loops, but we should have a second layer-of-defense as well. Also if someone launches a thousand services which require an ELB each, we want to respect the AWS API rate limits, even if we don't have a bug per-se.
The text was updated successfully, but these errors were encountered: