Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

Closed
zmerlynn opened this issue May 23, 2016 · 26 comments
Assignees
Labels
area/controller-manager priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@zmerlynn
Copy link
Member

On a large cluster, the routecontroller basically makes O(n) Routes.Insert requests all at once, and most of them error out with "Rate Limit Exceeded". We then wait for operations to complete on the ones that did stick, and damn the rest, they'll be caught next reconcile (which is a while).

In fact, in some cases it appears we're spamming hard enough to get rate limited by the upper-level API DOS protections and seeing something akin to golang/go#14627, e.g.:

E0523 22:00:21.772924      22 routecontroller.go:101] Could not create route 5a40e42e-212a-11e6-9ac7-42010af00002 10.9.170.0/
24 for node gke-jenkins-e2e-default-pool-1-b4d62c45-58a8 after 238.128126ms: Get https://www.googleapis.com/compute/v1/projec
ts/kubernetes-scale/zones/us-east1-a/instances/gke-jenkins-e2e-default-pool-1-b4d62c45-58a8?alt=json: http2: no cached connec
tion was available

We can do better than this!

@zmerlynn zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/controller-manager team/control-plane labels May 23, 2016
@zmerlynn
Copy link
Member Author

c.f. #23327, which is the other leg of this problem

@zmerlynn zmerlynn self-assigned this May 23, 2016
@wojtek-t
Copy link
Member

I took a look into code and it seems that we are doing full reconciliation every 10s. If we need to create 2000 routes at the beginning it seems to frequently.

I think that we need to:

  • first compute how many routes we have to create
  • based on that try to spread them over some time to reduce possibility of causing 'rate limit exceeded'

I will try to put some small PR together as a starting point.

@gmarek

zmerlynn added a commit to zmerlynn/kubernetes that referenced this issue May 24, 2016
Instead of just rate limits to operation polling, send all API calls
through a rate limited RoundTripper.

This isn't a perfect solution, since the QPS is obviously getting
split between different controllers, etc., but it's also spread across
different APIs, which, in practice, rate limit differently.

Fixes kubernetes#26119 (hopefully)
@wojtek-t
Copy link
Member

I'm reopening this one, since this doesn't seem to be solved.

So basically something strange is happening here. With @zmerlynn PR we have 10/s limit on API calls send to GCE. And we are still rate-limitted.

However, according to documentation we are allowed for 20/s:
https://cloud.google.com/compute/docs/api-rate-limits

So something either doesn't work or is weird.

@wojtek-t wojtek-t reopened this May 24, 2016
@wojtek-t
Copy link
Member

What is super interesting from the logs of recent run:

It starts to creating routes:

I0524 12:48:54.353757      29 routecontroller.go:98] Creating route for node gke-jenkins-e2e-default-pool-1-55ac4188-xogp 10.10.218.0/24 with hint 97a4808f-21a4-11e6-87b7-42010af00002
....

After 2 minutes from that, we are hitting "Rate limit Exceeded" for the first time:

E0524 12:50:53.996602      29 routecontroller.go:101] Could not create route 9b689fe5-21a4-11e6-87b7-42010af00002 10.11.91.0/24 for node gke-jenkins-e2e-default-pool-2-4fd82afd-rgru after 1m59.64009569s: googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded
...
E0524 12:52:22.350821      29 routecontroller.go:101] Could not create route 90216fd2-21a4-11e6-87b7-42010af00002 10.10.10.0/24 for node gke-jenkins-e2e-default-pool-4-308af981-rlw9 after 3m27.985218479s: googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded

And only after 9 minutes we create the first route ever.
The second one is created after 15 minutes, and from that point we create a number of them

I0524 12:58:26.514569      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-ocnk 10.13.193.0/24 with hint d9e56242-21a4-11e6-87b7-42010af00002 after 9m32.158520459s
I0524 13:04:05.640382      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-p2kd 10.14.51.0/24 with hint dbaf5e86-21a4-11e6-87b7-42010af00002 after 15m11.284277905s
I0524 13:04:06.553900      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-9rb5 10.12.176.0/24 with hint c0b05886-21a4-11e6-87b7-42010af00002 after 15m12.199269778s
I0524 13:04:06.779552      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-bkke 10.9.143.0/24 with hint 8d636777-21a4-11e6-87b7-42010af00002 after 15m12.424801178s
...

So it seems that nothing useful is happening for the first 2 minutes, and then between 3m30 and 9m30

@wojtek-t
Copy link
Member

OK - so I think that what is happening is that all controllers are sharing GCE interface (and thuse they have common throttling).

So if one controller is generating a lot of API calls, then other controllers may be throttled.

@gmarek has a hypothesis that it may be caused by nodecontroller.

@zmerlynn
Copy link
Member Author

zmerlynn commented May 24, 2016

One thing my PR didn't address (because it was late): Now we have some obnoxious differences in logging when you go to make a request, because the Creating route versus Created route could actually be absorbing a fair amount of ratelimiter. It could before because of the operation rate limiter and operation polling anyways, but now could be worse.

@wojtek-t
Copy link
Member

@zmerlynn - yes I'm aware of that. And probably this is exactly the case here.

@gmarek
Copy link
Contributor

gmarek commented May 24, 2016

I got fooled by that - I guess we should fix it somehow.

@zmerlynn
Copy link
Member Author

One approach to fixing it is a similar approach to the operation polling code, but in the RoundTripper: warn if we ratelimited for more than N seconds (and dump the request or something? not clear how to clarify what we were sending).

@wojtek-t
Copy link
Member

I found something else - Kubelet is also building GCE interface:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L337

We have 2000 kubelets, so if all of them send multiple requests to GCE, we may hit limits.

@gmarek
Copy link
Contributor

gmarek commented May 24, 2016

Any idea in what circumstances Kubelet will contact cloud provider? @dchen1107

@wojtek-t
Copy link
Member

Hmm - it seems it contacts GCE exactly once at the beginning. So it doesn't explain much.

@zmerlynn
Copy link
Member Author

I think it's only here:

instances, ok := kcfg.Cloud.Instances()

I've rarely seen GCE ratelimit readonly calls, in practice, and kubelets come up pretty well staggered.

@wojtek-t
Copy link
Member

OK - so with logs that I added, it is pretty clear that we are very heavily throttled on GCE api calls.

In particular, when I was running 1000-node cluster, I see lines where we were throttle for 1m30. I'm pretty sure it's significantly higher in 2000-node cluster.

I0525 08:07:41.684522      24 gce.go:124] GCE api call: POST https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/routes?alt=json (throttled for 1m30.328621564s)
I0525 08:07:41.784558      24 gce.go:124] GCE api call: POST https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/routes?alt=json (throttled for 1m30.310976327s)

I'm going to investigate a bit more where all those reuqests come from.

@wojtek-t
Copy link
Member

Actually - it seems that we have a pretty big problem here.
Basically, Kubelet seems to send a lot of requests to GCE.

So to clarify:

We should think how to solve this issue, but I think the options are: increasing QPS quota or calling GCE only once per X statusUpdates.

@wojtek-t
Copy link
Member

The second issue is that we are throttled at the controller level too, and this one I'm tryin to understand right now.

@wojtek-t
Copy link
Member

@zmerlynn - that said, any rate-limitting on our side, will not solve the kubelet-related problem

@wojtek-t
Copy link
Member

Hmm - actually looking into implementation it seems that NodeAddresses are only contacting metadata server, so I'm not longer that convinced...

@wojtek-t
Copy link
Member

Also, I think that important thing is that a single call to GCE API can translate to multiple api calls to GCE. In particular - CreateRoute translates to:

  • get instance
  • insert route
  • a bunch of getOperation (until this is finished)

@wojtek-t
Copy link
Member

OK - so I think that what is happening here is that since CreateRoute translates to:

  • get instance
  • insert route
  • a bunch of getOperations

That means that if we call say 2000 CreateRoute() at the same time, all of them will try to issue: getInstance at the beginning. So we will quickly accumulate 2000 GET requests in the queue.
Once they are processed, we generate the "POST route" then for any processed.

So It's kind of expected what is happening here.

So getting back to the Zach`s PR - I think that throttling POST and GET and OPERATION calls separate is kind of good idea in general.

@wojtek-t
Copy link
Member

But I still don't really understand why do we get "RateLimitExceeded". Is that because of Kubelets?

@wojtek-t
Copy link
Member

I think that one problem we have at the RouteController level is that if CreateRoute fails, we simply don't react on it. And we will wait at least 5 minutes before even retrying it.
We should add some retries at the route controller level too.
I will send out some short PR for it.

@wojtek-t
Copy link
Member

But I still don't understand why do we get those "RateLimitExceeded" error

@wojtek-t
Copy link
Member

OK - so as an update. In GCE, we have a separate rate-limit for number of in-flight CreateRoute call per project.
#26263 is supposed to address this issue.

k8s-github-robot pushed a commit that referenced this issue May 26, 2016
Automatic merge from submit-queue

GCE provider: Revert rate limits

[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]() This reverts #26140 and #26170. After testing with #26263, #26140 is unnecessary, and we need to be able to prioritize normal GET / POST requests over operation polling requests, which is what the pre-#26140 requests do.

c.f. #26119
k8s-github-robot pushed a commit that referenced this issue May 26, 2016
Automatic merge from submit-queue

Limit concurrent route creations

Ref #26119 

This is supposed to improve 2 things:
- retry creating route in routecontroller in case of failure
- limit number of concurrent CreateRoute calls in flight.

We need something like that, because we have a limit of concurrent in-flight CreateRoute requests in GCE.

@gmarek @cjcullen
@wojtek-t
Copy link
Member

OK - so this one is actually fixed. I mean that it is still long, but currently it's purely blocked on GCE.

@wojtek-t
Copy link
Member

As a comment - we have an internal bug for it already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants