On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

zmerlynn · 2016-05-23T22:24:59Z

On a large cluster, the routecontroller basically makes O(n) Routes.Insert requests all at once, and most of them error out with "Rate Limit Exceeded". We then wait for operations to complete on the ones that did stick, and damn the rest, they'll be caught next reconcile (which is a while).

In fact, in some cases it appears we're spamming hard enough to get rate limited by the upper-level API DOS protections and seeing something akin to golang/go#14627, e.g.:

E0523 22:00:21.772924      22 routecontroller.go:101] Could not create route 5a40e42e-212a-11e6-9ac7-42010af00002 10.9.170.0/
24 for node gke-jenkins-e2e-default-pool-1-b4d62c45-58a8 after 238.128126ms: Get https://www.googleapis.com/compute/v1/projec
ts/kubernetes-scale/zones/us-east1-a/instances/gke-jenkins-e2e-default-pool-1-b4d62c45-58a8?alt=json: http2: no cached connec
tion was available

We can do better than this!

The text was updated successfully, but these errors were encountered:

zmerlynn · 2016-05-23T22:27:39Z

c.f. #23327, which is the other leg of this problem

wojtek-t · 2016-05-24T05:57:42Z

I took a look into code and it seems that we are doing full reconciliation every 10s. If we need to create 2000 routes at the beginning it seems to frequently.

I think that we need to:

first compute how many routes we have to create
based on that try to spread them over some time to reduce possibility of causing 'rate limit exceeded'

I will try to put some small PR together as a starting point.

@gmarek

Instead of just rate limits to operation polling, send all API calls through a rate limited RoundTripper. This isn't a perfect solution, since the QPS is obviously getting split between different controllers, etc., but it's also spread across different APIs, which, in practice, rate limit differently. Fixes kubernetes#26119 (hopefully)

wojtek-t · 2016-05-24T13:52:56Z

I'm reopening this one, since this doesn't seem to be solved.

So basically something strange is happening here. With @zmerlynn PR we have 10/s limit on API calls send to GCE. And we are still rate-limitted.

However, according to documentation we are allowed for 20/s:
https://cloud.google.com/compute/docs/api-rate-limits

So something either doesn't work or is weird.

wojtek-t · 2016-05-24T14:06:57Z

What is super interesting from the logs of recent run:

It starts to creating routes:

I0524 12:48:54.353757      29 routecontroller.go:98] Creating route for node gke-jenkins-e2e-default-pool-1-55ac4188-xogp 10.10.218.0/24 with hint 97a4808f-21a4-11e6-87b7-42010af00002
....

After 2 minutes from that, we are hitting "Rate limit Exceeded" for the first time:

E0524 12:50:53.996602      29 routecontroller.go:101] Could not create route 9b689fe5-21a4-11e6-87b7-42010af00002 10.11.91.0/24 for node gke-jenkins-e2e-default-pool-2-4fd82afd-rgru after 1m59.64009569s: googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded
...
E0524 12:52:22.350821      29 routecontroller.go:101] Could not create route 90216fd2-21a4-11e6-87b7-42010af00002 10.10.10.0/24 for node gke-jenkins-e2e-default-pool-4-308af981-rlw9 after 3m27.985218479s: googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded

And only after 9 minutes we create the first route ever.
The second one is created after 15 minutes, and from that point we create a number of them

I0524 12:58:26.514569      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-ocnk 10.13.193.0/24 with hint d9e56242-21a4-11e6-87b7-42010af00002 after 9m32.158520459s
I0524 13:04:05.640382      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-p2kd 10.14.51.0/24 with hint dbaf5e86-21a4-11e6-87b7-42010af00002 after 15m11.284277905s
I0524 13:04:06.553900      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-9rb5 10.12.176.0/24 with hint c0b05886-21a4-11e6-87b7-42010af00002 after 15m12.199269778s
I0524 13:04:06.779552      29 routecontroller.go:103] Created route for node gke-jenkins-e2e-default-pool-2-4fd82afd-bkke 10.9.143.0/24 with hint 8d636777-21a4-11e6-87b7-42010af00002 after 15m12.424801178s
...

So it seems that nothing useful is happening for the first 2 minutes, and then between 3m30 and 9m30

wojtek-t · 2016-05-24T14:11:57Z

OK - so I think that what is happening is that all controllers are sharing GCE interface (and thuse they have common throttling).

So if one controller is generating a lot of API calls, then other controllers may be throttled.

@gmarek has a hypothesis that it may be caused by nodecontroller.

zmerlynn · 2016-05-24T14:12:07Z

One thing my PR didn't address (because it was late): Now we have some obnoxious differences in logging when you go to make a request, because the Creating route versus Created route could actually be absorbing a fair amount of ratelimiter. It could before because of the operation rate limiter and operation polling anyways, but now could be worse.

wojtek-t · 2016-05-24T14:13:00Z

@zmerlynn - yes I'm aware of that. And probably this is exactly the case here.

gmarek · 2016-05-24T14:14:10Z

I got fooled by that - I guess we should fix it somehow.

zmerlynn · 2016-05-24T14:19:21Z

One approach to fixing it is a similar approach to the operation polling code, but in the RoundTripper: warn if we ratelimited for more than N seconds (and dump the request or something? not clear how to clarify what we were sending).

wojtek-t · 2016-05-24T14:21:20Z

I found something else - Kubelet is also building GCE interface:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L337

We have 2000 kubelets, so if all of them send multiple requests to GCE, we may hit limits.

gmarek · 2016-05-24T14:25:10Z

Any idea in what circumstances Kubelet will contact cloud provider? @dchen1107

wojtek-t · 2016-05-24T14:26:21Z

Hmm - it seems it contacts GCE exactly once at the beginning. So it doesn't explain much.

zmerlynn · 2016-05-24T14:26:41Z

I think it's only here:

kubernetes/cmd/kubelet/app/server.go

Line 622 in 6224f44

instances, ok := kcfg.Cloud.Instances()

I've rarely seen GCE ratelimit readonly calls, in practice, and kubelets come up pretty well staggered.

wojtek-t · 2016-05-25T08:40:59Z

OK - so with logs that I added, it is pretty clear that we are very heavily throttled on GCE api calls.

In particular, when I was running 1000-node cluster, I see lines where we were throttle for 1m30. I'm pretty sure it's significantly higher in 2000-node cluster.

I0525 08:07:41.684522      24 gce.go:124] GCE api call: POST https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/routes?alt=json (throttled for 1m30.328621564s)
I0525 08:07:41.784558      24 gce.go:124] GCE api call: POST https://www.googleapis.com/compute/v1/projects/kubernetes-scale/global/routes?alt=json (throttled for 1m30.310976327s)

I'm going to investigate a bit more where all those reuqests come from.

wojtek-t · 2016-05-25T09:10:43Z

Actually - it seems that we have a pretty big problem here.
Basically, Kubelet seems to send a lot of requests to GCE.

So to clarify:

every time Kubelet is updating NodeStatus (so every 10s by default) it is sending an API call to GCE to get it's not addresses:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L2960
that means, that if we have 2000 nodes, kubelets generate 200 QPS to GCE (which is significantly more than we can afford)

We should think how to solve this issue, but I think the options are: increasing QPS quota or calling GCE only once per X statusUpdates.

wojtek-t · 2016-05-25T09:11:11Z

The second issue is that we are throttled at the controller level too, and this one I'm tryin to understand right now.

wojtek-t · 2016-05-25T09:15:59Z

@zmerlynn - that said, any rate-limitting on our side, will not solve the kubelet-related problem

wojtek-t · 2016-05-25T09:29:29Z

Hmm - actually looking into implementation it seems that NodeAddresses are only contacting metadata server, so I'm not longer that convinced...

wojtek-t · 2016-05-25T09:39:19Z

Also, I think that important thing is that a single call to GCE API can translate to multiple api calls to GCE. In particular - CreateRoute translates to:

get instance
insert route
a bunch of getOperation (until this is finished)

wojtek-t · 2016-05-25T11:31:18Z

OK - so I think that what is happening here is that since CreateRoute translates to:

get instance
insert route
a bunch of getOperations

That means that if we call say 2000 CreateRoute() at the same time, all of them will try to issue: getInstance at the beginning. So we will quickly accumulate 2000 GET requests in the queue.
Once they are processed, we generate the "POST route" then for any processed.

So It's kind of expected what is happening here.

So getting back to the Zach`s PR - I think that throttling POST and GET and OPERATION calls separate is kind of good idea in general.

wojtek-t · 2016-05-25T11:56:20Z

But I still don't really understand why do we get "RateLimitExceeded". Is that because of Kubelets?

wojtek-t · 2016-05-25T12:03:16Z

I think that one problem we have at the RouteController level is that if CreateRoute fails, we simply don't react on it. And we will wait at least 5 minutes before even retrying it.
We should add some retries at the route controller level too.
I will send out some short PR for it.

wojtek-t · 2016-05-25T13:20:45Z

But I still don't understand why do we get those "RateLimitExceeded" error

wojtek-t · 2016-05-25T14:33:48Z

OK - so as an update. In GCE, we have a separate rate-limit for number of in-flight CreateRoute call per project.
#26263 is supposed to address this issue.

Automatic merge from submit-queue GCE provider: Revert rate limits [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]() This reverts #26140 and #26170. After testing with #26263, #26140 is unnecessary, and we need to be able to prioritize normal GET / POST requests over operation polling requests, which is what the pre-#26140 requests do. c.f. #26119

@gmarek

Automatic merge from submit-queue Limit concurrent route creations Ref #26119 This is supposed to improve 2 things: - retry creating route in routecontroller in case of failure - limit number of concurrent CreateRoute calls in flight. We need something like that, because we have a limit of concurrent in-flight CreateRoute requests in GCE. @gmarek @cjcullen

wojtek-t · 2016-06-16T13:54:04Z

OK - so this one is actually fixed. I mean that it is still long, but currently it's purely blocked on GCE.

wojtek-t · 2016-06-16T13:54:16Z

As a comment - we have an internal bug for it already.

zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/controller-manager team/control-plane labels May 23, 2016

zmerlynn self-assigned this May 23, 2016

zmerlynn mentioned this issue May 24, 2016

GCE provider: Rate limit all API calls #26140

Merged

wojtek-t closed this as completed in #26140 May 24, 2016

wojtek-t reopened this May 24, 2016

wojtek-t mentioned this issue May 24, 2016

Log requests to GCE #26170

Merged

gmarek mentioned this issue May 24, 2016

Limit the number of could API calls made by NodeController #26173

Closed

This was referenced May 24, 2016

Nodes are ready even if GCE routes are not created for them. #23327

Closed

GCE provider: Retry all rate limited calls #26206

Closed

nodecontroller: Fix log message on successful update #26207

Merged

wojtek-t mentioned this issue May 25, 2016

Limit concurrent route creations #26263

Merged

zmerlynn mentioned this issue May 25, 2016

GCE provider: Revert rate limits #26306

Merged

wojtek-t closed this as completed Jun 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

zmerlynn commented May 23, 2016

zmerlynn commented May 23, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

zmerlynn commented May 24, 2016 •

edited

Loading

wojtek-t commented May 24, 2016

gmarek commented May 24, 2016

zmerlynn commented May 24, 2016

wojtek-t commented May 24, 2016

gmarek commented May 24, 2016

wojtek-t commented May 24, 2016

zmerlynn commented May 24, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented Jun 16, 2016

wojtek-t commented Jun 16, 2016

On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

Comments

zmerlynn commented May 23, 2016

zmerlynn commented May 23, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

wojtek-t commented May 24, 2016

zmerlynn commented May 24, 2016 • edited Loading

wojtek-t commented May 24, 2016

gmarek commented May 24, 2016

zmerlynn commented May 24, 2016

wojtek-t commented May 24, 2016

gmarek commented May 24, 2016

wojtek-t commented May 24, 2016

zmerlynn commented May 24, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented Jun 16, 2016

wojtek-t commented Jun 16, 2016

zmerlynn commented May 24, 2016 •

edited

Loading