Limit concurrent route creations #26263

wojtek-t · 2016-05-25T12:43:03Z

This is supposed to improve 2 things:

retry creating route in routecontroller in case of failure
limit number of concurrent CreateRoute calls in flight.

We need something like that, because we have a limit of concurrent in-flight CreateRoute requests in GCE.

zmerlynn · 2016-05-25T15:38:08Z

pkg/controller/route/routecontroller.go

+	// that means that we may wait up to 5 minutes before even starting
+	// creating a route for it. This is bad.
+	// We should have a watch on node and if we observe a new node (with CIDR?)
+	// trigger reconciliation for that node.


This is an extant issue, but #26267 will at least cause it to be NotReady until the route is programmed, right?

Yes - this is a long standing issue (it's nothing new). I will send a PR for it, but that won't be trivial change, so I'm not sure if it will land in 1.3.
But yeah - #26267 would makes nodes to be not ready in such case.

I would file an issue and we can triage it. If it's long standing, nothing huge to worry about. One thing is that "new node" in the large cluster sense is kind of gradual, though - basically everything in the first half hour looks like a new node, so that resync period could be a little relevant. One question I had is whether the reconcile loop works like { reconcile(); sleep(constant); } or if it's { reconcile(); sleep( to-meet-five-minutes ). (Was a question I keep meaning to answer.)

Huh. Unless I'm misreading JitterUntil badly, it's the former and not the latter.

the former. i can file an issue but it's hard with kid on hands :)

Actually - he is eating now.
#26274 filed for this.

zmerlynn · 2016-05-25T16:06:34Z

Some medium level nits, otherwise LGTM. If you're back online tonight, we can iterate on this in realtime and get it done. I'm testing it now, but I think it might actually be best accompanied by a revert of #26140 (which you could do by just pushing a git revert 55fdc1c036df7fa2b22cd475aa597990c6e29491 && git revert 9b5cdfb705a9ed53f5bb376133a17e6b5c051311 onto this PR). I think we're unnecessarily throttling the non-operation-GETs, and it may be slowing down other things - my PR was headed toward fixing that, but just taking the limiters off could be faster / could work. I'm about to test that now.

wojtek-t · 2016-05-25T16:17:36Z

PTAL

wojtek-t · 2016-05-25T16:44:14Z

@zmerlynn - to be honest, I wouldn't revert that PR, because there is also another qps limit on all api calls. so I think both are necessary. Though we may want to bump the limit from 10 to sth like 19. or 20. WDYT?

zmerlynn · 2016-05-25T16:49:20Z

With this and the two reverts suggested, 1k nodes seems pretty happy, though route programming is still taking quite a long time.

@wojtek-t: My concern on the rate limiting is that we're now de-prioritizing GETs (like instance gets/lists) behind things like operations polls, or even the bulk insertion of routes when the 200 kicked. Just looking at the logs, we were delayed by upwards of 20s sometimes, even with the QPS in master (i.e. I wasn't running with my PR). I suppose bumping it to 19 could work, but the "more right" answer is something like a hierarchical token bucket where the operation-polls can get pushed out almost completely.

wojtek-t · 2016-05-25T17:03:26Z

@wojtek-t: My concern on the rate limiting is that we're now de-prioritizing GETs (like instance gets/lists) behind things like operations polls, or even the bulk insertion of routes when the 200 kicked. Just looking at the logs, we were delayed by upwards of 20s sometimes, even with the QPS in master (i.e. I wasn't running with my PR). I suppose bumping it to 19 could work, but the "more right" answer is something like a hierarchical token bucket where the operation-polls can get pushed out almost completely

I though that 20QPS limit includes operation GETs. Isn't it the case?

zmerlynn · 2016-05-25T17:13:30Z

pkg/controller/route/routecontroller.go

+					startTime := time.Now()
+					// Ensure that we don't have more than maxConcurrentRouteCreations
+					// CreateRoute calls in flight.
+					rateLimiter <- struct{}{}


I can add the additional logging here after we get it in, it's not a big deal. This looks fine.

zmerlynn · 2016-05-25T17:24:10Z

@k8s-oncall: Manual merge please, if we ever get a passing result.

k8s-github-robot · 2016-05-25T21:01:49Z

@zmerlynn
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

zmerlynn · 2016-05-25T21:16:53Z

@k8s-bot test this, issue #IGNORE (fighting PR builder)

Automatic merge from submit-queue GCE provider: Revert rate limits [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]() This reverts #26140 and #26170. After testing with #26263, #26140 is unnecessary, and we need to be able to prioritize normal GET / POST requests over operation polling requests, which is what the pre-#26140 requests do. c.f. #26119

wojtek-t · 2016-05-26T11:01:23Z

Trivial rebase - reapplying lgtm.

k8s-github-robot · 2016-05-26T13:15:23Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-26T14:00:05Z

GCE e2e build/test passed for commit aa65a79.

k8s-github-robot · 2016-05-26T14:16:41Z

Automatic merge from submit-queue

wojtek-t assigned zmerlynn May 25, 2016

googlebot added the cla: yes label May 25, 2016

wojtek-t added the release-note-none Denotes a PR that doesn't merit a release note. label May 25, 2016

wojtek-t added this to the v1.3 milestone May 25, 2016

k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 25, 2016

wojtek-t force-pushed the fix_route_controller branch from 4ded84b to 2c7aabd Compare May 25, 2016 14:00

wojtek-t changed the title ~~Spread creating routes over time and retry on failures~~ Limit concurrent route creations May 25, 2016

wojtek-t force-pushed the fix_route_controller branch 3 times, most recently from ff27949 to 80a29ba Compare May 25, 2016 14:11

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 25, 2016

wojtek-t mentioned this pull request May 25, 2016

On large cluster, routecontroller takes forever to program routes due to rate limit errors / reconcile #26119

Closed

This was referenced May 25, 2016

PR builder failing with: Quota 'NETWORKS' exceeded. Limit: 75.0 #26270

Closed

When running on GCE prevent Nodes to become Ready before GCE routes are created. #26267

Closed

zmerlynn reviewed May 25, 2016
View reviewed changes

zmerlynn mentioned this pull request May 25, 2016

GCE provider: Retry all rate limited calls #26206

Closed

wojtek-t force-pushed the fix_route_controller branch from 80a29ba to 2c1e1e9 Compare May 25, 2016 16:17

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 25, 2016

wojtek-t mentioned this pull request May 25, 2016

Maximum number of concurrent CreateRoute calls should be configurable #26274

Closed

wojtek-t force-pushed the fix_route_controller branch from 2c1e1e9 to 012d06f Compare May 25, 2016 16:59

zmerlynn reviewed May 25, 2016
View reviewed changes

zmerlynn added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 25, 2016

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 25, 2016

zmerlynn added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 25, 2016

zmerlynn mentioned this pull request May 25, 2016

GCE provider: Revert rate limits #26306

Merged

zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 25, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2016

Spread creating routes over time and retry on failures

aa65a79

wojtek-t force-pushed the fix_route_controller branch from 012d06f to aa65a79 Compare May 26, 2016 11:01

wojtek-t added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 26, 2016

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2016

k8s-github-robot merged commit d3d6185 into kubernetes:master May 26, 2016

wojtek-t deleted the fix_route_controller branch June 15, 2016 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit concurrent route creations #26263

Limit concurrent route creations #26263

wojtek-t commented May 25, 2016 •

edited

zmerlynn May 25, 2016

wojtek-t May 25, 2016

zmerlynn May 25, 2016

zmerlynn May 25, 2016

wojtek-t May 25, 2016

wojtek-t May 25, 2016

zmerlynn commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

zmerlynn commented May 25, 2016 •

edited

wojtek-t commented May 25, 2016

zmerlynn May 25, 2016

zmerlynn commented May 25, 2016

k8s-github-robot commented May 25, 2016

zmerlynn commented May 25, 2016

wojtek-t commented May 26, 2016

k8s-github-robot commented May 26, 2016

k8s-bot commented May 26, 2016

k8s-github-robot commented May 26, 2016

Limit concurrent route creations #26263

Limit concurrent route creations #26263

Conversation

wojtek-t commented May 25, 2016 • edited

zmerlynn May 25, 2016

Choose a reason for hiding this comment

wojtek-t May 25, 2016

Choose a reason for hiding this comment

zmerlynn May 25, 2016

Choose a reason for hiding this comment

zmerlynn May 25, 2016

Choose a reason for hiding this comment

wojtek-t May 25, 2016

Choose a reason for hiding this comment

wojtek-t May 25, 2016

Choose a reason for hiding this comment

zmerlynn commented May 25, 2016

wojtek-t commented May 25, 2016

wojtek-t commented May 25, 2016

zmerlynn commented May 25, 2016 • edited

wojtek-t commented May 25, 2016

zmerlynn May 25, 2016

Choose a reason for hiding this comment

zmerlynn commented May 25, 2016

k8s-github-robot commented May 25, 2016

zmerlynn commented May 25, 2016

wojtek-t commented May 26, 2016

k8s-github-robot commented May 26, 2016

k8s-bot commented May 26, 2016

k8s-github-robot commented May 26, 2016

wojtek-t commented May 25, 2016 •

edited

zmerlynn commented May 25, 2016 •

edited