New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit concurrent route creations #26263
Limit concurrent route creations #26263
Conversation
4ded84b
to
2c7aabd
Compare
ff27949
to
80a29ba
Compare
// that means that we may wait up to 5 minutes before even starting | ||
// creating a route for it. This is bad. | ||
// We should have a watch on node and if we observe a new node (with CIDR?) | ||
// trigger reconciliation for that node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an extant issue, but #26267 will at least cause it to be NotReady until the route is programmed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this is a long standing issue (it's nothing new). I will send a PR for it, but that won't be trivial change, so I'm not sure if it will land in 1.3.
But yeah - #26267 would makes nodes to be not ready in such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would file an issue and we can triage it. If it's long standing, nothing huge to worry about. One thing is that "new node" in the large cluster sense is kind of gradual, though - basically everything in the first half hour looks like a new node, so that resync period could be a little relevant. One question I had is whether the reconcile loop works like { reconcile(); sleep(constant); }
or if it's { reconcile(); sleep( to-meet-five-minutes )
. (Was a question I keep meaning to answer.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Unless I'm misreading JitterUntil
badly, it's the former and not the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the former. i can file an issue but it's hard with kid on hands :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - he is eating now.
#26274 filed for this.
Some medium level nits, otherwise LGTM. If you're back online tonight, we can iterate on this in realtime and get it done. I'm testing it now, but I think it might actually be best accompanied by a revert of #26140 (which you could do by just pushing a |
80a29ba
to
2c1e1e9
Compare
PTAL |
@zmerlynn - to be honest, I wouldn't revert that PR, because there is also another qps limit on all api calls. so I think both are necessary. Though we may want to bump the limit from 10 to sth like 19. or 20. WDYT? |
With this and the two reverts suggested, 1k nodes seems pretty happy, though route programming is still taking quite a long time. @wojtek-t: My concern on the rate limiting is that we're now de-prioritizing GETs (like instance gets/lists) behind things like operations polls, or even the bulk insertion of routes when the 200 kicked. Just looking at the logs, we were delayed by upwards of 20s sometimes, even with the QPS in |
2c1e1e9
to
012d06f
Compare
I though that 20QPS limit includes operation GETs. Isn't it the case? |
startTime := time.Now() | ||
// Ensure that we don't have more than maxConcurrentRouteCreations | ||
// CreateRoute calls in flight. | ||
rateLimiter <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the additional logging here after we get it in, it's not a big deal. This looks fine.
@k8s-oncall: Manual merge please, if we ever get a passing result. |
@zmerlynn |
@k8s-bot test this, issue #IGNORE (fighting PR builder) |
Automatic merge from submit-queue GCE provider: Revert rate limits [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]() This reverts #26140 and #26170. After testing with #26263, #26140 is unnecessary, and we need to be able to prioritize normal GET / POST requests over operation polling requests, which is what the pre-#26140 requests do. c.f. #26119
012d06f
to
aa65a79
Compare
Trivial rebase - reapplying lgtm. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit aa65a79. |
Automatic merge from submit-queue |
Ref #26119
This is supposed to improve 2 things:
We need something like that, because we have a limit of concurrent in-flight CreateRoute requests in GCE.
@gmarek @cjcullen