-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overloaded API servers (sending 429) never cause clients to rebalance #48610
Comments
fyi we have issue kubernetes/client-go#222 for this. |
I think we should do both. 222 won't fix non-kubernetes clients. |
Added a fix in #48616 for server side |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/reopen I'm not sure if this has been worked upon. |
@shyamjvs: you can't re-open an issue/PR unless you authored it or you are assigned to it. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Golang 1.11 and kube 1.14 have the fixes to go http2 stack to allow us to do this now. Resurrecting. |
I am also considering an algorithmic change which is rebalance either on all errors or all requests. |
You are talking about a serverside change to close the entire http2 connection (not just individual requests)? Or are you talking about a clientside change to do the same on a 429? I think it'd be best if the LB--ah, it's an L4 LB, of course. OK, consider this: if you run apiserver behind an LB, clients have no way of knowing how many replicas there are and therefore how many connections they should keep open to load the replicas evenly. If we keep the model where a single client sends all of their traffic to a single apiserver, that kind of violates an assumption we're making for (at least the first take of) rate limiting. (We've talked about, in the future, adding a layer on top that helps clients find an apiserver replica with unused throughput.) I don't think single clients today can reasonably be big enough for this to matter? But it's something to think about. Maybe a policy of clients just always open and distribute requests among N http2 connections would be OK for now. |
So every single large HA cluster I've seen has this problem (in practice). If you have a 1k node cluster you can get horribly unbalanced and never recover. |
I think there is no difference between an LB closing connections after some point and a client closing connections except this is generally forcing a graceful transition in certain circumstances. Note this is on 10-30m timescales, not in minutes. If rate limiting only works above 10m it's not very useful. |
The rate limiting design will work on the ~ second-by-second timeframe. |
Copying my couple thoughts from Slack discussion: I didn't have much chance to think about that (and I trust your judgement here), but couple things that I would like to understand are:
Those are a bit random comments, but something we may want to think about. |
Yeah, good point. So here's an even simpler suggestion: limit the number of concurrent streams over a single http2 connection to something in the 10-100 range. |
I wonder what our largest watch connection is after controller manager. I haven't seen many that are that dense, except for people who are trying to watch many individual namespaces. I can look to see what info we can forcibly extract stats wise from the underlying runtime. When we rebalance, I'd like to focus on low yield users more than high yield users. Rebalancing the controller manager connection isn't going to fix a problem. If we can distribute 10k small users (nodes, workers, etc) that's the group that has to be balanced from a connection / churn rate. |
Kube-proxies watching all endpoints are very dense in some cases also.
+1 |
Hello! Code Freeze is just about on us. We'll be entering into Freeze tomorrow, May 31st. Is this still planned for 1.15? |
/milestone v1.16 |
Hello! This issue is tagged for milestone 1.16. Code freeze is starting on August 29th, in 9 days, which means that there should be a PR ready and merged before then. Is this still planned for 1.16? |
/milestone v1.17 |
Bug triage for 1.17 here. This issue has been open for a significant amount of time and since it is tagged for milestone 1.17, we want to let you know that the 1.17 code freeze is coming in less than one month on Nov. 14th. Will this issue be resolved before then? |
I don't think anyone is working on this. @caesarxuchao is fixing the go std lib not detecting dead connections, this might make a good followup; but that change is actually in golang (!), not k8s. |
/milestone clear |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Is this fixed by #88567? The mechanism probablistically triggers at all times, not just during overload, to cause steady rebalancing. I suspect anything more advanced would require load balancer coordination. |
Yes, I think that's close enough. |
When running API servers behind proxies, most of our clients (anything depending on rest client) is very good at not dropping connections. However, this means that in the event one of the backing API servers is rate limited, the API server stays rate limited because clients never rebalance.
The graph below is 3 api servers with rate limits set to 400. About 250 nodes are connecting to the api servers, and the controllers are picking one in particular and sending all traffic there (to ensure coherent caches). However, before a rolling restart of the masters at 21:20, most of the traffic in the cluster is going to a single apiserver. And it never rebalances until that "hot" server is restarted at 21:20, at which point clients randomly get distributed by the ELB. This is not an ELB problem, because the clients themselves establish and hold TCP connections.
The general problem is that clients are too good at staying connected, even if they're getting a lot of 429. I'd like to change the server to close connections with some fixed (dynamic?) probability if there is a 429, on the principle that overloaded servers will then shed load. This would only make sense on a multi-apiserver setup.
@kubernetes/sig-scalability-bugs @jeremyeder
The text was updated successfully, but these errors were encountered: