Overloaded API servers (sending 429) never cause clients to rebalance #48610

smarterclayton · 2017-07-07T15:21:11Z

When running API servers behind proxies, most of our clients (anything depending on rest client) is very good at not dropping connections. However, this means that in the event one of the backing API servers is rate limited, the API server stays rate limited because clients never rebalance.

The graph below is 3 api servers with rate limits set to 400. About 250 nodes are connecting to the api servers, and the controllers are picking one in particular and sending all traffic there (to ensure coherent caches). However, before a rolling restart of the masters at 21:20, most of the traffic in the cluster is going to a single apiserver. And it never rebalances until that "hot" server is restarted at 21:20, at which point clients randomly get distributed by the ELB. This is not an ELB problem, because the clients themselves establish and hold TCP connections.

The general problem is that clients are too good at staying connected, even if they're getting a lot of 429. I'd like to change the server to close connections with some fixed (dynamic?) probability if there is a 429, on the principle that overloaded servers will then shed load. This would only make sense on a multi-apiserver setup.

@kubernetes/sig-scalability-bugs @jeremyeder

shyamjvs · 2017-07-07T15:50:24Z

fyi we have issue kubernetes/client-go#222 for this.
Observed it during 5k-node scale tests.

smarterclayton · 2017-07-07T15:50:49Z

I think we should do both. 222 won't fix non-kubernetes clients.

smarterclayton · 2017-07-07T16:53:24Z

Added a fix in #48616 for server side

fejta-bot · 2017-12-31T10:09:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-30T10:17:32Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-01T11:03:49Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

shyamjvs · 2018-03-01T11:05:34Z

/reopen
/remove-lifecycle rotten

I'm not sure if this has been worked upon.

k8s-ci-robot · 2018-03-01T11:05:34Z

@shyamjvs: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
/remove-lifecycle rotten

I'm not sure if this has been worked upon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2018-05-30T11:53:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-29T12:38:43Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-29T13:26:53Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

smarterclayton · 2019-03-28T15:36:23Z

Golang 1.11 and kube 1.14 have the fixes to go http2 stack to allow us to do this now. Resurrecting.

smarterclayton · 2019-03-28T15:37:04Z

I am also considering an algorithmic change which is rebalance either on all errors or all requests.

lavalamp · 2019-03-28T16:01:00Z

You are talking about a serverside change to close the entire http2 connection (not just individual requests)?

Or are you talking about a clientside change to do the same on a 429?

I think it'd be best if the LB--ah, it's an L4 LB, of course.

OK, consider this: if you run apiserver behind an LB, clients have no way of knowing how many replicas there are and therefore how many connections they should keep open to load the replicas evenly. If we keep the model where a single client sends all of their traffic to a single apiserver, that kind of violates an assumption we're making for (at least the first take of) rate limiting. (We've talked about, in the future, adding a layer on top that helps clients find an apiserver replica with unused throughput.)

I don't think single clients today can reasonably be big enough for this to matter? But it's something to think about.

Maybe a policy of clients just always open and distribute requests among N http2 connections would be OK for now.

smarterclayton · 2019-03-28T18:04:40Z

So every single large HA cluster I've seen has this problem (in practice). If you have a 1k node cluster you can get horribly unbalanced and never recover.

smarterclayton · 2019-03-28T18:05:48Z

I think there is no difference between an LB closing connections after some point and a client closing connections except this is generally forcing a graceful transition in certain circumstances.

Note this is on 10-30m timescales, not in minutes. If rate limiting only works above 10m it's not very useful.

lavalamp · 2019-03-28T18:09:33Z

The rate limiting design will work on the ~ second-by-second timeframe.

wojtek-t · 2019-03-28T18:28:03Z

Copying my couple thoughts from Slack discussion:

I didn't have much chance to think about that (and I trust your judgement here), but couple things that I would like to understand are:

given that it's done at the connection level, this should be probably somewhat weighed (if that's even possible) number of streams within a connection? (in the end equal spread of connection stream is something we probably want, right?
given there are multiple streams in a single connection, we have a potential to close many watches in one shot - maybe we can somehow target to minimize this (I mean number of things that will be almost for sure immediately retried at the same time)
in the same context - given we're working on rate-limitting of api-calls (I know we're explicitly excluding watch both now and are planning to ignore that in the first version of that), it's something we may want to think a bit if we have thoughts about that
one more thing on top of it, is that we may consider different probabilities depending on how apiserver is behaving, e.g.:
if it's returning 429 or 5xx, we may want to have increased priority comparing to the situation when it's returning only 2xx

Those are a bit random comments, but something we may want to think about.

lavalamp · 2019-03-28T18:40:28Z

we have a potential to close many watches in one shot

Yeah, good point. So here's an even simpler suggestion: limit the number of concurrent streams over a single http2 connection to something in the 10-100 range.

smarterclayton · 2019-04-03T16:22:51Z

I wonder what our largest watch connection is after controller manager. I haven't seen many that are that dense, except for people who are trying to watch many individual namespaces.

I can look to see what info we can forcibly extract stats wise from the underlying runtime. When we rebalance, I'd like to focus on low yield users more than high yield users. Rebalancing the controller manager connection isn't going to fix a problem. If we can distribute 10k small users (nodes, workers, etc) that's the group that has to be balanced from a connection / churn rate.

wojtek-t · 2019-04-04T07:41:15Z

I wonder what our largest watch connection is after controller manager. I haven't seen many that are that dense, except for people who are trying to watch many individual namespaces.

Kube-proxies watching all endpoints are very dense in some cases also.

When we rebalance, I'd like to focus on low yield users more than high yield users.

+1

soggiest · 2019-05-31T05:32:13Z

Hello! Code Freeze is just about on us. We'll be entering into Freeze tomorrow, May 31st. Is this still planned for 1.15?

soggiest · 2019-06-07T03:49:49Z

/milestone v1.16

josiahbjorgaard · 2019-08-20T19:31:35Z

Hello! This issue is tagged for milestone 1.16. Code freeze is starting on August 29th, in 9 days, which means that there should be a PR ready and merged before then. Is this still planned for 1.16?

liggitt · 2019-08-31T15:37:40Z

/milestone v1.17

josiahbjorgaard · 2019-10-20T22:38:47Z

Bug triage for 1.17 here. This issue has been open for a significant amount of time and since it is tagged for milestone 1.17, we want to let you know that the 1.17 code freeze is coming in less than one month on Nov. 14th. Will this issue be resolved before then?

lavalamp · 2019-10-21T20:21:07Z

I don't think anyone is working on this.

@caesarxuchao is fixing the go std lib not detecting dead connections, this might make a good followup; but that change is actually in golang (!), not k8s.

josiahbjorgaard · 2019-11-03T21:26:24Z

/milestone clear

fejta-bot · 2020-02-01T21:43:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2020-02-02T13:12:45Z

/remove-lifecycle stale

riking · 2020-03-10T20:58:00Z

Is this fixed by #88567? The mechanism probablistically triggers at all times, not just during overload, to cause steady rebalancing.

I suspect anything more advanced would require load balancer coordination.

lavalamp · 2020-03-10T21:05:32Z

Yes, I think that's close enough.

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Jul 7, 2017

smarterclayton mentioned this issue Jul 7, 2017

Rebalance clients when server is overloaded #48616

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018

k8s-ci-robot closed this as completed Mar 1, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 1, 2018

shyamjvs reopened this Mar 1, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 29, 2018

k8s-ci-robot closed this as completed Jul 29, 2018

smarterclayton reopened this Mar 28, 2019

wojtek-t removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 29, 2019

wojtek-t added this to the v1.15 milestone Mar 29, 2019

wojtek-t mentioned this issue Apr 4, 2019

The re-watch mechanism does not work as expectations #76145

Closed

k8s-ci-robot modified the milestones: v1.15, v1.16 Jun 7, 2019

k8s-ci-robot modified the milestones: v1.16, v1.17 Aug 31, 2019

lavalamp mentioned this issue Oct 21, 2019

"don't require a load balancer between cluster and control plane and still be HA" #18174

Open

k8s-ci-robot removed this from the v1.17 milestone Nov 3, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2020

smarterclayton mentioned this issue Feb 27, 2020

add a new filter goaway which could send GOAWAY probabilistically to help balance HTTP2 requests #88567

Merged

lavalamp closed this as completed Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overloaded API servers (sending 429) never cause clients to rebalance #48610

Overloaded API servers (sending 429) never cause clients to rebalance #48610

smarterclayton commented Jul 7, 2017 •

edited

shyamjvs commented Jul 7, 2017

smarterclayton commented Jul 7, 2017

smarterclayton commented Jul 7, 2017

fejta-bot commented Dec 31, 2017

fejta-bot commented Jan 30, 2018

fejta-bot commented Mar 1, 2018

shyamjvs commented Mar 1, 2018

k8s-ci-robot commented Mar 1, 2018

fejta-bot commented May 30, 2018

fejta-bot commented Jun 29, 2018

fejta-bot commented Jul 29, 2018

smarterclayton commented Mar 28, 2019

smarterclayton commented Mar 28, 2019

lavalamp commented Mar 28, 2019

smarterclayton commented Mar 28, 2019

smarterclayton commented Mar 28, 2019

lavalamp commented Mar 28, 2019

wojtek-t commented Mar 28, 2019

lavalamp commented Mar 28, 2019

smarterclayton commented Apr 3, 2019

wojtek-t commented Apr 4, 2019

soggiest commented May 31, 2019

soggiest commented Jun 7, 2019

josiahbjorgaard commented Aug 20, 2019

liggitt commented Aug 31, 2019

josiahbjorgaard commented Oct 20, 2019 •

edited

lavalamp commented Oct 21, 2019

josiahbjorgaard commented Nov 3, 2019

fejta-bot commented Feb 1, 2020

wojtek-t commented Feb 2, 2020

riking commented Mar 10, 2020 •

edited

lavalamp commented Mar 10, 2020

Overloaded API servers (sending 429) never cause clients to rebalance #48610

Overloaded API servers (sending 429) never cause clients to rebalance #48610

Comments

smarterclayton commented Jul 7, 2017 • edited

shyamjvs commented Jul 7, 2017

smarterclayton commented Jul 7, 2017

smarterclayton commented Jul 7, 2017

fejta-bot commented Dec 31, 2017

fejta-bot commented Jan 30, 2018

fejta-bot commented Mar 1, 2018

shyamjvs commented Mar 1, 2018

k8s-ci-robot commented Mar 1, 2018

fejta-bot commented May 30, 2018

fejta-bot commented Jun 29, 2018

fejta-bot commented Jul 29, 2018

smarterclayton commented Mar 28, 2019

smarterclayton commented Mar 28, 2019

lavalamp commented Mar 28, 2019

smarterclayton commented Mar 28, 2019

smarterclayton commented Mar 28, 2019

lavalamp commented Mar 28, 2019

wojtek-t commented Mar 28, 2019

lavalamp commented Mar 28, 2019

smarterclayton commented Apr 3, 2019

wojtek-t commented Apr 4, 2019

soggiest commented May 31, 2019

soggiest commented Jun 7, 2019

josiahbjorgaard commented Aug 20, 2019

liggitt commented Aug 31, 2019

josiahbjorgaard commented Oct 20, 2019 • edited

lavalamp commented Oct 21, 2019

josiahbjorgaard commented Nov 3, 2019

fejta-bot commented Feb 1, 2020

wojtek-t commented Feb 2, 2020

riking commented Mar 10, 2020 • edited

lavalamp commented Mar 10, 2020

smarterclayton commented Jul 7, 2017 •

edited

josiahbjorgaard commented Oct 20, 2019 •

edited

riking commented Mar 10, 2020 •

edited