Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overloaded API servers (sending 429) never cause clients to rebalance #48610

Closed
smarterclayton opened this issue Jul 7, 2017 · 32 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 7, 2017

When running API servers behind proxies, most of our clients (anything depending on rest client) is very good at not dropping connections. However, this means that in the event one of the backing API servers is rate limited, the API server stays rate limited because clients never rebalance.

graph

The graph below is 3 api servers with rate limits set to 400. About 250 nodes are connecting to the api servers, and the controllers are picking one in particular and sending all traffic there (to ensure coherent caches). However, before a rolling restart of the masters at 21:20, most of the traffic in the cluster is going to a single apiserver. And it never rebalances until that "hot" server is restarted at 21:20, at which point clients randomly get distributed by the ELB. This is not an ELB problem, because the clients themselves establish and hold TCP connections.

The general problem is that clients are too good at staying connected, even if they're getting a lot of 429. I'd like to change the server to close connections with some fixed (dynamic?) probability if there is a 429, on the principle that overloaded servers will then shed load. This would only make sense on a multi-apiserver setup.

@kubernetes/sig-scalability-bugs @jeremyeder

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/bug Categorizes issue or PR as related to a bug. labels Jul 7, 2017
@shyamjvs
Copy link
Member

shyamjvs commented Jul 7, 2017

fyi we have issue kubernetes/client-go#222 for this.
Observed it during 5k-node scale tests.

@smarterclayton
Copy link
Contributor Author

I think we should do both. 222 won't fix non-kubernetes clients.

@smarterclayton
Copy link
Contributor Author

Added a fix in #48616 for server side

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@shyamjvs
Copy link
Member

shyamjvs commented Mar 1, 2018

/reopen
/remove-lifecycle rotten

I'm not sure if this has been worked upon.

@k8s-ci-robot
Copy link
Contributor

@shyamjvs: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
/remove-lifecycle rotten

I'm not sure if this has been worked upon.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 1, 2018
@shyamjvs shyamjvs reopened this Mar 1, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 29, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@smarterclayton
Copy link
Contributor Author

Golang 1.11 and kube 1.14 have the fixes to go http2 stack to allow us to do this now. Resurrecting.

@smarterclayton
Copy link
Contributor Author

I am also considering an algorithmic change which is rebalance either on all errors or all requests.

@lavalamp
Copy link
Member

You are talking about a serverside change to close the entire http2 connection (not just individual requests)?

Or are you talking about a clientside change to do the same on a 429?

I think it'd be best if the LB--ah, it's an L4 LB, of course.

OK, consider this: if you run apiserver behind an LB, clients have no way of knowing how many replicas there are and therefore how many connections they should keep open to load the replicas evenly. If we keep the model where a single client sends all of their traffic to a single apiserver, that kind of violates an assumption we're making for (at least the first take of) rate limiting. (We've talked about, in the future, adding a layer on top that helps clients find an apiserver replica with unused throughput.)

I don't think single clients today can reasonably be big enough for this to matter? But it's something to think about.

Maybe a policy of clients just always open and distribute requests among N http2 connections would be OK for now.

@smarterclayton
Copy link
Contributor Author

So every single large HA cluster I've seen has this problem (in practice). If you have a 1k node cluster you can get horribly unbalanced and never recover.

@smarterclayton
Copy link
Contributor Author

I think there is no difference between an LB closing connections after some point and a client closing connections except this is generally forcing a graceful transition in certain circumstances.

Note this is on 10-30m timescales, not in minutes. If rate limiting only works above 10m it's not very useful.

@lavalamp
Copy link
Member

The rate limiting design will work on the ~ second-by-second timeframe.

@wojtek-t
Copy link
Member

Copying my couple thoughts from Slack discussion:

I didn't have much chance to think about that (and I trust your judgement here), but couple things that I would like to understand are:

  • given that it's done at the connection level, this should be probably somewhat weighed (if that's even possible) number of streams within a connection? (in the end equal spread of connection stream is something we probably want, right?
  • given there are multiple streams in a single connection, we have a potential to close many watches in one shot - maybe we can somehow target to minimize this (I mean number of things that will be almost for sure immediately retried at the same time)
  • in the same context - given we're working on rate-limitting of api-calls (I know we're explicitly excluding watch both now and are planning to ignore that in the first version of that), it's something we may want to think a bit if we have thoughts about that
    one more thing on top of it, is that we may consider different probabilities depending on how apiserver is behaving, e.g.:
  • if it's returning 429 or 5xx, we may want to have increased priority comparing to the situation when it's returning only 2xx

Those are a bit random comments, but something we may want to think about.

@lavalamp
Copy link
Member

we have a potential to close many watches in one shot

Yeah, good point. So here's an even simpler suggestion: limit the number of concurrent streams over a single http2 connection to something in the 10-100 range.

@wojtek-t wojtek-t removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 29, 2019
@wojtek-t wojtek-t added this to the v1.15 milestone Mar 29, 2019
@smarterclayton
Copy link
Contributor Author

I wonder what our largest watch connection is after controller manager. I haven't seen many that are that dense, except for people who are trying to watch many individual namespaces.

I can look to see what info we can forcibly extract stats wise from the underlying runtime. When we rebalance, I'd like to focus on low yield users more than high yield users. Rebalancing the controller manager connection isn't going to fix a problem. If we can distribute 10k small users (nodes, workers, etc) that's the group that has to be balanced from a connection / churn rate.

@wojtek-t
Copy link
Member

wojtek-t commented Apr 4, 2019

I wonder what our largest watch connection is after controller manager. I haven't seen many that are that dense, except for people who are trying to watch many individual namespaces.

Kube-proxies watching all endpoints are very dense in some cases also.

When we rebalance, I'd like to focus on low yield users more than high yield users.

+1

@soggiest
Copy link

Hello! Code Freeze is just about on us. We'll be entering into Freeze tomorrow, May 31st. Is this still planned for 1.15?

@soggiest
Copy link

soggiest commented Jun 7, 2019

/milestone v1.16

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.15, v1.16 Jun 7, 2019
@josiahbjorgaard
Copy link
Contributor

Hello! This issue is tagged for milestone 1.16. Code freeze is starting on August 29th, in 9 days, which means that there should be a PR ready and merged before then. Is this still planned for 1.16?

@liggitt
Copy link
Member

liggitt commented Aug 31, 2019

/milestone v1.17

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.16, v1.17 Aug 31, 2019
@josiahbjorgaard
Copy link
Contributor

josiahbjorgaard commented Oct 20, 2019

Bug triage for 1.17 here. This issue has been open for a significant amount of time and since it is tagged for milestone 1.17, we want to let you know that the 1.17 code freeze is coming in less than one month on Nov. 14th. Will this issue be resolved before then?

@lavalamp
Copy link
Member

I don't think anyone is working on this.

@caesarxuchao is fixing the go std lib not detecting dead connections, this might make a good followup; but that change is actually in golang (!), not k8s.

@josiahbjorgaard
Copy link
Contributor

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.17 milestone Nov 3, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2020
@wojtek-t
Copy link
Member

wojtek-t commented Feb 2, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2020
@riking
Copy link

riking commented Mar 10, 2020

Is this fixed by #88567? The mechanism probablistically triggers at all times, not just during overload, to cause steady rebalancing.

I suspect anything more advanced would require load balancer coordination.

@lavalamp
Copy link
Member

Yes, I think that's close enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants