Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Add new apiserver listener which throttles incoming connections #7470
Conversation
wallyworld
added some commits
Jun 7, 2017
jameinel
reviewed
Jun 8, 2017
If this is all we have, I'd live with it, and make it work. But I think we can do a little bit better, to make it easier to understand what you're tweaking.
I'd love it if we could update config live, but if that's too hard, we can live with restart. It just makes it much harder to play with cause you have to wait so long after restart.
| @@ -332,7 +332,7 @@ func (s *loginSuite) TestLoginRateLimited(c *gc.C) { | ||
| select { | ||
| case err := <-errResults: | ||
| c.Check(err, jc.Satisfies, params.IsCodeTryAgain) | ||
| - case <-time.After(coretesting.LongWait): | ||
| + case <-time.After(apiserver.LoginRetyPause + coretesting.LongWait): |
| + DefaultLoginMinPause = 100 * time.Millisecond | ||
| + DefaultLoginMaxPause = 1 * time.Second | ||
| + DefaultLoginRetryPause = 5 * time.Second | ||
| + DefaultConnMinPause = 10 * time.Millisecond |
jameinel
Jun 8, 2017
Owner
I feel like we're missing the "how much to back off based on number of connections".
So that 5ms increment, do we need to more aggressively back off connections ?
| +} | ||
| + | ||
| +// Validate validates the rate limit configuration. | ||
| +func (c RateLimitConfig) Validate() error { |
jameinel
Jun 8, 2017
Owner
I feel like we could probably use some sort of sanity checking around maximums as well.
Accidentally typoing the ENV var (mistaking seconds and milliseconds) and suddenly your ConnMinPause accidentally jumps to 1000s instead of 1s)
There are also checks we could do about Max being > Min.
wallyworld
Jun 8, 2017
Owner
This configurability isn't for the end user. I kept it as hidden as possible. But I could declare a MaxConnMinPause const.
| + | ||
| +// connRateMetric returns an int value based on the rate of new connections. | ||
| +func (l *throttlingListener) connRateMetric() int { | ||
| + l.Lock() |
jameinel
Jun 8, 2017
Owner
some thoughts on this func
- connRateMetric is eventually going to be walking all 500 entries every time we call it, we could just have a 'lastSlot' pointer in addition to the nextSlot pointer, and only have to update lastSlot when nextSlot bumps into it.
- This essentially always assumes averaging over 500 connections. So if you get a spike of 100 people all trying to connect-right-now, but haven't gotten any requests before that for a day, the average is still quite low. Yes we have other things to specifically deal with spikes (max concurrent login = 10).
But I think ideally throttling would be the best place to get a good heuristic as we've done the least overhead. - This also doesn't pay attention to disconnects
The one used for network data passing seems to be:
https://en.wikipedia.org/wiki/Token_bucket
Which doesn't need to track history.
You keep track of "when did you last count this" and "what is the current allowance". You take TIME*rate to increase allowance (with some max threshold), and you take connections to decrease it.
It would seem that Hierarchical Token Bucket is used if you want to go into classes of behavior.
It seems like we could actually just use Token Bucket and go into "above allowance X allow no delay, from X to Y, start slowing down connections, above Y drop connections (or just cap the max delay)".
That should give 4 bits of configuration (how quickly do you go from 0 to X, how high are you at X, how quickly do you get to Y, how high are you at Y)
Units wise we could say "connections per second" and "connection delay in ms" or something along those lines, which feels understandable. (At X connections/second [minute?] delay by Y ms, and Z delay by A, etc.)
Thoughts?
I'm concerned about something that walks a list of 500 items on every connection, and doesn't get us quite what we want anyway.
wallyworld
Jun 8, 2017
Owner
The averaging over possible stale connections issue is a fair point - I could put a bound on how far back in time to look. This would make having a last slot pointer irrelevant, as we'd walk back until we wrapped or exceeded the max lookback time
wallyworld
Jun 8, 2017
Owner
The 500 was chosen to not be too big but at the same time to give a decent bucket size for averaging over to smooth things out a bit. But it's not chosen on hard data. I could reduce the default and make it configurable.
wallyworld
Jun 8, 2017
Owner
I've changed the algorithm to
rate < min threshold -> min pause
rate > max threshold -> max pause
in between, interpolate
no randomness
| + defer l.Unlock() | ||
| + now := l.clk.Now() | ||
| + l.connAcceptTimes[l.nextSlot] = &now | ||
| + l.nextSlot++ |
jameinel
Jun 8, 2017
Owner
I'm also somewhat concerned about go test -race complaining about Accept vs connRateMetric (we might always be doing it in a single goroutine, but it does feel a bit interlocked.)
wallyworld
Jun 8, 2017
Owner
i ran test --race and saw no such failures. occasionally the pause test failed because the wait time in the test was too low, so i increased it to LongWait
| +func (l *throttlingListener) pauseTime() time.Duration { | ||
| + // The pause time is minPause plus 5ms for each unit increase | ||
| + // in connection rate, up to a maximum of maxPause, | ||
| + wantedMaxPause := l.minPause + time.Duration(l.connRateMetric()*5)*time.Millisecond |
jameinel
Jun 8, 2017
Owner
that 5 is something we just can't tweak right now, which is another reason to shift the configuration a bit.
| @@ -1254,6 +1258,72 @@ func (a *MachineAgent) newAPIserverWorker( | ||
| return server, nil | ||
| } | ||
| +func getRateLimitConfig(cfg agent.Config) (apiserver.RateLimitConfig, error) { |
jameinel
Jun 8, 2017
Owner
its a bit of a shame that you have to restart the agent to update these values, given that restarting a Controller means killing and restarting 1000 active connections.
"juju model-config -m controller" has the super nice property that the changes are detected live.
(it often takes 30min for a controller to come anywhere close to 'settled' after bouncing.)
wallyworld
changed the title from
Add new apiserver listener which throttle incoming connections
to
Add new apiserver listener which throttles incoming connections
Jun 8, 2017
| + return 0 | ||
| + } | ||
| + // We use as a metric how many connections per 10ms | ||
| + connRate := connCount * float64(10*time.Millisecond) / (1.0 + float64(latestConnTime.Sub(*earliestConnTime))) |
jameinel
Jun 8, 2017
Owner
offhand I would think 'connections per 1s' would be easier to have a feel for what a valid value was (vs conns/10ms), but otherwise its ok
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
|
Build failed: Generating tarball failed |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
|
Build failed: Tests failed |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
wallyworld commentedJun 8, 2017
•
Edited 1 time
-
wallyworld
Jun 8, 2017
Description of change
For models with many agents, it's possible to DDOS the controller (the thundering herd problem) if the controller agent bounces, eg after an upgrade, and everything attempts to re-connect at the same time.
Rate limiting is introduced to logins and general api connections. In addition, when a login is rejected with a retry error, we first pause so that the agent doesn't get to try again too soon.
The connection rate limit is achieved by pausing when accepting api connections. The pause time is based on the connection rate - the higher the rate of incoming connections, the longer the pause. The pause time is increased by 5ms for each new connection received per 10ms, up to a maximum of 5s pause time.
QA steps
bootstrap and deploy peer-xplod charm
add units
add new models, rinse and repeat
system should reach steady state without errors
on my lxd system, the units all got to the "reached maximum count 1000" state without noticeable impact on system interactions
Bug
Partially addresses https://bugs.launchpad.net/juju/+bug/1696113