Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional rate limiting of new connections and help prevent churn #79

Merged
merged 3 commits into from May 19, 2018

Conversation

jameshartig
Copy link

@jameshartig jameshartig commented Apr 4, 2018

We've been having an issue in production where thousands of connections are created/closed/repeat within a few seconds and the service runs out of file descriptors (because the old connections are left in TIME_WAIT) and is now in an unusable state. This PR adds a few things:

First, a "reserve pool" was added to Pool and has a capacity of 10*size. If the main pool is full, this reserve pool is used to temporarily hold onto a client in case it is immediately needed again. This should help prevent connection churn if more than size connections are needed temporarily. Over a period of 100 seconds, the reserve pool is drained re-using the ping timer.

Second, instead of potentially creating a new connection just to ping, we only look for a client in the pool.

Third, an optional "limited" pool can be created with a token bucket system that only allows size clients
to be created roughly every 10 seconds. The bucket is refilled at a rate of 10/size seconds. This also re-uses the ping timer. I left this default to off since technically it's backward-incompatible but we plan on turning it on in almost all cases. A PoolLimited bool was added to the cluster.Opts struct to enable this functionality.

@jameshartig jameshartig force-pushed the master branch 3 times, most recently from 309426e to 475b492 Compare April 4, 2018 16:15
@jameshartig
Copy link
Author

Updated to remove the signature change to pool.NewCustom and instead implemented another goroutine in a pool.NewCustomLimited method.

@jameshartig
Copy link
Author

jameshartig commented Apr 4, 2018

I wrote the following code to test the behavior with and without my changes and with and without a limited pool: https://gist.github.com/fastest963/1976bf549c91998f9f37435663eb66cf

before:

$ ./testpool --parallel 250 --size 10
2018/04/04 13:26:23 num connections 248
2018/04/04 13:26:23 num new connections 652
2018/04/04 13:26:23 num pings 23191
2018/04/04 13:26:23 mean ping duration 9ms
2018/04/04 13:26:24 num connections 250
2018/04/04 13:26:24 num new connections 555
2018/04/04 13:26:24 num pings 23421
2018/04/04 13:26:24 mean ping duration 9ms
2018/04/04 13:26:25 num connections 250
2018/04/04 13:26:25 num new connections 443
2018/04/04 13:26:25 num pings 22872
2018/04/04 13:26:25 mean ping duration 9ms
2018/04/04 13:26:26 num connections 250
2018/04/04 13:26:26 num new connections 309
2018/04/04 13:26:26 num pings 21653
2018/04/04 13:26:26 mean ping duration 10ms
2018/04/04 13:26:27 num connections 251
2018/04/04 13:26:27 num new connections 359
2018/04/04 13:26:27 num pings 23481
2018/04/04 13:26:27 mean ping duration 9ms
2018/04/04 13:26:28 num connections 250
2018/04/04 13:26:28 num new connections 406
2018/04/04 13:26:28 num pings 19325
2018/04/04 13:26:28 mean ping duration 11ms

after (not limited):

$ ./testpool --parallel 250 --size 10
2018/04/04 13:38:44 num connections 250
2018/04/04 13:38:44 num new connections 290
2018/04/04 13:38:44 num pings 34920
2018/04/04 13:38:44 mean ping duration 6ms
2018/04/04 13:38:45 num connections 250
2018/04/04 13:38:45 num new connections 29
2018/04/04 13:38:45 num pings 35537
2018/04/04 13:38:45 mean ping duration 6ms
2018/04/04 13:38:46 num connections 250
2018/04/04 13:38:46 num new connections 18
2018/04/04 13:38:46 num pings 35723
2018/04/04 13:38:46 mean ping duration 6ms
2018/04/04 13:38:47 num connections 251
2018/04/04 13:38:47 num new connections 1
2018/04/04 13:38:47 num pings 35136
2018/04/04 13:38:47 mean ping duration 6ms
2018/04/04 13:38:48 num connections 251
2018/04/04 13:38:48 num new connections 0
2018/04/04 13:38:48 num pings 37500
2018/04/04 13:38:48 mean ping duration 6ms
2018/04/04 13:38:49 num connections 250
2018/04/04 13:38:49 num new connections 17
2018/04/04 13:38:49 num pings 38082
2018/04/04 13:38:49 mean ping duration 5ms
2018/04/04 13:38:50 num connections 250
2018/04/04 13:38:50 num new connections 1
2018/04/04 13:38:50 num pings 37271
2018/04/04 13:38:50 mean ping duration 6ms

after (limited):

$ ./testpool --limited --parallel 250 --size 10
2018/04/04 13:23:29 num connections 20
2018/04/04 13:23:29 num new connections 20
2018/04/04 13:23:29 num pings 38896
2018/04/04 13:23:29 mean ping duration 5ms
2018/04/04 13:23:30 num connections 21
2018/04/04 13:23:30 num new connections 1
2018/04/04 13:23:30 num pings 38749
2018/04/04 13:23:30 mean ping duration 5ms
2018/04/04 13:23:31 num connections 22
2018/04/04 13:23:31 num new connections 1
2018/04/04 13:23:31 num pings 37484
2018/04/04 13:23:31 mean ping duration 6ms
2018/04/04 13:23:32 num connections 23
2018/04/04 13:23:32 num new connections 1
2018/04/04 13:23:32 num pings 36209
2018/04/04 13:23:32 mean ping duration 6ms
2018/04/04 13:23:33 num connections 24
2018/04/04 13:23:33 num new connections 1
2018/04/04 13:23:33 num pings 36280
2018/04/04 13:23:33 mean ping duration 6ms
2018/04/04 13:23:34 num connections 25
2018/04/04 13:23:34 num new connections 1
2018/04/04 13:23:34 num pings 39232
2018/04/04 13:23:34 mean ping duration 5ms

To summarize:

  • ~60% better throughput on the given test above (10 pooled connections with 250 goroutines pinging every 100 microseconds) and little churn.
  • With rate limiting: 6% slower than without but number of connections stays under 25 with a new connection allowed every second.

@jameshartig jameshartig force-pushed the master branch 2 times, most recently from 4609d14 to f2acfb0 Compare April 10, 2018 14:30
@mediocregopher
Copy link
Owner

I implemented some of your changes in the pool in radix.v3:
mediocregopher/radix@93d26b0

I think the option pattern is good for here, especially since radix.v2 is split up into separate packages, so it's even a little more clean. By making NewCustom take in a variadic options array we could prevent having to make a backwards incompatible change. And ClusterOpts could take in a []pool.Opt as a field maybe? Not sure there...

@jameshartig jameshartig force-pushed the master branch 4 times, most recently from 2cdb82e to 79b3e87 Compare May 4, 2018 16:33
@jameshartig
Copy link
Author

I implemented the Opt pattern much like you did in v3. I also added a headroom to the limits because you'd probably want there to be a buffer so you can burst above the pool size and then only start limiting once it gets insane. The headroom can be 0 if you don't want that, but we definitely do.

@jameshartig
Copy link
Author

Here are some new numbers from the test:

after (buffer size: 10*poolSize) (no creation limit)

2018/05/04 12:54:28 num connections 250
2018/05/04 12:54:28 num new connections 250
2018/05/04 12:54:28 num available 2
2018/05/04 12:54:28 num pings 28707
2018/05/04 12:54:28 mean ping duration 7ms
2018/05/04 12:54:29 num connections 250
2018/05/04 12:54:29 num new connections 0
2018/05/04 12:54:29 num available 0
2018/05/04 12:54:29 num pings 30229
2018/05/04 12:54:29 mean ping duration 7ms
2018/05/04 12:54:30 num connections 250
2018/05/04 12:54:30 num new connections 0
2018/05/04 12:54:30 num available 0
2018/05/04 12:54:30 num pings 29302
2018/05/04 12:54:30 mean ping duration 7ms
2018/05/04 12:54:31 num connections 250
2018/05/04 12:54:31 num new connections 28
2018/05/04 12:54:31 num available 0
2018/05/04 12:54:31 num pings 29613
2018/05/04 12:54:31 mean ping duration 7ms
2018/05/04 12:54:32 num connections 250
2018/05/04 12:54:32 num new connections 60
2018/05/04 12:54:32 num available 0
2018/05/04 12:54:32 num pings 26975
2018/05/04 12:54:32 mean ping duration 8ms
2018/05/04 12:54:33 num connections 250
2018/05/04 12:54:33 num new connections 1
2018/05/04 12:54:33 num available 0
2018/05/04 12:54:33 num pings 28690
2018/05/04 12:54:33 mean ping duration 8ms

after (buffer size: 10*poolSize) (limit headroom: 10, interval: 100ms)

2018/05/04 13:01:29 num connections 29
2018/05/04 13:01:29 num new connections 29
2018/05/04 13:01:29 num available 0
2018/05/04 13:01:29 num pings 35413
2018/05/04 13:01:29 mean ping duration 6ms
2018/05/04 13:01:30 num connections 39
2018/05/04 13:01:30 num new connections 10
2018/05/04 13:01:30 num available 0
2018/05/04 13:01:30 num pings 36249
2018/05/04 13:01:30 mean ping duration 6ms
2018/05/04 13:01:31 num connections 49
2018/05/04 13:01:31 num new connections 10
2018/05/04 13:01:31 num available 0
2018/05/04 13:01:31 num pings 32829
2018/05/04 13:01:31 mean ping duration 7ms
2018/05/04 13:01:32 num connections 59
2018/05/04 13:01:32 num new connections 10
2018/05/04 13:01:32 num available 0
2018/05/04 13:01:32 num pings 31136
2018/05/04 13:01:32 mean ping duration 7ms
2018/05/04 13:01:33 num connections 69
2018/05/04 13:01:33 num new connections 10
2018/05/04 13:01:33 num available 0
2018/05/04 13:01:33 num pings 30998
2018/05/04 13:01:33 mean ping duration 7ms
2018/05/04 13:01:34 num connections 79
2018/05/04 13:01:34 num new connections 10
2018/05/04 13:01:34 num available 0
2018/05/04 13:01:34 num pings 31285
2018/05/04 13:01:34 mean ping duration 7ms

Copy link
Owner

@mediocregopher mediocregopher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes in the main commit. Did you mean to include those other two commits that you'd already submitted prs for?

pool/doc.go Outdated
// Rate limited pool
//
// A standard pool creates new connections on the fly when necessary. You can
// create a "limited" pool using NewLimited if you want to rate limit those new
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewLimited is no longer a thing

pool/pool.go Outdated
func NewCustom(network, addr string, size int, df DialFunc) (*Pool, error) {
func NewCustom(network, addr string, size int, df DialFunc, os ...Opt) (*Pool, error) {
defaultPoolOpts := []Opt{
PingInterval(10 * time.Second / time.Duration(size)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a size of 0 must be supported

pool/pool.go Outdated
}

// additionally, if there are any connections in the reserve pool, they're closed
// periodically so that within 100 seconds a full reserve will be emptied
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment isn't quite right, no longer always 100 seconds

pool/pool.go Outdated
// remove one from the reservePool if there is any
select {
case conn := <-p.reservePool:
conn.Close()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well try and put it in the regular pool if there's room there

@jameshartig
Copy link
Author

jameshartig commented May 9, 2018

Some small changes in the main commit. Did you mean to include those other two commits that you'd already submitted prs for?

Yes, unfortunately I made this PR from master and so I needed to keep this up-to-date with those changes so that we could keep deploying my fork instead of this one to A/B test.

When you merge those PRs, I'll rebase this one.

@jameshartig
Copy link
Author

I updated the docs file, added a test for the 0-sized pool (with a buffer), and added a test to ensure that reserve-evicted connections are added back to the main pool.

@mieczkowski
Copy link

@mediocregopher Is there any chance to merge it fast? It would be helpfull in our project and I don't want to change library that we are using :)

@mediocregopher mediocregopher merged commit 364a4ea into mediocregopher:master May 19, 2018
@mediocregopher
Copy link
Owner

Sorry that took so long, @fastest963 I'm gonna just close those other prs, was easier to merge them all here

@mieczkowski
Copy link

@fastest963 @mediocregopher It would be nice to have some timeout in CreateLimit(headroom int, createInterval time.Duration, waitTimeout time.Duration) . If request timeout is for example 2 seconds - wait max 2 seconds for connection, otherwise return timeout error. So in:

func (p *Pool) Get() (*redis.Client, error)

should be something like:

case <-p.limited:
			return p.df(p.Network, p.Addr)
case <-time.After(p.po.createWaitTimeout):
			return nil, errors.New("pool timeout")

What do you think about that? Or maybe context.Context support and deadline/timeout ? But it will require more changes in code...

@jameshartig
Copy link
Author

@mediocregopher did something similar in v3 and I can add that as a separate option function but it's probably better to instead to make a GetContext method that accepts a context.

@mediocregopher
Copy link
Owner

Might as well add that third argument, I'd rather break the API now when it's only been a few days since merging and probably not that many people are using it, than to add an ugly extra method.

I don't think messing around with contexts is a good idea, as nothing else in this project uses them.

@jameshartig
Copy link
Author

@mieczkowski @mediocregopher added a new option here: #88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants