-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ratelimit to handle throtling cache #8226
Added ratelimit to handle throtling cache #8226
Conversation
c54e9eb
to
4d7e4d6
Compare
4d7e4d6
to
e14ab2d
Compare
e14ab2d
to
87aa38f
Compare
@banks @i0rek Does this mechanism to throttle cache would suit you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this!
I had a look over most of it. I left a few comments and questions.
I think it might be easier to review this as a smaller PR. What do you think about stashing most of the UX changes (config, and docs) in another branch and initially making this PR just the rate limiter and integration with the cache? That way we can focus the discussion on the core implementation, and once that is worked out we can figure out the UX.
* Renamed CacheRateLimitPerEntry into EntryFetchRateLimit * Renamed CacheConfiguration into Cache * Use tabs instead of spaces in runtime_test.go
Moved implementation away from cache as well
557f4e7
to
ec5138b
Compare
Hello @dnephin ,
Smaller PRs have drawbacks: we sometimes have to wait a long time before having a review and having 3 PRs instead of 1 is hard to follow on both sides (ours and yours) furthermore, we backport those changes as well on your versions and having separated versions we have sometimes to rebase make it too hard for us. And finally, it is sometimes hard to provide a feature partially implemented (what kind of split do you want ? A first with a hardcoded limit ?) In any case, I am trying to provide PRs with clear separation in commits that are atomic, do work, compile, and pass tests. |
d0b57ed
to
bcb2fd7
Compare
bcb2fd7
to
5fd7610
Compare
6f3e168
to
daf5fe0
Compare
hello @dnephin & @i0rek, Using the built-in cache from Golang reduces quite a lot the size of PR, so, I simplified everything, there are fewer data structures and I am using the same semantics for RPC than the current rate limiter in RPC. I also added an end-to-end test that validates the behavior in both directions (aka to have a minimum of updates and a maximum). The test was tricky to implement because of timing issues, but I added enough tolerance to work reliably on my machine with concurrency. |
39723a5
to
75855a0
Compare
This unit test validates that rate-limit works effectively with different values.
75855a0
to
96a9dda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is looking good with x/time/rate
Left some more suggestions for testing, and naming.
agent/agent_test.go
Outdated
test.Run(fmt.Sprintf("rate_limit_at_%v", currentTest.rateLimit), func(t *testing.T) { | ||
tt := currentTest | ||
t.Parallel() | ||
a := NewTestAgent(t, fmt.Sprintf("cache = { rate_limit_per_entry = %v, rate_limit_burst = 1 }", tt.rateLimit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a full agent to test this change?
It seems like the change is fairly localized to the agent/cache
package, so I would expect the test to be in that same package. If we use a fake cache.Type
implementation the test should read better as there will be fewer things that are not related to the behaviour we are trying to test. A fake implementation could report on the number of times it was called, which should make it easy to test that refresh is blocked on a limit.
That should also make it easier to show that different keys do not share a rate limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is the best proof it is working from end-to-end. What would you suggest exactly?
I am not very familiar with Mocks (and to be fair, not always a big fan), but if you point me to an example of what you have in mind...
// EntryFetchMaxBurst max burst size of RateLimit for a single cache entry | ||
EntryFetchMaxBurst int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://godoc.org/golang.org/x/time/rate#Limiter.Burst it's the number of tokens per Wait
call, which I believe is also going to be 1. Do we need to make this configurable? Or can we remove it and always use 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the bust rate is important: it allows the rate-limit to be overridden for burst-size time (defaults to 2), but globally respect the rate-limit (in our case, we will probably configure it with a value around 4 or 5 I think and 0.1 qps => you can have quick updates on a stable service, but when too many updates are firing on a service, you get rate-limited at 1 query every 10s) I use the same configuration as the rate-limit you already implemented for RPC which includes burst size as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either I have misunderstood the docs I linked, or the docs are misleading, or it doesn't work quite that way. The docs suggest this only impacts Wait()
, and we always use Wait()
with a value of 1 , because we only ever fetch an entry one at a time. I'll need to read the code to better understand how burst works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnephin This is exactly how it works for RPC Max Rate limit as well: https://github.com/hashicorp/consul/blob/master/agent/agent.go#L1404
Wait(10) is equivalent to fetching 10 times Wait(1) in an atomic way. Thus, it represents the burst that can be applied when the RPC calls start being rate-limited.
Imagine the following scenario with cache.entry_fetch_rate_limit = 0.2
(aka 1 rpc every 5s), so every 5s, a token gets added to the reservation
- imagine you watch a service without any change for 1 min, so reservation get full (aka you have n buckets available)
- for some reason, the service becomes unstable at t1 and get updated every 1 second
Time | X-Consul-Index on server | idx with burst_size = 1 | idx with burst_size = 5 |
---|---|---|---|
0 | 42 | tokens=1, 42 | tokens=5, 42 |
1 | 43 | tokens=0, 43 | tokens=4, 43 |
2 | 44 | tokens=0, 43 | tokens=3, 44 |
3 | 45 | tokens=0, 43 | tokens=2, 45 |
4 | 46 | tokens=0, 43 | tokens=1, 46 |
5 | 47 | tokens=1, 47 | tokens=1, 47 |
6 | 48 | tokens=0, 47 | tokens=0, 47 |
7 | 49 | tokens=0, 47 | tokens=0, 47 |
8 | 50 | tokens=0, 47 | tokens=0, 47 |
9 | 51 | tokens=1, 51 | tokens=1 51 |
10 | 52 | tokens=0, 51 | tokens=0 51 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this as well and was confused at first, because seemed to me that burst only determines how many tokens can be taken by WaitN
(which we don't use), but there is more:
https://godoc.org/golang.org/x/time/rate#Limiter:
It implements a "token bucket" of size b, [...] with a maximum burst size of b events.
Being able to configure a bigger token bucket is important in order to be able to allow for multiple Wait
calls in quick succession (what @pierresouchay means with burst, not what the library refers to as burst when using WaitN
or other ...N
functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for the clarification!
It sounds like this setting is actually the bucket size. I wonder if hat would have been a better name for the config field. I guess since we already call this burst elsewhere in the config we probably have to keep it burst here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, burst make sense from a functional point of view: the operator want to allow burst of data until the rate limit is reached. Bucket is more technical about the implementation. But as you highlighted it, it is anyway already used that way in RPC, so I think it does not make sense to change it
There are always trade-offs! Larger PRs can take longer to review and require more back and forth, but it is true that if you have to wait a while for an initial review, maybe it is easier to do more at once. Hopefully the ci automation makes it easier to split things up a bit more and get them all cherry-picked into the release branches.
That is very true. It's hard to say exactly where the right line is. Sometime what I try to do is open one small PR with the core part of the change, and then a second (or sometimes more) draft PRs that help demonstrate how that core feature might be exposed to the user, or integrated in other places. The draft PR doesn't need to be totally ready for review, so it can save some time up front. But maybe that approach won't always work.
Thank you! We really appreciate your PRs! |
* Renamed rate_limit_per_entry into entry_fetch_rate_limit * Renamed rate_limit_burst into entry_fetch_max_burst * Multi-lines for RateLimiter: rate.NewLimiter(...) to avoid too long lines
1e0b1d2
to
00df40b
Compare
… a structure in RuntimeConfig
added comment for panic()
060705e
to
d3eeb4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments and clarifications.
// EntryFetchMaxBurst max burst size of RateLimit for a single cache entry | ||
EntryFetchMaxBurst int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this as well and was confused at first, because seemed to me that burst only determines how many tokens can be taken by WaitN
(which we don't use), but there is more:
https://godoc.org/golang.org/x/time/rate#Limiter:
It implements a "token bucket" of size b, [...] with a maximum burst size of b events.
Being able to configure a bigger token bucket is important in order to be able to allow for multiple Wait
calls in quick succession (what @pierresouchay means with burst, not what the library refers to as burst when using WaitN
or other ...N
functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a last little thing - the docs.
- `cache` Cache configuration of agent. The configurable values are the following: | ||
|
||
- `entry_fetch_max_burst`: The size of the token bucket used to recharge the rate-limit per | ||
cache entry. The default value is 2 and means that when cache has not been updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is now rate.Inf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@i0rek The cache.entry_fetch_max_burst
is actually 2 by default (even if it has no real meaning until cache.entry_fetch_rate
is set to something different than +Inf...
I propose to add: This value has no real meaning unless entry_fetch_rate
is also set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I looked at the wrong thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
- `cache` Cache configuration of agent. The configurable values are the following: | ||
|
||
- `entry_fetch_max_burst`: The size of the token bucket used to recharge the rate-limit per | ||
cache entry. The default value is 2 and means that when cache has not been updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I looked at the wrong thing!
🍒✅ Cherry pick of commit 505de6d onto |
This implements a solution for #7863 It does: Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf Add cache.entry_fetch_max_burst size of rate limit (default value = 2) The new configuration now supports the following syntax for instance to allow 1 query every 3s: command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}' in JSON { "cache": { "entry_fetch_rate": 0.333 } }
@i0rek @dnephin would be great to have a 1.7.X release with this change as well. Thanks |
This implements a solution for #7863
It does:
cache.entry_fetch_rate
to limit the number of calls/s for a given cache entry, default value = 1The new configuration now supports the following syntax for instance to allow 1 query every 3s:
-hcl 'cache = { entry_fetch_rate = 0.333}'
So, by default, the cache does not get updates faster than 1/s, even if the service is modified more frequently than that. Having such value ensures it will not modify the default behavior too much (in our case, we plan to have a value around 0.2 aka 1 modification every 5s). Having a default burst of 2 ensures that on a stable service, you get updates quickly anyway (for instance if you have a probe), so on a stable service, if you register a new service to get the latency at registration, you will see the service appear/disappear almost immediately with default settings