Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ratelimit to handle throtling cache #8226

Merged

Conversation

pierresouchay
Copy link
Contributor

@pierresouchay pierresouchay commented Jul 2, 2020

This implements a solution for #7863

It does:

  • Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = 1
  • Add cache.entry_fetch_max_burst size of rate limit (default value = 2)

The new configuration now supports the following syntax for instance to allow 1 query every 3s:

  • command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}'
  • in JSON
{
  "cache": {
    "entry_fetch_rate": 0.333
  }
}

So, by default, the cache does not get updates faster than 1/s, even if the service is modified more frequently than that. Having such value ensures it will not modify the default behavior too much (in our case, we plan to have a value around 0.2 aka 1 modification every 5s). Having a default burst of 2 ensures that on a stable service, you get updates quickly anyway (for instance if you have a probe), so on a stable service, if you register a new service to get the latency at registration, you will see the service appear/disappear almost immediately with default settings

@pierresouchay pierresouchay force-pushed the rate_limit_for_cached_entries branch from e14ab2d to 87aa38f Compare July 3, 2020 13:40
@pierresouchay
Copy link
Contributor Author

@banks @i0rek Does this mechanism to throttle cache would suit you?

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this!

I had a look over most of it. I left a few comments and questions.

I think it might be easier to review this as a smaller PR. What do you think about stashing most of the UX changes (config, and docs) in another branch and initially making this PR just the rate limiter and integration with the cache? That way we can focus the discussion on the core implementation, and once that is worked out we can figure out the UX.

agent/cache/cache.go Outdated Show resolved Hide resolved
agent/cache/cache.go Outdated Show resolved Hide resolved
agent/config/runtime.go Outdated Show resolved Hide resolved
lib/ratelimitspec.go Outdated Show resolved Hide resolved
lib/ratelimitspec.go Outdated Show resolved Hide resolved
lib/ratelimitspec_test.go Outdated Show resolved Hide resolved
agent/config/runtime_test.go Outdated Show resolved Hide resolved
lib/ratelimitspec.go Outdated Show resolved Hide resolved
agent/config/runtime.go Outdated Show resolved Hide resolved
agent/cache/ratelimit.go Outdated Show resolved Hide resolved
 * Renamed CacheRateLimitPerEntry into EntryFetchRateLimit
 * Renamed CacheConfiguration into Cache
 * Use tabs instead of spaces in runtime_test.go
Moved implementation away from cache as well
@pierresouchay
Copy link
Contributor Author

pierresouchay commented Jul 20, 2020

Hello @dnephin ,

I think it might be easier to review this as a smaller PR. What do you think about stashing most of the UX changes (config, and docs) in another branch and initially making this PR just the rate limiter and integration with the cache? That way we can focus the discussion on the core implementation, and once that is worked out we can figure out the UX.

Smaller PRs have drawbacks: we sometimes have to wait a long time before having a review and having 3 PRs instead of 1 is hard to follow on both sides (ours and yours) furthermore, we backport those changes as well on your versions and having separated versions we have sometimes to rebase make it too hard for us. And finally, it is sometimes hard to provide a feature partially implemented (what kind of split do you want ? A first with a hardcoded limit ?)

In any case, I am trying to provide PRs with clear separation in commits that are atomic, do work, compile, and pass tests.

@pierresouchay pierresouchay force-pushed the rate_limit_for_cached_entries branch 4 times, most recently from 6f3e168 to daf5fe0 Compare July 22, 2020 12:18
@pierresouchay
Copy link
Contributor Author

hello @dnephin & @i0rek,

Using the built-in cache from Golang reduces quite a lot the size of PR, so, I simplified everything, there are fewer data structures and I am using the same semantics for RPC than the current rate limiter in RPC.

I also added an end-to-end test that validates the behavior in both directions (aka to have a minimum of updates and a maximum). The test was tricky to implement because of timing issues, but I added enough tolerance to work reliably on my machine with concurrency.

@pierresouchay pierresouchay force-pushed the rate_limit_for_cached_entries branch 2 times, most recently from 39723a5 to 75855a0 Compare July 22, 2020 15:31
This unit test validates that rate-limit works effectively with different values.
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is looking good with x/time/rate

Left some more suggestions for testing, and naming.

test.Run(fmt.Sprintf("rate_limit_at_%v", currentTest.rateLimit), func(t *testing.T) {
tt := currentTest
t.Parallel()
a := NewTestAgent(t, fmt.Sprintf("cache = { rate_limit_per_entry = %v, rate_limit_burst = 1 }", tt.rateLimit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a full agent to test this change?

It seems like the change is fairly localized to the agent/cache package, so I would expect the test to be in that same package. If we use a fake cache.Type implementation the test should read better as there will be fewer things that are not related to the behaviour we are trying to test. A fake implementation could report on the number of times it was called, which should make it easy to test that refresh is blocked on a limit.

That should also make it easier to show that different keys do not share a rate limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is the best proof it is working from end-to-end. What would you suggest exactly?

I am not very familiar with Mocks (and to be fair, not always a big fan), but if you point me to an example of what you have in mind...

agent/agent_test.go Outdated Show resolved Hide resolved
agent/cache/cache.go Outdated Show resolved Hide resolved
Comment on lines +127 to +128
// EntryFetchMaxBurst max burst size of RateLimit for a single cache entry
EntryFetchMaxBurst int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://godoc.org/golang.org/x/time/rate#Limiter.Burst it's the number of tokens per Wait call, which I believe is also going to be 1. Do we need to make this configurable? Or can we remove it and always use 1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the bust rate is important: it allows the rate-limit to be overridden for burst-size time (defaults to 2), but globally respect the rate-limit (in our case, we will probably configure it with a value around 4 or 5 I think and 0.1 qps => you can have quick updates on a stable service, but when too many updates are firing on a service, you get rate-limited at 1 query every 10s) I use the same configuration as the rate-limit you already implemented for RPC which includes burst size as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either I have misunderstood the docs I linked, or the docs are misleading, or it doesn't work quite that way. The docs suggest this only impacts Wait(), and we always use Wait() with a value of 1 , because we only ever fetch an entry one at a time. I'll need to read the code to better understand how burst works.

Copy link
Contributor Author

@pierresouchay pierresouchay Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnephin This is exactly how it works for RPC Max Rate limit as well: https://github.com/hashicorp/consul/blob/master/agent/agent.go#L1404

Wait(10) is equivalent to fetching 10 times Wait(1) in an atomic way. Thus, it represents the burst that can be applied when the RPC calls start being rate-limited.
Imagine the following scenario with cache.entry_fetch_rate_limit = 0.2 (aka 1 rpc every 5s), so every 5s, a token gets added to the reservation

  1. imagine you watch a service without any change for 1 min, so reservation get full (aka you have n buckets available)
  2. for some reason, the service becomes unstable at t1 and get updated every 1 second
Time X-Consul-Index on server idx with burst_size = 1 idx with burst_size = 5
0 42 tokens=1, 42 tokens=5, 42
1 43 tokens=0, 43 tokens=4, 43
2 44 tokens=0, 43 tokens=3, 44
3 45 tokens=0, 43 tokens=2, 45
4 46 tokens=0, 43 tokens=1, 46
5 47 tokens=1, 47 tokens=1, 47
6 48 tokens=0, 47 tokens=0, 47
7 49 tokens=0, 47 tokens=0, 47
8 50 tokens=0, 47 tokens=0, 47
9 51 tokens=1, 51 tokens=1 51
10 52 tokens=0, 51 tokens=0 51

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this as well and was confused at first, because seemed to me that burst only determines how many tokens can be taken by WaitN (which we don't use), but there is more:

https://godoc.org/golang.org/x/time/rate#Limiter:

It implements a "token bucket" of size b, [...] with a maximum burst size of b events.

Being able to configure a bigger token bucket is important in order to be able to allow for multiple Wait calls in quick succession (what @pierresouchay means with burst, not what the library refers to as burst when using WaitN or other ...N functions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the clarification!

It sounds like this setting is actually the bucket size. I wonder if hat would have been a better name for the config field. I guess since we already call this burst elsewhere in the config we probably have to keep it burst here as well.

Copy link
Contributor Author

@pierresouchay pierresouchay Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, burst make sense from a functional point of view: the operator want to allow burst of data until the rate limit is reached. Bucket is more technical about the implementation. But as you highlighted it, it is anyway already used that way in RPC, so I think it does not make sense to change it

agent/cache/cache.go Outdated Show resolved Hide resolved
agent/config/config.go Outdated Show resolved Hide resolved
agent/config/runtime.go Outdated Show resolved Hide resolved
agent/config/config.go Outdated Show resolved Hide resolved
website/pages/docs/agent/options.mdx Outdated Show resolved Hide resolved
agent/config/builder.go Outdated Show resolved Hide resolved
@dnephin
Copy link
Contributor

dnephin commented Jul 22, 2020

Smaller PRs have drawbacks

There are always trade-offs! Larger PRs can take longer to review and require more back and forth, but it is true that if you have to wait a while for an initial review, maybe it is easier to do more at once. Hopefully the ci automation makes it easier to split things up a bit more and get them all cherry-picked into the release branches.

it is sometimes hard to provide a feature partially implemented

That is very true. It's hard to say exactly where the right line is. Sometime what I try to do is open one small PR with the core part of the change, and then a second (or sometimes more) draft PRs that help demonstrate how that core feature might be exposed to the user, or integrated in other places. The draft PR doesn't need to be totally ready for review, so it can save some time up front. But maybe that approach won't always work.

In any case, I am trying to provide PRs with clear separation in commits that are atomic, do work, compile, and pass tests.

Thank you! We really appreciate your PRs!

 * Renamed rate_limit_per_entry into entry_fetch_rate_limit
 * Renamed rate_limit_burst into entry_fetch_max_burst
 * Multi-lines for RateLimiter: rate.NewLimiter(...) to avoid too long lines
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments and clarifications.

Comment on lines +127 to +128
// EntryFetchMaxBurst max burst size of RateLimit for a single cache entry
EntryFetchMaxBurst int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this as well and was confused at first, because seemed to me that burst only determines how many tokens can be taken by WaitN (which we don't use), but there is more:

https://godoc.org/golang.org/x/time/rate#Limiter:

It implements a "token bucket" of size b, [...] with a maximum burst size of b events.

Being able to configure a bigger token bucket is important in order to be able to allow for multiple Wait calls in quick succession (what @pierresouchay means with burst, not what the library refers to as burst when using WaitN or other ...N functions).

agent/cache/cache.go Outdated Show resolved Hide resolved
agent/config/builder.go Show resolved Hide resolved
agent/config/config.go Outdated Show resolved Hide resolved
agent/config/config.go Show resolved Hide resolved
agent/agent_test.go Outdated Show resolved Hide resolved
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a last little thing - the docs.

- `cache` Cache configuration of agent. The configurable values are the following:

- `entry_fetch_max_burst`: The size of the token bucket used to recharge the rate-limit per
cache entry. The default value is 2 and means that when cache has not been updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is now rate.Inf.

Copy link
Contributor Author

@pierresouchay pierresouchay Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@i0rek The cache.entry_fetch_max_burst is actually 2 by default (even if it has no real meaning until cache.entry_fetch_rate is set to something different than +Inf...
I propose to add: This value has no real meaning unless entry_fetch_rate is also set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I looked at the wrong thing!

Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

- `cache` Cache configuration of agent. The configurable values are the following:

- `entry_fetch_max_burst`: The size of the token bucket used to recharge the rate-limit per
cache entry. The default value is 2 and means that when cache has not been updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I looked at the wrong thing!

@hanshasselberg hanshasselberg merged commit 505de6d into hashicorp:master Jul 27, 2020
@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit 505de6d onto release/1.8.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Jul 27, 2020
This implements a solution for #7863

It does:

    Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf
    Add cache.entry_fetch_max_burst size of rate limit (default value = 2)

The new configuration now supports the following syntax for instance to allow 1 query every 3s:

    command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}'
    in JSON

{
  "cache": {
    "entry_fetch_rate": 0.333
  }
}
@ltagliamonte-dd
Copy link
Contributor

@i0rek @dnephin would be great to have a 1.7.X release with this change as well. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants