Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Rate limiting sampler #819

Merged
merged 16 commits into from
Dec 11, 2018
Merged

Adds Rate limiting sampler #819

merged 16 commits into from
Dec 11, 2018

Conversation

devinsba
Copy link
Member

@devinsba devinsba commented Oct 31, 2018

The rate-limited sampler allows you to choose an amount of traces to accept on a per-second interval. The minimum number is 0 and the max is 2,147,483,647 (max int).

For example, to allow 10 traces per second, you'd initialize the following:

tracingBuilder.sampler(RateLimitingSampler.create(10));

Appropriate Usage

If the rate is 10 or more traces per second, an attempt is made to distribute the accept decisions equally across the second. For example, if the rate is 100, 10 will pass every decisecond as opposed to bunching all pass decisions at the beginning of the second.

This sampler is efficient, but not as efficient as the BoundarySampler. However, this sampler is insensitive to the trace ID and will operate correctly even if they are not perfectly random.

Implementation

The implementation uses System.nanoTime() and tracks how many yes decisions occur across a second window. When the rate is at least 10/s, the yes decisions are equally split over 10 deciseconds, allowing a roll-over of unused yes decisions up until the end of the second.

Prior art

RateLimitingSampler was made to allow Amazon X-Ray rules to be expressed in Brave. We considered their Reservoir design. Our implementation differs as it removes a race condition and attempts to be more fair by distributing accept decisions every decisecond. If aws/aws-xray-sdk-java#47 is merged, we'll have the same implementation!

Thanks!

Thanks @devinsba for spiking the implementation and @anuraaga for suggesting how to be more fair when sampling large numbers of requests. Also appreciate review from @huydx and @zeagord

@codefromthecrypt
Copy link
Member

@anuraaga @trustin fun change to look at. We are trying to include a java 6 compat dependency free non-blocking rate limited sampler. For simplicity this would allow the bucket to empty immediately (ex if 1000 requests/second we don't partition it into 10millis buckets, rather one big bucket and so if all requests deplete in the first millis, so be it).

That said, what thoughts have you on the impl and where we go from here. This is one of the most requested features and is a gap as folks try to bridge from the AWS sdk which has a similar impl but using clock time instead.

@codefromthecrypt
Copy link
Member

@devinsba question around license headers.. yeah this can wait. we do package the LICENSE file since a long time.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advocate for buckets, it does add some complexity but don't know if it's a huge amount. It feels weird to me that a periodic process that always runs towards the end of a second has basically no chance of ever being sampled.

brave/src/main/java/brave/sampler/RateLimitingSampler.java Outdated Show resolved Hide resolved
@codefromthecrypt
Copy link
Member

codefromthecrypt commented Nov 11, 2018 via email

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Nov 12, 2018

gonna pull this local and try it out

@codefromthecrypt
Copy link
Member

added a cleanup commit

@codefromthecrypt
Copy link
Member

here are bench results. It does take time to issue nanoTime, so some overhead is normal. Keep in mind this only applies to first span in a trace and we are talking about operations per microsecond

Benchmark                                         (traceId)   Mode  Cnt    Score    Error   Units
SamplerBenchmarks.sampler_boundary     -9223372036854775808  thrpt   15  512.167 ± 20.785  ops/us
SamplerBenchmarks.sampler_boundary      1234567890987654321  thrpt   15  515.468 ± 26.426  ops/us
SamplerBenchmarks.sampler_counting     -9223372036854775808  thrpt   15   25.422 ±  1.467  ops/us
SamplerBenchmarks.sampler_counting      1234567890987654321  thrpt   15   26.166 ±  1.278  ops/us
SamplerBenchmarks.sampler_rateLimited  -9223372036854775808  thrpt   15   14.433 ±  0.650  ops/us
SamplerBenchmarks.sampler_rateLimited   1234567890987654321  thrpt   15   14.650 ±  0.336  ops/us

@codefromthecrypt
Copy link
Member

I take back the advice for "I would say add decrementBy to brave.internal.Platform" it isn't needed.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Nov 19, 2018

was thinking about the main problem with the "all samples at the beginning of the second" problem @devinsba @anuraaga. What if we setup some basic rules with a "rollover plan"

Ex.
if < 10/s allow all to happen at any time in the second
if <= 100/s bucket <=10/decisecond with a rollover.

ex rate is 100 and if no requests until the last decisecond, there are 100 available still, and 10 again after the next second turns. OTOH, if 13 requests happen in the first 2 deciseconds, the next decisecond has 17 max available.

then some function over that as the rate increases, either a higher magnitude per decisecond or smaller bucket than decisecond. Note: fixed buckets, especially few fixed buckets can allow for some interesting impls. Ex our counting sampler is cute due to known bounds https://github.com/openzipkin/brave/blob/master/brave/src/main/java/brave/sampler/CountingSampler.java

Thoughts?

@codefromthecrypt
Copy link
Member

one reminder on the bucket thing is that the minimum rate in my suggestion is 1/s. it is easier code and matches Amazon's ReservoirSize https://docs.aws.amazon.com/xray/latest/api/API_SamplingRule.html slow rates like per minute or hour is complicated and I don't think we want to handle that until/unless explicitly requested

@anuraaga
Copy link
Contributor

anuraaga commented Nov 19, 2018

That SGTM - not thinking about per minute / hour seems fine since they would rarely sample less than 100%.

@devinsba
Copy link
Member Author

If someone else has the time, energy, and know how to get this over the finish line I'd appreciate it. I've been sick and time has been tight.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Nov 29, 2018 via email

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Dec 7, 2018 via email

@devinsba
Copy link
Member Author

devinsba commented Dec 7, 2018

To Note: I had intended to do the "replace the sampler" approach for the XRay wrapper of this

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Dec 7, 2018 via email

@huydx
Copy link
Member

huydx commented Dec 7, 2018

@adriancole replace the sampler itself sounds better, totally forgot about that. Anw the implementation looks good for me.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Dec 7, 2018 via email

@codefromthecrypt
Copy link
Member

I think all this needs is javadoc polishing and readme. @anuraaga are you happy with impl?

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Dec 10, 2018

Here are benchmark results:

Benchmark                                                                                           (traceId)    Mode     Cnt     Score   Error   Units
SamplerBenchmarks.sampler_rateLimited_1                                                  -9223372036854775808  sample  779286     0.153 ± 0.001   us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.00                      -9223372036854775808  sample             0.031           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.50                      -9223372036854775808  sample             0.140           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.90                      -9223372036854775808  sample             0.151           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.95                      -9223372036854775808  sample             0.154           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.99                      -9223372036854775808  sample             0.197           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.999                     -9223372036854775808  sample             7.321           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.9999                    -9223372036854775808  sample            14.022           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p1.00                      -9223372036854775808  sample           101.120           us/op
SamplerBenchmarks.sampler_rateLimited_1gc.alloc.rate                                   -9223372036854775808  sample      15     0.224 ± 0.031  MB/sec
SamplerBenchmarks.sampler_rateLimited_1gc.alloc.rate.norm                              -9223372036854775808  sample      15     0.013 ± 0.002    B/op
SamplerBenchmarks.sampler_rateLimited_1gc.count                                        -9223372036854775808  sample      150          counts
SamplerBenchmarks.sampler_rateLimited_1                                                   1234567890987654321  sample  783515     0.149 ± 0.001   us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.00                       1234567890987654321  sample             0.032           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.50                       1234567890987654321  sample             0.140           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.90                       1234567890987654321  sample             0.150           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.95                       1234567890987654321  sample             0.154           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.99                       1234567890987654321  sample             0.186           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.999                      1234567890987654321  sample             3.119           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.9999                     1234567890987654321  sample            13.150           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p1.00                       1234567890987654321  sample           130.176           us/op
SamplerBenchmarks.sampler_rateLimited_1gc.alloc.rate                                    1234567890987654321  sample      15     0.220 ± 0.023  MB/sec
SamplerBenchmarks.sampler_rateLimited_1gc.alloc.rate.norm                               1234567890987654321  sample      15     0.013 ± 0.001    B/op
SamplerBenchmarks.sampler_rateLimited_1gc.count                                         1234567890987654321  sample      150          counts
SamplerBenchmarks.sampler_rateLimited_100                                                -9223372036854775808  sample  727518     0.153 ± 0.002   us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.00                  -9223372036854775808  sample             0.030           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.50                  -9223372036854775808  sample             0.140           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.90                  -9223372036854775808  sample             0.155           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.95                  -9223372036854775808  sample             0.165           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.99                  -9223372036854775808  sample             0.205           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.999                 -9223372036854775808  sample             8.560           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.9999                -9223372036854775808  sample            15.964           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p1.00                  -9223372036854775808  sample           110.848           us/op
SamplerBenchmarks.sampler_rateLimited_100gc.alloc.rate                                 -9223372036854775808  sample      15     0.240 ± 0.021  MB/sec
SamplerBenchmarks.sampler_rateLimited_100gc.alloc.rate.norm                            -9223372036854775808  sample      15     0.015 ± 0.001    B/op
SamplerBenchmarks.sampler_rateLimited_100gc.count                                      -9223372036854775808  sample      150          counts
SamplerBenchmarks.sampler_rateLimited_100                                                 1234567890987654321  sample  725738     0.147 ± 0.001   us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.00                   1234567890987654321  sample             0.029           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.50                   1234567890987654321  sample             0.139           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.90                   1234567890987654321  sample             0.153           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.95                   1234567890987654321  sample             0.166           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.99                   1234567890987654321  sample             0.196           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.999                  1234567890987654321  sample             5.924           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.9999                 1234567890987654321  sample            13.140           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p1.00                   1234567890987654321  sample            77.824           us/op
SamplerBenchmarks.sampler_rateLimited_100gc.alloc.rate                                  1234567890987654321  sample      15     0.229 ± 0.035  MB/sec
SamplerBenchmarks.sampler_rateLimited_100gc.alloc.rate.norm                             1234567890987654321  sample      15     0.015 ± 0.002    B/op
SamplerBenchmarks.sampler_rateLimited_100gc.count                                       1234567890987654321  sample      150          counts

@codefromthecrypt
Copy link
Member

I will repost benchmarks as I forgot to include gc data

@codefromthecrypt
Copy link
Member

ran benchmarks against another tracer which won't be mentioned.. our results are an order of magnitude more efficient in terms of allocation and runtime.

@codefromthecrypt
Copy link
Member

Polished up and ran against x-ray. Note their perf is expectedly a little quicker than ours at higher rates as we try to be fair. Also, they have a race condition resetting which we don't. I will raise a pull request to correct theirs.

SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.999                  1234567890987654321  sample             8.128           us/op
SamplerBenchmarks.sampler_rateLimited_1:·gc.alloc.rate.norm                           1234567890987654321  sample      15     0.014 ± 0.002    B/op

SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.999              1234567890987654321  sample             8.368           us/op
SamplerBenchmarks.sampler_rateLimited_100:·gc.alloc.rate.norm                         1234567890987654321  sample      15     0.015 ± 0.002    B/op

SamplerBenchmarks.sampler_rateLimited_1_xray:sampler_rateLimited_1_xray·p0.999       -9223372036854775808  sample             8.240           us/op
SamplerBenchmarks.sampler_rateLimited_1_xray:·gc.alloc.rate.norm                     -9223372036854775808  sample      15     0.008 ± 0.002    B/op

SamplerBenchmarks.sampler_rateLimited_100_xray:sampler_rateLimited_100_xray·p0.999   -9223372036854775808  sample             5.742           us/op
SamplerBenchmarks.sampler_rateLimited_100_xray:·gc.alloc.rate.norm                   -9223372036854775808  sample      15     0.007 ± 0.001    B/op

@codefromthecrypt codefromthecrypt changed the title WIP - Rate limiting sampler Adds Rate limiting sampler Dec 10, 2018
@codefromthecrypt
Copy link
Member

added aws/aws-xray-sdk-java#47 to give back to amazon

@devinsba
Copy link
Member Author

👍 Thanks for picking this up @adriancole, I'll see if I can find time to fix that XRay test today so your PR can get merged

@codefromthecrypt
Copy link
Member

I will look into the test failure in aws/aws-xray-sdk-java#47 before merge. thanks!

@codefromthecrypt
Copy link
Member

Amazon's test case found a legit glitch. Fixed in both places

@codefromthecrypt codefromthecrypt merged commit 3be55b5 into openzipkin:master Dec 11, 2018
haotianw465 pushed a commit to aws/aws-xray-sdk-java that referenced this pull request Dec 12, 2018
* Improves reservoir to avoid a race condition and to be more fair

This re-uses the implementation we just made in Brave based on feedback
on yours. Enjoy!

See openzipkin/brave#819

* Makes rate setting lazy
@codefromthecrypt
Copy link
Member

whoot xray now has this impl aws/aws-xray-sdk-java#47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants