Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Rate limiting sampler #819

Merged
merged 16 commits into from Dec 11, 2018

Conversation

Projects
None yet
6 participants
@devinsba
Copy link
Member

commented Oct 31, 2018

The rate-limited sampler allows you to choose an amount of traces to accept on a per-second interval. The minimum number is 0 and the max is 2,147,483,647 (max int).

For example, to allow 10 traces per second, you'd initialize the following:

tracingBuilder.sampler(RateLimitingSampler.create(10));

Appropriate Usage

If the rate is 10 or more traces per second, an attempt is made to distribute the accept decisions equally across the second. For example, if the rate is 100, 10 will pass every decisecond as opposed to bunching all pass decisions at the beginning of the second.

This sampler is efficient, but not as efficient as the BoundarySampler. However, this sampler is insensitive to the trace ID and will operate correctly even if they are not perfectly random.

Implementation

The implementation uses System.nanoTime() and tracks how many yes decisions occur across a second window. When the rate is at least 10/s, the yes decisions are equally split over 10 deciseconds, allowing a roll-over of unused yes decisions up until the end of the second.

Prior art

RateLimitingSampler was made to allow Amazon X-Ray rules to be expressed in Brave. We considered their Reservoir design. Our implementation differs as it removes a race condition and attempts to be more fair by distributing accept decisions every decisecond. If aws/aws-xray-sdk-java#47 is merged, we'll have the same implementation!

Thanks!

Thanks @devinsba for spiking the implementation and @anuraaga for suggesting how to be more fair when sampling large numbers of requests. Also appreciate review from @huydx and @zeagord

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2018

@anuraaga @trustin fun change to look at. We are trying to include a java 6 compat dependency free non-blocking rate limited sampler. For simplicity this would allow the bucket to empty immediately (ex if 1000 requests/second we don't partition it into 10millis buckets, rather one big bucket and so if all requests deplete in the first millis, so be it).

That said, what thoughts have you on the impl and where we go from here. This is one of the most requested features and is a gap as folks try to bridge from the AWS sdk which has a similar impl but using clock time instead.

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2018

@devinsba question around license headers.. yeah this can wait. we do package the LICENSE file since a long time.

@anuraaga
Copy link
Contributor

left a comment

I'd advocate for buckets, it does add some complexity but don't know if it's a huge amount. It feels weird to me that a periodic process that always runs towards the end of a second has basically no chance of ever being sampled.

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2018

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

gonna pull this local and try it out

@adriancole adriancole referenced this pull request Nov 12, 2018

Open

non-blocking example #1

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

added a cleanup commit

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

here are bench results. It does take time to issue nanoTime, so some overhead is normal. Keep in mind this only applies to first span in a trace and we are talking about operations per microsecond

Benchmark                                         (traceId)   Mode  Cnt    Score    Error   Units
SamplerBenchmarks.sampler_boundary     -9223372036854775808  thrpt   15  512.167 ± 20.785  ops/us
SamplerBenchmarks.sampler_boundary      1234567890987654321  thrpt   15  515.468 ± 26.426  ops/us
SamplerBenchmarks.sampler_counting     -9223372036854775808  thrpt   15   25.422 ±  1.467  ops/us
SamplerBenchmarks.sampler_counting      1234567890987654321  thrpt   15   26.166 ±  1.278  ops/us
SamplerBenchmarks.sampler_rateLimited  -9223372036854775808  thrpt   15   14.433 ±  0.650  ops/us
SamplerBenchmarks.sampler_rateLimited   1234567890987654321  thrpt   15   14.650 ±  0.336  ops/us
@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

I take back the advice for "I would say add decrementBy to brave.internal.Platform" it isn't needed.

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2018

was thinking about the main problem with the "all samples at the beginning of the second" problem @devinsba @anuraaga. What if we setup some basic rules with a "rollover plan"

Ex.
if < 10/s allow all to happen at any time in the second
if <= 100/s bucket <=10/decisecond with a rollover.

ex rate is 100 and if no requests until the last decisecond, there are 100 available still, and 10 again after the next second turns. OTOH, if 13 requests happen in the first 2 deciseconds, the next decisecond has 17 max available.

then some function over that as the rate increases, either a higher magnitude per decisecond or smaller bucket than decisecond. Note: fixed buckets, especially few fixed buckets can allow for some interesting impls. Ex our counting sampler is cute due to known bounds https://github.com/openzipkin/brave/blob/master/brave/src/main/java/brave/sampler/CountingSampler.java

Thoughts?

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2018

one reminder on the bucket thing is that the minimum rate in my suggestion is 1/s. it is easier code and matches Amazon's ReservoirSize https://docs.aws.amazon.com/xray/latest/api/API_SamplingRule.html slow rates like per minute or hour is complicated and I don't think we want to handle that until/unless explicitly requested

@anuraaga

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2018

That SGTM - not thinking about per minute / hour seems fine since they would rarely sample less than 100%.

@devinsba

This comment has been minimized.

Copy link
Member Author

commented Nov 29, 2018

If someone else has the time, energy, and know how to get this over the finish line I'd appreciate it. I've been sick and time has been tight.

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

@devinsba

This comment has been minimized.

Copy link
Member Author

commented Dec 7, 2018

To Note: I had intended to do the "replace the sampler" approach for the XRay wrapper of this

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2018

@huydx

This comment has been minimized.

Copy link
Member

commented Dec 7, 2018

@adriancole replace the sampler itself sounds better, totally forgot about that. Anw the implementation looks good for me.

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2018

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

I think all this needs is javadoc polishing and readme. @anuraaga are you happy with impl?

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

Here are benchmark results:

Benchmark                                                                                           (traceId)    Mode     Cnt     Score   Error   Units
SamplerBenchmarks.sampler_rateLimited_1                                                  -9223372036854775808  sample  779286     0.153 ± 0.001   us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.00                      -9223372036854775808  sample             0.031           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.50                      -9223372036854775808  sample             0.140           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.90                      -9223372036854775808  sample             0.151           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.95                      -9223372036854775808  sample             0.154           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.99                      -9223372036854775808  sample             0.197           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.999                     -9223372036854775808  sample             7.321           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.9999                    -9223372036854775808  sample            14.022           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p1.00                      -9223372036854775808  sample           101.120           us/op
SamplerBenchmarks.sampler_rateLimited_1:·gc.alloc.rate                                   -9223372036854775808  sample      15     0.224 ± 0.031  MB/sec
SamplerBenchmarks.sampler_rateLimited_1:·gc.alloc.rate.norm                              -9223372036854775808  sample      15     0.013 ± 0.002    B/op
SamplerBenchmarks.sampler_rateLimited_1:·gc.count                                        -9223372036854775808  sample      150          counts
SamplerBenchmarks.sampler_rateLimited_1                                                   1234567890987654321  sample  783515     0.149 ± 0.001   us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.00                       1234567890987654321  sample             0.032           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.50                       1234567890987654321  sample             0.140           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.90                       1234567890987654321  sample             0.150           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.95                       1234567890987654321  sample             0.154           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.99                       1234567890987654321  sample             0.186           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.999                      1234567890987654321  sample             3.119           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.9999                     1234567890987654321  sample            13.150           us/op
SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p1.00                       1234567890987654321  sample           130.176           us/op
SamplerBenchmarks.sampler_rateLimited_1:·gc.alloc.rate                                    1234567890987654321  sample      15     0.220 ± 0.023  MB/sec
SamplerBenchmarks.sampler_rateLimited_1:·gc.alloc.rate.norm                               1234567890987654321  sample      15     0.013 ± 0.001    B/op
SamplerBenchmarks.sampler_rateLimited_1:·gc.count                                         1234567890987654321  sample      150          counts
SamplerBenchmarks.sampler_rateLimited_100                                                -9223372036854775808  sample  727518     0.153 ± 0.002   us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.00                  -9223372036854775808  sample             0.030           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.50                  -9223372036854775808  sample             0.140           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.90                  -9223372036854775808  sample             0.155           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.95                  -9223372036854775808  sample             0.165           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.99                  -9223372036854775808  sample             0.205           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.999                 -9223372036854775808  sample             8.560           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.9999                -9223372036854775808  sample            15.964           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p1.00                  -9223372036854775808  sample           110.848           us/op
SamplerBenchmarks.sampler_rateLimited_100:·gc.alloc.rate                                 -9223372036854775808  sample      15     0.240 ± 0.021  MB/sec
SamplerBenchmarks.sampler_rateLimited_100:·gc.alloc.rate.norm                            -9223372036854775808  sample      15     0.015 ± 0.001    B/op
SamplerBenchmarks.sampler_rateLimited_100:·gc.count                                      -9223372036854775808  sample      150          counts
SamplerBenchmarks.sampler_rateLimited_100                                                 1234567890987654321  sample  725738     0.147 ± 0.001   us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.00                   1234567890987654321  sample             0.029           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.50                   1234567890987654321  sample             0.139           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.90                   1234567890987654321  sample             0.153           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.95                   1234567890987654321  sample             0.166           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.99                   1234567890987654321  sample             0.196           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.999                  1234567890987654321  sample             5.924           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.9999                 1234567890987654321  sample            13.140           us/op
SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p1.00                   1234567890987654321  sample            77.824           us/op
SamplerBenchmarks.sampler_rateLimited_100:·gc.alloc.rate                                  1234567890987654321  sample      15     0.229 ± 0.035  MB/sec
SamplerBenchmarks.sampler_rateLimited_100:·gc.alloc.rate.norm                             1234567890987654321  sample      15     0.015 ± 0.002    B/op
SamplerBenchmarks.sampler_rateLimited_100:·gc.count                                       1234567890987654321  sample      150          counts
@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

I will repost benchmarks as I forgot to include gc data

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

ran benchmarks against another tracer which won't be mentioned.. our results are an order of magnitude more efficient in terms of allocation and runtime.

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

Polished up and ran against x-ray. Note their perf is expectedly a little quicker than ours at higher rates as we try to be fair. Also, they have a race condition resetting which we don't. I will raise a pull request to correct theirs.

SamplerBenchmarks.sampler_rateLimited_1:sampler_rateLimited_1·p0.999                  1234567890987654321  sample             8.128           us/op
SamplerBenchmarks.sampler_rateLimited_1:·gc.alloc.rate.norm                           1234567890987654321  sample      15     0.014 ± 0.002    B/op

SamplerBenchmarks.sampler_rateLimited_100:sampler_rateLimited_100·p0.999              1234567890987654321  sample             8.368           us/op
SamplerBenchmarks.sampler_rateLimited_100:·gc.alloc.rate.norm                         1234567890987654321  sample      15     0.015 ± 0.002    B/op

SamplerBenchmarks.sampler_rateLimited_1_xray:sampler_rateLimited_1_xray·p0.999       -9223372036854775808  sample             8.240           us/op
SamplerBenchmarks.sampler_rateLimited_1_xray:·gc.alloc.rate.norm                     -9223372036854775808  sample      15     0.008 ± 0.002    B/op

SamplerBenchmarks.sampler_rateLimited_100_xray:sampler_rateLimited_100_xray·p0.999   -9223372036854775808  sample             5.742           us/op
SamplerBenchmarks.sampler_rateLimited_100_xray:·gc.alloc.rate.norm                   -9223372036854775808  sample      15     0.007 ± 0.001    B/op

@adriancole adriancole changed the title WIP - Rate limiting sampler Adds Rate limiting sampler Dec 10, 2018

adriancole added a commit to adriancole/aws-xray-sdk-java that referenced this pull request Dec 10, 2018

Improves reservoir to avoid a race condition and to be more fair
This re-uses the implementation we just made in Brave based on feedback
on yours. Enjoy!

See openzipkin/brave#819
@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

added aws/aws-xray-sdk-java#47 to give back to amazon

adriancole added some commits Dec 10, 2018

@devinsba

This comment has been minimized.

Copy link
Member Author

commented Dec 10, 2018

👍 Thanks for picking this up @adriancole, I'll see if I can find time to fix that XRay test today so your PR can get merged

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 11, 2018

I will look into the test failure in aws/aws-xray-sdk-java#47 before merge. thanks!

@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 11, 2018

Amazon's test case found a legit glitch. Fixed in both places

@adriancole adriancole merged commit 3be55b5 into openzipkin:master Dec 11, 2018

2 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

haotianw465 added a commit to aws/aws-xray-sdk-java that referenced this pull request Dec 12, 2018

Improves reservoir to avoid a race condition and to be more fair (#47)
* Improves reservoir to avoid a race condition and to be more fair

This re-uses the implementation we just made in Brave based on feedback
on yours. Enjoy!

See openzipkin/brave#819

* Makes rate setting lazy
@adriancole

This comment has been minimized.

Copy link
Contributor

commented Dec 13, 2018

whoot xray now has this impl aws/aws-xray-sdk-java#47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.