Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
91 changed files
with
5,078 additions
and
8,918 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
FROM openjdk:14-alpine | ||
FROM openjdk:15-alpine | ||
RUN apk --no-cache add curl | ||
|
||
COPY nb/target/nb.jar nb.jar | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
- 526d09cd (HEAD -> nb4-rc1) auto create dirs for grafana_apikey | ||
- b4ec4c9a (origin/nb4-rc1) trigger build | ||
- af87ef9c relaxed requirement for finicky test | ||
- 3436ec61 trigger build | ||
- 17ed4c1e annotator and dashboard fixes | ||
- 4dab9b89 move annotations enums to package | ||
- 6d514cb6 bump middle version number to required java version '15' | ||
- fa78e27f set NB4 to Java 15 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,244 @@ | ||
## RateLimiter Design | ||
|
||
The nosqlbench rate limiter is a hybrid design, combining ideas from | ||
well-known algorithms with a heavy dose of mechanical sympathy. The | ||
resulting implementation provides the following: | ||
|
||
1. A basic design that can be explained in one page (this page!) | ||
2. High throughput, compared to other rate limiters tested. | ||
3. Graceful degradation with increasing concurrency. | ||
4. Clearly defined behavioral semantics. | ||
5. Efficient burst capability, for tunable catch-up rates. | ||
6. Efficient calculation of wait time. | ||
|
||
## Parameters | ||
|
||
**rate** - In simplest terms, users simply need to configure the *rate*. | ||
For example, `rate=12000` specifies an op rate of 12000 ops/second. | ||
|
||
**burst rate** - Additionally, users may specify a burst rate which can be | ||
used to recover unused time when a client is able to go faster than the | ||
strict limit. The burst rate is multiplied by the _op rate_ to arrive at | ||
the maximum rate when wait time is available to recover. For | ||
example, `rate=12000,1.1` | ||
specifies that a client may operate at 12000 ops/s _when it is caught up_, | ||
while allowing it to go at a rate of up to 13200 ops/s _when it is behind | ||
schedule_. | ||
|
||
## Design Principles | ||
|
||
The core design of the rate limiter is based on | ||
the [token bucket](https://en.wikipedia.org/wiki/Token_bucket) algorithm | ||
as established in the telecom industry for rate metering. Additional | ||
refinements have been added to allow for flexible and reliable use on | ||
non-realtime systems. | ||
|
||
The unit of scheduling used in this design is the token, corresponding | ||
directly to a nanosecond of time. The scheduling time that is made | ||
available to callers is stored in a pool of tokens which is set to a | ||
configured size. The size of the token pool determines how many grants are | ||
allowed to be dispatched before the next one is forced to wait for | ||
available tokens. | ||
|
||
At some regular frequency, a filler thread adds tokens (nanoseconds of | ||
time to be distributed to waiting ops) to the pool. The callers which are | ||
waiting for these tokens consume a number of tokens serially. If the pool | ||
does not contain the requested number of tokens, then the caller is | ||
blocked using basic synchronization primitives. When the pool is filled | ||
any blocked callers are unblocked. | ||
|
||
The hybrid rate limiter tracks and accumulates both the passage of system | ||
time and the usage rate of this time as a measurement of progress. The | ||
delta between these two reference points in time captures a very simple | ||
and empirical value of imposed wait time. | ||
|
||
That is, the time which was allocated but which was not used always | ||
represents a slow down which is imposed by external factors. This | ||
manifests as slower response when considering the target rate to be | ||
equivalent to user load. | ||
|
||
## Design Details | ||
|
||
In fact, there are three pools. The _active_ pool, the _bursting_ pool, | ||
and the | ||
_waiting_ pool. The active pool has a limited size based on the number of | ||
operations that are allowed to be granted concurrently. | ||
|
||
The bursting pool is sized according to the relative burst rate and the | ||
size of the active pool. For example, with an op rate of 1000 ops/s and a | ||
burst rate of 1.1, the active pool can be sized to 1E9 nanos (one second | ||
of nanos), and the burst pool can be sized to 1E8 (1/10 of that), thus | ||
yielding a combined pool size of 1E9 + 1E8, or 1100000000 ns. | ||
|
||
The waiting pool is where all extra tokens are held in reserve. It is | ||
unlimited except by the size of a long value. The size of the waiting pool | ||
is a direct measure of wait time in nanoseconds. | ||
|
||
Within the pools, tokens (time) are neither created nor destroyed. They | ||
are added by the filler based on the passage of time, and consumed by | ||
callers when they become available. In between these operations, the net | ||
sum of tokens is preserved. In short, when time deltas are observed in the | ||
system clock, this time is accumulated into the available scheduling time | ||
of the token pools. In this way, the token pool acts as a metered | ||
dispenser of scheduling time to waiting (or not) consumers. | ||
|
||
The filler thread adds tokens to the pool according to the system | ||
real-time clock, at some estimated but unreliable interval. The frequency | ||
of filling is set high enough to give a reliable perception of time | ||
passing smoothly, but low enough to avoid wasting too much thread time in | ||
calling overhead. (It is set to 1K/s by default). Each time filling | ||
occurs, the real-time clock is check-pointed, and the time delta is fed | ||
into the pool filling logic as explained below. | ||
|
||
## Visual Explanation | ||
|
||
The diagram below explains the moving parts of the hybrid rate limiter. | ||
The arrows represent the flow of tokens (ns) as a form of scheduling | ||
currency. | ||
|
||
The top box shows an active token filler thread which polls the system | ||
clock and accumulates new time into the token pool. | ||
|
||
The bottom boxes represent concurrent readers of the token pool. These are | ||
typically independent threads which do a blocking read for tokens once | ||
they are ready to execute the rate-limited task. | ||
|
||
![Hybrid Ratelimiter Schematic](hybrid_ratelimiter.png) | ||
|
||
In the middle, the passive component in this diagram is the token pool | ||
itself. When the token filler adds tokens, it never blocks. However, the | ||
token filler can cause any readers of the token pool to unblock so that | ||
they can acquire newly available tokens. | ||
|
||
When time is added to the token pool, the following steps are taken: | ||
|
||
1) New tokens (based on measured time elapsed since the last fill) are | ||
added to the active pool until it is full. | ||
2) Any extra tokens are added to the waiting pool. | ||
3) If the waiting pool has any tokens, and there is room in the bursting | ||
pool, some tokens are moved from the waiting pool to the bursting pool | ||
according to how many will fit. | ||
|
||
When a caller asks for a number of tokens, the combined total from the | ||
active and burst pools is available to that caller. If the number of | ||
tokens needed is not yet available, then the caller will block until | ||
tokens are added. | ||
|
||
## Bursting Logic | ||
|
||
Tokens in the waiting pool represent time that has not been claimed by a | ||
caller. Tokens accumulate in the waiting pool as a side-effect of | ||
continuous filling outpacing continuous draining, thus creating a backlog | ||
of operations. | ||
|
||
The pool sizes determine both the maximum instantaneously available | ||
operations as well as the rate at which unclaimed time can be back-filled | ||
back into the active or burst pools. | ||
|
||
### Normalizing for Jitter | ||
|
||
Since it is not possible to schedule the filler thread to trigger on a | ||
strict and reliable schedule (as in a real-time system), the method of | ||
moving tokens from the waiting pool to the bursting pool must account for | ||
differences in timing. Thus, tokens which are activated for bursting are | ||
scaled according to the amount of time added in the last fill, relative to | ||
the maximum active pool. This means that a full pool fill will allow a | ||
full burst pool fill, presuming wait time is positive by that amount. It | ||
also means that the same effect can be achieved by ten consecutive fills | ||
of a tenth the time each. In effect, bursting is normalized to the passage | ||
of time along with the burst rate, with a maximum cap imposed when | ||
operations are unclaimed by callers. | ||
|
||
## Mechanical Trade-offs | ||
|
||
In this implementation, it is relatively easy to explain how accuracy and | ||
performance trade-off. They are competing concerns. Consider these two | ||
extremes of an isochronous configuration: | ||
|
||
### Slow Isochronous | ||
|
||
For example, the rate limiter could be configured for strict isochronous | ||
behavior by setting the active pool size to *one* op of nanos and the | ||
burst rate to 1.0, thus disabling bursting. If the op rate requested is 1 | ||
op/s, this configuration will work relatively well, although *any* caller | ||
which doesn't show up (or isn't already waiting) when the tokens become | ||
available will incur a waittime penalty. The odds of this are relatively | ||
low for a high-velocity client. | ||
|
||
### Fast Isochronous | ||
|
||
However, if the op rate for this type of configuration is set to 1E8 | ||
operations per second, then the filler thread will be adding 100 ops worth | ||
of time when there is only *one* op worth of active pool space. This is | ||
due to the fact that filling can only occur at a maximal frequency which | ||
has been set to 1K fills/s on average. That will create artificial wait | ||
time, since the token consumers and producers would not have enough pool | ||
space to hold the tokens needed during fill. It is not possible on most | ||
systems to fill the pool at arbitrarily high fill frequencies. Thus, it is | ||
important for users to understand the limits of the machinery when using | ||
high rates. In most scenarios, these limits will not be onerous. | ||
|
||
### Boundary Rules | ||
|
||
Taking these effects into account, the default configuration makes some | ||
reasonable trade-offs according to the rules below. These rules should | ||
work well for most rates below 50M ops/s. The net effect of these rules is | ||
to increase work bulking within the token pools as rates go higher. | ||
|
||
Trying to go above 50M ops/s while also forcing isochronous behavior will | ||
result in artificial wait-time. For this reason, the pool size itself is | ||
not user-configurable at this time. | ||
|
||
- The pool size will always be at least as big as two ops. This rule | ||
ensures that there is adequate buffer space for tokens when callers are | ||
accessing the token pools near the rate of the filler thread. If this | ||
were not ensured, then artificial wait time would be injected due to | ||
overflow error. | ||
- The pool size will always be at least as big as 1E6 nanos, or 1/1000 of | ||
a second. This rule ensures that the filler thread has a reasonably | ||
attainable update frequency which will prevent underflow in the active | ||
or burst pools. | ||
- The number of ops that can fit in the pool will determine how many ops | ||
can be dispatched between fills. For example, an op rate of 1E6 will | ||
mean that up to 1000 ops worth of tokens may be present between fills, | ||
and up to 1000 ops may be allowed to start at any time before the next | ||
fill. | ||
|
||
.1 ops/s : .2 seconds worth 1 ops/s : 2 seconds worth 100 ops/s : 2 | ||
seconds worth | ||
|
||
In practical terms, this means that rates slower than 1K ops/S will have | ||
their strictness controlled by the burst rate in general, and rates faster | ||
than 1K ops/S will automatically include some op bulking between fills. | ||
|
||
## History | ||
|
||
A CAS-oriented method which compensated for RTC calling overhead was used | ||
previously. This method afforded very high performance, but it was | ||
difficult to reason about. | ||
|
||
This implementation replaces that previous version. Basic synchronization | ||
primitives (implicit locking via synchronized methods) performed | ||
surprisingly well -- well enough to discard the complexity of the previous | ||
implementation. | ||
|
||
Further, this version is much easier to study and reason about. | ||
|
||
## New Challenges | ||
|
||
While the current implementation works well for most basic cases, high CPU | ||
contention has shown that it can become an artificial bottleneck. Based on | ||
observations on higher end systems with many cores running many threads | ||
and high target rates, it appears that the rate limiter becomes a resource | ||
blocker or forces too much thread management. | ||
|
||
Strategies for handling this should be considered: | ||
|
||
1) Make callers able to pseudo-randomly (or not randomly) act as a token | ||
filler, such that active consumers can do some work stealing from the | ||
original token filler thread. | ||
2) Analyze the timing and history of a high-contention scenario for | ||
weaknesses in the parameter adjustment rules above. | ||
3) Add internal micro-batching at the consumer interface, such that | ||
contention cost is lower in general. | ||
4) Partition the rate limiter into multiple slices. |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.