Skip to content

Commit

Permalink
README updated.
Browse files Browse the repository at this point in the history
  • Loading branch information
Maxim Egorushkin committed Apr 30, 2023
1 parent 996a85c commit fbfdcfc
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 7 deletions.
14 changes: 9 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,12 @@ C++14 multiple-producer-multiple-consumer *lockless* queues based on circular bu

It has been developed, tested and benchmarked on Linux, but should support any C++14 platforms which implement `std::atomic`.

These queues have been designed to be as low-latency as it can possibly get to let the CPUs perform at their peak capacities unimpeded. When minimizing latency a good design is not when there is nothing left to add, rather when there is nothing left to remove, as these queues exemplify. The main design principle these queues follow is _minimalism_: the bare minimum of atomic operations, fixed size buffer, value semantics.
These queues have been designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue.

These qualities are also limitations:
## Design Principles
When minimizing latency a good design is not when there is nothing left to add, but rather when there is nothing left to remove, as these queues exemplify. The main design principle these queues follow is _minimalism_, which results in such design choices as the bare minimum of atomic operations, contention/false-sharing minimization, fixed size buffer, value semantics. The impact of each of these small design choices on their own can be barely measurable, but their total impact is much greater then a simple sum of the constituents' impacts, aka synergy. The synergy emerging from combining multiple of these small design choices together is what allows CPUs to perform at their peak capacities least unimpeded.

These design choices are also limitations:

* The maximum queue size must be set at compile time or construction time. The circular buffer side-steps the memory reclamation problem inherent in linked-list based queues for the price of fixed buffer size. See [Effective memory reclamation for lock-free data structures in C++][4] for more details. Fixed buffer size may not be that much of a limitation, since once the queue gets larger than the maximum expected size that indicates a problem that elements aren't consumed fast enough, and if the queue keeps growing it may eventually consume all available memory which may affect the entire system, rather than the problematic process only. The only apparent inconvenience is that one has to do an upfront calculation on what would be the largest expected/acceptable number of unconsumed elements in the queue.
* There are no OS-blocking push/pop functions. This queue is designed for ultra-low-latency scenarios and using an OS blocking primitive would be sacrificing push-to-pop latency. For lowest possible latency one cannot afford blocking in the OS kernel because the wake-up latency of a blocked thread is about 1-3 microseconds, whereas this queue's round-trip time can be as low as 150 nanoseconds.
Expand All @@ -25,9 +28,9 @@ Ultra-low-latency applications need just that and nothing more. The minimalism p

Available containers are:
* `AtomicQueue` - a fixed size ring-buffer for atomic elements.
* `OptimistAtomicQueue` - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full.
* `OptimistAtomicQueue` - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full. It is `AtomicQueue` used with `push`/`pop` instead of `try_push`/`try_pop`.
* `AtomicQueue2` - a fixed size ring-buffer for non-atomic elements.
* `OptimistAtomicQueue2` - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full.
* `OptimistAtomicQueue2` - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full. It is `AtomicQueue2` used with `push`/`pop` instead of `try_push`/`try_pop`.

These containers have corresponding `AtomicQueueB`, `OptimistAtomicQueueB`, `AtomicQueueB2`, `OptimistAtomicQueueB2` versions where the buffer size is specified as an argument to the constructor.

Expand All @@ -37,7 +40,8 @@ Single-producer-single-consumer mode is supported. In this mode, no expensive at

Move-only queue element types are fully supported. For example, a queue of `std::unique_ptr<T>` elements would be `AtomicQueue2B<std::unique_ptr<T>>` or `AtomicQueue2<std::unique_ptr<T>, CAPACITY>`.

A few other thread-safe containers are used for reference in the benchmarks:
## Role Models
Several other well established and popular thread-safe containers are used for reference in the [benchmarks][1]:
* `std::mutex` - a fixed size ring-buffer with `std::mutex`.
* `pthread_spinlock` - a fixed size ring-buffer with `pthread_spinlock_t`.
* `boost::lockfree::spsc_queue` - a wait-free single-producer-single-consumer queue from Boost library.
Expand Down
4 changes: 2 additions & 2 deletions html/benchmarks.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<body>
<h1 class="view-toggle">Scalability Benchmark</h1>
<div>
<p>N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalabilty of different queues.</p>
<p>N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalabilty of different queues. The minimum, maximum, mean and standard deviation of at least 33 runs are reported in the tooltip.</p>
<h3 class="view-toggle">Scalability on Intel i9-9900KS</h3><div class="chart" id="scalability-9900KS-5GHz"></div>
<h3 class="view-toggle">Scalability on AMD Ryzen 7 5825U</h3><div class="chart" id="scalability-ryzen-5825u"></div>
<h3 class="view-toggle">Scalability on Intel Xeon Gold 6132</h3><div class="chart" id="scalability-xeon-gold-6132"></div>
Expand All @@ -34,7 +34,7 @@ <h3 class="view-toggle">Scalability on AMD Ryzen 9 5950X</h3><div class="chart"

<h1 class="view-toggle">Latency Benchmark</h1>
<div>
<p>One thread posts a 4-byte integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmark measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time.</p>
<p>One thread posts a 4-byte integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmark measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time, i.e. the time it takes to post a message to another thread and receive a reply. The minimum, maximum, mean and standard deviation of at least 33 runs are reported in the tooltip.</p>
<h3 class="view-toggle">Latency on Intel i9-9900KS</h3><div class="chart" id="latency-9900KS-5GHz"></div>
<h3 class="view-toggle">Latency on AMD Ryzen 7 5825U</h3><div class="chart" id="latency-ryzen-5825u"></div>
<h3 class="view-toggle">Latency on Intel Xeon Gold 6132</h3><div class="chart" id="latency-xeon-gold-6132"></div>
Expand Down

0 comments on commit fbfdcfc

Please sign in to comment.