Optimize SPSC queue #121

polytypic · 2024-01-19T14:38:34Z

This PR optimizes the SPSC queue.

See the individual commits to better understand the motivation behind the optimizations.

Here is a run of the benchmarks on my M3 Max before the optimizations:

➜  saturn git:(rewrite-bench) ✗ dune exec --release -- ./bench/main.exe -budget 1 Single | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/2 workers, capacity 1",
    "value": 6.500299359098922
  },
  {
    "name": "messages over time/2 workers, capacity 8",
    "value": 12.632201914378621
  },
  {
    "name": "messages over time/2 workers, capacity 64",
    "value": 14.90261050891111
  },
  {
    "name": "messages over time/2 workers, capacity 512",
    "value": 14.960086955505428
  },
  {
    "name": "messages over time/2 workers, capacity 4096",
    "value": 20.06888141824778
  },
  {
    "name": "messages over time/2 workers, capacity 32768",
    "value": 22.94386410808166
  }
]

Here is a run of the benchmarks after the optimizations:

➜  saturn git:(optimize-spsc-queue) ✗ dune exec --release -- ./bench/main.exe -budget 1 Single | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/2 workers, capacity 1",
    "value": 6.868795322207784
  },
  {
    "name": "messages over time/2 workers, capacity 8",
    "value": 18.420713908583064
  },
  {
    "name": "messages over time/2 workers, capacity 64",
    "value": 30.26195314050038
  },
  {
    "name": "messages over time/2 workers, capacity 512",
    "value": 38.7165046104334
  },
  {
    "name": "messages over time/2 workers, capacity 4096",
    "value": 123.22902197665648
  },
  {
    "name": "messages over time/2 workers, capacity 32768",
    "value": 270.82312457121435
  }
]

TODO:

Add comments on the use/lack of fences in the code.

lyrm

I am unsure the commit Avoid float array pessimization and a bit of the previous one (removing the indirection of the array and thus doing initialization with Obj.magic) are a good idea. They introduce a lot of Obj.magic which I feel we should be careful with, especially because it could cause issues with flambda.

Also, what happen if you are actually putting float in the queue ? I am guessing the array float special case will not be triggered ?

I have no immediate solution for float array pessimization, but for initialization, we could have a make : size_exponent:int -> 'a -> 'a function instead of the create function that requires the user to provide an inhabitant for type 'a.

It may just move the issue to the user lever that may want to either use the same trick with Obj.magic or use and option type though...

lyrm · 2024-02-07T16:52:09Z

Thanks for the PR ! All the optimizations (except maybe the ones mentioned in my previous message :p) are great and thanks again for making the git history so readable. And obviously, the gain in performance is amazing !

lyrm · 2024-02-08T17:08:27Z

I should have added that the optimization with the cashes is very cool !!

This should stabilize performance results to some degree, but, as it is, there is still a massive amount of contention so performance results will still likely have a lot of randomness. This is also one of two crucial optimizations. Without this optimization best performance simply cannot be obtained.

Transparent atomic provides a workaround for the OCaml 5.(0|1) CSE bugs around `Atomic.get`s. In this case, invalid CSE optimization on `Atomic.get`s can actually break the queue semantics. Fences are not really needed in this specific queue with the exception of the updates of the head and tail indices and even there the fences are only needed to ensure that the other side sees the index update only after updates to the array. `Atomic.incr` is potentially a little bit faster, because it needs no write barrier.

Wide pattern matches, even with immutable fields, tend to generate inefficient code in OCaml at the moment. Specifically, the compiler tends to generate a sequence of loads for the matched fields at the start of the function. The problem with this is that it increases register pressure, which may lead to spilling (i.e. extra reads and writes to stack), and may also lead to fetching some fields unnecessarily (i.e. wasted work) as those fields are only accessed conditionally. It is typically better to access immutable fields using dot notation. The compiler performs CSE on such loads so repeated accesses of immutable fields do not necessarily generate repeated loads of those fields.

This reduces contention as operations can be performed on either side in parallel without interference from the other side. Without this algorithmic change, most of the other optimizations become irrelevant, because the contention on index accesses would simply limit the peak performance by an order of magnitude or more. Try removing this change and observe the massive drop in peak performance.

The `mask` can be computed very quickly from the length of the array which is known to be a power of two. Avoiding the indirection means that an allocation, some pointer chasing, and potentially a couple of cache misses (an extra object needs to be written by the producer (a potential cache miss) and read by consumer (a definite cache miss, because the memory is written by the producer)) are avoided per message. This provides a significant performance improvement now that contention has been reduced. Using `Obj.magic ()` with an array that may contain `float` values is prone to be problematic and it makes sense to test that things work.

This tweaks the code for checking whether the queue is full or empry for minor improvement.

Also report out of range `size_exponent` an invalid argument

OCaml currently has a special case for `float array`. This means than whenever an array is accessed that might be a `float array`, the compiler generates an extra conditional to treat the array properly. These conditionals slow down array accesses slightly. Here we hack around that by specifying that the array is of elements of a "pointer" type. This way the compiler will generate array accesses with potential write barriers and avoid generating code for the `float array` special case. This improves performance.

This reduces code size at the expense of one or two conditionals. The time taken by those conditionals is fairly insignificant. The reduction in code size will likely improve performance in real applications, because pressure on the instruction cache is reduced. This approach is likely to be useful in various other data structure implementations and is likely often preferable to inlining to avoid code duplication.

This prevents Flamda from noticing that we are potentially storing float values into a non-float array.

lyrm · 2024-06-10T09:11:46Z

The changes proposed in this PR have already been merged in two different PRs (#133 and #135) that separate "safe" and "unsafe" optimizations. Thanks @polytypic for all this work!

polytypic force-pushed the optimize-spsc-queue branch from de1f9aa to f543aca Compare January 19, 2024 15:10

polytypic changed the title ~~WIP: Optimize SPSC queue~~ Optimize SPSC queue Jan 19, 2024

polytypic force-pushed the optimize-spsc-queue branch 20 times, most recently from 859b2dd to 5cc638e Compare January 20, 2024 13:09

polytypic marked this pull request as ready for review January 21, 2024 10:33

polytypic requested a review from a team January 21, 2024 10:33

polytypic force-pushed the optimize-spsc-queue branch 6 times, most recently from 181b99b to 204f2ce Compare January 27, 2024 22:40

lyrm reviewed Feb 7, 2024

View reviewed changes

polytypic force-pushed the optimize-spsc-queue branch 7 times, most recently from 82cdf25 to 8b0c40d Compare February 17, 2024 17:21

polytypic force-pushed the optimize-spsc-queue branch 2 times, most recently from c86dcca to 034c23b Compare March 1, 2024 13:51

polytypic force-pushed the optimize-spsc-queue branch from 034c23b to c816fde Compare March 13, 2024 07:46

polytypic added 10 commits April 2, 2024 19:19

Micro-optimize empty/full checks

bb91ad7

This tweaks the code for checking whether the queue is full or empry for minor improvement.

Ensure initialization order

909bd6d

Also report out of range `size_exponent` an invalid argument

Add a simple float test and prevent inlining

b6e62a1

This prevents Flamda from noticing that we are potentially storing float values into a non-float array.

polytypic force-pushed the optimize-spsc-queue branch from c816fde to b6e62a1 Compare April 2, 2024 16:19

lyrm mentioned this pull request May 15, 2024

Optimize spsc queue : optimizations without Obj.magic #133

Merged

1 task

lyrm mentioned this pull request May 24, 2024

Optimize spsc queue : unsafe optimizations #135

Merged

lyrm closed this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SPSC queue #121

Optimize SPSC queue #121

polytypic commented Jan 19, 2024 •

edited

lyrm left a comment •

edited

lyrm commented Feb 7, 2024

lyrm commented Feb 8, 2024

lyrm commented Jun 10, 2024

Optimize SPSC queue #121

Optimize SPSC queue #121

Conversation

polytypic commented Jan 19, 2024 • edited

lyrm left a comment • edited

Choose a reason for hiding this comment

lyrm commented Feb 7, 2024

lyrm commented Feb 8, 2024

lyrm commented Jun 10, 2024

polytypic commented Jan 19, 2024 •

edited

lyrm left a comment •

edited