Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SPSC queue #121

Closed
wants to merge 10 commits into from
Closed

Optimize SPSC queue #121

wants to merge 10 commits into from

Conversation

polytypic
Copy link
Contributor

@polytypic polytypic commented Jan 19, 2024

This PR optimizes the SPSC queue.

See the individual commits to better understand the motivation behind the optimizations.

Here is a run of the benchmarks on my M3 Max before the optimizations:

➜  saturn git:(rewrite-bench) ✗ dune exec --release -- ./bench/main.exe -budget 1 Single | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/2 workers, capacity 1",
    "value": 6.500299359098922
  },
  {
    "name": "messages over time/2 workers, capacity 8",
    "value": 12.632201914378621
  },
  {
    "name": "messages over time/2 workers, capacity 64",
    "value": 14.90261050891111
  },
  {
    "name": "messages over time/2 workers, capacity 512",
    "value": 14.960086955505428
  },
  {
    "name": "messages over time/2 workers, capacity 4096",
    "value": 20.06888141824778
  },
  {
    "name": "messages over time/2 workers, capacity 32768",
    "value": 22.94386410808166
  }
]

Here is a run of the benchmarks after the optimizations:

➜  saturn git:(optimize-spsc-queue) ✗ dune exec --release -- ./bench/main.exe -budget 1 Single | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/2 workers, capacity 1",
    "value": 6.868795322207784
  },
  {
    "name": "messages over time/2 workers, capacity 8",
    "value": 18.420713908583064
  },
  {
    "name": "messages over time/2 workers, capacity 64",
    "value": 30.26195314050038
  },
  {
    "name": "messages over time/2 workers, capacity 512",
    "value": 38.7165046104334
  },
  {
    "name": "messages over time/2 workers, capacity 4096",
    "value": 123.22902197665648
  },
  {
    "name": "messages over time/2 workers, capacity 32768",
    "value": 270.82312457121435
  }
]

TODO:

  • Add comments on the use/lack of fences in the code.

@polytypic polytypic changed the title WIP: Optimize SPSC queue Optimize SPSC queue Jan 19, 2024
@polytypic polytypic force-pushed the optimize-spsc-queue branch 20 times, most recently from 859b2dd to 5cc638e Compare January 20, 2024 13:09
@polytypic polytypic marked this pull request as ready for review January 21, 2024 10:33
@polytypic polytypic requested a review from a team January 21, 2024 10:33
@polytypic polytypic force-pushed the optimize-spsc-queue branch 6 times, most recently from 181b99b to 204f2ce Compare January 27, 2024 22:40
Copy link
Collaborator

@lyrm lyrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure the commit Avoid float array pessimization and a bit of the previous one (removing the indirection of the array and thus doing initialization with Obj.magic) are a good idea. They introduce a lot of Obj.magic which I feel we should be careful with, especially because it could cause issues with flambda.

Also, what happen if you are actually putting float in the queue ? I am guessing the array float special case will not be triggered ?

I have no immediate solution for float array pessimization, but for initialization, we could have a make : size_exponent:int -> 'a -> 'a function instead of the create function that requires the user to provide an inhabitant for type 'a.

It may just move the issue to the user lever that may want to either use the same trick with Obj.magic or use and option type though...

@lyrm
Copy link
Collaborator

lyrm commented Feb 7, 2024

Thanks for the PR ! All the optimizations (except maybe the ones mentioned in my previous message :p) are great and thanks again for making the git history so readable. And obviously, the gain in performance is amazing !

@lyrm
Copy link
Collaborator

lyrm commented Feb 8, 2024

I should have added that the optimization with the cashes is very cool !!

@polytypic polytypic force-pushed the optimize-spsc-queue branch 7 times, most recently from 82cdf25 to 8b0c40d Compare February 17, 2024 17:21
@polytypic polytypic force-pushed the optimize-spsc-queue branch 2 times, most recently from c86dcca to 034c23b Compare March 1, 2024 13:51
This should stabilize performance results to some degree, but, as it is, there
is still a massive amount of contention so performance results will still likely
have a lot of randomness.

This is also one of two crucial optimizations.  Without this optimization best
performance simply cannot be obtained.
Transparent atomic provides a workaround for the OCaml 5.(0|1) CSE bugs around
`Atomic.get`s.  In this case, invalid CSE optimization on `Atomic.get`s can
actually break the queue semantics.

Fences are not really needed in this specific queue with the exception of the
updates of the head and tail indices and even there the fences are only needed
to ensure that the other side sees the index update only after updates to the
array.

`Atomic.incr` is potentially a little bit faster, because it needs no write
barrier.
Wide pattern matches, even with immutable fields, tend to generate inefficient
code in OCaml at the moment.  Specifically, the compiler tends to generate a
sequence of loads for the matched fields at the start of the function.  The
problem with this is that it increases register pressure, which may lead to
spilling (i.e. extra reads and writes to stack), and may also lead to fetching
some fields unnecessarily (i.e. wasted work) as those fields are only accessed
conditionally.

It is typically better to access immutable fields using dot notation.  The
compiler performs CSE on such loads so repeated accesses of immutable fields do
not necessarily generate repeated loads of those fields.
This reduces contention as operations can be performed on either side in
parallel without interference from the other side.

Without this algorithmic change, most of the other optimizations become
irrelevant, because the contention on index accesses would simply limit the peak
performance by an order of magnitude or more.  Try removing this change and
observe the massive drop in peak performance.
The `mask` can be computed very quickly from the length of the array which is
known to be a power of two.

Avoiding the indirection means that an allocation, some pointer chasing, and
potentially a couple of cache misses (an extra object needs to be written by the
producer (a potential cache miss) and read by consumer (a definite cache miss,
because the memory is written by the producer)) are avoided per message.  This
provides a significant performance improvement now that contention has been
reduced.

Using `Obj.magic ()` with an array that may contain `float` values is prone to
be problematic and it makes sense to test that things work.
This tweaks the code for checking whether the queue is full or empry for minor
improvement.
Also report out of range `size_exponent` an invalid argument
OCaml currently has a special case for `float array`.  This means than whenever
an array is accessed that might be a `float array`, the compiler generates an
extra conditional to treat the array properly.  These conditionals slow down
array accesses slightly.  Here we hack around that by specifying that the array
is of elements of a "pointer" type.  This way the compiler will generate array
accesses with potential write barriers and avoid generating code for the `float
array` special case.  This improves performance.
This reduces code size at the expense of one or two conditionals.  The time
taken by those conditionals is fairly insignificant.  The reduction in code size
will likely improve performance in real applications, because pressure on the
instruction cache is reduced.

This approach is likely to be useful in various other data structure
implementations and is likely often preferable to inlining to avoid code
duplication.
This prevents Flamda from noticing that we are potentially storing float values
into a non-float array.
@lyrm
Copy link
Collaborator

lyrm commented Jun 10, 2024

The changes proposed in this PR have already been merged in two different PRs (#133 and #135) that separate "safe" and "unsafe" optimizations. Thanks @polytypic for all this work!

@lyrm lyrm closed this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants