Optimize work-stealing deque #124

polytypic · 2024-01-26T01:48:14Z

This PR optimizes the work-stealing deque.

The graphs on current-bench show the progress of optimizing the work-stealing deque quite nicely:

Here is a run of the benchmarks on my M3 Max before the optimizations:

➜  saturn git:(rewrite-bench) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Work' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "spawns over time/1 worker",
    "value": 49.86378542204304
  },
  {
    "name": "spawns over time/2 workers",
    "value": 44.30827252087946
  },
  {
    "name": "spawns over time/4 workers",
    "value": 84.66239655969726
  },
  {
    "name": "spawns over time/8 workers",
    "value": 177.69250680679536
  }
]

Here is a run after the optimizations:

➜  saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Work' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                     
  {
    "name": "spawns over time/1 worker",
    "value": 61.20210013480126
  },
  {
    "name": "spawns over time/2 workers",
    "value": 121.42420051856216
  },
  {
    "name": "spawns over time/4 workers",
    "value": 235.72974008156237
  },
  {
    "name": "spawns over time/8 workers",
    "value": 428.2572470382373
  }
]

General approach:

Add benchmark(s). In this case the benchmark simulates a scheduler running parallel fibonacci.
Avoid false sharing. With false sharing there will be too much noise to get any other useful optimizations done. In this case the top and bottom indices and the root record had to be padded to avoid false sharing. This already gave a big performance improvement.
Avoid other forms of contention. Contention tends to mask any benefits from further optimizations. In this case there wasn't much of that except for avoiding some unnecessary loads and stores. With parallel data structures, any reads and writes of shared locations (atomic or non-atomic) should be viewed with suspicion.
Usual micro-optimizations such as avoiding float array pessimization, indirections, costly operations, avoiding dependency chains, avoiding unnecessary work, avoiding unnecessary fences, making the common case fast (at the expense of less common case), ... Unless main contention issues are addressed micro-optimizations are difficult to make as noise and stalls from contention mask the improvements.

lyrm

Thanks for the PR.

I haven't noticed any issue with theses changes. I will push a PR soon with more DSCheck tests for this queue though. The ones that we currently have where written for a way slower version of DSCheck and we should be able to do way more now :)

lyrm · 2024-02-27T16:27:50Z

test/ws_deque/stm_ws_deque.ml

+      (*
      WSDT_dom.neg_agree_test_par ~count
-        ~name:"STM Saturn_lockfree.Ws_deque test parallel, negative";
+        ~name:"STM Saturn_lockfree.Ws_deque test parallel, negative"; *)


I will say that this should either be removed or a note should be added to explain why this is a bad idea to launch such a test.

Yes, the test was causing problems at one point, which is why I commented it out. I was planning to take another look to understand the issue better. The work-stealing deque data structure is specifically designed to take advantage of the asymmetry between the (single) owner and the (multiple) thiefs. That negative test specifically uses the work-stealing deque in a manner in which it is not intended to work. So, I'm inclided to simply remove the test.

So, with this change to the code in this PR

modified src_lockfree/ws_deque.ml @@ -54,7 +54,7 @@ module M : S = struct let create () = let top = Atomic.make 0 |> Multicore_magic.copy_as_padded in let bottom = Atomic.make 0 |> Multicore_magic.copy_as_padded in - let tab = Array.make min_capacity (Obj.magic ()) in + let tab = Array.make min_capacity (ref (Obj.magic ())) in { top; bottom; tab } |> Multicore_magic.copy_as_padded let realloc a t b sz new_sz =

the negative test passes.

The original code (before this PR) also has the same behavior:

saturn/src_lockfree/ws_deque.ml

Line 103 in 8937436

tab = Atomic.make (CArray.create min_size (Obj.magic ()));

In other words, the circular array is initialized with Obj.magic () values rather than ref (Obj.magic ()) values. I haven't analyzed why exactly the original code doesn't crash, but I suspect it is purely out of luck.

Even with the ref (Obj.magic ()) things are not guaranteed to work. If I change the test use floats, i.e.

QCheck.make ~print:show_cmd (Gen.oneof [ - Gen.map (fun i -> Push i) int_gen; + Gen.map (fun i -> Push i) Gen.float; Gen.return Pop; (*Gen.return Steal;*)

(with corresponding changes to use float instead of int) the negative test will cause a segmentation fault even with ref (Obj.magic ()). I would expect the original code to also exhibit a segmentation fault with this change.

The way I see it, the whole point of this particular data structure is to exploit the asymmetry of having one owner and multiple thiefs. If we would instead choose to make this data structure safe to use with multiple owners, then we would be implementing a fundamentally different data structure (i.e. a multi consumer, single producer deque). Implementing a data structure that doesn't cause a segmentation fault, but instead returns wrong result in case of misuse is not something I would recommend. So, personally I would just recommend embracing the fact that calling pop in parallel from multiple domains is not safe and would remove the negative test.

I guess we could also have a "safe" implementation that raises an issue in case of wrong use (a thief calling push or pop), but I am guessing this could have quite a cost in term of performance.

Yes, I would expect the cost (of detecting misuse) to be relatively high in case it needs to be reliable. It would be great if misuse could be detected for free and an error produced reliably. In all likelyhood, detecting misuse reliably would be expensive and detecting misuse only partially while avoiding segmentation faults would also likely be expensive and could also lead to the program silently giving incorrect results, which I consider worse than crashing.

polytypic · 2024-03-03T20:43:24Z

I created a PR adding new benchmarks for the work-stealing deque #130 and I added a commit to this PR to optimize the work-stealing deque to avoid performance pitfalls on those benchmarks.

Here is a run from before adding the commit:

➜  saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 -diff bench-high.json deque
Saturn_lockfree Work_stealing_deque: 
  time per spawn/1 worker:
    16.60 ns = 0.85 x 19.52 ns
  spawns over time/1 worker:
    60.23 M/s = 1.18 x 51.23 M/s
  time per spawn/2 workers:
    16.73 ns = 0.37 x 45.64 ns
  spawns over time/2 workers:
    119.58 M/s = 2.73 x 43.82 M/s
  time per spawn/4 workers:
    17.38 ns = 0.51 x 34.40 ns
  spawns over time/4 workers:
    230.12 M/s = 1.98 x 116.27 M/s
  time per spawn/8 workers:
    18.65 ns = 0.28 x 65.65 ns
  spawns over time/8 workers:
    429.00 M/s = 3.52 x 121.86 M/s
  time per message/1 adder, 1 taker:
    110.91 ns = 1.07 x 103.71 ns
  messages over time/1 adder, 1 taker:
    18.03 M/s = 0.94 x 19.28 M/s
  time per message/1 adder, 2 takers:
    167.84 ns = 0.90 x 187.33 ns
  messages over time/1 adder, 2 takers:
    17.87 M/s = 1.12 x 16.01 M/s
  time per message/1 adder, 4 takers:
    261.86 ns = 0.94 x 278.85 ns
  messages over time/1 adder, 4 takers:
    19.09 M/s = 1.06 x 17.93 M/s
  time per message/one domain (FIFO):
    16.34 ns = 0.91 x 17.86 ns
  messages over time/one domain (FIFO):
    61.18 M/s = 1.09 x 55.98 M/s
  time per message/one domain (LIFO):
    18.95 ns = 1.00 x 18.89 ns
  messages over time/one domain (LIFO):
    52.78 M/s = 1.00 x 52.93 M/s

Here is a run after the commit:

➜  saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 -diff bench-high.json deque
Saturn_lockfree Work_stealing_deque: 
  time per spawn/1 worker:
    16.60 ns = 0.85 x 19.52 ns
  spawns over time/1 worker:
    60.26 M/s = 1.18 x 51.23 M/s
  time per spawn/2 workers:
    16.66 ns = 0.37 x 45.64 ns
  spawns over time/2 workers:
    120.04 M/s = 2.74 x 43.82 M/s
  time per spawn/4 workers:
    17.29 ns = 0.50 x 34.40 ns
  spawns over time/4 workers:
    231.33 M/s = 1.99 x 116.27 M/s
  time per spawn/8 workers:
    18.60 ns = 0.28 x 65.65 ns
  spawns over time/8 workers:
    430.18 M/s = 3.53 x 121.86 M/s
  time per message/1 adder, 1 taker:
    35.15 ns = 0.34 x 103.71 ns
  messages over time/1 adder, 1 taker:
    56.89 M/s = 2.95 x 19.28 M/s
  time per message/1 adder, 2 takers:
    47.34 ns = 0.25 x 187.33 ns
  messages over time/1 adder, 2 takers:
    63.37 M/s = 3.96 x 16.01 M/s
  time per message/1 adder, 4 takers:
    85.12 ns = 0.31 x 278.85 ns
  messages over time/1 adder, 4 takers:
    58.74 M/s = 3.28 x 17.93 M/s
  time per message/one domain (FIFO):
    16.61 ns = 0.93 x 17.86 ns
  messages over time/one domain (FIFO):
    60.21 M/s = 1.08 x 55.98 M/s
  time per message/one domain (LIFO):
    15.84 ns = 0.84 x 18.89 ns
  messages over time/one domain (LIFO):
    63.12 M/s = 1.19 x 52.93 M/s

Performance on the SPMC style benchmarks improved significantly.

The optimization in the last commit (caching the index used by thieves) is mentioned in the original paper that introduces the work-stealing deque (section 2.3 Avoid top accesses in pushBottom).

This should stabilize and improve performance.

This avoids unnecessarily raising and catching an exception in case an option result. The cost is a sigle cheap conditional.

This reduces contention as operations can be performed on either side in parallel without interference from the other side.

polytypic force-pushed the optimize-ws-deque branch 19 times, most recently from ff22c9f to 1c98c25 Compare January 27, 2024 08:02

polytypic marked this pull request as ready for review January 27, 2024 08:02

polytypic requested a review from a team January 27, 2024 08:04

polytypic force-pushed the optimize-ws-deque branch from 1c98c25 to 77cd1b9 Compare January 27, 2024 09:57

polytypic force-pushed the optimize-ws-deque branch 4 times, most recently from 0d2e6ea to 1cfc137 Compare February 17, 2024 17:20

lyrm approved these changes Feb 27, 2024

View reviewed changes

polytypic force-pushed the optimize-ws-deque branch 3 times, most recently from e409e1a to 5bd372d Compare March 3, 2024 12:00

polytypic force-pushed the optimize-ws-deque branch 20 times, most recently from 204ae86 to 6052cd0 Compare March 3, 2024 19:24

polytypic force-pushed the optimize-ws-deque branch 2 times, most recently from d36ac07 to 8eb58ee Compare March 13, 2024 07:46

polytypic added 6 commits April 2, 2024 19:19

Add padding to avoid false sharing

f9a19ab

This should stabilize and improve performance.

Prefer to put inline attribute immediately after let

75cb975

Use a GADT to express desired result type

8817cdd

This avoids unnecessarily raising and catching an exception in case an option result. The cost is a sigle cheap conditional.

Various tweaks to improve performance

ba510f6

Remove negative test that uses the WS deque in an invalid unsafe way

ba05f5b

Implement caching of the thief side index

fbcbd98

This reduces contention as operations can be performed on either side in parallel without interference from the other side.

polytypic force-pushed the optimize-ws-deque branch from 8eb58ee to fbcbd98 Compare April 2, 2024 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize work-stealing deque #124

Optimize work-stealing deque #124

polytypic commented Jan 26, 2024 •

edited

lyrm left a comment

lyrm Feb 27, 2024

polytypic Feb 27, 2024

polytypic Feb 28, 2024 •

edited

lyrm Feb 28, 2024

polytypic Feb 28, 2024

polytypic commented Mar 3, 2024 •

edited

Optimize work-stealing deque #124

Are you sure you want to change the base?

Optimize work-stealing deque #124

Conversation

polytypic commented Jan 26, 2024 • edited

lyrm left a comment

Choose a reason for hiding this comment

lyrm Feb 27, 2024

Choose a reason for hiding this comment

polytypic Feb 27, 2024

Choose a reason for hiding this comment

polytypic Feb 28, 2024 • edited

Choose a reason for hiding this comment

lyrm Feb 28, 2024

Choose a reason for hiding this comment

polytypic Feb 28, 2024

Choose a reason for hiding this comment

polytypic commented Mar 3, 2024 • edited

polytypic commented Jan 26, 2024 •

edited

polytypic Feb 28, 2024 •

edited

polytypic commented Mar 3, 2024 •

edited