Optimize Michael-Scott queue and remove `snapshot` mechanism #122

polytypic · 2024-01-21T14:10:38Z

This PR optimizes the Michael-Scott queue implementation. The main optimizations are avoiding false sharing and then optimization of the node representation, which effectively halves the length of the linked list of nodes by "inlining" the atomic into the node record and improves performance dramatically.

This PR also removes the snapshot mechanism. The snapshot mechanism is not used in any tests in this repository. Furthermore, the snapshot mechanism cannot be made to work (efficiently) as such with the traditional Michael-Scott queue representation where the head points to a node whose value has been taken from the queue. That is because, to avoid space leaks, the value in the head node must be overwritten and that simply breaks the snapshot mechanism. The previous implementation avoided this problem by using an inefficient node representation that effectively made the linked list of nodes twice as long and operations on the queue twice as slow.

Here is a run of the benchmarks on my M3 Max before the optimizations:

➜  saturn git:(rewrite-bench) ✗ dune exec --release -- ./bench/main.exe -budget 1 'free Queue' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/one domain",
    "value": 25.832570841950652
  },
  {
    "name": "messages over time/1 nb adder, 1 nb taker",
    "value": 30.775546858386015
  },
  {
    "name": "messages over time/1 nb adder, 2 nb takers",
    "value": 28.61298006857959
  },
  {
    "name": "messages over time/2 nb adders, 1 nb taker",
    "value": 26.324158773037244
  },
  {
    "name": "messages over time/2 nb adders, 2 nb takers",
    "value": 26.72249173339718
  }
]

And here after the optimizations:

➜  saturn git:(optimize-michael-scott-queue) ✗ dune exec --release -- ./bench/main.exe -budget 1 'free Queue' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/one domain",
    "value": 49.09980418998089
  },
  {
    "name": "messages over time/1 nb adder, 1 nb taker",
    "value": 98.92424842899835
  },
  {
    "name": "messages over time/1 nb adder, 2 nb takers",
    "value": 73.48287261205158
  },
  {
    "name": "messages over time/2 nb adders, 1 nb taker",
    "value": 79.05834136653924
  },
  {
    "name": "messages over time/2 nb adders, 2 nb takers",
    "value": 70.05359476527885
  }
]

Here is a run of the benchmarks without the inlined node representation:

➜  saturn git:(optimize-michael-scott-queue) ✗ dune exec --release -- ./bench/main.exe -budget 1 'free Queue' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "messages over time/one domain",
    "value": 27.799662289702503
  },
  {
    "name": "messages over time/1 nb adder, 1 nb taker",
    "value": 46.095344138917085
  },
  {
    "name": "messages over time/1 nb adder, 2 nb takers",
    "value": 42.19755884308922
  },
  {
    "name": "messages over time/2 nb adders, 1 nb taker",
    "value": 63.98353506270374
  },
  {
    "name": "messages over time/2 nb adders, 2 nb takers",
    "value": 58.21147004626938
  }
]

As you can see, on the one domain and 1 nb adder, 1 nb taker configurations the inlined node representation is roughly twice as fast! On the more contentious configurations, the additional cores likely both help to overcome some of the additional overhead (when not using the inlined representation) and the additional contention brings down the performance of inlined representation so the performance difference is not quite as dramatic.

You can run cp test/michael_scott_queue/michael_scott_queue_node.ml src_lockfree or git reset --hard HEAD~1 to use the non-inlined node representation:

modified   src_lockfree/michael_scott_queue_node.ml
@@ -3,12 +3,10 @@ module Atomic = Transparent_atomic
 type ('a, _) t =
   | Nil : ('a, [> `Nil ]) t
   | Next : {
-      mutable next : ('a, [ `Nil | `Next ]) t;
+      next : ('a, [ `Nil | `Next ]) t Atomic.t;
       mutable value : 'a;
     }
       -> ('a, [> `Next ]) t
 
-let[@inline] make value = Next { next = Nil; value }
-
-external as_atomic : ('a, [ `Next ]) t -> ('a, [ `Nil | `Next ]) t Atomic.t
-  = "%identity"
+let[@inline] make value = Next { next = Atomic.make Nil; value }
+let[@inline] as_atomic (Next r : ('a, [ `Next ]) t) = r.next

lyrm

Thanks for the PR !
I made a few minor comments (including one related to the use of Obj.magic that seems safer here then in the Spsc_queue optimization PR).

I really like the refactoring work to factorize the exception/option version of a function !

lyrm · 2024-02-08T16:05:40Z

src_lockfree/michael_scott_queue.ml

+  head : ('a, [ `Next ]) Node.t Atomic.t;
+  tail : ('a, [ `Next ]) Node.t Atomic.t;


I tried to put ('a option, ['Next]) Node.t Atomic.t as a type for both the head and the tail and with minimum changes below, the code works :

without Obj.magic

with (roughly) a 10% decrease in performance (I did not try it a lot so it may fall into error margins).

Also here Obj.magic is used for value that should never be read so this seems to be an acceptable use for it !

lyrm · 2024-02-08T16:07:11Z

src_lockfree/michael_scott_queue.ml

-        loop b
-  in
-  loop b
+type ('a, _) poly = Option : ('a, 'a option) poly | Value : ('a, 'a) poly


Nice way of refactoring to avoid copy/paste code ! I will remember it :)

lyrm · 2024-02-08T16:10:00Z

src_lockfree/michael_scott_queue.ml


-let peek_opt { head; _ } =
+let rec pop_as :


Should this function be [@inline] ? (as well as the other _as functions below). I would I guess you use a GADT instead of a function to avoid the issue inlining have with functions.

lyrm · 2024-02-08T17:02:07Z

src_lockfree/michael_scott_queue_node.ml

-let[@inline] as_atomic (Next r : ('a, [ `Next ]) t) = r.next
+let[@inline] make value = Next { next = Nil; value }
+
+external as_atomic : ('a, [ `Next ]) t -> ('a, [ `Nil | `Next ]) t Atomic.t


next_as_atomic seems like a better name for readability (and I think it is the name you usually use). Also, I won't be again a comment in the code about this line :)

This moves the node representation to a separate file so that it is easy to change only the node representation. This also changes the queue representation to be closer to the traditional Michael-Scott queue where the head points to a node whose value is no longer logically part of the queue. The "snapshot" mechanism cannot work with this representation, because the nodes must be mutated as they are logically removed from the queue to free the values to avoid space leaks. This commit also implements various micro-optimizations and tweaks to the implementation that improve performance. For example, backoffs are added after failed CAS operations, which improves performance in case of contention.

This effectively halves the length of the linked list of nodes and roughly doubles the thruput of the queue.

polytypic force-pushed the optimize-michael-scott-queue branch 9 times, most recently from 572bf7e to 4026c70 Compare January 22, 2024 10:23

polytypic marked this pull request as ready for review January 22, 2024 10:57

polytypic requested a review from a team January 22, 2024 11:25

polytypic changed the title ~~Optimize Michael-Scott queue~~ Optimize Michael-Scott queue and remove snapshot mechanism Jan 22, 2024

polytypic force-pushed the optimize-michael-scott-queue branch 8 times, most recently from 2e44242 to 311adcf Compare January 22, 2024 20:05

polytypic mentioned this pull request Jan 23, 2024

MPMC 2-stack queue #112

Draft

polytypic force-pushed the optimize-michael-scott-queue branch from 311adcf to e9da697 Compare January 27, 2024 22:42

lyrm reviewed Feb 8, 2024

View reviewed changes

polytypic force-pushed the optimize-michael-scott-queue branch 4 times, most recently from 513ec24 to 8b3d673 Compare February 17, 2024 17:21

polytypic force-pushed the optimize-michael-scott-queue branch 3 times, most recently from 61bc326 to ceb8e26 Compare March 1, 2024 13:13

polytypic force-pushed the optimize-michael-scott-queue branch from ceb8e26 to d495cb1 Compare March 1, 2024 13:50

polytypic force-pushed the optimize-michael-scott-queue branch from d495cb1 to f580b8e Compare March 13, 2024 07:46

polytypic added 3 commits April 2, 2024 19:20

Add padding to avoid false sharing

75726fd

"Inline" the Atomic.t objects into the next fields of nodes

bd8eae0

This effectively halves the length of the linked list of nodes and roughly doubles the thruput of the queue.

polytypic force-pushed the optimize-michael-scott-queue branch from f580b8e to bd8eae0 Compare April 2, 2024 16:20

polytypic mentioned this pull request May 6, 2024

Document that Saturn_lockfree.Queue.{snapshot + next} are not linearizable #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Michael-Scott queue and remove `snapshot` mechanism #122

Optimize Michael-Scott queue and remove `snapshot` mechanism #122

polytypic commented Jan 21, 2024 •

edited

lyrm left a comment

lyrm Feb 8, 2024

lyrm Feb 8, 2024

lyrm Feb 8, 2024

lyrm Feb 8, 2024

		head : ('a, [ `Next ]) Node.t Atomic.t;
		tail : ('a, [ `Next ]) Node.t Atomic.t;

Optimize Michael-Scott queue and remove snapshot mechanism #122

Are you sure you want to change the base?

Optimize Michael-Scott queue and remove snapshot mechanism #122

Conversation

polytypic commented Jan 21, 2024 • edited

lyrm left a comment

Choose a reason for hiding this comment

lyrm Feb 8, 2024

Choose a reason for hiding this comment

lyrm Feb 8, 2024

Choose a reason for hiding this comment

lyrm Feb 8, 2024

Choose a reason for hiding this comment

lyrm Feb 8, 2024

Choose a reason for hiding this comment

Optimize Michael-Scott queue and remove `snapshot` mechanism #122

Optimize Michael-Scott queue and remove `snapshot` mechanism #122

polytypic commented Jan 21, 2024 •

edited