MPMC unbounded queue #35

art-w · 2022-11-21T23:37:19Z

This is essentially a Michael Scott queue with (bounded) FAD queues inside, inspired by @bartoszmodelski version :)

Performances should be reasonable, since most operations only hit the fast FAD queues!
The relaxed FIFO issue of FAD queues is avoided by not using them as a circular buffer (such that cells are not reused, only one push and one pop can touch them)
The MS queue wrapper provides support for resizing when the last FAD queue is full
There's a bit of a tension with allocating the next FAD queue array because it does take some arbitrary amount of time -- during which pushing is not possible. As a tiny hack, the next array is preallocated as soon as possible (with tail = -1), and overallocation are compensated by keeping them for the future (in gift_rest)
When pop fails, a tombstone is used to indicate failure to the corresponding push... I didn't have a lot of success trying to rollback the head, but that's mostly due to my unrealistic benchmarks. However, I would like to exploit this design flaw as an alternative implementation for domainslib Channels: by using its Task algebraic effects, I think we could replace the Tombstone by an Awaiting_push of ('a -> unit) such that the poping task would be suspended until a push has completed. Feedback on this random idea would be much appreciated :)

Some stuff that still need work:

I reused the existing SPSC benchmark for this PR, but a proper "MPMC" benchmark would be nice (... I only have some unstable benchmarks atm to confirm that performances are not absurdly bad)
I added a test using qcheck-lin to show that I did my homework, but I'm guessing this will cause some circular dependency issue as multicoretests depends on lockfree :/

bartoszmodelski · 2022-11-22T14:07:01Z

Looks very cool!

You might find some useful MPMC benchmarks in: https://github.com/ocaml-multicore/lockfree/pull/24/files

art-w · 2023-03-20T11:22:17Z

Rebased, integrated with the nice SPSC/MPMC benchmarks and optimized a bit. It looks rather competitive on my computer: (median of 20 iterations to push/pop 2 millions elements evenly distributed amongst domains)

{"name":"spsc-queue", "time":0.197151, "throughput":10144499.051894}
{"name":"mpmc-queue", "time":0.04201,  "throughput":47607634.376259}

{"name":"mpmc-relaxed-fad-pushers:1,takers:1", "time":0.337243, "throughput": 6226962.460233}
{"name":"mpmc-relaxed-cas-pushers:1,takers:1", "time":0.253526, "throughput": 8283175.015164}
{"name":  "mpmc-unbounded-pushers:1,takers:1", "time":0.060743, "throughput":34571831.616132}

{"name":"mpmc-relaxed-fad-pushers:4,takers:4", "time":0.714718, "throughput":2938221.363058}
{"name":"mpmc-relaxed-cas-pushers:4,takers:4", "time":0.545062, "throughput":3852772.251762}
{"name":  "mpmc-unbounded-pushers:4,takers:4", "time":0.240097, "throughput":8746463.298379}

{"name":"mpmc-relaxed-fad-pushers:7,takers:1", "time":1.865645, "throughput":1125616.114949}
{"name":"mpmc-relaxed-cas-pushers:7,takers:1", "time":0.612,    "throughput":3431372.613185}
{"name":  "mpmc-unbounded-pushers:7,takers:1", "time":0.548163, "throughput":3830977.721448}

{"name":"mpmc-relaxed-fad-pushers:1,takers:7", "time":1.577434, "throughput":1331275.930521}
{"name":"mpmc-relaxed-cas-pushers:1,takers:7", "time":0.98371,  "throughput":2134775.383006}
{"name":  "mpmc-unbounded-pushers:1,takers:7", "time":0.877693, "throughput":2392635.669022}

(The extremes 1vs7 can have a lot of noise caused by the scheduler, it would be interesting to run the benchmarks on a sanitized environment..)

The optimization relies on this message-passing section from the memory model, by exploiting the fact that exactly one "push" domain and one "pop" domain will ever try to access a specific cell from the queue array... and so we can allocate less Atomic.t by packing status bits together (we technically need only one bit per cell, but use two to handle carries induced by the FAD operation). (cc @lyrm since we briefly talked about this ^^)

polytypic · 2023-03-22T09:18:53Z

Is there a benchmark comparing against the Michael-Scott queue? It would be interesting to see the relative performance.

art-w · 2023-03-22T13:48:14Z

Sure, I also added the Ws_deque to the benchmarks! Indeed the Michael-Scott queue looks pretty good:

One producer	Michael-Scott	Relaxed FAD	Relaxed CAS	Unbounded	Work-stealing
pushers:1 takers:1	`0.131434s`	`0.329778s`	`0.21065 s`	`0.059725s`	`0.272594s`
pushers:1 takers:2	`0.165976s`	`0.248677s`	`0.397321s`	`0.199688s`	`0.482286s`
pushers:1 takers:3	`0.177053s`	`0.316906s`	`0.469243s`	`0.391438s`	`0.410644s`
pushers:1 takers:4	`0.200289s`	`0.298454s`	`0.555292s`	`0.509853s`	`0.544786s`
pushers:1 takers:5	`0.218715s`	`0.333596s`	`0.376887s`	`0.628065s`	`0.365196s`
pushers:1 takers:6	`0.251839s`	`0.355936s`	`0.459976s`	`0.740695s`	`0.527715s`
pushers:1 takers:7	`0.899175s`	`1.134269s`	`1.006196s`	`0.880158s`	`0.673532s`

And we can see that the unbounded MPMC queue is terrible with the consumers spinlocking to pop. Producers don't have this issue:

One consumer	Michael-Scott	Relaxed FAD	Relaxed CAS	Unbounded
pushers:2 takers:1	`0.136905s`	`0.480398s`	`0.41395 s`	`0.182891s`
pushers:3 takers:1	`0.255042s`	`0.475786s`	`0.341128s`	`0.153577s`
pushers:4 takers:1	`0.260206s`	`0.508386s`	`0.434022s`	`0.122285s`
pushers:5 takers:1	`0.149528s`	`0.388776s`	`0.449589s`	`0.101665s`
pushers:6 takers:1	`0.149537s`	`0.447404s`	`0.546314s`	`0.110501s`
pushers:7 takers:1	`1.632871s`	`1.94816 s`	`0.564839s`	`0.532245s`

(... I clearly don't have a true 8 cores ;) )

Finally, the Michael-Scott queue compares less favorably on a balanced setup as contention grows:

Balanced	Michael-Scott	Relaxed FAD	Relaxed CAS	Unbounded
pushers:2 takers:2	`0.26675 s`	`0.256043s`	`0.371672s`	`0.19826 s`
pushers:3 takers:3	`0.520373s`	`0.183438s`	`0.397926s`	`0.178163s`
pushers:4 takers:4	`0.754701s`	`0.640411s`	`0.539897s`	`0.235253s`

lyrm · 2023-03-22T14:05:51Z

I tried this very basic dscheck test with a push and a pop and it is not finishing (at leat not in a reasonable amount of time).

let producer_consumer () =
  Atomic.trace (fun () ->
      let queue = Mpmc_queue.make ~dummy:1 () in

      (* producer *)
      Atomic.spawn (fun () -> Mpmc_queue.push queue 0);

      (* consumer *)
      let popped = ref None in
      Atomic.spawn (fun () -> popped := Mpmc_queue.pop queue);

      (* checks*)
      Atomic.final (fun () ->
          Atomic.check (fun () ->
              let remaining = Mpmc_queue.pop queue in
              match (!popped, remaining) with
              | None, Some 0 | Some 0, None -> true
              | _, _ -> false)))

polytypic · 2023-03-22T14:42:04Z

I also wonder how the results might change with the change shown here. In my quick tests it gave noticeable improvement and made benchmark results significantly more stable (on Apple M1). Similar changes to other queue structures could give improvements as well.

art-w · 2023-03-22T14:44:38Z

Thanks for testing it @lyrm! But which version of dscheck are you using? (I'm unable to reproduce with its main branch or ocaml-multicore/dscheck#3, on the contrary dscheck terminates after exploring only one trace which is super wrong of it)

Ha wait I'm dumb.

lyrm · 2023-03-22T15:13:33Z

I am using the last release version. It is actually finishing but takes like several minutes which seems a lot for a single push and a single pop.

polytypic · 2023-03-22T16:23:04Z

For benchmarking fun, here is a magically faster variation of the Michael-Scott queue:

module Backoff = Kcas.Backoff

type 'a node = { next : 'a node; mutable value : 'a }

external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"

type 'a t = {
  head : 'a node Atomic.t Atomic.t;
  tail : 'a node Atomic.t Atomic.t;
}

let create () =
  let next = Atomic.make (Obj.magic ()) in
  Multicore_magic.copy_as_padded
    {
      head = Multicore_magic.copy_as_padded @@ Atomic.make next;
      tail = Multicore_magic.copy_as_padded @@ Atomic.make next;
    }

let rec pop backoff head =
  let old_head = Multicore_magic.fenceless_get head in
  let first = Multicore_magic.fenceless_get old_head in
  if first == Obj.magic () then None
  else if Atomic.compare_and_set head old_head (next_as_atomic first) then (
    let value = first.value in
    first.value <- Obj.magic ();
    Some value)
  else pop (Backoff.once backoff) head

let pop { head; _ } = pop Backoff.default head [@@inline]

let rec fix_tail tail (new_tail : 'a node Atomic.t) =
  let old_tail = Atomic.get tail in
  if
    Atomic.get new_tail == Obj.magic ()
    && not (Atomic.compare_and_set tail old_tail new_tail)
  then fix_tail tail new_tail

let rec push (new_node : 'a node) tail (old_tail : 'a node Atomic.t) =
  if not (Atomic.compare_and_set old_tail (Obj.magic ()) new_node) then
    push new_node tail (next_as_atomic (Atomic.get old_tail))
  else if not (Atomic.compare_and_set tail old_tail (next_as_atomic new_node))
  then fix_tail tail (next_as_atomic new_node)

let push { tail; _ } value =
  let new_node = { next = Obj.magic (); value } in
  push new_node tail (Atomic.get tail)
  [@@inline]

With this variation I got

Optimized Lockfree.MMSQueue: mean = 0.003840, sd = 0.000000 tp=78125617.782193

from the same benchmark as here. This is basically about 3 to 4 times faster than the lockfree library version of the Michael-Scott queue from couple of weeks ago.

EDIT: I modified the code to use a next_as_atomic cast instead of Obj.magic.

art-w · 2023-03-22T17:38:35Z

@lyrm > My bad, thanks a lot! There's a small window for a spinlock when the two domains interleave perfectly:

push starts and increments the tail to reserve an array cell
pop attempts to grab that element, but discovers that push did not complete yet so it leaves a tombstone behind and prepare to recurse. The recursion should now discover that the queue is empty, but...
push attempts to push its element, see the tombstone, and goto 1.

In this scenario, pop never sees the queue as empty and push never has the opportunity to complete. I'll need a bit more time to fix this one hmhm.

@polytypic > Wow that's fast! I did not benchmark your PR but the code you just posted: (also added your backoff to the consumers spinlock hence the more reasonable numbers with a single producer)

	Magic Michael-Scott	Unbounded
pushers:1 takers:1	`0.06981 s`	`0.049369s`
pushers:1 takers:2	`0.071409s`	`0.138905s`
pushers:1 takers:3	`0.09788 s`	`0.136138s`
pushers:1 takers:4	`0.09283 s`	`0.19781 s`
pushers:1 takers:5	`0.091467s`	`0.153733s`
pushers:1 takers:6	`0.088874s`	`0.19984 s`
pushers:2 takers:1	`0.068123s`	`0.122823s`
pushers:3 takers:1	`0.076451s`	`0.115788s`
pushers:4 takers:1	`0.0887 s`	`0.089226s`
pushers:5 takers:1	`0.08625 s`	`0.095281s`
pushers:6 takers:1	`0.08588 s`	`0.08793 s`
pushers:2 takers:2	`0.086222s`	`0.136547s`
pushers:3 takers:3	`0.093071s`	`0.127035s`

My PR is essentially an unbounded Michael-Scott queue with bounded FAD queues in place of your single mutable value : 'a, so perhaps I can steal some optimizations from you? ^^'

Is there anything blocking your implementation from becoming the defacto standard? I think the benchmarks make it clear that the specialized {M,S}p{M,S}c / work-stealing implementations perform a lot worse than your generic solution... This is crazy though:

let rec push ... (old_tail : 'a node Atomic.t) =
  if ... then push ... (Atomic.get old_tail |> Obj.magic)

(for completeness, but I think those numbers are useless on my system:)

8 cores	Magic Michael-Scott	Unbounded
pushers:1 takers:7	`0.484896s`	`0.230395s`
pushers:7 takers:1	`0.982056s`	`1.20984 s`
pushers:4 takers:4	`0.394102s`	`0.38964 s`

polytypic · 2023-03-22T18:06:23Z

Is there anything blocking your implementation from becoming the defacto standard?

Well... I don't know. It is perhaps a bit too "magical" for most people's taste. I wanted to see how fast things could be with the MS queue. BTW, there is at least one minor optimization that could still be done. A level of indirection could be removed from the queue itself by inlining the head atomic as is done with the nodes.

Ideally OCaml would, at some point, offer the ability to have atomic fields in records, provide some way to allocate cache line aligned blocks, and also allow fenceless operations on atomic fields. Then it would be possible to implement the same (and better) without using magic/unsafe features. I hope experiments like this provide motivation towards having such features in OCaml.

Some of the Obj.magic usages could be replaced e.g. with external ... = "%identity" to make them perhaps a bit more palatable:

external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
external head_as_atomic : 'a t -> 'a node Atomic.t = "%identity" (* the additional optimization mentioned above *)

polytypic · 2023-03-24T07:57:02Z

Here is a version of the optimized Michael-Scott style queue that avoids most uses of Obj.magic:

module Backoff = Kcas.Backoff
(* The [Backoff] from the kcas library does not perform allocations, so it has
   slightly lower overhead than the [Backoff] in the lockfree library. *)

type 'a node = Nil | Node of { next : 'a node; mutable value : 'a }

external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
(* Ideally one should be able to say that the [next] field is [atomic next: 'a
   node], but we can't in OCaml.  As the [next] field is the first field in a
   [Node { ... }] it happens to be at the same location it would be in a ['a
   node Atomic.t], so we use an unsafe cast to access the [next] field as an
   atomic. *)

type 'a t = {
  head : 'a node Atomic.t Atomic.t;
  tail : 'a node Atomic.t Atomic.t;
}

let create () =
  let next = Atomic.make Nil in
  (* We use explicit padding to ensure that the [head] and [tail] atomics do not
     suffer from false sharing.  This improves performance, because [pop] only
     accesses the [head] and [push] only accesses the [tail].  With false
     sharing those accesses would unnecessarily cause contention. *)
  Multicore_magic.copy_as_padded
    {
      head = Multicore_magic.copy_as_padded @@ Atomic.make next;
      tail = Multicore_magic.copy_as_padded @@ Atomic.make next;
    }

let rec pop backoff head =
  (* We can safely use [fenceless_get] operations here, because the accesses are
     dependent (i.e. the order of memory accesses is fixed anyway).  The
     difference that [Atomic.get] makes is that it does not allow accesses that
     are later in the program to be performed before it, but in this case it is
     not possible anyway. *)
  let old_head = Multicore_magic.fenceless_get head in
  match Multicore_magic.fenceless_get old_head with
  | Nil -> None
  | Node node as first ->
      if Atomic.compare_and_set head old_head (next_as_atomic first) then (
        let value = node.value in
        node.value <- Obj.magic ();
        (* At this point we've acquired a value.  The queue will still point to
           the node, so we must make sure that the queue no longer transitively
           points to the value - otherwise we'd have a space leak. *)
        Some value)
      else pop (Backoff.once backoff) head

let pop { head; _ } =
  (* We use a recursive worker - non-recursive wrapper -idiom and instruct the
     compiler to always inline the wrapper. *)
  pop Backoff.default head
  [@@inline]

let rec fix_tail tail (new_tail : 'a node Atomic.t) =
  let old_tail = Atomic.get tail in
  if
    Atomic.get new_tail == Nil
    && not (Atomic.compare_and_set tail old_tail new_tail)
  then fix_tail tail new_tail

let rec push (new_node : 'a node) tail (old_tail : 'a node Atomic.t) =
  if not (Atomic.compare_and_set old_tail Nil new_node) then
    push new_node tail (next_as_atomic (Atomic.get old_tail))
  else if not (Atomic.compare_and_set tail old_tail (next_as_atomic new_node))
  then fix_tail tail (next_as_atomic new_node)

let push { tail; _ } value =
  (* We use a recursive worker - non-recursive wrapper -idiom and instruct the
     compiler to always inline the wrapper. *)
  let new_node = Node { next = Nil; value } in
  push new_node tail (Atomic.get tail)
  [@@inline]

(* BTW, some of the other [Atomic.get]s could also be [fenceless_get]s, but I
   didn't see improvements from those, so I left them as [Atomic.get]s. *)

BTW, I don't think the above uses less magic. It just uses different spells.

art-w · 2023-03-24T12:37:03Z

Nice, it looks good!

Why do you need to pad the root in create? Can it create contention to access the record field head/next if an Atomic is nearby?

let create () = ...
  Multicore_magic.copy_as_padded { head = ... ; tail = ... }

I'm not sure I understand how the node is still reachable after a pop and why we need to clear its value?

        node.value <- Obj.magic ();
        (* At this point we've acquired a value.  The queue will still point to
           the node, so we must make sure that the queue no longer transitively
           points to the value - otherwise we'd have a space leak. *)

(I read through the issue/PR addressing this, but I think the situation is different? In your version, at rest, the head should point to live nodes and the tail should point to Nil?)

Btw I'm happy if we can close this PR in favor of your implementation :) (I played a bit more with the FAD queue but I don't think it can achieve the same performances as yours!)

polytypic · 2023-03-24T13:06:34Z

If you ask me, I think it is good to have different implementations. I think that is also part of the idea of this repository. Also, I hope that OCaml will get more support for Atomics in the future and then it is likely that the relative performance of various implementations changes. So, my idea wasn’t to prevent this work. I noticed the possibility of some of the MS queue optimizations much earlier and just thought it might be interesting to try them now.

polytypic · 2023-03-24T13:13:59Z

Why do you need to pad the root in create? Can it create contention to access the record field head/next if an Atomic is nearby?

Yes. If the root shares cache lines with the atomics (or anything else that is mutated). Specifically, the accesses in the pattern matches, { head; _} and { tail; _ }, can suffer from contention if the fields happen to be in cache lines whose contents are being mutated. It is all about how shared memory works.

Ideally OCaml would provide means to allocate things in a cache line granular manner, so that a specific heap block would be guaranteed not to share cache lines with any other heap block. The padding provided by multicore-magic cannot guarantee that, but if you use it with all "long lived" blocks, then it should prevent almost all harmful false sharing. Cases where false sharing happens between short lived objects is usually not a problem and it usually better to just allocate short lived objects without extra padding/alignment to reduce heap pressure.

Perhaps future OCaml could have something like this:

type 'a node =
  | Nil
  | Node of {
      atomic next: 'a node;  (* A node is short lived, so prefer to reduce heap pressure *)
      mutable value: 'a;
    }

type 'a t =
  {
    atomic head : 'a node;
    [@@align_to_cache_line] atomic tail : 'a node;
  }
  (* A queue is likely long lived and having head and tail in their own cache lines avoids contention *)

(I read through the issue/PR addressing this, but I think the situation is different? In your version, at rest, the head should point to live nodes and the tail should point to Nil?)

In the magically optimized implementation the next field is inseparable from the value field (they are allocated together as a Node { ... }). So, even though the tail and head of an empty queue will point to a next field that contains Nil the value field is also there.

art-w · 2023-03-24T14:13:32Z

Ha I see it now, thanks!

I don't know if this repo is supposed to be a datastructure zoo or a recommended stdlib for multicore... I feel it's confusing for users if there are many similar implementations to pick from, even though one is clearly better than the others :P

art-w · 2024-11-07T12:13:19Z

Closing, the Michael-Scott queue is just better on all aspects :)

art-w force-pushed the mpmc-queue branch from 458e618 to 214887d Compare November 22, 2022 10:51

art-w mentioned this pull request Nov 28, 2022

Add settable promise ocaml-multicore/domainslib#98

Closed

Sudha247 mentioned this pull request Dec 21, 2022

More lockfree structures #48

Open

art-w force-pushed the mpmc-queue branch from 214887d to 34eb007 Compare March 20, 2023 11:17

art-w changed the title ~~(draft) MPMC unbounded queue~~ MPMC unbounded queue Mar 20, 2023

art-w force-pushed the mpmc-queue branch from 34eb007 to 6fb1955 Compare March 20, 2023 14:25

art-w added 3 commits March 20, 2023 15:28

Add an MPMC unbounded queue

6b53133

Benchmark MPMC unbounded queue with SPSC test

c69a44a

Qcheck-lin test for MPMC queue

c5229e7

art-w force-pushed the mpmc-queue branch from 6fb1955 to b192ffe Compare March 20, 2023 14:28

Integrate MPMC unbounded queue with existing benchmarks

dfdf857

art-w force-pushed the mpmc-queue branch from b192ffe to 5b3aee6 Compare March 20, 2023 14:30

Optimize MPMC unbounded queue with status bits

843cb56

art-w force-pushed the mpmc-queue branch from 5b3aee6 to 843cb56 Compare March 20, 2023 14:53

art-w added 2 commits March 22, 2023 13:39

Benchmark Michael_scott_queue

639094d

Benchmark Ws_deque

49e5142

art-w closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPMC unbounded queue #35

MPMC unbounded queue #35

art-w commented Nov 21, 2022

bartoszmodelski commented Nov 22, 2022

art-w commented Mar 20, 2023

polytypic commented Mar 22, 2023

art-w commented Mar 22, 2023

lyrm commented Mar 22, 2023 •

edited

Loading

polytypic commented Mar 22, 2023

art-w commented Mar 22, 2023 •

edited

Loading

lyrm commented Mar 22, 2023

polytypic commented Mar 22, 2023 •

edited

Loading

art-w commented Mar 22, 2023

polytypic commented Mar 22, 2023

polytypic commented Mar 24, 2023 •

edited

Loading

art-w commented Mar 24, 2023

polytypic commented Mar 24, 2023

polytypic commented Mar 24, 2023 •

edited

Loading

art-w commented Mar 24, 2023

art-w commented Nov 7, 2024

MPMC unbounded queue #35

MPMC unbounded queue #35

Conversation

art-w commented Nov 21, 2022

bartoszmodelski commented Nov 22, 2022

art-w commented Mar 20, 2023

polytypic commented Mar 22, 2023

art-w commented Mar 22, 2023

lyrm commented Mar 22, 2023 • edited Loading

polytypic commented Mar 22, 2023

art-w commented Mar 22, 2023 • edited Loading

lyrm commented Mar 22, 2023

polytypic commented Mar 22, 2023 • edited Loading

art-w commented Mar 22, 2023

polytypic commented Mar 22, 2023

polytypic commented Mar 24, 2023 • edited Loading

art-w commented Mar 24, 2023

polytypic commented Mar 24, 2023

polytypic commented Mar 24, 2023 • edited Loading

art-w commented Mar 24, 2023

art-w commented Nov 7, 2024

lyrm commented Mar 22, 2023 •

edited

Loading

art-w commented Mar 22, 2023 •

edited

Loading

polytypic commented Mar 22, 2023 •

edited

Loading

polytypic commented Mar 24, 2023 •

edited

Loading

polytypic commented Mar 24, 2023 •

edited

Loading