Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPMC unbounded queue #35

Closed
wants to merge 7 commits into from
Closed

MPMC unbounded queue #35

wants to merge 7 commits into from

Conversation

art-w
Copy link

@art-w art-w commented Nov 21, 2022

This is essentially a Michael Scott queue with (bounded) FAD queues inside, inspired by @bartoszmodelski version :)

  • Performances should be reasonable, since most operations only hit the fast FAD queues!
  • The relaxed FIFO issue of FAD queues is avoided by not using them as a circular buffer (such that cells are not reused, only one push and one pop can touch them)
  • The MS queue wrapper provides support for resizing when the last FAD queue is full
  • There's a bit of a tension with allocating the next FAD queue array because it does take some arbitrary amount of time -- during which pushing is not possible. As a tiny hack, the next array is preallocated as soon as possible (with tail = -1), and overallocation are compensated by keeping them for the future (in gift_rest)
  • When pop fails, a tombstone is used to indicate failure to the corresponding push... I didn't have a lot of success trying to rollback the head, but that's mostly due to my unrealistic benchmarks. However, I would like to exploit this design flaw as an alternative implementation for domainslib Channels: by using its Task algebraic effects, I think we could replace the Tombstone by an Awaiting_push of ('a -> unit) such that the poping task would be suspended until a push has completed. Feedback on this random idea would be much appreciated :)

Some stuff that still need work:

  • I reused the existing SPSC benchmark for this PR, but a proper "MPMC" benchmark would be nice (... I only have some unstable benchmarks atm to confirm that performances are not absurdly bad)
  • I added a test using qcheck-lin to show that I did my homework, but I'm guessing this will cause some circular dependency issue as multicoretests depends on lockfree :/

@bartoszmodelski
Copy link
Contributor

Looks very cool!

You might find some useful MPMC benchmarks in: https://github.com/ocaml-multicore/lockfree/pull/24/files

@art-w
Copy link
Author

art-w commented Mar 20, 2023

Rebased, integrated with the nice SPSC/MPMC benchmarks and optimized a bit. It looks rather competitive on my computer: (median of 20 iterations to push/pop 2 millions elements evenly distributed amongst domains)

{"name":"spsc-queue", "time":0.197151, "throughput":10144499.051894}
{"name":"mpmc-queue", "time":0.04201,  "throughput":47607634.376259}

{"name":"mpmc-relaxed-fad-pushers:1,takers:1", "time":0.337243, "throughput": 6226962.460233}
{"name":"mpmc-relaxed-cas-pushers:1,takers:1", "time":0.253526, "throughput": 8283175.015164}
{"name":  "mpmc-unbounded-pushers:1,takers:1", "time":0.060743, "throughput":34571831.616132}

{"name":"mpmc-relaxed-fad-pushers:4,takers:4", "time":0.714718, "throughput":2938221.363058}
{"name":"mpmc-relaxed-cas-pushers:4,takers:4", "time":0.545062, "throughput":3852772.251762}
{"name":  "mpmc-unbounded-pushers:4,takers:4", "time":0.240097, "throughput":8746463.298379}

{"name":"mpmc-relaxed-fad-pushers:7,takers:1", "time":1.865645, "throughput":1125616.114949}
{"name":"mpmc-relaxed-cas-pushers:7,takers:1", "time":0.612,    "throughput":3431372.613185}
{"name":  "mpmc-unbounded-pushers:7,takers:1", "time":0.548163, "throughput":3830977.721448}

{"name":"mpmc-relaxed-fad-pushers:1,takers:7", "time":1.577434, "throughput":1331275.930521}
{"name":"mpmc-relaxed-cas-pushers:1,takers:7", "time":0.98371,  "throughput":2134775.383006}
{"name":  "mpmc-unbounded-pushers:1,takers:7", "time":0.877693, "throughput":2392635.669022}

(The extremes 1vs7 can have a lot of noise caused by the scheduler, it would be interesting to run the benchmarks on a sanitized environment..)

The optimization relies on this message-passing section from the memory model, by exploiting the fact that exactly one "push" domain and one "pop" domain will ever try to access a specific cell from the queue array... and so we can allocate less Atomic.t by packing status bits together (we technically need only one bit per cell, but use two to handle carries induced by the FAD operation). (cc @lyrm since we briefly talked about this ^^)

@art-w art-w changed the title (draft) MPMC unbounded queue MPMC unbounded queue Mar 20, 2023
@polytypic
Copy link
Contributor

Is there a benchmark comparing against the Michael-Scott queue? It would be interesting to see the relative performance.

@art-w
Copy link
Author

art-w commented Mar 22, 2023

Sure, I also added the Ws_deque to the benchmarks! Indeed the Michael-Scott queue looks pretty good:

One producer Michael-Scott Relaxed FAD Relaxed CAS Unbounded Work-stealing
pushers:1 takers:1 0.131434s 0.329778s 0.21065 s 0.059725s 0.272594s
pushers:1 takers:2 0.165976s 0.248677s 0.397321s 0.199688s 0.482286s
pushers:1 takers:3 0.177053s 0.316906s 0.469243s 0.391438s 0.410644s
pushers:1 takers:4 0.200289s 0.298454s 0.555292s 0.509853s 0.544786s
pushers:1 takers:5 0.218715s 0.333596s 0.376887s 0.628065s 0.365196s
pushers:1 takers:6 0.251839s 0.355936s 0.459976s 0.740695s 0.527715s
pushers:1 takers:7 0.899175s 1.134269s 1.006196s 0.880158s 0.673532s

And we can see that the unbounded MPMC queue is terrible with the consumers spinlocking to pop. Producers don't have this issue:

One consumer Michael-Scott Relaxed FAD Relaxed CAS Unbounded
pushers:2 takers:1 0.136905s 0.480398s 0.41395 s 0.182891s
pushers:3 takers:1 0.255042s 0.475786s 0.341128s 0.153577s
pushers:4 takers:1 0.260206s 0.508386s 0.434022s 0.122285s
pushers:5 takers:1 0.149528s 0.388776s 0.449589s 0.101665s
pushers:6 takers:1 0.149537s 0.447404s 0.546314s 0.110501s
pushers:7 takers:1 1.632871s 1.94816 s 0.564839s 0.532245s

(... I clearly don't have a true 8 cores ;) )

Finally, the Michael-Scott queue compares less favorably on a balanced setup as contention grows:

Balanced Michael-Scott Relaxed FAD Relaxed CAS Unbounded
pushers:2 takers:2 0.26675 s 0.256043s 0.371672s 0.19826 s
pushers:3 takers:3 0.520373s 0.183438s 0.397926s 0.178163s
pushers:4 takers:4 0.754701s 0.640411s 0.539897s 0.235253s

@lyrm
Copy link
Collaborator

lyrm commented Mar 22, 2023

I tried this very basic dscheck test with a push and a pop and it is not finishing (at leat not in a reasonable amount of time).

let producer_consumer () =
  Atomic.trace (fun () ->
      let queue = Mpmc_queue.make ~dummy:1 () in

      (* producer *)
      Atomic.spawn (fun () -> Mpmc_queue.push queue 0);

      (* consumer *)
      let popped = ref None in
      Atomic.spawn (fun () -> popped := Mpmc_queue.pop queue);

      (* checks*)
      Atomic.final (fun () ->
          Atomic.check (fun () ->
              let remaining = Mpmc_queue.pop queue in
              match (!popped, remaining) with
              | None, Some 0 | Some 0, None -> true
              | _, _ -> false)))

@polytypic
Copy link
Contributor

I also wonder how the results might change with the change shown here. In my quick tests it gave noticeable improvement and made benchmark results significantly more stable (on Apple M1). Similar changes to other queue structures could give improvements as well.

@art-w
Copy link
Author

art-w commented Mar 22, 2023

Thanks for testing it @lyrm! But which version of dscheck are you using? (I'm unable to reproduce with its main branch or ocaml-multicore/dscheck#3, on the contrary dscheck terminates after exploring only one trace which is super wrong of it)

Ha wait I'm dumb.

@lyrm
Copy link
Collaborator

lyrm commented Mar 22, 2023

I am using the last release version. It is actually finishing but takes like several minutes which seems a lot for a single push and a single pop.

@polytypic
Copy link
Contributor

polytypic commented Mar 22, 2023

For benchmarking fun, here is a magically faster variation of the Michael-Scott queue:

module Backoff = Kcas.Backoff

type 'a node = { next : 'a node; mutable value : 'a }

external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"

type 'a t = {
  head : 'a node Atomic.t Atomic.t;
  tail : 'a node Atomic.t Atomic.t;
}

let create () =
  let next = Atomic.make (Obj.magic ()) in
  Multicore_magic.copy_as_padded
    {
      head = Multicore_magic.copy_as_padded @@ Atomic.make next;
      tail = Multicore_magic.copy_as_padded @@ Atomic.make next;
    }

let rec pop backoff head =
  let old_head = Multicore_magic.fenceless_get head in
  let first = Multicore_magic.fenceless_get old_head in
  if first == Obj.magic () then None
  else if Atomic.compare_and_set head old_head (next_as_atomic first) then (
    let value = first.value in
    first.value <- Obj.magic ();
    Some value)
  else pop (Backoff.once backoff) head

let pop { head; _ } = pop Backoff.default head [@@inline]

let rec fix_tail tail (new_tail : 'a node Atomic.t) =
  let old_tail = Atomic.get tail in
  if
    Atomic.get new_tail == Obj.magic ()
    && not (Atomic.compare_and_set tail old_tail new_tail)
  then fix_tail tail new_tail

let rec push (new_node : 'a node) tail (old_tail : 'a node Atomic.t) =
  if not (Atomic.compare_and_set old_tail (Obj.magic ()) new_node) then
    push new_node tail (next_as_atomic (Atomic.get old_tail))
  else if not (Atomic.compare_and_set tail old_tail (next_as_atomic new_node))
  then fix_tail tail (next_as_atomic new_node)

let push { tail; _ } value =
  let new_node = { next = Obj.magic (); value } in
  push new_node tail (Atomic.get tail)
  [@@inline]

With this variation I got

Optimized Lockfree.MMSQueue: mean = 0.003840, sd = 0.000000 tp=78125617.782193

from the same benchmark as here. This is basically about 3 to 4 times faster than the lockfree library version of the Michael-Scott queue from couple of weeks ago.

EDIT: I modified the code to use a next_as_atomic cast instead of Obj.magic.

@art-w
Copy link
Author

art-w commented Mar 22, 2023

@lyrm > My bad, thanks a lot! There's a small window for a spinlock when the two domains interleave perfectly:

  1. push starts and increments the tail to reserve an array cell
  2. pop attempts to grab that element, but discovers that push did not complete yet so it leaves a tombstone behind and prepare to recurse. The recursion should now discover that the queue is empty, but...
  3. push attempts to push its element, see the tombstone, and goto 1.

In this scenario, pop never sees the queue as empty and push never has the opportunity to complete. I'll need a bit more time to fix this one hmhm.

@polytypic > Wow that's fast! I did not benchmark your PR but the code you just posted: (also added your backoff to the consumers spinlock hence the more reasonable numbers with a single producer)

Magic Michael-Scott Unbounded
pushers:1 takers:1 0.06981 s 0.049369s
pushers:1 takers:2 0.071409s 0.138905s
pushers:1 takers:3 0.09788 s 0.136138s
pushers:1 takers:4 0.09283 s 0.19781 s
pushers:1 takers:5 0.091467s 0.153733s
pushers:1 takers:6 0.088874s 0.19984 s
pushers:2 takers:1 0.068123s 0.122823s
pushers:3 takers:1 0.076451s 0.115788s
pushers:4 takers:1 0.0887 s 0.089226s
pushers:5 takers:1 0.08625 s 0.095281s
pushers:6 takers:1 0.08588 s 0.08793 s
pushers:2 takers:2 0.086222s 0.136547s
pushers:3 takers:3 0.093071s 0.127035s

My PR is essentially an unbounded Michael-Scott queue with bounded FAD queues in place of your single mutable value : 'a, so perhaps I can steal some optimizations from you? ^^'

Is there anything blocking your implementation from becoming the defacto standard? I think the benchmarks make it clear that the specialized {M,S}p{M,S}c / work-stealing implementations perform a lot worse than your generic solution... This is crazy though:

let rec push ... (old_tail : 'a node Atomic.t) =
  if ... then push ... (Atomic.get old_tail |> Obj.magic)

(for completeness, but I think those numbers are useless on my system:)

8 cores Magic Michael-Scott Unbounded
pushers:1 takers:7 0.484896s 0.230395s
pushers:7 takers:1 0.982056s 1.20984 s
pushers:4 takers:4 0.394102s 0.38964 s

@polytypic
Copy link
Contributor

Is there anything blocking your implementation from becoming the defacto standard?

Well... I don't know. It is perhaps a bit too "magical" for most people's taste. I wanted to see how fast things could be with the MS queue. BTW, there is at least one minor optimization that could still be done. A level of indirection could be removed from the queue itself by inlining the head atomic as is done with the nodes.

Ideally OCaml would, at some point, offer the ability to have atomic fields in records, provide some way to allocate cache line aligned blocks, and also allow fenceless operations on atomic fields. Then it would be possible to implement the same (and better) without using magic/unsafe features. I hope experiments like this provide motivation towards having such features in OCaml.

Some of the Obj.magic usages could be replaced e.g. with external ... = "%identity" to make them perhaps a bit more palatable:

external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
external head_as_atomic : 'a t -> 'a node Atomic.t = "%identity" (* the additional optimization mentioned above *)

@polytypic
Copy link
Contributor

polytypic commented Mar 24, 2023

Here is a version of the optimized Michael-Scott style queue that avoids most uses of Obj.magic:

module Backoff = Kcas.Backoff
(* The [Backoff] from the kcas library does not perform allocations, so it has
   slightly lower overhead than the [Backoff] in the lockfree library. *)

type 'a node = Nil | Node of { next : 'a node; mutable value : 'a }

external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
(* Ideally one should be able to say that the [next] field is [atomic next: 'a
   node], but we can't in OCaml.  As the [next] field is the first field in a
   [Node { ... }] it happens to be at the same location it would be in a ['a
   node Atomic.t], so we use an unsafe cast to access the [next] field as an
   atomic. *)

type 'a t = {
  head : 'a node Atomic.t Atomic.t;
  tail : 'a node Atomic.t Atomic.t;
}

let create () =
  let next = Atomic.make Nil in
  (* We use explicit padding to ensure that the [head] and [tail] atomics do not
     suffer from false sharing.  This improves performance, because [pop] only
     accesses the [head] and [push] only accesses the [tail].  With false
     sharing those accesses would unnecessarily cause contention. *)
  Multicore_magic.copy_as_padded
    {
      head = Multicore_magic.copy_as_padded @@ Atomic.make next;
      tail = Multicore_magic.copy_as_padded @@ Atomic.make next;
    }

let rec pop backoff head =
  (* We can safely use [fenceless_get] operations here, because the accesses are
     dependent (i.e. the order of memory accesses is fixed anyway).  The
     difference that [Atomic.get] makes is that it does not allow accesses that
     are later in the program to be performed before it, but in this case it is
     not possible anyway. *)
  let old_head = Multicore_magic.fenceless_get head in
  match Multicore_magic.fenceless_get old_head with
  | Nil -> None
  | Node node as first ->
      if Atomic.compare_and_set head old_head (next_as_atomic first) then (
        let value = node.value in
        node.value <- Obj.magic ();
        (* At this point we've acquired a value.  The queue will still point to
           the node, so we must make sure that the queue no longer transitively
           points to the value - otherwise we'd have a space leak. *)
        Some value)
      else pop (Backoff.once backoff) head

let pop { head; _ } =
  (* We use a recursive worker - non-recursive wrapper -idiom and instruct the
     compiler to always inline the wrapper. *)
  pop Backoff.default head
  [@@inline]

let rec fix_tail tail (new_tail : 'a node Atomic.t) =
  let old_tail = Atomic.get tail in
  if
    Atomic.get new_tail == Nil
    && not (Atomic.compare_and_set tail old_tail new_tail)
  then fix_tail tail new_tail

let rec push (new_node : 'a node) tail (old_tail : 'a node Atomic.t) =
  if not (Atomic.compare_and_set old_tail Nil new_node) then
    push new_node tail (next_as_atomic (Atomic.get old_tail))
  else if not (Atomic.compare_and_set tail old_tail (next_as_atomic new_node))
  then fix_tail tail (next_as_atomic new_node)

let push { tail; _ } value =
  (* We use a recursive worker - non-recursive wrapper -idiom and instruct the
     compiler to always inline the wrapper. *)
  let new_node = Node { next = Nil; value } in
  push new_node tail (Atomic.get tail)
  [@@inline]

(* BTW, some of the other [Atomic.get]s could also be [fenceless_get]s, but I
   didn't see improvements from those, so I left them as [Atomic.get]s. *)

BTW, I don't think the above uses less magic. It just uses different spells.

@art-w
Copy link
Author

art-w commented Mar 24, 2023

Nice, it looks good!

  • Why do you need to pad the root in create? Can it create contention to access the record field head/next if an Atomic is nearby?
let create () = ...
  Multicore_magic.copy_as_padded { head = ... ; tail = ... }
  • I'm not sure I understand how the node is still reachable after a pop and why we need to clear its value?
        node.value <- Obj.magic ();
        (* At this point we've acquired a value.  The queue will still point to
           the node, so we must make sure that the queue no longer transitively
           points to the value - otherwise we'd have a space leak. *)

(I read through the issue/PR addressing this, but I think the situation is different? In your version, at rest, the head should point to live nodes and the tail should point to Nil?)

Btw I'm happy if we can close this PR in favor of your implementation :) (I played a bit more with the FAD queue but I don't think it can achieve the same performances as yours!)

@polytypic
Copy link
Contributor

If you ask me, I think it is good to have different implementations. I think that is also part of the idea of this repository. Also, I hope that OCaml will get more support for Atomics in the future and then it is likely that the relative performance of various implementations changes. So, my idea wasn’t to prevent this work. I noticed the possibility of some of the MS queue optimizations much earlier and just thought it might be interesting to try them now.

@polytypic
Copy link
Contributor

polytypic commented Mar 24, 2023

Why do you need to pad the root in create? Can it create contention to access the record field head/next if an Atomic is nearby?

Yes. If the root shares cache lines with the atomics (or anything else that is mutated). Specifically, the accesses in the pattern matches, { head; _} and { tail; _ }, can suffer from contention if the fields happen to be in cache lines whose contents are being mutated. It is all about how shared memory works.

Ideally OCaml would provide means to allocate things in a cache line granular manner, so that a specific heap block would be guaranteed not to share cache lines with any other heap block. The padding provided by multicore-magic cannot guarantee that, but if you use it with all "long lived" blocks, then it should prevent almost all harmful false sharing. Cases where false sharing happens between short lived objects is usually not a problem and it usually better to just allocate short lived objects without extra padding/alignment to reduce heap pressure.

Perhaps future OCaml could have something like this:

type 'a node =
  | Nil
  | Node of {
      atomic next: 'a node;  (* A node is short lived, so prefer to reduce heap pressure *)
      mutable value: 'a;
    }

type 'a t =
  {
    atomic head : 'a node;
    [@@align_to_cache_line] atomic tail : 'a node;
  }
  (* A queue is likely long lived and having head and tail in their own cache lines avoids contention *)

(I read through the issue/PR addressing this, but I think the situation is different? In your version, at rest, the head should point to live nodes and the tail should point to Nil?)

In the magically optimized implementation the next field is inseparable from the value field (they are allocated together as a Node { ... }). So, even though the tail and head of an empty queue will point to a next field that contains Nil the value field is also there.

@art-w
Copy link
Author

art-w commented Mar 24, 2023

Ha I see it now, thanks!

I don't know if this repo is supposed to be a datastructure zoo or a recommended stdlib for multicore... I feel it's confusing for users if there are many similar implementations to pick from, even though one is clearly better than the others :P

@art-w
Copy link
Author

art-w commented Nov 7, 2024

Closing, the Michael-Scott queue is just better on all aspects :)

@art-w art-w closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants