-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPMC unbounded queue #35
Conversation
Looks very cool! You might find some useful MPMC benchmarks in: https://github.com/ocaml-multicore/lockfree/pull/24/files |
Rebased, integrated with the nice SPSC/MPMC benchmarks and optimized a bit. It looks rather competitive on my computer: (median of 20 iterations to push/pop 2 millions elements evenly distributed amongst domains) {"name":"spsc-queue", "time":0.197151, "throughput":10144499.051894}
{"name":"mpmc-queue", "time":0.04201, "throughput":47607634.376259}
{"name":"mpmc-relaxed-fad-pushers:1,takers:1", "time":0.337243, "throughput": 6226962.460233}
{"name":"mpmc-relaxed-cas-pushers:1,takers:1", "time":0.253526, "throughput": 8283175.015164}
{"name": "mpmc-unbounded-pushers:1,takers:1", "time":0.060743, "throughput":34571831.616132}
{"name":"mpmc-relaxed-fad-pushers:4,takers:4", "time":0.714718, "throughput":2938221.363058}
{"name":"mpmc-relaxed-cas-pushers:4,takers:4", "time":0.545062, "throughput":3852772.251762}
{"name": "mpmc-unbounded-pushers:4,takers:4", "time":0.240097, "throughput":8746463.298379}
{"name":"mpmc-relaxed-fad-pushers:7,takers:1", "time":1.865645, "throughput":1125616.114949}
{"name":"mpmc-relaxed-cas-pushers:7,takers:1", "time":0.612, "throughput":3431372.613185}
{"name": "mpmc-unbounded-pushers:7,takers:1", "time":0.548163, "throughput":3830977.721448}
{"name":"mpmc-relaxed-fad-pushers:1,takers:7", "time":1.577434, "throughput":1331275.930521}
{"name":"mpmc-relaxed-cas-pushers:1,takers:7", "time":0.98371, "throughput":2134775.383006}
{"name": "mpmc-unbounded-pushers:1,takers:7", "time":0.877693, "throughput":2392635.669022} (The extremes 1vs7 can have a lot of noise caused by the scheduler, it would be interesting to run the benchmarks on a sanitized environment..) The optimization relies on this message-passing section from the memory model, by exploiting the fact that exactly one "push" domain and one "pop" domain will ever try to access a specific cell from the queue array... and so we can allocate less |
Is there a benchmark comparing against the Michael-Scott queue? It would be interesting to see the relative performance. |
Sure, I also added the
And we can see that the unbounded MPMC queue is terrible with the consumers spinlocking to pop. Producers don't have this issue:
(... I clearly don't have a true 8 cores ;) ) Finally, the Michael-Scott queue compares less favorably on a balanced setup as contention grows:
|
I tried this very basic let producer_consumer () =
Atomic.trace (fun () ->
let queue = Mpmc_queue.make ~dummy:1 () in
(* producer *)
Atomic.spawn (fun () -> Mpmc_queue.push queue 0);
(* consumer *)
let popped = ref None in
Atomic.spawn (fun () -> popped := Mpmc_queue.pop queue);
(* checks*)
Atomic.final (fun () ->
Atomic.check (fun () ->
let remaining = Mpmc_queue.pop queue in
match (!popped, remaining) with
| None, Some 0 | Some 0, None -> true
| _, _ -> false))) |
I also wonder how the results might change with the change shown here. In my quick tests it gave noticeable improvement and made benchmark results significantly more stable (on Apple M1). Similar changes to other queue structures could give improvements as well. |
Ha wait I'm dumb. |
I am using the last release version. It is actually finishing but takes like several minutes which seems a lot for a single push and a single pop. |
For benchmarking fun, here is a magically faster variation of the Michael-Scott queue: module Backoff = Kcas.Backoff
type 'a node = { next : 'a node; mutable value : 'a }
external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
type 'a t = {
head : 'a node Atomic.t Atomic.t;
tail : 'a node Atomic.t Atomic.t;
}
let create () =
let next = Atomic.make (Obj.magic ()) in
Multicore_magic.copy_as_padded
{
head = Multicore_magic.copy_as_padded @@ Atomic.make next;
tail = Multicore_magic.copy_as_padded @@ Atomic.make next;
}
let rec pop backoff head =
let old_head = Multicore_magic.fenceless_get head in
let first = Multicore_magic.fenceless_get old_head in
if first == Obj.magic () then None
else if Atomic.compare_and_set head old_head (next_as_atomic first) then (
let value = first.value in
first.value <- Obj.magic ();
Some value)
else pop (Backoff.once backoff) head
let pop { head; _ } = pop Backoff.default head [@@inline]
let rec fix_tail tail (new_tail : 'a node Atomic.t) =
let old_tail = Atomic.get tail in
if
Atomic.get new_tail == Obj.magic ()
&& not (Atomic.compare_and_set tail old_tail new_tail)
then fix_tail tail new_tail
let rec push (new_node : 'a node) tail (old_tail : 'a node Atomic.t) =
if not (Atomic.compare_and_set old_tail (Obj.magic ()) new_node) then
push new_node tail (next_as_atomic (Atomic.get old_tail))
else if not (Atomic.compare_and_set tail old_tail (next_as_atomic new_node))
then fix_tail tail (next_as_atomic new_node)
let push { tail; _ } value =
let new_node = { next = Obj.magic (); value } in
push new_node tail (Atomic.get tail)
[@@inline] With this variation I got
from the same benchmark as here. This is basically about 3 to 4 times faster than the lockfree library version of the Michael-Scott queue from couple of weeks ago. EDIT: I modified the code to use a |
@lyrm > My bad, thanks a lot! There's a small window for a spinlock when the two domains interleave perfectly:
In this scenario, @polytypic > Wow that's fast! I did not benchmark your PR but the code you just posted: (also added your backoff to the consumers spinlock hence the more reasonable numbers with a single producer)
My PR is essentially an unbounded Michael-Scott queue with bounded FAD queues in place of your single Is there anything blocking your implementation from becoming the defacto standard? I think the benchmarks make it clear that the specialized let rec push ... (old_tail : 'a node Atomic.t) =
if ... then push ... (Atomic.get old_tail |> Obj.magic) (for completeness, but I think those numbers are useless on my system:)
|
Well... I don't know. It is perhaps a bit too "magical" for most people's taste. I wanted to see how fast things could be with the MS queue. BTW, there is at least one minor optimization that could still be done. A level of indirection could be removed from the queue itself by inlining the Ideally OCaml would, at some point, offer the ability to have Some of the external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
external head_as_atomic : 'a t -> 'a node Atomic.t = "%identity" (* the additional optimization mentioned above *) |
Here is a version of the optimized Michael-Scott style queue that avoids most uses of module Backoff = Kcas.Backoff
(* The [Backoff] from the kcas library does not perform allocations, so it has
slightly lower overhead than the [Backoff] in the lockfree library. *)
type 'a node = Nil | Node of { next : 'a node; mutable value : 'a }
external next_as_atomic : 'a node -> 'a node Atomic.t = "%identity"
(* Ideally one should be able to say that the [next] field is [atomic next: 'a
node], but we can't in OCaml. As the [next] field is the first field in a
[Node { ... }] it happens to be at the same location it would be in a ['a
node Atomic.t], so we use an unsafe cast to access the [next] field as an
atomic. *)
type 'a t = {
head : 'a node Atomic.t Atomic.t;
tail : 'a node Atomic.t Atomic.t;
}
let create () =
let next = Atomic.make Nil in
(* We use explicit padding to ensure that the [head] and [tail] atomics do not
suffer from false sharing. This improves performance, because [pop] only
accesses the [head] and [push] only accesses the [tail]. With false
sharing those accesses would unnecessarily cause contention. *)
Multicore_magic.copy_as_padded
{
head = Multicore_magic.copy_as_padded @@ Atomic.make next;
tail = Multicore_magic.copy_as_padded @@ Atomic.make next;
}
let rec pop backoff head =
(* We can safely use [fenceless_get] operations here, because the accesses are
dependent (i.e. the order of memory accesses is fixed anyway). The
difference that [Atomic.get] makes is that it does not allow accesses that
are later in the program to be performed before it, but in this case it is
not possible anyway. *)
let old_head = Multicore_magic.fenceless_get head in
match Multicore_magic.fenceless_get old_head with
| Nil -> None
| Node node as first ->
if Atomic.compare_and_set head old_head (next_as_atomic first) then (
let value = node.value in
node.value <- Obj.magic ();
(* At this point we've acquired a value. The queue will still point to
the node, so we must make sure that the queue no longer transitively
points to the value - otherwise we'd have a space leak. *)
Some value)
else pop (Backoff.once backoff) head
let pop { head; _ } =
(* We use a recursive worker - non-recursive wrapper -idiom and instruct the
compiler to always inline the wrapper. *)
pop Backoff.default head
[@@inline]
let rec fix_tail tail (new_tail : 'a node Atomic.t) =
let old_tail = Atomic.get tail in
if
Atomic.get new_tail == Nil
&& not (Atomic.compare_and_set tail old_tail new_tail)
then fix_tail tail new_tail
let rec push (new_node : 'a node) tail (old_tail : 'a node Atomic.t) =
if not (Atomic.compare_and_set old_tail Nil new_node) then
push new_node tail (next_as_atomic (Atomic.get old_tail))
else if not (Atomic.compare_and_set tail old_tail (next_as_atomic new_node))
then fix_tail tail (next_as_atomic new_node)
let push { tail; _ } value =
(* We use a recursive worker - non-recursive wrapper -idiom and instruct the
compiler to always inline the wrapper. *)
let new_node = Node { next = Nil; value } in
push new_node tail (Atomic.get tail)
[@@inline]
(* BTW, some of the other [Atomic.get]s could also be [fenceless_get]s, but I
didn't see improvements from those, so I left them as [Atomic.get]s. *) BTW, I don't think the above uses less magic. It just uses different spells. |
Nice, it looks good!
let create () = ...
Multicore_magic.copy_as_padded { head = ... ; tail = ... }
node.value <- Obj.magic ();
(* At this point we've acquired a value. The queue will still point to
the node, so we must make sure that the queue no longer transitively
points to the value - otherwise we'd have a space leak. *) (I read through the issue/PR addressing this, but I think the situation is different? In your version, at rest, the Btw I'm happy if we can close this PR in favor of your implementation :) (I played a bit more with the FAD queue but I don't think it can achieve the same performances as yours!) |
If you ask me, I think it is good to have different implementations. I think that is also part of the idea of this repository. Also, I hope that OCaml will get more support for Atomics in the future and then it is likely that the relative performance of various implementations changes. So, my idea wasn’t to prevent this work. I noticed the possibility of some of the MS queue optimizations much earlier and just thought it might be interesting to try them now. |
Yes. If the root shares cache lines with the atomics (or anything else that is mutated). Specifically, the accesses in the pattern matches, Ideally OCaml would provide means to allocate things in a cache line granular manner, so that a specific heap block would be guaranteed not to share cache lines with any other heap block. The padding provided by multicore-magic cannot guarantee that, but if you use it with all "long lived" blocks, then it should prevent almost all harmful false sharing. Cases where false sharing happens between short lived objects is usually not a problem and it usually better to just allocate short lived objects without extra padding/alignment to reduce heap pressure. Perhaps future OCaml could have something like this: type 'a node =
| Nil
| Node of {
atomic next: 'a node; (* A node is short lived, so prefer to reduce heap pressure *)
mutable value: 'a;
}
type 'a t =
{
atomic head : 'a node;
[@@align_to_cache_line] atomic tail : 'a node;
}
(* A queue is likely long lived and having head and tail in their own cache lines avoids contention *)
In the magically optimized implementation the |
Ha I see it now, thanks! I don't know if this repo is supposed to be a datastructure zoo or a recommended stdlib for multicore... I feel it's confusing for users if there are many similar implementations to pick from, even though one is clearly better than the others :P |
Closing, the Michael-Scott queue is just better on all aspects :) |
This is essentially a Michael Scott queue with (bounded) FAD queues inside, inspired by @bartoszmodelski version :)
tail = -1
), and overallocation are compensated by keeping them for the future (ingift_rest
)head
, but that's mostly due to my unrealistic benchmarks. However, I would like to exploit this design flaw as an alternative implementation fordomainslib
Channels: by using its Task algebraic effects, I think we could replace theTombstone
by anAwaiting_push of ('a -> unit)
such that the poping task would be suspended until a push has completed. Feedback on this random idea would be much appreciated :)Some stuff that still need work:
qcheck-lin
to show that I did my homework, but I'm guessing this will cause some circular dependency issue as multicoretests depends on lockfree :/