Rendezvous and demonstration on Spsc_queue API #68

art-w · 2023-03-22T18:55:26Z

This is a small proposal to avoid spinlocks in this repository by introducing some mean of synchronizing domains. I think it would help users if the APIs provided facilities for this common pattern, while working well with fibers and custom schedulers (so not just system locks in place of spinlocks).

I did some experiment on the Spsc_queue benchmark where a spinlock is used for the consumers/producers when the queue is empty/full. Currently this is absurdly slow if the two domains happen to be scheduled on the same CPU core: it takes up to 3s to push/pop a thousand elements, rather than milliseconds on two cores! (.. this is why the CI benchmarks don't terminate in any reasonnable time atm, you can try it with taskset --cpu-list 1 ./_build/default/bench/main.exe but I recommend lowering the items counts first)

Adding a couple of external semaphores to this benchmark solves the single core worst-case for spinlocks, but at the price of being 4x slower in general. By integrating the locking logic into the datastructure, we can achieve nearly no observable regression in performances while also providing an API that don't encourage users to spinlock. It's debatable if the resulting datastructure is still "lockfree", but at least the locking is opt-in :-°

This PR builds on a previous experiment in domainslib where the goal was to synchronize fibers with an Mpmc_queue, by using a "rendezvous" lock between matching pop and push operations. Essentially we want some mean for a pushing domain to unlock a stuck poping domain (which happens when the queue is empty, but the user has signified that it really wants to wait). Dually if the queue was full, we want the next pop to unstuck a waiting pushing domain. By exploiting the custom scheduler, locking is cheap since the domain can keep working on other tasks.

... Now this is just a quick demo to gather feedback on the proposed Rendezvous abstraction, there's still a bunch of spinlocks to remove from this repo if there's interest. Some questions:

This PR example only uses a unit Rendezvous.t, so perhaps the rendezvous polymorphism is not required and could be added on the side on a case-by-case basis? I remember that it was nice for the Mpmc_queue in domainslib, and the type is a bit easier to read than with unit everywhere (?)
I tried a rendezvous type (('a -> unit) -> unit) -> 'a since @polytypic used something similar in kcas, but it was less convenient to handle the edge-case "I thought I needed to lock but don't anymore"... The callback pair is less fancy but easier to understand, maybe (?)
Should the Rendezvous type definition be public? I did so with the hope that other datastructure libraries could adopt the pattern and custom schedulers would be able to provide implementations, without requiring an explicit dependency on lockfree just for this type, but perhaps that won't be the case... We could expose standard synchronizations methods like mutexes instead then.

polytypic · 2023-03-23T09:19:25Z

A quick note: I currently strongly feel this sort of thing should be put into a minimal library of its own. IOW, I would not put it into lockfree (nor kcas for that matter). The ability to suspend and later resume seems to be useful/needed for a lot of low level facilities and would likely be implemented by any and all libraries providing schedulers of some sort that want to cooperate.

polytypic · 2023-03-23T09:35:09Z

src/spsc_queue.ml

+    else
+      let (_ : bool) = Atomic.compare_and_set t.lock some_release None in
+      ()
+  else unlock t.lock


Hmm... Did I understand correctly that if two calls to wait are being made, e.g. to pops when the queue is empty, then what happens is that the second call to wait actually wakes up the first one? So, in that case you get a very expensive busy wait?

Ah.. SPSC! Right.

Yeah :P This was a quick example that only works for SPSC, but in general multiple waiting pop would actually push their "release" function onto a queue. Then the next push would see the value queue as empty and pop/wake-up the first available pop.

polytypic · 2023-03-23T09:52:11Z

src/spsc_queue.ml

+  if Atomic.compare_and_set t.lock None some_release then
+    if size t = expected_size then wait ()
+    else
+      let (_ : bool) = Atomic.compare_and_set t.lock some_release None in


Hmm... I wonder whether this can cause problems? It is possible that another party either manages to change the size of the queue (before the first CAS) or call release (between the two CASes).

Related to this, to avoid capturing the stack unnecessarily (before the waiters are added), the kcas blocking draft uses three states: Init | Resumed | Waiting of .... Then both suspend and resume take the race condition (where resume happens before suspend is fully ready) into account.

I guess SPSC makes things a bit simpler here as well.

Yes it's possible for another party to call release even though wait () will not be called... I think it's fine? (however, we do need to be careful when calling wait () otherwise it's possible for the queue to be unstuck just before we install our lock, followed by no more operations to wake us up)

(I believe the size t = expected_size is doing something similar to your three states to determine the Resumed state implicitly for the queue internals)

One thing I had in mind is that for the case of release without matching wait one has to make sure that resources are properly released/consumed, but it might not be too much of a problem. Also, release without matching wait could potentially be a broken rendezvous. I guess in the SPSC case that is not an issue.

(To be clear, I totally agree that calling release when wait might not be called feels dirty when compared to your two-phases commit.. but I don't have an intuition for why this might break custom schedulers assumptions)

polytypic · 2023-03-23T09:57:27Z

perhaps the rendezvous polymorphism is not required and could be added on the side on a case-by-case basis?

It is difficult to say, but generally speaking, there are likely to be a lot of cases where the ability to pass value is not needed (only the suspend/resume is needed), and it usually requires a bit more work (an allocation or two) to make it possible to pass a value.

polytypic · 2023-03-23T10:10:06Z

src/spsc_queue.ml

+    wait ~rdv t (t.mask + 1);
+    push ~rdv t element)
+
+let rec pop ~rdv t =


I also used a(n optional) named parameter to pass in the suspend/resume mechanism in my kcas blocking draft.

However, what I was thinking is that the suspend/resume (or rendezvous) mechanism could be something that can be obtained from the DLS, for example, at the point when it is needed.

A hypothetical rendezvous framework would provide a way to install a rendezvous mechanism:

type t = (* ((unit -> unit) -> unit) -> unit OR whatever *) val set : t -> unit val get : unit -> t (* Raises or returns default mechanism if nothing has been explicitly set *)

Schedulers, like Eio, would call set to install the mechanism when they run their event loops. (Perhaps something like push and pop rather than set might be a better API.)

Facilities that need the ability then call get when they need it.

This would allow things like kcas to work independently of the scheduler(s). You could have domainslib running on some domains and Eio on others. And you could have communication between the domains — without having to know which mechanism to pass from which domain.

An earlier idea / proposal by, I believe, @deepali2806 and @kayceesrk , is to use a Suspend effect.

That would also work. I think the proposals are equally expressive (both can implement the other).

The reason I used the continuation passing style approach is that A) it does not require the suspend mechanism to capture a stack or perform an effect and B) it also doesn't prescribe a specific signature for such an effect. Regarding A, in my previous experiments, I've gotten the impression that effects do have higher costs than function calls. OTOH, it is not clear whether the ability to avoid capturing stacks is really useful — most of the time one would likely be using an effects based scheduler anyway. Regarding B, I think the various libraries currently have similar, but slightly different Suspend like effects — I'm not sure whether or not they could all just use the one and same effect.

@Sudha247 is also involved in this effort with @deepali2806. We're using suspend effect now to safely use lazy from different threads. Think blackholing for thunk evaluation as in GHC.

Hm yeah so I liked the idea of being explicit about blocking operations in the API (hence the parameter)... I don't think it's so bad to write ~rdv:dls if desired:

let dls : unit Rendezvous.t = fun () -> DLS.get dls_rendezvous_key ()

(But for example in domainslib, we need the pool argument to wait so I'm not sure we can expect schedulers to preconfigure the dls_rendezvous_key for the user in general)

Regarding effects, I think it's asking a lot more from schedulers to handle our own effect rather than define how to block in their library (it's also a pain for the user to setup if they wanted a real system lock?)

Hmm... Where can I find the work on safe lazy?

Speaking of the different forms of Suspend. In my par-ml experiment I used the following:

type 'a continuation = ('a, unit) Effect.Deep.continuation type _ Effect.t += Suspend : ('a continuation -> unit) -> 'a Effect.t

The reason I used those definitions is that the handler then becomes trivial:

let effc (type a) : a Effect.t -> _ = function | Suspend ef -> Some ef | _ -> None

I guess this

type resume_result = Resume_success | Resume_failure type 'a resumer = 'a -> resume_result type _ Effect.t += Suspend : ('a resumer -> bool) -> 'a Effect.t

is the current proposal?

I unfortunately haven't had time to think about this thoroughly, but I'm not immediately convinced that the features like resume_result and returning a bool are the best way to go. Are they necessary for best results or could the same be communicated through side-channels efficiently enough? How often are the features needed? (In the unified_interface repo the Resume_failure constructor does not seem to used at all.) They might be necessary, but I need to think about this. Also, are they sufficient? Is there some special case that would benefit from other special features? I assume these questions have been considered and I'd love to hear some more rationale behind the choices.

Also, there is a kind of arbitrary decision here to insert a function, an abstraction, at one point or another. In the approach I used in kcas and the rendezvous proposal here, the interface is very abstract — only simple function types (or pairs of functions) are being used. Why expose the concrete effect rather than abstract over it?

Usually designs that are more abstract have more lasting power. A very concrete design with specific affordances for all the special cases tends to make things more cumbersome.

But like I said, I unfortunately haven't had time to think about this thoroughly enough. And when I say think, I mean that one/I should actually try to implement realistic things with the various proposals and see how they compare (in usability and performance).

in domainslib, we need the pool argument to wait so I'm not sure we can expect schedulers to preconfigure the dls_rendezvous_key for the user in general

These sorts of schedulers, including domainslib and Eio (and every other similar effects based library I know of for OCaml), basically run a loop of some kind on the domains on which they run. I don't see why it would be problematic to install the rendezvous mechanism to DLS just before entering the loop.

It would also be fairly easy to provide a function, in domainslib, like get_pool: unit -> pool (or get_pool: unit -> pool option if you prefer) that obtains the pool with which the current domain is associated with.

One possibility is also to allow both: passing an optional parameter explicitly to tell how to block and, in the absence of such parameter, obtain the mechanism from DLS.

Yeah I don't know why domainslib provides the ability to ping-pong between different domain pools... Yet another generic alternative might be to release the blocking domain into a shared pool like idle-domains and expect custom schedulers to integrate with that?

I unfortunately haven't had time to think about this thoroughly, but I'm not immediately convinced that the features like resume_result and returning a bool are the best way to go. Are they necessary for best results or could the same be communicated through side-channels efficiently enough? How often are the features needed? (In the unified_interface repo the Resume_failure constructor does not seem to used at all.) They might be necessary, but I need to think about this. Also, are they sufficient? Is there some special case that would benefit from other special features? I assume these questions have been considered and I'd love to hear some more rationale behind the choices

The return type of resumer captures the notion of cancellation of the task. It means if the particular scheduler cancels the task while it is still in the suspended state, we should avoid resuming it. The resumer is defined in such a way that if the task is live, it actually resumes the task and returns Resume_success when the corresponding resume function is called. In the alternate case, when the task is cancelled, it will return Resume_failure instead of actually resuming the task. In MVar implementation here, we are checking the resumer return value. When the resumer return value is Resume_failure it means we can skip through it and retry the operation again to get the next resumer.

Also, in the type signature of Suspend effect, we have a function that takes a resumer and returns a bool type. It signifies the thread safety while suspending the task. Thread safety, in this case, is achieved through lock-free implementation. It means we are using atomic operations like compare and swap (CAS). When such a CAS operation is successful, we will return the true value, indicating the push to the suspended queue is successful. In case of CAS failure, we have to retry the operation again to get the most recent state of the queue or MVar.

Hmm... Where can I find the work on safe lazy?

@Sudha247 has a development branch. This is very much "research" atm. Not intended to be upstreamed as is.

art-w · 2023-03-23T12:28:57Z

I currently strongly feel this sort of thing should be put into a minimal library of its own.

Yeah I don't have an opinion on this.. It really depends on whether custom schedulers and datastructure libraries need a lot more than an alias for the type unit -> ('a -> unit) * (unit -> 'a) (well, datastructures will need some default implementations of this type when testing, but schedulers will not)

it usually requires a bit more work (an allocation or two) to make it possible to pass a value.

Indeed, that's why I added some specialized _unit to show that it shouldn't be too much of an issue (but either way, I'm not seeing a performance difference in my tests, as we only allocate extra stuff when we have to block)

To quote your example from kcas on continuations vs products for the rendezvous type:

scheduler (fun resume ->
    if not (Atomic.compare_and_set self `Init (`Waiting resume))
    then resume ());

vs

let resume, wait = scheduler () in
if Atomic.compare_and_set self `Init (`Waiting resume)
then wait ()

A small issue I had with the continuation is that you need to call resume () in the happy case when you don't need to lock (so it's slightly awkward for the scheduler which now need to check if blocking is necessary or not after the callback completes)

polytypic · 2023-03-23T12:50:01Z

I currently strongly feel this sort of thing should be put into a minimal library of its own.

Yeah I don't have an opinion on this..

I'm thinking this more in terms of having the capability available without explicit parameterization. One approach would be to use the DLS as I suggested previously. This way one could just use libraries like kcas with blocking — no matter how many different schedulers are being used in the program things would just work (as long as the schedulers just install the mechanism to DLS when running their event loops). Passing things through parameters or functorizing everything easily becomes a mess — althrough it definitely has the advantage of not using ambient state. Also, if every library would have slightly different definitions (one would use rendezvous, another would use the CPS suspend, third would use a Suspend effect), then it would become much less practical to e.g. use DLS.

A small issue I had with the continuation is that you need to call resume () in the happy case when you don't need to lock (so it's slightly awkward for the scheduler which now need to check if blocking is necessary or not after the callback completes)

Hmm... Yes, this could be an advantage of breaking the suspend/resume protocol into two like this. So, IIUC, basically, in an effects based implementation it would be the wait implementation that actually performs the effect to capture the stack. By making wait a simple first-order function, things potentially become simpler.

art-w · 2023-03-23T14:05:13Z

(Ha, I almost didn't notice that the ocaml-benchmarks went through without a timeout for the first time! I thought the relaxed mpmc queue would also need fixing, it still takes seconds, but I guess the spsc benchmark was the main issue... cc @ElectreAAS as the graphs are missing :/ )

ElectreAAS · 2023-03-24T14:42:45Z

@art-w you scared me, but the benchmarks are run properly and the graphs are present! Link here. The problem is that these benchmarks have the name "Lockfree" and hence aren't shown when just clicking on the PR, you also need to click on Lockfree here.

However the graphs aren't particularly useful as there is nothing to compare to, but they're present!

polytypic · 2023-04-28T05:47:27Z

Also mentioning here that the domain-local-await or DLA library is now available as an opam package and provides a blocking mechanism inspired by the "rendezvous" primitive described in this PR. I plan to soon release a version of kcas using DLA and there are PRs to add DLA support to Eio and Domainslib. As described in the DLA library README it may be replaced later by an official blocking mechanism, but I hope that we can use it now to experiment providing blocking abstractions that from the POV of casual users just work ™ with OCaml 5 today.

Introduce Rendezvous and demonstration on Spsc_queue API

209efa0

polytypic reviewed Mar 23, 2023

View reviewed changes

polytypic mentioned this pull request Mar 24, 2023

Review and Commentary deepali2806/unified_interface#1

Open

This was referenced Apr 20, 2023

Add blocking ocaml-multicore/kcas#32

Merged

Add support for domain local await ocaml-multicore/eio#494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rendezvous and demonstration on Spsc_queue API #68

Rendezvous and demonstration on Spsc_queue API #68

art-w commented Mar 22, 2023

polytypic commented Mar 23, 2023 •

edited

polytypic Mar 23, 2023

polytypic Mar 23, 2023

art-w Mar 23, 2023

polytypic Mar 23, 2023

polytypic Mar 23, 2023

art-w Mar 23, 2023

polytypic Mar 23, 2023

art-w Mar 23, 2023

polytypic commented Mar 23, 2023

polytypic Mar 23, 2023 •

edited

polytypic Mar 23, 2023 •

edited

kayceesrk Mar 23, 2023

art-w Mar 23, 2023

polytypic Mar 23, 2023

polytypic Mar 23, 2023 •

edited

polytypic Mar 23, 2023 •

edited

art-w Mar 23, 2023

deepali2806 Mar 23, 2023

kayceesrk Mar 24, 2023

art-w commented Mar 23, 2023

polytypic commented Mar 23, 2023

art-w commented Mar 23, 2023

ElectreAAS commented Mar 24, 2023

polytypic commented Apr 28, 2023

Rendezvous and demonstration on Spsc_queue API #68

Are you sure you want to change the base?

Rendezvous and demonstration on Spsc_queue API #68

Conversation

art-w commented Mar 22, 2023

polytypic commented Mar 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polytypic commented Mar 23, 2023

polytypic Mar 23, 2023 • edited

Choose a reason for hiding this comment

polytypic Mar 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polytypic Mar 23, 2023 • edited

Choose a reason for hiding this comment

polytypic Mar 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

art-w commented Mar 23, 2023

polytypic commented Mar 23, 2023

art-w commented Mar 23, 2023

ElectreAAS commented Mar 24, 2023

polytypic commented Apr 28, 2023

polytypic commented Mar 23, 2023 •

edited

polytypic Mar 23, 2023 •

edited

polytypic Mar 23, 2023 •

edited

polytypic Mar 23, 2023 •

edited

polytypic Mar 23, 2023 •

edited