Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendezvous and demonstration on Spsc_queue API #68

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

art-w
Copy link

@art-w art-w commented Mar 22, 2023

This is a small proposal to avoid spinlocks in this repository by introducing some mean of synchronizing domains. I think it would help users if the APIs provided facilities for this common pattern, while working well with fibers and custom schedulers (so not just system locks in place of spinlocks).

I did some experiment on the Spsc_queue benchmark where a spinlock is used for the consumers/producers when the queue is empty/full. Currently this is absurdly slow if the two domains happen to be scheduled on the same CPU core: it takes up to 3s to push/pop a thousand elements, rather than milliseconds on two cores! (.. this is why the CI benchmarks don't terminate in any reasonnable time atm, you can try it with taskset --cpu-list 1 ./_build/default/bench/main.exe but I recommend lowering the items counts first)

Adding a couple of external semaphores to this benchmark solves the single core worst-case for spinlocks, but at the price of being 4x slower in general. By integrating the locking logic into the datastructure, we can achieve nearly no observable regression in performances while also providing an API that don't encourage users to spinlock. It's debatable if the resulting datastructure is still "lockfree", but at least the locking is opt-in :-°

This PR builds on a previous experiment in domainslib where the goal was to synchronize fibers with an Mpmc_queue, by using a "rendezvous" lock between matching pop and push operations. Essentially we want some mean for a pushing domain to unlock a stuck poping domain (which happens when the queue is empty, but the user has signified that it really wants to wait). Dually if the queue was full, we want the next pop to unstuck a waiting pushing domain. By exploiting the custom scheduler, locking is cheap since the domain can keep working on other tasks.

... Now this is just a quick demo to gather feedback on the proposed Rendezvous abstraction, there's still a bunch of spinlocks to remove from this repo if there's interest. Some questions:

  • This PR example only uses a unit Rendezvous.t, so perhaps the rendezvous polymorphism is not required and could be added on the side on a case-by-case basis? I remember that it was nice for the Mpmc_queue in domainslib, and the type is a bit easier to read than with unit everywhere (?)
  • I tried a rendezvous type (('a -> unit) -> unit) -> 'a since @polytypic used something similar in kcas, but it was less convenient to handle the edge-case "I thought I needed to lock but don't anymore"... The callback pair is less fancy but easier to understand, maybe (?)
  • Should the Rendezvous type definition be public? I did so with the hope that other datastructure libraries could adopt the pattern and custom schedulers would be able to provide implementations, without requiring an explicit dependency on lockfree just for this type, but perhaps that won't be the case... We could expose standard synchronizations methods like mutexes instead then.

@polytypic
Copy link
Contributor

polytypic commented Mar 23, 2023

A quick note: I currently strongly feel this sort of thing should be put into a minimal library of its own. IOW, I would not put it into lockfree (nor kcas for that matter). The ability to suspend and later resume seems to be useful/needed for a lot of low level facilities and would likely be implemented by any and all libraries providing schedulers of some sort that want to cooperate.

else
let (_ : bool) = Atomic.compare_and_set t.lock some_release None in
()
else unlock t.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Did I understand correctly that if two calls to wait are being made, e.g. to pops when the queue is empty, then what happens is that the second call to wait actually wakes up the first one? So, in that case you get a very expensive busy wait?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. SPSC! Right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :P This was a quick example that only works for SPSC, but in general multiple waiting pop would actually push their "release" function onto a queue. Then the next push would see the value queue as empty and pop/wake-up the first available pop.

if Atomic.compare_and_set t.lock None some_release then
if size t = expected_size then wait ()
else
let (_ : bool) = Atomic.compare_and_set t.lock some_release None in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I wonder whether this can cause problems? It is possible that another party either manages to change the size of the queue (before the first CAS) or call release (between the two CASes).

Related to this, to avoid capturing the stack unnecessarily (before the waiters are added), the kcas blocking draft uses three states: Init | Resumed | Waiting of .... Then both suspend and resume take the race condition (where resume happens before suspend is fully ready) into account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess SPSC makes things a bit simpler here as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's possible for another party to call release even though wait () will not be called... I think it's fine? (however, we do need to be careful when calling wait () otherwise it's possible for the queue to be unstuck just before we install our lock, followed by no more operations to wake us up)

(I believe the size t = expected_size is doing something similar to your three states to determine the Resumed state implicitly for the queue internals)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I had in mind is that for the case of release without matching wait one has to make sure that resources are properly released/consumed, but it might not be too much of a problem. Also, release without matching wait could potentially be a broken rendezvous. I guess in the SPSC case that is not an issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To be clear, I totally agree that calling release when wait might not be called feels dirty when compared to your two-phases commit.. but I don't have an intuition for why this might break custom schedulers assumptions)

@polytypic
Copy link
Contributor

perhaps the rendezvous polymorphism is not required and could be added on the side on a case-by-case basis?

It is difficult to say, but generally speaking, there are likely to be a lot of cases where the ability to pass value is not needed (only the suspend/resume is needed), and it usually requires a bit more work (an allocation or two) to make it possible to pass a value.

wait ~rdv t (t.mask + 1);
push ~rdv t element)

let rec pop ~rdv t =
Copy link
Contributor

@polytypic polytypic Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also used a(n optional) named parameter to pass in the suspend/resume mechanism in my kcas blocking draft.

However, what I was thinking is that the suspend/resume (or rendezvous) mechanism could be something that can be obtained from the DLS, for example, at the point when it is needed.

A hypothetical rendezvous framework would provide a way to install a rendezvous mechanism:

type t = (* ((unit -> unit) -> unit) -> unit   OR   whatever  *)

val set : t -> unit
val get : unit -> t (* Raises or returns default mechanism if nothing has been explicitly set *)

Schedulers, like Eio, would call set to install the mechanism when they run their event loops. (Perhaps something like push and pop rather than set might be a better API.)

Facilities that need the ability then call get when they need it.

This would allow things like kcas to work independently of the scheduler(s). You could have domainslib running on some domains and Eio on others. And you could have communication between the domains — without having to know which mechanism to pass from which domain.

Copy link
Contributor

@polytypic polytypic Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An earlier idea / proposal by, I believe, @deepali2806 and @kayceesrk , is to use a Suspend effect.

That would also work. I think the proposals are equally expressive (both can implement the other).

The reason I used the continuation passing style approach is that A) it does not require the suspend mechanism to capture a stack or perform an effect and B) it also doesn't prescribe a specific signature for such an effect. Regarding A, in my previous experiments, I've gotten the impression that effects do have higher costs than function calls. OTOH, it is not clear whether the ability to avoid capturing stacks is really useful — most of the time one would likely be using an effects based scheduler anyway. Regarding B, I think the various libraries currently have similar, but slightly different Suspend like effects — I'm not sure whether or not they could all just use the one and same effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sudha247 is also involved in this effort with @deepali2806. We're using suspend effect now to safely use lazy from different threads. Think blackholing for thunk evaluation as in GHC.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah so I liked the idea of being explicit about blocking operations in the API (hence the parameter)... I don't think it's so bad to write ~rdv:dls if desired:

let dls : unit Rendezvous.t = fun () -> DLS.get dls_rendezvous_key ()

(But for example in domainslib, we need the pool argument to wait so I'm not sure we can expect schedulers to preconfigure the dls_rendezvous_key for the user in general)

Regarding effects, I think it's asking a lot more from schedulers to handle our own effect rather than define how to block in their library (it's also a pain for the user to setup if they wanted a real system lock?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Where can I find the work on safe lazy?

Copy link
Contributor

@polytypic polytypic Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of the different forms of Suspend. In my par-ml experiment I used the following:

type 'a continuation = ('a, unit) Effect.Deep.continuation
type _ Effect.t += Suspend : ('a continuation -> unit) -> 'a Effect.t

The reason I used those definitions is that the handler then becomes trivial:

let effc (type a) : a Effect.t -> _ = function
  | Suspend ef -> Some ef
  | _ -> None

I guess this

type resume_result = Resume_success | Resume_failure

type 'a resumer = 'a -> resume_result
type _ Effect.t += Suspend : ('a resumer -> bool) -> 'a Effect.t

is the current proposal?

I unfortunately haven't had time to think about this thoroughly, but I'm not immediately convinced that the features like resume_result and returning a bool are the best way to go. Are they necessary for best results or could the same be communicated through side-channels efficiently enough? How often are the features needed? (In the unified_interface repo the Resume_failure constructor does not seem to used at all.) They might be necessary, but I need to think about this. Also, are they sufficient? Is there some special case that would benefit from other special features? I assume these questions have been considered and I'd love to hear some more rationale behind the choices.

Also, there is a kind of arbitrary decision here to insert a function, an abstraction, at one point or another. In the approach I used in kcas and the rendezvous proposal here, the interface is very abstract — only simple function types (or pairs of functions) are being used. Why expose the concrete effect rather than abstract over it?

Usually designs that are more abstract have more lasting power. A very concrete design with specific affordances for all the special cases tends to make things more cumbersome.

But like I said, I unfortunately haven't had time to think about this thoroughly enough. And when I say think, I mean that one/I should actually try to implement realistic things with the various proposals and see how they compare (in usability and performance).

Copy link
Contributor

@polytypic polytypic Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in domainslib, we need the pool argument to wait so I'm not sure we can expect schedulers to preconfigure the dls_rendezvous_key for the user in general

These sorts of schedulers, including domainslib and Eio (and every other similar effects based library I know of for OCaml), basically run a loop of some kind on the domains on which they run. I don't see why it would be problematic to install the rendezvous mechanism to DLS just before entering the loop.

It would also be fairly easy to provide a function, in domainslib, like get_pool: unit -> pool (or get_pool: unit -> pool option if you prefer) that obtains the pool with which the current domain is associated with.

One possibility is also to allow both: passing an optional parameter explicitly to tell how to block and, in the absence of such parameter, obtain the mechanism from DLS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't know why domainslib provides the ability to ping-pong between different domain pools... Yet another generic alternative might be to release the blocking domain into a shared pool like idle-domains and expect custom schedulers to integrate with that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately haven't had time to think about this thoroughly, but I'm not immediately convinced that the features like resume_result and returning a bool are the best way to go. Are they necessary for best results or could the same be communicated through side-channels efficiently enough? How often are the features needed? (In the unified_interface repo the Resume_failure constructor does not seem to used at all.) They might be necessary, but I need to think about this. Also, are they sufficient? Is there some special case that would benefit from other special features? I assume these questions have been considered and I'd love to hear some more rationale behind the choices

The return type of resumer captures the notion of cancellation of the task. It means if the particular scheduler cancels the task while it is still in the suspended state, we should avoid resuming it. The resumer is defined in such a way that if the task is live, it actually resumes the task and returns Resume_success when the corresponding resume function is called. In the alternate case, when the task is cancelled, it will return Resume_failure instead of actually resuming the task. In MVar implementation here, we are checking the resumer return value. When the resumer return value is Resume_failure it means we can skip through it and retry the operation again to get the next resumer.

Also, in the type signature of Suspend effect, we have a function that takes a resumer and returns a bool type. It signifies the thread safety while suspending the task. Thread safety, in this case, is achieved through lock-free implementation. It means we are using atomic operations like compare and swap (CAS). When such a CAS operation is successful, we will return the true value, indicating the push to the suspended queue is successful. In case of CAS failure, we have to retry the operation again to get the most recent state of the queue or MVar.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Where can I find the work on safe lazy?

@Sudha247 has a development branch. This is very much "research" atm. Not intended to be upstreamed as is.

@art-w
Copy link
Author

art-w commented Mar 23, 2023

I currently strongly feel this sort of thing should be put into a minimal library of its own.

Yeah I don't have an opinion on this.. It really depends on whether custom schedulers and datastructure libraries need a lot more than an alias for the type unit -> ('a -> unit) * (unit -> 'a) (well, datastructures will need some default implementations of this type when testing, but schedulers will not)

it usually requires a bit more work (an allocation or two) to make it possible to pass a value.

Indeed, that's why I added some specialized _unit to show that it shouldn't be too much of an issue (but either way, I'm not seeing a performance difference in my tests, as we only allocate extra stuff when we have to block)


To quote your example from kcas on continuations vs products for the rendezvous type:

scheduler (fun resume ->
    if not (Atomic.compare_and_set self `Init (`Waiting resume))
    then resume ());

vs

let resume, wait = scheduler () in
if Atomic.compare_and_set self `Init (`Waiting resume)
then wait ()

A small issue I had with the continuation is that you need to call resume () in the happy case when you don't need to lock (so it's slightly awkward for the scheduler which now need to check if blocking is necessary or not after the callback completes)

@polytypic
Copy link
Contributor

I currently strongly feel this sort of thing should be put into a minimal library of its own.

Yeah I don't have an opinion on this..

I'm thinking this more in terms of having the capability available without explicit parameterization. One approach would be to use the DLS as I suggested previously. This way one could just use libraries like kcas with blocking — no matter how many different schedulers are being used in the program things would just work (as long as the schedulers just install the mechanism to DLS when running their event loops). Passing things through parameters or functorizing everything easily becomes a mess — althrough it definitely has the advantage of not using ambient state. Also, if every library would have slightly different definitions (one would use rendezvous, another would use the CPS suspend, third would use a Suspend effect), then it would become much less practical to e.g. use DLS.

A small issue I had with the continuation is that you need to call resume () in the happy case when you don't need to lock (so it's slightly awkward for the scheduler which now need to check if blocking is necessary or not after the callback completes)

Hmm... Yes, this could be an advantage of breaking the suspend/resume protocol into two like this. So, IIUC, basically, in an effects based implementation it would be the wait implementation that actually performs the effect to capture the stack. By making wait a simple first-order function, things potentially become simpler.

@art-w
Copy link
Author

art-w commented Mar 23, 2023

(Ha, I almost didn't notice that the ocaml-benchmarks went through without a timeout for the first time! I thought the relaxed mpmc queue would also need fixing, it still takes seconds, but I guess the spsc benchmark was the main issue... cc @ElectreAAS as the graphs are missing :/ )

@ElectreAAS
Copy link

@art-w you scared me, but the benchmarks are run properly and the graphs are present! Link here. The problem is that these benchmarks have the name "Lockfree" and hence aren't shown when just clicking on the PR, you also need to click on Lockfree here.
image
However the graphs aren't particularly useful as there is nothing to compare to, but they're present!

@polytypic
Copy link
Contributor

Also mentioning here that the domain-local-await or DLA library is now available as an opam package and provides a blocking mechanism inspired by the "rendezvous" primitive described in this PR. I plan to soon release a version of kcas using DLA and there are PRs to add DLA support to Eio and Domainslib. As described in the DLA library README it may be replaced later by an official blocking mechanism, but I hope that we can use it now to experiment providing blocking abstractions that from the POV of casual users just work ™ with OCaml 5 today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants