Skip to content

Attempt at a thread-safe implementation of DLS #12889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 20, 2024

Conversation

gasche
Copy link
Member

@gasche gasche commented Jan 8, 2024

This is a proposal to fix #12677, making Domain.DLS thread-safe -- or rather, understanding, improving and specifying its behavior in presence of multiple threads on the same domain.

This is marked as a Draft because I have not tested the code, only that it compiles. Writing tests for thread-safety issue is less tempting to me than writing thread-safe code, so I am hoping that some other people (perhaps @polytypic or @jmid ) would be willing to lend a hand to test this / write a test.

Seeing the code now (before it is ready for upstreaming) can still be informative to see whether we can hope for a low-invasiveness fix for 5.2, independently of the complete-design-change proposal in #12724.

There are three changes in the PR:

  1. The array-resizing logic is made thread-safe by checking whether another thread already resized the array, and giving up / retrying in that case. This relies on a new compare_and_set-like primitive for the dls root.

  2. We accept that the init : unit -> 'a function provided by the user to initialize the key on first get may be called several times if several threads call get at the same time. We ensure and specify that only the first such value gets used, redundant initialization computations that finish later are discarded.
    This is worth discussing. In particular we could decide to instead fail in this case. But I don't see how users would write programs if we failed by default in this case. The recommendation in the PR is for users to protect their initialization function if they do not support redundant initializations.

  3. I refactored the handling of unique_value to be hidden in an internal module (this is inspired by my work on unboxed dynarrays: Dynarrays, unboxed (with local dummies) #12885 ). I did this in the context of a first iteration of (2) that stored lazy thunks in the st arrays. I gave up on this approach (it fails on concurrent initialization, which I think is the wrong behavior), but I kept the refactoring because I think it clarifies the code, at least it helps me write and think about this code.

@SGrondin
Copy link

SGrondin commented Jan 8, 2024

I already have a test for this in another project, I think. However it's not a test we'd be able to put into the testsuite/ directory. I'll run it on this PR next weekend and report back.

@gasche
Copy link
Member Author

gasche commented Jan 8, 2024

However it's not a test we'd be able to put into the testsuite/ directory.

This is fine with me, but it would be nice if there is a version that I can run myself if I need to iterate on the PR. Thanks!

@gasche
Copy link
Member Author

gasche commented Jan 27, 2024

I think that we need someone to contribute soon, in reviewer position, if we want to get this in 5.2.

@gasche gasche force-pushed the thread-safe-DLS branch 2 times, most recently from de5fd12 to 70eb15d Compare January 28, 2024 14:37
@gasche gasche marked this pull request as ready for review January 28, 2024 14:38
@gasche
Copy link
Member Author

gasche commented Jan 28, 2024

I wrote a test and marked the PR as "ready for review".

The test was useful as it caught a bug with the previous version of the PR, which accounted for resize/resize races and init/init races, but not resize/init races. Now the test passes and the code is more robust.

@OlivierNicole
Copy link
Contributor

I am reviewing this.

Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a net improvement and I believe it does make DLS thread-safe. I like the Obj_opt module, and it fixes a potential bug (missing opaque_identity in DLS.set_initial_key).

That being said, overall the choice not to use atomic values forces the code to rely on a large set of fragile assumptions: about safe points, about allowed compiler reorderings, about C primitives not releasing the domain lock… I think it would be more future-proof to use atomics and implement locking algorithms that scale well even in contended scenarios (I am not experienced enough to confidently propose some myself, unfortunately, but I believe they exist). The current code can work as an intermediary state but I would feel better if we planned to eventually move toward something else.

More minor, but it is a shame IMO to shift the responsibility of protecting the init : unit -> 'a function to the DLS user, instead of enforcing it ourselves with an atomic. It can avoid some contention when init requires no protection, but that must be put into balance with the fact that: 1. user-written protection will still cause contention; 2. it will likely be less efficient than a protection written by us, or even incorrect.

However, this can be also done later as a backward-compatible change.

Comment on lines +1999 to +2019
CAMLprim value caml_domain_dls_compare_and_set(value old, value new)
{
CAMLnoalloc;
value current = Caml_state->dls_root;
if (current == old) {
caml_modify_generational_global_root(&Caml_state->dls_root, new);
return Val_true;
} else {
return Val_false;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While OCaml code cannot run in parallel on the same domain, C code has no such restriction. Therefore extra care must be taken here to ensure that this function can never run concurrently with itself, otherwise we have a C-level data race. That makes me a bit uneasy in terms of robustness to changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caml_state can only be used when the runtime lock is held, so I don't think that we have to worry about races occurring due to this function, it has the same only-one-thread-at-once restriction as OCaml mutator code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t know that! Thanks.

Comment on lines +149 to +156
let[@inline never] array_compare_and_set a i oldval newval =
(* Note: we cannot use [@poll error] due to the
allocations on a.(i) in the Double_array case. *)
let curval = a.(i) in
if curval == oldval then (
Array.unsafe_set a i newval;
true
) else false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: all lines except the first could be put into a function aux marked with [@poll error].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would rather do is to have a version of "uniform arrays" with a simpler implementation, to make @poll error more precise. The problem with your proposal is that you are introducing intermediary function calls, which could in turn introduce poll points. (I think that function calls are poll points in bytecode for example.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I hadn’t thought of that. Never mind my comment then.

Comment on lines 113 to 114
should protect it by using a boolean flag to detect this and
fail, or a mutex, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
should protect it by using a boolean flag to detect this and
fail, or a mutex, etc.
should protect it by using an atomic boolean flag to detect this and
fail, or a mutex, etc.

Minor: maybe we shouldn’t let beginners think they can get away with a non-atomic flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, they can absolutely get away with a non-atomic flag.

let once f =
  let already_called = ref false in
  fun () ->
    if !already_called then invalid_arg "this function may only be called once";
    already_called := true;
    f ()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, once does not quite work as it prevents calling the function from several domains. We could use a DLS cell to remember if the initialization function was called, but it looks a bit weird.

let once_per_domain init =
  let already_called = Domain.DLS.new_key (fun () -> false) in
  fun () ->
    let called = Domain.DLS.get already_called in
    if called then failwith "...";
    Domain.DLS.set already_called true;
    init ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, it had slipped my mind that the two executions of f do not happen truly in parallel. in that case, would it make sense to add a sentence like “Note that the calls to f may be concurrent (although only one thread will run OCaml code at a time)”, or something along these lines?

@OlivierNicole
Copy link
Contributor

It looks like the test creates too many threads on i386. Maybe the number could be a bit lower?

@gasche
Copy link
Member Author

gasche commented Jan 29, 2024

That being said, overall the choice not to use atomic values forces the code to rely on a large set of fragile assumptions: about safe points, about allowed compiler reorderings, about C primitives not releasing the domain lock… I think it would be more future-proof to use atomics and implement locking algorithms that scale well even in contended scenarios (I am not experienced enough to confidently propose some myself, unfortunately, but I believe they exist)

What we are implementing here is Domain.DLS, which is supposed to provide domain-local state, that is, basically, a reference cell per domain, and which is designed to avoid synchronization across domains. I am wary of trying to reason about the performance overhead of introducing an atomic read in the fast path of DLS.get; this is not my area of expertise, but I would worry about creating scalability bottlenecks on relaxed architectures where barriers are more costly. The PR as proposed does not change the performance profile of the fast path at all (when the value is already initialized), so it is very easy to tell that it should not create performance regressions.

If you have an idea in mind that use atomics only in the slow path (where initialization is required), that would probably be fine. But I don't know what it is that you would suggest doing with atomics there.

More minor, but it is a shame IMO to shift the responsibility of protecting the init : unit -> 'a function to the DLS user, instead of enforcing it ourselves with an atomic.

What sort of protection do you actually have in mind?

  • If you mean "failing when concurrent initialization is detected", I believe that this is the wrong default -- how can users write programs that use both Thread and Domain.DLS in this context, if it is invalid to ever have two threads call DLS.get at the same time?
  • If you mean "take a mutex to block other threads from initializing the value", this might be a better default, but it risks blowing up in the user face in the case of reentrancy scenarios (if DLS.get is called from a signal handler or a memprof callback, from the same thread that is already holding the lock to call the init () function). Being able to call DLS.get from a memprof callback in particular sounds rather convenient.

My intuition is that most key-initialization functions can in fact tolerate redundant calls just fine and do not need any sort of protection -- this is at least the case of all the uses of DLS.new_key in the standard library.

It looks like the test creates too many threads on i386. Maybe the number could be a bit lower?

Thanks, I will lower the thread count -- I will look for which constants still reliably trigger the bug on my machine.

@gasche
Copy link
Member Author

gasche commented Jan 29, 2024

I toned down the test considerably, instead of 10 threads per key for 100 DLS keys, it now runs 3 threads per key for 10 DLS keys. (We can still observe rebase/init races on my machine, but they seem to not happen for every run, maybe 1/5 runs. This means that if we were to introduce a regression only in that case, the test would become flaky.)

@OlivierNicole
Copy link
Contributor

What sort of protection do you actually have in mind?

  • If you mean "failing when concurrent initialization is detected", I believe that this is the wrong default -- how can users write programs that use both Thread and Domain.DLS in this context, if it is invalid to ever have two threads call DLS.get at the same time?

  • If you mean "take a mutex to block other threads from initializing the value", this might be a better default, but it risks blowing up in the user face in the case of reentrancy scenarios (if DLS.get is called from a signal handler or a memprof callback, from the same thread that is already holding the lock to call the init () function). Being able to call DLS.get from a memprof callback in particular sounds rather convenient.

My intuition is that most key-initialization functions can in fact tolerate redundant calls just fine and do not need any sort of protection -- this is at least the case of all the uses of DLS.new_key in the standard library.

I was thinking about the second type of protection, but your arguments convinced me that it might not be such a good idea.

Comment on lines +182 to +184
if Obj_opt.is_some updated_obj
then (Obj_opt.unsafe_get updated_obj : a)
else assert false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: saves one line.

Suggested change
if Obj_opt.is_some updated_obj
then (Obj_opt.unsafe_get updated_obj : a)
else assert false
assert (Obj_opt.is_some updated_obj);
(Obj_opt.unsafe_get updated_obj : a)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the current version to remain statically safe under the -noassert flag, which disables all assert expressions except assert false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn’t know about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still save one line:

Suggested change
if Obj_opt.is_some updated_obj
then (Obj_opt.unsafe_get updated_obj : a)
else assert false
if not (Obj_opt.is_some updated_obj) then assert false;
(Obj_opt.unsafe_get updated_obj : a)

Don't you love double negations?

@gasche
Copy link
Member Author

gasche commented Feb 1, 2024

@OlivierNicole we have had detailed discussions of the code and your review comments since after you made your review. Does your initial "approval" still stands?

This PR needs an approval from a maintainer, on their own or on behalf of @OlivierNicole.

@gasche
Copy link
Member Author

gasche commented Feb 1, 2024

Note: this PR is a subtle, non-trivial change to the DLS implementation, so it would be great if we could manage to get this merged before the wider 5.2 testing starts, to maximize our chances of catching potential regressions. This means that the target for merging really is "in a couple weeks".

@Octachron Octachron added the bug label Feb 1, 2024
@OlivierNicole
Copy link
Contributor

@OlivierNicole we have had detailed discussions of the code and your review comments since after you made your review. Does your initial "approval" still stands?

I was about to answer yes, but then I came across a Discuss post by @gadmm where he says:

[@poll error] is also inoperant in bytecode (which is trickier because it has more polling locations).

So the polling locations in bytecode should be checked (I’ll try to do that soon…). Then I’ll approve the bugfix, while still saying that to me this code should be improved in the future to not rely on many subtle invariants. In a discussion with @sadiqj he said that an Atomic.get instead of a plain read on the fast path would probably make no noticeable difference on all reasonable architectures. @polytypic also noted that this is a case where TLS would make the implementation trivially easy and performant.

@gasche
Copy link
Member Author

gasche commented Feb 2, 2024

@polytypic also noted that this is a case where TLS would make the implementation trivially easy and performant.

Of course, the present PR is meant to fix a bug ( reported by @polytypic ) in 5.2, before a behavior change from DLS to TLS that we hope to happen in 5.3 -- and clearly is out of scope for 5.2.

@gasche
Copy link
Member Author

gasche commented Feb 2, 2024

to me this code should be improved in the future to not rely on many subtle invariants

Do we have a practical way to do this in mind? The only ones I can think of are as follows:

  1. Implement all this in C instead of OCaml, to easily control where poll points occur. (At the cost of a bleh programming language.)
  2. Substantially improve the language-level support for poll-safe programming in OCaml, for example with a language construct for masking of asynchronous callbacks. (Spec: delay all callbacks that can be delayed, for example on all allocations, fail if code within may have non-delayable callbacks, for example function calls).

Neither of those approach seem preferable to the current one, to me, in the desired time target of 5.2.

@gasche
Copy link
Member Author

gasche commented Feb 2, 2024

Note: I also discussed this problem (how the heck can we write thread-safe programs?) with @gadmm this week. One suggestion he made is to implement a "single-threaded atomics" module in stdlib/camlinternal* or utils/, similar to the implementation of Atomic that we had in 4.x: https://github.com/ocaml/ocaml/blob/4.14/stdlib/camlinternalAtomic.ml (but: for DLS we need atomic operations on Array, in addition to references).

I think that this is a good suggestion, and it aligns well with the current implementation which basically includes such a helper function in the middle of Domain.DLS instead of a dedicated module. However I am not suggesting to do this in the general way (a new module and all) in the context of the present PR, because I am tired of working on it and I would like the bugfix to be merged in a reasonable time frame.

@OlivierNicole
Copy link
Contributor

However I am not suggesting to do this in the general way (a new module and all) in the context of the present PR, because I am tired of working on it and I would like the bugfix to be merged in a reasonable time frame.

I don’t have a problem with that. However, I wouldn’t take my task of reviewer seriously if I approved this PR without checking that the code is correct in bytecode by checking the set of possible poll points (the big interpreter loop takes a bit of time to audit, as well as the emission of the CHECK_SIGNALS opcode). If you do it and say that it makes no difference, I will take your word for it. Otherwise, I will check it myself. I am doing my best to do this quickly! 🙂 but I am sure that you will understand that I have to juggle it with other time constraints.

@gasche
Copy link
Member Author

gasche commented Feb 2, 2024

No problem with that. I think that the code is correct in bytecode as well, but I'm happy to let you check as well.

Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that checking bytecode poll points was quicker than I thought.

From what I can see, in addition to polling everywhere the native program would also poll, the bytecode interpreter polls right before uninstalling an exception handler, and after switching fibers (i.e., after installing an effect handler, or performing an effect, or resuming a continuation). Nothing that threatens the correctness of this PR, so I re-iterate my approval.

@OlivierNicole
Copy link
Contributor

OlivierNicole commented Feb 5, 2024

Ah, yes. In native code only recursive functions contain poll points, right?

@OlivierNicole
Copy link
Contributor

Note: I also discussed this problem (how the heck can we write thread-safe programs?) with @gadmm this week. One suggestion he made is to implement a "single-threaded atomics" module in stdlib/camlinternal* or utils/, similar to the implementation of Atomic that we had in 4.x: https://github.com/ocaml/ocaml/blob/4.14/stdlib/camlinternalAtomic.ml (but: for DLS we need atomic operations on Array, in addition to references).

That could be an idea. I would still be interested to know if there is a way to use atomics there in an almost-free way (on the fast path) to have the equivalent at a negligible cost, in a way that is more robust to compiler optimizations: the discussion in #12900 has shown that reasoning about this is non-trivial and requires deep knowledge of the compiler (and relies on invariants that may not be true forever). I personally lack knowledge regarding multicore performance.

Alternatively, ideally your proposal would integrate guarantees against some reorderings.

@damiendoligez damiendoligez self-assigned this Feb 7, 2024
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost good. I think the doc comment for new_key needs work.

Comment on lines +150 to +152
(* Note: we cannot use [@poll error] due to the
allocations on a.(i) in the Double_array case. *)
let curval = a.(i) in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get around this problem by using Obj.(obj (field (repr a) i)) instead of a.(i).
This amounts to using Obj.t as your "uniform array" type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this trick does not work, Obj.field gets compiled in the same way as Array.unsafe_get, it generates a conditional on the block tag and an allocation in the float path. I could suppress this code path with a well-placed let a = (Obj.magic a : int array), but this feels too evil to me -- I don't know that it is safe.

Comment on lines +182 to +184
if Obj_opt.is_some updated_obj
then (Obj_opt.unsafe_get updated_obj : a)
else assert false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still save one line:

Suggested change
if Obj_opt.is_some updated_obj
then (Obj_opt.unsafe_get updated_obj : a)
else assert false
if not (Obj_opt.is_some updated_obj) then assert false;
(Obj_opt.unsafe_get updated_obj : a)

Don't you love double negations?

Comment on lines 108 to 110
{b Warning.} [f] may be called several times if several
threads on the same domain try to access the key
concurrently. Only the 'first' value computed will be used,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that overly specific? What about signal handlers, finalizers, etc? What if f itself is using DLS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New wording, what do you think?

    {b Warning.} [f] may be called several times if another call
    to [get] occurs during initialization on the same domain. Only
    the 'first' value computed will be used, the other now-useless
    values will be discarded. Your initialization function should
    support this situation, or contain logic to detect this case
    and fail.

@gasche gasche force-pushed the thread-safe-DLS branch 2 times, most recently from 6a32bbe to bef720c Compare February 29, 2024 20:16
@gasche
Copy link
Member Author

gasche commented Feb 29, 2024

I took @damiendoligez's comments into account and rebased the PR.

@gasche
Copy link
Member Author

gasche commented Mar 1, 2024

(This still needs the approval of a maintainer to be merged.)

@Octachron
Copy link
Member

@gasche , there is now a maintainer approval.

@gasche
Copy link
Member Author

gasche commented Mar 20, 2024

I don't see any maintainer approval through the Github UI, but I am willing to trust your slightly cryptic message as implying that you or some other maintainer approves of this PR. I will rebase right now to re-run the CI, and I would be happy to merge if that passes.

@Octachron
Copy link
Member

Sorry for the implicitness, I was pointing to the fact that @OlivierNicole's approval is now a maintainer's approval.

@gasche
Copy link
Member Author

gasche commented Mar 20, 2024

Duh, right, thanks!

@gasche
Copy link
Member Author

gasche commented Mar 20, 2024

(So this is what I need to do to get my PRs merged :-)

@gasche
Copy link
Member Author

gasche commented Mar 20, 2024

@OlivierNicole would you do the honors of clicking the "Merge pull request" button?

@OlivierNicole OlivierNicole merged commit a6875b7 into ocaml:trunk Mar 20, 2024
Octachron pushed a commit that referenced this pull request Mar 22, 2024
Attempt at a thread-safe implementation of DLS

(cherry picked from commit a6875b7)
the 'first' value computed will be used, the other now-useless
values will be discarded. Your initialization function should
support this situation, or contain logic to detect this case
and fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A late caveat to the guarantee of using the first value: in case an asynchronous exception occurs between computation and assignment, the first computed value will be discarded. Similarly, if an asynchronous exception occurs during computation, then there is no guarantee that the first execution of the initialization function will give the value used eventually, so the logic mentioned can be difficult to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a reasonable view would be to say that the "first thread to compute the value" is defined by the first thread who manages to set the value? (This is what we would do in a concurrent setting anyway, where the synchronziation on this final writes and further reads would define the precedence order.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Domain.DLS is not thread-safe
6 participants