-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Attempt at a thread-safe implementation of DLS #12889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I already have a test for this in another project, I think. However it's not a test we'd be able to put into the |
This is fine with me, but it would be nice if there is a version that I can run myself if I need to iterate on the PR. Thanks! |
I think that we need someone to contribute soon, in reviewer position, if we want to get this in 5.2. |
de5fd12
to
70eb15d
Compare
I wrote a test and marked the PR as "ready for review". The test was useful as it caught a bug with the previous version of the PR, which accounted for resize/resize races and init/init races, but not resize/init races. Now the test passes and the code is more robust. |
I am reviewing this. |
fba8dc3
to
b592dd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a net improvement and I believe it does make DLS thread-safe. I like the Obj_opt
module, and it fixes a potential bug (missing opaque_identity
in DLS.set_initial_key
).
That being said, overall the choice not to use atomic values forces the code to rely on a large set of fragile assumptions: about safe points, about allowed compiler reorderings, about C primitives not releasing the domain lock… I think it would be more future-proof to use atomics and implement locking algorithms that scale well even in contended scenarios (I am not experienced enough to confidently propose some myself, unfortunately, but I believe they exist). The current code can work as an intermediary state but I would feel better if we planned to eventually move toward something else.
More minor, but it is a shame IMO to shift the responsibility of protecting the init : unit -> 'a
function to the DLS user, instead of enforcing it ourselves with an atomic. It can avoid some contention when init
requires no protection, but that must be put into balance with the fact that: 1. user-written protection will still cause contention; 2. it will likely be less efficient than a protection written by us, or even incorrect.
However, this can be also done later as a backward-compatible change.
CAMLprim value caml_domain_dls_compare_and_set(value old, value new) | ||
{ | ||
CAMLnoalloc; | ||
value current = Caml_state->dls_root; | ||
if (current == old) { | ||
caml_modify_generational_global_root(&Caml_state->dls_root, new); | ||
return Val_true; | ||
} else { | ||
return Val_false; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While OCaml code cannot run in parallel on the same domain, C code has no such restriction. Therefore extra care must be taken here to ensure that this function can never run concurrently with itself, otherwise we have a C-level data race. That makes me a bit uneasy in terms of robustness to changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caml_state
can only be used when the runtime lock is held, so I don't think that we have to worry about races occurring due to this function, it has the same only-one-thread-at-once restriction as OCaml mutator code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t know that! Thanks.
let[@inline never] array_compare_and_set a i oldval newval = | ||
(* Note: we cannot use [@poll error] due to the | ||
allocations on a.(i) in the Double_array case. *) | ||
let curval = a.(i) in | ||
if curval == oldval then ( | ||
Array.unsafe_set a i newval; | ||
true | ||
) else false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: all lines except the first could be put into a function aux
marked with [@poll error]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would rather do is to have a version of "uniform arrays" with a simpler implementation, to make @poll error
more precise. The problem with your proposal is that you are introducing intermediary function calls, which could in turn introduce poll points. (I think that function calls are poll points in bytecode for example.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I hadn’t thought of that. Never mind my comment then.
stdlib/domain.mli
Outdated
should protect it by using a boolean flag to detect this and | ||
fail, or a mutex, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should protect it by using a boolean flag to detect this and | |
fail, or a mutex, etc. | |
should protect it by using an atomic boolean flag to detect this and | |
fail, or a mutex, etc. |
Minor: maybe we shouldn’t let beginners think they can get away with a non-atomic flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, they can absolutely get away with a non-atomic flag.
let once f =
let already_called = ref false in
fun () ->
if !already_called then invalid_arg "this function may only be called once";
already_called := true;
f ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, once
does not quite work as it prevents calling the function from several domains. We could use a DLS cell to remember if the initialization function was called, but it looks a bit weird.
let once_per_domain init =
let already_called = Domain.DLS.new_key (fun () -> false) in
fun () ->
let called = Domain.DLS.get already_called in
if called then failwith "...";
Domain.DLS.set already_called true;
init ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, it had slipped my mind that the two executions of f
do not happen truly in parallel. in that case, would it make sense to add a sentence like “Note that the calls to f
may be concurrent (although only one thread will run OCaml code at a time)”, or something along these lines?
It looks like the test creates too many threads on i386. Maybe the number could be a bit lower? |
What we are implementing here is Domain.DLS, which is supposed to provide domain-local state, that is, basically, a reference cell per domain, and which is designed to avoid synchronization across domains. I am wary of trying to reason about the performance overhead of introducing an atomic read in the fast path of If you have an idea in mind that use atomics only in the slow path (where initialization is required), that would probably be fine. But I don't know what it is that you would suggest doing with atomics there.
What sort of protection do you actually have in mind?
My intuition is that most key-initialization functions can in fact tolerate redundant calls just fine and do not need any sort of protection -- this is at least the case of all the uses of
Thanks, I will lower the thread count -- I will look for which constants still reliably trigger the bug on my machine. |
b592dd4
to
f360fad
Compare
I toned down the test considerably, instead of 10 threads per key for 100 DLS keys, it now runs 3 threads per key for 10 DLS keys. (We can still observe rebase/init races on my machine, but they seem to not happen for every run, maybe 1/5 runs. This means that if we were to introduce a regression only in that case, the test would become flaky.) |
I was thinking about the second type of protection, but your arguments convinced me that it might not be such a good idea. |
if Obj_opt.is_some updated_obj | ||
then (Obj_opt.unsafe_get updated_obj : a) | ||
else assert false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: saves one line.
if Obj_opt.is_some updated_obj | |
then (Obj_opt.unsafe_get updated_obj : a) | |
else assert false | |
assert (Obj_opt.is_some updated_obj); | |
(Obj_opt.unsafe_get updated_obj : a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the current version to remain statically safe under the -noassert
flag, which disables all assert
expressions except assert false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn’t know about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still save one line:
if Obj_opt.is_some updated_obj | |
then (Obj_opt.unsafe_get updated_obj : a) | |
else assert false | |
if not (Obj_opt.is_some updated_obj) then assert false; | |
(Obj_opt.unsafe_get updated_obj : a) |
Don't you love double negations?
@OlivierNicole we have had detailed discussions of the code and your review comments since after you made your review. Does your initial "approval" still stands? This PR needs an approval from a maintainer, on their own or on behalf of @OlivierNicole. |
Note: this PR is a subtle, non-trivial change to the DLS implementation, so it would be great if we could manage to get this merged before the wider 5.2 testing starts, to maximize our chances of catching potential regressions. This means that the target for merging really is "in a couple weeks". |
I was about to answer yes, but then I came across a Discuss post by @gadmm where he says:
So the polling locations in bytecode should be checked (I’ll try to do that soon…). Then I’ll approve the bugfix, while still saying that to me this code should be improved in the future to not rely on many subtle invariants. In a discussion with @sadiqj he said that an |
Of course, the present PR is meant to fix a bug ( reported by @polytypic ) in 5.2, before a behavior change from DLS to TLS that we hope to happen in 5.3 -- and clearly is out of scope for 5.2. |
Do we have a practical way to do this in mind? The only ones I can think of are as follows:
Neither of those approach seem preferable to the current one, to me, in the desired time target of 5.2. |
Note: I also discussed this problem (how the heck can we write thread-safe programs?) with @gadmm this week. One suggestion he made is to implement a "single-threaded atomics" module in stdlib/camlinternal* or utils/, similar to the implementation of Atomic that we had in 4.x: https://github.com/ocaml/ocaml/blob/4.14/stdlib/camlinternalAtomic.ml (but: for DLS we need atomic operations on Array, in addition to references). I think that this is a good suggestion, and it aligns well with the current implementation which basically includes such a helper function in the middle of Domain.DLS instead of a dedicated module. However I am not suggesting to do this in the general way (a new module and all) in the context of the present PR, because I am tired of working on it and I would like the bugfix to be merged in a reasonable time frame. |
I don’t have a problem with that. However, I wouldn’t take my task of reviewer seriously if I approved this PR without checking that the code is correct in bytecode by checking the set of possible poll points (the big interpreter loop takes a bit of time to audit, as well as the emission of the |
No problem with that. I think that the code is correct in bytecode as well, but I'm happy to let you check as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that checking bytecode poll points was quicker than I thought.
From what I can see, in addition to polling everywhere the native program would also poll, the bytecode interpreter polls right before uninstalling an exception handler, and after switching fibers (i.e., after installing an effect handler, or performing an effect, or resuming a continuation). Nothing that threatens the correctness of this PR, so I re-iterate my approval.
Ah, yes. In native code only recursive functions contain poll points, right? |
That could be an idea. I would still be interested to know if there is a way to use atomics there in an almost-free way (on the fast path) to have the equivalent at a negligible cost, in a way that is more robust to compiler optimizations: the discussion in #12900 has shown that reasoning about this is non-trivial and requires deep knowledge of the compiler (and relies on invariants that may not be true forever). I personally lack knowledge regarding multicore performance. Alternatively, ideally your proposal would integrate guarantees against some reorderings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks almost good. I think the doc comment for new_key
needs work.
(* Note: we cannot use [@poll error] due to the | ||
allocations on a.(i) in the Double_array case. *) | ||
let curval = a.(i) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could get around this problem by using Obj.(obj (field (repr a) i))
instead of a.(i)
.
This amounts to using Obj.t
as your "uniform array" type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this trick does not work, Obj.field
gets compiled in the same way as Array.unsafe_get
, it generates a conditional on the block tag and an allocation in the float path. I could suppress this code path with a well-placed let a = (Obj.magic a : int array)
, but this feels too evil to me -- I don't know that it is safe.
if Obj_opt.is_some updated_obj | ||
then (Obj_opt.unsafe_get updated_obj : a) | ||
else assert false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still save one line:
if Obj_opt.is_some updated_obj | |
then (Obj_opt.unsafe_get updated_obj : a) | |
else assert false | |
if not (Obj_opt.is_some updated_obj) then assert false; | |
(Obj_opt.unsafe_get updated_obj : a) |
Don't you love double negations?
stdlib/domain.mli
Outdated
{b Warning.} [f] may be called several times if several | ||
threads on the same domain try to access the key | ||
concurrently. Only the 'first' value computed will be used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that overly specific? What about signal handlers, finalizers, etc? What if f
itself is using DLS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New wording, what do you think?
{b Warning.} [f] may be called several times if another call
to [get] occurs during initialization on the same domain. Only
the 'first' value computed will be used, the other now-useless
values will be discarded. Your initialization function should
support this situation, or contain logic to detect this case
and fail.
6a32bbe
to
bef720c
Compare
I took @damiendoligez's comments into account and rebased the PR. |
(This still needs the approval of a maintainer to be merged.) |
@gasche , there is now a maintainer approval. |
I don't see any maintainer approval through the Github UI, but I am willing to trust your slightly cryptic message as implying that you or some other maintainer approves of this PR. I will rebase right now to re-run the CI, and I would be happy to merge if that passes. |
Sorry for the implicitness, I was pointing to the fact that @OlivierNicole's approval is now a maintainer's approval. |
Duh, right, thanks! |
(So this is what I need to do to get my PRs merged :-) |
@OlivierNicole would you do the honors of clicking the "Merge pull request" button? |
Attempt at a thread-safe implementation of DLS (cherry picked from commit a6875b7)
the 'first' value computed will be used, the other now-useless | ||
values will be discarded. Your initialization function should | ||
support this situation, or contain logic to detect this case | ||
and fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A late caveat to the guarantee of using the first value: in case an asynchronous exception occurs between computation and assignment, the first computed value will be discarded. Similarly, if an asynchronous exception occurs during computation, then there is no guarantee that the first execution of the initialization function will give the value used eventually, so the logic mentioned can be difficult to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a reasonable view would be to say that the "first thread to compute the value" is defined by the first thread who manages to set the value? (This is what we would do in a concurrent setting anyway, where the synchronziation on this final writes and further reads would define the precedence order.)
This is a proposal to fix #12677, making Domain.DLS thread-safe -- or rather, understanding, improving and specifying its behavior in presence of multiple threads on the same domain.
This is marked as a Draft because I have not tested the code, only that it compiles. Writing tests for thread-safety issue is less tempting to me than writing thread-safe code, so I am hoping that some other people (perhaps @polytypic or @jmid ) would be willing to lend a hand to test this / write a test.Seeing the code now (before it is ready for upstreaming) can still be informative to see whether we can hope for a low-invasiveness fix for 5.2, independently of the complete-design-change proposal in #12724.
There are three changes in the PR:
The array-resizing logic is made thread-safe by checking whether another thread already resized the array, and giving up / retrying in that case. This relies on a new
compare_and_set
-like primitive for the dls root.We accept that the
init : unit -> 'a
function provided by the user to initialize the key on firstget
may be called several times if several threads callget
at the same time. We ensure and specify that only the first such value gets used, redundant initialization computations that finish later are discarded.This is worth discussing. In particular we could decide to instead fail in this case. But I don't see how users would write programs if we failed by default in this case. The recommendation in the PR is for users to protect their initialization function if they do not support redundant initializations.
I refactored the handling of
unique_value
to be hidden in an internal module (this is inspired by my work on unboxed dynarrays: Dynarrays, unboxed (with local dummies) #12885 ). I did this in the context of a first iteration of (2) that stored lazy thunks in thest
arrays. I gave up on this approach (it fails on concurrent initialization, which I think is the wrong behavior), but I kept the refactoring because I think it clarifies the code, at least it helps me write and think about this code.