-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix a data race in caml_darken_cont
#12969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If I understand correctly, there is a "bug" here in fiber.c: Lines 608 to 613 in 22d220a
The comment is wrong, |
Context: caml_darken_cont is performing a computation similar to forcing a thunk. If the continuation has not been marked yet (its color is UNMARKED), then exactly one domain must do it. Domains that observe an unmarked continuation compete to win a compare_and_set setting the header to the temporary NOT_MARKABLE color. The winning domain marks the continuation and then sets it to MARKED. All other competing domains spin-wait on their compare-and-set until they see MARKED from the winning domain. (This sounds so-so for scalability, but contention here probably never happens in practice.) |
I don't understand your proposed alternative fix 1 -- the proposed placement of the barrier annotation in that case. You place the barrier after the relaxed load, as if (if I understand correctly) any reader of the header synchronizes with the last release store. For me the correct reasoning is that if you read MARKED, you know that you must have seen it from the store_release below, so only on that case (right before we if (Has_status_hd(hd, caml_global_heap_state.MARKED)) {
/* MARKED can only come from the store_release below. */
CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_atomic_val(cont));
break;
} (This could be made more robust by adding a CAML_TSAN_ANNOTATE_HAPPENS_BEFORE around the store_release below, so that you don't depend on TSan under the hood using the same mechanism to stare both forms of synchronization.) This being said, I'm not confident that this reasoning is correct -- that indeed we can tell when reading MARKED that we synchronized with the code below; I suppose that they could come from a shared allocation or from deserialisation. |
I agree that your proposed fix, to use an acquire-load to read the header in this function, sounds very reasonable. In theory this could result in many more acquire-load reads in case where the loop spins on a NOT_MARKABLE header. I don't know if this is a good or a bad thing. If this was considered a bad thing, we could write it this way instead: header_t hd = atomic_load_relaxed(Hp_atomic_val(cont));
CAMLassert(!Has_status_hd(hd, caml_global_heap_state.GARBAGE));
if (Has_status_hd(hd, caml_global_heap_state.MARKED)) {
/* perform an acquire load to synchronize
with the marking domain. */
hd = atomic_load_acquire(Hp_atomic_val(cont));
if (Has_status_hd(hd, caml_global_heap_state.MARKED))
break;
} |
Thank you for your interest in this PR, @gasche!
The comment is correct in the fact that only one Note that according to the LKMM, if I understand it correctly, this is enough to form a
Correct!
I agree it's a better alternative: it would work identically and the comment is a great addition to explain the annotation.
@OlivierNicole also suggested this idea, for the same reason. I dropped it from this PR so as not to have too many alternative implementations, and that IMO it's reasonable to rely on the fact that this explicit annotation would match with the implicit one performed by TSan's atomic operation.
As far as I know, this is the only correct way to mark a continuation, and is only done by the GC or when "consuming" a continuation (in
header_t hd = atomic_load_relaxed(Hp_atomic_val(cont));
CAMLassert(!Has_status_hd(hd, caml_global_heap_state.GARBAGE));
if (Has_status_hd(hd, caml_global_heap_state.MARKED)) {
/* perform an acquire load to synchronize
with the marking domain. */
hd = atomic_load_acquire(Hp_atomic_val(cont));
if (Has_status_hd(hd, caml_global_heap_state.MARKED))
break;
} This is an excellent idea, we
I will update the PR to use the code you suggested. |
I had also noticed the spurious race in this test when running with the debug runtime, thanks for working on that. However I see a different trace when it occurs:
I am wondering if, in this case, there isn't also a risk that a third thread could reuse the stack being released by |
I am confused, because you are describing exactly the two stack traces in @fabbing's report. |
I am confused too, I had not seen the little arrow allowing the expansion of his traces. Yes, these are the same traces I am getting. However I am not sure that both thread have executed |
I found this confusing as well, but I proposed my own explanation for what is going on in #12969 (comment) (this is code in |
Yes, we are indeed talking about the same data race! TSan reports the race coming from |
dddc870
to
a4ff4d4
Compare
a4ff4d4
to
2fe1a2a
Compare
This fix a data race between `caml_free_stack` when a continuation is `resume`d and `scan_stack_frames` from a (previous) GC marking. Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>
2fe1a2a
to
7435700
Compare
Rebased on |
I was thinking of waiting for a memory-model expert to come around and comment, but those are few and far between, so maybe the fact that the four of us understand what is going on and agree on a fix is good enough. Merging. |
We can definitely wait for a memory model expert opinion! I just wanted to have a tidy PR in case this is the chosen solution. |
Nah, waiting is boring. |
Fix a data race in `caml_darken_cont` (cherry picked from commit ef62c57)
This PR proposes to fix a data race detected by the TSan CI when using the debug runtime.
The data race occurs in the
parallel/test_issue_11094.ml
test, involvingcaml_scan_stack
andcaml_free_stack
. Using the debug runtime causes the stack to be poisoned (runtime/fiber.c#L573), making the data race visible to TSan.TSan data-race report
The data race
Both domains have executed
caml_darken_cont
at some point, one coming from a GC major scan, and the other one fromcaml_continuation_use
whileresume
ing the continuation.major_collection_slice
, reachedcaml_darken_cont
, marked the fiber's stack withcaml_scan_stack
and updated the continuation header toMARKED
with anatomic_store_release
.resume
d the continuation calledcaml_continuation_use_noexc
, also reachedcaml_darken_cont
, read the continuation header with anatomic_load_relaxed
, found theMARKED
tag, and so skipped the stack marking.Alas, in the C11 memory model, there is no
happens-before
relationship caused by thisrelease
store and therelaxed
load, and so TSan reported the race.Proposed fix
The
atomic_load_relaxed
ofcaml_darken_cont
is replaced by anatomic_load_acquire
(runtime/major_gc.c#L1185).atomic_load_acquire
is free foramd64
, but not forarm64
(and other weak architectures?) and will have a performance cost.Alternative fix 1
We keep the
atomic_load_relaxed
and use the recently added TSan notation from #12894, to notify this relaxed load synchronizes with the release store.Indeed, this may have been thes code's intention, as according to the Linux-Kernel Memory Consistency Model:
But I can't convince myself, that having a simple
ldr
with theatomic_load_acquire
makes no difference with theldar
(implicit acquire fence) instruction with anatomic_load_acquire
. A (real) memory model expert opinion on this would be welcome.atomic_store_release
with the same mechanism as withCAML_TSAN_ANNOTATE_HAPPENS_AFTER
.Alternative fix 2
We silence TSan reports for
caml_free_stack
.Acknowledgment
A lot of work went into this one-line PR, with @OlivierNicole.