Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix datarace between oldify and allocation #12894

Merged
merged 3 commits into from Feb 7, 2024

Conversation

Johan511
Copy link
Contributor

@Johan511 Johan511 commented Jan 13, 2024

There had been 2 dataraces reported in parallel/multicore_systhreads .ml and test_issue_11094.ml between oldify and allocation of the object being oldified.

TSAN logs reported are :
parallel/multicore_systhreads.ml
test_issue_11094.ml

Code can be fetched using

git fetch https://github.com/Johan511/ocaml e09d63cca9375fce8d2c91d0188591dbe0db142e
git checkout FETCH_HEAD

To replicate the datarace

git fetch https://github.com/Johan511/ocaml e09d63cca9375fce8d2c91d0188591dbe0db142e
git checkout FETCH_HEAD
./configure --enable-tsan
make -j6
cd testsuite
KEEP_TEST_DIR_ON_SUCCESS=On make one TEST=tests/parallel/multicore_systhreads.ml

(datarace is much harder to trigger in test_issue_11094.ml)

The datarace is reported because TSAN is unable to realise a happens before (hb) relatioship between the initialising write to a header and the read from it.

The fix uses TSAN's AnnotateHappensBefore and AnnotateHappensAfter functions is annotate the hb relatioship

@gasche
Copy link
Member

gasche commented Jan 13, 2024

Noob question: for me happens-before is a binary relation (a memory event happens before/after another). What does AnnotateHappensBefore((void*) addr) mean? (Could the PR either include a small comment that explains what this means, or a reference to pre-existing documentation?)

@Johan511
Copy link
Contributor Author

Johan511 commented Jan 14, 2024

TSAN simulates a release barrier on encountering ANNOTATE_HAPPENS_BEFORE and an acquire barrier on encountering ANNOTATE_HAPPENS_AFTER.
Ref

@gasche
Copy link
Member

gasche commented Jan 14, 2024

You are pointing at the TSAN runtime source code. Are those operations documented anywhere (else)?

@Johan511
Copy link
Contributor Author

There is a paper on TSAN which mentions about the annotions.
Paper
Please check section 5 (Dynamic Annotations)

Definition of annotations in the LLVM Library

Test / Usage for annotations

@OlivierNicole
Copy link
Contributor

For context, these two race reports are from #11040 (in fact, the last remaining) and @Johan511 has worked with Tarides to help us understand their origin and fix them.

This is one of the cases where there is a race according to the C11 memory model, but not according to a more practical one like, e.g., the Linux kernel memory model (LKMM). The situation is similar to publication safety of OCaml values, which is justified by arguments outside of the scope of C11.

You are pointing at the TSAN runtime source code. Are those operations documented anywhere (else)?

TSan’s API has close to no documentation at all, unfortunately. But, having practiced reading its source code to understand what it does, I can confirm what @Johan511 says: these two annotations behave like a release store – acquire load pair.

@gasche
Copy link
Member

gasche commented Jan 15, 2024

I trust @Johan511's expertise and his description of the situation, but I would have liked to have some documentation to reference in the PR source code itself for these obscure operations.

(Also, from just the source we have no information about whether these operations will remain available in the future. There is a possibility that the TSan mode of past OCaml releases would become unusable with more recent clang versions due to a change to this undocumented stuff.)

@OlivierNicole
Copy link
Contributor

That is true, but since nothing is documented, the same can be said of all the functions in the TSan API; including the ones we use at the assembly level to instrument memory accesses.

@OlivierNicole
Copy link
Contributor

That being said, I share your concern. On the other hand I am reassured by the fact that AnnotateHappensBefore / AnnotateHapepnsAfter have been around since 2009 and have survived a complete change of ThreadSanitizer’s detection algorithm. If the TSan maintainers removed them, it would likely break a number of projects that use them.

Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, however I think this PR should also remove the “no tsan” attribute on pool_initialize and caml_shared_try_alloc, since they are no longer needed:

diff --git a/runtime/shared_heap.c b/runtime/shared_heap.c
index 7d13dcd287..6031a98a01 100644
--- a/runtime/shared_heap.c
+++ b/runtime/shared_heap.c
@@ -253,7 +253,6 @@ static void calc_pool_stats(pool* a, sizeclass sz, struct heap_stats* s)
 }
 
 /* Initialize a pool and its object freelist */
-CAMLno_tsan  /* Disable TSan reports from this function (see #11040) */
 Caml_inline void pool_initialize(pool* r,
                                  sizeclass sz,
                                  caml_domain_state* owner)
@@ -426,7 +425,6 @@ static void* large_allocate(struct caml_heap_state* local, mlsize_t sz) {
   return (char*)a + LARGE_ALLOC_HEADER_SZ;
 }
 
-CAMLno_tsan /* Disable TSan reports from this function (see #11040) */
 value* caml_shared_try_alloc(struct caml_heap_state* local, mlsize_t wosize,
                              tag_t tag, reserved_t reserved)
 {

@gasche
Copy link
Member

gasche commented Jan 16, 2024

@Johan511 could you clean up the history a bit? In principle I don't mind the "change to easily trigger the bug then revert that" trick, but in practice your change is rather invasive (it's a full revert of a PR, not just tweaking the default size constant). I think you could either drop these commits or simplify them to make them smaller.

@OlivierNicole
Copy link
Contributor

@Johan511 I happen to have done this work of cleaning the history in order to test your PR together with some CI changes I am working on. The cleaned history is on this branch https://github.com/OlivierNicole/ocaml/commits/pr12894-rebased/ if that saves you time.

Copy link
Contributor

@dustanddreams dustanddreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AnnotateHappensXXX names are currently in use by llvm's own sources, so one can reasonably expect that they won't disappear anytime soon (see llvm/include/llvm/Support/{Compiler,Threading}.h for their use)

@gasche
Copy link
Member

gasche commented Jan 19, 2024

With apologies for being dense, now that I sort of understand what those annotations are doing, I would like to better understand why we are using them in this PR.

The explanation of @Johan511 is that the code in caml_shared_try_alloc is a bunch of initializing writes (we create a valid OCaml object from free space), and the code in do_some_marking and mark_slice_dark is a read from that initializing write.

As a non-memory model expert, it is not clear why we want to teach TSAN that there is a barrier / happens-before relashionship here without inserting such a barrier. Is this a case of address dependency, on what? Could this reasoning be explained in documentation comments around the place where the annotations are inserted?

For the code of OCaml mutators (the code that corresponds to source code, not to the concurrent functioning of the GC), our reasoning is that not having a barrier on initializing writes is fine because we have a barrier on non-initializing writes (caml_modify), and a block populated by initializing writes can only be "published" to other domains by non-initializing writes.
It is not clear that the same reasoning applies here: shared_try_alloc adds something to the shared heap, and another thread that is concurrently doing some marking could observe this value in the shared heap without any synchronization?

Remark: how can it happen that two hardware threads allocate and mark the same shared heap? I thought that shared heaps are domain-local. The reports suggest a race between the domain domain thread and its backup thread, isn't there enough synchronization between those to rule this out?

Remark: the explanation of @Johan511 mentions a race between oldify and allocation, but the traces that I see in his post correspond to a race between marking the shared heap and initializing the first header word of a new shared heap pool. Are these considered the same thing?

@gasche
Copy link
Member

gasche commented Jan 19, 2024

(cc @stedolan who shed some light on a different but simliar issue, #12916)

@stedolan
Copy link
Contributor

For the code of OCaml mutators (the code that corresponds to source code, not to the concurrent functioning of the GC), our reasoning is that not having a barrier on initializing writes is fine because we have a barrier on non-initializing writes (caml_modify), and a block populated by initializing writes can only be "published" to other domains by non-initializing writes.
It is not clear that the same reasoning applies here: shared_try_alloc adds something to the shared heap, and another thread that is concurrently doing some marking could observe this value in the shared heap without any synchronization?

The same reasoning applies here:

  • Marking does not pluck objects at random out of the heap. It follows pointers from the roots (stack, etc.) along fields, the same way that the program does. In other words, any object marked has a chain of address dependencies starting from its allocation, so the barrier on publication is sufficient by the same reasoning as for program accesses.

  • Sweeping does walk over arbitrary objects, irrespective of any dependencies. However, for this reason, sweeping of an object is only performed by the same domain that allocated it. (This is not quite true due to handoff after domain termination, but domain termination has enough fences to handle this)

@gasche
Copy link
Member

gasche commented Jan 19, 2024

Followup naive question: if the barrier on non-initializing mutations is enough to avoid races here, why does TSan not notice this? Is it missing an annotation or TSan function call on the write that publishes the value in this example?

(I feel that we are adding annotations here that have a delicate justification, to composate for writes that use barriers in a way that TSan does not recognize. It would be much easier to justify adding an annotation on those writes, right?)

@OlivierNicole
Copy link
Contributor

@gasche Your questions are perfectly legitimate and with @Johan511, it took us a while to arrive to a satisfactory answer. Sorry if this post is a bit lengthy.

Allow me to answer first to:

Remark: how can it happen that two hardware threads allocate and mark the same shared heap? I thought that shared heaps are domain-local. The reports suggest a race between the domain domain thread and its backup thread, isn't there enough synchronization between those to rule this out?

Short version: the major heap pool is owned by a domain but can be marked by other domains. The backup thread is a red herring. There are two distinct causes to the lack of synchronization seen by TSan, as I explain below.

To be more concrete, here is what happened during that execution:

  • We are in the middle of a minor collection. The value being promoted is the random generator state of domain 1, which is a Bigarray; it is promoted by domain 1, thus the new value is in a pool belonging to domain 1.
  • Concurrently, domain 0 collects its own minor heap (via its backup thread, presumably its main thread is blocked on a C call). For some reason, it finishes earlier than domain 1, and starts an opportunistic major slice. It starts marking stuff, and reaches the PRNG state of domain 1 (now living in a major pool belonging to domain 1). It reaches it presumably because it started marking from the DLS root of domain 1 (perhaps counter-intuitively, DLS are registered as global roots and will be marked by the first domain that has the opportunity). The first step of marking is to read the header (relaxed atomic load).
  • Here TSan reports a data race, between the initializing write (plain write) into the major heap by domain 1 and the (later) load of the header by domain 0.

I agree with you that the title of the PR should be changed: the “race” does not happen between oldify and allocation but rather between allocation and marking.

As @stedolan says the race is not a race in practice thanks to address dependencies. I would add another ingredient, namely the CAS / release store (pick any) that happens after between the allocation of the promoted value and its publication.

Here is an idealized view of what happens. To simplify things, I’m pretending that the promoted value is itself a global root. In reality, it is only reachable from the global roots, but I believe that the reasoning still holds (just replace “address dependency” by “chain of address dependencies”).

/* These values are inintially published to all threads */
value v_in_minor0 = Val_op(caml_alloc_small(6, Custom_tag));
value *global_root = Op_val(v_in_minor0);

P0() { /* Domain 0 */
	/* Beginning of do_some_marking */

	/* Follow the global root... */
	value block =
		Val_op(atomic_load_explicit(&global_root, memory_order_relaxed));
	header_t hd = Hd_val(block); /* <-- race? */

	/* ... rest of do_some_marking */
}

P1() { /* Domain 1 */
	/* Beginning of oldify_one */

	/* Beginning of caml_shared_try_alloc */
	header_t *new_v_hp = &pool_start[count_allocated];
	Hd_hp(new_v_hp) = 0; /* <-- race? */
	/* and then later... */
	Hd_hp(new_v_hp) = Make_header(...); /* <-- race? */
	value *new_v = Val_hp(new_v_hp);
	/* End of caml_shared_try_alloc */

	for(int i = 0; i < 6; i++) {
		Field(new_v, i) = Field(v_in_minor0, i);
	}

	/* Beginning of try_update_object_header */
	header_t hd = atomic_load(Hp_atomic_val(v_in_minor0));
	header_t desired_hd = In_progress_update_val;
    	if( atomic_compare_exchange_strong(Hp_atomic_val(v_in_minor0), &hd, desired_hd) ) {
            /* Success. Now we can write the forwarding pointer. */
            atomic_store_relaxed(Op_atomic_val(v_in_minor0), new_v);
            /* And update header ('release' ensures after update of fwd pointer) */
            atomic_store_release(Hp_atomic_val(v_in_minor0), 0);
      	} else {
            /* Updated by another domain. Spin for that update to complete and
               then throw away the result and use the one from the other domain. */
            spin_on_header(v_in_minor0);
            new_v = Field(v_in_minor0, 0);
      	}

	atomic_store_explicit(&global_root, new_v, memory_order_relaxed);
	/* End of try_update_object_header */

	/* End of oldify_one */
}

TSan reports a race between the two initializing stores in P0 and the header read in P1. There is no race in practice by the reasoning that:

  • The final store to global_root in P1 happens-after the two “racy” stores to the header of new_v, because of the strong atomic in try_update_object_header (or the release store—both act as a release fence).
  • If P0 reads the old value from global_root, there is no data race on new_v_hp. If it reads new_v from global_root, then it happens-after the store to global_root in P1 (the store has been propagated to P0, “rfe implies hb” in terms of the Linux kernel memory model).
  • The two loads in P0 cannot be reordered because they are in an address dependency.

Of those three happens-before edges, TSan does not see nos. 2 and 3.

  • C11 considers that only an acquire load that reads from a release store can establish a hb relation. The LKMM relaxes this by allowing the same reasoning on "relaxed" loads that read from a release store (I’m using quotes because the LKMM does not have a notion of relaxed atomics, they use READ_ONCE and WRITE_ONCE which are similar), and the LKMM can be used correctly to model all architectures supported by OCaml.
  • TSan does not have the notion of address/data dependencies, and C11 doesn’t really have it either. Well, formally it does with memory_order_consume, but that was never correctly implemented in compilers, preventing us from using it and forcing us to reason outside of C11. In practice, all reasonable architecture behave as if address-dependent loads were executed in program order, except Alpha which OCaml doesn’t support.

Could this reasoning be explained in documentation comments around the place where the annotations are inserted?

I also think that a comment should be added near the annotations with a justification.

Followup naive question: if the barrier on non-initializing mutations is enough to avoid races here, why does TSan not notice this? Is it missing an annotation or TSan function call on the write that publishes the value in this example?

Because we purposefully did not instrument initializing writes as we anticipated precisely this kind of spurious reports. Only, we missed this spot.

(I feel that we are adding annotations here that have a delicate justification, to composate for writes that use barriers in a way that TSan does not recognize. It would be much easier to justify adding an annotation on those writes, right?)

We could de-instrument the two initializing writes involved. AnnotateHappensBefore / AnnotateHappensAfter seemed to be a more precise fix since, unlike de-instrumenting, it does not risk hiding other, unrelated races with the de-instrumented access. It is also a simple macro in the code, whereas de-instrumenting is syntactically heavier: it requires to put the memory access in a dedicated function with CAMLno_tsan and a Caml_noinline conditioned by #if defined(WITH_THREAD_SANITIZER).

But it’s not a strong opinion. And I admit that de-instrumenting all initializing writes would be more coherent.

@gasche
Copy link
Member

gasche commented Jan 20, 2024

Sorry for being unclear, I was not suggesting to de-instrument the initializing writes, but rather to ensure that TSan sees the barrier in the modifying writes. (Possibly by using one of those Happens{Before,After} annotation there.)

@OlivierNicole
Copy link
Contributor

If you suggest to use annotations to make TSan take into account the happens-before relation no. 2 in my description, that’s possible, but with that alone, no. 3 (the hb stemming from address dependencies) would be missing and the ordering between the accesses would not be established, I fear.

@gasche
Copy link
Member

gasche commented Jan 21, 2024

At this point I am convinced that this PR is a good approach to fix the false report by TSan, but I would enjoy seeing a bit of documentation (as source comments), at some or all of the places where an annotation is added, on why the annotation is helpful/necessary and correct there.

Surely the very detailed explanations of @OlivierNicole (thanks!) can be condensed in a few sentences that would help code readers have not all the details, but at least a sense of what they are looking at. (You should also feel free to include a link to the present PR for the full details.)

Changes Outdated
Comment on lines 63 to 65
- #11040: Silences false data race observed between caml_shared_try_alloc
and oldify. Introduces Macros to call tsan annotations which help annotate
a happens before relatioship
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- #11040: Silences false data race observed between caml_shared_try_alloc
and oldify. Introduces Macros to call tsan annotations which help annotate
a happens before relatioship
- #11040: Silences false data race observed between caml_shared_try_alloc
and oldify. Introduces macros to call tsan annotations which help annotate
a ``happens before'' relationship.

Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update. I am overall happy with what the comments say, but below are suggestions regarding typos, and other small things.

Comment on lines 43 to 44
/* TSAN annotates a release operation on encountering ANNOTATE_HAPPENS_BEFORE
* and similarly an acquire operation on encountering ANNOTATE_HAPPENS_AFTER */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* TSAN annotates a release operation on encountering ANNOTATE_HAPPENS_BEFORE
* and similarly an acquire operation on encountering ANNOTATE_HAPPENS_AFTER */
/* TSan records a release operation on encountering ANNOTATE_HAPPENS_BEFORE
* and similarly an acquire operation on encountering ANNOTATE_HAPPENS_AFTER.
These annotations are used to eliminate false positives. */

“TSan records an operation” seems more accurate to me than “TSan annotates an operation”. Also minor: TSAN -> TSan.

Comment on lines 948 to 952
* Changes here should probably be reflected in do_some_marking. */
/* Annotating a acquire barrier on `p` because TSAN does not realise
* a happens-before relatioship established due to address dependencies
* with the read in shared_heap.c allocation (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_val(child));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Changes here should probably be reflected in do_some_marking. */
/* Annotating a acquire barrier on `p` because TSAN does not realise
* a happens-before relatioship established due to address dependencies
* with the read in shared_heap.c allocation (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_val(child));
* Changes here should probably be reflected in do_some_marking. */
/* Annotating an acquire barrier on the header because TSan does not see the
* happens-before relationship established by address dependencies with
* initializing writes in shared_heap.c allocation (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_val(child));

The comment refers to p which is probably an unintentional leftover from copy-pasting.

I suggest to add a blank line to separate from the previous unrelated comment. The above also suggests a few minor word tweaks that make the sentence clearer IMO. (And it’s not a read in shared_heap.c, it’s two writes)

Comment on lines 1014 to 1017
/* Annotating a acquire barrier on `p` because TSAN does not realise
* a happens-before relatioship established due to address dependencies
* with the read in shared_heap.c allocation (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_val(block));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remarks here.

Comment on lines 456 to 460
/* Annotating a release barrier on `p` because TSAN does not realise
* a happens-before relatioship established due to address dependencies
* with the read in major_gc.c marking (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_BEFORE(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* Annotating a release barrier on `p` because TSAN does not realise
* a happens-before relatioship established due to address dependencies
* with the read in major_gc.c marking (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_BEFORE(p);
/* Annotating a release barrier on `p` because TSan does not see the
* happens-before relationship established by address dependencies
* between the initializing writes here and the read in major_gc.c
* marking (#12894) */
CAML_TSAN_ANNOTATE_HAPPENS_BEFORE(p);

@OlivierNicole
Copy link
Contributor

@Johan511 By the way @dustanddreams’s suggested changes were applied only partially and not quite faithfully—the quotes around ``happens-before`` are not quite right, and the two typos that he suggested to fix are still there.

@gasche gasche added the merge-me label Feb 7, 2024
@gasche
Copy link
Member

gasche commented Feb 7, 2024

I am planning to merge this (if the CI agrees) and also include in 5.2 -- which we want to have as TSan-clean as easily possible.

@OlivierNicole
Copy link
Contributor

@gasche Could you add the run-thread-sanitizer label?

@gasche gasche added the run-thread-sanitizer This label makes the CI run the testsuite with TSAN enabled label Feb 7, 2024
Changes Outdated
- #11040: Silences false data race observed between caml_shared_try_alloc
and oldify. Introduces macros to call tsan annotations which help annotate
a ``happens before'' relationship.
(Olivier Nicole, Hari Hara Naveen S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If you are still amending the PR, I think you could list me as a reviewer in the Changes entry.)

@Johan511
Copy link
Contributor Author

Johan511 commented Feb 7, 2024

@gasche I fixed some code hygiene changes which were reported, can you please approve the workflows once more?

Changes Outdated Show resolved Hide resolved
@gasche gasche merged commit 6f5fe0e into ocaml:trunk Feb 7, 2024
12 checks passed
gasche added a commit that referenced this pull request Feb 7, 2024
Fix datarace between oldify and allocation

(cherry picked from commit 6f5fe0e)
@OlivierNicole OlivierNicole mentioned this pull request Feb 9, 2024
20 tasks
@dustanddreams
Copy link
Contributor

For the record, you might remember me mentioning that TSan on riscv64 becoming noticeably slower recently. I have tracked this slowdown to this PR, which is not a surprise since it causes pool_initialize and caml_shared_try_alloc to be instrumented now.

I am nevertheless surprised by the particular slowdown on risvc64. Measuring the time used to compile OCaml itself (thus a bit of C compile and a lot of OCaml compile), it causes an increase of the total time of about 10% on arm64, but 65% (!) on riscv64. It would be interesting to figure out why such a large difference.

@gasche
Copy link
Member

gasche commented Apr 5, 2024

pool_initialize should be cold, the performance difference would come from caml_shared_try_alloc which allocates in the major heap. I wonder if the performance difference could come from worse TSAN code instrumentations for riscv in the C compiler. (Are you sure that you didn't use, say, a debug runtime under riscv? I would expect debug builds to suffer from larger overhead as they perform a lot more writes.) It may be possible to conditionally un-instrument the function under some architectures, to avoid large performance hits while preserving race-detection capabilities on other architectures.

@OlivierNicole
Copy link
Contributor

Intriguing. @dustanddreams a simple way to test your hypothesis would be to un-instrument these functions again and see if the overhead goes away.

For the record, we now have CAMLno_tsan_for_perf which will make a function non-instrumented, except when we are debugging the runtime itself and define the CPP variable TSAN_INSTRUMENT_ALL—the CI does that, for instance. So that’s another option if instrumenting those functions proves costly.

@dustanddreams
Copy link
Contributor

Interestingly, the overhead does not come from the instrumentation of these two function, but from the CAML_TSAN_ANNOTATE_HAPPENS_{BEFORE,AFTER} additions. I need to investigate in TSan why these are so costly on riscv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-me run-thread-sanitizer This label makes the CI run the testsuite with TSAN enabled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants