Skip to content

Fix a data race in caml_darken_cont #12969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 12, 2024

Conversation

fabbing
Copy link
Contributor

@fabbing fabbing commented Feb 9, 2024

This PR proposes to fix a data race detected by the TSan CI when using the debug runtime.
The data race occurs in the parallel/test_issue_11094.ml test, involving caml_scan_stack and caml_free_stack. Using the debug runtime causes the stack to be poisoned (runtime/fiber.c#L573), making the data race visible to TSan.

TSan data-race report
Write of size 8 at 0x7b5403f88318 by thread T4 (mutexes: write M0):
  #0 __tsan_memset /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:3089 (libtsan.so.2+0x79601) (BuildId: c39405aada755398a542cde01e23149afb1204d1)
  #1 caml_free_stack runtime/fiber.c:573 (a.out+0xbe4f8) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #2 caml_runstack <null> (a.out+0xf49ac) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #3 camlStdlib__Domain.body_710 [...]/ocaml/stdlib/domain.ml:207 (a.out+0x8574f) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #4 caml_start_program <null> (a.out+0xf41a7) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #5 caml_callback_exn runtime/callback.c:201 (a.out+0xafe7b) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #6 domain_thread_func runtime/domain.c:1205 (a.out+0xb5253) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)

Previous read of size 8 at 0x7b5403f88318 by thread T1 (mutexes: write M1):
  #0 scan_stack_frames runtime/fiber.c:250 (a.out+0xbdff0) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #1 caml_scan_stack runtime/fiber.c:285 (a.out+0xbdff0)
  #2 caml_darken_cont runtime/major_gc.c:1195 (a.out+0xd4aa8) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #3 do_some_marking runtime/major_gc.c:1033 (a.out+0xd5bd7) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #4 mark runtime/major_gc.c:1148 (a.out+0xd5d9b) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #5 major_collection_slice runtime/major_gc.c:1721 (a.out+0xd66b7) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #6 caml_major_collection_slice runtime/major_gc.c:1901 (a.out+0xd7754) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #7 caml_poll_gc_work runtime/domain.c:1739 (a.out+0xb5d3d) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #8 stw_handler runtime/domain.c:1391 (a.out+0xb6225) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #9 handle_incoming runtime/domain.c:330 (a.out+0xb6225)
  #10 caml_handle_incoming_interrupts runtime/domain.c:343 (a.out+0xb745b) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #11 caml_handle_gc_interrupt runtime/domain.c:1765 (a.out+0xb745b)
  #12 caml_enter_blocking_section runtime/signals.c:174 (a.out+0xe6cc4) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #13 caml_ml_mutex_lock runtime/sync.c:93 (a.out+0xea4e4) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #14 caml_c_call <null> (a.out+0xf402d) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #15 camlTest_issue_11094.with_mutex_296 [...]/ocaml/testsuite/tests/parallel/test_issue_11094.ml:19 (a.out+0x4f6b7) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #16 camlTest_issue_11094.fun_633 [...]/ocaml/testsuite/tests/parallel/test_issue_11094.ml:48 (a.out+0x4fdc0) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #17 camlStdlib__Domain.body_710 [...]/ocaml/stdlib/domain.ml:207 (a.out+0x8574f) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #18 caml_start_program <null> (a.out+0xf41a7) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #19 caml_callback_exn runtime/callback.c:201 (a.out+0xafe7b) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)
  #20 domain_thread_func runtime/domain.c:1205 (a.out+0xb5253) (BuildId: 5b510af01dbe393db8bb15cea51b9ae5ccd81175)

The data race

Both domains have executed caml_darken_cont at some point, one coming from a GC major scan, and the other one from caml_continuation_use while resumeing the continuation.

void caml_darken_cont(value cont)
{
  CAMLassert(Is_block(cont) && !Is_young(cont) && Tag_val(cont) == Cont_tag);
  {
    SPIN_WAIT {
      header_t hd = atomic_load_relaxed(Hp_atomic_val(cont));
      CAMLassert(!Has_status_hd(hd, caml_global_heap_state.GARBAGE));
      if (Has_status_hd(hd, caml_global_heap_state.MARKED))
        break;
      if (Has_status_hd(hd, caml_global_heap_state.UNMARKED) &&
          atomic_compare_exchange_strong(
              Hp_atomic_val(cont), &hd,
              With_status_hd(hd, NOT_MARKABLE))) {
        value stk = Field(cont, 0);
        if (Ptr_val(stk) != NULL)
          caml_scan_stack(&caml_darken, darken_scanning_flags, Caml_state,
                          Ptr_val(stk), 0);
        atomic_store_release(Hp_atomic_val(cont),
                             With_status_hd(hd, caml_global_heap_state.MARKED));
      }
    }
  }
}
  • The GC domain, performing a major_collection_slice, reached caml_darken_cont, marked the fiber's stack with caml_scan_stack and updated the continuation header to MARKED with an atomic_store_release.
  • Later, the other domain, which resumed the continuation called caml_continuation_use_noexc, also reached caml_darken_cont, read the continuation header with an atomic_load_relaxed, found the MARKED tag, and so skipped the stack marking.

Alas, in the C11 memory model, there is no happens-before relationship caused by this release store and the relaxed load, and so TSan reported the race.

Proposed fix

The atomic_load_relaxed of caml_darken_cont is replaced by an atomic_load_acquire (runtime/major_gc.c#L1185).

  • Pro: This is a well-defined synchronization according to the C11 memory model, and so TSan is happy.
  • Con: atomic_load_acquire is free for amd64, but not for arm64 (and other weak architectures?) and will have a performance cost.

Alternative fix 1

We keep the atomic_load_relaxed and use the recently added TSan notation from #12894, to notify this relaxed load synchronizes with the release store.

diff --git a/runtime/major_gc.c b/runtime/major_gc.c
index 78ccf358d3..353351519e 100644
--- a/runtime/major_gc.c
+++ b/runtime/major_gc.c
@@ -1183,6 +1183,7 @@ void caml_darken_cont(value cont)
   {
     SPIN_WAIT {
       header_t hd = atomic_load_relaxed(Hp_atomic_val(cont));
+      CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_atomic_val(cont));
       CAMLassert(!Has_status_hd(hd, caml_global_heap_state.GARBAGE));
       if (Has_status_hd(hd, caml_global_heap_state.MARKED))
         break;

Indeed, this may have been thes code's intention, as according to the Linux-Kernel Memory Consistency Model:

W ->rfe R implies that W and R are on different CPUs.  It also means
that W's store must have propagated to R's CPU before R executed;
otherwise R could not have read the value stored by W.  Therefore W
must have executed before R, and so we have W ->hb R.

But I can't convince myself, that having a simple ldr with the atomic_load_acquire makes no difference with the ldar (implicit acquire fence) instruction with an atomic_load_acquire. A (real) memory model expert opinion on this would be welcome.

  • Pro: No performance cost, TSan is happy
  • Con: UB to C11 memory model (but fine for LKMM?), brittle as it relies on TSan instrumenting atomic_store_release with the same mechanism as with CAML_TSAN_ANNOTATE_HAPPENS_AFTER.

Alternative fix 2

We silence TSan reports for caml_free_stack.

  • Pro: CI is happy
  • Con: Would also hide some other race

Acknowledgment

A lot of work went into this one-line PR, with @OlivierNicole.

@gasche
Copy link
Member

gasche commented Feb 10, 2024

If I understand correctly, there is a "bug" here in fiber.c:

ocaml/runtime/fiber.c

Lines 608 to 613 in 22d220a

/* this forms a barrier between execution and any other domains
that might be marking this continuation */
if (!Is_young(cont) ) caml_darken_cont(cont);
/* at this stage the stack is assured to be marked */
v = Field(cont, 0);

The comment is wrong, caml_darken_cont does not in fact form a barrier due to exiting after only a relaxed load on the header, while an acquire load would have been required to form a barrier.

@gasche
Copy link
Member

gasche commented Feb 10, 2024

Context: caml_darken_cont is performing a computation similar to forcing a thunk. If the continuation has not been marked yet (its color is UNMARKED), then exactly one domain must do it. Domains that observe an unmarked continuation compete to win a compare_and_set setting the header to the temporary NOT_MARKABLE color. The winning domain marks the continuation and then sets it to MARKED. All other competing domains spin-wait on their compare-and-set until they see MARKED from the winning domain. (This sounds so-so for scalability, but contention here probably never happens in practice.)

@gasche
Copy link
Member

gasche commented Feb 10, 2024

I don't understand your proposed alternative fix 1 -- the proposed placement of the barrier annotation in that case. You place the barrier after the relaxed load, as if (if I understand correctly) any reader of the header synchronizes with the last release store. For me the correct reasoning is that if you read MARKED, you know that you must have seen it from the store_release below, so only on that case (right before we break) we may want to add a barrier annotation. So this would suggest the following alternative alternative fix:

      if (Has_status_hd(hd, caml_global_heap_state.MARKED)) {
        /* MARKED can only come from the store_release below. */
        CAML_TSAN_ANNOTATE_HAPPENS_AFTER(Hp_atomic_val(cont));
        break;
      }

(This could be made more robust by adding a CAML_TSAN_ANNOTATE_HAPPENS_BEFORE around the store_release below, so that you don't depend on TSan under the hood using the same mechanism to stare both forms of synchronization.)

This being said, I'm not confident that this reasoning is correct -- that indeed we can tell when reading MARKED that we synchronized with the code below; I suppose that they could come from a shared allocation or from deserialisation.

@gasche
Copy link
Member

gasche commented Feb 10, 2024

I agree that your proposed fix, to use an acquire-load to read the header in this function, sounds very reasonable.

In theory this could result in many more acquire-load reads in case where the loop spins on a NOT_MARKABLE header. I don't know if this is a good or a bad thing. If this was considered a bad thing, we could write it this way instead:

      header_t hd = atomic_load_relaxed(Hp_atomic_val(cont));
      CAMLassert(!Has_status_hd(hd, caml_global_heap_state.GARBAGE));
      if (Has_status_hd(hd, caml_global_heap_state.MARKED)) {
        /* perform an acquire load to synchronize
           with the marking domain. */
        hd = atomic_load_acquire(Hp_atomic_val(cont));
        if (Has_status_hd(hd, caml_global_heap_state.MARKED))
          break;
      }

@fabbing
Copy link
Contributor Author

fabbing commented Feb 10, 2024

Thank you for your interest in this PR, @gasche!

The comment is wrong, caml_darken_cont does not in fact form a barrier due to exiting after only a relaxed load on the header, while an acquire load would have been required to form a barrier.

The comment is correct in the fact that only one domain is able to mark (darken) the continuation, and one and only one domain will do it (as you said in your context comment).
The problem is that, according to the C11 memory model, using a relaxed load doesn't provide the necessary happens-before relationship to form a barrier with the "rest" of the memory, and here in particular the fiber stack.

Note that according to the LKMM, if I understand it correctly, this is enough to form a hb relationship. I get it for amd64 (strong architecture) where acquire loads come for free, but I'm puzzled for arm64 (relaxed architecture). Also, the ppo of LKMM seems less strict than the po of C11.
This is where the opinion of a more knowledgeable person could guide us in choosing one of the proposed fixes.

I don't understand your proposed alternative fix 1 -- the proposed placement of the barrier annotation in that case. You place the barrier after the relaxed load, as if (if I understand correctly) any reader of the header synchronizes with the last release store.

Correct!

For me the correct reasoning is that if you read MARKED, you know that you must have seen it from the store_release below, so only on that case (right before we break) we may want to add a barrier annotation. So this would suggest the following alternative alternative fix:

I agree it's a better alternative: it would work identically and the comment is a great addition to explain the annotation.

(This could be made more robust by adding a CAML_TSAN_ANNOTATE_HAPPENS_BEFORE around the store_release below, so that you don't depend on TSan under the hood using the same mechanism to stare both forms of synchronization.)

@OlivierNicole also suggested this idea, for the same reason. I dropped it from this PR so as not to have too many alternative implementations, and that IMO it's reasonable to rely on the fact that this explicit annotation would match with the implicit one performed by TSan's atomic operation.
If we happen to choose it, I could add a test in the testsuite to validate that this expectation is not broken by a TSan's code update.

This being said, I'm not confident that this reasoning is correct -- that indeed we can tell when reading MARKED that we synchronized with the code below; I suppose that they could come from a shared allocation or from deserialisation.

As far as I know, this is the only correct way to mark a continuation, and is only done by the GC or when "consuming" a continuation (in caml_continuation_use_noexc).

In theory this could result in many more acquire-load reads in case where the loop spins on a NOT_MARKABLE header. I don't know if this is a good or a bad thing. If this was considered a bad thing, we could write it this way instead:

      header_t hd = atomic_load_relaxed(Hp_atomic_val(cont));
      CAMLassert(!Has_status_hd(hd, caml_global_heap_state.GARBAGE));
      if (Has_status_hd(hd, caml_global_heap_state.MARKED)) {
        /* perform an acquire load to synchronize
           with the marking domain. */
        hd = atomic_load_acquire(Hp_atomic_val(cont));
        if (Has_status_hd(hd, caml_global_heap_state.MARKED))
          break;
      }

This is an excellent idea, we

  • have proper C11 synchronisation semantics
  • don't rely on TSan internals knowledge
  • have as little performance penalty as possible: I expect none for amd64 as the second load will probably come for "free", and only one acquire load on arm64.
    I was thinking if we could lose events between these two different reads (ie: the header going from MARKED to UNMARKED and back to MARKED) but this shouldn't be possible according to the GC logic (it would have to complete a GC cycle).

I will update the PR to use the code you suggested.

fabbing added a commit to fabbing/ocaml that referenced this pull request Feb 10, 2024
@ghost
Copy link

ghost commented Feb 12, 2024

I had also noticed the spurious race in this test when running with the debug runtime, thanks for working on that.

However I see a different trace when it occurs:

  • one thread is indeed in scan_stack_frames via caml_scan_stack and caml_darken_cont
  • but the othre thread is in caml_free_stack from caml_runstack.

I am wondering if, in this case, there isn't also a risk that a third thread could reuse the stack being released by caml_free_stack and put in the stack cache, before the first thread completes its caml_scan_stack, and that memory corruption could occur in this scenario. But I am not sure it can occur.

@OlivierNicole
Copy link
Contributor

I had also noticed the spurious race in this test when running with the debug runtime, thanks for working on that.

However I see a different trace when it occurs:

* one thread is indeed in `scan_stack_frames` via `caml_scan_stack` and `caml_darken_cont`

* but the othre thread is in `caml_free_stack` from `caml_runstack`.

I am confused, because you are describing exactly the two stack traces in @fabbing's report.

@ghost
Copy link

ghost commented Feb 12, 2024

I am confused too, I had not seen the little arrow allowing the expansion of his traces. Yes, these are the same traces I am getting. However I am not sure that both thread have executed caml_darken_cont as he wrote.

@gasche
Copy link
Member

gasche commented Feb 12, 2024

I found this confusing as well, but I proposed my own explanation for what is going on in #12969 (comment) (this is code in caml_continuation_use_noexc ; my understanding is that the race occurs after that function called caml_darken_cont, which thus does not show up in the stack trace).

@fabbing
Copy link
Contributor Author

fabbing commented Feb 12, 2024

Yes, we are indeed talking about the same data race!

TSan reports the race coming from caml_free_stack in caml_runstack because the fiber was suspended when the Fork effect was perfomed (in fork https://github.com/fabbing/ocaml/blob/dddc8707c99d092c57a850a83f19d83bb5324c85/testsuite/tests/parallel/test_issue_11094.ml#L16). This created a continuation for the fiber and the execution flow jumped back to the main fiber, storing the continuation in the queue (https://github.com/fabbing/ocaml/blob/dddc8707c99d092c57a850a83f19d83bb5324c85/testsuite/tests/parallel/test_issue_11094.ml#L48). Finally, the continuation was resumed with a continue (https://github.com/fabbing/ocaml/blob/dddc8707c99d092c57a850a83f19d83bb5324c85/testsuite/tests/parallel/test_issue_11094.ml#L30), this is where the continuation is "consumed" with caml_continuation_use_noexc (marking it and preventing multiple resume), and the execution flow jumped back into the computation (caml_runstack). Note that the synchronisation was supposed to happen in caml_darken_cont executed from caml_continuation_use_noexc when continueing the fiber (https://github.com/fabbing/ocaml/blob/dddc8707c99d092c57a850a83f19d83bb5324c85/stdlib/effect.ml#L61), but it's no longer visible in the backtrace because of the jump in the execution flow.

fabbing added a commit to fabbing/ocaml that referenced this pull request Feb 12, 2024
@fabbing fabbing force-pushed the race_gc_tracing_free_stack branch from dddc870 to a4ff4d4 Compare February 12, 2024 13:36
fabbing added a commit to fabbing/ocaml that referenced this pull request Feb 12, 2024
@fabbing fabbing force-pushed the race_gc_tracing_free_stack branch from a4ff4d4 to 2fe1a2a Compare February 12, 2024 13:37
fabbing and others added 2 commits February 12, 2024 14:39
This fix a data race between `caml_free_stack` when a continuation is `resume`d
and `scan_stack_frames` from a (previous) GC marking.

Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>
@fabbing fabbing force-pushed the race_gc_tracing_free_stack branch from 2fe1a2a to 7435700 Compare February 12, 2024 13:40
@fabbing
Copy link
Contributor Author

fabbing commented Feb 12, 2024

Rebased on trunk, with a proper commit description, a Co-author mention for @OlivierNicole, and the Changes entry including @dustanddreams review.

@gasche
Copy link
Member

gasche commented Feb 12, 2024

I was thinking of waiting for a memory-model expert to come around and comment, but those are few and far between, so maybe the fact that the four of us understand what is going on and agree on a fix is good enough. Merging.

@fabbing
Copy link
Contributor Author

fabbing commented Feb 12, 2024

We can definitely wait for a memory model expert opinion! I just wanted to have a tidy PR in case this is the chosen solution.

@gasche gasche merged commit ef62c57 into ocaml:trunk Feb 12, 2024
@gasche
Copy link
Member

gasche commented Feb 12, 2024

Nah, waiting is boring.

gasche added a commit that referenced this pull request Feb 12, 2024
Fix a data race in `caml_darken_cont`

(cherry picked from commit ef62c57)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants