Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data races on fields of gc_stats structure #12590

Closed
OlivierNicole opened this issue Sep 20, 2023 · 6 comments
Closed

Data races on fields of gc_stats structure #12590

OlivierNicole opened this issue Sep 20, 2023 · 6 comments
Labels

Comments

@OlivierNicole
Copy link
Contributor

As reported by @eutro in #12579:

as a side-note, writes between barrier arrival and barrier departure may race with code outside the STW, caml_collect_gc_stats_sample has unsynchronised writes that could race with caml_compute_gc_stats, which is precisely what has been reported by TSan

The fix suggested by @eutro is to protect caml_compute_gc_stats by STW barriers, which makes sense to me.

@gasche
Copy link
Member

gasche commented Sep 20, 2023

The code says:

/* The "sampled stats" of a domain are a recent copy of its
   domain-local stats, accessed without synchronization and only
   updated ("sampled") during stop-the-world events -- each minor
   collection, and on domain termination. */
static struct gc_stats sampled_gc_stats[Max_domains];

/* Update the sampled stats for the given domain. */
void caml_collect_gc_stats_sample(caml_domain_state* domain);

The intent is that caml_collect_gc_stats_sample is only called within a STW section, at a time where it cannot race with mutators, while caml_compute_gc_stats can be called from mutators without synchronization. (In particular it is used in caml_gc_quick_stat that is supposed to be quick.)

I guess that @eutro is pointing at a bug where caml_collect_gc_stats_sample is called "too late" in the STW cycle, at a point where some mutators have already started running. Can we fix this issue, but preserve the intent of the code, by collecting the sample earlier? (We don't want to add extra barriers if we can avoid it.)

@eutro
Copy link
Contributor

eutro commented Sep 21, 2023

If, say, a thread gets pre-empted just after arriving at the barrier, other threads are free to leave the STW, and then the GC stats can easily be subject to a data race. The bug may also be easier to think of in terms of memory ordering rather than time. There is no release-ordered store after GC stat collection that other threads are guaranteed to acquire before they start executing mutator code, hence there is a data race. (Though there are release stores that might be witnessed (like the STW domains_still_processing decrement), many architectures have strong ordering anyway, etc.)


The existing barrier is arrived at here:

ocaml/runtime/minor_gc.c

Lines 644 to 650 in 8e30385

if( participating_count > 1 ) {
atomic_fetch_add_explicit
(&domains_finished_minor_gc, 1, memory_order_release);
}
domain->stat_minor_words += Wsize_bsize (minor_allocated_bytes);
domain->stat_promoted_words += domain->allocated_words - prev_alloc_words;

It should be enough to move the stat_*_words increments and caml_collect_gc_stats_sample before the barrier arrival, which ensures the correct memory ordering.

It might also, for clarity, make more sense to have the barrier departure within the same procedure, rather than its caller.

@gasche
Copy link
Member

gasche commented Sep 21, 2023

Let me rephrase this in terms that are more familiar to me (a non-expert):

  • the "barrier" at the end of the minor GC is in two halves, a domain first "enters" the barrier (the atomic_fetch_add_explicit call in caml_empty_minor_heap_promote) then "leaves" it (the SPIN_WAIT with atomic_load_acquire in its caller)
  • when we leave the barrier, we have learned for sure that all other domains have entered it
  • conversely, no other domain can leave the barrier before we enter it, but they can leave before we leave

The error with the current gc stats sampling is that it is done before leaving the barrier, while it should happen before entering the barrier.
(As a non-expert I find this error fairly natural, because it is easy to confuse "leaving the barrier" and "leaving the critical STW section that is in-between the two barriers".)

It should be enough to move the stat_*_words increments and caml_collect_gc_stats_sample before the barrier arrival, which ensures the correct memory ordering.

Indeed. (The stat_*_words increment need to be moved because they are part of the data that collect_*_sample collects.) Could you send a PR to do this?

It might also, for clarity, make more sense to have the barrier departure within the same procedure, rather than its caller.

Indeed, it would be clearer if the whole barrier code was in caml_empty_minor_heap_promote. Ideally there could be code comments on the entering and leaving code that explain to non-experts the invariants I mentioned:

  • before we enter the barrier, other domains may be running their STW sections, but no mutator code is running
  • after we enter the barrier, some domains may still be finishing their STW sections, and some mutators may start running (but never both at the same time)
  • after we leave the barrier, no domains are running their STW sections anymore

(would you also like to submit a PR?)

@gasche gasche mentioned this issue Sep 21, 2023
20 tasks
@eutro
Copy link
Contributor

eutro commented Sep 21, 2023

I'll happily submit a PR, yes.

@gasche
Copy link
Member

gasche commented Sep 23, 2023

To our current knowledge there are two races involving gc stats, one on caml_collect_gc_stats_sample that was fixed by merging #12595 and one on caml_clear_gc_stats_sample for which a fix is proposed in #12597. I believe that this issue can be closed once #12597 is merged. (There might remain other races in the gc_stats code.)

@gasche gasche added the bug label Oct 3, 2023
@gasche
Copy link
Member

gasche commented Oct 3, 2023

#12597 is merged and, to my knowledge, all gc_stats related data races are now fixed. Closing this.

#12597 removed the CAMLno_tsan attributes on stats functions that were silencing reports for the existing races. This means that new reports may show up in the future if some data races in fact remain in the runtime. Please report them :-)

@gasche gasche closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants