Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fix a race on the orphaning structure in major_gc.c #12591

Merged
merged 2 commits into from
Sep 23, 2023

Conversation

gasche
Copy link
Member

@gasche gasche commented Sep 21, 2023

I went back to #12345 today as we saw a new CI failure, and this somehow lead me to #11040 which lists a few data-race bugs in the runtime that have been found by TSan and silenced -- some of the errors there were detected by the finaliser_handover test. I started looking at those data races and this PR is a result of this.

major_gc.c uses an orphaned_lock mutex to protect a global orph_structs data structure that stores information about ephemerons or finalizers that have been "orphaned" by a terminating domain, and not yet "adopted" by one of the still-running domains. Except that one function no_orphaned_work() (which checks if there is anything is currently orphan) does not take the lock and is thus racy.

The present commit:

  • adds locking to no_orphaned_work()
  • reduces the critical orphaned_lock sections to a minimum (only the accesses to orph_structs) to minimize the risk of performance impact

Before the orphaned_lock would only be taken by programs that use ephemerons and finalisers. After this PR, everyone takes the lock, a handful of times per major GC slice. I don't anticipate a performance impact, except possibly in pathological programs that would perform a lot of slices performing almost no work. (One might think of opportunistic slices as a candidate for this pathological behavior, but they have a shortcut to quit early when they find no work to do, before taking this lock.) It might still be worth running benchmarks just to be sure.

@gadmm
Copy link
Contributor

gadmm commented Sep 21, 2023

There seems to be two issues:

  1. The current implementation is racy (this race may or may not be innocuous);
  2. The race is via plain non-atomic accesses (UB), instead of relaxed atomic accesses.

What TSan complains about is 2. rather than 1.

The idiom in the original code seems to be the mixing of mutex-protected accesses with relaxed accesses, to avoid locking in some read-only places, at the cost of perhaps introducing races that are considered innocuous. A solution is maybe to replace the plain accesses with relaxed atomic ones (and this is the solution for other bugs found by TSan).

Is the race in 1. innocuous or harmful? Does removing the race (rather than the UB) plays a role in fixing the flaky test?

@gasche
Copy link
Member Author

gasche commented Sep 21, 2023

I don't really consider that racy non-atomic accesses are supposed to be innocuous races. There is no indication of intent, no comment about this, so for all I know this is just a programming mistake, a bug -- which I am fixing in my commit. If we wanted to use relaxed atomics, we could, but then we would not bother with a lock in the first place. (Or we could use reader-writer locks; I regret not having them bound to easy functions in platform.h because we could use them in many places.)

I propose to only consider changing the code to use atomics if we get good reasons to believe that using the locks correctly is a performance regression. Let's explore the simpler, less invasive change (fixing the use of locks) for now.

@gasche
Copy link
Member Author

gasche commented Sep 21, 2023

(The code changes here should not affect the flaky test at all, which is a different gaping bug.)

@gadmm
Copy link
Contributor

gadmm commented Sep 21, 2023

I think there is a misunderstanding. Of course the plain accesses are UB, but mixing atomic accesses & mutexes is a common trick (cf. double-checked locking). (And "race" then is not meant in the sense of C data race implying UB.)

It seems that in order to make the code correct, one simply needs to add an _Atomic specifier to the fields of orph_structs (giving SC atomic accesses, which can be further refined into relaxed accesses in this case).

@gasche
Copy link
Member Author

gasche commented Sep 21, 2023

i think that I understood you on the first try. You are making the guess that the absence of lock in no_orphaned_work was an intentional design choice and not a mistake, and that the code can be fixed while still respecting this intention by using both a lock and atomic accesses. My point is that this guess is just a vague guess, and that in absence of any mark of intent in the code (if you want to mix locks with non-locked accesses, please leave a comment!) I assume that it is just a bug. More importantly, the final code is simpler if we just use the locking discipline, and there is no demonstrated or suspected performance need to use the more complex approach that you are suggesting.

I'm not opposed to move to using atomics (I would rather use atomics directly and perform an atomic exchange when emptying the two lists) if we see on some benchmarks that it makes a difference, or if someone that understand the major GC dynamics better than I do makes dire predictions on the performance of taking a handful of locks per GC slice. Until now, I think we should stick to the simpler implementation.

(We could just ask @kayceesrk who wrote the racy code in 2242a22, and could probably tell us if it is important in his mind to avoid taking a lock in no_orphaned_work().)

@kayceesrk
Copy link
Contributor

@kayceesrk who wrote the racy code

Ouch. Good way to be brought into a conversation.

I believe it is important to avoid taking a lock for no_orphaned_work() since it is used in the checks for GC phase changes which doesn't involve locks now.

ocaml/runtime/major_gc.c

Lines 1436 to 1475 in f89f4cc

static int is_complete_phase_sweep_and_mark_main (void)
{
return
caml_gc_phase == Phase_sweep_and_mark_main &&
atomic_load_acquire (&num_domains_to_sweep) == 0 &&
atomic_load_acquire (&num_domains_to_mark) == 0 &&
/* Marking is done */
atomic_load_acquire(&ephe_cycle_info.num_domains_todo) ==
atomic_load_acquire(&ephe_cycle_info.num_domains_done) &&
/* Ephemeron marking is done */
no_orphaned_work();
/* All orphaned ephemerons have been adopted */
}
static int is_complete_phase_mark_final (void)
{
return
caml_gc_phase == Phase_mark_final &&
atomic_load_acquire (&num_domains_to_final_update_first) == 0 &&
/* updated finalise first values */
atomic_load_acquire (&num_domains_to_mark) == 0 &&
/* Marking is done */
atomic_load_acquire(&ephe_cycle_info.num_domains_todo) ==
atomic_load_acquire(&ephe_cycle_info.num_domains_done) &&
/* Ephemeron marking is done */
no_orphaned_work();
/* All orphaned ephemerons have been adopted */
}
static int is_complete_phase_sweep_ephe (void)
{
return
caml_gc_phase == Phase_sweep_ephe &&
atomic_load_acquire (&num_domains_to_ephe_sweep) == 0 &&
/* All domains have swept their ephemerons */
atomic_load_acquire (&num_domains_to_final_update_last) == 0 &&
/* All domains have updated finalise last values */
no_orphaned_work();
/* All orphaned structures have been adopted */
}

The idea is that each of these conditions can be tested without locks and then with a global coordination, such as here:

ocaml/runtime/major_gc.c

Lines 1698 to 1708 in f89f4cc

/* Complete GC phase */
if (is_complete_phase_sweep_and_mark_main() ||
is_complete_phase_mark_final ()) {
CAMLassert (caml_gc_phase != Phase_sweep_ephe);
if (barrier_participants) {
try_complete_gc_phase (domain_state,
(void*)0,
participant_count,
barrier_participants);
} else {
caml_try_run_on_all_domains (&try_complete_gc_phase, 0, 0);

I don't know whether adding locks for no_orphaned_work will slow down the code. But I'd prefer to keep the spirit of testing pending work without locks. So my preference would be to make these relaxed atomic operations.

@gasche
Copy link
Member Author

gasche commented Sep 22, 2023

Hm, sorry for the unceremonial mention, and thanks for the extra details about your original intent!

@kayceesrk
Copy link
Contributor

Hm, sorry for the unceremonial mention,

No worries :-). FWIW, it is a data race and a bug. It should be fixed. Thanks for working on a fix!

@gasche gasche force-pushed the runtime-orph-structs-race branch 2 times, most recently from 7fef131 to 946bc0f Compare September 22, 2023 11:43
@gasche
Copy link
Member Author

gasche commented Sep 22, 2023

I updated the PR to use atomic fields in addition to the lock as suggested by @gadmm. Now the lock is used to protect the push/pop operations from each other, and read-only usage is performed with just atomic reads.

I went for two acquire reads in the no_orphaned_work() function instead of relaxed reads as suggested by @kayceesrk. I still believe that this function is not performance-sensitive (the time spent in each major GC slice should be dominated by the major GC work rather than the constant-per-slice book-keeping work, otherwise our GC pacing logic is wrong); my intent is that acquire loads synchronize with earlier pushes into those lists, instead of allowing an arbitrary delay between pushes and eventual adoption.

(It may be that synchronization happens often enough anyway, but that is a bit tricky to reason about given that production of orphans happens on domain termination, and terminated domains do not synchronize much with other domains.)

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me except for a clarification question.

static int no_orphaned_work (void)
{
return
orph_structs.ephe_list_live == 0 &&
orph_structs.final_info == NULL;
atomic_load_acquire(&orph_structs.ephe_list_live) == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to learn how and why acquire loads are better than relaxed loads here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a memory-model person so this may be wrong. The intent is that we reliably detect all pushes to orphaning structures that happened before the current point -- instead of observing an arbitrary past value of those lists, which may introduce an arbitrary delay between orphaning and adoption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this reasoning work even when the stores are not release stores but instead are protected by an unrelated mutex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On @gadmm's advice I marked the struct field _Atomic. I assumed that by default (if we don't specify an ordering) reads and writes are sequentially consistent, so in particular that writes are release stores. (But maybe the default is relaxed? I find it surprisingly difficult to find this information by googling.) The reads in no_orphaned_work() are the only ones that are performance-sensitive, so I did not bother specifying an ordering for the others. If we did specify an ordering I would use release stores and acquire reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked the struct field _Atomic. I assumed that by default (if we don't specify an ordering) reads and writes are sequentially consistent

I missed this detail, and it makes sense now!

https://en.cppreference.com/w/c/language/atomic

Built-in increment and decrement operators and compound assignment are read-modify-write atomic operations with total sequentially consistent ordering (as if using memory_order_seq_cst).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This quote from is about composed operations that both read and write at the same time, not single reads or single writes. But these do also default to sequential consistency, as documented on the other page https://en.cppreference.com/w/c/atomic/memory_order :

The default behavior of all atomic operations in the language and the library provides for sequentially consistent ordering (see discussion below). That default can hurt performance, but the library's atomic operations can be given an additional memory_order argument [..].

(I wish this was said clearly on the atomic page, but oh well, we have enough work improving the OCaml manual already.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right. I was reading it too quickly.

There is currently a race on the `orph_structs` global, which is read by
`no_orphaned_work()` without being protected by the `orphaned_lock`
mutex. We will add locking inside `no_orphaned_work`, but to reduce
the performance impact we first ensure that the `orphaned_lock`
critical sections are as short as possible.
To avoid taking a lock in `no_orphaned_work()`, we make the
orph_struct fields atomic. The lock is now only used to guarantee
that the `add_*` and `adopt` operations do not race to each other.

Suggested-by: Guillaume Munch-Maccagnoni <Guillaume.Munch-Maccagnoni@inria.fr>
@gasche gasche merged commit c287b9d into ocaml:trunk Sep 23, 2023
9 checks passed
@gadmm
Copy link
Contributor

gadmm commented Sep 25, 2023

The discussion is a bit perplexing to me, and I find the self-merging of this PR, during the week-end and only a couple of hours after first approval, hasty.

The moving and removing of these locking operations is needless, and made the PR harder to review and the codebase harder to comprehend. In particular you introduce new lock-less reads that require reasoning about ordering (they look at the contents instead of just comparing to a sentinel value). I fail to reconcile this choice with your insistence about aiming for less-invasive changes and simpler resulting code.

The original pattern is indeed one where we use atomics and still "bother with locks", but one point is that it does not need subtle reasoning about ordering. We can indeed sometimes avoid locks in certain places without making the code more complicated.

Just to clarify, as I hope this will help future discussions—we have already seen it used elsewhere in the runtime, where we have pointed out that it should be corrected (or has already been corrected) in this way. This is really a common pattern and it is a red herring to propose that if we use atomics, then we should go all the way to some kind of lock-free implementation.

Before rejecting my analysis as a vague guess, it is important to understand why this has very likely been working until now despite the UB. Omitting to specify these loads and stores as atomic makes it possible for things to go very wrong in terms of code generation, but it is also likely that the generated code ends up correct (e.g. no tearing nor inopportunely-fused loads) depending on the context. It would be an extraordinary claim to insinuate that locks were somehow forgotten absent-mindedly wherever this pattern was used and the generated code still ended up correct by accident.

Lastly, the expectation that intent would be documented in the code is surprising, when it is known that the source of OCaml lacks documentation—I have already tried to ask for a better documentation of uses of atomics in the runtime in particular. This might not be a good way to go around the OCaml source code. And bringing up supposed code intent was a diversion when what should have mattered is that the fix I advocated is correct, less invasive, simpler, and straightforward to estimate to be as performant as the original code.

Given these observations, I believe that a more forthright approach to these discussions, and in a spirit of mutual learning, would benefit the effectiveness of our collective work on OCaml.

As regards to acquire vs. relaxed, remember that synchronisation concerns writes preceding the release, and reads following the acquire. But here all other reads and writes are already synchronised with the mutex. The acquires are needless here as synchronisation does not have the effect you seem to claim. The synchronisation in play is about ordering for the purpose of correctness. The ideas of arbitrary delays, and of synchronisation needing to happen often enough, seem to stem from a misconception. As a mental model which I can offer, it is good to have in mind that the cache is coherent, and that the reordering concerns small buffers and queues internal to each CPU core. This operates at time scales that are not relevant here.

(Regarding performance, relaxed ordering seems unimportant—as said, a simple two-line change to add _Atomic qualifiers would have been enough. But again using the weakest operations possible leads to better self-documenting code.)

@gasche
Copy link
Member Author

gasche commented Oct 23, 2023

I was not originally planning to comment more on this thread, but @gadmm asked for a reply.

Shortly:

  • As the comments of KC seems to suggest, I was wrong about guessing that the race was unintentional and you were right.
  • I agree that you could have written a better PR than I did, but you evidently did not decide to do it. The opportunity to do it as a follow-up change is still open if you wish to. Maybe I should have asked -- I agree that it would be a learning experience to see the change "done right" -- but I did not want to impose extra work on you.
  • I did not do this myself (send a follow-up PR trying to do things exactly as you would have done them) because at this point I was fed up with the change. As far as I can tell, the new code (after merging the PR) is good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants