-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fix a race on the orphaning structure in major_gc.c #12591
Conversation
There seems to be two issues:
What TSan complains about is 2. rather than 1. The idiom in the original code seems to be the mixing of mutex-protected accesses with relaxed accesses, to avoid locking in some read-only places, at the cost of perhaps introducing races that are considered innocuous. A solution is maybe to replace the plain accesses with relaxed atomic ones (and this is the solution for other bugs found by TSan). Is the race in 1. innocuous or harmful? Does removing the race (rather than the UB) plays a role in fixing the flaky test? |
I don't really consider that racy non-atomic accesses are supposed to be innocuous races. There is no indication of intent, no comment about this, so for all I know this is just a programming mistake, a bug -- which I am fixing in my commit. If we wanted to use relaxed atomics, we could, but then we would not bother with a lock in the first place. (Or we could use reader-writer locks; I regret not having them bound to easy functions in platform.h because we could use them in many places.) I propose to only consider changing the code to use atomics if we get good reasons to believe that using the locks correctly is a performance regression. Let's explore the simpler, less invasive change (fixing the use of locks) for now. |
(The code changes here should not affect the flaky test at all, which is a different gaping bug.) |
I think there is a misunderstanding. Of course the plain accesses are UB, but mixing atomic accesses & mutexes is a common trick (cf. double-checked locking). (And "race" then is not meant in the sense of C data race implying UB.) It seems that in order to make the code correct, one simply needs to add an |
i think that I understood you on the first try. You are making the guess that the absence of lock in I'm not opposed to move to using atomics (I would rather use atomics directly and perform an atomic exchange when emptying the two lists) if we see on some benchmarks that it makes a difference, or if someone that understand the major GC dynamics better than I do makes dire predictions on the performance of taking a handful of locks per GC slice. Until now, I think we should stick to the simpler implementation. (We could just ask @kayceesrk who wrote the racy code in 2242a22, and could probably tell us if it is important in his mind to avoid taking a lock in |
Ouch. Good way to be brought into a conversation. I believe it is important to avoid taking a lock for Lines 1436 to 1475 in f89f4cc
The idea is that each of these conditions can be tested without locks and then with a global coordination, such as here: Lines 1698 to 1708 in f89f4cc
I don't know whether adding locks for |
Hm, sorry for the unceremonial mention, and thanks for the extra details about your original intent! |
No worries :-). FWIW, it is a data race and a bug. It should be fixed. Thanks for working on a fix! |
7fef131
to
946bc0f
Compare
I updated the PR to use atomic fields in addition to the lock as suggested by @gadmm. Now the lock is used to protect the push/pop operations from each other, and read-only usage is performed with just atomic reads. I went for two (It may be that synchronization happens often enough anyway, but that is a bit tricky to reason about given that production of orphans happens on domain termination, and terminated domains do not synchronize much with other domains.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me except for a clarification question.
static int no_orphaned_work (void) | ||
{ | ||
return | ||
orph_structs.ephe_list_live == 0 && | ||
orph_structs.final_info == NULL; | ||
atomic_load_acquire(&orph_structs.ephe_list_live) == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to learn how and why acquire loads are better than relaxed loads here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a memory-model person so this may be wrong. The intent is that we reliably detect all pushes to orphaning structures that happened before the current point -- instead of observing an arbitrary past value of those lists, which may introduce an arbitrary delay between orphaning and adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this reasoning work even when the stores are not release stores but instead are protected by an unrelated mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On @gadmm's advice I marked the struct field _Atomic
. I assumed that by default (if we don't specify an ordering) reads and writes are sequentially consistent, so in particular that writes are release stores. (But maybe the default is relaxed? I find it surprisingly difficult to find this information by googling.) The reads in no_orphaned_work()
are the only ones that are performance-sensitive, so I did not bother specifying an ordering for the others. If we did specify an ordering I would use release stores and acquire reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I marked the struct field _Atomic. I assumed that by default (if we don't specify an ordering) reads and writes are sequentially consistent
I missed this detail, and it makes sense now!
https://en.cppreference.com/w/c/language/atomic
Built-in increment and decrement operators and compound assignment are read-modify-write atomic operations with total sequentially consistent ordering (as if using memory_order_seq_cst).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This quote from is about composed operations that both read and write at the same time, not single reads or single writes. But these do also default to sequential consistency, as documented on the other page https://en.cppreference.com/w/c/atomic/memory_order :
The default behavior of all atomic operations in the language and the library provides for sequentially consistent ordering (see discussion below). That default can hurt performance, but the library's atomic operations can be given an additional memory_order argument [..].
(I wish this was said clearly on the atomic
page, but oh well, we have enough work improving the OCaml manual already.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You are right. I was reading it too quickly.
There is currently a race on the `orph_structs` global, which is read by `no_orphaned_work()` without being protected by the `orphaned_lock` mutex. We will add locking inside `no_orphaned_work`, but to reduce the performance impact we first ensure that the `orphaned_lock` critical sections are as short as possible.
To avoid taking a lock in `no_orphaned_work()`, we make the orph_struct fields atomic. The lock is now only used to guarantee that the `add_*` and `adopt` operations do not race to each other. Suggested-by: Guillaume Munch-Maccagnoni <Guillaume.Munch-Maccagnoni@inria.fr>
8298cb0
to
f144f99
Compare
The discussion is a bit perplexing to me, and I find the self-merging of this PR, during the week-end and only a couple of hours after first approval, hasty. The moving and removing of these locking operations is needless, and made the PR harder to review and the codebase harder to comprehend. In particular you introduce new lock-less reads that require reasoning about ordering (they look at the contents instead of just comparing to a sentinel value). I fail to reconcile this choice with your insistence about aiming for less-invasive changes and simpler resulting code. The original pattern is indeed one where we use atomics and still "bother with locks", but one point is that it does not need subtle reasoning about ordering. We can indeed sometimes avoid locks in certain places without making the code more complicated. Just to clarify, as I hope this will help future discussions—we have already seen it used elsewhere in the runtime, where we have pointed out that it should be corrected (or has already been corrected) in this way. This is really a common pattern and it is a red herring to propose that if we use atomics, then we should go all the way to some kind of lock-free implementation. Before rejecting my analysis as a vague guess, it is important to understand why this has very likely been working until now despite the UB. Omitting to specify these loads and stores as atomic makes it possible for things to go very wrong in terms of code generation, but it is also likely that the generated code ends up correct (e.g. no tearing nor inopportunely-fused loads) depending on the context. It would be an extraordinary claim to insinuate that locks were somehow forgotten absent-mindedly wherever this pattern was used and the generated code still ended up correct by accident. Lastly, the expectation that intent would be documented in the code is surprising, when it is known that the source of OCaml lacks documentation—I have already tried to ask for a better documentation of uses of atomics in the runtime in particular. This might not be a good way to go around the OCaml source code. And bringing up supposed code intent was a diversion when what should have mattered is that the fix I advocated is correct, less invasive, simpler, and straightforward to estimate to be as performant as the original code. Given these observations, I believe that a more forthright approach to these discussions, and in a spirit of mutual learning, would benefit the effectiveness of our collective work on OCaml. As regards to acquire vs. relaxed, remember that synchronisation concerns writes preceding the release, and reads following the acquire. But here all other reads and writes are already synchronised with the mutex. The acquires are needless here as synchronisation does not have the effect you seem to claim. The synchronisation in play is about ordering for the purpose of correctness. The ideas of arbitrary delays, and of synchronisation needing to happen often enough, seem to stem from a misconception. As a mental model which I can offer, it is good to have in mind that the cache is coherent, and that the reordering concerns small buffers and queues internal to each CPU core. This operates at time scales that are not relevant here. (Regarding performance, relaxed ordering seems unimportant—as said, a simple two-line change to add |
I was not originally planning to comment more on this thread, but @gadmm asked for a reply. Shortly:
|
I went back to #12345 today as we saw a new CI failure, and this somehow lead me to #11040 which lists a few data-race bugs in the runtime that have been found by TSan and silenced -- some of the errors there were detected by the finaliser_handover test. I started looking at those data races and this PR is a result of this.
major_gc.c uses an
orphaned_lock
mutex to protect a globalorph_structs
data structure that stores information about ephemerons or finalizers that have been "orphaned" by a terminating domain, and not yet "adopted" by one of the still-running domains. Except that one functionno_orphaned_work()
(which checks if there is anything is currently orphan) does not take the lock and is thus racy.The present commit:
no_orphaned_work()
orphaned_lock
sections to a minimum (only the accesses toorph_structs
) to minimize the risk of performance impactBefore the
orphaned_lock
would only be taken by programs that use ephemerons and finalisers. After this PR, everyone takes the lock, a handful of times per major GC slice. I don't anticipate a performance impact, except possibly in pathological programs that would perform a lot of slices performing almost no work. (One might think of opportunistic slices as a candidate for this pathological behavior, but they have a shortcut to quit early when they find no work to do, before taking this lock.) It might still be worth running benchmarks just to be sure.