-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data races in minor_gc.c and caml_natdynlink_open #12737
Conversation
I don't mind reviewing such changes, but i don't have the energy to review a medium-sized PR right now. I would prefer to receive smaller PRs for the changes that are independent from each other. Of course anyone else is welcome to review the PR as is. |
I'm a little confused by this one. Is |
Sure, I will extract the annotations changes in a separate PR.
In fact, even if the value is an integer, the GC does perform a write on it. It writes… the same value. See the line in Line 252 in 9b059b1
This line is executed when the value is not a minor block, which includes the case when it is an integer. I don’t know what the reason is for this, but well, it does result in a data race in this case. |
5815551
to
a321a38
Compare
I removed the commits that don’t remove data races. |
Interesting, I took a look. It looks like Edit: Not doing the write when it is unnecessary seems to be difficult, because of the many spots that |
runtime/minor_gc.c
Outdated
@@ -546,7 +545,7 @@ void caml_empty_minor_heap_promote(caml_domain_state* domain, | |||
|
|||
for( r = ref_start ; r < foreign_major_ref->ptr && r < ref_end ; r++ ) | |||
{ | |||
oldify_one (&st, **r, *r); | |||
oldify_one (&st, *((volatile value *) *r), *r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that the problem is the read of the third argument and second argument could end up being different due to the race?
If that's right, then maybe it's clearer to use an intermediate value, .e.g.
value * pr = *r;
oldify_one (&st, *pr,pr);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is not that the reads can end up being different: if *pr
is a pointer and another thread has promoted it, reading the old value here is not an issue as it now points to a forwarding block, which will be detected by oldify_one
. The problem is that the race is between a volatile
write and a plain load, namely, *pr
in your proposal. To avoid undefined behaviour, in principle, we would need to use relaxed atomic accesses. But we have decided that volatile
is to be used as an equivalent of relaxed accesses in some cases in the runtime; hence my proposal of making the read volatile
.
I agree that using an intermediate value is more legible. How about:
volatile value * pr = *r;
oldify_one (&st, *pr, pr);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for the explanation.
|
||
/* TODO: dlclose in case of error... */ | ||
|
||
p = caml_stat_strdup_to_os(String_val(filename)); | ||
global_dup = Int_val(global); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this instead? It has the advantage of not introducing a dup
variable, which we aren't doing elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes against Rule 1, but if it’s done elsewhere, or if people are fine with not strictly applying the rules in compiler-internal libraries, I won’t fight it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds fine indeed as the runtime follows relaxed rules. This could be strengthened with a CAMLassert(Is_long(global))
for readability in place of the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 42bcd53
I haven't had the time to look at this yet (and given the nice feedback, there is a chance that I don't have to!) but a meta-comment: whenever you are doing something non-obvious that could be easily be undone by someone meaning well, please have a comment explaining that you are doing it that way for a good reason -- and giving the reason, obviously. I suspect that both change sites meet this requires-an-explanation criterion. |
I had a similar train of thought, in fact the documentation itself might not warn explicitly-enough against accessing GC roots during blocking sections even if the value is known to be an immediate. This looks innocuous and the danger is counterintuitive. Here is the relevant part of the documentation that could perhaps be made clearer:
Similarly, Not saying that it is the role of anyone in particular to do this improvement; just thinking out loud. |
@@ -546,7 +545,8 @@ void caml_empty_minor_heap_promote(caml_domain_state* domain, | |||
|
|||
for( r = ref_start ; r < foreign_major_ref->ptr && r < ref_end ; r++ ) | |||
{ | |||
oldify_one (&st, **r, *r); | |||
volatile value* pr = *r; | |||
oldify_one (&st, *pr, pr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is value_ptr
which could be used here to make the replacement more obvious.
In fact, I wonder if it is not a case where value_ptr
should replace value *
further in the code base.
e.g. could it be that in minor_gc.h
, the following:
struct caml_ref_table CAML_TABLE_STRUCT(value *);
should be:
struct caml_ref_table CAML_TABLE_STRUCT(value_ptr);
or something like that (with all the changes throughout the codebase that this would require)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And, in this case, would it be a sign that we should inspect all the uses of value*
in the runtime to see if they should be replaced with value_ptr
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often the elements of a caml_table_ref
are manipulated without any risk of conflicting access, e.g., outside of minor collection remembered sets are domain-local. Making all these elements volatile
would remove opportunities for compiler optimization for no good reason.
(And, in this case, would it be a sign that we should inspect all the uses of
value*
in the runtime to see if they should be replaced withvalue_ptr
? :)
I haven’t had a response on #12512, except from @gasche kindly trying to unstall the PR, so this auditing process is paused as far as I’m concerned.
The proposed fix for I think that this is what @gadmm was saying above, but I would say it more explicitly: I think that the mental model of people before was that non-immediate roots could be moved around during blocking sections and were thus unsafe, and that this PR is in the process of strengthening the specification (not just clarifying it). I would argue that "not access any OCaml data" in the previous paragraph was previously understood to mean any OCaml block, not all values of If we must change the specification, we must. The new, stronger specification can be justified. But do we really must? |
We can modify |
Registering (as a root) a value that is known to be a tagged integer is a no-op: the tagged integer will not change anything to the set of blocks traversed by the GC, and the GC will not change the value of the tagged integer. So, it's not wrong to do it, but it's not wrong to not do it. At any rate, the only recommended idiom around blocking sections is to extract C values from OCaml values before entering the blocking section. (That's what the manual tries to say.) Whether these values are tagged integers or not is better ignored. (Why make confusing special cases?) |
Notes:
|
Well, technically there is already one Line 250 in 8ec2b3d
|
AFAIU this was already a data race in OCaml 4 with systhreads, which is why I was being less assertive than @gasche. But if such code exists in the wild, some could fall on the theoretically-UB-but-accidentally-correct side of the spectrum. The code does not look like it has been micro-optimised enough for performance to be a factor (e.g. with my limited experience, it is wasteful to optimise for branch prediction if it has not even been optimised for data accesses). Is it clear though that the assignment |
I checked, and there are no such cases. |
I was wrong, actually: I hadn’t considered the recursive tail calls in I see two solutions:
To make the discussion more concrete, I implemented the first solution in the last two commits 2238a6f and 38f8ee7. |
One would also have to adapt in the same way all other scanning actions (beyond oldify_one) to make the immediate special-case sound, and document the constraint for maintainers and users of the scanning hooks… Or do I miss something? Unless it is clear that some programs will be fixed in this way, the second option sounds simpler, doesn't it? |
Well, other scanning need to be checked, but they may be sound in this regard. Regarding scanning hooks, I’m missing some context. In the compiler code, they only seem to be used by systhreads. |
Sorry, there is nothing to do for users of scanning hooks. |
Well, I do agree that they can be used in racy ways. I was just wondering whether they were exposed and/or ever used outside of systhreads. |
Yes, for instance by coq and boxroot. But I think here we only need to check the various scanning actions; 3rd party code do not define new scanning actions. |
Could you point me to where these hooks are defined? |
38f8ee7
to
8ca0ea3
Compare
Thanks! I thus reverted to the initial fix for the Dynlink “race”. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this decision on the design, I looked at the implementation again and it is still fine. The extra implementation comments are warmly welcome.
@OlivierNicole could you squash together the two commits that are only about the intermediate variable? (Otherwise the history is fine.)
(Are you intending to write a Changes entry?) |
8ca0ea3
to
fe48ef4
Compare
Thanks, updated Changes and squashed. |
(The PR complains about conflicts.) |
Remove a data race between a `volatile` write in `oldify_one` and a plain read in `caml_empty_minor_heap_promote`. (The racing read was the second dereference that is performed when `**r` is evaluated.) This is a real race but it suffices to make the read `volatile` in this case.
This race was due to the C primitive `caml_natdynlink_open` accessing a local OCaml `value` after calling `caml_enter_blocking_section()`.
fe48ef4
to
8f4af7f
Compare
Not that it changes anything, but for the record, @jmid just encountered the “code smell” of using an immediate in a “blocking section” without a copy outside of the compiler repo. And it’s in Dune: (TSan was triggered) |
Fixes ocaml#10553 Quoting @jmid, using a local variable without the runtime lock in place, is against the rules. For integer values, sometimes the rules are bent, but this is not a good idea. See ocaml/ocaml#12737. Signed-off-by: Etienne Millon <me@emillon.org>
Fixes ocaml#10553 Quoting @jmid, using a local variable without the runtime lock in place, is against the rules. For integer values, sometimes the rules are bent, but this is not a good idea. See ocaml/ocaml#12737. Signed-off-by: Etienne Millon <me@emillon.org>
As discussed in ocaml#12737, using `Int_val` inside blocking sections can cause data races and is now seen as a bad idea. (this causes a TSAN warning when using Dune, see ocaml/dune#10554)
As discussed in ocaml#12737, using `Int_val` inside blocking sections can cause data races and is now seen as a bad idea. (this causes a TSAN warning when using Dune, see ocaml/dune#10554)
Fixes #10553 Quoting @jmid, using a local variable without the runtime lock in place, is against the rules. For integer values, sometimes the rules are bent, but this is not a good idea. See ocaml/ocaml#12737. Signed-off-by: Etienne Millon <me@emillon.org>
Fixes ocaml#10553 Quoting @jmid, using a local variable without the runtime lock in place, is against the rules. For integer values, sometimes the rules are bent, but this is not a good idea. See ocaml/ocaml#12737. Signed-off-by: Etienne Millon <me@emillon.org>
Fixes ocaml#10553 Quoting @jmid, using a local variable without the runtime lock in place, is against the rules. For integer values, sometimes the rules are bent, but this is not a good idea. See ocaml/ocaml#12737. Signed-off-by: Etienne Millon <me@emillon.org>
Fixes #10553 Quoting @jmid, using a local variable without the runtime lock in place, is against the rules. For integer values, sometimes the rules are bent, but this is not a good idea. See ocaml/ocaml#12737. Signed-off-by: Etienne Millon <me@emillon.org>
This proposes to fix two data races reported in #11040.
The first race happens during minor GC, when promoting the values that are in the remembered set. This is a task done in parallel by all existing domains, with a static work sharing policy, to avoid situations where a domain has much more work to do, e.g. if it continually updates some globals with newly allocated data. In the section of
caml_empty_minor_heap_promote
that performs this task, a thread can read a valueocaml/runtime/minor_gc.c
Line 549 in 9b059b1
(the problematic operation is the second plain read resulting from
**r
)… that is being promoted by another thread (through a
volatile
write inoldify_one
).ocaml/runtime/minor_gc.c
Line 252 in 9b059b1
It suffices to make the read
volatile
as well: promotion to the major heap can only happen once, so even if the first thread reads an old value, when attempting to promote itoldify_one
will detect that it has already been done and will simply update the value.The second race is in
caml_natdynlink_open
: this primitive is accessing one of itsCAMLlocal
variables after callingcaml_enter_blocking_section
(it’s against the rules!) allowing this to happen even during a stop-the-world section. As a result, the GC can concurrently promote the value (by a direct write on the stack). To get rid of the race, one just needs to copy the value (which is cast usingInt_val
) to a localint
.This PR is best reviewed commit by commit.
(Edit: the rest of this description no longer applies and was moved to a separate PR: #12746)
It also tidies up a few things regarding TSan annotations:Initially, we introducedCAMLreally_no_tsan
as a complement toCAMLno_tsan
. The idea was thatCAMLno_tsan
should be used to de-instrument functions that we don’t want instrumented with--enable-tsan
, whileCAMLreally_no_tsan
de-instruments them in all cases, including when, e.g.,-fsanitize=thread
is passed through theCFLAGS
. However, experience shows that it is vastly more convenient to chase data races in the runtime using--enable-tsan
than by modifying CFLAGS. As a consequence,CAMLreally_no_tsan
is not really relevant anymore. I replace the pairCAMLno_tsan
/CAMLreally_no_tsan
withCAMLno_tsan
/CAMLno_tsan_for_perf
. Functions markedCAMLno_tsan
are never instrumented, whereas functions markedCAMLno_tsan_for_perf
are not instrumented to save performance, except whenTSAN_INSTRUMENT_ALL
is defined. DefiningTSAN_INSTRUMENT_ALL
ensure maximum coverage when looking for data races in the runtime.A number of TSan false positives related tovolatile
writes were removed by the merge of Fix TSan false positives due to volatile write handling #12681, and we no longer need to silence the corresponding reports.Outdated TSan silencing annotations are removed, and 2 added, to match the state of the todo-list in ThreadSanitizer issues #11040.