-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manual chapter about ThreadSanitizer support #12802
Conversation
Co-authored-by: Fabrice Buoro <fabrice@tarides.com>
Co-authored-by: Miod Vallat <118974489+dustanddreams@users.noreply.github.com>
I find the work really great, thanks!
This sentence may be wrong:
Suppressing data races is useful for intentional races or libraries that
cannot be fixed
Shouldn't it say something like suppressing reports, rather than
suppressing data rraces?
|
When you say
C code interacting with OCaml should always be built through the
\texttt{ocamlc} or \texttt{ocamlopt} commands,
I'd suggest to refer only to ocamlopt because since only native programs
can be instrumented, it makes no sense to me to mention ocamlc here.
Especially givne that, hopefully, only ocamlopt will use the TSan
related C flags.
|
Finally, IIUC the history size is related to what is mentionned earlier
in the chapter about TSan remembering only the 4 last accesses, but hte
documentation does not make that link clear. Perhaps it could?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Reviewed synchronously with @OlivierNicole)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no discussion of the performance overhead at all in this chapter, I think there should be one. What I expect as a reader is to get a ballpark of performance overhead: is this thing designed to make my program 20% slower, or 5x, or 40x? You could tell me about an actual overhead observed on a real program, also maybe an interval of slowdown observed in extremal (but still representative of real program) cases, and say a word on what sort of workloads get a smaller or larger overhead.
What I vaguely remember from our discussions is "typically 3-5x slower, but there is no overhead if you don't use mutable state and it can be up to 40x slower if you use it a lot". Something like that, but with numbers that I didn't just make up, would be useful to guide people decisions on when and how to use this feature.
Relatedly: what is the recommended development workflow for users? Is the idea that they typically do all their multicore hacking in a TSan-enabled switch, or that they typically program in a non-TSan switch and then occasionally move to a TSan switch for debugging?
manual/src/cmds/tsan.etex
Outdated
\end{verbatim} | ||
|
||
Increasing the history size can sometimes be necessary to obtain the second | ||
stack trace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Is there a performance cost to doing that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added a mention to that effect in cbe7759.
Other questions:
|
Thanks to all the reviewers for their reviews. I pushed cbe7759 to take your comments into account.
In fact, these are two different things. I tried to clarify this in the paragraph discussing I added a section about performance implications.
From what I observed, it depends on how much performance matters during development, but a typical workflow is to develop on a regular switch and periodically run the test suite in a TSan-enabled switch.
I try to explain this in the new version.
People should use relaxed atomics unless they are comfortable with UB. |
5848113
to
515913e
Compare
manual/src/cmds/tsan.etex
Outdated
OCaml, since version 5.0, allows shared-memory parallelism and thus mutation of | ||
shared data. This creates the possibility of data races, i.e., unordered | ||
accesses to the same data (one of them being a write). Data races can be | ||
dangerous as they are easy to miss and capable of causing unexpected results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the use of the word "dangerous". Data races are "dangerous" in C/C++ because the semantics is undefined. In OCaml, data races only lead to more non-determinism than found under SC. Without data races, the program will have SC semantics but still have non-determinism due to thread interleaving. Would you call this non-determinism "dangerous"?
How about, "In OCaml, data races are easy to introduce, and the behaviour of programs with data races can be unintuitive -- the observed behaviours cannot be explained by simply interleaving operations from different concurrent threads".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different take on this.
Currently most non-persistent data structures in OCaml are not thread-safe or domain-safe, so using them in a multithreaded context is a bug that is very likely to (silent) incorrect program behavior. Embedding protections against this in the data-structure themselves -- so that it fails loudly on such incorrect usage -- is not easy and not standard practice. TSan is currently our best chance of detecting when we mistakenly do this.
So I would treat a data race as "dangerous" because it very likely indicates that a part of the program is used outside its specification, in the "unspecified behavior" or even possibly "a segfault because people used unsafe programming patterns that were fine for OCaml 4 and not correct for Multicore" (as we have seen in Buffer, Dynarray, etc.).
If I understand correctly, your mental model is that benign races on non-atomic locations are fine, because they are not UB. I have a stricter mental model in mind where they signal a programming bug (despite preserving safety) and should be avoided. A counterpoint is that we don't have relaxed atomics to express benign races -- I think that we could and probably should have something like that, some way for programmers to express their intent to perform benign races. But benign races are very rare anyway, probably almost non-existent in idiomatic high-level code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in part with both of you. Data races produce unintuitive behaviour, and that unintuitive behaviour can definitely be dangerous if it is unexpected. That being said,
"In OCaml, data races are easy to introduce, and the behaviour of programs with data races can be unintuitive -- the observed behaviours cannot be explained by simply interleaving operations from different concurrent threads".
looks fine to me.
A counterpoint is that we don't have relaxed atomics to express benign races -- I think that we could and probably should have something like that, some way for programmers to express their intent to perform benign races.
I disagree on this point, as to me non-atomic accesses in OCaml already fill that role, as I have expressed in a previous discussion. A non-atomic access, properly commented and marked as [@no_tsan]
—if one day we introduce that attribute—is what I’d want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said,
"In OCaml, data races are easy to introduce, and the behaviour of programs with data races can be unintuitive -- the observed behaviours cannot be explained by simply interleaving operations from different concurrent threads".
looks fine to me.
I use this phrasing in the new version, let me know if this works for everyone!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the first 2 paragraphs, but that does not mean "data races are dangerous". This phrase has a particular connotation for C/C++ (and Rust) programmers, where having a data race is a catastrophic bug by itself -- your program has undefined semantics. This is not the case with OCaml. However, as you suggest, the presence of a data race indicates that a part of the program is likely used outside of its specification. Here data race itself is not at fault, but is an indication of issue elsewhere.
The question is whether we need to highlight this distinction here. I feel that we should. TSan is a dynamic data race detector. I don't want the reader to take away that "Data races are dangerous and OCaml TSan does not rule out all data races. So parallel OCaml programs in practice are likely to have data races. So most parallel OCaml program have undefined semantics due to data races."
We've worked hard to establish a result where we provide well-defined semantics for programs with data races. Calling "data races are dangerous" seems to throw this away entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a number of comments. Ok to merge when they're addressed.
Thank you very much for your thorough review, I believe I have addressed it in my last commits. |
This adds documentation of the compiler support for ThreadSanitizer, its usage, what can be expected in terms of correctness, the guidelines to follow when linking with external code, and a few caveats to be aware of.
It attempts to be complete (documenting some facts that the sparse documentation available about TSan does not specify) and to give a few examples.