Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manual chapter about ThreadSanitizer support #12802

Merged
merged 8 commits into from
Dec 6, 2023

Conversation

OlivierNicole
Copy link
Contributor

@OlivierNicole OlivierNicole commented Dec 1, 2023

This adds documentation of the compiler support for ThreadSanitizer, its usage, what can be expected in terms of correctness, the guidelines to follow when linking with external code, and a few caveats to be aware of.

It attempts to be complete (documenting some facts that the sparse documentation available about TSan does not specify) and to give a few examples.

Co-authored-by: Fabrice Buoro <fabrice@tarides.com>
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
Co-authored-by: Miod Vallat <118974489+dustanddreams@users.noreply.github.com>
@OlivierNicole OlivierNicole changed the title Add manual page about ThreadSanitizer support Add manual chapter about ThreadSanitizer support Dec 1, 2023
@shindere
Copy link
Contributor

shindere commented Dec 1, 2023 via email

@shindere
Copy link
Contributor

shindere commented Dec 1, 2023 via email

@shindere
Copy link
Contributor

shindere commented Dec 1, 2023 via email

@kayceesrk kayceesrk added the thread-sanitizer Related to data races, thread sanitizer label Dec 2, 2023
Copy link
Contributor

@fabbing fabbing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reviewed synchronously with @OlivierNicole)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no discussion of the performance overhead at all in this chapter, I think there should be one. What I expect as a reader is to get a ballpark of performance overhead: is this thing designed to make my program 20% slower, or 5x, or 40x? You could tell me about an actual overhead observed on a real program, also maybe an interval of slowdown observed in extremal (but still representative of real program) cases, and say a word on what sort of workloads get a smaller or larger overhead.
What I vaguely remember from our discussions is "typically 3-5x slower, but there is no overhead if you don't use mutable state and it can be up to 40x slower if you use it a lot". Something like that, but with numbers that I didn't just make up, would be useful to guide people decisions on when and how to use this feature.

Relatedly: what is the recommended development workflow for users? Is the idea that they typically do all their multicore hacking in a TSan-enabled switch, or that they typically program in a non-TSan switch and then occasionally move to a TSan switch for debugging?

\end{verbatim}

Increasing the history size can sometimes be necessary to obtain the second
stack trace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Is there a performance cost to doing that?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a mention to that effect in cbe7759.

@gasche
Copy link
Member

gasche commented Dec 2, 2023

Other questions:

  • If I build a program with a TSan-enabled compiler, is it possible to run it without any alarms at all? Is there a way to do this that reduces the performance overhead to something very small?
  • We found that it was useful in the runtime codebase to systematically add a comment on each CAMLno_tsan function to explain why instrumentation is disabled. Maybe you could suggest this as well? I would be interested also in a short explanation of why one may want to disable instrumentation for certain functions. (I guess: "to reduce the performance overhead in certain corner cases" and "to silence alarms about real but already-known issues to avoid hiding new alarms in the noise"; is "benine races" ever a valid reason or should people use relaxed atomics instead?)
  • Is there a way to disable instrumentation in the OCaml code? (An attribute?) If not, I think it would be useful to say so explicitly.

@OlivierNicole
Copy link
Contributor Author

Thanks to all the reviewers for their reviews. I pushed cbe7759 to take your comments into account.

Finally, IIUC the history size is related to what is mentionned earlier
in the chapter about TSan remembering only the 4 last accesses, but hte
documentation does not make that link clear. Perhaps it could?

In fact, these are two different things. I tried to clarify this in the paragraph discussing history_size.

I added a section about performance implications.

Relatedly: what is the recommended development workflow for users? Is the idea that they typically do all their multicore hacking in a TSan-enabled switch, or that they typically program in a non-TSan switch and then occasionally move to a TSan switch for debugging?

From what I observed, it depends on how much performance matters during development, but a typical workflow is to develop on a regular switch and periodically run the test suite in a TSan-enabled switch.

If I build a program with a TSan-enabled compiler, is it possible to run it without any alarms at all? Is there a way to do this that reduces the performance overhead to something very small?

TSAN_OPTIONS=report_bugs=0 turns off alarms entirely, but there is no way to reduce the performance overhead.

We found that it was useful in the runtime codebase to systematically add a comment on each CAMLno_tsan function to explain why instrumentation is disabled. Maybe you could suggest this as well? I would be interested also in a short explanation of why one may want to disable instrumentation for certain functions. (I guess: "to reduce the performance overhead in certain corner cases" and "to silence alarms about real but already-known issues to avoid hiding new alarms in the noise";

I try to explain this in the new version.

is "benine races" ever a valid reason or should people use relaxed atomics instead?)

People should use relaxed atomics unless they are comfortable with UB.

OlivierNicole and others added 2 commits December 4, 2023 16:11
Co-authored-by: Miod Vallat <118974489+dustanddreams@users.noreply.github.com>
Co-authored-by: Antonin Décimo <antonin@tarides.com>
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
OCaml, since version 5.0, allows shared-memory parallelism and thus mutation of
shared data. This creates the possibility of data races, i.e., unordered
accesses to the same data (one of them being a write). Data races can be
dangerous as they are easy to miss and capable of causing unexpected results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the use of the word "dangerous". Data races are "dangerous" in C/C++ because the semantics is undefined. In OCaml, data races only lead to more non-determinism than found under SC. Without data races, the program will have SC semantics but still have non-determinism due to thread interleaving. Would you call this non-determinism "dangerous"?

How about, "In OCaml, data races are easy to introduce, and the behaviour of programs with data races can be unintuitive -- the observed behaviours cannot be explained by simply interleaving operations from different concurrent threads".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different take on this.

Currently most non-persistent data structures in OCaml are not thread-safe or domain-safe, so using them in a multithreaded context is a bug that is very likely to (silent) incorrect program behavior. Embedding protections against this in the data-structure themselves -- so that it fails loudly on such incorrect usage -- is not easy and not standard practice. TSan is currently our best chance of detecting when we mistakenly do this.

So I would treat a data race as "dangerous" because it very likely indicates that a part of the program is used outside its specification, in the "unspecified behavior" or even possibly "a segfault because people used unsafe programming patterns that were fine for OCaml 4 and not correct for Multicore" (as we have seen in Buffer, Dynarray, etc.).

If I understand correctly, your mental model is that benign races on non-atomic locations are fine, because they are not UB. I have a stricter mental model in mind where they signal a programming bug (despite preserving safety) and should be avoided. A counterpoint is that we don't have relaxed atomics to express benign races -- I think that we could and probably should have something like that, some way for programmers to express their intent to perform benign races. But benign races are very rare anyway, probably almost non-existent in idiomatic high-level code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in part with both of you. Data races produce unintuitive behaviour, and that unintuitive behaviour can definitely be dangerous if it is unexpected. That being said,

"In OCaml, data races are easy to introduce, and the behaviour of programs with data races can be unintuitive -- the observed behaviours cannot be explained by simply interleaving operations from different concurrent threads".

looks fine to me.

A counterpoint is that we don't have relaxed atomics to express benign races -- I think that we could and probably should have something like that, some way for programmers to express their intent to perform benign races.

I disagree on this point, as to me non-atomic accesses in OCaml already fill that role, as I have expressed in a previous discussion. A non-atomic access, properly commented and marked as [@no_tsan]—if one day we introduce that attribute—is what I’d want to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said,

"In OCaml, data races are easy to introduce, and the behaviour of programs with data races can be unintuitive -- the observed behaviours cannot be explained by simply interleaving operations from different concurrent threads".

looks fine to me.

I use this phrasing in the new version, let me know if this works for everyone!

Copy link
Contributor

@kayceesrk kayceesrk Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the first 2 paragraphs, but that does not mean "data races are dangerous". This phrase has a particular connotation for C/C++ (and Rust) programmers, where having a data race is a catastrophic bug by itself -- your program has undefined semantics. This is not the case with OCaml. However, as you suggest, the presence of a data race indicates that a part of the program is likely used outside of its specification. Here data race itself is not at fault, but is an indication of issue elsewhere.

The question is whether we need to highlight this distinction here. I feel that we should. TSan is a dynamic data race detector. I don't want the reader to take away that "Data races are dangerous and OCaml TSan does not rule out all data races. So parallel OCaml programs in practice are likely to have data races. So most parallel OCaml program have undefined semantics due to data races."

We've worked hard to establish a result where we provide well-defined semantics for programs with data races. Calling "data races are dangerous" seems to throw this away entirely.

manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
manual/src/cmds/tsan.etex Outdated Show resolved Hide resolved
Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a number of comments. Ok to merge when they're addressed.

@OlivierNicole
Copy link
Contributor Author

Thank you very much for your thorough review, I believe I have addressed it in my last commits.

@gasche gasche merged commit f8d4eb2 into ocaml:trunk Dec 6, 2023
9 checks passed
@OlivierNicole OlivierNicole deleted the tsan_manual branch December 6, 2023 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thread-sanitizer Related to data races, thread sanitizer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants