OS-based Synchronisation for Stop-the-World Sections#12579
OS-based Synchronisation for Stop-the-World Sections#12579gasche merged 11 commits intoocaml:trunkfrom
Conversation
|
Meta-comment: thanks a lot for the very useful description of the patch, I learned several things I didn't know from it! |
|
A quick comment that the URLs in sandmark are permalinks. For example, here is this PR against trunk on sequential and parallel benchmarks. |
|
I'm taking the opportunity to squash some of these commits, given that I need to fix conflicts anyway. |
d057cc6 to
6d84ba4
Compare
| caml_stat_space_overhead.index == BUFFER_SIZE) { | ||
| struct buf_list_t *l = | ||
| (struct buf_list_t*) | ||
| caml_stat_alloc_noexc(sizeof(struct buf_list_t)); |
There was a problem hiding this comment.
Shouldn't the result of caml_stat_alloc_noexc checked against NULL here? (I know this is not new code but code moved from cycle_all_domains_callback)
There was a problem hiding this comment.
I think so. We should probably create an independent issue.
|
Who would be available to do a full review of this PR? ( @dustanddreams, do I understand that you started to review this, and are planning to eventually do a full review? ) |
You understand correctly. |
|
Here is my review, sorry for taking too long. This is a commit-by-commit review.
Another comment which does not need to be addressed in this PR, but can be discussed here: there are many constructs similar to |
|
Thank you very much for the review.
The availability of the header has nothing to do with libc, and is just in a separate package on various Linux distros (Alpine, Debian, Arch, etc.). Given that the headers simply may or may not be available, the check was added to prevent it from being a hard error, which it was on the Alpine Linux machine in Inria CI. I am not aware if system headers are ever unavailable on BSDs, but I did briefly check whether the pre-futex BSD versions are still supported, which I believe they are not.
It would probably be reasonable to remove it.
This was deliberate, the futex is meant to (continue to) be usable as a boolean indicating whether it can/should be waited on, but it may be reasonable to change the zero checks to explicitly compare against
For the Both increments I made non-atomic did show up in
That is why I moved it, yes. I'll move the prototype too then.
I don't believe that's true. Note that the original code checks while (foo()) {
bar();
}is equivalent to if (foo()) do {
bar();
} while (foo());which is an optimisation the compiler did anyway, to reduce the number of jumps; I merely eliminated the
The loop gets very hot under certain workloads; I'd vehemently argue that it's worth sacrificing a tiny bit of readability (perhaps with a kind, soothing comment in the code) to hoist an unnecessary increment from a hot loop; indeed that's why I included the commit at all. It may be worth adding a
Thank you ❤️
while (atomic_load_acquire(&stw_leader));
It is not a tight loop: /* Wait for the current STW to end */
do caml_plat_wait(&all_domains_cond);
while (atomic_load_acquire(&stw_leader));Perhaps the braceless
That's reasonable. At a quick look, I can't find any architecture where the |
All the BSD systems have had futexes for more than 5 years already, so this should not be an issue. ...
Yes, please. ...
Ok, so it's worth keeping. Maybe replace ...
You're right, I was misled when I wrote that comment, sorry, please ignore it. ...
Adding an assert would be a good tradeoff if you keep these changes, indeed. ...
Doh! Missed the |
NickBarnes
left a comment
There was a problem hiding this comment.
This is a first raft of comments, almost all minor stylistic things on the first big futex/barrier/spinlock commit. I'm going to move on to the other commits. Overall this is a big improvement and I'm looking forward to getting it merged. Let me know if I can help move it along.
| very short critical section we are waiting on */ | ||
| #define SPIN_WAIT SPIN_WAIT_BACK_OFF(Max_spins_long) | ||
|
|
||
| struct caml_plat_srcloc { |
There was a problem hiding this comment.
I think this belongs in misc.h, maybe somewhere close to CAMLassert.
There was a problem hiding this comment.
I don't know if it makes much sense in this PR, unless we also migrate, say, caml_failed_assert to use it. I added the type to reduce the code size of the SPIN_WAIT loop, but it could be beneficial elsewhere too.
runtime/platform.c
Outdated
| #ifndef CAML_PLAT_FUTEX_FALLBACK | ||
| # if defined(_WIN32) | ||
| # include <synchapi.h> | ||
| # define CAML_PLAT_FUTEX_WAIT(ftx, undesired) \ |
There was a problem hiding this comment.
Naming conventions aren't wholly clear to me, but I think these macros, being local to this file, probably don't need CAML_PLAT_.
There was a problem hiding this comment.
I've realised that FUTEX_WAIT and FUTEX_WAKE are defined as macro constants by the futex headers themselves, so it's probably best to keep the CAML_PLAT_.
| @@ -225,15 +403,14 @@ void caml_mem_unmap(void* mem, uintnat size) | |||
| #define Max_sleep_ns 1000000000 // 1 s | |||
There was a problem hiding this comment.
I dislike the punning between spin counts (spins. etc) and nanoseconds (Max_sleep_ns, etc.)
There was a problem hiding this comment.
I'm not sure what there is to do, it is an unfortunate part of the original spin wait code, and not one I particularly want to touch. Though perhaps it could be commented, I suppose.
runtime/platform.c
Outdated
| @@ -225,15 +403,14 @@ void caml_mem_unmap(void* mem, uintnat size) | |||
| #define Max_sleep_ns 1000000000 // 1 s | |||
|
|
|||
| unsigned caml_plat_spin_wait(unsigned spins, | |||
There was a problem hiding this comment.
This function interface is potentially confusing. It doesn't spin, it just waits, and it does so for a number of nanoseconds, not a number of spins. Maybe it should be called caml_wait_for_ns or something. If the argument is called spins then maybe it should divide by a constant (which could be 1) called something like SPINS_PER_NS (or multiply by NS_PER_SPIN, I guess), with a comment explaining what's going on there. All just for clarity, of course - the compiled code will stay the same.
NickBarnes
left a comment
There was a problem hiding this comment.
Another great commit. Almost all of my comments are stylistic. Some of the header macros could be tidied up, some of the names for things could be improved, more comments are always good, and if we have names for values (Barrier_*) then we should use them (rather than relying on Barrier_released == 0).
| void caml_do_opportunistic_major_slice | ||
| (caml_domain_state* domain_unused, void* unused) | ||
|
|
||
| int caml_do_opportunistic_major_slice |
There was a problem hiding this comment.
I think this might be better returning a bool. Certainly it should have a comment documenting the returned value (which I think is "Did we do some work?").
There was a problem hiding this comment.
Unfortunately, the runtime code is not consistent in its use of bool versus int. This should be the topic for some discussion among maintainters, then another PR.
I do agree it would be better to add a comment.
NickBarnes
left a comment
There was a problem hiding this comment.
This is the last of my reviews on this PR for now, having got to the end of the commits. My main comment on this one is that line-length is important; please stick to 80 columns.
Overall, this PR is excellent: a significant improvement in the runtime (in the GC especially), and a step forwards in both clarity and performance. My raft of small nitpicks are meant as polish: bouquets rather than brickbats.
| if (!stw_request.enter_spin_callback | ||
| (domain, stw_request.enter_spin_data)) { |
There was a problem hiding this comment.
If I understand correctly, the point of this callback now returning a boolean is to stop calling it if there is no more work to do. So it essentially avoids unnecessary function calls during the bounded spinning. Is that correct?
There was a problem hiding this comment.
Yes, practically. This allows for breaking to a "fast path" that is only concerned with synchronising as efficiently as possible (and is made better use of by the minor GC barrier).
| /* create/teardown STWs | ||
|
|
||
| The STW API does have an enter barrier before the handler code is | ||
| run, however, the enter barrier itself calls the runtime events API | ||
| after arrival. Thus, the barrier in the STWs below is needed both | ||
| to ensure that all domains have actually reached the handler before | ||
| we start/stop (and are not calling the runtime events API from the | ||
| STW code), and of course to ensure that the setup/teardown is | ||
| observed by all domains returning from the STW. */ | ||
|
|
There was a problem hiding this comment.
I think it’s great to have such explanation comments, but, maybe because I’m not familiar with runtime events, I’m having a hard time understanding this one. By “runtime events API”, do we mean e.g. CAML_EV_BEGIN? If so, this function is called before arriving to the barrier, e.g. at 8d2ffa5#diff-67115925103982a8ebeb085cfab5ef31a182c9a442bc51e053934364d3750dafR1400, and I don’t understand how the barrier in the function below helps with that.
If I’m misunderstanding, please, let me know.
There was a problem hiding this comment.
The point is to avoid a race there. Calling the STW API before the enter barrier is fine because it's either properly initialised or not, but calling it after can race with the actual setup. I'll try and clarify.
There was a problem hiding this comment.
I understand better now and find the comment clearer, thanks.
| stw_teardown_runtime_events(caml_domain_state *domain_state, | ||
| void *remove_file_data, int num_participating, | ||
| caml_domain_state **participating_domains) { | ||
| Caml_global_barrier_if_final(num_participating) { |
There was a problem hiding this comment.
I have the same difficulty to understand this barrier as the one in stw_create_runtime_events above.
There was a problem hiding this comment.
Resolved (see related comment thread).
|
I have reviewed the changes. I agree with a number of comments made by @dustanddreams and @NickBarnes so I did not repeat them. Overall, I find the code very well-written and do not have major concerns. There is simply one thing that I do not understand about the protection of runtime events buffer creation and teardown, but to be fair it predates this PR (however, the PR has the nicety of adding an explanatory comment, so I’m asking). I left it as a code comment, along with a few other minor comments and questions. |
runtime/caml/platform.h
Outdated
| * The lifecycle is as follows: | ||
| * | ||
| * Reset (1 thread) (other threads) | ||
| * | | | ||
| * +----------+------------+ | ||
| * | | ||
| * Arrive (all threads) | ||
| * | | ||
| * | check arrival number | ||
| * +----------+------------+ | ||
| * | | | ||
| * Check or Block Release | ||
| * (non-final threads) (final thread) | ||
| * | | | ||
| */ |
There was a problem hiding this comment.
I think we should make extremely clear here that Check-ing or Block-ing on a barrier without prior calling arrive is a programming error and will not check or block anything. And repeat the warning in a comment preceding each of the Check and Block functions.
There was a problem hiding this comment.
I really don't think that's necessary. It would theoretically be perfectly fine to call Check or even Block by another thread that was merely monitoring and wasn't expected to arrive at all, and it should be evident that Block-ing without Arrive-ing first, where expected, would cause a guaranteed deadlock.
|
Thanks @dustanddreams, @NickBarnes and @OlivierNicole; I am happy with the review comment stream so far, and I intend to approve on your behalf once your comments have been taken into account by @eutro |
|
What I'm going to do is push a commit to address all comments first, and then I will rebase to fix conflicts |
|
I believe I've marked as resolved all those review comments I've affirmatively addressed or that aren't relevant with new changes. See the latest commit for details, but notably:
|
|
This looks very good now. Can you rebase again, dropping the first two commits which are not really part of this STW work, and add a |
|
The last commit is new, doing away with |
c13ff7a to
cdc5cc2
Compare
damiendoligez
left a comment
There was a problem hiding this comment.
I've carefully read the code and I have only a few remarks. The quality of the code is very good, and I have no doubts we will be able to maintain it in the future.
Do we want to add this much code just to get rid of semi-busy waiting in the runtime? My answer is definitely yes, and this is why: we could go on with spin loops and cpurelax and sleep, and decide that it works well enough (although it doesn't on Windows) but that puts us at the mercy of the scheduling decisions made by the C library, the OS, and even the hardware, which can and will change in the future (and there are surprises, see #13128 for an example).
So in the end it's a question of quality of implementation, and I think we should do the right thing.
| void caml_do_opportunistic_major_slice | ||
| (caml_domain_state* domain_unused, void* unused) | ||
|
|
||
| int caml_do_opportunistic_major_slice |
There was a problem hiding this comment.
Unfortunately, the runtime code is not consistent in its use of bool versus int. This should be the topic for some discussion among maintainters, then another PR.
I do agree it would be better to add a comment.
|
Rebasing to fix conflicts, mainly with #13063 (all Mutex/CV uses in this PR are within STW sections, or are used as a fallback implementation for the otherwise fully blocking futex, so they became |
|
This is a bit orthogonal to this PR, but the macros can be written more elegantly using the |
|
@MisterDA Thanks for the suggestion, but indeed I think it should be a separate PR. |
damiendoligez
left a comment
There was a problem hiding this comment.
Approving again, now that my suggestions have been taken into account. This PR is good to merge as soon as it gets rebased and CI-checked.
- Comment on the return value of `enter_spin_callback` parameter
- Reduce the public API of the `global_barrier` to only those required for the `Caml_maybe_global_barrier` and `Caml_global_barrier_if_final`
- Add more comments for `global_barrier` API including an example for `Caml_global_barrier_if_final`
- Add `do {`/`} while(0)` around `Caml_maybe_global_barrier`
- Clean up `Caml_global_barrier_if_final` to be only a single macro
- Move `GENSYM` from `platform.h` to `misc.h` as `CAML_GENSYM`, make it prefix with `caml__` too
- Add further comments for futexes
- Replace all `#if defined(CAML_PLAT_FUTEX_FALLBACK)` and similar with `#ifdef` for consistency
- Make `caml_plat_futex_(init|free)` not be inline, so they can be declared together regardless of fallback usage
- Introduce binary latches with `caml_plat_binary_latch` and add associated functions, to replace `caml_plat_barrier_raw_(wait|release)`
- Expand on barrier comments
- Add overview comment for `SPIN_WAIT_*` macros, clarify uses for `Max_spins_*` constants
- Rename `caml_plat_spin_wait` function to `caml_plat_spin_back_off` instead, and rename its `spins` parameter to `sleep_ns`
- Replace uses of the old `caml_plat_barrier_raw_(wait|release)` with `caml_plat_latch_*` functions, removing (implicit and explicit) uses of the `Barrier_*` constants
- Add braces for `do ; while(...);` loop that was mistakened for just a `while(...);` loop
- Replace `check_for_stw_leader` label with a `while (1)` loop
- Rename `domains_finished_minor_gc` to `minor_gc_end_barrier`
- Move `stw_teardown_runtime_events` declaration
- Clarify that the runtime events STW sections' barrier avoids a race specifically
- Add a clarifying comment and debug assertion for the `work = end - p` hoisting in `pool_sweep`
- Increase some spins, to closer match existing Linux behaviour - Replace leader-released "STW API barrier" with one released by the last arriving domain - also abstract `interrupt_pending` checks - Add `Latch_released` and `CAML_PLAT_BARRIER_INITIALIZER` comments
- This silences some MSVC warnings
- Replace `Caml_maybe_global_barrier` macro with `caml_global_barrier` inline function, the old `caml_global_barrier` function defined in `domain.c` now renamed to `caml_enter_global_barrier`. - Change name of `caml_global_barrier_wait_unless_final` to `caml_global_barrier_and_check_final`, also clarify the comment. - Clarify in `Caml_global_barrier_if_final` comment that other threads will block until the block finishes executing, and how the block should exit normally. - Clarify `caml_plat_barrier` comment on races. - Typo fixes, `indiciates` -> `indicates`, `condition` -> `condition` - Justify `int` return (instead of `bool`) in comment on `caml_do_opportunistic_major_slice`. - Add `TODO` and clarify untested BSD (NetBSD/DragonFly) comments in `platform.h` and `platform.c`
|
Merged! Thanks @eutro for the impressive work, ad @dustanddreams, @NickBarnes, @OlivierNicole and @damiendoligez for your reviews. |
This PR augments the existing busy-wait based synchronisation of stop-the-world (STW) sections using proper OS-based synchronisation primitives (barriers and futexes).
The branch also currently includes a couple (the first two) unrelated commits touching ocamltest and the testsuite to make extracting executables from the latter easier. Please ignore these.
Busy Waits in the Runtime
See also #11707 where non-runtime spins are discussed too.
The runtime currently uses busy-waiting/spinning in several places for synchronisation. In almost all places this is done with the
SPIN_WAITmacro inplatform.h, which expands to an endless loop with eventual exponential backoff (usingusleep) incaml_plat_spin_wait:SPIN_WAITimplementationocaml/runtime/caml/platform.h
Lines 79 to 86 in 2ee5c06
ocaml/runtime/platform.c
Lines 223 to 240 in 2ee5c06
Spinning is used in a handful of places, quite reasonably, for ironing out contention over object headers (in obj.c, major_gc.c, minor_gc.c), which this PR does not touch (other than minor adjustments to the
SPIN_WAITmacro itself).The more interesting uses, which this PR does affect, are those for STW sections. These are:
Existing STW Spins
Existing STW Spins
For starting an STW section, all domains are issued interrupts, waited on, then released by a barrier:
ocaml/runtime/domain.c
Lines 1513 to 1544 in 2ee5c06
The two spins here are in
caml_wait_interrupt_serviced(by the leader, waiting for each domain to service its interrupt) and instw_handler(by each participant, waiting for the leader to release the barrier - in most cases, "async" STW requests don't).ocaml/runtime/domain.c
Lines 346 to 364 in 2ee5c06
ocaml/runtime/domain.c
Lines 1340 to 1353 in 2ee5c06
There are also two barrier implementations with spinning, one used as a barrier for minor collections:
ocaml/runtime/minor_gc.c
Lines 667 to 701 in c287b9d
(as a side-note, writes between barrier arrival and barrier departure may race with code outside the STW,(this has since been fixed in #12595, and the barrier arrival/departure moved around)caml_collect_gc_stats_samplehas unsynchronised writes that could race with caml_compute_gc_stats, which is precisely what has been reported by TSan)The other is the global barrier used by several STW sections, notably in major collections, but implemented in one place:
ocaml/runtime/domain.c
Lines 1282 to 1315 in 2ee5c06
Spinning Performance
Spinning Performance
These spins tend to works reasonably well in most cases, particularly on Linux. Spinning has some advantages:
Spinning also has notable disadvantages:
usleepfor short timesSleepwith0and so don't sleep at all, at best merely yielding insteadI took measurements for how long the STW-synchronising
SPIN_WAITs tend to spin. WithMax_spinsat both1000and10000, I logged thecaml__spinsvariable after eachSPIN_WAIT(with an ad-hoc C program over shared memory). The test load was the testsuite withmake -C testsuite/ parallel(the serial run shows a similar distribution). I found that:Spin Counter Plots
Spin Counter Plots
In the plots below, each line is the distribution of spin counts (number of iterations) for different numbers of threads. The X axis shows the number of iterations (value of
caml__spinsafter the loop), and on the Y axis is the empirical cumulative distribution function, i.e. the Y axis shows the proportion ofSPIN_WAITs which finished within the number of iterations on the X axis. The graphs are labelled with the source of the spin:1ae2is theinterrupt_servicedspin,5ae1is thestw_handlerspin,ba01is the global barrier, andba02is the minor GC barrier.Non-waiting Spins
These are plots of the distribution of
SPIN_WAITs limited to those that didn't callcaml_plat_spin_wait, the step at the end indicating spins which do end up callingcaml_plat_spin_wait.Max_spins = 10_000Max_spins = 1_000Waiting Spins
These are plots of the distribution of
SPIN_WAITs that do actually callcaml_plat_spin_wait. Here the spin count on the X axis is actually the nanosecond sleep used and returned bycaml_plat_spin_waiton the last iteration.Max_spins = 10_000Max_spins = 1_000This PR
The aim of this PR is to improve the situation of busy-waits in the runtime, initially motivated by poor performance on certain Windows machines, where tests like
memory-model/forbiddenrun for three minutes. It does two things: replaces pure busy-spinning in STW synchronisation with OS-based synchronisation, and cleans up some of the issues that are exacerbated by this. No existing semantic bugs are addressed.This PR introduces two new synchronisation objects in
platform.h:caml_plat_futex, andcaml_plat_barrierimplemented using it. Hand-rolling synchronisation objects may be alarming and controversial, but should hopefully be easy enough to review.Descriptions
caml_plat_futexis fundamentally a 32-bit word withwaitandwake(_all) operations, implemented using syscalls on Linux (if thelinux/futex.hheader is available) and some BSDs,WaitOnAddresson Windows, and a mutex + condition variable as a fallback, on macOS and elsewhere.caml_plat_barriercan be used as either a "single-sense" count-down latch (which needs explicit resetting), or a "sense-reversing" conventional barrier (which doesn't need explicit resetting).Other pthreads synchronisation primitives were considered: semaphores and pthreads' own barrier. The latter was deemed unsuitable - it must be created for a known number of parties, and doesn't support split
arriveandwaitthat the runtime currently uses. A semaphore may work for thewait_interrupt_servicedspin, but wasn't used withcaml_plat_futexalready available and more suitable.The STW-synchronising spin waits mentioned in the previous section were changed in this PR to use these. The barriers in
minor_gc.candcaml_global_barrier_*now usecaml_plat_barrierin the same mode (sense-reversing or single-sense) as the original code did. Theinterrupt_servicedandstw_handlerspins were replaced bycaml_plat_futexes used in a similar way to a binary semaphore. Some bounded spinning was also kept for all of these, particularly in the two-domain case (where yielding to the OS is often unnecessary), and where useful work can be done (this is only in the minor GC STW, which does opportunistic major slices while spinning), as guided by the spin plots above.Finally, this PR also includes the following optimisations (guided by
perfonartemis) in impacted code. Some of these can apply to trunk directly without the busy-wait changes, but they seem to be much more impactful with them.Caml_statereads removed, where these were hotglobal_barrierAPI was streamlined for the common running-something-as-the-final-party use-casepool_sweep, now slightly hotter, was optimised to skip the first bounds check and to hoist the in-loopworkincrementalloc_counterandwork_counterinmajor_gc.cwere unified into oneoutstanding_work_counterupdate_major_slice_workwhich had been updating two atomics that were probably on the same cache linedomain_spawndisallows new STW sections, which it cannot run concurrently with, if it's already been forced to wait for a few (currently 2) STW sections to enddomain_spawn(and its parent thread) from making any progress (see all thespawn_burntests), which was exacerbated in some cases (particularly on macOS in CI)Benchmarks
A lot of benchmarking was done running (individual tests of) the testsuite, which, while not necessarily made for benchmarking, does have a wide variety of program behaviours. Some benchmarking (for
perf) was also done on the Sandmark benchmarks, running them manually.Testsuite Benchmarks
Testsuite Benchmarks
The plots below include:
trunkortrunk+backports(bars which go down are good, bars which go up are bad).Tests which didn't run for long enough to be useful (the majority) were excluded from benchmarking. Here is the full list of tests run, though not all platforms ran all tests. The source code of the programs I used to compile, run, time, and plot all the tests is currently not public, but I will work on publishing these after opening this PR.
In each plot, some tests are excluded for not meeting a threshold of statistical significance, which is noted on the first page.
Notes:
OCAML_TEST_SIZEwas 2 forsummer, and 3 for all the otherstrunkfor these benchmarks was at 4a458b9,barrierswas at d057cc6trunk+backportsistrunkwith patches for thepool_sweep,outstanding_work_counteranddomain_spawnoptimisation patches appliedSandmark Nightly
See the latest
5.2.0+trunk+eutro+barrierson https://sandmark.tarides.com/.