Skip to content

OS-based Synchronisation for Stop-the-World Sections#12579

Merged
gasche merged 11 commits intoocaml:trunkfrom
eutro:tsdnr-barriers
Jun 4, 2024
Merged

OS-based Synchronisation for Stop-the-World Sections#12579
gasche merged 11 commits intoocaml:trunkfrom
eutro:tsdnr-barriers

Conversation

@eutro
Copy link
Contributor

@eutro eutro commented Sep 18, 2023

This PR augments the existing busy-wait based synchronisation of stop-the-world (STW) sections using proper OS-based synchronisation primitives (barriers and futexes).

The branch also currently includes a couple (the first two) unrelated commits touching ocamltest and the testsuite to make extracting executables from the latter easier. Please ignore these.

Busy Waits in the Runtime

See also #11707 where non-runtime spins are discussed too.

The runtime currently uses busy-waiting/spinning in several places for synchronisation. In almost all places this is done with the SPIN_WAIT macro in platform.h, which expands to an endless loop with eventual exponential backoff (using usleep) in caml_plat_spin_wait:

SPIN_WAIT implementation

#define SPIN_WAIT \
unsigned GENSYM(caml__spins) = 0; \
for (; 1; cpu_relax(), \
GENSYM(caml__spins) = \
CAMLlikely(GENSYM(caml__spins) < Max_spins) ? \
GENSYM(caml__spins) + 1 : \
caml_plat_spin_wait(GENSYM(caml__spins), \
__FILE__, __LINE__, __func__))

ocaml/runtime/platform.c

Lines 223 to 240 in 2ee5c06

#define Min_sleep_ns 10000 // 10 us
#define Slow_sleep_ns 1000000 // 1 ms
#define Max_sleep_ns 1000000000 // 1 s
unsigned caml_plat_spin_wait(unsigned spins,
const char* file, int line,
const char* function)
{
unsigned next_spins;
if (spins < Min_sleep_ns) spins = Min_sleep_ns;
if (spins > Max_sleep_ns) spins = Max_sleep_ns;
next_spins = spins + spins / 4;
if (spins < Slow_sleep_ns && Slow_sleep_ns <= next_spins) {
caml_gc_log("Slow spin-wait loop in %s at %s:%d", function, file, line);
}
usleep(spins/1000);
return next_spins;
}

Spinning is used in a handful of places, quite reasonably, for ironing out contention over object headers (in obj.c, major_gc.c, minor_gc.c), which this PR does not touch (other than minor adjustments to the SPIN_WAIT macro itself).

The more interesting uses, which this PR does affect, are those for STW sections. These are:

Existing STW Spins

Existing STW Spins

For starting an STW section, all domains are issued interrupts, waited on, then released by a barrier:

ocaml/runtime/domain.c

Lines 1513 to 1544 in 2ee5c06

/* Next, interrupt all domains */
for(i = 0; i < stw_domains.participating_domains; i++) {
dom_internal * d = stw_domains.domains[i];
stw_request.participating[i] = d->state;
CAMLassert(!d->interruptor.interrupt_pending);
if (d->state != domain_state) caml_send_interrupt(&d->interruptor);
}
/* Domains now know they are part of the STW.
Note: releasing the lock will not allow new domain to be created
in parallel with the rest of the STW section, as new domains
follow the protocol of waiting on [all_domains_cond] which is
only broadcast at the end of the STW section.
The reason we use a condition variable [all_domains_cond] instead
of just holding the lock until the end of the STW section is that
the last domain to exit the section (and broadcast the condition)
is not necessarily the same as the domain starting the section
(and taking the lock) -- whereas POSIX mutexes must be unlocked
by the same thread that locked them.
*/
caml_plat_unlock(&all_domains_lock);
for(i = 0; i < stw_request.num_domains; i++) {
int id = stw_request.participating[i]->id;
caml_wait_interrupt_serviced(&all_domains[id].interruptor);
}
/* release from the enter barrier */
atomic_store_release(&stw_request.domains_still_running, 0);

The two spins here are in caml_wait_interrupt_serviced (by the leader, waiting for each domain to service its interrupt) and in stw_handler (by each participant, waiting for the leader to release the barrier - in most cases, "async" STW requests don't).

ocaml/runtime/domain.c

Lines 346 to 364 in 2ee5c06

static void caml_wait_interrupt_serviced(struct interruptor* target)
{
int i;
/* Often, interrupt handlers are fast, so spin for a bit before waiting */
for (i=0; i<1000; i++) {
if (!atomic_load_acquire(&target->interrupt_pending)) {
return;
}
cpu_relax();
}
{
SPIN_WAIT {
if (!atomic_load_acquire(&target->interrupt_pending))
return;
}
}
}

ocaml/runtime/domain.c

Lines 1340 to 1353 in 2ee5c06

static void stw_handler(caml_domain_state* domain)
{
CAML_EV_BEGIN(EV_STW_HANDLER);
CAML_EV_BEGIN(EV_STW_API_BARRIER);
{
SPIN_WAIT {
if (atomic_load_acquire(&stw_request.domains_still_running) == 0)
break;
if (stw_request.enter_spin_callback)
stw_request.enter_spin_callback(domain, stw_request.enter_spin_data);
}
}
CAML_EV_END(EV_STW_API_BARRIER);

There are also two barrier implementations with spinning, one used as a barrier for minor collections:

ocaml/runtime/minor_gc.c

Lines 667 to 701 in c287b9d

/* arrive at the barrier */
if( participating_count > 1 ) {
atomic_fetch_add_explicit
(&domains_finished_minor_gc, 1, memory_order_release);
}
/* other domains may be executing mutator code from this point, but
not before */
call_timing_hook(&caml_minor_gc_end_hook);
CAML_EV_COUNTER(EV_C_MINOR_PROMOTED,
Bsize_wsize(domain->allocated_words - prev_alloc_words));
CAML_EV_COUNTER(EV_C_MINOR_ALLOCATED, minor_allocated_bytes);
CAML_EV_END(EV_MINOR);
caml_gc_log ("Minor collection of domain %d completed: %2.0f%% of %u KB live",
domain->id,
100.0 * (double)st.live_bytes / (double)minor_allocated_bytes,
(unsigned)(minor_allocated_bytes + 512)/1024);
/* leave the barrier */
if( participating_count > 1 ) {
CAML_EV_BEGIN(EV_MINOR_LEAVE_BARRIER);
{
SPIN_WAIT {
if (atomic_load_acquire(&domains_finished_minor_gc) ==
participating_count) {
break;
}
caml_do_opportunistic_major_slice(domain, 0);
}
}
CAML_EV_END(EV_MINOR_LEAVE_BARRIER);
}

(as a side-note, writes between barrier arrival and barrier departure may race with code outside the STW, caml_collect_gc_stats_sample has unsynchronised writes that could race with caml_compute_gc_stats, which is precisely what has been reported by TSan) (this has since been fixed in #12595, and the barrier arrival/departure moved around)

The other is the global barrier used by several STW sections, notably in major collections, but implemented in one place:

ocaml/runtime/domain.c

Lines 1282 to 1315 in 2ee5c06

/* sense-reversing barrier */
#define BARRIER_SENSE_BIT 0x100000
barrier_status caml_global_barrier_begin(void)
{
uintnat b = 1 + atomic_fetch_add(&stw_request.barrier, 1);
return b;
}
int caml_global_barrier_is_final(barrier_status b)
{
return ((b & ~BARRIER_SENSE_BIT) == stw_request.num_domains);
}
void caml_global_barrier_end(barrier_status b)
{
uintnat sense = b & BARRIER_SENSE_BIT;
if (caml_global_barrier_is_final(b)) {
/* last domain into the barrier, flip sense */
atomic_store_release(&stw_request.barrier, sense ^ BARRIER_SENSE_BIT);
} else {
/* wait until another domain flips the sense */
SPIN_WAIT {
uintnat barrier = atomic_load_acquire(&stw_request.barrier);
if ((barrier & BARRIER_SENSE_BIT) != sense) break;
}
}
}
void caml_global_barrier(void)
{
barrier_status b = caml_global_barrier_begin();
caml_global_barrier_end(b);
}

Spinning Performance

Spinning Performance

These spins tend to works reasonably well in most cases, particularly on Linux. Spinning has some advantages:

  • It is easy, to write and to verify
  • It is cheap to release, no explicit wake-ups needed
  • If we are lucky, we can avoid a context switch

Spinning also has notable disadvantages:

  • We may waste CPU and power
  • We may waste our time slice, get scheduled out anyway, or we sleep and context-switch
  • The OS doesn't know what we're waiting on, though this is less of an issue than with fair locks
  • The effectiveness of the exponential backoff depends heavily on how well the OS can usleep for short times
  • It is fragile, bad spinning behaviour can lead to awful synchronisation latency if we back off for a long time
    • Some spins have no guarantees for ending in a timely manner (though they tend to)
      • A domain can run for a while without handling interrupts, which in certain cases may or may not be considered a bug
      • Barrier-delimited major and minor GC sections can also take varying lengths of time, so arrivals can be staggered
        • This is offset in the minor GC barrier with opportunistic major slices if there is work available

I took measurements for how long the STW-synchronising SPIN_WAITs tend to spin. With Max_spins at both 1000 and 10000, I logged the caml__spins variable after each SPIN_WAIT (with an ad-hoc C program over shared memory). The test load was the testsuite with make -C testsuite/ parallel (the serial run shows a similar distribution). I found that:

  • With only two domains, the barriers would often release before the thread had to back off, but with more domains this gets rarer
  • In most cases, threads only end up sleeping a few times
Spin Counter Plots

Spin Counter Plots

In the plots below, each line is the distribution of spin counts (number of iterations) for different numbers of threads. The X axis shows the number of iterations (value of caml__spins after the loop), and on the Y axis is the empirical cumulative distribution function, i.e. the Y axis shows the proportion of SPIN_WAITs which finished within the number of iterations on the X axis. The graphs are labelled with the source of the spin: 1ae2 is the interrupt_serviced spin, 5ae1 is the stw_handler spin, ba01 is the global barrier, and ba02 is the minor GC barrier.

Non-waiting Spins

These are plots of the distribution of SPIN_WAITs limited to those that didn't call caml_plat_spin_wait, the step at the end indicating spins which do end up calling caml_plat_spin_wait.

  • Max_spins = 10_000
    Non-waiting spins max 10k

  • Max_spins = 1_000
    Non-waiting spins max 1k

Waiting Spins

These are plots of the distribution of SPIN_WAITs that do actually call caml_plat_spin_wait. Here the spin count on the X axis is actually the nanosecond sleep used and returned by caml_plat_spin_wait on the last iteration.

  • Max_spins = 10_000
    Waiting spins max 10k

  • Max_spins = 1_000
    Waiting spins max 1k

This PR

The aim of this PR is to improve the situation of busy-waits in the runtime, initially motivated by poor performance on certain Windows machines, where tests like memory-model/forbidden run for three minutes. It does two things: replaces pure busy-spinning in STW synchronisation with OS-based synchronisation, and cleans up some of the issues that are exacerbated by this. No existing semantic bugs are addressed.

This PR introduces two new synchronisation objects in platform.h: caml_plat_futex, and caml_plat_barrier implemented using it. Hand-rolling synchronisation objects may be alarming and controversial, but should hopefully be easy enough to review.

Descriptions

caml_plat_futex is fundamentally a 32-bit word with wait and wake(_all) operations, implemented using syscalls on Linux (if the linux/futex.h header is available) and some BSDs, WaitOnAddress on Windows, and a mutex + condition variable as a fallback, on macOS and elsewhere.

caml_plat_barrier can be used as either a "single-sense" count-down latch (which needs explicit resetting), or a "sense-reversing" conventional barrier (which doesn't need explicit resetting).

Other pthreads synchronisation primitives were considered: semaphores and pthreads' own barrier. The latter was deemed unsuitable - it must be created for a known number of parties, and doesn't support split arrive and wait that the runtime currently uses. A semaphore may work for the wait_interrupt_serviced spin, but wasn't used with caml_plat_futex already available and more suitable.

The STW-synchronising spin waits mentioned in the previous section were changed in this PR to use these. The barriers in minor_gc.c and caml_global_barrier_* now use caml_plat_barrier in the same mode (sense-reversing or single-sense) as the original code did. The interrupt_serviced and stw_handler spins were replaced by caml_plat_futexes used in a similar way to a binary semaphore. Some bounded spinning was also kept for all of these, particularly in the two-domain case (where yielding to the OS is often unnecessary), and where useful work can be done (this is only in the minor GC STW, which does opportunistic major slices while spinning), as guided by the spin plots above.

Finally, this PR also includes the following optimisations (guided by perf on artemis) in impacted code. Some of these can apply to trunk directly without the busy-wait changes, but they seem to be much more impactful with them.

  • Some atomic operations were relaxed, and Caml_state reads removed, where these were hot
  • The global_barrier API was streamlined for the common running-something-as-the-final-party use-case
    • The primary aim being to elide the now slightly more expensive barrier arrival entirely if there is only one domain in the STW
  • pool_sweep, now slightly hotter, was optimised to skip the first bounds check and to hoist the in-loop work increment
  • alloc_counter and work_counter in major_gc.c were unified into one outstanding_work_counter
    • This reduces the now-exacerbated contention in update_major_slice_work which had been updating two atomics that were probably on the same cache line
    • This also changes the gc log output
  • domain_spawn disallows new STW sections, which it cannot run concurrently with, if it's already been forced to wait for a few (currently 2) STW sections to end
    • Some cases where this triggers were already substantially improved by the busy-wait changes
    • Previously, relentless STW requests could prevent domain_spawn (and its parent thread) from making any progress (see all the spawn_burn tests), which was exacerbated in some cases (particularly on macOS in CI)

Benchmarks

A lot of benchmarking was done running (individual tests of) the testsuite, which, while not necessarily made for benchmarking, does have a wide variety of program behaviours. Some benchmarking (for perf) was also done on the Sandmark benchmarks, running them manually.

Testsuite Benchmarks

Testsuite Benchmarks

The plots below include:

  • a summary plot of the absolute and relative changes in duration compared to trunk or trunk+backports (bars which go down are good, bars which go up are bad).
  • a summary plot showing a histogram of the relative changes in duration (left is good, right is bad)
  • detailed plots of the distribution of each individual test, with:
    • a box plot showing the median and quartiles
    • a violin plot showing the distribution (kernel density estimation)
    • a scatter plot showing individual results

Tests which didn't run for long enough to be useful (the majority) were excluded from benchmarking. Here is the full list of tests run, though not all platforms ran all tests. The source code of the programs I used to compile, run, time, and plot all the tests is currently not public, but I will work on publishing these after opening this PR.

In each plot, some tests are excluded for not meeting a threshold of statistical significance, which is noted on the first page.

Device Name Operating System Processor Plots
artemis Arch Linux AMD Ryzen™ 5 2600 @ 3.40GHz v. trunk
v. backports
unified
herse Windows 11 Intel® Core™ i7-8665U @ 1.90GHz v. trunk
v. backports
unified
apollo Windows 10 Intel® Core™ i7-6700HQ @ 2.60GHz v. trunk
v. backports
unified
summer FreeBSD 13.2 Intel® Xeon® Silver 4108 @ 1.80GHz v. trunk
v. backports
unified

Notes:

  • OCAML_TEST_SIZE was 2 for summer, and 3 for all the others
  • trunk for these benchmarks was at 4a458b9, barriers was at d057cc6
  • trunk+backports is trunk with patches for the pool_sweep, outstanding_work_counter and domain_spawn optimisation patches applied
  • All these devices are on x64
  • Artemis is the goddess of the hunt, and the namesake of my Linux PC. Her twin, Apollo, is the god of a myriad of things, and the namesake of my old laptop. Herse, or Ersa, is the personification of dew, and the namesake of my work laptop. Summer is a season and a given name. It is also the hostname of a CI server.

Sandmark Nightly

See the latest 5.2.0+trunk+eutro+barriers on https://sandmark.tarides.com/.

@nojb
Copy link
Contributor

nojb commented Sep 18, 2023

Meta-comment: thanks a lot for the very useful description of the patch, I learned several things I didn't know from it!

@kayceesrk
Copy link
Contributor

A quick comment that the URLs in sandmark are permalinks. For example, here is this PR against trunk on sequential and parallel benchmarks.

@eutro
Copy link
Contributor Author

eutro commented Sep 22, 2023

I'm taking the opportunity to squash some of these commits, given that I need to fix conflicts anyway.

caml_stat_space_overhead.index == BUFFER_SIZE) {
struct buf_list_t *l =
(struct buf_list_t*)
caml_stat_alloc_noexc(sizeof(struct buf_list_t));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the result of caml_stat_alloc_noexc checked against NULL here? (I know this is not new code but code moved from cycle_all_domains_callback)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. We should probably create an independent issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #13212

@damiendoligez damiendoligez self-assigned this Oct 4, 2023
@gasche
Copy link
Member

gasche commented Oct 4, 2023

Who would be available to do a full review of this PR?

( @dustanddreams, do I understand that you started to review this, and are planning to eventually do a full review? )

@ghost
Copy link

ghost commented Oct 4, 2023

( @dustanddreams, do I understand that you started to review this, and are planning to eventually do a full review? )

You understand correctly.

@ghost
Copy link

ghost commented Oct 5, 2023

Here is my review, sorry for taking too long.

This is a commit-by-commit review.

  • The first two commits, "Add -do-not-run option to ocamltest" and "Execute individual tests with make one LIST=", are orthogonal to the intent of this PR, and should get removed (this PR still builds and performs well without them), and possibly defended in another PR.

  • Implement caml_plat_futex and caml_plat_barrier

    1. This commits looks good. I am however wondering why you are adding an explicit check for the existence of <linux/futex.h>, while other platforms will use the futex syscall flavour without a particular check. Is this because this header is not available using the musl libc? If so, this should probably be documented somewhere so that someone™ eventually works on overcoming this limitation.

    2. In runtime/domain.c, the BARRIER_SENSE_BIT #define has been moved to runtime/caml/platform.h, but without its preceding comment, which is now orphaned on line 1297. Consider moving it to runtime/caml/platform.h or removing it.

  • Use caml_plat_(futex|barrier) for STW synchronisation

    1. All over this commit, there are many atomic_load_acquire calls which return value are implicitly used as booleans, i.e. if (atomic_load_acquire(..)) or if (!atomic_load_acquire(...)), yet these values are now one of the Barrier_xxx defines in runtime/caml/platform.h.
      All these constructs should be modified to become explicit comparisons against Barrier_released rather than implicit comparisons against zero.

    2. The same applies to the CAMLassert(!...interrupt_pending.value) in domain_create. And the caml_plat_futex_init call in this function should also use Barrier_released instead of 0 as its argument.

  • Optimise global barrier at call sites

    1. There are two parts in this commit; one part is to build upon the previous commits to reduce contention, and another part is the replacement of atomic operations with non-atomic ones when the need for atomicity is not required, with the introduction of nonatomic_increment_counter and the use of atomic_load_relaxed instead of atomic_load in caml_empty_minor_heaps_once.
      I am not convinced that the changes to caml_empty_minor_heaps_once are correct here, and I am also not convinced the introduction of nonatomic_increment_counter is worth the decreased readability. Have you been able to notice a real performance improvement with these changes?

    2. In runtime/runtime_events.c, did you move the body of stw_teardown_runtime_events down because you wanted it to be placed after the create/teardown STWs comment block? If there is no particular reason for this move, I'd suggest moving it back to its initial location (to help reviewing). If there is, then the prototype should be moved closer to the top of the file with all the other static function prototypes, before caml_runtime_events_init.

  • Make first iteration of pool_sweep unconditional

    1. The change of the while() ... do { } loop into do { ... } while() is wrong. I mean, the intent is correct, but you need to update the condition. It needs to be do { ... } while (p <= end) now to match the original behaviour.

    2. Also, in the same pool_sweep routine, I do not trust the changes to the local work variable. Even if the result might end up being the same, the original computation, adding each loop's chunk, is in my opinion easier to follow.

  • Unify work_counter and alloc_counter

    Looking good, nothing bad to say about it (-:

  • Have domain_create prevent new STW sections

    1. In domain_create, you are adding a tight loop on line 578:
        while (atomic_load_acquire(&stw_leader));
    

    I understand that this loop is supposed to exit quickly, but shouldn't there be some form of breathing/relaxing code in case things got really bad?

    1. The addition of the check_for_stw_leader label could be replaced with a proper infinite loop. Instead of:
     /* see if there is a stw_leader already */
    check_for_stw_leader:
     if (atomic_load_acquire(&stw_leader)) {
       caml_plat_unlock(&all_domains_lock);
       caml_handle_incoming_interrupts();
       return 0;
     }
    
     /* STW requests may be suspended by [domain_create], in which case,
        instead of taking the stw_leader, we should release the lock and
       wait for requests to be unsuspended before trying again */
     if (CAMLunlikely(stw_requests_suspended)) {
       caml_plat_wait(&requests_suspended_cond);
       goto check_for_stw_leader;
     }
    

    you could write:

     /* see if there is a stw_leader already */
     while (1) {
       if (atomic_load_acquire(&stw_leader)) {
         caml_plat_unlock(&all_domains_lock);
         caml_handle_incoming_interrupts();
         return 0;
       }
    
       /* STW requests may be suspended by [domain_create], in which case,
          instead of taking the stw_leader, we should release the lock and
          wait for requests to be unsuspended before trying again */
       if (!CAMLunlikely(stw_requests_suspended))
         break;
    
       caml_plat_wait(&requests_suspended_cond);
     }
    

Another comment which does not need to be addressed in this PR, but can be discussed here: there are many constructs similar to foo + atomic_fetch_add(&bar, foo) to perform an atomic add and use the new value. With gcc and clang pre-C11 atomic routines, this is available as __atomic_add_and_fetch(&bar, foo) and this might be worth adding a macro to camlatomic.h to either use the gcc/clang routine when available, or declare an inline routine with the current expression when not. But if there is a strong will to keep camlatomic.h short and bloat-free, then feel free to discard this comment.

@eutro
Copy link
Contributor Author

eutro commented Oct 5, 2023

Thank you very much for the review.

  • Implement caml_plat_futex and caml_plat_barrier

    1. This commits looks good. I am however wondering why you are adding an explicit check for the existence of <linux/futex.h>, while other platforms will use the futex syscall flavour without a particular check. Is this because this header is not available using the musl libc? If so, this should probably be documented somewhere so that someone™ eventually works on overcoming this limitation.

The availability of the header has nothing to do with libc, and is just in a separate package on various Linux distros (Alpine, Debian, Arch, etc.). Given that the headers simply may or may not be available, the check was added to prevent it from being a hard error, which it was on the Alpine Linux machine in Inria CI. I am not aware if system headers are ever unavailable on BSDs, but I did briefly check whether the pre-futex BSD versions are still supported, which I believe they are not.

  • ...
    1. ...
    2. In runtime/domain.c, the BARRIER_SENSE_BIT #define has been moved to runtime/caml/platform.h, but without its preceding comment, which is now orphaned on line 1297. Consider moving it to runtime/caml/platform.h or removing it.

It would probably be reasonable to remove it.

  • Use caml_plat_(futex|barrier) for STW synchronisation
    1. All over this commit, there are many atomic_load_acquire calls which return value are implicitly used as booleans, i.e. if (atomic_load_acquire(..)) or if (!atomic_load_acquire(...)), yet these values are now one of the Barrier_xxx defines in runtime/caml/platform.h.
      All these constructs should be modified to become explicit comparisons against Barrier_released rather than implicit comparisons against zero.
    2. The same applies to the CAMLassert(!...interrupt_pending.value) in domain_create. And the caml_plat_futex_init call in this function should also use Barrier_released instead of 0 as its argument.

This was deliberate, the futex is meant to (continue to) be usable as a boolean indicating whether it can/should be waited on, but it may be reasonable to change the zero checks to explicitly compare against Barrier_released.

  • Optimise global barrier at call sites
    1. There are two parts in this commit; one part is to build upon the previous commits to reduce contention, and another part is the replacement of atomic operations with non-atomic ones when the need for atomicity is not required, with the introduction of nonatomic_increment_counter and the use of atomic_load_relaxed instead of atomic_load in caml_empty_minor_heaps_once.
      I am not convinced that the changes to caml_empty_minor_heaps_once are correct here, and I am also not convinced the introduction of nonatomic_increment_counter is worth the decreased readability. Have you been able to notice a real performance improvement with these changes?

For the caml_empty_minor_heaps_once change, I'm fairly confident that it is correct, perhaps I can convince you? The variable caml_minor_cycles_started is only written within the STW section, by one thread, and is protected and fenced by the STW and minor GC barrier (in fact, I believe it doesn't even need to be atomic at all).

Both increments I made non-atomic did show up in perf, and the change did yield a substantial improvement in benchmarks.

  • ...
    1. ...
    2. In runtime/runtime_events.c, did you move the body of stw_teardown_runtime_events down because you wanted it to be placed after the create/teardown STWs comment block? If there is no particular reason for this move, I'd suggest moving it back to its initial location (to help reviewing). If there is, then the prototype should be moved closer to the top of the file with all the other static function prototypes, before caml_runtime_events_init.

That is why I moved it, yes. I'll move the prototype too then.

  • Make first iteration of pool_sweep unconditional
    1. The change of the while() ... do { } loop into do { ... } while() is wrong. I mean, the intent is correct, but you need to update the condition. It needs to be do { ... } while (p <= end) now to match the original behaviour.

I don't believe that's true. Note that the original code checks p + wh after incrementing it as well, and that p == end (which according to CAMLasserts in other functions is supposed to be when the function breaks) would be out of bounds.

while (foo()) {
 bar();
}

is equivalent to

if (foo()) do {
  bar();
} while (foo());

which is an optimisation the compiler did anyway, to reduce the number of jumps; I merely eliminated the if that the compiler didn't manage to.

  • ...
    1. ...
    2. Also, in the same pool_sweep routine, I do not trust the changes to the local work variable. Even if the result might end up being the same, the original computation, adding each loop's chunk, is in my opinion easier to follow.

The loop gets very hot under certain workloads; I'd vehemently argue that it's worth sacrificing a tiny bit of readability (perhaps with a kind, soothing comment in the code) to hoist an unnecessary increment from a hot loop; indeed that's why I included the commit at all. It may be worth adding a CAMLassert(p == end); to this function as well to make the invariant this relies on for correctness explicit.

  • Unify work_counter and alloc_counter
    Looking good, nothing bad to say about it (-:

Thank you ❤️

  • Have domain_create prevent new STW sections
    1. In domain_create, you are adding a tight loop on line 578:
while (atomic_load_acquire(&stw_leader));

I understand that this loop is supposed to exit quickly, but shouldn't there be some form of breathing/relaxing code in case things got really bad?

It is not a tight loop:

/* Wait for the current STW to end */
do caml_plat_wait(&all_domains_cond);
while (atomic_load_acquire(&stw_leader));

Perhaps the braceless do ; while(); is too evil? I'll happily add braces.

  • ...
    1. ...
    2. The addition of the check_for_stw_leader label could be replaced with a proper infinite loop. Instead of: ... you could write: ...

That's reasonable.


At a quick look, I can't find any architecture where the add_fetch intrinsic generates better code than fetch_add, on Compiler Explorer https://godbolt.org/z/c1oj9jTTv; GCC at least is smart enough (on arm32 for example) to not redo the add anyway if it's at hand. The advantage would be readability, but using the intrinsics provides no real benefit.

@ghost
Copy link

ghost commented Oct 6, 2023

I am not aware if system headers are ever unavailable on BSDs, but I did briefly check whether the pre-futex BSD versions are still supported, which I believe they are not.

All the BSD systems have had futexes for more than 5 years already, so this should not be an issue.

...

This was deliberate, the futex is meant to (continue to) be usable as a boolean indicating whether it can/should be waited on, but it may be reasonable to change the zero checks to explicitly compare against Barrier_released.

Yes, please.

...

 I am not convinced that the changes to `caml_empty_minor_heaps_once` are correct here, and I am also not convinced the introduction of `nonatomic_increment_counter` is worth the decreased readability. Have you been able to notice a real performance improvement with these changes?

For the caml_empty_minor_heaps_once change, I'm fairly confident that it is correct, perhaps I can convince you? The variable caml_minor_cycles_started is only written within the STW section, by one thread, and is protected and fenced by the STW and minor GC barrier (in fact, I believe it doesn't even need to be atomic at all).

Both increments I made non-atomic did show up in perf, and the change did yield a substantial improvement in benchmarks.

Ok, so it's worth keeping. Maybe replace nonatomic_increment_counter with nonatomic_fetch_add? That would make its operation much more obvious.

...

  • Make first iteration of pool_sweep unconditional

    1. The change of the while() ... do { } loop into do { ... } while() is wrong. I mean, the intent is correct, but you need to update the condition. It needs to be do { ... } while (p <= end) now to match the original behaviour.

I don't believe that's true. Note that the original code checks p + wh after incrementing it as well, and that p == end (which according to CAMLasserts in other functions is supposed to be when the function breaks) would be out of bounds.

You're right, I was misled when I wrote that comment, sorry, please ignore it.

...

  1. Also, in the same pool_sweep routine, I do not trust the changes to the local work variable. Even if the result might end up being the same, the original computation, adding each loop's chunk, is in my opinion easier to follow.

The loop gets very hot under certain workloads; I'd vehemently argue that it's worth sacrificing a tiny bit of readability (perhaps with a kind, soothing comment in the code) to hoist an unnecessary increment from a hot loop; indeed that's why I included the commit at all. It may be worth adding a CAMLassert(p == end); to this function as well to make the invariant this relies on for correctness explicit.

Adding an assert would be a good tradeoff if you keep these changes, indeed.

...

It is not a tight loop:

/* Wait for the current STW to end */
do caml_plat_wait(&all_domains_cond);
while (atomic_load_acquire(&stw_leader));

Perhaps the braceless do ; while(); is too evil? I'll happily add braces.

Doh! Missed the do. Yes, braces would help here.

Copy link
Contributor

@NickBarnes NickBarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first raft of comments, almost all minor stylistic things on the first big futex/barrier/spinlock commit. I'm going to move on to the other commits. Overall this is a big improvement and I'm looking forward to getting it merged. Let me know if I can help move it along.

very short critical section we are waiting on */
#define SPIN_WAIT SPIN_WAIT_BACK_OFF(Max_spins_long)

struct caml_plat_srcloc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this belongs in misc.h, maybe somewhere close to CAMLassert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it makes much sense in this PR, unless we also migrate, say, caml_failed_assert to use it. I added the type to reduce the code size of the SPIN_WAIT loop, but it could be beneficial elsewhere too.

#ifndef CAML_PLAT_FUTEX_FALLBACK
# if defined(_WIN32)
# include <synchapi.h>
# define CAML_PLAT_FUTEX_WAIT(ftx, undesired) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming conventions aren't wholly clear to me, but I think these macros, being local to this file, probably don't need CAML_PLAT_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realised that FUTEX_WAIT and FUTEX_WAKE are defined as macro constants by the futex headers themselves, so it's probably best to keep the CAML_PLAT_.

@@ -225,15 +403,14 @@ void caml_mem_unmap(void* mem, uintnat size)
#define Max_sleep_ns 1000000000 // 1 s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the punning between spin counts (spins. etc) and nanoseconds (Max_sleep_ns, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what there is to do, it is an unfortunate part of the original spin wait code, and not one I particularly want to touch. Though perhaps it could be commented, I suppose.

@@ -225,15 +403,14 @@ void caml_mem_unmap(void* mem, uintnat size)
#define Max_sleep_ns 1000000000 // 1 s

unsigned caml_plat_spin_wait(unsigned spins,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function interface is potentially confusing. It doesn't spin, it just waits, and it does so for a number of nanoseconds, not a number of spins. Maybe it should be called caml_wait_for_ns or something. If the argument is called spins then maybe it should divide by a constant (which could be 1) called something like SPINS_PER_NS (or multiply by NS_PER_SPIN, I guess), with a comment explaining what's going on there. All just for clarity, of course - the compiled code will stay the same.

Copy link
Contributor

@NickBarnes NickBarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great commit. Almost all of my comments are stylistic. Some of the header macros could be tidied up, some of the names for things could be improved, more comments are always good, and if we have names for values (Barrier_*) then we should use them (rather than relying on Barrier_released == 0).

void caml_do_opportunistic_major_slice
(caml_domain_state* domain_unused, void* unused)

int caml_do_opportunistic_major_slice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be better returning a bool. Certainly it should have a comment documenting the returned value (which I think is "Did we do some work?").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the runtime code is not consistent in its use of bool versus int. This should be the topic for some discussion among maintainters, then another PR.

I do agree it would be better to add a comment.

Copy link
Contributor

@NickBarnes NickBarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last of my reviews on this PR for now, having got to the end of the commits. My main comment on this one is that line-length is important; please stick to 80 columns.

Overall, this PR is excellent: a significant improvement in the runtime (in the GC especially), and a step forwards in both clarity and performance. My raft of small nitpicks are meant as polish: bouquets rather than brickbats.

Comment on lines +1380 to +1434
if (!stw_request.enter_spin_callback
(domain, stw_request.enter_spin_data)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the point of this callback now returning a boolean is to stop calling it if there is no more work to do. So it essentially avoids unnecessary function calls during the bounded spinning. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, practically. This allows for breaking to a "fast path" that is only concerned with synchronising as efficiently as possible (and is made better use of by the minor GC barrier).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 399 to 405
/* create/teardown STWs

The STW API does have an enter barrier before the handler code is
run, however, the enter barrier itself calls the runtime events API
after arrival. Thus, the barrier in the STWs below is needed both
to ensure that all domains have actually reached the handler before
we start/stop (and are not calling the runtime events API from the
STW code), and of course to ensure that the setup/teardown is
observed by all domains returning from the STW. */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s great to have such explanation comments, but, maybe because I’m not familiar with runtime events, I’m having a hard time understanding this one. By “runtime events API”, do we mean e.g. CAML_EV_BEGIN? If so, this function is called before arriving to the barrier, e.g. at 8d2ffa5#diff-67115925103982a8ebeb085cfab5ef31a182c9a442bc51e053934364d3750dafR1400, and I don’t understand how the barrier in the function below helps with that.

If I’m misunderstanding, please, let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to avoid a race there. Calling the STW API before the enter barrier is fine because it's either properly initialised or not, but calling it after can race with the actual setup. I'll try and clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand better now and find the comment clearer, thanks.

stw_teardown_runtime_events(caml_domain_state *domain_state,
void *remove_file_data, int num_participating,
caml_domain_state **participating_domains) {
Caml_global_barrier_if_final(num_participating) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same difficulty to understand this barrier as the one in stw_create_runtime_events above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved (see related comment thread).

@OlivierNicole
Copy link
Contributor

I have reviewed the changes. I agree with a number of comments made by @dustanddreams and @NickBarnes so I did not repeat them. Overall, I find the code very well-written and do not have major concerns. There is simply one thing that I do not understand about the protection of runtime events buffer creation and teardown, but to be fair it predates this PR (however, the PR has the nicety of adding an explanatory comment, so I’m asking). I left it as a code comment, along with a few other minor comments and questions.

Comment on lines 166 to 199
* The lifecycle is as follows:
*
* Reset (1 thread) (other threads)
* | |
* +----------+------------+
* |
* Arrive (all threads)
* |
* | check arrival number
* +----------+------------+
* | |
* Check or Block Release
* (non-final threads) (final thread)
* | |
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make extremely clear here that Check-ing or Block-ing on a barrier without prior calling arrive is a programming error and will not check or block anything. And repeat the warning in a comment preceding each of the Check and Block functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think that's necessary. It would theoretically be perfectly fine to call Check or even Block by another thread that was merely monitoring and wasn't expected to arrive at all, and it should be evident that Block-ing without Arrive-ing first, where expected, would cause a guaranteed deadlock.

@gasche
Copy link
Member

gasche commented Oct 23, 2023

Thanks @dustanddreams, @NickBarnes and @OlivierNicole; I am happy with the review comment stream so far, and I intend to approve on your behalf once your comments have been taken into account by @eutro

@eutro
Copy link
Contributor Author

eutro commented Nov 1, 2023

What I'm going to do is push a commit to address all comments first, and then I will rebase to fix conflicts and to combine the changes with previous commits - I may combine the commits at some other point in time, perhaps.

@eutro
Copy link
Contributor Author

eutro commented Nov 1, 2023

I believe I've marked as resolved all those review comments I've affirmatively addressed or that aren't relevant with new changes. See the latest commit for details, but notably:

  • I've introduced caml_plat_binary_latch and associated set/release/wait/is_set/is_released functions for use instead of caml_plat_futex with just barrier_raw_(wait/release) and manual reads/writes. All comparisons that were (or should have been) against Barrier_released are now done with caml_plat_latch_is_(set/released).
  • I've made all preprocessor uses of CAML_PLAT_FUTEX_FALLBACK use #ifdef instead, for consistency, instead of making them all #if defined as was directly suggested

@ghost
Copy link

ghost commented Nov 2, 2023

This looks very good now. Can you rebase again, dropping the first two commits which are not really part of this STW work, and add a Changes entry?

@eutro
Copy link
Contributor Author

eutro commented Jan 12, 2024

The last commit is new, doing away with caml_wait_interrupt_serviced and the leader arriving into the barrier the same way as other domains do: diff

@eutro eutro force-pushed the tsdnr-barriers branch 3 times, most recently from c13ff7a to cdc5cc2 Compare April 16, 2024 13:22
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've carefully read the code and I have only a few remarks. The quality of the code is very good, and I have no doubts we will be able to maintain it in the future.

Do we want to add this much code just to get rid of semi-busy waiting in the runtime? My answer is definitely yes, and this is why: we could go on with spin loops and cpurelax and sleep, and decide that it works well enough (although it doesn't on Windows) but that puts us at the mercy of the scheduling decisions made by the C library, the OS, and even the hardware, which can and will change in the future (and there are surprises, see #13128 for an example).

So in the end it's a question of quality of implementation, and I think we should do the right thing.

void caml_do_opportunistic_major_slice
(caml_domain_state* domain_unused, void* unused)

int caml_do_opportunistic_major_slice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the runtime code is not consistent in its use of bool versus int. This should be the topic for some discussion among maintainters, then another PR.

I do agree it would be better to add a comment.

@eutro
Copy link
Contributor Author

eutro commented May 12, 2024

Rebasing to fix conflicts, mainly with #13063 (all Mutex/CV uses in this PR are within STW sections, or are used as a fallback implementation for the otherwise fully blocking futex, so they became lock_blocking).

@eutro eutro force-pushed the tsdnr-barriers branch from 763a031 to afc6e58 Compare May 12, 2024 14:46
@MisterDA
Copy link
Contributor

MisterDA commented May 28, 2024

This is a bit orthogonal to this PR, but the macros can be written more elegantly using the __COUNTER__ macro, which expands to sequential integral values starting from 0; and delayed macro expansion. Taking the definition of __UNIQUE_ID from the Linux kernel, this gives MisterDA@a74df39.
If including this patch in this current form is too much of a burden for this PR, I'll submit it separately once the PR is merged.

@damiendoligez
Copy link
Member

@MisterDA Thanks for the suggestion, but indeed I think it should be a separate PR.

Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving again, now that my suggestions have been taken into account. This PR is good to merge as soon as it gets rebased and CI-checked.

eutro added 11 commits June 4, 2024 20:26
- Comment on the return value of `enter_spin_callback` parameter
- Reduce the public API of the `global_barrier` to only those required for the `Caml_maybe_global_barrier` and `Caml_global_barrier_if_final`
- Add more comments for `global_barrier` API including an example for `Caml_global_barrier_if_final`
- Add `do {`/`} while(0)` around `Caml_maybe_global_barrier`
- Clean up `Caml_global_barrier_if_final` to be only a single macro
- Move `GENSYM` from `platform.h` to `misc.h` as `CAML_GENSYM`, make it prefix with `caml__` too
- Add further comments for futexes
- Replace all `#if defined(CAML_PLAT_FUTEX_FALLBACK)` and similar with `#ifdef` for consistency
- Make `caml_plat_futex_(init|free)` not be inline, so they can be declared together regardless of fallback usage
- Introduce binary latches with `caml_plat_binary_latch` and add associated functions, to replace `caml_plat_barrier_raw_(wait|release)`
- Expand on barrier comments
- Add overview comment for `SPIN_WAIT_*` macros, clarify uses for `Max_spins_*` constants
- Rename `caml_plat_spin_wait` function to `caml_plat_spin_back_off` instead, and rename its `spins` parameter to `sleep_ns`
- Replace uses of the old `caml_plat_barrier_raw_(wait|release)` with `caml_plat_latch_*` functions, removing (implicit and explicit) uses of the `Barrier_*` constants
- Add braces for `do ; while(...);` loop that was mistakened for just a `while(...);` loop
- Replace `check_for_stw_leader` label with a `while (1)` loop
- Rename `domains_finished_minor_gc` to `minor_gc_end_barrier`
- Move `stw_teardown_runtime_events` declaration
- Clarify that the runtime events STW sections' barrier avoids a race specifically
- Add a clarifying comment and debug assertion for the `work = end - p` hoisting in `pool_sweep`
- Increase some spins, to closer match existing Linux behaviour
- Replace leader-released "STW API barrier" with one released by the
  last arriving domain - also abstract `interrupt_pending` checks
- Add `Latch_released` and `CAML_PLAT_BARRIER_INITIALIZER` comments
- Replace `Caml_maybe_global_barrier` macro with `caml_global_barrier`
  inline function, the old `caml_global_barrier` function defined in
  `domain.c` now renamed to `caml_enter_global_barrier`.

- Change name of `caml_global_barrier_wait_unless_final` to
  `caml_global_barrier_and_check_final`, also clarify the comment.

- Clarify in `Caml_global_barrier_if_final` comment that other threads
  will block until the block finishes executing, and how the block
  should exit normally.

- Clarify `caml_plat_barrier` comment on races.

- Typo fixes, `indiciates` -> `indicates`, `condition` -> `condition`

- Justify `int` return (instead of `bool`) in comment on
  `caml_do_opportunistic_major_slice`.

- Add `TODO` and clarify untested BSD (NetBSD/DragonFly) comments in
  `platform.h` and `platform.c`
@eutro eutro force-pushed the tsdnr-barriers branch from 8f6778d to ed5c811 Compare June 4, 2024 19:31
@gasche gasche added the merge-me label Jun 4, 2024
@gasche gasche merged commit 51491dd into ocaml:trunk Jun 4, 2024
@gasche
Copy link
Member

gasche commented Jun 4, 2024

Merged! Thanks @eutro for the impressive work, ad @dustanddreams, @NickBarnes, @OlivierNicole and @damiendoligez for your reviews.

@kayceesrk kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-me Performance PR or issues affecting runtime performance of the compiled programs runtime-system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants