Fix spurious major GC slices.#13086
Conversation
|
This could/should probably be a new flag for I convinced myself that the code says what it does, but how do we know that what it does is reasonable? (In particular, are there regressions on other workloads?) |
I disagree. This is a change to GC policy:
IIRC, this change narrows an earlier fix that made too wide a change: the |
2f8b342 to
a2b5834
Compare
a2b5834 to
9bf1ec5
Compare
When tracking a Coq performance regression in OCaml 5 (ocaml#13300), we realized that the GC work-computation heuristics are bad for ramp-up phases of programs that do a lot of unmarshalling, such as the loading of Coq .vo files implied by its `Require Import` directives. The problem is that the GC assumes a steady state, where it should work to release approximately the same amount of memory that was allocated since the beginning of the cycle. When allocating a large amount of long-lived memory, this assumption results in excessive marking work (traversing the heap to find free memory). The fix introduces a sub-category of `major_allocated_words` (the number of words allocated into the major heap), called `major_allocated_words_longlived`, which counts the major-heap memory that we expect to live until the end of time, or the next request for a full major GC, and is not taken into account to decide marking work. The only place where this new "longlived" category is used in this commit is in the unmarshalling code, only in the case where the post-unmarshalling data is larged than Max_young_wosize (cannot be allocated as a single block in the minor heap), that is currently 2Mio on a 64bit system. This heuristic could be a problem for OCaml programs that allocate a lot of short-lived memory through unmarshalling, by packets of more than 2Mio each. Those programs could suffer from vastly increased memory consuptmion. Performance numbers for this change, with the Coq default GC settings: ``` Summary coqc.5.2+backport+change ran 1.22 ± 0.02 times faster than coqc.5.2+backport 1.29 ± 0.10 times faster than coqc.5.2 ``` with the OCaml default GC settings: ``` Summary coqc.5.2+backport+change ran 1.30 ± 0.03 times faster than coqc.5.2+backport 1.41 ± 0.11 times faster than coqc.5.2 ``` In these numbers: - coqc.5.2 is with the stock 5.2 runtime - coqc.5.2+backport has a backport of ocaml#13086, which changed the GC pacing slightly - coqc.5.2+backport+change is the described change, on top of the backport
Reported to me by @stedolan.
Because of commit e6370d5 (memory.c) and PR #11750 we have some spurious major GC slices when a minor GC promotes more than 20% of the minor heap and a large allocation comes along before the next scheduled major slice (the one that happens when the minor heap is half full).
This is because the
allocated_wordscounter will keep the amount of memory promoted by the minor GC until the next major slice takes it into account.The consequence is that the large allocation will trigger a major slice on the spot, then the next scheduled major slice has very little work to do.
The solution is have a separate count for direct major allocations, and use it (instead of all major allocations) to trigger unscheduled major slices.
The problem is illustrated by the program found here: https://gist.github.com/damiendoligez/4d65d0ade50e6d0b2726e812a0eb7a14
Number of major slices (displayed by
OCAMLRUNPARAM=v=0x40 ./a.out 2>&1 | grep '^allocated_words =' | wc):Note that the amount of major GC work is not affected by this problem, but we still incur some overhead for starting the extra slices, and the latency profile is changed (the major slice pauses are closer to the minor GC pauses) so it's still worth fixing.