-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefetching optimisations for sweeping #9934
Conversation
|
||
#ifdef CAML_INTERNALS | ||
#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) | ||
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think - but I have not experimented - that the MSVC equivalent is #include <winnt.h>
and PreFetchCacheLine((p), PF_NON_TEMPORAL_LEVEL_ALL)
(I'm not sure about the constant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why only on x86? __builtin_prefetch
exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would document a bit:
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
/* 1 = intent to write; 3 = all cache levels */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.
I left others out because I don't really know anything about non-x86 memory hierarchies. We can turn it on for ARM if you like, but I can't judge how much / whether it'll help, and I don't have the expertise / time to do any serious benchmarking.
(I'll add the comments)
Incidentally, when writing this I noticed another possible place for optimisation / cleanup, but haven't attempted it: The caml_fl_merge_block function does a certain amount of work to determine whether the most recently found free block is mergeable with the current one (i.e. is it still free, and is it adjacent to the current block?). This information is known to the sweeper, as it processes blocks in order. We could possibly shave some more time off sweeping by changing the interface, and having the sweeper pass the previous block or NULL to caml_fl_merge_block, rather than the latter redetecting it. (This interface is somewhat subtle and difficult to debug, though, so this could be a delicate change) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds interesting. Thanks for looking into this.
|
||
#ifdef CAML_INTERNALS | ||
#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) | ||
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why only on x86? __builtin_prefetch
exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.
|
||
#ifdef CAML_INTERNALS | ||
#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) | ||
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would document a bit:
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
/* 1 = intent to write; 3 = all cache levels */
The 2 graphs represent the normalized running time of benchmarks on sandmark when #9934 is run with trunk as a baseline. The first one is on an Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz.
The second graph is on an AMD EPYC 7702 64-Core Processor Similarly, trunk's running time on the Amd machine
|
Thank you for the benchmarking work and the nice graphics! But I don't know what to conclude from these. A 10% speedup on some tests is very nice indeed, but a 6% slowdown on some other tests is a concern. Also, I'm surprised that the effect on performance can be that strong: typically, GC takes 30% of execution times, and sweeping takes less time than marking, so the whole sweeping phase should be 10% or so of total execution time, and improving the sweeping phase can hardly improve the total running time by 10%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me.
runtime/major_gc.c
Outdated
if (caml_gc_sweep_hp < sweep_limit){ | ||
hp = caml_gc_sweep_hp; | ||
if (sweep_hp < limit){ | ||
caml_prefetch(sweep_hp + 4096); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this will prefetch the cache line at sweep_hp
+ 4k. Isn't this likely to conflict with the cache line that contains sweep_hp
itself? I know caches are associative, but the associativity is rather low, so what is the probability of evicting sweep_hp
itself when we still need it? Would it be hard to benchmark some variation of this number (for example 4032)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point, I'll try that.
You have to be careful because |
Yeah, that's the tricky subtlety I was referring to. An observation that might help is that allocation cannot occur during sweep_slice, so the synchronisation with the allocator only needs to happen at the start and end of that function, not in the inner loop of sweeping. |
What's the status of this PR? The CI failure is a missing Changes entry (and indeed this probably deserves one). Could we move ahead and eventually merge? |
Performance evaluation is inconclusive, to me at least. This is supposed to make programs run faster, and it's unclear it does. |
Ah, indeed, Damien's approval should probably be interpreted as "approval for correctness". I looked at the graphs again. Several of the speedup results are found (in varying proportions) of both machines, but the most striking slowdown, bdd, is only found on the Intel machine. To me this suggests that the numbers may be partially noisy due to processor-specific code-cache effects. (Here the changes are in the runtime so we probably cannot use the random-nop-padding approach used in another PR to avoid those.) Looking at the change, it is of course possible that the addition of one prefetching instruction would result in wide variation, but there is a more invasive refactoring in
|
@stedolan says it's a prelude to another PR - is it also a prerequisite? |
I would like to raise a concern I have with using the sandmark benchmark suite for assessing the performance of changes to the GC: Why do we think this is a good and representative set of benchmarks for these kinds of changes? I've seen lots of great work from OCamlLabs on how to get reliable numbers out of these benchmarks, but I have yet to see any assessment of the quality of this particular set of benchmarks. For example, most of these benchmarks seem to do very few major cycles and allocate very few major words. That does not resemble most of the real programs that I deal with. It would also be good to see some analysis of the noise in these benchmarks, both between runs and between changes to code layout. |
Parroting the usual answer from Landmark maintainers, and I think they have a point: if you think that a particular workflow is missing from their benchmark suite, you should probably contribute a benchmark to the suite. |
That doesn't help with assessing the quality of the benchmarks that are already in there. If the Sandmark maintainers are going to add benchmark results to other people's PRs then they need to provide some context as to why they think the numbers are relevant. |
The Sandmark numbers are better than nothing. (Many performance-oriented PRs came with no benchmarking whatsoever until recently.) But the numbers need to be interpreted! As I wrote earlier, the sweep phase is at most 10% of the total running time of the program, so variations of -10% / +6% in total running time don't just come from the sweep phase. |
I agree that it may not have been the best idea to run Sandmark on a PR not submitted by Multicore OCaml folks, especially when it remains difficult for the wider community to easily run the benchmarks on their end. But I am hoping that the process gets easier. We will refrain from running Sandmark on PRs not related to multicore. That said, I wanted to bring the performance questions to a conclusion. I suspect that the original numbers were run on this PR and trunk and that point in time, which may have had unrelated commits. So I reran Sandmark on this PR (commit b419956) and the commit that this PR is based on (7d9e60d). The commit history is here: https://github.com/stedolan/ocaml/commits/sweep-optimisation. The normalized running time graph is here: The baseline is 7d9360d. The graph shows the performance impact of this PR against the baseline. Lower is better. The numbers in the parenthesis is the running time in seconds for the baseline version. Overall, there is positive improvement. I analysed the outliers in detail. bdd
$ perf stat ./_build/4.12.0+7d9e60d_1/benchmarks/bdd/bdd.exe 26 # BASELINE
Performance counter stats for './_build/4.12.0+7d9e60d_1/benchmarks/bdd/bdd.exe 26':
5169.389420 task-clock (msec) # 1.000 CPUs utilized
9 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
4,728 page-faults # 0.915 K/sec
11,34,59,02,093 cycles # 2.195 GHz
22,49,12,29,020 instructions # 1.98 insn per cycle
4,95,24,82,879 branches # 958.040 M/sec
5,87,93,810 branch-misses # 1.19% of all branches
5.169854280 seconds time elapsed
$ perf stat ./_build/4.12.0+b419956_1/benchmarks/bdd/bdd.exe 26 # THIS PR
Performance counter stats for './_build/4.12.0+b419956_1/benchmarks/bdd/bdd.exe 26':
5470.182940 task-clock (msec) # 1.000 CPUs utilized
10 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
4,727 page-faults # 0.864 K/sec
12,00,60,98,103 cycles # 2.195 GHz
22,48,76,42,055 instructions # 1.87 insn per cycle
4,95,23,07,655 branches # 905.328 M/sec
7,80,00,231 branch-misses # 1.58% of all branches
5.470739910 seconds time elapsed There are more branch misses in the PR. However, the slowdown cannot just be explained with the changes introduced. revcomp2On the other side, we see a 6.8% improvement on 47.64% revcomp2.exe revcomp2.exe [.] camlDune__exe__Revcomp2__wr_238
10.05% revcomp2.exe revcomp2.exe [.] sweep_slice
7.33% revcomp2.exe revcomp2.exe [.] mark_slice
6.71% revcomp2.exe revcomp2.exe [.] caml_input_scan_line
3.08% revcomp2.exe revcomp2.exe [.] caml_page_table_lookup
2.79% revcomp2.exe revcomp2.exe [.] caml_oldify_one
1.73% revcomp2.exe revcomp2.exe [.] caml_alloc_string
1.55% revcomp2.exe revcomp2.exe [.] caml_alloc_shr_for_minor_gc with this PR, sweeping takes 5.3% of the total time: 52.14% revcomp2.exe revcomp2.exe [.] camlDune__exe__Revcomp2__wr_238
8.13% revcomp2.exe revcomp2.exe [.] mark_slice
6.85% revcomp2.exe revcomp2.exe [.] caml_input_scan_line
5.30% revcomp2.exe revcomp2.exe [.] sweep_slice
3.28% revcomp2.exe revcomp2.exe [.] caml_page_table_lookup
2.47% revcomp2.exe revcomp2.exe [.] caml_oldify_one
1.93% revcomp2.exe revcomp2.exe [.] caml_alloc_shr_for_minor_gc ConclusionsI'm doing an experiment to quantify the noise in Sandmark. This is especially fiddly to quantify accurately due to the microarchitectural optimisations on modern processors. See the work in #10039. Given the overall improvement, I am for accepting this PR. |
Thank you very much for the analysis KC. That makes things much clearer.
I don't want to dissuade you too much from adding benchmark results to PRs. I think my concern is mostly that dropping a benchmark results graph in a comment, without the much more involved work needed to investigate the results and provide context for people who are not familiar with the nature of the particular benchmarks, is as likely to do harm as it is to do good. When that additional work is done -- as you have very helpfully done in your previous comment -- then the results start to become very useful and greatly appreciated. |
Thanks Leo. We'll make sure we provide an interpretation of the numbers and not just the raw results. |
b419956
to
fc7b0b8
Compare
I am unable to reproduce the I suspect the observed slowdown may possibly be caused by Intel's workaround for their JCC bug. Many recent Intel processors have a serious bug in their decoded instruction cache. Intel's workaround, distributed in a microcode update, is to disable the decoded instruction cache around jump instructions crossing a 32-byte boundary. This has a performance cost, which is usually low but has been observed to cause +/- 20% performance swings, particularly in microbenchmarks. By configuring OCaml as follows, padding bytes can be inserted to ensure that no jumps cross 32-byte boundaries and the workaround never triggers:
It might be worth building OCaml like this in future Sandmark runs on Intel processors. @damiendoligez I played around with the offset number a bit and didn't notice any of the cache aliasing you mentioned. I've left it at 4000 just in case. The performance is not strongly affected by this parameter: it needs to be big enough that the prefetch has time to complete before the data is needed, and small enough that the data hasn't already fallen out of cache by the time sweeping gets there. I saw good results anywhere from 1k to 100k, and I left the parameter close to the bottom end of the range (very large values cause additional cache pollution, by prefetching many kilobytes beyond the end of the region being swept). |
@damiendoligez Does your "approved" still stand? (There was discussion since, but as far as I'm concerned this is ready to merge) |
Sure. Merging now. I'll make a note to do some benchmarking on other architectures. |
(cherry picked from commit 8a90546)
(cherry picked from commit 8a90546)
(cherry picked from commit 8a90546)
(cherry picked from commit 8a90546)
(cherry picked from commit 8a90546)
23a7f73 flambda-backend: Fix some Debuginfo.t scopes in the frontend (ocaml#248) 33a04a6 flambda-backend: Attempt to shrink the heap before calling the assembler (ocaml#429) 8a36a16 flambda-backend: Fix to allow stage 2 builds in Flambda 2 -Oclassic mode (ocaml#442) d828db6 flambda-backend: Rename -no-extensions flag to -disable-all-extensions (ocaml#425) 68c39d5 flambda-backend: Fix mistake with extension records (ocaml#423) 423f312 flambda-backend: Refactor -extension and -standard flags (ocaml#398) 585e023 flambda-backend: Improved simplification of array operations (ocaml#384) faec6b1 flambda-backend: Typos (ocaml#407) 8914940 flambda-backend: Ensure allocations are initialised, even dead ones (ocaml#405) 6b58001 flambda-backend: Move compiler flag -dcfg out of ocaml/ subdirectory (ocaml#400) 4fd57cf flambda-backend: Use ghost loc for extension to avoid expressions with overlapping locations (ocaml#399) 8d993c5 flambda-backend: Let's fix instead of reverting flambda_backend_args (ocaml#396) d29b133 flambda-backend: Revert "Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)" (ocaml#395) d0cda93 flambda-backend: Revert ocaml#373 (ocaml#393) 1c6eee1 flambda-backend: Fix "make check_all_arches" in ocaml/ subdirectory (ocaml#388) a7960dd flambda-backend: Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382) bf7b1a8 flambda-backend: List and Array Comprehensions (ocaml#147) f2547de flambda-backend: Compile more stdlib files with -O3 (ocaml#380) 3620c58 flambda-backend: Four small inliner fixes (ocaml#379) 2d165d2 flambda-backend: Regenerate ocaml/configure 3838b56 flambda-backend: Bump Menhir to version 20210419 (ocaml#362) 43c14d6 flambda-backend: Re-enable -flambda2-join-points (ocaml#374) 5cd2520 flambda-backend: Disable inlining of recursive functions by default (ocaml#372) e98b277 flambda-backend: Import ocaml#10736 (stack limit increases) (ocaml#373) 82c8086 flambda-backend: Use hooks for type tree and parse tree (ocaml#363) 33bbc93 flambda-backend: Fix parsecmm.mly in ocaml subdirectory (ocaml#357) 9650034 flambda-backend: Right-to-left evaluation of arguments of String.get and friends (ocaml#354) f7d3775 flambda-backend: Revert "Magic numbers" (ocaml#360) 0bd2fa6 flambda-backend: Add [@inline ready] attribute and remove [@inline hint] (not [@inlined hint]) (ocaml#351) cee74af flambda-backend: Ensure that functions are evaluated after their arguments (ocaml#353) 954be59 flambda-backend: Bootstrap dd5c299 flambda-backend: Change prefix of all magic numbers to avoid clashes with upstream. c2b1355 flambda-backend: Fix wrong shift generation in Cmm_helpers (ocaml#347) 739243b flambda-backend: Add flambda_oclassic attribute (ocaml#348) dc9b7fd flambda-backend: Only speculate during inlining if argument types have useful information (ocaml#343) aa190ec flambda-backend: Backport fix from PR#10719 (ocaml#342) c53a574 flambda-backend: Reduce max inlining depths at -O2 and -O3 (ocaml#334) a2493dc flambda-backend: Tweak error messages in Compenv. 1c7b580 flambda-backend: Change Name_abstraction to use a parameterized type (ocaml#326) 07e0918 flambda-backend: Save cfg to file (ocaml#257) 9427a8d flambda-backend: Make inlining parameters more aggressive (ocaml#332) fe0610f flambda-backend: Do not cache young_limit in a processor register (upstream PR 9876) (ocaml#315) 56f28b8 flambda-backend: Fix an overflow bug in major GC work computation (ocaml#310) 8e43a49 flambda-backend: Cmm invariants (port upstream PR 1400) (ocaml#258) e901f16 flambda-backend: Add attributes effects and coeffects (#18) aaa1cdb flambda-backend: Expose Flambda 2 flags via OCAMLPARAM (ocaml#304) 62db54f flambda-backend: Fix freshening substitutions 57231d2 flambda-backend: Evaluate signature substitutions lazily (upstream PR 10599) (ocaml#280) a1a07de flambda-backend: Keep Sys.opaque_identity in Cmm and Mach (port upstream PR 9412) (ocaml#238) faaf149 flambda-backend: Rename Un_cps -> To_cmm (ocaml#261) ecb0201 flambda-backend: Add "-dcfg" flag to ocamlopt (ocaml#254) 32ec58a flambda-backend: Bypass Simplify (ocaml#162) bd4ce4a flambda-backend: Revert "Semaphore without probes: dummy notes (ocaml#142)" (ocaml#242) c98530f flambda-backend: Semaphore without probes: dummy notes (ocaml#142) c9b6a04 flambda-backend: Remove hack for .depend from runtime/dune (ocaml#170) 6e5d4cf flambda-backend: Build and install Semaphore (ocaml#183) 924eb60 flambda-backend: Special constructor for %sys_argv primitive (ocaml#166) 2ac6334 flambda-backend: Build ocamldoc (ocaml#157) c6f7267 flambda-backend: Add -mbranches-within-32B to major_gc.c compilation (where supported) a99fdee flambda-backend: Merge pull request ocaml#10195 from stedolan/mark-prefetching bd72dcb flambda-backend: Prefetching optimisations for sweeping (ocaml#9934) 27fed7e flambda-backend: Add missing index param for Obj.field (ocaml#145) cd48b2f flambda-backend: Fix camlinternalOO at -O3 with Flambda 2 (ocaml#132) 9d85430 flambda-backend: Fix testsuite execution (ocaml#125) ac964ca flambda-backend: Comment out `[@inlined]` annotation. (ocaml#136) ad4afce flambda-backend: Fix magic numbers (test suite) (ocaml#135) 9b033c7 flambda-backend: Disable the comparison of bytecode programs (`ocamltest`) (ocaml#128) e650abd flambda-backend: Import flambda2 changes (`Asmpackager`) (ocaml#127) 14dcc38 flambda-backend: Fix error with Record_unboxed (bug in block kind patch) (ocaml#119) 2d35761 flambda-backend: Resurrect [@inline never] annotations in camlinternalMod (ocaml#121) f5985ad flambda-backend: Magic numbers for cmx and cmxa files (ocaml#118) 0e8b9f0 flambda-backend: Extend conditions to include flambda2 (ocaml#115) 99870c8 flambda-backend: Fix Translobj assertions for Flambda 2 (ocaml#112) 5106317 flambda-backend: Minor fix for "lazy" compilation in Matching with Flambda 2 (ocaml#110) dba922b flambda-backend: Oclassic/O2/O3 etc (ocaml#104) f88af3e flambda-backend: Wire in the remaining Flambda 2 flags (ocaml#103) 678d647 flambda-backend: Wire in the Flambda 2 inlining flags (ocaml#100) 1a8febb flambda-backend: Formatting of help text for some Flambda 2 options (ocaml#101) 9ae1c7a flambda-backend: First set of command-line flags for Flambda 2 (ocaml#98) bc0bc5e flambda-backend: Add config variables flambda_backend, flambda2 and probes (ocaml#99) efb8304 flambda-backend: Build our own ocamlobjinfo from tools/objinfo/ at the root (ocaml#95) d2cfaca flambda-backend: Add mutability annotations to Pfield etc. (ocaml#88) 5532555 flambda-backend: Lambda block kinds (ocaml#86) 0c597ba flambda-backend: Revert VERSION, etc. back to 4.12.0 (mostly reverts 822d0a0 from upstream 4.12) (ocaml#93) 037c3d0 flambda-backend: Float blocks 7a9d190 flambda-backend: Allow --enable-middle-end=flambda2 etc (ocaml#89) 9057474 flambda-backend: Root scanning fixes for Flambda 2 (ocaml#87) 08e02a3 flambda-backend: Ensure that Lifthenelse has a boolean-valued condition (ocaml#63) 77214b7 flambda-backend: Obj changes for Flambda 2 (ocaml#71) ecfdd72 flambda-backend: Cherry-pick 9432cfdadb043a191b414a2caece3e4f9bbc68b7 (ocaml#84) d1a4396 flambda-backend: Add a `returns` field to `Cmm.Cextcall` (ocaml#74) 575dff5 flambda-backend: CMM traps (ocaml#72) 8a87272 flambda-backend: Remove Obj.set_tag and Obj.truncate (ocaml#73) d9017ae flambda-backend: Merge pull request ocaml#80 from mshinwell/fb-backport-pr10205 3a4824e flambda-backend: Backport PR#10205 from upstream: Avoid overwriting closures while initialising recursive modules f31890e flambda-backend: Install missing headers of ocaml/runtime/caml (ocaml#77) 83516f8 flambda-backend: Apply node created for probe should not be annotated as tailcall (ocaml#76) bc430cb flambda-backend: Add Clflags.is_flambda2 (ocaml#62) ed87247 flambda-backend: Preallocation of blocks in Translmod for value let rec w/ flambda2 (ocaml#59) a4b04d5 flambda-backend: inline never on Gc.create_alarm (ocaml#56) cef0bb6 flambda-backend: Config.flambda2 (ocaml#58) ff0e4f7 flambda-backend: Pun labelled arguments with type constraint in function applications (ocaml#53) d72c5fb flambda-backend: Remove Cmm.memory_chunk.Double_u (ocaml#42) 9d34d99 flambda-backend: Install missing artifacts 10146f2 flambda-backend: Add ocamlcfg (ocaml#34) 819d38a flambda-backend: Use OC_CFLAGS, OC_CPPFLAGS, and SHAREDLIB_CFLAGS for foreign libs (#30) f98b564 flambda-backend: Pass -function-sections iff supported. (#29) e0eef5e flambda-backend: Bootstrap (#11 part 2) 17374b4 flambda-backend: Add [@@Builtin] attribute to Primitives (#11 part 1) 85127ad flambda-backend: Add builtin, effects and coeffects fields to Cextcall (#12) b670bcf flambda-backend: Replace tuple with record in Cextcall (#10) db451b5 flambda-backend: Speedups in Asmlink (#8) 2fe489d flambda-backend: Cherry-pick upstream PR#10184 from upstream, dynlink invariant removal (rev 3dc3cd7 upstream) d364bfa flambda-backend: Local patch against upstream: enable function sections in the Dune build 886b800 flambda-backend: Local patch against upstream: remove Raw_spacetime_lib (does not build with -m32) 1a7db7c flambda-backend: Local patch against upstream: make dune ignore ocamldoc/ directory e411dd3 flambda-backend: Local patch against upstream: remove ocaml/testsuite/tests/tool-caml-tex/ 1016d03 flambda-backend: Local patch against upstream: remove ocaml/dune-project and ocaml/ocaml-variants.opam 93785e3 flambda-backend: To upstream: export-dynamic for otherlibs/dynlink/ via the natdynlinkops files (still needs .gitignore + way of generating these files) 63db8c1 flambda-backend: To upstream: stop using -O3 in otherlibs/Makefile.otherlibs.common eb2f1ed flambda-backend: To upstream: stop using -O3 for dynlink/ 6682f8d flambda-backend: To upstream: use flambda_o3 attribute instead of -O3 in the Makefile for systhreads/ de197df flambda-backend: To upstream: renamed ocamltest_unix.xxx files for dune bf3773d flambda-backend: To upstream: dune build fixes (depends on previous to-upstream patches) 6fbc80e flambda-backend: To upstream: refactor otherlibs/dynlink/, removing byte/ and native/ 71a03ef flambda-backend: To upstream: fix to Ocaml_modifiers in ocamltest 686d6e3 flambda-backend: To upstream: fix dependency problem with Instruct c311155 flambda-backend: To upstream: remove threadUnix 52e6e78 flambda-backend: To upstream: stabilise filenames used in backtraces: stdlib/, otherlibs/systhreads/, toplevel/toploop.ml 7d08e0e flambda-backend: To upstream: use flambda_o3 attribute in stdlib 403b82e flambda-backend: To upstream: flambda_o3 attribute support (includes bootstrap) 65032b1 flambda-backend: To upstream: use nolabels attribute instead of -nolabels for otherlibs/unix/ f533fad flambda-backend: To upstream: remove Compflags, add attributes, etc. 49fc1b5 flambda-backend: To upstream: Add attributes and bootstrap compiler a4b9e0d flambda-backend: Already upstreamed: stdlib capitalisation patch 4c1c259 flambda-backend: ocaml#9748 from xclerc/share-ev_defname (cherry-pick 3e937fc) 00027c4 flambda-backend: permanent/default-to-best-fit (cherry-pick 64240fd) 2561dd9 flambda-backend: permanent/reraise-by-default (cherry-pick 50e9490) c0aa4f4 flambda-backend: permanent/gc-tuning (cherry-pick e9d6d2f) git-subtree-dir: ocaml git-subtree-split: 23a7f73
This PR contains two patches that optimise
sweep_slice
: a small refactoring that moves some globals to locals, and a use of prefetching. The goal is to reduce cache misses during GC.Sweeping is a linear traversal of memory, which should already be fast. However, it is not a normal linear traversal: the next pointer is known only once you've loaded the length from the current one, making the algorithm more like a linked list traversal. This defeats some hardware prefetching mechanisms: the address dependencies mean that the next load is not exposed until the current one returns data (meaning out-of-order execution doesn't help), and the stride is irregular since not all objects are the same size. Stream prefetching does help somewhat by noticing sequential accesses, but (on Intel) doesn't cross 4k page boundaries and doesn't always prefetch data all the way to L1. See the intel optimisation manual for more details on hardware prefetching. (Currently, this code hasn't been benchmarked on AMD processors, and is a no-op on non-x86 architectures)
The prefetching in this patch is very straightforward: it prefetches 4k ahead of the sweep pointer.
On a small benchmark, this speeds up sweeping by around 25%. (Sweeping is about a quarter of the runtime of this benchmark, leading to a more modest overall improvement of a few percent).
This is a prelude to a more complicated patch that adds prefetching to marking, where it causes a more dramatic improvement.
(joint work with Will Hasenplaugh)