Prefetching optimisations for sweeping #9934

stedolan · 2020-09-22T15:20:57Z

This PR contains two patches that optimise sweep_slice: a small refactoring that moves some globals to locals, and a use of prefetching. The goal is to reduce cache misses during GC.

Sweeping is a linear traversal of memory, which should already be fast. However, it is not a normal linear traversal: the next pointer is known only once you've loaded the length from the current one, making the algorithm more like a linked list traversal. This defeats some hardware prefetching mechanisms: the address dependencies mean that the next load is not exposed until the current one returns data (meaning out-of-order execution doesn't help), and the stride is irregular since not all objects are the same size. Stream prefetching does help somewhat by noticing sequential accesses, but (on Intel) doesn't cross 4k page boundaries and doesn't always prefetch data all the way to L1. See the intel optimisation manual for more details on hardware prefetching. (Currently, this code hasn't been benchmarked on AMD processors, and is a no-op on non-x86 architectures)

The prefetching in this patch is very straightforward: it prefetches 4k ahead of the sweep pointer.

On a small benchmark, this speeds up sweeping by around 25%. (Sweeping is about a quarter of the runtime of this benchmark, leading to a more modest overall improvement of a few percent).

This is a prelude to a more complicated patch that adds prefetching to marking, where it causes a more dramatic improvement.

(joint work with Will Hasenplaugh)

dra27 · 2020-09-22T18:45:16Z

runtime/caml/misc.h

+
+#ifdef CAML_INTERNALS
+#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
+#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)


I think - but I have not experimented - that the MSVC equivalent is #include <winnt.h> and PreFetchCacheLine((p), PF_NON_TEMPORAL_LEVEL_ALL) (I'm not sure about the constant)

Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.

Also, I would document a bit:

#define caml_prefetch(p) __builtin_prefetch((p), 1, 3) /* 1 = intent to write; 3 = all cache levels */

Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.

I left others out because I don't really know anything about non-x86 memory hierarchies. We can turn it on for ARM if you like, but I can't judge how much / whether it'll help, and I don't have the expertise / time to do any serious benchmarking.

(I'll add the comments)

stedolan · 2020-09-23T09:09:27Z

Incidentally, when writing this I noticed another possible place for optimisation / cleanup, but haven't attempted it:

The caml_fl_merge_block function does a certain amount of work to determine whether the most recently found free block is mergeable with the current one (i.e. is it still free, and is it adjacent to the current block?). This information is known to the sweeper, as it processes blocks in order. We could possibly shave some more time off sweeping by changing the interface, and having the sweeper pass the previous block or NULL to caml_fl_merge_block, rather than the latter redetecting it. (This interface is somewhat subtle and difficult to debug, though, so this could be a delicate change)

xavierleroy

Sounds interesting. Thanks for looking into this.

xavierleroy · 2020-09-23T14:55:12Z

runtime/caml/misc.h

+
+#ifdef CAML_INTERNALS
+#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
+#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)


Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.

xavierleroy · 2020-09-23T14:56:10Z

runtime/caml/misc.h

+
+#ifdef CAML_INTERNALS
+#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
+#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)


Also, I would document a bit:

#define caml_prefetch(p) __builtin_prefetch((p), 1, 3) /* 1 = intent to write; 3 = all cache levels */

shubhamkumar13 · 2020-09-28T15:35:37Z

The 2 graphs represent the normalized running time of benchmarks on sandmark when #9934 is run with trunk as a baseline.

The first one is on an Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz.

The following table refers to trunk's raw running time result on the same Intel system.

	name	time_secs	gc.major_collections	gc.major_words
0	lu-decomposition.	1.32719	11	7775014
1	levinson-durbin.	2.74204	1894	250063925
2	menhir.sql-parser	6.98978	23	59727559
3	fft.	4.37065	41	140412857
4	setrip.-enc_-rseed_1067894368	1.52092	9	13301
5	yojson_ydump.sample.json	0.76486	17	2979936
6	revcomp2.	2.9139	25	49759392
7	test_decompress.64_524_288	4.1468	555	105774397
8	LU_decomposition.1024	4.00685	4	4194470
9	game_of_life.256	11.927	2	2101398
10	regexredux2.	18.9592	32	240256398
11	grammatrix.	93.5887	26	68529245
12	evolutionary_algorithm.10000_10000	70.5337	32	1300975436
13	floyd_warshall.512	4.47551	22	6396941
14	quicksort.4000000	2.91457	0	4000146
15	mandelbrot6.16_000	40.6443	0	0
16	matrix_multiplication.1024	9.70372	6	3152023
17	fasta6.25_000_000	5.81388	1	25417192
18	qr-decomposition.	2.38781	63	2696897
19	cubicle.szymanski_at.cub	495.709	1532	6372966728
20	lexifi-g2pp.	17.1283	8	183647
21	durand-kerner-aberth.	0.1383	92	237481
22	nbody.50_000_000	7.47168	0	0
23	fannkuchredux2.12	94.0485	0	0
24	fannkuchredux.12	83.9848	0	0
25	knucleotide3.	43.9427	7	34073208
26	cpdf.blacktext	4.31274	14	27976325
27	bdd.26	5.3187	12	2471440
28	kb_no_exc.	2.5074	227	23320680
29	menhir.ocamly	235.341	35	1147662209
30	cpdf.scale	13.7828	33	94564877
31	zarith_pi.5000	1.47259	642	392716558
32	spectralnorm2.5_500	7.50886	5	121482
33	binarytrees5.21	11.8288	63	270360189
34	cubicle.german_pfs.cub	224.007	377	3254161011
35	cpdf.squeeze	16.3786	37	140393871
36	minilight.roomfront	22.2329	68	10639641
37	kb.	3.97839	329	24449437
38	pidigits5.10_000	6.02947	2691	1664137387
39	naive-multilayer.	4.17674	242	1293140
40	menhir.sysver	86.3455	63	467046511
41	sequence_cps.10000	1.58179	779	219311
42	knucleotide.	43.8687	13	50878638
43	fasta3.25_000_000	7.48009	0	551

The second graph is on an AMD EPYC 7702 64-Core Processor

Similarly, trunk's running time on the Amd machine

	name	time_secs	gc.major_collections	gc.major_words
0	menhir.sql-parser	4.95038	23	59727559
1	spectralnorm2.5_500	5.02338	5	121482
2	test_decompress.64_524_288	2.55676	555	105774397
3	minilight.roomfront	13.6939	68	10648042
4	sequence_cps.10000	1.25108	777	219410
5	lexifi-g2pp.	7.1579	8	183441
6	durand-kerner-aberth.	0.0961349	92	237481
7	cpdf.blacktext	3.01654	14	27976325
8	cpdf.squeeze	11.6939	37	140393871
9	knucleotide3.	27.4065	7	34073208
10	LU_decomposition.1024	2.77154	4	4194470
11	grammatrix.	50.1481	26	68529245
12	menhir.ocamly	173.388	35	1147662209
13	lu-decomposition.	0.850682	11	7775014
14	levinson-durbin.	1.57413	1894	250063925
15	kb.	2.58478	329	24449437
16	zarith_pi.5000	0.787027	642	392716558
17	setrip.-enc_-rseed_1067894368	0.771906	9	13307
18	binarytrees5.21	7.58026	63	270360189
19	matrix_multiplication.1024	4.2848	6	3152023
20	fannkuchredux.12	50.0576	0	0
21	fannkuchredux2.12	47.999	0	0
22	qr-decomposition.	1.27351	63	2696897
23	kb_no_exc.	1.63776	227	23320680
24	knucleotide.	27.7144	13	50878638
25	pidigits5.10_000	3.21054	2691	1664137387
26	nbody.50_000_000	4.87069	0	0
27	evolutionary_algorithm.10000_10000	44.0215	33	1300975511
28	yojson_ydump.sample.json	0.530261	17	2979936
29	revcomp2.	1.88539	25	49759392
30	menhir.sysver	58.3835	63	466977775
31	mandelbrot6.16_000	21.3379	0	0
32	quicksort.4000000	1.64895	0	4000146
33	cubicle.german_pfs.cub	183.453	376	3249735172
34	bdd.26	2.99953	12	2471440
35	naive-multilayer.	2.78721	239	1285722
36	floyd_warshall.512	2.68485	22	6396941
37	cubicle.szymanski_at.cub	317.303	1535	6373523793
38	regexredux2.	11.9367	32	240256398
39	fasta3.25_000_000	5.02575	0	551
40	game_of_life.256	9.40374	2	2101398
41	fft.	2.57991	41	140412857
42	fasta6.25_000_000	3.76762	1	25417192
43	cpdf.scale	9.81784	32	95330715

xavierleroy · 2020-09-28T17:42:43Z

Thank you for the benchmarking work and the nice graphics! But I don't know what to conclude from these. A 10% speedup on some tests is very nice indeed, but a 6% slowdown on some other tests is a concern.

Also, I'm surprised that the effect on performance can be that strong: typically, GC takes 30% of execution times, and sweeping takes less time than marking, so the whole sweeping phase should be 10% or so of total execution time, and improving the sweeping phase can hardly improve the total running time by 10%.

lpw25 · 2020-09-30T14:58:29Z

@damiendoligez

damiendoligez

This all looks good to me.

damiendoligez · 2020-10-01T12:56:08Z

runtime/major_gc.c

-    if (caml_gc_sweep_hp < sweep_limit){
-      hp = caml_gc_sweep_hp;
+    if (sweep_hp < limit){
+      caml_prefetch(sweep_hp + 4096);


IIUC this will prefetch the cache line at sweep_hp + 4k. Isn't this likely to conflict with the cache line that contains sweep_hp itself? I know caches are associative, but the associativity is rather low, so what is the probability of evicting sweep_hp itself when we still need it? Would it be hard to benchmark some variation of this number (for example 4032)?

Excellent point, I'll try that.

damiendoligez · 2020-10-01T13:05:31Z

The caml_fl_merge_block function does a certain amount of work to determine whether the most recently found free block is mergeable with the current one (i.e. is it still free, and is it adjacent to the current block?). This information is known to the sweeper, as it processes blocks in order. We could possibly shave some more time off sweeping by changing the interface, and having the sweeper pass the previous block or NULL to caml_fl_merge_block, rather than the latter redetecting it. (This interface is somewhat subtle and difficult to debug, though, so this could be a delicate change)

You have to be careful because caml_fl_merge (the "most recently found free block") is also modified by the allocation functions. If you find a more efficient API for this function, I'll gladly review the PR.

stedolan · 2020-10-05T12:30:54Z

You have to be careful because caml_fl_merge (the "most recently found free block") is also modified by the allocation functions. If you find a more efficient API for this function, I'll gladly review the PR.

Yeah, that's the tricky subtlety I was referring to. An observation that might help is that allocation cannot occur during sweep_slice, so the synchronisation with the allocator only needs to happen at the start and end of that function, not in the inner loop of sweeping.

gasche · 2020-11-25T08:45:17Z

What's the status of this PR? The CI failure is a missing Changes entry (and indeed this probably deserves one). Could we move ahead and eventually merge?

xavierleroy · 2020-11-25T08:48:17Z

Performance evaluation is inconclusive, to me at least. This is supposed to make programs run faster, and it's unclear it does.

gasche · 2020-11-25T10:17:12Z

Ah, indeed, Damien's approval should probably be interpreted as "approval for correctness".

I looked at the graphs again. Several of the speedup results are found (in varying proportions) of both machines, but the most striking slowdown, bdd, is only found on the Intel machine. To me this suggests that the numbers may be partially noisy due to processor-specific code-cache effects. (Here the changes are in the runtime so we probably cannot use the random-nop-padding approach used in another PR to avoid those.)

Looking at the change, it is of course possible that the addition of one prefetching instruction would result in wide variation, but there is a more invasive refactoring in sweep_slice that reorders computations around to compute what to prefetch; this invasive change is more likely to be the cause of processor-specific variation, if there is some indeed. This suggests that one could:

write a PR with the same reordering changes but without the prefetching instructions, and compare that one with this one (so only the prefetching instructions differ), and that one with trunk (so we see if there is a cost to the reordering)
consider benchmarking the change that adds prefetching to freelist.c by itself, as it requires no refactoring. This may be more stable, but my understanding is that Stephen expects less of an improvement for that one change, sweeping was the motivation for this PR.

dra27 · 2020-11-25T10:24:20Z

@stedolan says it's a prelude to another PR - is it also a prerequisite?

lpw25 · 2020-11-25T10:59:33Z

I would like to raise a concern I have with using the sandmark benchmark suite for assessing the performance of changes to the GC: Why do we think this is a good and representative set of benchmarks for these kinds of changes? I've seen lots of great work from OCamlLabs on how to get reliable numbers out of these benchmarks, but I have yet to see any assessment of the quality of this particular set of benchmarks.

For example, most of these benchmarks seem to do very few major cycles and allocate very few major words. That does not resemble most of the real programs that I deal with.

It would also be good to see some analysis of the noise in these benchmarks, both between runs and between changes to code layout.

gasche · 2020-11-25T11:07:24Z

Parroting the usual answer from Landmark maintainers, and I think they have a point: if you think that a particular workflow is missing from their benchmark suite, you should probably contribute a benchmark to the suite.

lpw25 · 2020-11-25T11:11:08Z

if you think that a particular workflow is missing from their benchmark suite, you should probably contribute a benchmark to the suite.

That doesn't help with assessing the quality of the benchmarks that are already in there. If the Sandmark maintainers are going to add benchmark results to other people's PRs then they need to provide some context as to why they think the numbers are relevant.

xavierleroy · 2020-11-25T12:54:25Z

The Sandmark numbers are better than nothing. (Many performance-oriented PRs came with no benchmarking whatsoever until recently.) But the numbers need to be interpreted! As I wrote earlier, the sweep phase is at most 10% of the total running time of the program, so variations of -10% / +6% in total running time don't just come from the sweep phase.

kayceesrk · 2020-11-27T08:04:23Z

I agree that it may not have been the best idea to run Sandmark on a PR not submitted by Multicore OCaml folks, especially when it remains difficult for the wider community to easily run the benchmarks on their end. But I am hoping that the process gets easier. We will refrain from running Sandmark on PRs not related to multicore.

That said, I wanted to bring the performance questions to a conclusion. I suspect that the original numbers were run on this PR and trunk and that point in time, which may have had unrelated commits. So I reran Sandmark on this PR (commit b419956) and the commit that this PR is based on (7d9e60d). The commit history is here: https://github.com/stedolan/ocaml/commits/sweep-optimisation. The normalized running time graph is here:

The baseline is 7d9360d. The graph shows the performance impact of this PR against the baseline. Lower is better. The numbers in the parenthesis is the running time in seconds for the baseline version. Overall, there is positive improvement.

I analysed the outliers in detail.

bdd

bdd is 2.6% slower in this PR according to the Sandmark run. bdd does not spend significant amount of time in the GC. The function sweep_slice takes 0.15% of the running time as reported by perf. Here is the perf stat output:

$ perf stat ./_build/4.12.0+7d9e60d_1/benchmarks/bdd/bdd.exe 26 # BASELINE

 Performance counter stats for './_build/4.12.0+7d9e60d_1/benchmarks/bdd/bdd.exe 26':

       5169.389420      task-clock (msec)         #    1.000 CPUs utilized
                 9      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             4,728      page-faults               #    0.915 K/sec
   11,34,59,02,093      cycles                    #    2.195 GHz
   22,49,12,29,020      instructions              #    1.98  insn per cycle
    4,95,24,82,879      branches                  #  958.040 M/sec
       5,87,93,810      branch-misses             #    1.19% of all branches

       5.169854280 seconds time elapsed

$ perf stat ./_build/4.12.0+b419956_1/benchmarks/bdd/bdd.exe 26 # THIS PR 

 Performance counter stats for './_build/4.12.0+b419956_1/benchmarks/bdd/bdd.exe 26':

       5470.182940      task-clock (msec)         #    1.000 CPUs utilized          
                10      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             4,727      page-faults               #    0.864 K/sec                  
   12,00,60,98,103      cycles                    #    2.195 GHz                    
   22,48,76,42,055      instructions              #    1.87  insn per cycle         
    4,95,23,07,655      branches                  #  905.328 M/sec                  
       7,80,00,231      branch-misses             #    1.58% of all branches        

       5.470739910 seconds time elapsed

There are more branch misses in the PR. However, the slowdown cannot just be explained with the changes introduced.

revcomp2

On the other side, we see a 6.8% improvement on revcomp2. This improvement is real. On the baseline, the program spends, ~10% of its time sweeping:

  47.64%  revcomp2.exe  revcomp2.exe      [.] camlDune__exe__Revcomp2__wr_238
  10.05%  revcomp2.exe  revcomp2.exe      [.] sweep_slice
   7.33%  revcomp2.exe  revcomp2.exe      [.] mark_slice
   6.71%  revcomp2.exe  revcomp2.exe      [.] caml_input_scan_line
   3.08%  revcomp2.exe  revcomp2.exe      [.] caml_page_table_lookup
   2.79%  revcomp2.exe  revcomp2.exe      [.] caml_oldify_one
   1.73%  revcomp2.exe  revcomp2.exe      [.] caml_alloc_string
   1.55%  revcomp2.exe  revcomp2.exe      [.] caml_alloc_shr_for_minor_gc

with this PR, sweeping takes 5.3% of the total time:

  52.14%  revcomp2.exe  revcomp2.exe      [.] camlDune__exe__Revcomp2__wr_238
   8.13%  revcomp2.exe  revcomp2.exe      [.] mark_slice
   6.85%  revcomp2.exe  revcomp2.exe      [.] caml_input_scan_line
   5.30%  revcomp2.exe  revcomp2.exe      [.] sweep_slice
   3.28%  revcomp2.exe  revcomp2.exe      [.] caml_page_table_lookup
   2.47%  revcomp2.exe  revcomp2.exe      [.] caml_oldify_one
   1.93%  revcomp2.exe  revcomp2.exe      [.] caml_alloc_shr_for_minor_gc

Conclusions

I'm doing an experiment to quantify the noise in Sandmark. This is especially fiddly to quantify accurately due to the microarchitectural optimisations on modern processors. See the work in #10039.

Given the overall improvement, I am for accepting this PR.

lpw25 · 2020-11-27T10:57:01Z

Thank you very much for the analysis KC. That makes things much clearer.

I agree that it may not have been the best idea to run Sandmark on a PR not submitted by Multicore OCaml folks, especially when it remains difficult for the wider community to easily run the benchmarks on their end. But I am hoping that the process gets easier. We will refrain from running Sandmark on PRs not related to multicore.

I don't want to dissuade you too much from adding benchmark results to PRs.

I think my concern is mostly that dropping a benchmark results graph in a comment, without the much more involved work needed to investigate the results and provide context for people who are not familiar with the nature of the particular benchmarks, is as likely to do harm as it is to do good.

When that additional work is done -- as you have very helpfully done in your previous comment -- then the results start to become very useful and greatly appreciated.

kayceesrk · 2020-11-27T12:42:33Z

Thanks Leo. We'll make sure we provide an interpretation of the numbers and not just the raw results.

stedolan · 2021-01-20T14:17:13Z

I am unable to reproduce the bdd slowdown: on several Intel machines with different build settings, PIE vs. non-PIE, etc, I can detect no difference in the performance of bdd before and after this patch.

I suspect the observed slowdown may possibly be caused by Intel's workaround for their JCC bug. Many recent Intel processors have a serious bug in their decoded instruction cache. Intel's workaround, distributed in a microcode update, is to disable the decoded instruction cache around jump instructions crossing a 32-byte boundary. This has a performance cost, which is usually low but has been observed to cause +/- 20% performance swings, particularly in microbenchmarks.

By configuring OCaml as follows, padding bytes can be inserted to ensure that no jumps cross 32-byte boundaries and the workaround never triggers:

./configure CC='gcc -Wa,-mbranches-within-32B' AS='as -mbranches-within-32B'

It might be worth building OCaml like this in future Sandmark runs on Intel processors.

@damiendoligez I played around with the offset number a bit and didn't notice any of the cache aliasing you mentioned. I've left it at 4000 just in case. The performance is not strongly affected by this parameter: it needs to be big enough that the prefetch has time to complete before the data is needed, and small enough that the data hasn't already fallen out of cache by the time sweeping gets there. I saw good results anywhere from 1k to 100k, and I left the parameter close to the bottom end of the range (very large values cause additional cache pollution, by prefetching many kilobytes beyond the end of the region being swept).

stedolan · 2021-01-26T10:25:48Z

@damiendoligez Does your "approved" still stand? (There was discussion since, but as far as I'm concerned this is ready to merge)

damiendoligez · 2021-02-03T13:23:34Z

Sure. Merging now. I'll make a note to do some benchmarking on other architectures.

(cherry picked from commit 8a90546)

@inline

23a7f73 flambda-backend: Fix some Debuginfo.t scopes in the frontend (ocaml#248) 33a04a6 flambda-backend: Attempt to shrink the heap before calling the assembler (ocaml#429) 8a36a16 flambda-backend: Fix to allow stage 2 builds in Flambda 2 -Oclassic mode (ocaml#442) d828db6 flambda-backend: Rename -no-extensions flag to -disable-all-extensions (ocaml#425) 68c39d5 flambda-backend: Fix mistake with extension records (ocaml#423) 423f312 flambda-backend: Refactor -extension and -standard flags (ocaml#398) 585e023 flambda-backend: Improved simplification of array operations (ocaml#384) faec6b1 flambda-backend: Typos (ocaml#407) 8914940 flambda-backend: Ensure allocations are initialised, even dead ones (ocaml#405) 6b58001 flambda-backend: Move compiler flag -dcfg out of ocaml/ subdirectory (ocaml#400) 4fd57cf flambda-backend: Use ghost loc for extension to avoid expressions with overlapping locations (ocaml#399) 8d993c5 flambda-backend: Let's fix instead of reverting flambda_backend_args (ocaml#396) d29b133 flambda-backend: Revert "Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)" (ocaml#395) d0cda93 flambda-backend: Revert ocaml#373 (ocaml#393) 1c6eee1 flambda-backend: Fix "make check_all_arches" in ocaml/ subdirectory (ocaml#388) a7960dd flambda-backend: Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382) bf7b1a8 flambda-backend: List and Array Comprehensions (ocaml#147) f2547de flambda-backend: Compile more stdlib files with -O3 (ocaml#380) 3620c58 flambda-backend: Four small inliner fixes (ocaml#379) 2d165d2 flambda-backend: Regenerate ocaml/configure 3838b56 flambda-backend: Bump Menhir to version 20210419 (ocaml#362) 43c14d6 flambda-backend: Re-enable -flambda2-join-points (ocaml#374) 5cd2520 flambda-backend: Disable inlining of recursive functions by default (ocaml#372) e98b277 flambda-backend: Import ocaml#10736 (stack limit increases) (ocaml#373) 82c8086 flambda-backend: Use hooks for type tree and parse tree (ocaml#363) 33bbc93 flambda-backend: Fix parsecmm.mly in ocaml subdirectory (ocaml#357) 9650034 flambda-backend: Right-to-left evaluation of arguments of String.get and friends (ocaml#354) f7d3775 flambda-backend: Revert "Magic numbers" (ocaml#360) 0bd2fa6 flambda-backend: Add [@inline ready] attribute and remove [@inline hint] (not [@inlined hint]) (ocaml#351) cee74af flambda-backend: Ensure that functions are evaluated after their arguments (ocaml#353) 954be59 flambda-backend: Bootstrap dd5c299 flambda-backend: Change prefix of all magic numbers to avoid clashes with upstream. c2b1355 flambda-backend: Fix wrong shift generation in Cmm_helpers (ocaml#347) 739243b flambda-backend: Add flambda_oclassic attribute (ocaml#348) dc9b7fd flambda-backend: Only speculate during inlining if argument types have useful information (ocaml#343) aa190ec flambda-backend: Backport fix from PR#10719 (ocaml#342) c53a574 flambda-backend: Reduce max inlining depths at -O2 and -O3 (ocaml#334) a2493dc flambda-backend: Tweak error messages in Compenv. 1c7b580 flambda-backend: Change Name_abstraction to use a parameterized type (ocaml#326) 07e0918 flambda-backend: Save cfg to file (ocaml#257) 9427a8d flambda-backend: Make inlining parameters more aggressive (ocaml#332) fe0610f flambda-backend: Do not cache young_limit in a processor register (upstream PR 9876) (ocaml#315) 56f28b8 flambda-backend: Fix an overflow bug in major GC work computation (ocaml#310) 8e43a49 flambda-backend: Cmm invariants (port upstream PR 1400) (ocaml#258) e901f16 flambda-backend: Add attributes effects and coeffects (#18) aaa1cdb flambda-backend: Expose Flambda 2 flags via OCAMLPARAM (ocaml#304) 62db54f flambda-backend: Fix freshening substitutions 57231d2 flambda-backend: Evaluate signature substitutions lazily (upstream PR 10599) (ocaml#280) a1a07de flambda-backend: Keep Sys.opaque_identity in Cmm and Mach (port upstream PR 9412) (ocaml#238) faaf149 flambda-backend: Rename Un_cps -> To_cmm (ocaml#261) ecb0201 flambda-backend: Add "-dcfg" flag to ocamlopt (ocaml#254) 32ec58a flambda-backend: Bypass Simplify (ocaml#162) bd4ce4a flambda-backend: Revert "Semaphore without probes: dummy notes (ocaml#142)" (ocaml#242) c98530f flambda-backend: Semaphore without probes: dummy notes (ocaml#142) c9b6a04 flambda-backend: Remove hack for .depend from runtime/dune (ocaml#170) 6e5d4cf flambda-backend: Build and install Semaphore (ocaml#183) 924eb60 flambda-backend: Special constructor for %sys_argv primitive (ocaml#166) 2ac6334 flambda-backend: Build ocamldoc (ocaml#157) c6f7267 flambda-backend: Add -mbranches-within-32B to major_gc.c compilation (where supported) a99fdee flambda-backend: Merge pull request ocaml#10195 from stedolan/mark-prefetching bd72dcb flambda-backend: Prefetching optimisations for sweeping (ocaml#9934) 27fed7e flambda-backend: Add missing index param for Obj.field (ocaml#145) cd48b2f flambda-backend: Fix camlinternalOO at -O3 with Flambda 2 (ocaml#132) 9d85430 flambda-backend: Fix testsuite execution (ocaml#125) ac964ca flambda-backend: Comment out `[@inlined]` annotation. (ocaml#136) ad4afce flambda-backend: Fix magic numbers (test suite) (ocaml#135) 9b033c7 flambda-backend: Disable the comparison of bytecode programs (`ocamltest`) (ocaml#128) e650abd flambda-backend: Import flambda2 changes (`Asmpackager`) (ocaml#127) 14dcc38 flambda-backend: Fix error with Record_unboxed (bug in block kind patch) (ocaml#119) 2d35761 flambda-backend: Resurrect [@inline never] annotations in camlinternalMod (ocaml#121) f5985ad flambda-backend: Magic numbers for cmx and cmxa files (ocaml#118) 0e8b9f0 flambda-backend: Extend conditions to include flambda2 (ocaml#115) 99870c8 flambda-backend: Fix Translobj assertions for Flambda 2 (ocaml#112) 5106317 flambda-backend: Minor fix for "lazy" compilation in Matching with Flambda 2 (ocaml#110) dba922b flambda-backend: Oclassic/O2/O3 etc (ocaml#104) f88af3e flambda-backend: Wire in the remaining Flambda 2 flags (ocaml#103) 678d647 flambda-backend: Wire in the Flambda 2 inlining flags (ocaml#100) 1a8febb flambda-backend: Formatting of help text for some Flambda 2 options (ocaml#101) 9ae1c7a flambda-backend: First set of command-line flags for Flambda 2 (ocaml#98) bc0bc5e flambda-backend: Add config variables flambda_backend, flambda2 and probes (ocaml#99) efb8304 flambda-backend: Build our own ocamlobjinfo from tools/objinfo/ at the root (ocaml#95) d2cfaca flambda-backend: Add mutability annotations to Pfield etc. (ocaml#88) 5532555 flambda-backend: Lambda block kinds (ocaml#86) 0c597ba flambda-backend: Revert VERSION, etc. back to 4.12.0 (mostly reverts 822d0a0 from upstream 4.12) (ocaml#93) 037c3d0 flambda-backend: Float blocks 7a9d190 flambda-backend: Allow --enable-middle-end=flambda2 etc (ocaml#89) 9057474 flambda-backend: Root scanning fixes for Flambda 2 (ocaml#87) 08e02a3 flambda-backend: Ensure that Lifthenelse has a boolean-valued condition (ocaml#63) 77214b7 flambda-backend: Obj changes for Flambda 2 (ocaml#71) ecfdd72 flambda-backend: Cherry-pick 9432cfdadb043a191b414a2caece3e4f9bbc68b7 (ocaml#84) d1a4396 flambda-backend: Add a `returns` field to `Cmm.Cextcall` (ocaml#74) 575dff5 flambda-backend: CMM traps (ocaml#72) 8a87272 flambda-backend: Remove Obj.set_tag and Obj.truncate (ocaml#73) d9017ae flambda-backend: Merge pull request ocaml#80 from mshinwell/fb-backport-pr10205 3a4824e flambda-backend: Backport PR#10205 from upstream: Avoid overwriting closures while initialising recursive modules f31890e flambda-backend: Install missing headers of ocaml/runtime/caml (ocaml#77) 83516f8 flambda-backend: Apply node created for probe should not be annotated as tailcall (ocaml#76) bc430cb flambda-backend: Add Clflags.is_flambda2 (ocaml#62) ed87247 flambda-backend: Preallocation of blocks in Translmod for value let rec w/ flambda2 (ocaml#59) a4b04d5 flambda-backend: inline never on Gc.create_alarm (ocaml#56) cef0bb6 flambda-backend: Config.flambda2 (ocaml#58) ff0e4f7 flambda-backend: Pun labelled arguments with type constraint in function applications (ocaml#53) d72c5fb flambda-backend: Remove Cmm.memory_chunk.Double_u (ocaml#42) 9d34d99 flambda-backend: Install missing artifacts 10146f2 flambda-backend: Add ocamlcfg (ocaml#34) 819d38a flambda-backend: Use OC_CFLAGS, OC_CPPFLAGS, and SHAREDLIB_CFLAGS for foreign libs (#30) f98b564 flambda-backend: Pass -function-sections iff supported. (#29) e0eef5e flambda-backend: Bootstrap (#11 part 2) 17374b4 flambda-backend: Add [@@Builtin] attribute to Primitives (#11 part 1) 85127ad flambda-backend: Add builtin, effects and coeffects fields to Cextcall (#12) b670bcf flambda-backend: Replace tuple with record in Cextcall (#10) db451b5 flambda-backend: Speedups in Asmlink (#8) 2fe489d flambda-backend: Cherry-pick upstream PR#10184 from upstream, dynlink invariant removal (rev 3dc3cd7 upstream) d364bfa flambda-backend: Local patch against upstream: enable function sections in the Dune build 886b800 flambda-backend: Local patch against upstream: remove Raw_spacetime_lib (does not build with -m32) 1a7db7c flambda-backend: Local patch against upstream: make dune ignore ocamldoc/ directory e411dd3 flambda-backend: Local patch against upstream: remove ocaml/testsuite/tests/tool-caml-tex/ 1016d03 flambda-backend: Local patch against upstream: remove ocaml/dune-project and ocaml/ocaml-variants.opam 93785e3 flambda-backend: To upstream: export-dynamic for otherlibs/dynlink/ via the natdynlinkops files (still needs .gitignore + way of generating these files) 63db8c1 flambda-backend: To upstream: stop using -O3 in otherlibs/Makefile.otherlibs.common eb2f1ed flambda-backend: To upstream: stop using -O3 for dynlink/ 6682f8d flambda-backend: To upstream: use flambda_o3 attribute instead of -O3 in the Makefile for systhreads/ de197df flambda-backend: To upstream: renamed ocamltest_unix.xxx files for dune bf3773d flambda-backend: To upstream: dune build fixes (depends on previous to-upstream patches) 6fbc80e flambda-backend: To upstream: refactor otherlibs/dynlink/, removing byte/ and native/ 71a03ef flambda-backend: To upstream: fix to Ocaml_modifiers in ocamltest 686d6e3 flambda-backend: To upstream: fix dependency problem with Instruct c311155 flambda-backend: To upstream: remove threadUnix 52e6e78 flambda-backend: To upstream: stabilise filenames used in backtraces: stdlib/, otherlibs/systhreads/, toplevel/toploop.ml 7d08e0e flambda-backend: To upstream: use flambda_o3 attribute in stdlib 403b82e flambda-backend: To upstream: flambda_o3 attribute support (includes bootstrap) 65032b1 flambda-backend: To upstream: use nolabels attribute instead of -nolabels for otherlibs/unix/ f533fad flambda-backend: To upstream: remove Compflags, add attributes, etc. 49fc1b5 flambda-backend: To upstream: Add attributes and bootstrap compiler a4b9e0d flambda-backend: Already upstreamed: stdlib capitalisation patch 4c1c259 flambda-backend: ocaml#9748 from xclerc/share-ev_defname (cherry-pick 3e937fc) 00027c4 flambda-backend: permanent/default-to-best-fit (cherry-pick 64240fd) 2561dd9 flambda-backend: permanent/reraise-by-default (cherry-pick 50e9490) c0aa4f4 flambda-backend: permanent/gc-tuning (cherry-pick e9d6d2f) git-subtree-dir: ocaml git-subtree-split: 23a7f73

stedolan requested a review from damiendoligez September 22, 2020 15:20

dra27 reviewed Sep 22, 2020

View reviewed changes

xavierleroy reviewed Sep 23, 2020

View reviewed changes

damiendoligez approved these changes Oct 1, 2020

View reviewed changes

kayceesrk mentioned this pull request Dec 3, 2020

Noise in Sandmark ocaml-bench/sandmark#198

Open

stedolan added 3 commits January 20, 2021 14:06

Use local variables in sweep_slice

5852703

Prefetching for sweep

1d1ed65

Minor changes after review

fc7b0b8

stedolan force-pushed the sweep-optimisation branch from b419956 to fc7b0b8 Compare January 20, 2021 14:12

damiendoligez merged commit 8a90546 into ocaml:trunk Feb 3, 2021

garrigue pushed a commit to garrigue/ocaml that referenced this pull request Mar 3, 2021

Prefetching optimisations for sweeping (ocaml#9934)

8483bc2

smuenzel pushed a commit to smuenzel/ocaml that referenced this pull request Mar 30, 2021

Prefetching optimisations for sweeping (ocaml#9934)

2dcade9

stedolan added a commit to stedolan/ocaml that referenced this pull request Aug 10, 2021

Prefetching optimisations for sweeping (ocaml#9934)

fe8757a

(cherry picked from commit 8a90546)

poechsel pushed a commit to ocaml-flambda/ocaml that referenced this pull request Sep 3, 2021

Prefetching optimisations for sweeping (ocaml#9934)

6e300f2

(cherry picked from commit 8a90546)

chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Sep 9, 2021

Prefetching optimisations for sweeping (ocaml#9934)

6492c8c

(cherry picked from commit 8a90546)

stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 5, 2021

Prefetching optimisations for sweeping (ocaml#9934)

38cb79d

(cherry picked from commit 8a90546)

stedolan added a commit to stedolan/ocaml that referenced this pull request Dec 13, 2021

flambda-backend: Prefetching optimisations for sweeping (ocaml#9934)

bd72dcb

(cherry picked from commit 8a90546)

MisterDA mentioned this pull request Jun 17, 2024

Enable software prefetching on MSVC and clang-cl x86 and x86_64 #13238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefetching optimisations for sweeping #9934

Prefetching optimisations for sweeping #9934

stedolan commented Sep 22, 2020

dra27 Sep 22, 2020

xavierleroy Sep 23, 2020

xavierleroy Sep 23, 2020

stedolan Oct 5, 2020

stedolan commented Sep 23, 2020

xavierleroy left a comment

xavierleroy Sep 23, 2020

xavierleroy Sep 23, 2020

shubhamkumar13 commented Sep 28, 2020

xavierleroy commented Sep 28, 2020

lpw25 commented Sep 30, 2020

damiendoligez left a comment

damiendoligez Oct 1, 2020

stedolan Oct 5, 2020

damiendoligez commented Oct 1, 2020

stedolan commented Oct 5, 2020

gasche commented Nov 25, 2020

xavierleroy commented Nov 25, 2020

gasche commented Nov 25, 2020 •

edited

Loading

dra27 commented Nov 25, 2020

lpw25 commented Nov 25, 2020

gasche commented Nov 25, 2020

lpw25 commented Nov 25, 2020

xavierleroy commented Nov 25, 2020

kayceesrk commented Nov 27, 2020 •

edited

Loading

lpw25 commented Nov 27, 2020 •

edited

Loading

kayceesrk commented Nov 27, 2020

stedolan commented Jan 20, 2021

stedolan commented Jan 26, 2021

damiendoligez commented Feb 3, 2021

Prefetching optimisations for sweeping #9934

Prefetching optimisations for sweeping #9934

Conversation

stedolan commented Sep 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stedolan commented Sep 23, 2020

xavierleroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhamkumar13 commented Sep 28, 2020

xavierleroy commented Sep 28, 2020

lpw25 commented Sep 30, 2020

damiendoligez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damiendoligez commented Oct 1, 2020

stedolan commented Oct 5, 2020

gasche commented Nov 25, 2020

xavierleroy commented Nov 25, 2020

gasche commented Nov 25, 2020 • edited Loading

dra27 commented Nov 25, 2020

lpw25 commented Nov 25, 2020

gasche commented Nov 25, 2020

lpw25 commented Nov 25, 2020

xavierleroy commented Nov 25, 2020

kayceesrk commented Nov 27, 2020 • edited Loading

bdd

revcomp2

Conclusions

lpw25 commented Nov 27, 2020 • edited Loading

kayceesrk commented Nov 27, 2020

stedolan commented Jan 20, 2021

stedolan commented Jan 26, 2021

damiendoligez commented Feb 3, 2021

gasche commented Nov 25, 2020 •

edited

Loading

kayceesrk commented Nov 27, 2020 •

edited

Loading

lpw25 commented Nov 27, 2020 •

edited

Loading