Performance regression with multiple domains between 5.0 and 5.1 #12460

kayceesrk · 2023-08-03T03:52:03Z

@ghennequin is using OCaml 5 for numerical computing using owl. He observed a performance regression between 5.0 and 5.1. The benchmarks are available at https://github.com/ghennequin/par-matmul-profiling. I was able to reproduce the regression on one of my machines (x86-64, 16 cores, 32 hw threads, 2 sockets with 8 cores per socket, 2 NUMA nodes).

I have been investigating this regression (and other similar ones). CC @damiendoligez who may be interested in this.

The text was updated successfully, but these errors were encountered:

gasche · 2023-08-09T12:43:22Z

I wonder if you could get any closer to the source of the issue (by bisection maybe?) and how this affects the release calendar: is it something that we could hope to get de-regressed before a 5.1 release, or should we aim for a 5.1.1 with a bugfix? (cc @Octachron)

Octachron · 2023-08-09T14:40:40Z

This is a complex performance regression for which as far as I know neither @kayceesrk nor @sadiqj could identify the source after few weeks. Fixing such regression is in scope for 5.2 not 5.1 .

kayceesrk · 2023-08-09T15:23:11Z

Bisecting is a good idea (which I haven’t tried yet). I haven’t spent too much time on this tbh. I plan to get back to this in early September.

gasche · 2023-10-04T13:57:18Z

I am a bit nervous about the fact that no one investigated this issue yet. Any volunteers?

( This might be in the skillset of @fabbing, @OlivierNicole, @dustanddreams and/or @NickBarnes )

OlivierNicole · 2023-10-30T17:08:41Z

Looking into this with help from @fabbing, there are ~~many more minor words allocated~~ many more major GC cycles on 5.1. We will look into during our time allocated to work on the compiler.

fabbing · 2023-11-17T14:28:19Z

While #12500 solves the performance issue caused by the high number of major collections, it is still unclear to me what happened between 5.0.0 and 5.1.0 that caused this change in behavior (no relevant change to bigarray.c). So I am still investigating the matter.

gasche · 2023-11-17T14:52:52Z

I would bet on #11903 .

OlivierNicole · 2023-11-17T15:39:50Z

The thing is, we ran the benchmark before and after the merge of #11903, and it doesn’t seem to be the case: the issue of too many major GCs was present in both cases.

Note: my previous message wrongly said "many more minor words allocated" but the issue is "many more major GCs", I will update it.

gasche · 2023-11-17T15:42:53Z

I would not necessarily trust the GC statistics that have been partly buggy in 5.0. It may be that the PR introduced the time difference (the actual bug) without showing difference in GC counts (due to buggy stats).

gasche · 2023-11-17T15:44:03Z

(I was thinking of dbd36ae which only affects minor collections, not major collections.)

kayceesrk · 2023-11-21T06:33:58Z

@OlivierNicole @fabbing Now that #12754 is merged, would you be able to plot the difference again between 5.0, 5.1 and trunk?

fabbing · 2023-11-21T08:44:19Z

@OlivierNicole @fabbing Now that #12754 is merged, would you be able to plot the difference again between 5.0, 5.1 and trunk?

The numbers in #12500 (comment) include pace-1 which is mostly what was merged with #12754. But I'll run the tests again with trunk.

fabbing · 2023-11-21T16:52:21Z

Here are the numbers and a graph for the speedup on 5.0.0, 5.1.0 and trunk at 6125c29.
The machine used has only 4 cores and 2 threads/core, so speedup values above 4 domains are not really significant.

(The diagram use an incorrect label for trunk, it should be 6125c29)

Raw speedup data per domain:

	5.0.0	5.1.0	`6125c29`
1	0,998539	0,980273	1,01852
2	1,1493	0,925795	1,17663
4	1,96153	1,1798	1,67212
8	1,4196	1,42262	1,62505

GC stats

	5.0.0	5.1.0	`6125c29`
minor_collections	554	331	660
major_collections	39	307	40
compactions	0	0	0
forced_major_collections	1	1	1

minor_words	113908377	49636567	118703032
promoted_words	95219	107011	96063
major_words	102629	115090	103737

top_heap_words	1982767	`1163567`	1823023
heap_words	200604	200604	204700
live_words	18901	18972	18902
free_words	180792	180730	184877
largest_free	0	0	0
fragments	911	902	921

live_blocks	1707	1717	1707
free_blocks	0	0	0
heap_chunks	0	0	0

gasche · 2023-11-21T21:19:59Z

Is it correct to interpret your numbers as the existence of a remaining slowdown for 4 domains, or do you believe that this is measurement noise? (If you have made several runs of each test, you have a sense for the confidence interval around each point, and it would be useful to report on that somehow.)

jberdine · 2023-11-21T21:42:58Z

Apologies for the superficial comment, but I can't help but notice that the increase in number of minor collections done by trunk vs 5.0 is very close to the same ratio as the speedup of 5.0 over trunk.

fabbing · 2023-11-22T15:07:04Z

Well, after a particularly painful git bisect, I can confirm that 11903 causes the increase in GC major collections and the associated performance regression. I don't know what I screwed up in my initial tests to erroneously rule it out...
I will investigate further from there.

Octachron · 2023-11-23T13:49:34Z

Potentially related, I have been reported another example of regression for a numerical code using owl, where the performance regression appears even on a single domain code.
Measuring time, the number minor and major collections depict a pretty wild change of behavior between 5.0.0 and 5.1.0:

compiler version	time	minor	major
4.14.1	0.803s	6626	1656
5.0.0	1.448s	8697	8679
5.1.0	5.507s	56162	56156
5.2.0+trunk	1.701s	10945	10939
5.1+gc_fixes	1.740s	10563	10557
5.1+12754	2.343s	20089	20083

where the 5.1+gc_fixes branch contains backports for #12318, #12439, #12754 and #12595 .

fabbing · 2023-11-29T08:32:49Z

Thanks @Octachron, your example is helpful to investigate this issue without having to consider multiple domains!

kayceesrk · 2023-11-29T08:51:11Z

Potentially related, I have been reported another example of regression for a numerical code using owl, where the performance regression appears even on a single domain code.

Since others are also looking at this regression, moving some conversations from private exchanges onto this thread.

This benchmark is heavily influenced by the custom_major_ratio controlled by OCAMLRUNPARAM parameter M. From https://v2.ocaml.org/manual/runtime.html#s:ocamlrun-options,

M
(custom_major_ratio) Target ratio of floating garbage to major heap size for out-of-heap memory held by custom values (e.g. bigarrays) located in the major heap. The GC speed is adjusted to try to use this much memory for dead values that are not yet collected. Expressed as a percentage of major heap size. Default: 44. Note: this only applies to values allocated with caml_alloc_custom_mem.

By increasing M value from the default of 44, one can see that the running time reduces dramatically along with the number of major and minor GCs whereas the max RSS (memory usage) only increases marginally.

M	Time	# major	# minor	max RSS (kbytes)
44	2.310s	10945	10939	27364
88	1.658s	5475	5469	28144
176	1.277s	2752	2736	29444
352	1.133s	1390	1370	30756

For comparison, on this machine, for 4.14.1, the observations are:

M	Time	# major	# minor	max RSS (kbytes)
44	1.253s	6626	1656	26088

This potentially indicates that the M parameter is too small for 5.x compared to 4.x. @damiendoligez is investigating this.

fabbing · 2023-11-29T17:12:41Z

Together with @Engil, we ran @Octachron's gw_original.ml example under olly with d07e6e9 (the commit just before PR11903) and 613f96d (PR11903). (We had to make a quick change to olly to increase the polling speed).

compiler version	time	minor	major
`d07e6e9`	1.076s	8699	8680
`613f96d`	3.542s	56160	56154

We can see an unexpected increase in GC minor and major collections, since it uses only one domain.
Perfetto also reveals an interesting (and still unexplained) GC behaviour on 613f96d, where 4 GC cycles are always executed back-to-back.

Perfetto graph for d07e6e9

Perfetto graph for 613f96d

gasche · 2023-11-29T20:28:09Z

I discussed this general issue with @damiendoligez today. One idea that we floated is to change the way we count major GC 'work' for custom blocks, to count the external heap memory as part of the work, not just the in-heap block size -- this would match the GC speed adjustment when promoting the block. One way to think about this is that ideally Bigarrays would be handled by the GC pacing logic just like an in-heap string (or other non-scanned block) of the same total size.
(We could conceivably extend this idea to custom blocks that allocate another resource than memory, by using the user-provided mem ratio, converted into a virtual out-of-heap size.)

One obvious difficulty with this approach is that it requires remembering the out-of-heap size of custom blocks, which is not currently stored with the value. This could be stored as an extra field/argument, but modifying the Custom layout would be fairly invasive, or possibly as a new custom operation (assuming users know how to recompute it from their own data, which sounds plausible), but I wonder whether performing an indirect call for each custom block traversed by the GC is reasonable.

gasche · 2024-01-10T14:53:15Z

@Octachron has a smaller repro case taken from Owl, that shows a regression in 5.1 even without domains. This repro case should be added to Sandmark.

@Octachron could you post your example here?

We then need someone to add it to Sandmark. Once @Octachron has posted his example here, I will try to look for another volunteer to do it -- Octachron has enough on his plate already.

tmcgilchrist · 2024-01-12T05:09:28Z

@gasche I can add the test case to Sandmark. Please share the details here and I can look at it next week.

tmcgilchrist · 2024-01-15T04:14:48Z

Here are the results I can reproduce on MacOS with 12 core M3 Pro 38Gb Ram. Large increase between 5.1.0 and 5.1.1, with 5.1.1 faster than 5.0.0 on this platform.

Fixes to the original test posted by @kc ghennequin/par-matmul-profiling#1 to make Owl work on ARM64,

gasche · 2024-01-15T05:10:26Z

Excellent news, thanks!

@tmcgilchrist the example of Octachron is a sequential Owl program which is available at

https://gitlab.inria.fr/fangelet/ocaml-gc-regression-examples
https://gitlab.inria.fr/fangelet/ocaml-gc-regression-examples/-/blob/main/analyses/gw_original.ml?ref_type=heads

gasche · 2024-02-01T07:36:22Z

@tmcgilchrist did you manage to add such examples to Sandmark? (I would be curious to see a pointer with performance results for those programs, that I could revisit in the future to see if things have changed.)

tmcgilchrist · 2024-02-01T21:24:42Z

ocaml-bench/sandmark#464 added Owl program to sandmark @gasche
Waiting on it to run in the nightly tests....

gasche · 2024-02-01T21:31:17Z

Great, thanks!

tmcgilchrist · 2024-02-14T23:19:10Z

@gasche Owl benchmarks are available for 5.0.1, 5.2 and 5.3+trunk https://sandmark.tarides.com

gasche · 2024-02-15T13:24:45Z

@tmcgilchrist Thanks! I looked at the new sandmark webpage, and here is some random feedback. (I don't know if you are the right person to send this feedback to, but you probably know the right person.)

I would have found natural to compare 4.14, 5.0 and the latest 5 release. However the only available 4.14 version on Sandmark is called 4.14.0+domains, and I have no idea what this is. Is it the upstream 4.14 runtime? Can I trust it to faithfully represent 4.14 performance on non-multicore benchmarks? (Where is it even defined/described?) There is a 4.12.0+stock which sounds reassuring. Can we have a 4.14.0+stock as well?
Most of the graphs are very hard to read because one or two inputs return a huge number, and the other numbers are much smaller and therefore unreadable. Look at the "Major collections" or "Normalized major collections" graphs for example, except for a few spikes, I cannot read most of the data points, and they could hide almost-arbitrary performance differences. The non-normalized graph, for example, is unreadable because a few benchmarks have thousands of major collections, while the vast majority of them have dozens of major collections and show up basically as 0 in the result -- for a given benchmark, the results for the two OCaml versions could be 40 and 20, which both show up as 0, hiding an important difference. Would it be possible to change the way this data is visualized, so that we can reasonably estimate the data by looking at the graph? For example, would using a logarithmic plot solve this problem?

kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Aug 3, 2023

fabbing mentioned this issue Nov 14, 2023

Bigarray: do not change GC pace when creating sub-arrays #12500

Closed

Octachron added the bug label Jan 10, 2024

damiendoligez self-assigned this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with multiple domains between 5.0 and 5.1 #12460

Performance regression with multiple domains between 5.0 and 5.1 #12460

kayceesrk commented Aug 3, 2023 •

edited

gasche commented Aug 9, 2023 •

edited

Octachron commented Aug 9, 2023

kayceesrk commented Aug 9, 2023

gasche commented Oct 4, 2023

OlivierNicole commented Oct 30, 2023 •

edited

fabbing commented Nov 17, 2023

gasche commented Nov 17, 2023

OlivierNicole commented Nov 17, 2023

gasche commented Nov 17, 2023

gasche commented Nov 17, 2023

kayceesrk commented Nov 21, 2023

fabbing commented Nov 21, 2023

fabbing commented Nov 21, 2023 •

edited

gasche commented Nov 21, 2023

jberdine commented Nov 21, 2023

fabbing commented Nov 22, 2023

Octachron commented Nov 23, 2023

fabbing commented Nov 29, 2023

kayceesrk commented Nov 29, 2023

fabbing commented Nov 29, 2023

gasche commented Nov 29, 2023

gasche commented Jan 10, 2024

tmcgilchrist commented Jan 12, 2024

tmcgilchrist commented Jan 15, 2024

gasche commented Jan 15, 2024

gasche commented Feb 1, 2024

tmcgilchrist commented Feb 1, 2024

gasche commented Feb 1, 2024

tmcgilchrist commented Feb 14, 2024

gasche commented Feb 15, 2024

Performance regression with multiple domains between 5.0 and 5.1 #12460

Performance regression with multiple domains between 5.0 and 5.1 #12460

Comments

kayceesrk commented Aug 3, 2023 • edited

gasche commented Aug 9, 2023 • edited

Octachron commented Aug 9, 2023

kayceesrk commented Aug 9, 2023

gasche commented Oct 4, 2023

OlivierNicole commented Oct 30, 2023 • edited

fabbing commented Nov 17, 2023

gasche commented Nov 17, 2023

OlivierNicole commented Nov 17, 2023

gasche commented Nov 17, 2023

gasche commented Nov 17, 2023

kayceesrk commented Nov 21, 2023

fabbing commented Nov 21, 2023

fabbing commented Nov 21, 2023 • edited

gasche commented Nov 21, 2023

jberdine commented Nov 21, 2023

fabbing commented Nov 22, 2023

Octachron commented Nov 23, 2023

fabbing commented Nov 29, 2023

kayceesrk commented Nov 29, 2023

fabbing commented Nov 29, 2023

gasche commented Nov 29, 2023

gasche commented Jan 10, 2024

tmcgilchrist commented Jan 12, 2024

tmcgilchrist commented Jan 15, 2024

gasche commented Jan 15, 2024

gasche commented Feb 1, 2024

tmcgilchrist commented Feb 1, 2024

gasche commented Feb 1, 2024

tmcgilchrist commented Feb 14, 2024

gasche commented Feb 15, 2024

kayceesrk commented Aug 3, 2023 •

edited

gasche commented Aug 9, 2023 •

edited

OlivierNicole commented Oct 30, 2023 •

edited

fabbing commented Nov 21, 2023 •

edited