Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with multiple domains between 5.0 and 5.1 #12460

Open
kayceesrk opened this issue Aug 3, 2023 · 30 comments
Open

Performance regression with multiple domains between 5.0 and 5.1 #12460

kayceesrk opened this issue Aug 3, 2023 · 30 comments
Assignees
Labels
bug Performance PR or issues affecting runtime performance of the compiled programs

Comments

@kayceesrk
Copy link
Contributor

kayceesrk commented Aug 3, 2023

@ghennequin is using OCaml 5 for numerical computing using owl. He observed a performance regression between 5.0 and 5.1. The benchmarks are available at https://github.com/ghennequin/par-matmul-profiling. I was able to reproduce the regression on one of my machines (x86-64, 16 cores, 32 hw threads, 2 sockets with 8 cores per socket, 2 NUMA nodes).
image
I have been investigating this regression (and other similar ones). CC @damiendoligez who may be interested in this.

@kayceesrk kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Aug 3, 2023
@gasche
Copy link
Member

gasche commented Aug 9, 2023

I wonder if you could get any closer to the source of the issue (by bisection maybe?) and how this affects the release calendar: is it something that we could hope to get de-regressed before a 5.1 release, or should we aim for a 5.1.1 with a bugfix? (cc @Octachron)

@Octachron
Copy link
Member

This is a complex performance regression for which as far as I know neither @kayceesrk nor @sadiqj could identify the source after few weeks. Fixing such regression is in scope for 5.2 not 5.1 .

@kayceesrk
Copy link
Contributor Author

Bisecting is a good idea (which I haven’t tried yet). I haven’t spent too much time on this tbh. I plan to get back to this in early September.

@gasche
Copy link
Member

gasche commented Oct 4, 2023

I am a bit nervous about the fact that no one investigated this issue yet. Any volunteers?

( This might be in the skillset of @fabbing, @OlivierNicole, @dustanddreams and/or @NickBarnes )

@OlivierNicole
Copy link
Contributor

OlivierNicole commented Oct 30, 2023

Looking into this with help from @fabbing, there are many more minor words allocated many more major GC cycles on 5.1. We will look into during our time allocated to work on the compiler.

@fabbing
Copy link
Contributor

fabbing commented Nov 17, 2023

While #12500 solves the performance issue caused by the high number of major collections, it is still unclear to me what happened between 5.0.0 and 5.1.0 that caused this change in behavior (no relevant change to bigarray.c). So I am still investigating the matter.

@gasche
Copy link
Member

gasche commented Nov 17, 2023

I would bet on #11903 .

@OlivierNicole
Copy link
Contributor

The thing is, we ran the benchmark before and after the merge of #11903, and it doesn’t seem to be the case: the issue of too many major GCs was present in both cases.

Note: my previous message wrongly said "many more minor words allocated" but the issue is "many more major GCs", I will update it.

@gasche
Copy link
Member

gasche commented Nov 17, 2023

I would not necessarily trust the GC statistics that have been partly buggy in 5.0. It may be that the PR introduced the time difference (the actual bug) without showing difference in GC counts (due to buggy stats).

@gasche
Copy link
Member

gasche commented Nov 17, 2023

(I was thinking of dbd36ae which only affects minor collections, not major collections.)

@kayceesrk
Copy link
Contributor Author

@OlivierNicole @fabbing Now that #12754 is merged, would you be able to plot the difference again between 5.0, 5.1 and trunk?

@fabbing
Copy link
Contributor

fabbing commented Nov 21, 2023

@OlivierNicole @fabbing Now that #12754 is merged, would you be able to plot the difference again between 5.0, 5.1 and trunk?

The numbers in #12500 (comment) include pace-1 which is mostly what was merged with #12754. But I'll run the tests again with trunk.

@fabbing
Copy link
Contributor

fabbing commented Nov 21, 2023

Here are the numbers and a graph for the speedup on 5.0.0, 5.1.0 and trunk at 6125c29.
The machine used has only 4 cores and 2 threads/core, so speedup values above 4 domains are not really significant.

image
(The diagram use an incorrect label for trunk, it should be 6125c29)

Raw speedup data per domain:

  5.0.0 5.1.0 6125c29
1 0,998539 0,980273 1,01852
2 1,1493 0,925795 1,17663
4 1,96153 1,1798 1,67212
8 1,4196 1,42262 1,62505
GC stats
  5.0.0 5.1.0 6125c29
minor_collections 554 331 660
major_collections 39 307 40
compactions 0 0 0
forced_major_collections 1 1 1
       
minor_words 113908377 49636567 118703032
promoted_words 95219 107011 96063
major_words 102629 115090 103737
       
top_heap_words 1982767 1163567 1823023
heap_words 200604 200604 204700
live_words 18901 18972 18902
free_words 180792 180730 184877
largest_free 0 0 0
fragments 911 902 921
       
live_blocks 1707 1717 1707
free_blocks 0 0 0
heap_chunks 0 0 0

@gasche
Copy link
Member

gasche commented Nov 21, 2023

Is it correct to interpret your numbers as the existence of a remaining slowdown for 4 domains, or do you believe that this is measurement noise? (If you have made several runs of each test, you have a sense for the confidence interval around each point, and it would be useful to report on that somehow.)

@jberdine
Copy link
Contributor

Apologies for the superficial comment, but I can't help but notice that the increase in number of minor collections done by trunk vs 5.0 is very close to the same ratio as the speedup of 5.0 over trunk.

@fabbing
Copy link
Contributor

fabbing commented Nov 22, 2023

Well, after a particularly painful git bisect, I can confirm that 11903 causes the increase in GC major collections and the associated performance regression. I don't know what I screwed up in my initial tests to erroneously rule it out...
I will investigate further from there.

@Octachron
Copy link
Member

Potentially related, I have been reported another example of regression for a numerical code using owl, where the performance regression appears even on a single domain code.
Measuring time, the number minor and major collections depict a pretty wild change of behavior between 5.0.0 and 5.1.0:

compiler version time minor major
4.14.1 0.803s 6626 1656
5.0.0 1.448s 8697 8679
5.1.0 5.507s 56162 56156
5.2.0+trunk 1.701s 10945 10939
5.1+gc_fixes 1.740s 10563 10557
5.1+12754 2.343s 20089 20083

where the 5.1+gc_fixes branch contains backports for #12318, #12439, #12754 and #12595 .

@fabbing
Copy link
Contributor

fabbing commented Nov 29, 2023

Thanks @Octachron, your example is helpful to investigate this issue without having to consider multiple domains!

@kayceesrk
Copy link
Contributor Author

Potentially related, I have been reported another example of regression for a numerical code using owl, where the performance regression appears even on a single domain code.

Since others are also looking at this regression, moving some conversations from private exchanges onto this thread.

This benchmark is heavily influenced by the custom_major_ratio controlled by OCAMLRUNPARAM parameter M. From https://v2.ocaml.org/manual/runtime.html#s:ocamlrun-options,

M
(custom_major_ratio) Target ratio of floating garbage to major heap size for out-of-heap memory held by custom values (e.g. bigarrays) located in the major heap. The GC speed is adjusted to try to use this much memory for dead values that are not yet collected. Expressed as a percentage of major heap size. Default: 44. Note: this only applies to values allocated with caml_alloc_custom_mem.

By increasing M value from the default of 44, one can see that the running time reduces dramatically along with the number of major and minor GCs whereas the max RSS (memory usage) only increases marginally.

M Time # major # minor max RSS (kbytes)
44 2.310s 10945 10939 27364
88 1.658s 5475 5469 28144
176 1.277s 2752 2736 29444
352 1.133s 1390 1370 30756

For comparison, on this machine, for 4.14.1, the observations are:

M Time # major # minor max RSS (kbytes)
44 1.253s 6626 1656 26088

This potentially indicates that the M parameter is too small for 5.x compared to 4.x. @damiendoligez is investigating this.

@fabbing
Copy link
Contributor

fabbing commented Nov 29, 2023

Together with @Engil, we ran @Octachron's gw_original.ml example under olly with d07e6e9 (the commit just before PR11903) and 613f96d (PR11903). (We had to make a quick change to olly to increase the polling speed).

compiler version time minor major
d07e6e9 1.076s 8699 8680
613f96d 3.542s 56160 56154

We can see an unexpected increase in GC minor and major collections, since it uses only one domain.
Perfetto also reveals an interesting (and still unexplained) GC behaviour on 613f96d, where 4 GC cycles are always executed back-to-back.

Perfetto graph for d07e6e9

d07e6e99aa

Perfetto graph for 613f96d

613f96d7bf

@gasche
Copy link
Member

gasche commented Nov 29, 2023

I discussed this general issue with @damiendoligez today. One idea that we floated is to change the way we count major GC 'work' for custom blocks, to count the external heap memory as part of the work, not just the in-heap block size -- this would match the GC speed adjustment when promoting the block. One way to think about this is that ideally Bigarrays would be handled by the GC pacing logic just like an in-heap string (or other non-scanned block) of the same total size.
(We could conceivably extend this idea to custom blocks that allocate another resource than memory, by using the user-provided mem ratio, converted into a virtual out-of-heap size.)

One obvious difficulty with this approach is that it requires remembering the out-of-heap size of custom blocks, which is not currently stored with the value. This could be stored as an extra field/argument, but modifying the Custom layout would be fairly invasive, or possibly as a new custom operation (assuming users know how to recompute it from their own data, which sounds plausible), but I wonder whether performing an indirect call for each custom block traversed by the GC is reasonable.

@Octachron Octachron added the bug label Jan 10, 2024
@gasche
Copy link
Member

gasche commented Jan 10, 2024

@Octachron has a smaller repro case taken from Owl, that shows a regression in 5.1 even without domains. This repro case should be added to Sandmark.

@Octachron could you post your example here?

We then need someone to add it to Sandmark. Once @Octachron has posted his example here, I will try to look for another volunteer to do it -- Octachron has enough on his plate already.

@tmcgilchrist
Copy link
Contributor

@gasche I can add the test case to Sandmark. Please share the details here and I can look at it next week.

@tmcgilchrist
Copy link
Contributor

Here are the results I can reproduce on MacOS with 12 core M3 Pro 38Gb Ram. Large increase between 5.1.0 and 5.1.1, with 5.1.1 faster than 5.0.0 on this platform.

speedup

Fixes to the original test posted by @kc ghennequin/par-matmul-profiling#1 to make Owl work on ARM64,

@gasche
Copy link
Member

gasche commented Jan 15, 2024

Excellent news, thanks!

@tmcgilchrist the example of Octachron is a sequential Owl program which is available at

https://gitlab.inria.fr/fangelet/ocaml-gc-regression-examples
https://gitlab.inria.fr/fangelet/ocaml-gc-regression-examples/-/blob/main/analyses/gw_original.ml?ref_type=heads

@damiendoligez damiendoligez self-assigned this Jan 31, 2024
@gasche
Copy link
Member

gasche commented Feb 1, 2024

@tmcgilchrist did you manage to add such examples to Sandmark? (I would be curious to see a pointer with performance results for those programs, that I could revisit in the future to see if things have changed.)

@tmcgilchrist
Copy link
Contributor

ocaml-bench/sandmark#464 added Owl program to sandmark @gasche
Waiting on it to run in the nightly tests....

@gasche
Copy link
Member

gasche commented Feb 1, 2024

Great, thanks!

@tmcgilchrist
Copy link
Contributor

@gasche Owl benchmarks are available for 5.0.1, 5.2 and 5.3+trunk https://sandmark.tarides.com

@gasche
Copy link
Member

gasche commented Feb 15, 2024

@tmcgilchrist Thanks! I looked at the new sandmark webpage, and here is some random feedback. (I don't know if you are the right person to send this feedback to, but you probably know the right person.)

  • I would have found natural to compare 4.14, 5.0 and the latest 5 release. However the only available 4.14 version on Sandmark is called 4.14.0+domains, and I have no idea what this is. Is it the upstream 4.14 runtime? Can I trust it to faithfully represent 4.14 performance on non-multicore benchmarks? (Where is it even defined/described?) There is a 4.12.0+stock which sounds reassuring. Can we have a 4.14.0+stock as well?

  • Most of the graphs are very hard to read because one or two inputs return a huge number, and the other numbers are much smaller and therefore unreadable. Look at the "Major collections" or "Normalized major collections" graphs for example, except for a few spikes, I cannot read most of the data points, and they could hide almost-arbitrary performance differences. The non-normalized graph, for example, is unreadable because a few benchmarks have thousands of major collections, while the vast majority of them have dozens of major collections and show up basically as 0 in the result -- for a given benchmark, the results for the two OCaml versions could be 40 and 20, which both show up as 0, hiding an important difference. Would it be possible to change the way this data is visualized, so that we can reasonably estimate the data by looking at the graph? For example, would using a logarithmic plot solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Performance PR or issues affecting runtime performance of the compiled programs
Projects
None yet
Development

No branches or pull requests

8 participants