New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with multiple domains between 5.0 and 5.1 #12460
Comments
I wonder if you could get any closer to the source of the issue (by bisection maybe?) and how this affects the release calendar: is it something that we could hope to get de-regressed before a 5.1 release, or should we aim for a 5.1.1 with a bugfix? (cc @Octachron) |
This is a complex performance regression for which as far as I know neither @kayceesrk nor @sadiqj could identify the source after few weeks. Fixing such regression is in scope for 5.2 not 5.1 . |
Bisecting is a good idea (which I haven’t tried yet). I haven’t spent too much time on this tbh. I plan to get back to this in early September. |
I am a bit nervous about the fact that no one investigated this issue yet. Any volunteers? ( This might be in the skillset of @fabbing, @OlivierNicole, @dustanddreams and/or @NickBarnes ) |
Looking into this with help from @fabbing, there are |
While #12500 solves the performance issue caused by the high number of major collections, it is still unclear to me what happened between 5.0.0 and 5.1.0 that caused this change in behavior (no relevant change to |
I would bet on #11903 . |
The thing is, we ran the benchmark before and after the merge of #11903, and it doesn’t seem to be the case: the issue of too many major GCs was present in both cases. Note: my previous message wrongly said "many more minor words allocated" but the issue is "many more major GCs", I will update it. |
I would not necessarily trust the GC statistics that have been partly buggy in 5.0. It may be that the PR introduced the time difference (the actual bug) without showing difference in GC counts (due to buggy stats). |
(I was thinking of dbd36ae which only affects minor collections, not major collections.) |
@OlivierNicole @fabbing Now that #12754 is merged, would you be able to plot the difference again between 5.0, 5.1 and trunk? |
The numbers in #12500 (comment) include |
Here are the numbers and a graph for the speedup on
Raw speedup data per domain:
GC stats
|
Is it correct to interpret your numbers as the existence of a remaining slowdown for 4 domains, or do you believe that this is measurement noise? (If you have made several runs of each test, you have a sense for the confidence interval around each point, and it would be useful to report on that somehow.) |
Apologies for the superficial comment, but I can't help but notice that the increase in number of minor collections done by trunk vs 5.0 is very close to the same ratio as the speedup of 5.0 over trunk. |
Well, after a particularly painful |
Potentially related, I have been reported another example of regression for a numerical code using
where the 5.1+gc_fixes branch contains backports for #12318, #12439, #12754 and #12595 . |
Thanks @Octachron, your example is helpful to investigate this issue without having to consider multiple domains! |
Since others are also looking at this regression, moving some conversations from private exchanges onto this thread. This benchmark is heavily influenced by the
By increasing
For comparison, on this machine, for 4.14.1, the observations are:
This potentially indicates that the |
Together with @Engil, we ran @Octachron's
We can see an unexpected increase in GC minor and major collections, since it uses only one domain. Perfetto graph for d07e6e9Perfetto graph for 613f96d |
I discussed this general issue with @damiendoligez today. One idea that we floated is to change the way we count major GC 'work' for custom blocks, to count the external heap memory as part of the work, not just the in-heap block size -- this would match the GC speed adjustment when promoting the block. One way to think about this is that ideally Bigarrays would be handled by the GC pacing logic just like an in-heap string (or other non-scanned block) of the same total size. One obvious difficulty with this approach is that it requires remembering the out-of-heap size of custom blocks, which is not currently stored with the value. This could be stored as an extra field/argument, but modifying the Custom layout would be fairly invasive, or possibly as a new custom operation (assuming users know how to recompute it from their own data, which sounds plausible), but I wonder whether performing an indirect call for each custom block traversed by the GC is reasonable. |
@Octachron has a smaller repro case taken from Owl, that shows a regression in 5.1 even without domains. This repro case should be added to Sandmark. @Octachron could you post your example here? We then need someone to add it to Sandmark. Once @Octachron has posted his example here, I will try to look for another volunteer to do it -- Octachron has enough on his plate already. |
@gasche I can add the test case to Sandmark. Please share the details here and I can look at it next week. |
Here are the results I can reproduce on MacOS with 12 core M3 Pro 38Gb Ram. Large increase between 5.1.0 and 5.1.1, with 5.1.1 faster than 5.0.0 on this platform. Fixes to the original test posted by @kc ghennequin/par-matmul-profiling#1 to make Owl work on ARM64, |
Excellent news, thanks! @tmcgilchrist the example of Octachron is a sequential Owl program which is available at https://gitlab.inria.fr/fangelet/ocaml-gc-regression-examples |
@tmcgilchrist did you manage to add such examples to Sandmark? (I would be curious to see a pointer with performance results for those programs, that I could revisit in the future to see if things have changed.) |
ocaml-bench/sandmark#464 added Owl program to sandmark @gasche |
Great, thanks! |
@gasche Owl benchmarks are available for 5.0.1, 5.2 and 5.3+trunk https://sandmark.tarides.com |
@tmcgilchrist Thanks! I looked at the new sandmark webpage, and here is some random feedback. (I don't know if you are the right person to send this feedback to, but you probably know the right person.)
|
@ghennequin is using OCaml 5 for numerical computing using
owl
. He observed a performance regression between 5.0 and 5.1. The benchmarks are available at https://github.com/ghennequin/par-matmul-profiling. I was able to reproduce the regression on one of my machines (x86-64, 16 cores, 32 hw threads, 2 sockets with 8 cores per socket, 2 NUMA nodes).I have been investigating this regression (and other similar ones). CC @damiendoligez who may be interested in this.
The text was updated successfully, but these errors were encountered: