Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in the compiler between 4.02.3 and 4.01.0 #7067

Closed
vicuna opened this Issue Nov 30, 2015 · 31 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link
Collaborator

vicuna commented Nov 30, 2015

Original bug ID: 7067
Reporter: @dbuenzli
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:18:24Z)
Resolution: fixed
Priority: normal
Severity: major
Version: 4.02.3
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Tags: compiler-time-or-space-regression
Related to: #6435
Child of: #7630
Monitored by: @gasche @diml jmeber @hcarty

Bug description

As I said on the mailing list, the compilation time of tsdl using 4.02.3 is ~30x the time of 4.01.0. See the steps below to reproduce.

The bottleneck is in the compilation of the module https://github.com/dbuenzli/tsdl/blob/master/src/tsdl.ml

Steps to reproduce

opam install ocamlfind ctypes ctypes-foreign result
cd /tmp
git clone https://github.com/dbuenzli/tsdl.git
cd /tmp/tsdl
opam switch 4.01.0; eval opam config env
time ./build tests.otarget
[...]
real 0m4.089s
user 0m3.227s
sys 0m0.694s
./build clean
opam switch 4.02.3; eval opam config env
time ./build tests.otarget
real 1m31.203s
user 1m29.394s
sys 0m1.596s

File attachments

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @gasche

I just uploaded a self-contained archive to avoid having to install external packages to test the regression.

This seems to be ocamlopt-related: ocamlc is fast on all versions.

On my machine, ocamlopt -c tsdl.cmx terminates in under three seconds on 4.01.0, but it is very slow (I didn't wait for termination) on either 4.00.1 or 4.02.3.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @dbuenzli

I forgot to add that I once monitored the process via top and you could see repeated spikes of memory going from ~300Mo to 700Mo and back.

Also there is a file tsdl_consts.ml autogenerated by the build system which gets included in the module here:

https://github.com/dbuenzli/tsdl/blob/master/src/tsdl.ml#L16

This file is solely made of (mostly integer) constant definitions but is quite large:

./build tsdl_consts.ml
wc -l _build/src/tsdl_consts.ml
814 _build/src/tsdl_consts.ml

A gut feeling is that there is some kind of non-linear behaviour w.r.t. the number of toplevel definitions since those are also quite plentiful in the tsdl module implementation:

grep "^let" tsdl.ml | wc -l
738

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @mshinwell

Daniel, we should check this with flambda, since the module compilation strategy has changed somewhat (and we've already worked to fix various problems that have occurred as a result of things like huge modules full of constants).

It's compiler hacking tonight at Cambridge, so maybe someone could look then.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

Removing the outermost wrapper "module Sdl = struct ... end" allows to compile in a few seconds again.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

since the module compilation strategy has changed somewhat

Can you say a few words about the new strategy? The special compilation mode transl_store_structure was specially designed to reduce pressure on register allocation (and something apparently went wrong with it) compared to normal compilation of structures. Does flambda do something special to avoid big nested "let-in" with long lived identifier?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

Removing the outermost wrapper "module Sdl = struct ... end" allows to compile in a few seconds again

What happens is that the content of Sdl in the typedtree is a "Tmod_constraint(_, _, Tmodtype_implicit, _)", which breaks the "transl_store_structure" compilation scheme. The coercion is caused by duplicated value names.

My guess is that this comes from:

da96fcb

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @mshinwell

The bytecode compilation strategy is used, which yields a big set of [let]s and a record creation at the end, together with arbitrary interspersed side effects.

The constant subexpressions (which always have variable names bound to them in the Flambda terms, by construction) are identified and lifted out to special "Let_symbol" constructions. (See the "program" type in <https://github.com/chambart/ocaml-1/blob/flambda_trunk/middle_end/flambda.mli >)
These are treated internally as normal scoped let-bindings even though they actually bind global symbols in the object file.

The remaining, possibly effectful parts of the module expression are then split into "Initialize_symbol" expressions; these bind symbols assigned to preallocated blocks of size 1. When the initialization expression has been evaluated, the result is put inside the first field of the preallocated block.

The result of this is:
(a) a nice scoped, composable language for describing constructs bound to object file symbols that is easy to analyze;
(b) compilation overhead reasonably reduced since the individual defining expressions of the symbols may be separately compiled in the backend;
(c) toplevel values are referenced directly by symbols rather than occurring in closures, as required;
(d) fairly aggressive lifting of constant expressions.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

since the individual defining expressions of the symbols may be separately compiled in the backend

Thanks. So indeed this might very well fix this problem.

Gabriel's self-contained archive (thanks!) makes it straightforward to check.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

I'm concerned that this might somehow complicate the analysis of the impact on compile-time for flambda. Shouldn't we try to fix this one pre-flambda to get a good comparison point?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @mshinwell

Possibly yes, if it affects more than just this package.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

Well, any unit defining multiple values with the same name (and this is quite common) suffers from the regression (disabling the compilation scheme that worked well with the register allocator). This might not always be so dramatic because compilation units are usually smaller, but I guess the problem is rather widespread.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @gasche

Maybe we should start trying to track regressions in compilation time or compiler memory usage. This is the second issue in the past releases -- the previous one (#6350) was arguably sensibly more dramatic -- plus a memory-usage problem when compiling Camlp4 (#6001).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: runhang

The slowdown occurred between 4.01.0 and 4.02.0+rc1. During bisecting in this range, compiler fails to build for some commits for different reasons...

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @gasche

Runhang: Thanks for looking at this! You can "git bisect skip" some commits if they don't build, to get a more precise range. Note that at these time "make -j5" was unreliable, you have to ask for non-parallel builds.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

Checking the commit I cited as the probable source for the regression ( da96fcb ) might be faster than bisecting.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

I'm attaching a quick, certainly incomplete and perhaps buggy fix. At least the testsuite passes and it allows to compile Daniel's code in a few seconds again.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: runhang

da96fcb
real 1m32.745s
user 1m29.840s
sys 0m2.281s

51b3afd
real 0m3.642s
user 0m2.850s
sys 0m0.653s

can confirm da96fcb is the reason

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

Maybe we should start trying to track regressions in compilation time or compiler memory usage.

Definitely! Monitoring total memory allocation is more stable than runtime across runs and platforms, so it might serve as a good proxy for performance.

Compiling the compiler itself with OCAMLRUNPARAM=v=0x400, extracting the allocation trace, and comparing (with some tolerance) to reference results could be a good first step.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: runhang

confirm the latest flambda solves this slowdown

real 0m0.217s
user 0m0.140s
sys 0m0.064s

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @alainfrisch

This is impressive, but isn't a 17x speedup compared to the version before the regression too good to be true, esp. that flambda is usually slower at compile-time?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @garrigue

Well, I tried to understand that code at some point, but it is tricky.
The only thing I can say is test a lot, because the original bug took a long time before being detected.
Also, this alone would do nothing for other cases of nested definitions, such as
let x1 = ... in
let x2 = ... in
....
let x1000 = ... in [|x1; ...; x1000|]
I am correct that flambda being more general, it would handle such cases also?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @mshinwell

Even if what I wrote above doesn't apply because those lets aren't toplevel (I wasn't sure if they were in your example), they would still be lifted out (using this [Let_symbol] construction) if their defining expressions were found to be constant.

The worst case is probably when they are not toplevel and this does not happen. In this case, you end up with a big sequence of normal lets going through the compiler. The Flambda passes should all be tail-recursive and behave reasonably in the case of a long sequence of lets, but subsequent passes may not...

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @alainfrisch

Of course, the real culprit is the register allocator. At some point, we should perhaps reconsider the decision to reject the addition of a linear scan allocator...

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @alainfrisch

Is the flambda branch to check indeed https://github.com/chambart/ocaml-1/tree/flambda_trunk ?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @gasche

Relevant link:

linearscan is a linear scan register allocator, written in Coq, and extracted to Haskell
https://sympa.inria.fr/sympa/arc/coq-club/2015-11/msg00131.html
John Wiegley, Nov 2015

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @mshinwell

frish: Yep

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @alainfrisch

I've tested the patch with success on LexiFi's code base.

There is no noticeable change when compiling e.g. ocamldoc/*.ml with ocamlopt compared to trunk. (In a previous note, I reported some different results, but this was caused by a problem between the screen and the chair, namely comparing betweeen two working copies configured resp. for mingw and msvc....)

Moreover, here is a program to generate simple reproduction cases:

let () =
  print_endline "module X = struct";
  print_endline " let x = 1";
  print_endline " let x = 2";
  for i = 1 to int_of_string Sys.argv.(1) do
    Printf.printf "  let f%i x = x + %d\n" i i
  done;
  print_endline "end"

Compiling with ocamlopt from the current trunk, trunk+patch and current flambda, I get:

          trunk   trunk+patch  flambda
 N = 200:  0.7s    0.12s        0.45s 
 N = 400:  2.9s    0.25s        0.9s
 N = 800: 11.5s    0.45s        1.84s

which shows clearly that one avoids the quadratic behavior, either with the patch or with flambda (which is about than 4 times slower than trunk+patch).

Daniel's example shows that the problem does not occur only in theory. Here are the timings for "make native" on Gabriel's self-contained repro case, using ocamlopt.opt:

          trunk   trunk+patch  flambda
          90.9s     1.6s        3.2s 

I'd like to push the change to trunk so that we get at least a good comparison point for flambda.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @gasche

Are you sure that the patch is correct? I don't know this part of the compiler si I cannot relevantly review it.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 1, 2015

Comment author: @alainfrisch

I also somehow lost track of coercions (in particular the second argument of Tcoerce_structure) after the addition of module aliases. So I definitely appreciate review from someone else (Jacques?). But I'm reasonably confident that it doesn't break code, or that this would be quickly found.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 3, 2015

Comment author: @garrigue

I applied the patch and looked at the code, and it seems ok (it mimicks transl_structure, which is of course the reference).
Honestly, I don't understand the invariants of transl_store_structure, so I cannot say more.
Well, to be sure that it doesn't break anything with aliases, checking that every package in Core compiles (with the tests) should be a reasonable start...

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 7, 2015

Comment author: @alainfrisch

Ok, I've pushed that to trunk (commit 9d4b3a4).

This is of course only a partial solution to the general problem of "the register allocator does not behave well with long pieces of code", but at least we avoid bad regressions in common cases with previous versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.