Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLambda particularly slow with large file consisting in top-level trivial aliases. #7630

Closed
vicuna opened this Issue Sep 18, 2017 · 37 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link
Collaborator

vicuna commented Sep 18, 2017

Original bug ID: 7630
Reporter: @ejgallego
Assigned to: @gasche
Status: resolved (set by @gasche on 2018-09-10T14:51:33Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: AMD64
OS: Linux
OS Version: Ubuntu 17.04
Version: 4.05.0
Fixed in version: 4.07.0
Category: back end (clambda to assembly)
Tags: flambda
Related to: #4074
Parent of: #5546 #6001 #7067 #7631

Bug description

Dear OCaml developers,

we have observed very slow compilation times with -flambda in the
context of the Coq project.

In particular, ocaml+flambda seems to struggle in a file that has a
few thousand top level trivial definitions of the form:

let a = b

where b is defined in a different module.

The file takes around 7 minutes and 10GiB of RAM, which is a problem.
Compilation time and memory seems significantly increased with regards
to the non-flambda OCaml compiler. The problem seems present in 4.03.0
and 4.05.0+trunk.

No command line option we have tried is able to improve this,
including -Oclassic and -rounds=0.

Steps to reproduce

I think that the best way to reproduce the bug is to compile Coq
master with a flambda-enabled compiler. You will reach the problematic
file, OrdersEx.v in a few minutes. The concrete bad '.ml' file that
Coq generates is available here https://ufile.io/kfcqi

Additional information

Pierre-Marie Pédrot profiled the compiler and he thinks that
ocamlopt is stuck at register allocation, in concrete at
Coloring.walk, but are lacking some more concrete data.

Thank you!

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @gasche

Good news: the upcoming 4.06 release will come with a linear-scan allocator to be used optionally (option -linscan) that may be helpful to avoid register-allocation-level bottlenecks (we had a couple such bottlenecks even with the non-flambda backend in the past, for example with code within Camlp4 (not machine-generated)).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @ejgallego

Thanks Gabriel, note thou that we are not sure about the reg allocator causing the problem, given the characteristics of the file I found Pierre-Marie's finding surprising.

Has linscan been already merged in 4.06 ?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @ejgallego

Never mind, I found #375 , also Paul Steckler has reported the following in Coq's gitter:

OK, the profile does have Coloring.walk pretty high, but also ....
Set.mem and Set.bal are even higher
and Interf.compare
the highest item is the garbage-collector, not much to do about that
and the caller of Set.mem is Flambda_invariants__declare_variable

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @gasche

That indeed looks like the graph-based register coloration algorithm having trouble -- I have seen similar memory profiles in the past. You should definitely try with -linscan the just-branched 4.06 release branch if you can (there should be an alpha/beta release soon-ish).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @gasche

Another tool that can help, by the way, is the -dtimings option that displays pass-by-pass time (and, in recent versions, memory) usage for the compiler. It keeps improving over the OCaml version.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @psteckler

I had created a smaller example to demonstrate the slowdown introduced by flambda. I wanted to see if the linear scan code would help, and installed OPAM 4.06.0+trunk+fp+flambda. I used the -linscan flag and got:

Fatal error: exception Invalid_argument("index out of bounds")
Raised by primitive operation at file "asmcomp/linscan.ml", line 111, characters 19-35
Called from file "list.ml", line 100, characters 12-15
Called from file "asmcomp/linscan.ml", line 115, characters 10-56
Called from file "asmcomp/linscan.ml", line 169, characters 4-28
Called from file "list.ml", line 100, characters 12-15
Called from file "asmcomp/asmgen.ml", line 87, characters 4-32
Called from file "utils/misc.ml", line 28, characters 20-27
Re-raised at file "utils/misc.ml", line 28, characters 50-57

I can file a separate bug report, if desired.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 18, 2017

Comment author: @ejgallego

Maybe the combination of flambda + linscan is not well tested. Anyways, Paul used -dtimings to get more data and indeed register allocation is taking a lot more time in the flambda case.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 19, 2017

Comment author: @gasche

Well, there goes my hope of -linscan fixing your issue :p At least we get a lot of testing by convincing Coq people to try our new options.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 19, 2017

Comment author: @xclerc

As noted in #7631, the linscan bug seems to be related to fp;
could you try to build Coq with -linscan on a non-fp switch?

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 19, 2017

Comment author: @ejgallego

Also a -dtimings for OrdersEx.v would be very useful, unfortunately I am away from my desktop so I'll take a while.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 19, 2017

Comment author: @psteckler

I assume it's not necessary to try a non-fp switch, since the bug with linscan has been fixed.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 25, 2017

Comment author: @ejgallego

Ok, so I am back to my desktop and for the example in #7631, I see:

4.05.0+trunk+flambda -Oclassic:
regalloc(sourcefile(b.ml)): 5.408s

4.05.0:
regalloc(sourcefile(b.ml)): 0.084s

thus this is IMHO a bug.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 25, 2017

Comment author: @gasche

Again, you cannot conclude that there is a bug from these numbers, because how much work the register allocator has to do depends a lot on the size and shape of your basic blocks, which in turns depends a lot on how aggressively you inline in the previous passes.

The compiler backend has changed its compilation strategies for structures (long series of "let" at the toplevel or between "struct .. end") several times in the past to avoid similar performance issues. However, I believe that these changes were performed at the bytecomp/translmod level, that is before the split between flambda and non-flambda backends (or bytecode), so it should apply to all backends. It may be the case that some flambda optimizations undo this work, and it might be the case that a not-too-invasive change could un-undo this.

See related issues: MPR #4074, MPR #5546, MPR #6001, MPR #7067

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 25, 2017

Comment author: @alainfrisch

flambda goes through Translmod.transl_implementation_flambda, which uses the "normal" compilation strategy for structures, not the "allocate then update slots" one. My understanding is that later passes of flambda extract back a structured representation of toplevel definitions and do specific specific with them, but I don't know whether it is supposed to avoid sending "monsters" to the register allocator.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 25, 2017

Comment author: @ejgallego

My reasoning was based on the (likely wrong) assumption that flambda -Oclassic inlining should be the same that non-flambda. Quoting the manual:

"-Oclassic: Make inlining decisions at the point of definition of a function rather than at the call site(s). This mirrors the behaviour of OCaml compilers not using Flambda"

so I thought indeed some other reason different from inlining could be creating the slow behavior, but of course I am likely wrong :)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 25, 2017

Comment author: @ejgallego

More data, I tried with -Oclassic -inline 0; no change; -inlining-report produces a zero-byte file; let me know if I can grab more information of help.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 30, 2017

Comment author: @xavierleroy

flambda goes through Translmod.transl_implementation_flambda, which uses the "normal" compilation strategy for structures, not the "allocate then update slots" one.

The compile-time issues mentioned in this PR are the reasons why the "allocate then update slots" translation was introduced circa 1995.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 2, 2017

Comment author: @chambart

Yes, flambda is supposed to try to do this. I tracked it a bit (on the example from 7631), and it seems that the nicely crafted "allocate then update slots" pattern is broken by CSE which shares symbol loads between all the updates, adding lots of conflicts for the register allocator to die on.

This is shown by the fact that making Iconst_symbol a 'cheap operation' in CSEgen avoids the problem.

I don't know the right solution. Some potential hack could be to split the module initialization function in a lot of small functions. Or make symbols cheap for this particular function (I would feel quite bad writing something like that without some better criteria for triggering it).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 2, 2017

Comment author: @chambart

Also for some context as to why there is any difference between with and without flambda. In flambda, every top-level value gets its own symbol.

Something like:

let bar0 = [| 0 |]
let bar1 = [| 1 |]
let bar2 = [| 2 |]
let bar3 = [| 3 |]

ends up to something like

symbol bar0.(0) <- [| 0 |]
symbol bar1.(0) <- [| 1 |]
symbol bar2.(0) <- [| 2 |]
symbol bar3.(0) <- [| 3 |]
symbol toplevel_module.(0) <- bar0.(0)
symbol toplevel_module.(1) <- bar1.(0)
symbol toplevel_module.(2) <- bar2.(0)
symbol toplevel_module.(3) <- bar3.(0)

and loading the address of the symbol bar0 is shared between the assignment of bar0.(0) and toplevel_module.(0). Hence the giant conflict.

Closure generated version does not have this specific problem because toplevel values are all fields of the same symbol, reducing the number of conflicts. The specific initialisation pattern of clambda (without going through an intermediate symbol) is not representable in flambda in most situations. (And I would be quite reluctant on making that representable)

Here, it could be represented since allocating and array is effect free, and the values are not used anywhere else than in the toplevel module bloc. In the coq generated example it doesn't seem like it could be.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 5, 2017

Comment author: @gasche

Pierre proposes a PR that partially alleviate some of these problems in

#1401

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 11, 2017

Comment author: @alainfrisch

It would still be useful to know the timings with -linscan (now that it has been fixed to work with "fp").

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @ejgallego

On the problematic Coq file, -linscan reduces compilation time/memory from 10GiB and 10 minutes to 400MiB/20secs.

That's on top of PR1401

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @alainfrisch

Thanks, that's good to know. Have you any way to evaluate the impact on the performance for the resulting program? (I assume that if you use flambda, runtime performance somehow matters!)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @gasche

For Coq, you want to compile Coq itself (the type-checker, kernel etc.) with flambda for performance reasons. Then native-compute (which reduces Coq terms by compiling them into OCaml) requires that you compile all the Coq libraries on which you depend, and natdynlink (portions of) them with the coqc process -- so you need to use flambda there as well, but what you want is rather for them not to blow up (on the possibly different term/program shapes than OCaml's).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @alainfrisch

Ok, thanks for the explanation. So I take it that compiling those dynlinked addins with -Oclassic + -linscan is not a crazy idea. (I'd also be interested to see if applying -linscan to the Coq core itself would give any noticeable slowdown at runtime and/or speedup at compilation time.)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @ejgallego

The problematic file is a very degenerated case due several thousands of trivial top-level definitions; no similar case appears in the rest of contributions tested by Coq's CI.

Thus, it seems to me that enabling -linscan by default is a bit too much.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @ejgallego

Alain, soon (TM) we will have a way to systematically test performance of Coq with different combinations of compiler flags, we'll make an announcement once the setup is ready.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @alainfrisch

we will have a way to systematically test performance of Coq with different combinations of compiler flags

Looking forward to it!

it seems to me that enabling -linscan by default is a bit too much.

It depends on the numbers. If it makes compilation significantly faster without degrading runtime performance in any observable way, it's just "better", not "too much".

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @gasche

Alain, I've already told Emilio that -linscan should be fine (not actually degrade performances in observable ways), but he's strong-willed to get the absolute best baseline performance. He is also, among the active Coq developers, the only one currently doing testing and benchmarking of cutting-edge OCaml-side changes, so be nice to him :-)

(Remark: there is no problem compiling some files with -linscan but not others, so Coq could special-case some of its standard library to use it.)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @alainfrisch

Gabriel, I'm always nice to everyone :-)

I'm just honestly interested in more real-world results about -linscan.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 12, 2017

Comment author: @ejgallego

I think my view here is more of "things should work fine without any special options" than "let's get the best baseline performance".

I am also very interested about the results of different flags!

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 13, 2017

Comment author: @xclerc

Would it make sense to have a heuristics choosing
the algorithm to be used for register allocation?

(And then probably a -graphcoloring switch to
enforce the use of graph coloring.)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 27, 2017

Comment author: @chambart

I made another PR #1455 that disables CSE
for the toplevel function. This allows to compile that file in ~13 seconds on
my laptop. -Oclassic isn't really faster on that file by the way. You can
further reduce the compilation time by ~4 seconds using -dflambda-no-invariants
If you do, please keep the invariants checks enabled in you CI builds as this
helps uncover compiler bugs. Also there is no guarantee that this option will
stay.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Oct 27, 2017

Comment author: @ejgallego

Thank you very much Pierre, we will test the new PR ASAP.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 10, 2018

Comment author: @ejgallego

So far Coq has been working pretty well with FLambda and 4.07.0 for little more than a month; from our part this issue can be closed.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 10, 2018

Comment author: @gasche

Great! Marking this as resolved.

(Can you give a representative (by which I mean "any you like among those you have seen in practice") number on the runtime speedup and compilation-time slowdown related to your particular preferred flambda settings?)

@vicuna vicuna closed this Sep 10, 2018

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 10, 2018

Comment author: @ejgallego

(Can you give a representative (by which I mean "any you like among those you have seen in practice") number on the runtime speedup and compilation-time slowdown related to your particular preferred flambda settings?)

We are still in the process of gathering more data (*), I think that
typical speedups for Coq projects where Coq has been compiled with -O3
-unbox-closures are in the order of 0-15%. Relevant projects such as
Iris / LambdaRust and mathcomp are closer to the 15% IIRC.

The speedup comes for free as there is no OCaml compiling involved.

Compilation time in Gitlab CI is typically double, but a large part of
it is due to the "native compute" overhead which could be saved for
most users: https://gitlab.com/coq/coq/pipelines/29809722/builds

Incidentally, a "bench" should appear here soon:
https://ci.inria.fr/coq/job/benchmark-part-of-the-branch/512 ,
however, it will run on 4.06.1 due to batteries not being available
for 4.07 yet.

(*) I'd like to have coq_dune ready before updating my little own
bench scripts, so that's the bench timeline. I'll post the data ASAP
as I get it, cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.