Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moving a function into an internal module slows down its use #5546

Closed
vicuna opened this issue Mar 20, 2012 · 14 comments
Closed

moving a function into an internal module slows down its use #5546

vicuna opened this issue Mar 20, 2012 · 14 comments
Assignees
Milestone

Comments

@vicuna
Copy link

@vicuna vicuna commented Mar 20, 2012

Original bug ID: 5546
Reporter: @lefessan
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:08:28Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 4.00.0+dev
Target version: 4.00.2+dev
Fixed in version: 4.01.0+dev
Category: back end (clambda to assembly)
Related to: #4074 #5537 #5573
Child of: #7630
Monitored by: @bobzhang @gasche ronan.lehy@gmail.com mehdi "Pascal Cuoq" dario @ygrek @glondu @jmeber @hcarty @alainfrisch

Bug description

Toplevel functions and functions defined in submodules are compiled differently. The first ones can be accessed directly as globals (when their closure is static), whereas the later ones are never accessed directly (their closure always contains the environment, so is not static). As a consequence, moving a function from toplevel to a submodule can decrease performances dramatically.

This can be easily fixed by using the same compilation mode for toplevel functions and submodules values (see patch for trunk attached).

File attachments

@vicuna
Copy link
Author

@vicuna vicuna commented Mar 20, 2012

Comment author: @alainfrisch

My guess is that this patch (using "transl_store" compilation scheme also for sub-modules) has the potential of avoiding some bad cases with the register allocator (because it can reduce the dramatically the size of the interference graph). Did you observe such compile-time improvements?

@vicuna
Copy link
Author

@vicuna vicuna commented Mar 26, 2012

Comment author: @alainfrisch

Ok, let's consider the following:

let f1 x = x
let f2 x = x
...
let fN x = x

Compilation time (ocamlopt on trunk, with or without the patch; in seconds):

N=
 125           0.07
 250           0.10
 500           0.20
 1000          0.32
 2000          0.66
 4000          1.26
 8000          2.62

Now, let's consider the following variant:

module X = struct
let f1 x = x
let f2 x = x
...
let fN x = x
end

Compilation time:

N=      without patch  with patch
 125           0.25           0.07
 250           0.89           0.10
 500          14.14           0.30
 1000         63.90           0.60
 2000        312.50           1.17
 4000           ...           2.45
 8000           ...           4.75

The patch avoids a quadratric behavior of ocamlopt. There is still a factor of 2 between the nested and non-nested version, which could be worth investigating.

@vicuna
Copy link
Author

@vicuna vicuna commented Mar 27, 2012

Comment author: @alainfrisch

What about always using transl_store for compiling structures (and get rid of transl_module)? Even for inline functor arguments, local modules, etc. And also in bytecode. This would probably not improve runtime performances, but it simplifies the code (only one compilation scheme) and the life of the register allocator.

Fabrice: do you have benchmarks showing the runtime performance benefits of your patch?

@vicuna
Copy link
Author

@vicuna vicuna commented Mar 29, 2012

Comment author: @alainfrisch

What about always using transl_store for compiling structures

I knew I did such a patch, but I completely forgot I actually submitted it: #4074. Note that even hand-written code benefits from the patch (cf compilation time for some Camlp4 module).

@vicuna
Copy link
Author

@vicuna vicuna commented Jul 4, 2012

Comment author: @alainfrisch

This is an intrusive, potentially risky patch, which does not address a bug. I don't see any reason to delay 4.00 for it.

@vicuna
Copy link
Author

@vicuna vicuna commented Oct 22, 2012

Comment author: @yakobowski

Does anyone have a version of this patch that applies cleanly to a recent OCaml? (My own merge efforts failed spectacularly...) I would like to see if the slowdown is visible on Frama-C's codebase.

@vicuna
Copy link
Author

@vicuna vicuna commented Nov 21, 2012

Comment author: Tom Hvitved

I have experienced this problem in practice, where compile time has increased from 1 minute to 5 minutes as the result of adding extra values to a module (the module contains approx. 2000 values, and it will continue to increase). This patch is therefore highly anticipated by me, and my colleagues.

Tom Hvitved
SimCorp A/S

@vicuna
Copy link
Author

@vicuna vicuna commented Nov 21, 2012

Comment author: @ygrek

IIUC this issue concerns the runtime penalty, not compile time..

@vicuna
Copy link
Author

@vicuna vicuna commented Nov 21, 2012

Comment author: @alainfrisch

this issue concerns the runtime penalty, not compile time..

It addresses both, as the timings I posted on the second note shows.

@vicuna
Copy link
Author

@vicuna vicuna commented Nov 29, 2012

Comment author: @alainfrisch

I've attached an updated version of the patch, compatible with the current trunk. My initial experiments confirm that the patch improves accesses to deeply nested functions (probably solving #5537), but not to nested variables (#5573). To support this, we probably need to do something similar to the patch I proposed in #4074. Moreover, this patch does not apply to internal structured (local module, functor body), so the example given in #4074 is not fixed.

@vicuna
Copy link
Author

@vicuna vicuna commented Nov 29, 2012

Comment author: @alainfrisch

I've tested the impact of this patch on our global unit test suite. I cannot observe any noticeable impact on runtime. That said, memory consumption is reduced for many tests (typically a 3% gain on the number of allocated bytes for many tests). I don't immediately understand the reason. Since each test is relatively short-lived (and runs in a stand-alone process), this reduced pressure on the GC does not translate into observable runtime improvements. Maybe with a long running application, this would be more visible.

@vicuna
Copy link
Author

@vicuna vicuna commented Nov 30, 2012

Comment author: @alainfrisch

I also confirm that the patch fixes the performance issue for SimCorp reported by Tom.

@vicuna
Copy link
Author

@vicuna vicuna commented Dec 13, 2012

Comment author: @alainfrisch

The patch as been pushed to the trunk (rev 13130).

@vicuna
Copy link
Author

@vicuna vicuna commented Dec 14, 2012

Comment author: @bobzhang

I confirm that I also have a noticable speedup, previously comping a native version of camlp4 takes 30s, not it goes down to 20s. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants