-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whole program dead code elimination #608
base: trunk
Are you sure you want to change the base?
Conversation
Rather than making cmxa largely redundant with cmx files, couldn't we get rid of cmxa files (by e.g, allowing to have link options in cmx files)? |
@let-def I agree and have a patch that does this as part of a larger "namespaces" proposal. |
Concerning @let-def's suggestion, it's exactly what Caml Light did back in the days. Then, someone objected on the basis of the following scenario: you have a library "foo" containing two modules, "foo_aux" and "foo". With the proposed approach, you first compile foo.ml, obtaining foo.cmx, then build the library foo.cmx, overwriting the previous foo.cmx file... Also, .cmx files describe .o files while .cmxa files describe .a files. What are you proposing? .cmx files that describe .a files? Get rid of .a files? |
Yes, offer a workflow without .a files. Libraries add one more level of names, I would like to be able to do without if possible. |
I had assumed @let-def was suggesting allowing to use a directory filled with |
@chambart I'm curious to see the gain that this could have on some MirageOS binaries. Is there a way to turn |
@let-def why not, but this probably does not fit in this PR (that is already large enouth). Currently beside link information, there is also no way to tell using only cmx files that a file should be linked only if used. This matter if there is some initialization code. We could of course also add a flag to the cmx files to tell something like that. @samoht you can use |
The last patch adds a minimal optimization round when linking to allow to remove some more references to toplevel modules in situations containing something like: module A
module B
When concatenating A and B, B still maintain a reference to camlA but it could have been redirected to A.a. This usually does not matter since reaching the value in a field of a symbol or another has the same performance cost, hence there is no information propagated in the cmx file to know that A.camlA.(0) is an alias of A.a.(0). It now matters. This alias is something that inline_and_simplify knows about, so running it after concatenation allows to remove all references to camlA. In practice that can appear for instance if we use stdin that force the whole pervasive module to stay alive. A quite extreme example, with this pass and sufficiently aggressive inlining option (
Is in clambda (this is after un-anf, hence constants are not shown)
and the constants are
Notice that the majority of the code is related to do_at_exit |
@chambart any chance that you could fix the 4.03 description and add your new compiler to opam-repository? :-) (it not, I'll try to do this next week) |
Did you consider an actual "whole program optimizer" based on flambda? I can imagine such a compiler loading e.g. |
@alainfrisch I think we were all thinking about this and hoping for it. Whole-program compilation is the 'holy grail' in terms of optimization potential. It would be nice to introduce optimizations here (like type-specialization of functions and optimized type representation, which would provide serious performance boosts) and then slowly let some of them leak out to the open-universe case. |
@alainfrisch We haven't really thought about this much, but we're intending to spend some amount of time later this year trying to significantly improve compilation speed at Jane Street, and one thing we're considering is stopping the compiler earlier for a "type-check only" mode (to give fast feedback) with delayed "background" output of object files after that. I think this would fit in with what you propose. I will undertake to review this patch. |
(Also, I'm not sure "-lto" is badly named. It describes what is going on and is nearly the same as the corresponding GCC option for the same thing.) |
Some of the optimizations I would like will require changes to the OCaml GC – specifically, the ability to have blocks in the heap that contain a mixture of pointers and non-pointers. This would allow for floats, Even more far-out, if type-specialization makes most functions operate on unboxed, specialized data (such as machine ints, doubles, and pointers), OCaml might benefit from an LLVM backend. But that is a ways off. |
@alainfrisch without changing anything else to the toolchain, it is probably sufficient for you to build with This would benefit from a reasonably fast build for each file and have the same ability as bytecode not to rebuild the whole world when you change a file. The performance won't be marvelous of course. Currently this patch does not run the optimization passes when linking, but if you consider that workflow useful, I can change that. This require some changes as the passes do not expect symbols declared from different compilation units and will complain about that. |
@chambart If I understand correctlly, you claim that most of the overhead of -Oclassic compared to the legacy pipeline is due to cross-module optimizations. Is this only your intuition, or has this been empirically confirmed? |
@chambart I'm getting
when I am trying your PR (but using |
@alainfrisch no, I claim that what you gain from not recompiling everything when you change a given file is bigger than the overhead of flambda. This of course requires that you build system is able to see that the cmx file didn't change. |
@samoht no. I'll look into this. |
Latest command shown in the logs is:
It's on OSX 10.11 if that makes a difference... |
Ok understood. (But in fast dev mode, I already compile with -opaque.) |
@samoht if you want to test again, I fixed a few problems |
@chambart @samoht per discussion with @samoht just now, I gave the
FWIW, possibly related,
If there's an easy fix / something I can try to get the build going / you can point me to what I should try pinning and building locally, I'll try again later today :) |
@mor1, I did not seriously try packs, I wouldn't be surprised that I missed something in the handling of packs. The reason something is failing in ocamlbuild is probably due to it being the deepest package using pack in the package dependency tree. I'll try to add some tests for packs and lto in the testsuite. |
@mor1 I fixed a few things for pack, this should be better now. By the way, I wanted to try |
Thanks-- I'll give it a try (may not be for a few days as travelling). I haven't tried Is there an easy way to try out your updates? Do I just use the same switch as before? |
Yes opam switch reinstall 4.04.0+forced_lto should do the work (no need to opam update before) |
@samoht I just rebased and updated for 4.05. I think otherwise the status is still the same: We need some real world test to validate. |
So I tried this in the context of let () =
Printf.printf "%b" (Uucp.White.is_white_space (Uchar.of_int 0x0020));
() The results are as follows:
|
Here's a hello world
|
I tried this on a large executable which measured 220Mb in size when compiled with 4.03+flambda. Using 4.04+flambda with LTO enabled the executable reduced to 105Mb in size. It took more than ten minutes to link, which seems excessive. The ocamlopt.opt memory consumption which I think was around 4Gb is probably more reasonable as there is a lot of code involved here. I have not yet investigated whether there are further opportunities for dead code elimination. I was hoping for a larger reduction in size, so there may be something about the code that prevents it. I will try to build it in bytecode so I can see what reduction is obtained by Examination of the 220Mb executable has also revealed two problems with ELF string tables. These totalled 80Mb (!). Firstly, especially when built without dynlink support, then we shouldn't be having all of the symbols in the dynamic symbol table as well as the normal one. I think the ELF "hidden" visibility support may fix this; I will investigate and submit a pull request. Secondly we shouldn't be generating such verbose symbol names; some of them may also point at duplicate copies of code. We will look at this in due course. |
Any chance to update these patches to 4.06 and/or 4.07? |
You mean 4.08, right? |
Or 4.08 indeed. But just having an opam switch for 4.06+lto and/or 4.07+lto would already be great :-) |
So just to report some numbers for that PR, when compiling the hello world unikernel (and using For a Unix application:
For a xen (self-contained) virtual machine image:
But there is an issue with the solo5 backends, as we are cross-compiling the runtime (using ocaml-freestanding) and we are just installing the new
So the numebrs are really great, but I am not sure how to fix the last error :-) |
And the port of the PR to 4.06.1 is done here: https://github.com/well-typed-lightbulbs/ocaml-esp32/tree/4.06.1+lto (thanks to @TheLortex). 4.06.1+lto is now in opam |
A game server with containers, lwt and a few more dependencies on 4.06.1+lto: Native with use-lto: 11M Looks like more optimisations are possible, but the reduction in size is already quite significant. Would love to see this being moved forward. Compilation time increased from less than a second to 8 seconds, while On another binary that usually weights 41M I got a stackoverflow. |
@copy you can increase the native stack size. Does that prevent the stack overflow? |
@DemiMarie Indeed that fixed the problem, here's another binary built with async and core: Native with use-lto: 23M |
The notion of whole-program optimization and dead-code elimination seems relevant to compiling to WebAssembly. Right after I posted that, in order to use unboxed native 32-bit and 64-bit integers on the WebAssembly GC, we might be able to make a linker that does dead-code elimination and code-specialization (to emit precise types for the WebAssembly GC heap), I find this. So, I take this as a data point that suggests that it is possible to do dead-code elimination, and that it is probably feasible to emit WebAssembly modules with additional information to monomorphize (to some extent) and do dead-code elimination at link time. Edit: looks like monomorphization is actually not possible, but the rest still applies. 😄 |
@chambart there was any blocking on this PR? I rebased it to 4.10, got some nice reductions on Mirage with flambda from 3.8mb to 1.3mb. There is some small performance regressions on mine, mostly because I'm actually running the entire flambda pipeline on the whole program instead of just the cleaning pass. |
* Replace Ladjust_trap_depth with Lajdust_stack_offset Express stack offset directly in bytes instead of number of traps in this pseudo-instruction to handle differences between consecuitve blocks due to Istackoffset and block reordering. * Rename trap_depth field of Cfg block and instruction to stack_offset Initialize it at 0 not 1. Remove some comments that are no longer relevant. Fix printing of the modified field. * Update stack_offset after Istackoffset * Fix cfgize Remove exceptional_successor, propagate traps to handler blocks eagerly. * Format * Compute can_raise_interpoc from block.exn instead of trap_depth * Simplify cfgize * Address review comments
I propose in this PR to add another link mode to the
ocamlopt
compiler with flambda enabled.There is a new
-lto
option to the compiler to mark when a file should export sufficient information for the link. When all the.cmx
and.cmxa
files of a project are built with this option, it is possible to link with it too. When linking, all the flambda informations are concatenated and go through a dead code elimination pass that removes all unreferenced symbols (but keep effets) and build a new big object file containing the whole program.Of course, this prevent using dynlink in this program as the modules referenced by the loaded module might have been eliminated. It may be possible to provide a mode where some interfaces are requested, and only those would be available for a dynlinked module but this is not implemented yet.
There are no optimization specific to whole program applied yet.
Outside of the undynlinkability, the other drawbacks are:
for this mode, they must contain the whole code of the included modules since the compiler
might not have access to the cmx files while linking.
Overall compilation time does not change significantly, but link time of course increases a lot:
without
-lto
with
-lto
The effect is not wonderful on the compiler itself: the size of ocamlc.opt decreases by ~10% (there is not much dead code there), but on some extreme examples we can get quite a lot. There are still some cases where this does not eliminate as much as expected.
This patch is based on #602 only the commits after 88c2c8c (Also remove linking hack for bytecode) are relevant. This other PR is needed to allow removing unneeded toplevel modules.
Note the
-lto
or 'link time optimization' is quite badly named. Please suggest a better option.