-
Notifications
You must be signed in to change notification settings - Fork 116
IRC, iterated register coalescing (draft implementation) #678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code of this PR, alongside the Tiger book, it looks good to me. I think this PR should be merged as is, after it is rebased and CI tests for cfg reenabled (they should work now, since PR#656 turned off equivalence checking by default). Further comments should be addressed in separate PRs.
I have a few minor questions in the diff of the PR, but they shouldn't block this from being merged. Additionaly, suggestions / questions for later:
- Create a separate directory for regalloc (either under cfg or next to it, not sure)
- For testing, can we run a version of reload pass once to guarantee that the constrains that Emit relies on are satisfied after IRC. This requires reimplementing the logic of reload on cfg.
- Separate the stats from Emit. Add a separate pass.
- Stats: spills in upstream are either Ispill or Imove with res on
stack. Similarly for reloads. It may be useful to separate Imove to
spill and reload. - What will ocaml-ir do if a pass is missing. e.g, Compiler_hooks.Mach_spill is never called in the IRC pipeline?
- Is it possible to refactor Asmgen.compile_fundecl to move stats out of it into a helper function, or somehow tidy up the code, it's a bit hard to follow (maybe it'll go away).
- In cfg_liveness: has_an_exceptional_successor why is it needed?
- document magic number "16" in cfg_dataflow
- cfg_equivalence, check_live: can you please add a comment why live is
not checked for poptrap/pushtrap/prologue. These instructions don't exists in Mach. Do they get wrong made-up values somehow in Linear_to_cfg or Cfgize? - if we have mutable fields in Cfg.instruction we should make live mutable too again and see if it helps.
- remove_prologue_if_not_required : why does it need to be added on always, instead of adding it on after regalloc and only if required? (there was something about it being after debug related instructions).
- cfg_regalloc_utils.ml: maybe rename Fetch -> Load
- Stats separate module
- Cfg_regalloc_utils Instruction: something is wrong in our Cfg interface if we need this.
- Cfg_state: common interface to the various worklists?
- Is there a check that there is no edge between two Reg.t that are precolored in different colors?
- Assert the item is in the expected worklist before removing it (e.g., remove_spill_work_list).
Nothing apart from outputting an error message. I don't think many people rely on OCaml-IR supporting Mach_spill so I'm fine with IRC skipping Mach_spill. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the CI tests with -ocamlcfg still fail after the rebase.
Is it possible that this line went missing in the rebase:
https://github.com/ocaml-flambda/flambda-backend/blob/d1ec75ab7892b2c814ee340303d94fc6f6e0eb40/backend/asmgen.ml#L349
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the changes to dune
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI tests for ocamlcfg pass.
Overview
The code is mainly a naive/direct implementation/translation of the
paper (https://dl.acm.org/doi/abs/10.1145/229542.229546), except
the
freeze_moves
function which is actually taken from the book(https://www.cs.princeton.edu/~appel/modern/ml/).
Small/trivial additions to the paper include:
k
(an early version was adding interference between registers of
different classes, which obviously results in a denser graph)
Proc.destroyed_at_xyz
spill/reload moves linear in the number of destruction points, rather
than linear in the number of read/writes when a temporary is spilled
rewrite
is calledbefore
main
(indeed, otherwise the first round of the loop willbasically discover that the values live at the destruction points
needs to be spilled, then call
rewrite
before launching a secondround -- we hence essentially save one round by calling
rewrite
beforehand when we know temporaries will be spilled)
Code
New modules:
Cfg_regalloc_utils
contains generic utility functions;Cfg_irc_utils
contains more utility functions, used specifically bythe IRC allocator;
Cfg_irc_state
defines the state of the IRC allocator; it is abstractso that it is (relatively) easy to change the representation (e.g.
of the working sets/lists) with no impact on the core of the allocator;
Cfg_irc_split
implements a pre-processing phase that performsa kind of "naive" or "pseudo" splitting at destruction points;
essentially, if a register is live at a destruction point (currently only
calls), we rewrite:
Cfg_irc
contains the core of the algorithm.Modified modules:
Asmgen
is tweaked to allow the selection of the register allocator,and collect metrics;
Cfg
is augmented with utility functions;Cfg_intf
is tweaked so that two fields ofinstruction
, namelyarg
andres
, are now mutable, and a new field is added to storethe work list the instruction belongs to;
Cfg_dataflow
andCfg_liveness
are tweaked so that the transferfunction take an additional parameter indicating whether the instruction
can raise;
Cfg_equivalence
is tweaked to disable the comparison of live set oncertain instructions;
Emit
is augmented to collect metrics (number of instructions, moves,spills and reloads);
Reg
is tweaked so that a register now holds its class, its work list,its color, and its alias.
Use
At this point, the register allocators are enabled and controlled by
environment variables.
REGISTER_ALLOCATOR
selects the registerallocator:
irc
, then the IRC allocator will be used(note that case is ignored).
All allocators can use the
REGALLOC_STATS
variable to enable the productionof metrics. When the variable is set, its value is interpreted as the path to
a directory where quasi-CSV files will be produced. One file will be produced
by compiler invocation, and its base name will be the MD5 digest of the
command-line parameter. The first line of the file will contain the command-line
arguments, while the other lines are CSV contents. Each CSV line contains the
various metrics for a given function.
The IRC allocator can be further controlled through the following variables:
IRC_VERBOSE
to output a lot of debug statements;IRC_INVARIANTS
to check invariants at each step of the process;IRC_SPLIT
to select the kind of splitting to apply before registerallocation proper (
off
for no splitting andnaive
for the pseudo-or naive splitting described above);
IRC_SPILLING_HEURISTICS
to select the heuristics to use whenchoosing a register to spill (
set-choose
to randomly choose aregister and
flat-uses
to choose the least-used register).Generated code
The current version of IRC is worse than the current register allocator
in at least two areas: spilling code and memory operands.
The original article will produce a spill (resp. reload) for each write (resp. read)
use of a register that is being spilled (this is the behaviour with
IRC_SPLIT
setto
off
). The naive splitting will produce a spill/reload for each destructionpoint of a register that is being spilled (this is the behaviour with
IRC_SPLIT
set to
naive
). Both are worse than the split pass implemented by the existingregister allocator.
The logic from the
Reload
module is not reused / reimplemented, which meansthat all operands are in hardware register and that we do not take advantage of
the fact that some operations can accept one operand on the stack. Put
differently, the current implementation always reloads the spilled value into a
register before the operation while the existing register allocator can use the
value directly from the stack.
Tests and benchmarks
The IRC allocator has been successfully tested (building the distribution,
and running the test suite from upsteam) with closure, flambda and flambda2.
More numbers will be collected in the upcoming weeks, but here are some
rough numbers collected while building the standard library:
(the values above are the irc_time / upstream_time ratios)
("one round" means that the value is about the functions for which
the register allocator needs only one round, while "all" is about all
functions)
Still on the standard library, if we break down per function and look
at the statistics, we get:
The total build times for the compiler distribution (thus not only
measuring register allocation) are:
TO DO
Reg.clas
field is really usefulmake a material difference
for the IRC state and CFG values)
Reload
so thatwe do not always reload a spilled value into a register before an
operation that can accept an operand from the stack