Skip to content

Conversation

xclerc
Copy link
Contributor

@xclerc xclerc commented Jun 2, 2022

Overview

The code is mainly a naive/direct implementation/translation of the
paper (https://dl.acm.org/doi/abs/10.1145/229542.229546), except
the freeze_moves function which is actually taken from the book
(https://www.cs.princeton.edu/~appel/modern/ml/).
Small/trivial additions to the paper include:

  • handling of register classes, by using per-class values for k
    (an early version was adding interference between registers of
    different classes, which obviously results in a denser graph)
  • use of Proc.destroyed_at_xyz
  • handling of written-but-never-read temporaries
  • a "naive" splitting, whose objective is to introduce a number of
    spill/reload moves linear in the number of destruction points, rather
    than linear in the number of read/writes when a temporary is spilled
  • if there are destruction points in the function, rewrite is called
    before main (indeed, otherwise the first round of the loop will
    basically discover that the values live at the destruction points
    needs to be spilled, then call rewrite before launching a second
    round -- we hence essentially save one round by calling rewrite
    beforehand when we know temporaries will be spilled)
  • lots of log statements

Code

New modules:

  • Cfg_regalloc_utils contains generic utility functions;
  • Cfg_irc_utils contains more utility functions, used specifically by
    the IRC allocator;
  • Cfg_irc_state defines the state of the IRC allocator; it is abstract
    so that it is (relatively) easy to change the representation (e.g.
    of the working sets/lists) with no impact on the core of the allocator;
  • Cfg_irc_split implements a pre-processing phase that performs
    a kind of "naive" or "pseudo" splitting at destruction points;
    essentially, if a register is live at a destruction point (currently only
    calls), we rewrite:
    destruction_point
    
    to:
    t' <- t
    destruction_point[t/t']
    t <- t'
    
  • Cfg_irc contains the core of the algorithm.

Modified modules:

  • Asmgen is tweaked to allow the selection of the register allocator,
    and collect metrics;
  • Cfg is augmented with utility functions;
  • Cfg_intf is tweaked so that two fields of instruction, namely
    arg and res, are now mutable, and a new field is added to store
    the work list the instruction belongs to;
  • Cfg_dataflow and Cfg_liveness are tweaked so that the transfer
    function take an additional parameter indicating whether the instruction
    can raise;
  • Cfg_equivalence is tweaked to disable the comparison of live set on
    certain instructions;
  • Emit is augmented to collect metrics (number of instructions, moves,
    spills and reloads);
  • Reg is tweaked so that a register now holds its class, its work list,
    its color, and its alias.

Use

At this point, the register allocators are enabled and controlled by
environment variables. REGISTER_ALLOCATOR selects the register
allocator:

  • if the variable is unset, the existing register allocator will be used;
  • if the variable is set to irc, then the IRC allocator will be used
    (note that case is ignored).

All allocators can use the REGALLOC_STATS variable to enable the production
of metrics. When the variable is set, its value is interpreted as the path to
a directory where quasi-CSV files will be produced. One file will be produced
by compiler invocation, and its base name will be the MD5 digest of the
command-line parameter. The first line of the file will contain the command-line
arguments, while the other lines are CSV contents. Each CSV line contains the
various metrics for a given function.

The IRC allocator can be further controlled through the following variables:

  • IRC_VERBOSE to output a lot of debug statements;
  • IRC_INVARIANTS to check invariants at each step of the process;
  • IRC_SPLIT to select the kind of splitting to apply before register
    allocation proper (off for no splitting and naive for the pseudo-
    or naive splitting described above);
  • IRC_SPILLING_HEURISTICS to select the heuristics to use when
    choosing a register to spill (set-choose to randomly choose a
    register and flat-uses to choose the least-used register).

Generated code

The current version of IRC is worse than the current register allocator
in at least two areas: spilling code and memory operands.

The original article will produce a spill (resp. reload) for each write (resp. read)
use of a register that is being spilled (this is the behaviour with IRC_SPLIT set
to off). The naive splitting will produce a spill/reload for each destruction
point of a register that is being spilled (this is the behaviour with IRC_SPLIT
set to naive). Both are worse than the split pass implemented by the existing
register allocator.

The logic from the Reload module is not reused / reimplemented, which means
that all operands are in hardware register and that we do not take advantage of
the fact that some operations can accept one operand on the stack. Put
differently, the current implementation always reloads the spilled value into a
register before the operation while the existing register allocator can use the
value directly from the stack.

Tests and benchmarks

The IRC allocator has been successfully tested (building the distribution,
and running the test suite from upsteam) with closure, flambda and flambda2.

More numbers will be collected in the upcoming weeks, but here are some
rough numbers collected while building the standard library:

splitting one round all
off 0.81-0.86 1.10-1.11
naive 1.18-1.20 3.65-3.70

(the values above are the irc_time / upstream_time ratios)
("one round" means that the value is about the functions for which
the register allocator needs only one round, while "all" is about all
functions)

Still on the standard library, if we break down per function and look
at the statistics, we get:

splitting stat one round all
off min 0.28-0.32 0.20-0.38
off mean 0.86-0.89 0.97-0.99
off max 11.60-15.49 11.60-15.49
naive min 0.20-0.26 0.18-0.22
naive mean 0.99-1.00 1.09-1.10
naive max 19.11-29.7 31.53-31.99

The total build times for the compiler distribution (thus not only
measuring register allocation) are:

  • with existing allocator: 1
  • with IRC and no splitting: 1.04
  • with IRC and naive split: 1.05

TO DO

  • check whether the Reg.clas field is really useful
  • check whether storing the liveness in the instructions would
    make a material difference
  • work out what could be gained from other representation (both
    for the IRC state and CFG values)
  • implement the "hierarchical" heuristics for spilling
  • implement proper spilling
  • implement the equivalent of (or simply reuse) Reload so that
    we do not always reload a spilled value into a register before an
    operation that can accept an operand from the stack

@xclerc
Copy link
Contributor Author

xclerc commented Jun 2, 2022

(Caveat: the elements in the "Tests and benchmarks"
section were collected before rebasing from a1d36c7
to 9072c5d.)

@mshinwell mshinwell changed the title IRC (draft implementation) IRC, iterated register coalescing (draft implementation) Jun 3, 2022
@xclerc xclerc marked this pull request as ready for review June 28, 2022 13:39
Copy link
Contributor

@gretay-js gretay-js left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code of this PR, alongside the Tiger book, it looks good to me. I think this PR should be merged as is, after it is rebased and CI tests for cfg reenabled (they should work now, since PR#656 turned off equivalence checking by default). Further comments should be addressed in separate PRs.

I have a few minor questions in the diff of the PR, but they shouldn't block this from being merged. Additionaly, suggestions / questions for later:

  • Create a separate directory for regalloc (either under cfg or next to it, not sure)
  • For testing, can we run a version of reload pass once to guarantee that the constrains that Emit relies on are satisfied after IRC. This requires reimplementing the logic of reload on cfg.
  • Separate the stats from Emit. Add a separate pass.
  • Stats: spills in upstream are either Ispill or Imove with res on
    stack. Similarly for reloads. It may be useful to separate Imove to
    spill and reload.
  • What will ocaml-ir do if a pass is missing. e.g, Compiler_hooks.Mach_spill is never called in the IRC pipeline?
  • Is it possible to refactor Asmgen.compile_fundecl to move stats out of it into a helper function, or somehow tidy up the code, it's a bit hard to follow (maybe it'll go away).
  • In cfg_liveness: has_an_exceptional_successor why is it needed?
  • document magic number "16" in cfg_dataflow
  • cfg_equivalence, check_live: can you please add a comment why live is
    not checked for poptrap/pushtrap/prologue. These instructions don't exists in Mach. Do they get wrong made-up values somehow in Linear_to_cfg or Cfgize?
  • if we have mutable fields in Cfg.instruction we should make live mutable too again and see if it helps.
  • remove_prologue_if_not_required : why does it need to be added on always, instead of adding it on after regalloc and only if required? (there was something about it being after debug related instructions).
  • cfg_regalloc_utils.ml: maybe rename Fetch -> Load
  • Stats separate module
  • Cfg_regalloc_utils Instruction: something is wrong in our Cfg interface if we need this.
  • Cfg_state: common interface to the various worklists?
  • Is there a check that there is no edge between two Reg.t that are precolored in different colors?
  • Assert the item is in the expected worklist before removing it (e.g., remove_spill_work_list).

@poechsel
Copy link
Contributor

poechsel commented Jul 7, 2022

What will ocaml-ir do if a pass is missing. e.g, Compiler_hooks.Mach_spill is never called in the IRC pipeline?

Nothing apart from outputting an error message. I don't think many people rely on OCaml-IR supporting Mach_spill so I'm fine with IRC skipping Mach_spill.

Copy link
Contributor

@gretay-js gretay-js left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the CI tests with -ocamlcfg still fail after the rebase.
Is it possible that this line went missing in the rebase:
https://github.com/ocaml-flambda/flambda-backend/blob/d1ec75ab7892b2c814ee340303d94fc6f6e0eb40/backend/asmgen.ml#L349

Copy link
Collaborator

@mshinwell mshinwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the changes to dune only.

Copy link
Contributor

@gretay-js gretay-js left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI tests for ocamlcfg pass.

@gretay-js gretay-js mentioned this pull request Aug 4, 2022
@xclerc xclerc merged commit 6d40006 into main Aug 4, 2022
@xclerc xclerc mentioned this pull request Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants