Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
branch: mono-1-2
Fetching contributors…

Cannot retrieve contributors at this time

462 lines (355 sloc) 18.596 kB
Mono JIT porting guide.
Paolo Molaro (lupus@ximian.com)
* Introduction
This documents describes the process of porting the mono JIT
to a new CPU architecture. The new mono JIT has been designed
to make porting easier though at the same time enable the port
to take full advantage from the new architecture features and
instructions. Knowledge of the mini architecture (described in
the mini-doc.txt file) is a requirement for understanding this
guide, as well as an earlier document about porting the mono
interpreter (available on the web site).
There are six main areas that a port needs to implement to
have a fully-functional JIT for a given architecture:
1) instruction selection
2) native code emission
3) call conventions and register allocation
4) method trampolines
5) exception handling
6) minor helper methods
To take advantage of some not-so-common processor features
(for example conditional execution of instructions as may be
found on ARM or ia64), it may be needed to develop an
high-level optimization, but doing so is not a requirement for
getting the JIT to work.
We'll see in more details each of the steps required, note,
though, that a new port may just as well start from a
cut&paste of an existing port to a similar architecture (for
example from x86 to amd64, or from powerpc to sparc).
The architecture specific code is split from the rest of the
JIT, for example the x86 specific code and data is all
included in the following files in the distribution:
mini-x86.h mini-x86.c
inssel-x86.brg
cpu-pentium.md
tramp-x86.c
exceptions-x86.c
I suggest a similar split for other architectures as well.
Note that this document is still incomplete: some sections are
only sketched and some are missing, but the important info to
get a port going is already described.
* Architecture-specific instructions and instruction selection.
The JIT already provides a set of instructions that can be
easily mapped to a great variety of different processor
instructions. Sometimes it may be necessary or advisable to
add a new instruction that represent more closely an
instruction in the architecture. Note that a mini instruction
can be used to represent also a short sequence of CPU
low-level instructions, but note that each instruction
represents the minimum amount of code the instruction
scheduler will handle (i.e., the scheduler won't schedule the
instructions that compose the low-level sequence as individual
instructions, but just the whole sequence, as an indivisible
block).
New instructions are created by adding a line in the
mini-ops.h file, assigning an opcode and a name. To specify
the input and output for the instruction, there are two
different places, depending on the context in which the
instruction gets used.
If the instruction is used in the tree representation, the
input and output types are defined by the BURG rules in the
*.brg files (the usual non-terminals are 'reg' to represent a
normal register, 'lreg' to represent a register or two that
hold a 64 bit value, freg for a floating point register).
If an instruction is used as a low-level CPU instruction, the
info is specified in a machine description file. The
description file is processed by the genmdesc program to
provide a data structure that can be easily used from C code
to query the needed info about the instruction.
As an example, let's consider the add instruction for both x86
and ppc:
x86 version:
add: dest:i src1:i src2:i len:2 clob:1
ppc version:
add: dest:i src1:i src2:i len:4
Note that the instruction takes two input integer registers on
both CPU, but on x86 the first source register is clobbered
(clob:1) and the length in bytes of the instruction differs.
Note that integer adds and floating point adds use different
opcodes, unlike the IL language (64 bit add is done with two
instructions on 32 bit architectures, using a add that sets
the carry and an add with carry).
A specific CPU port may assign any meaning to the clob field
for an instruction since the value will be processed in an
arch-specific file anyway.
See the top of the existing cpu-pentium.md file for more info
on other fields: the info may or may not be applicable to a
different CPU, in this latter case the info can be ignored.
The code in mini.c together with the BURG rules in inssel.brg,
inssel-float.brg and inssel-long32.brg provides general
purpose mappings from the tree representation to a set of
instructions that should be easily implemented in any
architecture. To allow for additional arch-specific
functionality, an arch-specific BURG file can be used: in this
file arch-specific instructions can be selected that provide
better performance than the general instructions or that
provide functionality that is needed by the JIT but that
cannot be expressed in a general enough way.
As an example, x86 has the special instruction "push" to make
it easier to implement the default call convention (passing
arguments on the stack): almost all the other architectures
don't have such an instruction (and don't need it anyway), so
we added a special rule in the inssel-x86.brg file for it.
So, one of the first things needed in a port is to write a
cpu-$(arch).md machine description file and fill it with the
needed info. As a start, only a few instructions can be
specified, like the ones required to do simple integer
operations. The default rules of the instruction selector will
emit the common instructions and so we're ready to go for the
next step in porting the JIT.
*) Native code emission
Since the first step in porting mono to a new CPU is to port
the interpreter, there should be already a file that allows
the emission of binary native code in a buffer for the
architecture. This file should be placed in the
mono/arch/$(arch)/
directory.
The bulk of the code emission happens in the mini-$(arch).c
file, in a function called mono_arch_output_basic_block
(). This function takes a basic block, walks the list of
instructions in the block and emits the binary code for each.
Optionally a peephole optimization pass is done on the basic
block, but this can be left for later, when the port actually
works.
This function is very simple, there is just a big switch on
the instruction opcode and in the corresponding case the
functions or macros to emit the binary native code are
used. Note that in this function the lengths of the
instructions are used to determine if the buffer for the code
needs enlarging.
To complete the code emission for a method, a few other
functions need implementing as well:
mono_arch_emit_prolog ()
mono_arch_emit_epilog ()
mono_arch_patch_code ()
mono_arch_emit_prolog () will emit the code to setup the stack
frame for a method, optionally call the callbacks used in
profiling and tracing, and move the arguments to their home
location (in a caller-save register if the variable was
allocated to one, or in a stack location if the argument was
passed in a volatile register and wasn't allocated a
non-volatile one). caller-save registers used by the function
are saved in the prolog as well.
mono_arch_emit_epilog () will emit the code needed to return
from the function, optionally calling the profiling or tracing
callbacks. At this point the basic blocks or the code that was
moved out of the normal flow for the function can be emitted
as well (this is usually done to provide better info for the
static branch predictor). In the epilog, caller-save
registers are restored if they were used.
Note that, to help exception handling and stack unwinding,
when there is a transition from managed to unmanaged code,
some special processing needs to be done (basically, saving
all the registers and setting up the links in the Last Managed
Frame structure).
When the epilog has been emitted, the upper level code
arranges for the buffer of memory that contains the native
code to be copied in an area of executable memory and at this
point, instructions that use relative addressing need to be
patched to have the right offsets: this work is done by
mono_arch_patch_code ().
* Call conventions and register allocation
To account for the differences in the call conventions, a few functions need to
be implemented.
mono_arch_allocate_vars () assigns to both arguments and local
variables the offset relative to the frame register where they
are stored, dead variables are simply discarded. The total
amount of stack needed is calculated.
mono_arch_call_opcode () is the function that more closely
deals with the call convention on a given system. For each
argument to a function call, an instruction is created that
actually puts the argument where needed, be it the stack or a
specific register. This function can also re-arrange th order
of evaluation when multiple arguments are involved if needed
(like, on x86 arguments are pushed on the stack in reverse
order). The function needs to carefully take into accounts
platform specific issues, like how structures are returned as
well as the differences in size and/or alignment of managed
and corresponding unmanaged structures.
The other chunk of code that needs to deal with the call
convention and other specifics of a CPU, is the local register
allocator, implemented in a function named
mono_arch_local_regalloc (). The local allocator deals with a
basic block at a time and basically just allocates registers
for temporary values during expression evaluation, spilling
and unspilling as necessary.
The local allocator needs to take into account clobbering
information, both during simple instructions and during
function calls and it needs to deal with other
architecture-specific weirdnesses, like instructions that take
inputs only in specific registers or output only is some.
Some effort will be put later in moving most of the local
register allocator to a common file so that the code can be
shared more for similar, risc-like CPUs. The register
allocator does a first pass on the instructions in a block,
collecting liveness information and in a backward pass on the
same list performs the actual register allocation, inserting
the instructions needed to spill values, if necessary.
The cross-platform local register allocator is now implemented
and it is documented in the jit-regalloc file.
When this part of code is implemented, some testing can be
done with the generated code for the new architecture. Most
helpful is the use of the --regression command line switch to
run the regression tests (basic.cs, for example).
Note that the JIT will try to initialize the runtime, but it
may not be able yet to compile and execute complex code:
commenting most of the code in the mini_init() function in
mini.c is needed to let the JIT just compile the regression
tests. Also, using multiple -v switches on the command line
makes the JIT dump an increasing amount of information during
compilation.
Values loaded into registers need to be extened as needed by
the ECMA specs:
*) integers smaller than 4 bytes are extended to int32 values
*) 32 bit floats are extended to double precision (in particular
this means that currently all the floating point operations operate
on doubles)
* Method trampolines
To get better startup performance, the JIT actually compiles a
method only when needed. To achieve this, when a call to a
method is compiled, we actually emit a call to a magic
trampoline. The magic trampoline is a function written in
assembly that invokes the compiler to compile the given method
and jumps to the newly compiled code, ensuring the arguments
it received are passed correctly to the actual method.
Before jumping to the new code, though, the magic trampoline
takes care of patching the call site so that next time the
call will go directly to the method instead of the
trampoline. How does this all work?
mono_arch_create_jit_trampoline () creates a small function
that just preserves the arguments passed to it and adds an
additional argument (the method to compile) before calling the
generic trampoline. This small function is called the specific
trampoline, because it is method-specific (the method to
compile is hard-code in the instruction stream).
The generic trampoline saves all the arguments that could get
clobbered and calls a C function that will do two things:
*) actually call the JIT to compile the method
*) identify the calling code so that it can be patched to call directly
the actual method
If the 'this' argument to a method is a boxed valuetype that
is passed to a method that expects just a pointer to the data,
an additional unboxing trampoline will need to be inserted as
well.
* Exception handling
Exception handling is likely the most difficult part of the
port, as it needs to deal with unwinding (both managed and
unmanaged code) and calling catch and filter blocks. It also
needs to deal with signals, because mono takes advantage of
the MMU in the CPU and of the operation system to handle
dereferences of the NULL pointer. Some of the function needed
to implement the mechanisms are:
mono_arch_get_throw_exception () returns a function that takes
an exception object and invokes an arch-specific function that
will enter the exception processing. To do so, all the
relevant registers need to be saved and passed on.
mono_arch_handle_exception () this function takes the
exception thrown and a context that describes the state of the
CPU at the time the exception was thrown. The function needs
to implement the exception handling mechanism, so it makes a
search for an handler for the exception and if none is found,
it follows the unhandled exception path (that can print a
trace and exit or just abort the current thread). The
difficulty here is to unwind the stack correctly, by restoring
the register state at each call site in the call chain,
calling finally, filters and handler blocks while doing so.
As part of exception handling a couple of internal calls need
to be implemented as well.
ves_icall_get_frame_info () returns info about a specific
frame.
mono_jit_walk_stack () walks the stack and calls a callback with info for
each frame found.
ves_icall_get_trace () return an array of StackFrame objects.
** Code generation for filter/finally handlers
Filter and finally handlers are called from 2 different locations:
1.) from within the method containing the exception clauses
2.) from the stack unwinding code
To make this possible we implement them like subroutines,
ending with a "return" statement. The subroutine does not save
the base pointer, because we need access to the local
variables of the enclosing method. Its is possible that
instructions inside those handlers modify the stack pointer,
thus we save the stack pointer at the start of the handler,
and restore it at the end. We have to use a "call" instruction
to execute such finally handlers.
The MIR code for filter and finally handlers looks like:
OP_START_HANDLER
...
OP_END_FINALLY | OP_ENDFILTER(reg)
OP_START_HANDLER: should save the stack pointer somewhere
OP_END_FINALLY: restores the stack pointers and returns.
OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
** Calling finally/filter handlers
There is a special opcode to call those handler, its called
OP_CALL_HANDLER. It simple emits a call instruction.
Its a bit more complex to call handler from outside (in the
stack unwinding code), because we have to restore the whole
context of the method first. After that we simply emit a call
instruction to invoke the handler. Its usually possible to use
the same code to call filter and finally handlers (see
arch_get_call_filter).
** Calling catch handlers
Catch handlers are always called from the stack unwinding
code. Unlike finally clauses or filters, catch handler never
return. Instead we simply restore the whole context, and
restart execution at the catch handler.
** Passing Exception objects to catch handlers and filters.
We use a local variable to store exception objects. The stack
unwinding code must store the exception object into this
variable before calling catch handler or filter.
* Minor helper methods
A few minor helper methods are referenced from the arch-independent code.
Some of them are:
*) mono_arch_cpu_optimizations ()
This function returns a mask of optimizations that
should be enabled for the current CPU and a mask of
optimizations that should be excluded, instead.
*) mono_arch_regname ()
Returns the name for a numeric register.
*) mono_arch_get_allocatable_int_vars ()
Returns a list of variables that can be allocated to
the integer registers in the current architecture.
*) mono_arch_get_global_int_regs ()
Returns a list of caller-save registers that can be
used to allocate variables in the current method.
*) mono_arch_instrument_mem_needs ()
*) mono_arch_instrument_prolog ()
*) mono_arch_instrument_epilog ()
Functions needed to implement the profiling interface.
* Writing regression tests
Regression tests for the JIT should be written for any bug
found in the JIT in one of the *.cs files in the mini
directory. Eventually all the operations of the JIT should be
tested (including the ones that get selected only when some
specific optimization is enabled).
* Platform specific optimizations
An example of a platform-specific optimization is the peephole
optimization: we look at a small window of code at a time and
we replace one or more instructions with others that perform
better for the given architecture or CPU.
* 64 bit support tips, by Zoltan Varga (vargaz@gmail.com)
For a 64-bit port of the Mono runtime, you will typically do
the following:
* need to use inssel-long.brg instead of
inssel-long32.brg.
* need to implement lots of new opcodes:
OP_I<OP> is 32 bit op
OP_L<OP> and CEE_<OP> are 64 bit ops
The 64 bit version of an existing port might share the code
with the 32 bit port (for example SPARC/SPARV9), or it might
be separate (x86/AMD64).
That will depend on the similarities of the two instructions
sets/ABIs etc.
The runtime and most parts of the JIT are 64 bit clean
at this point, so the only parts which require changing are
the arch dependent files.
Jump to Line
Something went wrong with that request. Please try again.