docs/mini-porting.txt

		       Mono JIT porting guide.
		   Paolo Molaro (lupus@ximian.com)

* Introduction

	This documents describes the process of porting the mono JIT
	to a new CPU architecture. The new mono JIT has been designed
	to make porting easier though at the same time enable the port
	to take full advantage from the new architecture features and
	instructions. Knowledge of the mini architecture (described in
	the mini-doc.txt file) is a requirement for understanding this
	guide, as well as an earlier document about porting the mono
	interpreter (available on the web site).
	
	There are six main areas that a port needs to implement to
	have a fully-functional JIT for a given architecture:
	
		1) instruction selection
		2) native code emission
		3) call conventions and register allocation
		4) method trampolines
		5) exception handling
		6) minor helper methods
	
	To take advantage of some not-so-common processor features
	(for example conditional execution of instructions as may be
	found on ARM or ia64), it may be needed to develop an
	high-level optimization, but doing so is not a requirement for
	getting the JIT to work.
	
	We'll see in more details each of the steps required, note,
	though, that a new port may just as well start from a
	cut&paste of an existing port to a similar architecture (for
	example from x86 to amd64, or from powerpc to sparc).
	
	The architecture specific code is split from the rest of the
	JIT, for example the x86 specific code and data is all
	included in the following files in the distribution:
	
		mini-x86.h mini-x86.c
		inssel-x86.brg
		cpu-pentium.md
		tramp-x86.c 
		exceptions-x86.c 
	
	I suggest a similar split for other architectures as well.
	
	Note that this document is still incomplete: some sections are
	only sketched and some are missing, but the important info to
	get a port going is already described.


* Architecture-specific instructions and instruction selection.

	The JIT already provides a set of instructions that can be
	easily mapped to a great variety of different processor
	instructions.  Sometimes it may be necessary or advisable to
	add a new instruction that represent more closely an
	instruction in the architecture.  Note that a mini instruction
	can be used to represent also a short sequence of CPU
	low-level instructions, but note that each instruction
	represents the minimum amount of code the instruction
	scheduler will handle (i.e., the scheduler won't schedule the
	instructions that compose the low-level sequence as individual
	instructions, but just the whole sequence, as an indivisible
	block).

	New instructions are created by adding a line in the
	mini-ops.h file, assigning an opcode and a name. To specify
	the input and output for the instruction, there are two
	different places, depending on the context in which the
	instruction gets used.

	If the instruction is used in the tree representation, the
	input and output types are defined by the BURG rules in the
	*.brg files (the usual non-terminals are 'reg' to represent a
	normal register, 'lreg' to represent a register or two that
	hold a 64 bit value, freg for a floating point register).

	If an instruction is used as a low-level CPU instruction, the
	info is specified in a machine description file. The
	description file is processed by the genmdesc program to
	provide a data structure that can be easily used from C code
	to query the needed info about the instruction.

	As an example, let's consider the add instruction for both x86
	and ppc:
	
	x86 version:
		add: dest:i src1:i src2:i len:2 clob:1
	ppc version:
		add: dest:i src1:i src2:i len:4
	
	Note that the instruction takes two input integer registers on
	both CPU, but on x86 the first source register is clobbered
	(clob:1) and the length in bytes of the instruction differs.

	Note that integer adds and floating point adds use different
	opcodes, unlike the IL language (64 bit add is done with two
	instructions on 32 bit architectures, using a add that sets
	the carry and an add with carry).

	A specific CPU port may assign any meaning to the clob field
	for an instruction since the value will be processed in an
	arch-specific file anyway.

	See the top of the existing cpu-pentium.md file for more info
	on other fields: the info may or may not be applicable to a
	different CPU, in this latter case the info can be ignored.

	The code in mini.c together with the BURG rules in inssel.brg,
	inssel-float.brg and inssel-long32.brg provides general
	purpose mappings from the tree representation to a set of
	instructions that should be easily implemented in any
	architecture.  To allow for additional arch-specific
	functionality, an arch-specific BURG file can be used: in this
	file arch-specific instructions can be selected that provide
	better performance than the general instructions or that
	provide functionality that is needed by the JIT but that
	cannot be expressed in a general enough way.
	
	As an example, x86 has the special instruction "push" to make
	it easier to implement the default call convention (passing
	arguments on the stack): almost all the other architectures
	don't have such an instruction (and don't need it anyway), so
	we added a special rule in the inssel-x86.brg file for it.
	
	So, one of the first things needed in a port is to write a
	cpu-$(arch).md machine description file and fill it with the
	needed info. As a start, only a few instructions can be
	specified, like the ones required to do simple integer
	operations. The default rules of the instruction selector will
	emit the common instructions and so we're ready to go for the
	next step in porting the JIT.
	

*) Native code emission

	Since the first step in porting mono to a new CPU is to port
	the interpreter, there should be already a file that allows
	the emission of binary native code in a buffer for the
	architecture. This file should be placed in the

		mono/arch/$(arch)/

	directory.

	The bulk of the code emission happens in the mini-$(arch).c
	file, in a function called mono_arch_output_basic_block
	(). This function takes a basic block, walks the list of
	instructions in the block and emits the binary code for each.
	Optionally a peephole optimization pass is done on the basic
	block, but this can be left for later, when the port actually
	works.

	This function is very simple, there is just a big switch on
	the instruction opcode and in the corresponding case the
	functions or macros to emit the binary native code are
	used. Note that in this function the lengths of the
	instructions are used to determine if the buffer for the code
	needs enlarging.
	
	To complete the code emission for a method, a few other
	functions need implementing as well:
	
		mono_arch_emit_prolog ()
		mono_arch_emit_epilog ()
		mono_arch_patch_code ()
	
	mono_arch_emit_prolog () will emit the code to setup the stack
	frame for a method, optionally call the callbacks used in
	profiling and tracing, and move the arguments to their home
	location (in a caller-save register if the variable was
	allocated to one, or in a stack location if the argument was
	passed in a volatile register and wasn't allocated a
	non-volatile one). caller-save registers used by the function
	are saved in the prolog as well.
	
	mono_arch_emit_epilog () will emit the code needed to return
	from the function, optionally calling the profiling or tracing
	callbacks. At this point the basic blocks or the code that was
	moved out of the normal flow for the function can be emitted
	as well (this is usually done to provide better info for the
	static branch predictor).  In the epilog, caller-save
	registers are restored if they were used.

	Note that, to help exception handling and stack unwinding,
	when there is a transition from managed to unmanaged code,
	some special processing needs to be done (basically, saving
	all the registers and setting up the links in the Last Managed
	Frame structure).
	
	When the epilog has been emitted, the upper level code
	arranges for the buffer of memory that contains the native
	code to be copied in an area of executable memory and at this
	point, instructions that use relative addressing need to be
	patched to have the right offsets: this work is done by
	mono_arch_patch_code ().


* Call conventions and register allocation

	To account for the differences in the call conventions, a few functions need to
	be implemented.
	
	mono_arch_allocate_vars () assigns to both arguments and local
	variables the offset relative to the frame register where they
	are stored, dead variables are simply discarded. The total
	amount of stack needed is calculated.
	
	mono_arch_call_opcode () is the function that more closely
	deals with the call convention on a given system. For each
	argument to a function call, an instruction is created that
	actually puts the argument where needed, be it the stack or a
	specific register. This function can also re-arrange th order
	of evaluation when multiple arguments are involved if needed
	(like, on x86 arguments are pushed on the stack in reverse
	order). The function needs to carefully take into accounts
	platform specific issues, like how structures are returned as
	well as the differences in size and/or alignment of managed
	and corresponding unmanaged structures.
	
	The other chunk of code that needs to deal with the call
	convention and other specifics of a CPU, is the local register
	allocator, implemented in a function named
	mono_arch_local_regalloc (). The local allocator deals with a
	basic block at a time and basically just allocates registers
	for temporary values during expression evaluation, spilling
	and unspilling as necessary.

	The local allocator needs to take into account clobbering
	information, both during simple instructions and during
	function calls and it needs to deal with other
	architecture-specific weirdnesses, like instructions that take
	inputs only in specific registers or output only is some.

	Some effort will be put later in moving most of the local
	register allocator to a common file so that the code can be
	shared more for similar, risc-like CPUs.  The register
	allocator does a first pass on the instructions in a block,
	collecting liveness information and in a backward pass on the
	same list performs the actual register allocation, inserting
	the instructions needed to spill values, if necessary.

	The cross-platform local register allocator is now implemented
	and it is documented in the jit-regalloc file.
	
	When this part of code is implemented, some testing can be
	done with the generated code for the new architecture. Most
	helpful is the use of the --regression command line switch to
	run the regression tests (basic.cs, for example).

	Note that the JIT will try to initialize the runtime, but it
	may not be able yet to compile and execute complex code:
	commenting most of the code in the mini_init() function in
	mini.c is needed to let the JIT just compile the regression
	tests.  Also, using multiple -v switches on the command line
	makes the JIT dump an increasing amount of information during
	compilation.

	Values loaded into registers need to be extened as needed by
	the ECMA specs:

	*) integers smaller than 4 bytes are extended to int32 values
	*) 32 bit floats are extended to double precision (in particular
	this means that currently all the floating point operations operate
	on doubles)
	
* Method trampolines

	To get better startup performance, the JIT actually compiles a
	method only when needed. To achieve this, when a call to a
	method is compiled, we actually emit a call to a magic
	trampoline. The magic trampoline is a function written in
	assembly that invokes the compiler to compile the given method
	and jumps to the newly compiled code, ensuring the arguments
	it received are passed correctly to the actual method.

	Before jumping to the new code, though, the magic trampoline
	takes care of patching the call site so that next time the
	call will go directly to the method instead of the
	trampoline. How does this all work?

	mono_arch_create_jit_trampoline () creates a small function
	that just preserves the arguments passed to it and adds an
	additional argument (the method to compile) before calling the
	generic trampoline. This small function is called the specific
	trampoline, because it is method-specific (the method to
	compile is hard-code in the instruction stream).

	The generic trampoline saves all the arguments that could get
	clobbered and calls a C function that will do two things:
	
	*) actually call the JIT to compile the method
	*) identify the calling code so that it can be patched to call directly
	the actual method
	
	If the 'this' argument to a method is a boxed valuetype that
	is passed to a method that expects just a pointer to the data,
	an additional unboxing trampoline will need to be inserted as
	well.
	

* Exception handling

	Exception handling is likely the most difficult part of the
	port, as it needs to deal with unwinding (both managed and
	unmanaged code) and calling catch and filter blocks. It also
	needs to deal with signals, because mono takes advantage of
	the MMU in the CPU and of the operation system to handle
	dereferences of the NULL pointer. Some of the function needed
	to implement the mechanisms are:
	
	mono_arch_get_throw_exception () returns a function that takes
	an exception object and invokes an arch-specific function that
	will enter the exception processing.  To do so, all the
	relevant registers need to be saved and passed on.
	
	mono_arch_handle_exception () this function takes the
	exception thrown and a context that describes the state of the
	CPU at the time the exception was thrown. The function needs
	to implement the exception handling mechanism, so it makes a
	search for an handler for the exception and if none is found,
	it follows the unhandled exception path (that can print a
	trace and exit or just abort the current thread). The
	difficulty here is to unwind the stack correctly, by restoring
	the register state at each call site in the call chain,
	calling finally, filters and handler blocks while doing so.
	
	As part of exception handling a couple of internal calls need
	to be implemented as well.

	ves_icall_get_frame_info () returns info about a specific
	frame.

	mono_jit_walk_stack () walks the stack and calls a callback with info for
	each frame found.

	ves_icall_get_trace () return an array of StackFrame objects.
	
** Code generation for filter/finally handlers

	Filter and finally handlers are called from 2 different locations:
	
	       1.) from within the method containing the exception clauses
	       2.) from the stack unwinding code
	
	To make this possible we implement them like subroutines,
	ending with a "return" statement. The subroutine does not save
	the base pointer, because we need access to the local
	variables of the enclosing method. Its is possible that
	instructions inside those handlers modify the stack pointer,
	thus we save the stack pointer at the start of the handler,
	and restore it at the end. We have to use a "call" instruction
	to execute such finally handlers.
	
	The MIR code for filter and finally handlers looks like:
	
	    OP_START_HANDLER
	    ...
	    OP_END_FINALLY | OP_ENDFILTER(reg)
	
	OP_START_HANDLER: should save the stack pointer somewhere
	OP_END_FINALLY: restores the stack pointers and returns.
	OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
	
** Calling finally/filter handlers 

	There is a special opcode to call those handler, its called
	OP_CALL_HANDLER. It simple emits a call instruction.
	
	Its a bit more complex to call handler from outside (in the
	stack unwinding code), because we have to restore the whole
	context of the method first. After that we simply emit a call
	instruction to invoke the handler. Its usually possible to use
	the same code to call filter and finally handlers (see
	arch_get_call_filter).
	
** Calling catch handlers

	Catch handlers are always called from the stack unwinding
	code. Unlike finally clauses or filters, catch handler never
	return. Instead we simply restore the whole context, and
	restart execution at the catch handler.
	
** Passing Exception objects to catch handlers and filters.

	We use a local variable to store exception objects. The stack
	unwinding code must store the exception object into this
	variable before calling catch handler or filter.
	
* Minor helper methods

	A few minor helper methods are referenced from the arch-independent code.
	Some of them are:
	
	*) mono_arch_cpu_optimizations ()
		This function returns a mask of optimizations that
		should be enabled for the current CPU and a mask of
		optimizations that should be excluded, instead.
	
	*) mono_arch_regname ()
		Returns the name for a numeric register.
	
	*) mono_arch_get_allocatable_int_vars ()
		Returns a list of variables that can be allocated to
		the integer registers in the current architecture.
	
	*) mono_arch_get_global_int_regs ()
		Returns a list of caller-save registers that can be
		used to allocate variables in the current method.
	
	*) mono_arch_instrument_mem_needs ()
	*) mono_arch_instrument_prolog ()
	*) mono_arch_instrument_epilog ()
		Functions needed to implement the profiling interface.
	
* Testing the port

    The JIT has a set of regression tests in *.cs files inside the mini directory.
    The usual method of testing a port is by compiling these tests on another machine
	with a working runtime by typing 'make rcheck', then copying TestDriver.dll and
	*.exe to the mini directory. The tests can be run by typing:
	./mono --regression <exe file name>
	The suggested order for working through these tests is the following:
	- basic.exe
	- basic-long.exe
	- basic-float.exe
	- basic-calls.exe
	- objects.exe
	- arrays.exe
	- exceptions.exe
	- iltests.exe
	- generics.exe	
	
* Writing regression tests

	Regression tests for the JIT should be written for any bug
	found in the JIT in one of the *.cs files in the mini
	directory. Eventually all the operations of the JIT should be
	tested (including the ones that get selected only when some
	specific optimization is enabled).
	

* Platform specific optimizations

	An example of a platform-specific optimization is the peephole
	optimization: we look at a small window of code at a time and
	we replace one or more instructions with others that perform
	better for the given architecture or CPU.
	
* 64 bit support tips, by Zoltan Varga (vargaz@gmail.com)

	For a 64-bit port of the Mono runtime, you will typically do
	the following:

		* need to use inssel-long.brg instead of
		  inssel-long32.brg.

		* need to implement lots of new opcodes:
		       OP_I<OP> is 32 bit op
		       OP_L<OP> and CEE_<OP> are 64 bit ops


	The 64 bit version of an existing port might share the code
	with the 32 bit port (for example SPARC/SPARV9), or it might
	be separate (x86/AMD64).  

	That will depend on the similarities of the two instructions
	sets/ABIs etc.

	The runtime and most parts of the JIT are 64 bit clean
	at this point, so the only parts which require changing are
	the arch dependent files.