Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

New compilation engine for Mono

svn path=/trunk/mono/; revision=13205
  • Loading branch information...
commit 5a710fa941e8d9c64a6228fe173cf50976eed05e 1 parent 64655bb
Miguel de Icaza migueldeicaza authored

Showing 68 changed files with 35,187 additions and 0 deletions. Show diff stats Hide diff stats

  1. +44 0 docs/aot-compiler.txt
  2. +208 0 docs/local-regalloc.txt
  3. +628 0 docs/mini-doc.txt
  4. +113 0 docs/opcode-decomp.txt
  5. +5 0 mono/mini/.cvsignore
  6. +1 0  mono/mini/README
  7. +13 0 mono/mini/TODO
  8. +79 0 mono/mini/TestDriver.cs
  9. +44 0 mono/mini/aot-compiler.txt
  10. +555 0 mono/mini/aot.c
  11. +131 0 mono/mini/arrays.cs
  12. +426 0 mono/mini/basic-float.cs
  13. +595 0 mono/mini/basic-long.cs
  14. +622 0 mono/mini/basic.cs
  15. +244 0 mono/mini/bench.cs
  16. +216 0 mono/mini/cfold.c
  17. +130 0 mono/mini/cprop.c
  18. +611 0 mono/mini/cpu-g4.md
  19. +619 0 mono/mini/cpu-pentium.md
  20. +1,400 0 mono/mini/debug-dwarf2.c
  21. +164 0 mono/mini/debug-mini.c
  22. +111 0 mono/mini/debug-private.h
  23. +231 0 mono/mini/debug-stabs.c
  24. +1,138 0 mono/mini/debug.c
  25. +163 0 mono/mini/debug.h
  26. +28 0 mono/mini/debugger-main.c
  27. +479 0 mono/mini/dominators.c
  28. +733 0 mono/mini/driver.c
  29. +86 0 mono/mini/emullong.brg
  30. +953 0 mono/mini/exceptions-ppc.c
  31. +972 0 mono/mini/exceptions-x86.c
  32. +1,317 0 mono/mini/exceptions.cs
  33. +236 0 mono/mini/genmdesc.c
  34. +323 0 mono/mini/graph.c
  35. +80 0 mono/mini/helpers.c
  36. +94 0 mono/mini/iltests.il
  37. +233 0 mono/mini/inssel-float.brg
  38. +47 0 mono/mini/inssel-long.brg
  39. +606 0 mono/mini/inssel-long32.brg
  40. +497 0 mono/mini/inssel-x86.brg
  41. +1,589 0 mono/mini/inssel.brg
  42. +401 0 mono/mini/jit-icalls.c
  43. +189 0 mono/mini/linear-scan.c
  44. +202 0 mono/mini/liveness.c
  45. +208 0 mono/mini/local-regalloc.txt
  46. +8 0 mono/mini/main.c
  47. +99 0 mono/mini/makefile
  48. +12 0 mono/mini/mini-arch.h
  49. +628 0 mono/mini/mini-doc.txt
  50. +314 0 mono/mini/mini-ops.h
  51. +2,854 0 mono/mini/mini-ppc.c
  52. +26 0 mono/mini/mini-ppc.h
  53. +3,141 0 mono/mini/mini-x86.c
  54. +37 0 mono/mini/mini-x86.h
  55. +6,182 0 mono/mini/mini.c
  56. +662 0 mono/mini/mini.h
  57. +28 0 mono/mini/mini.prj
  58. +415 0 mono/mini/objects.cs
  59. +113 0 mono/mini/opcode-decomp.txt
  60. +72 0 mono/mini/regalloc.c
  61. +53 0 mono/mini/regalloc.h
  62. +114 0 mono/mini/regset.c
  63. +38 0 mono/mini/regset.h
  64. +1,099 0 mono/mini/ssa.c
  65. +348 0 mono/mini/test.cs
  66. +748 0 mono/mini/tramp-ppc.c
  67. +345 0 mono/mini/tramp-x86.c
  68. +87 0 mono/mini/viewstat.pl
44 docs/aot-compiler.txt
... ... @@ -0,0 +1,44 @@
  1 +Mono Ahead Of Time Compiler
  2 +===========================
  3 +
  4 +The new mono JIT has sophisticated optimization features. It uses SSA and has a
  5 +pluggable architecture for further optimizations. This makes it possible and
  6 +efficient to use the JIT also for AOT compilation.
  7 +
  8 +
  9 +* file format: We use the native object format of the platform. That way it is
  10 + possible to reuse existing tools like objdump and the dynamic loader. All we
  11 + need is a working assembler, i.e. we write out a text file which is then
  12 + passed to gas (the gnu assembler) to generate the object file.
  13 +
  14 +* file names: we simply add ".so" to the generated file. For example:
  15 + basic.exe -> basic.exe.so
  16 + corlib.dll -> corlib.dll.so
  17 +
  18 +* staring the AOT compiler: mini --aot assembly_name
  19 +
  20 +The following things are saved in the object file:
  21 +
  22 +* version infos:
  23 +
  24 +* native code: this is labeled with method_XXXXXXXX: where XXXXXXXX is the
  25 + hexadecimal token number of the method.
  26 +
  27 +* additional informations needed by the runtime: For example we need to store
  28 + the code length and the exception tables. We also need a way to patch
  29 + constants only available at runtime (for example vtable and class
  30 + addresses). This is stored i a binary blob labeled method_info_XXXXXXXX:
  31 +
  32 +PROBLEMS:
  33 +
  34 +- all precompiled methods must be domain independent, or we add patch infos to
  35 + patch the target doamin.
  36 +
  37 +- the main problem is how to patch runtime related addresses, for example:
  38 +
  39 + - current application domain
  40 + - string objects loaded with LDSTR
  41 + - address of MonoClass data
  42 + - static field offsets
  43 + - method addreses
  44 + - virtual function and interface slots
208 docs/local-regalloc.txt
... ... @@ -0,0 +1,208 @@
  1 +
  2 +* Proposal for the local register allocator
  3 +
  4 + The local register allocator deals with allocating registers
  5 + for temporaries inside a single basic block, while the global
  6 + register allocator is concerned with method-wide allocation of
  7 + variables.
  8 + The global register allocator uses callee-saved register for it's
  9 + purpouse so that there is no need to save and restore these registers
  10 + at call sites.
  11 +
  12 + There are a number of issues the local allocator needs to deal with:
  13 + *) some instructions expect operands in specific registers (for example
  14 + the shl instruction on x86, or the call instruction with thiscall
  15 + convention, or the equivalent call instructions on other architectures,
  16 + such as the need to put output registers in %oX on sparc)
  17 + *) some instructions deliver results only in specific registers (for example
  18 + the div instruction on x86, or the call instructionson on almost all
  19 + the architectures).
  20 + *) it needs to know what registers may be clobbered by an instruction
  21 + (such as in a method call)
  22 + *) it should avoid excessive reloads or stores to improve performance
  23 +
  24 + While which specific instructions have limitations is architecture-dependent,
  25 + the problem shold be solved in an arch-independent way to reduce code duplication.
  26 + The register allocator will be 'driven' by the arch-dependent code, but it's
  27 + implementation should be arch-independent.
  28 +
  29 + To improve the current local register allocator, we need to
  30 + keep more state in it than the current setup that only keeps busy/free info.
  31 +
  32 + Possible state information is:
  33 +
  34 + free: the resgister is free to use and it doesn't contain useful info
  35 + freeable: the register contains data loaded from a local (there is
  36 + also info about _which_ local it contains) as a result from previous
  37 + instructions (like, there was a store from the register to the local)
  38 + moveable: it contains live data that is needed in a following instruction, but
  39 + the contents may be moved to a different register
  40 + busy: the register contains live data and it is placed there because
  41 + the following instructions need it exactly in that register
  42 + allocated: the register is used by the global allocator
  43 +
  44 + The local register allocator will have the following interfaces:
  45 +
  46 + int get_register ();
  47 + Searches for a register in the free state. If it doesn't find it,
  48 + searches for a freeable register. Sets the status to moveable.
  49 + Looking for a 'free' register before a freeable one should allow for
  50 + removing a few redundant loads (though I'm still unsure if such
  51 + things should be delegated entirely to the peephole pass).
  52 +
  53 + int get_register_force (int reg);
  54 + Returns 'reg' if it is free or freeable. If it is moveable, it moves it
  55 + to another free or freeable register.
  56 + Sets the status of 'reg' to busy.
  57 +
  58 + void set_register_freeable (int reg);
  59 + Sets the status of 'reg' to freeable.
  60 +
  61 + void set_register_free (int reg);
  62 + Sets the status of 'reg' to free.
  63 +
  64 + void will_clobber (int reg);
  65 + Spills the register to the stack. Sets the status to freeable.
  66 + After the clobbering has occurred, set the status to free.
  67 +
  68 + void register_unspill (int reg);
  69 + Un-spills register reg and sets the status to moveable.
  70 +
  71 + FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.
  72 +
  73 + Note: the register allocator will insert instructions in the basic block
  74 + during it's operation.
  75 +
  76 +* Examples
  77 +
  78 + Given the tree (on x86 the right argument to shl needs to be in ecx):
  79 +
  80 + store (local1, shl (local1, call (some_arg)))
  81 +
  82 + At the start of the basic block, the registers are set to the free state.
  83 + The sequence of instructions may be:
  84 + instruction register status -> [%eax %ecx %edx]
  85 + start free free free
  86 + eax = load local1 mov free free
  87 + /* call clobbers eax, ecx, edx */
  88 + spill eax free free free
  89 + call mov free free
  90 + /* now eax contains the right operand of the shl */
  91 + mov %eax -> %ecx free busy free
  92 + un-spill mov busy free
  93 + shl %cl, %eax mov free free
  94 +
  95 + The resulting x86 code is:
  96 + mov $fffc(%ebp), %eax
  97 + mov %eax, $fff0(%ebp)
  98 + push some_arg
  99 + call func
  100 + mov %eax, %ecx
  101 + mov $fff0(%ebp), %eax
  102 + shl %cl, %eax
  103 +
  104 + Note that since shl could operate directly on memory, we could have:
  105 +
  106 + push some_arg
  107 + call func
  108 + mov %eax, %ecx
  109 + shl %cl, $fffc(%ebp)
  110 +
  111 + The above example with loading the operand in a register is just to complicate
  112 + the example and show that the algorithm should be able to handle it.
  113 +
  114 + Let's take another example with the this-call call convention (the first argument
  115 + is passed in %ecx).
  116 + In this case, will_clobber() will be called only on %eax and %edx, while %ecx
  117 + will be allocated with get_register_force ().
  118 + Note: when a register is allocated with get_register_force(), it should be set
  119 + to a different state as soon as possible.
  120 +
  121 + store (local1, shl (local1, this-call (local1)))
  122 +
  123 + instruction register status -> [%eax %ecx %edx]
  124 + start free free free
  125 + eax = load local1 mov free free
  126 + /* force load in %ecx */
  127 + ecx = load local1 mov busy free
  128 + spill eax free busy free
  129 + call mov free free
  130 + /* now eax contains the right operand of the shl */
  131 + mov %eax -> %ecx free busy free
  132 + un-spill mov busy free
  133 + shl %cl, %eax mov free free
  134 +
  135 + What happens when a register that we need to allocate with get_register_force ()
  136 + contains an operand for the next instruction?
  137 +
  138 + instruction register status -> [%eax %ecx %edx]
  139 + eax = load local0 mov free free
  140 + ecx = load local1 mov mov free
  141 + get_register_force (ecx) here.
  142 + We have two options:
  143 + mov %ecx, %edx
  144 + or:
  145 + spill %ecx
  146 + The first option is way better (and allows the peephole pass to
  147 + just load the value in %edx directly, instead of loading first to %ecx).
  148 + This doesn't work, though, if the instruction clobbers the %edx register
  149 + (like in a this-call). So, we first need to clobber the registers
  150 + (so the state of %ecx changes to freebale and there is no issue
  151 + with get_register_force ()).
  152 + What if an instruction both clobbers a register and requires it as
  153 + an operand? Lets' take the x86 idiv instruction as an example: it
  154 + requires the dividend in edx:eax and returns the result in eax,
  155 + with the modulus in edx.
  156 +
  157 + store (local1, div (local1, local2))
  158 +
  159 + instruction register status -> [%eax %ecx %edx]
  160 + eax = load local0 mov free free
  161 + will_clobber eax, edx free mov free
  162 + force mov %ecx, %eax busy free free
  163 + set %edx busy free busy
  164 + idiv mov free free
  165 +
  166 + Note: edx is set to free after idiv, because the modulus is not needed
  167 + (if it was a rem, eax would have been freed).
  168 + If we load the divisor before will_clobber(), we'll have to spill
  169 + eax and reload it later. If we load it just after the idiv, there is no issue.
  170 + In any case, the algorithm should give the correct results and allow the operation.
  171 +
  172 + Working recursively on the isntructions there shouldn't be huge issues
  173 + with this algorithm (though, of course, it's not optimal and it may
  174 + introduce excessive spills or register moves). The advantage over the current
  175 + local reg allocator is that:
  176 + 1) the number of spills/moves would be smaller anyway
  177 + 2) a separate peephole pass could be able to eliminate reg moves
  178 + 3) we'll be able to remove the 'forced' spills we currently do with
  179 + the return value of method calls
  180 +
  181 +* Issues
  182 +
  183 + How to best integrate such a reg allocator with the burg stuff.
  184 +
  185 + Think about a call os sparc with two arguments: they got into %o0 and %o1
  186 + and each of them sets the register as busy. But what if the values to put there
  187 + are themselves the result of a call? %o0 is no problem, but for all the
  188 + next argument n the above algorithm would spill all the 0...n-1 registers...
  189 +
  190 +* Papers
  191 +
  192 + More complex solutions to the local register allocator problem:
  193 + http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html
  194 +
  195 + Combining register allocation and instruction scheduling:
  196 + http://citeseer.nj.nec.com/motwani95combining.html
  197 +
  198 + More on LRA euristics:
  199 + http://citeseer.nj.nec.com/liberatore97hardness.html
  200 +
  201 + Linear-time optimal code scheduling for delayedload architectures
  202 + http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz
  203 +
  204 + Precise Register Allocation for Irregular Architectures
  205 + http://citeseer.nj.nec.com/kong98precise.html
  206 +
  207 + Allocate registers first to subtrees that need more of them.
  208 + http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf
628 docs/mini-doc.txt
... ... @@ -0,0 +1,628 @@
  1 +
  2 + A new JIT compiler for the Mono Project
  3 +
  4 + Miguel de Icaza (miguel@{ximian.com,gnome.org}),
  5 +
  6 +
  7 +* Abstract
  8 +
  9 + Mini is a new compilation engine for the Mono runtime. The
  10 + new engine is designed to bring new code generation
  11 + optimizations, portability and precompilation.
  12 +
  13 + In this document we describe the design decisions and the
  14 + architecture of the new compilation engine.
  15 +
  16 +* Introduction
  17 +
  18 + First we discuss the overall architecture of the Mono runtime,
  19 + and how code generation fits into it; Then we discuss the
  20 + development and basic architecture of our first JIT compiler
  21 + for the ECMA CIL framework. The next section covers the
  22 + objectives for the work on the new JIT compiler, then we
  23 + discuss the new features available in the new JIT compiler,
  24 + and finally a technical description of the new code generation
  25 + engine.
  26 +
  27 +* Architecture of the Mono Runtime
  28 +
  29 + The Mono runtime is an implementation of the ECMA Common
  30 + Language Infrastructure (CLI), whose aim is to be a common
  31 + platform for executing code in multiple languages.
  32 +
  33 + Languages that target the CLI generate images that contain
  34 + code in high-level intermediate representation called the
  35 + "Common Intermediate Language". This intermediate language is
  36 + rich enough to allow for programs and pre-compiled libraries
  37 + to be reflected. The execution environment allows for an
  38 + object oriented execution environment with single inheritance
  39 + and multiple interface implementations.
  40 +
  41 + This runtime provides a number of services for programs that
  42 + are targeted to it: Just-in-Time compilation of CIL code into
  43 + native code, garbage collection, thread management, I/O
  44 + routines, single, double and decimal floating point,
  45 + asynchronous method invocation, application domains, and a
  46 + framework for building arbitrary RPC systems (remoting) and
  47 + integration with system libraries through the Platform Invoke
  48 + functionality.
  49 +
  50 + The focus of this document is on the services provided by the
  51 + Mono runtime to transform CIL bytecodes into code that is
  52 + native to the underlying architecture.
  53 +
  54 + The code generation interface is a set of macros that allow a
  55 + C programmer to generate code on the fly, this is done
  56 + through a set of macros found in the mono/jit/arch/ directory.
  57 + These macros are used by the JIT compiler to generate native
  58 + code.
  59 +
  60 + The platform invocation code is interesting, as it generates
  61 + CIL code on the fly to marshal parameters, and then this
  62 + code is in turned processed by the JIT engine.
  63 +
  64 +* Previous Experiences
  65 +
  66 + Mono has built a JIT engine, which has been used to bootstrap
  67 + Mono since January, 2002. This JIT engine has reasonable
  68 + performance, and uses an tree pattern matching instruction
  69 + selector based on the BURS technology. This JIT compiler was
  70 + designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
  71 +
  72 + The existing JIT compiler has three phases:
  73 +
  74 + * Re-creation of the semantic tree from CIL
  75 + byte-codes.
  76 +
  77 + * Instruction selection, with a cost-driven
  78 + engine.
  79 +
  80 + * Code generation and register allocation.
  81 +
  82 + It is also hooked into the rest of the runtime to provide
  83 + services like marshaling, just-in-time compilation and
  84 + invocation of "internal calls".
  85 +
  86 + This engine constructed a collection of trees, which we
  87 + referred to as the "forest of trees", this forest is created by
  88 + "hydrating" the CIL instruction stream.
  89 +
  90 + The first step was to identify the basic blocks on the method,
  91 + and computing the control flow graph (cfg) for it. Once this
  92 + information was computed, a stack analysis on each basic block
  93 + was performed to create a forest of trees for each one of
  94 + them.
  95 +
  96 + So for example, the following statement:
  97 +
  98 + int a, b;
  99 + ...
  100 + b = a + 1;
  101 +
  102 + Which would be represented in CIL as:
  103 +
  104 + ldloc.0
  105 + ldc.i4.1
  106 + add
  107 + stloc.1
  108 +
  109 + After the stack analysis would create the following tree:
  110 +
  111 + (STIND_I4 ADDR_L[EBX|2] (
  112 + ADD (LDIND_I4 ADDR_L[ESI|1])
  113 + CONST_I4[1]))
  114 +
  115 + This tree contains information from the stack analysis: for
  116 + instance, notice that the operations explicitly encode the
  117 + data types they are operating on, there is no longer an
  118 + ambiguity on the types, because this information has been
  119 + inferred.
  120 +
  121 + At this point the JIT would pass the constructed forest of
  122 + trees to the architecture-dependant JIT compiler.
  123 +
  124 + The architecture dependent code then performed register
  125 + allocation (optionally using linear scan allocation for
  126 + variables, based on life analysis).
  127 +
  128 + Once variables had been assigned, a tree pattern matching with
  129 + dynamic programming is used (the tree pattern matcher is
  130 + custom build for each architecture, using a code
  131 + generator: monoburg). The instruction selector used cost
  132 + functions to select the best instruction patterns.
  133 +
  134 + The instruction selector is able to produce instructions that
  135 + take advantage of the x86 instruction indexing instructions
  136 + for example.
  137 +
  138 + One problem though is that the code emitter and the register
  139 + allocator did not have any visibility outside the current
  140 + tree, which meant that some redundant instructions were
  141 + generated. A peephole optimizer with this architecture was
  142 + hard to write, given the tree-based representation that is
  143 + used.
  144 +
  145 + This JIT was functional, but it did not provide a good
  146 + architecture to base future optimizations on. Also the
  147 + line between architecture neutral and architecture
  148 + specific code and optimizations was hard to draw.
  149 +
  150 + The JIT engine supported two code generation modes to support
  151 + the two optimization modes for applications that host multiple
  152 + application domains: generate code that will be shared across
  153 + application domains, or generate code that will not be shared
  154 + across application domains.
  155 +
  156 +* Objectives of the new JIT engine.
  157 +
  158 + We wanted to support a number of features that were missing:
  159 +
  160 + * Ahead-of-time compilation.
  161 +
  162 + The idea is to allow developers to pre-compile their code
  163 + to native code to reduce startup time, and the working
  164 + set that is used at runtime in the just-in-time compiler.
  165 +
  166 + Although in Mono this has not been a visible problem, we
  167 + wanted to pro-actively address this problem.
  168 +
  169 + When an assembly (a Mono/.NET executable) is installed in
  170 + the system, it would then be possible to pre-compile the
  171 + code, and have the JIT compiler tune the generated code
  172 + to the particular CPU on which the software is
  173 + installed.
  174 +
  175 + This is done in the Microsoft.NET world with a tool
  176 + called ngen.exe
  177 +
  178 + * Have a good platform for doing code optimizations.
  179 +
  180 + The design called for a good architecture that would
  181 + enable various levels of optimizations: some
  182 + optimizations are better performed on high-level
  183 + intermediate representations, some on medium-level and
  184 + some at low-level representations.
  185 +
  186 + Also it should be possible to conditionally turn these on
  187 + or off. Some optimizations are too expensive to be used
  188 + in just-in-time compilation scenarios, but these
  189 + expensive optimizations can be turned on for
  190 + ahead-of-time compilations or when using profile-guided
  191 + optimizations on a subset of the executed methods.
  192 +
  193 + * Reduce the effort required to port the Mono code
  194 + generator to new architectures.
  195 +
  196 + For Mono to gain wide adoption in the Unix world, it is
  197 + necessary that the JIT engine works in most of today's
  198 + commercial hardware platforms.
  199 +
  200 +* Features of the new JIT engine.
  201 +
  202 + The new JIT engine was architected by Dietmar Maurer and Paolo
  203 + Molaro, based on the new objectives.
  204 +
  205 + Mono provides a number of services to applications running
  206 + with the new JIT compiler:
  207 +
  208 + * Just-in-Time compilation of CLI code into native code.
  209 +
  210 + * Ahead-of-Time compilation of CLI code, to reduce
  211 + startup time of applications.
  212 +
  213 + A number of software development features are also available:
  214 +
  215 + * Execution time profiling (--profile)
  216 +
  217 + Generates a report of the times consumed by routines,
  218 + as well as the invocation times, as well as the
  219 + callers.
  220 +
  221 + * Memory usage profiling (--profile)
  222 +
  223 + Generates a report of the memory usage by a program
  224 + that is ran under the Mono JIT.
  225 +
  226 + * Code coverage (--coverage)
  227 +
  228 + * Execution tracing.
  229 +
  230 + People who are interested in developing and improving the Mini
  231 + JIT compiler will also find a few useful routines:
  232 +
  233 + * Compilation times
  234 +
  235 + This is used to time the execution time for the JIT
  236 + when compiling a routine.
  237 +
  238 + * Control Flow Graph and Dominator Tree drawing.
  239 +
  240 + These are visual aids for the JIT developer: they
  241 + render representations of the Control Flow graph, and
  242 + for the more advanced optimizations, they draw the
  243 + dominator tree graph.
  244 +
  245 + This requires Dot (from the graphwiz package) and Ghostview.
  246 +
  247 + * Code generator regression tests.
  248 +
  249 + The engine contains support for running regression
  250 + tests on the virtual machine, which is very helpful to
  251 + developers interested in improving the engine.
  252 +
  253 + * Optimization benchmark framework.
  254 +
  255 + The JIT engine will generate graphs that compare
  256 + various benchmarks embedded in an assembly, and run the
  257 + various tests with different optimization flags.
  258 +
  259 + This requires Perl, GD::Graph.
  260 +
  261 +* Flexibility
  262 +
  263 + This is probably the most important component of the new code
  264 + generation engine. The internals are relatively easy to
  265 + replace and update, even large passes can be replaced and
  266 + implemented differently.
  267 +
  268 +* New code generator
  269 +
  270 + Compiling a method begins with the `mini_method_to_ir' routine
  271 + that converts the CIL representation into a medium
  272 + intermediate representation.
  273 +
  274 + The mini_method_to_ir routine performs a number of operations:
  275 +
  276 + * Flow analysis and control flow graph computation.
  277 +
  278 + Unlike the previous version, stack analysis and control
  279 + flow graphs are computed in a single pass in the
  280 + mini_method_to_ir function, this is done for performance
  281 + reasons: although the complexity increases, the benefit
  282 + for a JIT compiler is that there is more time available
  283 + for performing other optimizations.
  284 +
  285 + * Basic block computation.
  286 +
  287 + mini_method_to_ir populates the MonoCompile structure
  288 + with an array of basic blocks each of which contains
  289 + forest of trees made up of MonoInst structures.
  290 +
  291 + * Inlining
  292 +
  293 + Inlining is no longer restricted to methods containing
  294 + one single basic block, instead it is possible to inline
  295 + arbitrary complex methods.
  296 +
  297 + The heuristics to choose what to inline are likely going
  298 + to be tuned in the future.
  299 +
  300 + * Method to opcode conversion.
  301 +
  302 + Some method call invocations like `call Math.Sin' are
  303 + transformed into an opcode: this transforms the call
  304 + into a semantically rich node, which is later inline
  305 + into an FPU instruction.
  306 +
  307 + Various Array methods invocations are turned into
  308 + opcodes as well (The Get, Set and Address methods)
  309 +
  310 + * Tail recursion elimination
  311 +
  312 + Basic blocks ****
  313 +
  314 + The MonoInst structure holds the actual decoded instruction,
  315 + with the semantic information from the stack analysis.
  316 + MonoInst is interesting because initially it is part of a tree
  317 + structure, here is a sample of the same tree with the new JIT
  318 + engine:
  319 +
  320 + (stind.i4 regoffset[0xffffffd4(%ebp)]
  321 + (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
  322 + iconst[1]))
  323 +
  324 + This is a medium-level intermediate representation (MIR).
  325 +
  326 + Some complex opcodes are decomposed at this stage into a
  327 + collection of simpler opcodes. Not every complex opcode is
  328 + decomposed at this stage, as we need to preserve the semantic
  329 + information during various optimization phases.
  330 +
  331 + For example a NEWARR opcode carries the length and the type of
  332 + the array that could be used later to avoid type checking or
  333 + array bounds check.
  334 +
  335 + There are a number of operations supported on this
  336 + representation:
  337 +
  338 + * Branch optimizations.
  339 +
  340 + * Variable liveness.
  341 +
  342 + * Loop optimizations: the dominator trees are
  343 + computed, loops are detected, and their nesting
  344 + level computed.
  345 +
  346 + * Conversion of the method into static single assignment
  347 + form (SSA form).
  348 +
  349 + * Dead code elimination.
  350 +
  351 + * Constant propagation.
  352 +
  353 + * Copy propagation.
  354 +
  355 + * Constant folding.
  356 +
  357 + Once the above optimizations are optionally performed, a
  358 + decomposition phase is used to turn some complex opcodes into
  359 + internal method calls. In the initial version of the JIT
  360 + engine, various operations on longs are emulated instead of
  361 + being inlined. Also the newarr invocation is turned into a
  362 + call to the runtime.
  363 +
  364 + At this point, after computing variable liveness, it is
  365 + possible to use the linear scan algorithm for allocating
  366 + variables to registers. The linear scan pass uses the
  367 + information that was previously gathered by the loop nesting
  368 + and loop structure computation to favor variables in inner
  369 + loops.
  370 +
  371 + Stack space is then reserved for the local variables and any
  372 + temporary variables generated during the various
  373 + optimizations.
  374 +
  375 +** Instruction selection
  376 +
  377 + At this point, the BURS instruction selector is invoked to
  378 + transform the tree-based representation into a list of
  379 + instructions. This is done using a tree pattern matcher that
  380 + is generated for the architecture using the `monoburg' tool.
  381 +
  382 + Monoburg takes as input a file that describes tree patterns,
  383 + which are matched against the trees that were produced by the
  384 + engine in the previous stages.
  385 +
  386 + The pattern matching might have more than one match for a
  387 + particular tree. In this case, the match selected is the one
  388 + whose cost is the smallest. A cost can be attached to each
  389 + rule, and if no cost is provided, the implicit cost is one.
  390 + Smaller costs are selected over higher costs.
  391 +
  392 + The cost function can be used to select particular blocks of
  393 + code for a given architecture, or by using a prohibitive high
  394 + number to avoid having the rule match.
  395 +
  396 + The various rules that our JIT engine uses transform a tree of
  397 + MonoInsts into a list of monoinsts:
  398 +
  399 + +-----------------------------------------------------------+
  400 + | Tree List |
  401 + | of ===> Instruction selection ===> of |
  402 + | MonoInst MonoInst. |
  403 + +-----------------------------------------------------------+
  404 +
  405 + During this process various "types" of MonoInst kinds
  406 + disappear and turned into lower-level representations. The
  407 + JIT compiler just happens to reuse the same structure (this is
  408 + done to reduce memory usage and improve memory locality).
  409 +
  410 + The instruction selection rules are split in a number of
  411 + files, each one with a particular purpose:
  412 +
  413 + inssel.brg
  414 + Contains the generic instruction selection
  415 + patterns.
  416 +
  417 + inssel-x86.brg
  418 + Contains x86 specific rules.
  419 +
  420 + inssel-ppc.brg
  421 + Contains PowerPC specific rules.
  422 +
  423 + inssel-long32.brg
  424 + burg file for 64bit instructions on 32bit architectures.
  425 +
  426 + inssel-long.brg
  427 + burg file for 64bit architectures.
  428 +
  429 + inssel-float.brg
  430 + burg file for floating point instructions
  431 +
  432 + For a given build, a set of those files would be included.
  433 + For example, for the build of Mono on the x86, the following
  434 + set is used:
  435 +
  436 + inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
  437 +
  438 +** Native method generation
  439 +
  440 + The native method generation has a number of steps:
  441 +
  442 + * Architecture specific register allocation.
  443 +
  444 + The information about loop nesting that was
  445 + previously gathered is used here to hint the
  446 + register allocator.
  447 +
  448 + * Generating the method prolog/epilog.
  449 +
  450 + * Optionally generate code to introduce tracing facilities.
  451 +
  452 + * Hooking into the debugger.
  453 +
  454 + * Performing any pending fixups.
  455 +
  456 + * Code generation.
  457 +
  458 +*** Code Generation
  459 +
  460 + The actual code generation is contained in the architecture
  461 + specific portion of the compiler. The input to the code
  462 + generator is each one of the basic blocks with its list of
  463 + instructions that were produced in the instruction selection
  464 + phase.
  465 +
  466 + During the instruction selection phase, virtual registers are
  467 + assigned. Just before the peephole optimization is performed,
  468 + physical registers are assigned.
  469 +
  470 + A simple peephole and algebraic optimizer is ran at this
  471 + stage.
  472 +
  473 + The peephole optimizer removes some redundant operations at
  474 + this point. This is possible because the code generation at
  475 + this point has visibility into the basic block that spans the
  476 + original trees.
  477 +
  478 + The algebraic optimizer performs some simple algebraic
  479 + optimizations that replace expensive operations with cheaper
  480 + operations if possible.
  481 +
  482 + The rest of the code generation is fairly simple: a switch
  483 + statement is used to generate code for each of the MonoInsts
  484 +
  485 + We always try to allocate code in sequence, instead of just using
  486 + malloc. This way we increase spatial locality which gives a massive
  487 + speedup on most architectures.
  488 +
  489 +*** Ahead of Time compilation
  490 +
  491 + Ahead-of-Time compilation is a new feature of our new
  492 + compilation engine. The compilation engine is shared by the
  493 + Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
  494 + (AOT).
  495 +
  496 + The difference is on the set of optimizations that are turned
  497 + on for each mode: Just-in-Time compilation should be as fast
  498 + as possible, while Ahead-of-Time compilation can take as long
  499 + as required, because this is not done at a time criticial
  500 + time.
  501 +
  502 + With AOT compilation, we can afford to turn all of the
  503 + computationally expensive optimizations on.
  504 +
  505 + After the code generation phase is done, the code and any
  506 + required fixup information is saved into a file that is
  507 + readable by "as" (the native assembler available on all
  508 + systems). This assembly file is then passed to the native
  509 + assembler, which generates a loadable module.
  510 +
  511 + At execution time, when an assembly is loaded from the disk,
  512 + the runtime engine will probe for the existance of a
  513 + pre-compiled image. If the pre-compiled image exists, then it
  514 + is loaded, and the method invocations are resolved to the code
  515 + contained in the loaded module.
  516 +
  517 + The code generated under the AOT scenario is slightly
  518 + different than the JIT scenario. It generates code that is
  519 + application-domain relative and that can be shared among
  520 + multiple thread.
  521 +
  522 + This is the same code generation that is used when the runtime
  523 + is instructed to maximize code sharing on a multi-application
  524 + domain scenario.
  525 +
  526 +* SSA-based optimizations
  527 +
  528 + SSA form simplifies many optimization because each variable has exactly
  529 + one definition site. All uses of a variable are "dominated" by its
  530 + definition, which enables us to implement algorithm like:
  531 +
  532 + * conditional constant propagation
  533 +
  534 + * array bound check removal
  535 +
  536 + * dead code elimination
  537 +
  538 + And we can implement those algorithm in a efficient way using SSA.
  539 +
  540 +
  541 +* Register allocation.
  542 +
  543 + Global register allocation is performed on the medium
  544 + intermediate representation just before instruction selection
  545 + is performed on the method. Local register allocation is
  546 + later performed at the basic-block level on the
  547 +
  548 + Global register allocation uses the following input:
  549 +
  550 + 1) set of register-sized variables that can be allocated to a
  551 + register (this is an architecture specific setting, for x86
  552 + these registers are the callee saved register ESI, EDI and
  553 + EBX).
  554 +
  555 + 2) liveness information for the variables
  556 +
  557 + 3) (optionally) loop info to favour variables that are used in
  558 + inner loops.
  559 +
  560 + During instruction selection phase, symbolic registers are
  561 + assigned to temporary values in expressions.
  562 +
  563 + Local register allocation assigns hard registers to the
  564 + symbolic registers, and it is performed just before the code
  565 + is actually emitted and is performed at the basic block level.
  566 + A CPU description file describes the input registers, output
  567 + registers, fixed registers and clobbered registers by each
  568 + operation.
  569 +
  570 +
  571 +----------
  572 +* Bootstrap
  573 +
  574 + The Mini bootstrap parses the arguments passed on the command
  575 + line, and initializes the JIT runtime. Each time the
  576 + mini_init() routine is invoked, a new Application Domain will
  577 + be returned.
  578 +
  579 +* Signal handlers
  580 +
  581 + mono_runtime_install_handlers
  582 +
  583 +* BURG Code Generator Generator
  584 +
  585 + monoburg was written by Dietmar Maurer. It is based on the
  586 + papers from Christopher W. Fraser, Robert R. Henry and Todd
  587 + A. Proebsting: "BURG - Fast Optimal Instruction Selection and
  588 + Tree Parsing" and "Engineering a Simple, Efficient Code
  589 + Generator Generator".
  590 +
  591 + The original BURG implementation is unable to work on DAGs, instead only
  592 + trees are allowed. Our monoburg implementations is able to generate tree
  593 + matcher which works on DAGs, and we use this feature in the new
  594 + JIT. This simplifies the code because we can directly pass DAGs and
  595 + don't need to convert them to trees.
  596 +
  597 +* Future
  598 +
  599 + Profile-based optimization is something that we are very
  600 + interested in supporting. There are two possible usage
  601 + scenarios:
  602 +
  603 + * Based on the profile information gathered during
  604 + the execution of a program, hot methods can be compiled
  605 + with the highest level of optimizations, while bootstrap
  606 + code and cold methods can be compiled with the least set
  607 + of optimizations and placed in a discardable list.
  608 +
  609 + * Code reordering: this profile-based optimization would
  610 + only make sense for pre-compiled code. The profile
  611 + information is used to re-order the assembly code on disk
  612 + so that the code is placed on the disk in a way that
  613 + increments locality.
  614 +
  615 + This is the same principle under which SGI's cord program
  616 + works.
  617 +
  618 + The nature of the CIL allows the above optimizations to be
  619 + easy to implement and deploy. Since we live and define our
  620 + universe for these things, there are no interactions with
  621 + system tools required, nor upgrades on the underlying
  622 + infrastructure required.
  623 +
  624 + Instruction scheduling is important for certain kinds of
  625 + processors, and some of the framework exists today in our
  626 + register allocator and the instruction selector to cope with
  627 + this, but has not been finished. The instruction selection
  628 + would happen at the same time as local register allocation.
113 docs/opcode-decomp.txt
... ... @@ -0,0 +1,113 @@
  1 +
  2 +* How to handle complex IL opcodes in an arch-independent way
  3 +
  4 + Many IL opcodes are very simple: add, ldind etc.
  5 + Such opcodes can be implemented with a single cpu instruction
  6 + in most architectures (on some, a group of IL instructions
  7 + can be converted to a single cpu op).
  8 + There are many IL opcodes, though, that are more complex, but
  9 + can be expressed as a series of trees or a single tree of
  10 + simple operations. Such simple operations are architecture-independent.
  11 + It makes sense to decompose such complex IL instructions in their
  12 + simpler equivalent so that we gain in several ways:
  13 + *) porting effort is easier, because only the simple instructions
  14 + need to be implemented in arch-specific code
  15 + *) we could apply BURG rules to the trees and do pattern matching
  16 + on them to optimize the expressions according to the host cpu
  17 +
  18 + The issue is: where do we do such conversion from coarse opcodes to
  19 + simple expressions?
  20 +
  21 +* Doing the conversion in method_to_ir ()
  22 +
  23 + Some of these conversions can certainly be done in method_to_ir (),
  24 + but it's not always easy to decide which are better done there and
  25 + which in a different pass.
  26 + For example, let's take ldlen: in the mono implementation, ldlen
  27 + can be simply implemented with a load from a fixed position in the
  28 + array object:
  29 +
  30 + len = [reg + maxlen_offset]
  31 +
  32 + However, ldlen carries also semantics information: the result is the
  33 + length of the array, and since in the CLR arrays are of fixed size,
  34 + this information can be useful to later do bounds check removal.
  35 + If we convert this opcode in method_to_ir () we lost some useful
  36 + information for further optimizations.
  37 +
  38 + In some other ways, decomposing an opcode in method_to_ir() may
  39 + allow for better optimizations later on (need to come up with an
  40 + example here ...).
  41 +
  42 +* Doing the conversion in inssel.brg
  43 +
  44 + Some conversion may be done inside the burg rules: this has the
  45 + disadvantage that the instruction selector is not run again on
  46 + the resulting expression tree and we could miss some optimization
  47 + (this is what effectively happens with the coarse opcodes in the old
  48 + jit). This may also interfere with an efficient local register allocator.
  49 + It may be possible to add an extension in monoburg that allows a rule
  50 + such as:
  51 +
  52 + recheck: LDLEN (reg) {
  53 + create an expression tree representing LDLEN
  54 + and return it
  55 + }
  56 +
  57 + When the monoburg label process gets back a recheck, it will run
  58 + the labeling again on the resulting expression tree.
  59 + If this is possible at all (and in an efficient way) is a
  60 + question for dietmar:-)
  61 + It should be noted, though, that this may not always work, since
  62 + some complex IL opcodes may require a series of expression trees
  63 + and handling such cases in monoburg could become quite hairy.
  64 + For example, think of opcode that need to do multiple actions on the
  65 + same object: this basically means a DUP...
  66 + On the other end, if a complex opcode needs a DUP, monoburg doesn't
  67 + actually need to create trees if it emits the instructions in
  68 + the correct sequence and maintains the right values in the registers
  69 + (usually the values that need a DUP are not changed...). How
  70 + this integrates with the current register allocator is not clear, since
  71 + that assigns registers based on the rule, but the instructions emitted
  72 + by the rules may be different (this already happens with the current JIT
  73 + where a MULT is replaced with lea etc...).
  74 +
  75 +* Doing it in a separate pass.
  76 +
  77 + Doing the conversion in a separate pass over the instructions
  78 + is another alternative. This can be done right after method_to_ir ()
  79 + or after the SSA pass (since the IR after the SSA pass should look
  80 + almost like the IR we get back from method_to_ir ()).
  81 +
  82 + This has the following advantages:
  83 + *) monoburg will handle only the simple opcodes (makes porting easier)
  84 + *) the instruction selection will be run on all the additional trees
  85 + *) it's easier to support coarse opcodes that produce multiple expression
  86 + trees (and apply the monoburg selector on all of them)
  87 + *) the SSA optimizer will see the original opcodes and will be able to use
  88 + the semantic info associated with them
  89 +
  90 + The disadvantage is that this is a separate pass on the code and
  91 + it takes time (how much has not been measured yet, though).
  92 +
  93 + With this approach, we may also be able to have C implementations
  94 + of some of the opcodes: this pass would insert a function call to
  95 + the C implementation (for example in the cases when first porting
  96 + to a new arch and implemenating some stuff may be too hard in asm).
  97 +
  98 +* Extended basic blocks
  99 +
  100 + IL code needs a lot of checks, bounds checks, overflow checks,
  101 + type checks and so on. This potentially increases by a lot
  102 + the number of basic blocks in a control flow graph. However,
  103 + all such blocks end up with a throw opcode that gives control to the
  104 + exception handling mechanism.
  105 + After method_to_ir () a MonoBasicBlock can be considered a sort
  106 + of extended basic block where the additional exits don't point
  107 + to basic blocks in the same procedure (at least when the method
  108 + doesn't have exception tables).
  109 + We need to make sure the passes following method_to_ir () can cope
  110 + with such kinds of extended basic blocks (especially the passes
  111 + that we need to apply to all the methods: as a start, we could
  112 + skip SSA optimizations for methods with exception clauses...)
  113 +
5 mono/mini/.cvsignore
... ... @@ -0,0 +1,5 @@
  1 +mini
  2 +mono-debugger-mini-wrapper
  3 +.libs
  4 +*.o
  5 +test.exe
1  mono/mini/README
... ... @@ -0,0 +1 @@
  1 +Mini is the new JIT compiler for Mono.
13 mono/mini/TODO
... ... @@ -0,0 +1,13 @@
  1 +* use a pool of MBState structures to speedup monoburg instead of using a
  2 + mempool.
  3 +* the decode tables in the burg-generated could use short instead of int
  4 + (this should save about 1 KB)
  5 +* track the use of ESP, so that we can avoid the x86_lea in the epilog
  6 +
  7 +
  8 +Other Ideas:
  9 +
  10 +* the ORP people avoids optimizations inside catch handlers - just to save
  11 + memory (for example allocation of strings - instead they allocate strings when
  12 + the code is executed (like the --shared option)). But there are only a few
  13 + functions using catch handlers, so I consider this a minor issue.
79 mono/mini/TestDriver.cs
... ... @@ -0,0 +1,79 @@
  1 +using System;
  2 +using System.Reflection;
  3 +
  4 +
  5 +public class TestDriver {
  6 +
  7 + static public int RunTests (Type type, string[] args) {
  8 + int failed = 0, ran = 0;
  9 + int result, expected, elen;
  10 + int i, j;
  11 + string name;
  12 + MethodInfo[] methods;
  13 + bool do_timings = false;
  14 + int tms = 0;
  15 + DateTime start, end = DateTime.Now;
  16 +
  17 + if (args != null && args.Length > 0) {
  18 + for (j = 0; j < args.Length; j++) {
  19 + if (args [j] == "--time") {
  20 + do_timings = true;
  21 + string[] new_args = new string [args.Length - 1];
  22 + for (i = 0; i < j; ++i)
  23 + new_args [i] = args [i];
  24 + j++;
  25 + for (; j < args.Length; ++i, ++j)
  26 + new_args [i] = args [j];
  27 + args = new_args;
  28 + break;
  29 + }
  30 + }
  31 + }
  32 + methods = type.GetMethods (BindingFlags.Public|BindingFlags.NonPublic|BindingFlags.Static);
  33 + for (i = 0; i < methods.Length; ++i) {
  34 + name = methods [i].Name;
  35 + if (!name.StartsWith ("test_"))
  36 + continue;
  37 + if (args != null && args.Length > 0) {
  38 + bool found = false;
  39 + for (j = 0; j < args.Length; j++) {
  40 + if (name.EndsWith (args [j])) {
  41 + found = true;
  42 + break;
  43 + }
  44 + }
  45 + if (!found)
  46 + continue;
  47 + }
  48 + for (j = 5; j < name.Length; ++j)
  49 + if (!Char.IsDigit (name [j]))
  50 + break;
  51 + expected = Int32.Parse (name.Substring (5, j - 5));
  52 + start = DateTime.Now;
  53 + result = (int)methods [i].Invoke (null, null);
  54 + if (do_timings) {
  55 + end = DateTime.Now;
  56 + long tdiff = end.Ticks - start.Ticks;
  57 + int mdiff = (int)tdiff/10000;
  58 + tms += mdiff;
  59 + Console.WriteLine ("{0} took {1} ms", name, mdiff);
  60 + }
  61 + ran++;
  62 + if (result != expected) {
  63 + failed++;
  64 + Console.WriteLine ("{0} failed: got {1}, expected {2}", name, result, expected);
  65 + }
  66 + }
  67 +
  68 + if (do_timings) {
  69 + Console.WriteLine ("Total ms: {0}", tms);
  70 + }
  71 + Console.WriteLine ("Regression tests: {0} ran, {1} failed in {2}", ran, failed, type);
  72 + //Console.WriteLine ("Regression tests: {0} ran, {1} failed in [{2}]{3}", ran, failed, type.Assembly.GetName().Name, type);
  73 + return failed;
  74 + }
  75 + static public int RunTests (Type type) {
  76 + return RunTests (type, null);
  77 + }
  78 +}
  79 +
44 mono/mini/aot-compiler.txt
... ... @@ -0,0 +1,44 @@
  1 +Mono Ahead Of Time Compiler
  2 +===========================
  3 +
  4 +The new mono JIT has sophisticated optimization features. It uses SSA and has a
  5 +pluggable architecture for further optimizations. This makes it possible and
  6 +efficient to use the JIT also for AOT compilation.
  7 +
  8 +
  9 +* file format: We use the native object format of the platform. That way it is
  10 + possible to reuse existing tools like objdump and the dynamic loader. All we
  11 + need is a working assembler, i.e. we write out a text file which is then
  12 + passed to gas (the gnu assembler) to generate the object file.
  13 +
  14 +* file names: we simply add ".so" to the generated file. For example:
  15 + basic.exe -> basic.exe.so
  16 + corlib.dll -> corlib.dll.so
  17 +
  18 +* staring the AOT compiler: mini --aot assembly_name
  19 +
  20 +The following things are saved in the object file:
  21 +
  22 +* version infos:
  23 +
  24 +* native code: this is labeled with method_XXXXXXXX: where XXXXXXXX is the
  25 + hexadecimal token number of the method.
  26 +
  27 +* additional informations needed by the runtime: For example we need to store
  28 + the code length and the exception tables. We also need a way to patch
  29 + constants only available at runtime (for example vtable and class
  30 + addresses). This is stored i a binary blob labeled method_info_XXXXXXXX:
  31 +
  32 +PROBLEMS:
  33 +
  34 +- all precompiled methods must be domain independent, or we add patch infos to
  35 + patch the target doamin.
  36 +
  37 +- the main problem is how to patch runtime related addresses, for example:
  38 +
  39 + - current application domain
  40 + - string objects loaded with LDSTR
  41 + - address of MonoClass data
  42 + - static field offsets
  43 + - method addreses
  44 + - virtual function and interface slots
555 mono/mini/aot.c
... ... @@ -0,0 +1,555 @@
  1 +/*
  2 + * aot.c: mono Ahead of Time compiler
  3 + *
  4 + * Author:
  5 + * Dietmar Maurer (dietmar@ximian.com)
  6 + *
  7 + * (C) 2002 Ximian, Inc.
  8 + */
  9 +
  10 +#include <sys/types.h>
  11 +#include <unistd.h>
  12 +#include <fcntl.h>
  13 +#include <sys/mman.h>
  14 +
  15 +#include <limits.h> /* for PAGESIZE */
  16 +#ifndef PAGESIZE
  17 +#define PAGESIZE 4096
  18 +#endif
  19 +
  20 +#include <mono/metadata/tabledefs.h>
  21 +#include <mono/metadata/class.h>
  22 +#include <mono/metadata/object.h>
  23 +#include <mono/metadata/tokentype.h>
  24 +#include <mono/metadata/appdomain.h>
  25 +#include <mono/metadata/debug-helpers.h>
  26 +
  27 +#include "mini.h"
  28 +
  29 +#define ENCODE_TYPE_POS(t,l) (((t) << 24) | (l))