-
Notifications
You must be signed in to change notification settings - Fork 3.8k
/
jit-trampolines
206 lines (155 loc) · 7.7 KB
/
jit-trampolines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
Author: Dietmar Maurer (dietmar@ximian.com)
(C) 2001 Ximian, Inc.
(C) 2007 Novell, Inc.
[ 2007 extensions based on posts from Paolo Molaro ]
Howto trigger JIT compilation
=============================
The JIT translates CIL code to native code on a per method basis. For example
if you have this simple program:
public class Test {
public static void Main () {
System.Console.WriteLine ("Hello");
}
}
the JIT first compiles the Main function. Unfortunately Main() contains another
reference to System.Console.WriteLine(), so the JIT also needs the address for
WriteLine() to generate a call instruction.
The simplest solution would be to JIT compile System.Console.WriteLine()
to generate that address. But that would mean that we JIT compile half of our
class library at once, since WriteLine() uses many other classes and function,
and we have to call the JIT for each of them. Even worse there is the
possibility of cyclic references, and we would end up in an endless loop.
Thus we need some kind of trampoline function for JIT compilation. Such a
trampoline first calls the JIT compiler to create native code, and then jumps
directly into that code. Whenever the JIT needs the address of a function (to
emit a call instruction) it uses the address of those trampoline functions.
One drawback of this approach is that it requires an additional indirection. We
always call the trampoline. Inside the trampoline we need to check if the
method is already compiled or not, and when not compiled we start JIT
compilation. After that we call the code. This process is quite time consuming
and shows very bad performance.
The solution is to add some logic to the trampoline function to detect from
where it is called. It is then possible for the JIT to patch the call
instruction in the caller, so that it directly calls the JIT compiled code
next time.
Implementation Details
======================
Mono 1.2.6 has quite a few improvements in this area compared to mono
1.2.5 which was released just a few weeks ago. I'll try to detail the
major changes below.
The first change is related to how the memory for the specific
trampolines is allocated: this is executable memory so it is not
allocated with malloc, but with a custom allocator, called Mono Code
Manager. Since the code manager is used primarily for methods, it
allocates chunks of memory that are aligned to multiples of 8 or 16
bytes depending on the architecture: this allows the cpu to fetch the
instructions faster. But the specific trampolines are not performance
critical (we'll spend lots of time JITting the method anyway), so they
can tolerate a smaller alignment. Considering the fact that most
trampolines are allocated one after the other and that in most
architectures they are 10 or 12 bytes, this change alone saved about
25% of the memory used (they used to be aligned up to 16 bytes).
To give a rough idea of how many trampolines are generated I'll give a
few examples:
* MonoDevelop startup creates about 21 thousand trampolines
* IronPython 2.0 running a benchmark creates about 17 thousand trampolines
* an "hello, world" style program about 800
This change in the first case saved more than 80 KB of memory (plus
about the same because reviewing the code allowed me to fix also a
related overallocation issue).
So reducing the size of the trampolines is great, but it's really not
possible to reduce them much further in size, if at all. The next step
is trying just not to create them.
There are two primary ways a trampoline is generated: a direct call to
the method is made or a virtual table slot is filled with a trampoline
for the case when the method is invoked using a virtual call. I'll
note here than in both cases, after compiling the method, the magic
trampoline will do the needed changes so that the trampoline is not
executed again, but execution goes directly to the newly compiled
code. In one case the callsite is changed so that the branch or call
instruction will transfer control to the new address. In the virtual
call case the magic trampoline will change the virtual table slot
directly.
The sequence of instructions used by the JIT to implement a virtual
call are well-known and the magic trampoline (inspecting the registers
and the code sequence) can easily get the virtual table slot that was
used for the invocation. The idea here then is: if we know the virtual
table slot we know also the method that is supposed to be compiled and
executed, since each vtable slot is assigned a unique method by the
class loader. This simple fact allows us to use a completely generic
trampoline in the virtual table slots, avoiding the creation of many
method-specific trampolines.
In the cases above, the number of generated trampolines goes from
21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to
5400 for the IronPython case and from 800 to 150 for the hello world
case.
Kinds of Trampolines and Thunks
===============================
This is a list of the trampolines and thunks used in Mono:
- create_fnptr
- load_aot_method
- imt thunk
Interface Method Table, this is used to dispatch calls to
interface methods.
- jump table
- debugger code
- exception call filter
- trampoline (various types)
- throw corlib exception
- restore context
- throw exception by name
- handle stack overflow
- throw exception
- delegate invoke implementation
- cpuid code
Implementation for x86/x86-64
=============================
Usually code looks like this:
mov <some address>, %r11
call *0xfffffffc(%rax)
First, the first call instruction can go directly to the compiled
address or to a trampoline.
If it goes to a trampoline, on amd64 it looks as the one above (on x86
it is different). Currently the trampoline is not modified, but it will
be in the future. On x86 the trampoline looks like:
push constant
jmp generic_trampoline
Note that constant can be a MonoMethod*, but it's not necessarily so
(these are the recent changes: this constant can be -1 or -2, the first
for the case of interface calls, the second for virtual calls).
Other architectures are similar in the semantics, but different in the
details.
The above is what happens for virtual calls.
For interfaces call 3 things can happen:
1) the calls goes directly to the method address.
2) it goes to a trampoline as described above.
3) it goes into an IMT collision resolution stub: this is a
chunk of code that, based on the constant put inside the
imt_register above, will perform a jump to the correct
vtable slot for the interface method.
Note that the vtable slot itself could then contain a trampoline.
Some functions that are used here:
emit-x86.c (arch_create_jit_trampoline): return the JIT trampoline function
emit-x86.c (x86_magic_trampoline): contains the code to detect the caller and
patch the call instruction.
emit-x86.c (arch_compile_method): JIT compile a method
Call Sites
==========
There are 3 basic different kinds of call sites:
1) normal calls:
call relative_displacement
2) virtual calls:
call *positive_offset(%register)
3) interface calls:
mov constant, %imt_register
call *negative_offset(%register)
The above is what happens on x86 and amd64, with different values of
%imt_register for each arch (this register is a constant, but it could
change with different mono builds, it should be likely one of
constants the runtime comunicates to the debugger). %register can
change depending on the callsite based on the register allocator
choices. Note that the constant for the interface calls won't
necessarily be a MonoMethod address: this could change in the future
to a simple number.
In all the 3 cases the JIT trampolines will need to inspect the call
site, but only in the first case the call site will be changed.