Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 925 lines (759 sloc) 34.418 kB
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
1 <!-- ((! set title Performance and Profiling !)) ((! set learn !)) -->
2
95035ca @Chris00 Slight template change for "centered" pages
Chris00 authored
3 *Table of contents*
4
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
5 # Performance and Profiling
6
7 ## ObQuote...
8 *"One serious obstacle to the adoption of good programming languages is
9 the notion that everything has to be sacrificed for speed. In computer
10 languages as in life, speed kills." — Mike Vanier*
11
12 ## Speed
13 Why is OCaml fast? Indeed, step back and ask *is OCaml fast?* How can we
14 make programs faster? In this chapter we'll look at what actually
15 happens when you compile your OCaml programs down to machine code. This
16 will help in understanding why OCaml is not just a great language for
17 programming, but is also a very fast language indeed. And it'll help you
18 to help the compiler write better machine code for you. It's also (I
19 believe anyway) a good thing for programmers to have some idea of what
20 happens between you typing `ocamlopt` and getting a binary you can run.
21
22 But you will need to know some assembler to get the most out of this
23 section. Don't be afraid! I'll help you out by translating the assembler
24 into a C-like pseudocode (after all C is just a portable assembly
25 language).
26
27 ### Basics of assembly language
28 The examples I give in this chapter are all compiled on an x86 Linux
29 machine. The x86 is, of course, a 32 bit machine, so an x86 "word" is 4
30 bytes (= 32 bits) long. At this level OCaml deals mostly with word-sized
31 objects, so you'll need to remember to multiply by four to get the size
32 in bytes.
33
34 To refresh your memory, the x86 has only a small number of general
35 purpose registers, each of which can store one word. The Linux assembler
36 puts `%` in front of register names. The registers are: `%eax`, `%ebx`,
37 `%ecx`, `%edx`, `%esi`, `%edi`, `%ebp` (special register used for stack
38 frames), and `%esp` (the stack pointer).
39
40 The Linux assembler (in common with other Unix assemblers, but opposite
41 to MS-derived assemblers) writes moves to and from registers/memory as:
42
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
43 ```assembly
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
44 movl from, to
45 ```
46 So `movl %ebx, %eax` means "copy the contents of register `%ebx` into
47 register `%eax`" (not the other way round).
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
48
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
49 Almost all of the assembly language that we will look at is going to be
50 dominated not by machine code instructions like `movl` but by what are
51 known as \<dfn\>assembler directives\</dfn\>. These directives begin
52 with a . (period) and they literally *direct* the assembler to do
53 something. Here are the common ones for the Linux assembler:
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
54
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
55 #### .text
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
56
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
57 **Text** is the Unix way of saying "program code". The **text segment**
58 simply means the part of the executable where program code is stored.
59 The `.text` directive switches the assembler so it starts writing into
60 the text segment.
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
61
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
62 #### .data
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
63
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
64 Similarly, the `.data` directive switches the assembler so it starts
65 writing into the data segment (part) of the executable.
66
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
67 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
68 .globl foo
69 foo:
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
70 ```
71 This declares a global symbol called `foo`. It means the address of the
72 next thing to come can be named `foo`. Writing just `foo:` without the
73 preceeding `.globl` directive declares a local symbol (local to just the
74 current file).
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
75
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
76 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
77 .long 12345
78 .byte 9
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
79 .ascii "hello"
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
80 .space 4
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
81 ```
82 `.long` writes a word (4 bytes) to the current segment. `.byte` writes a
83 single byte. `.ascii` writes a string of bytes (NOT nul-terminated).
84 `.space` writes the given number of zero bytes. Normally you use these
85 in the data segment.
86
87 ### The "hello, world" program
88 Enough assembler. Put the following program into a file called
89 `smallest.ml`:
90
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
91 ```ocaml
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
92 print_string "hello, world\n"
93 ```
94 And compile it to a native code executable using:
95
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
96 ```shell
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
97 ocamlopt -S smallest.ml -o smallest
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
98 ```
99 The `-S` (capital S) tells the compiler to leave the assembly language
100 file (called `smallest.s` - lowercase s) instead of deleting it.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
101
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
102 Here are the edited highlights of the `smallest.s` file with my comments
103 added. First of all the data segment:
104
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
105 ```assembly
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
106 .data
107 .long 4348 ; header for string
108 .globl Smallest__1
109 lest__1:
110 .ascii "hello, world\12" ; string
111 .space 2 ; padding ..
112 .byte 2 ; .. after string
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
113 ```
114 Next up the text (program code) segment:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
115
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
116 ```assembly
117 .text
118 .globl Smallest__entry ; entry point to the program
119 lest__entry:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
120
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
121 ; this is equivalent to the C pseudo-code:
122 ; Pervasives.output_string (stdout, &Smallest__1)
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
123
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
124 movl $Smallest__1, %ebx
125 movl Pervasives + 92, %eax ; Pervasives + 92 == stdout
126 call Pervasives__output_string_212
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
127
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
128 ; return 1
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
129
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
130 movl $1, %eax
131 ret
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
132 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
133
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
134 In C everything has to be inside a function. Think about how you can't
135 just write `printf ("hello, world\n");` in C, but instead you have to
136 put it inside `main () { ... }`. In OCaml you are allowed to have
137 commands at the top level, not inside a function. But when we translate
138 this into assembly language, where do we put those commands? There needs
139 to be some way to call those commands from outside, so they need to be
140 labelled in some way. As you can see from the code snippet, OCaml solves
141 this by taking the filename (`smallest.ml`), capitalizing it and adding
142 `__entry`, thus making up a symbol called `Smallest__entry` to refer to
143 the top level commands in this file.
144
145 Now look at the code that OCaml has generated. The original code said
146 `print_string "hello, world\n"`, but OCaml has instead compiled the
147 equivalent of `Pervasives.output_string stdout "hello, world\n"`. Why?
148 If you look into `pervasives.ml` you'll see why:
149
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
150 ```ocaml
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
151 let print_string s = output_string stdout s
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
152 ```
153 OCaml has *inlined* this function. **Inlining** - taking a function and
154 expanding it from its definition - is sometimes a performance win,
155 because it avoids the overhead of an extra function call, and it can
156 lead to more opportunities for the optimizer to do its thing. Sometimes
157 inlining is not good, because it can lead to code bloating, and thus
158 destroys the good work done by the processor cache (and besides function
159 calls are actually not very expensive at all on modern processors).
160 OCaml will inline simple calls like this, because they are essentially
161 risk free, almost always leading to a small performance gain.
162
163 What else can we notice about this? The calling convention seems to be
164 that the first two arguments are passed in the `%eax` and `%ebx`
165 registers respectively. Other arguments are probably passed on the
166 stack, but we'll see about that later.
167
168 C programs have a simple convention for storing strings, known as
169 **ASCIIZ**. This just means that the string is stored in ASCII, followed
170 by a trailing NUL (`\0`) character. OCaml stores strings in a different
171 way, as we can see from the data segment above. This string is stored
172 like this:
173
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
174 ```
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
175 4 byte header: 4348
176 the string: h e l l o , SP w o r l d \n
177 padding: \0 \0 \002
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
178 ```
179 The format is unusual. It's documented in [this posting on the OCaml
180 mailing
c77d3ae @Chris00 Fix links relatex to "caml-list".
Chris00 authored
181 list](http://caml.inria.fr/pub/ml-archives/caml-list/2002/08/e109df224ff0150b302033e2002dbf87.en.html).
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
182 \<!-- (old: http://caml.inria.fr/archives/200208/msg00463.html) --\>
183
184 Firstly the padding brings the total length of the string up to a whole
185 number of words (4 words, 16 bytes in this example). The padding is
186 carefully designed so that you can work out the actual length of the
187 string in bytes, provided that you know the total number of *words*
188 allocated to the string. The encoding for this is unambiguous (which you
189 can prove to yourself).
190
191 One nice feature of having strings with an explicit length is that you
192 can represent strings containing ASCII NUL (`\0`) characters in them,
193 something which is difficult to do in C. However, the flip side is that
194 you need to be aware of this if you pass an OCaml string to some C
195 native code: if it contains ASCII NUL, then the C `str*` functions will
196 fail on it.
197
198 Secondly we have the header. Every boxed (allocated) object in OCaml has
199 a header which tells the garbage collector about how large the object is
200 in words, and something about what the object contains. Writing the
201 number 4348 in binary:
202
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
203 ```
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
204 length of the object in words: 0000 0000 0000 0000 0001 00 (4 words)
205 color (used by GC): 00
206 tag: 1111 1100 (String_tag)
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
207 ```
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
208 See `/usr/include/caml/mlvalues.h` for more information about
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
209 the format of heap allocated objects in OCaml.
210
211 One unusual thing is that the code passes a pointer to the start of the
212 string (ie. the word immediately after the header) to
213 `Pervasives.output_string`. This means that `output_string` must
214 subtract 4 from the pointer to get at the header to determine the length
215 of the string.
216
217 What have I missed out from this simple example? Well, the text segment
218 above is not the whole story. It would be really nice if OCaml
219 translated that simple hello world program into just the five lines of
220 assembler shown above. But there is the question of what actually calls
221 `Smallest__entry` in the real program. For this OCaml includes a whole
222 load of bootstrapping code which does things like starting up the
223 garbage collector, allocating and initializing memory, calling
224 initializers in libraries and so on. OCaml links all of this code
225 statically to the final executable, which is why the program I end up
226 with (on Linux) weighs in at a portly 95,442 bytes. Nevertheless the
227 start-up time for the program is still unmeasurably small (under a
228 millisecond), compared to several seconds for starting up a reasonable
229 Java program and a second or so for a Perl script.
230
231 ### Tail recursion
232 We mentioned in chapter 6 that OCaml can turn tail-recursive function
233 calls into simple loops. Is this actually true? Let's look at what
234 simple tail recursion compiles to:
235
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
236 <!-- do not execute this code!! -->
237 ```ocaml
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
238 let rec loop () =
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
239 print_string "I go on forever ...";
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
240 loop ()
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
241
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
242 let () = loop ()
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
243 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
244
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
245 The file is called `tail.ml`, so following OCaml's usual procedure for
246 naming functions, our function will be called `Tail__loop_nnn` (where
247 `nnn` is some unique number which OCaml appends to distinguish
248 identically named functions from one another).
249
250 Here is the assembler for just the `loop` function defined above:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
251
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
252 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
253 .text
254 .globl Tail__loop_56
255 Tail__loop_56:
256 .L100:
257 ; Print the string
258 movl $Tail__2, %ebx
259 movl Pervasives + 92, %eax
260 call Pervasives__output_string_212
261 .L101:
262 ; The following movl is in fact obsolete:
263 movl $1, %eax
264 ; Jump back to the .L100 label above (ie. loop forever)
265 jmp .L100
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
266 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
267
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
268 So that's pretty conclusive. Calling `Tail__loop_56` will first print
269 the string, and then jump back to the top, then print the string, and
270 jump back, and so on forever. It's a simple loop, *not* a recursive
271 function call, so it doesn't use any stack space.
272
273 ### Digression: Where are the types?
274 OCaml is statically typed as we've said before on many occasions, so at
275 compile time, OCaml knows that the type of `loop` is `unit -> unit`. It
276 knows that the type of `"hello, world\n"` is `string`. It doesn't make
277 any attempt to communicate this fact to the `output_string` function.
278 `output_string` is expecting a `channel` and a `string` as arguments,
279 and indeed that's what it gets. What would happen if we passed, say, an
280 `int` instead of a `string`?
281
282 This is essentially an impossible condition. Because OCaml knows the
283 types at compile time, it doesn't need to deal with types or check types
284 at run time. There is no way, in pure OCaml, to "trick" the compiler
285 into generating a call to `Pervasives.output_string stdout 1`. Such an
286 error would be flagged at compile time, by type inference, and so could
287 never be compiled.
288
289 The upshot is that by the time we have compiled OCaml code to assembler
290 type information mostly isn't required, certainly in the cases we've
291 looked at above where the type is fully known at compile time, and there
292 is no polymorphism going on.
293
294 Fully knowing all your types at compile time is a major performance win
295 because it totally avoids any dynamic type checking at run time. Compare
296 this to a Java method invocation for example: `obj.method ()`. This is
297 an expensive operation because you need to find the concrete class that
298 `obj` is an instance of, and then look up the method, and you need to do
299 all of this potentially *every* time you call any method. Casting
300 objects is another case where you need to do a considerable amount of
301 work at run time in Java. None of this is allowed with OCaml's static
302 types.
303
304 ### Polymorphic types
305 As you might have guessed from the discussion above, polymorphism, which
306 is where the compiler *doesn't* have a fully known type for a function
307 at compile time, might have an impact on performance. Suppose we require
308 a function to work out the maximum of two integers. Our first attempt
309 is:
310
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
311 ```ocamltop
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
312 let max a b =
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
313 if a > b then a else b
314 ```
315 Simple enough, but recall that the \> (greater than) operator in OCaml
316 is polymorphic. It has type `'a -> 'a -> bool`, and this means that the
317 `max` function we defined above is going to be polymorphic:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
318
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
319 ```ocamltop
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
320 let max a b =
321 if a > b then a else b
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
322 ```
323 This is indeed reflected in the code that OCaml generates for this
324 function, which is pretty complex:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
325
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
326 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
327 .text
328 .globl Max__max_56
329 Max__max_56:
330
331 ; Reserve two words of stack space.
332
333 subl $8, %esp
334
335 ; Save the first and second arguments (a and b) on the stack.
336
337 movl %eax, 4(%esp)
338 movl %ebx, 0(%esp)
339
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
340 ; Call the C "greaterthan" function (in the OCaml library).
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
341
342 pushl %ebx
343 pushl %eax
344 movl $greaterthan, %eax
345 call caml_c_call
346 .L102:
347 addl $8, %esp
348
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
349 ; If the C "greaterthan" function returned 1, jump to .L100
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
350
351 cmpl $1, %eax
352 je .L100
353
354 ; Returned 0, so get argument a which we previously saved on
355 ; the stack and return it.
356
357 movl 4(%esp), %eax
358 addl $8, %esp
359 ret
360
361 ; Returned 1, so get argument b which we previously saved on
362 ; the stack and return it.
363
364 .L100:
365 movl 0(%esp), %eax
366 addl $8, %esp
367 ret
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
368 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
369
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
370 Basically the \> operation is done by calling a C function from the
371 OCaml library. This is obviously not going to be very efficient, nothing
372 like as efficient as if we could generate some quick inline assembly
373 language for doing the \>.
374
375 This is not a complete dead loss by any means. All we need to do is to
376 hint to the OCaml compiler that the arguments are in fact integers. Then
377 OCaml will generate a specialised version of `max` which only works on
378 `int` arguments:
379
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
380 ```ocamltop
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
381 let max (a : int) (b : int) =
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
382 if a > b then a else b
383 ```
384 Here is the assembly code generated for this function:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
385
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
386 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
387 .text
388 .globl Max_int__max_56
389 Max_int__max_56:
390
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
391 ; Single assembly instruction "cmpl" for performing the > op.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
392 cmpl %ebx, %eax
393
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
394 ; If %ebx > %eax, jump to .L100
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
395 jle .L100
396 ; Just return argument a.
397 ret
398 ; Return argument b.
399
400 .L100:
401 movl %ebx, %eax
402 ret
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
403 ```
404 That's just 5 lines of assembler, and is about as simple as you can make
405 it.
406
407 What about this code:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
408
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
409 ```ocamltop
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
410 let max a b =
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
411 if a > b then a else b
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
412
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
413 let () = print_int (max 2 3)
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
414 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
415
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
416 Is OCaml going to be smart enough to inline the `max` function and
417 specialise it to work on integers? Disappointingly the answer is no.
418 OCaml still has to generate the external `Max.max` symbol (because this
419 is a module, and so that function might be called from outside the
420 module), and it doesn't inline the function.
421
422 Here's another variation:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
423
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
424 ```ocamltop
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
425 let max a b =
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
426 if a > b then a else b in
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
427 print_int (max 2 3)
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
428 ```
429 Disappointingly although the definition of `max` in this code is local
430 (it can't be called from outside the module), OCaml still doesn't
431 specialise the function.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
432
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
433 Lesson: if you have a function which is unintentionally polymorphic then
434 you can help the compiler by specifying types for one or more of the
435 arguments.
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
436
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
437 ### The representation of integers, tag bits, heap-allocated values
438 There are a number of peculiarities about integers in OCaml. One of
439 these is that integers are 31 bit entities, not 32 bit entities. What
440 happens to the "missing" bit?
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
441
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
442 Write this to `int.ml`:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
443
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
444 ```ocamltop
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
445 print_int 3;;
446 ```
447 and compile with `ocamlopt -S int.ml -o int` to generate assembly
448 language in `int.s`. Recall from the discussion above that we are
449 expecting OCaml to inline the `print_int` function as
450 `output_string (string_of_int 3)`, and examining the assembly language
451 output we can see that this is exactly what OCaml does:
452
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
453 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
454 .text
455 .globl Int__entry
456 Int__entry:
457
458 ; Call Pervasives.string_of_int (3)
459
460 movl $7, %eax
461 call Pervasives__string_of_int_152
462
463 ; Call Pervasives.output_string (stdout, result_of_previous)
464
465 movl %eax, %ebx
466 movl Pervasives + 92, %eax
467 call Pervasives__output_string_212
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
468 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
469
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
470 The important code is shown in red. It shows two things: Firstly the
471 integer is unboxed (not allocated on the heap), but is instead passed
472 directly to the function in the register `%eax`. This is fast. But
473 secondly we see that the number being passed is 7, not 3.
474
475 This is a consequence of the representation of integers in OCaml. The
476 bottom bit of the integer is used as a tag - we'll see what for next.
477 The top 31 bits are the actual integer. The binary representation of 7
478 is 111, so the bottom tag bit is 1 and the top 31 bits form the number
479 11 in binary = 3. To get from the OCaml representation to the integer,
480 divide by two and round down.
481
482 Why the tag bit at all? This bit is used to distinguish between integers
483 and pointers to structures on the heap, and the distinction is only
484 necessary if we are calling a polymorphic function. In the case above,
485 where we are calling `string_of_int`, the argument can only ever be an
486 `int` and so the tag bit would never be consulted. Nevertheless, to
487 avoid having two internal representations for integers, all integers in
488 OCaml carry around the tag bit.
489
490 A bit of background about pointers is required to understand why the tag
491 bit is really necessary, and why it is where it is.
492
493 In the world of RISC chips like the Sparc, MIPS and Alpha, pointers have
494 to be word-aligned. So on the older 32 bit Sparc, for example, it's not
495 possible to create and use a pointer which isn't aligned to a multiple
496 of 4 (bytes). Trying to use one generates a processor exception, which
497 means basically your program segfaults. The reason for this is just to
498 simplify memory access. It's just a lot simpler to design the memory
499 subsystem of a CPU if you only need to worry about word-aligned access.
500
501 For historical reasons (because the x86 is derived from an 8 bit chip),
502 the x86 has supported unaligned memory access, although if you align all
503 memory accesses to multiples of 4, then things go faster.
504
505 Nevertheless, all pointers in OCaml are aligned - ie. multiples of 4 for
506 32 bit processors, and multiples of 8 for 64 bit processors. This means
507 that the bottom bit of any pointer in OCaml will always be zero.
508
509 So you can see that by looking at the bottom bit of a register, you can
510 immediately tell if it stores a pointer ("tag" bit is zero), or an
511 integer (tag bit set to one).
512
513 Remember our polymorphic \> function which caused us so much trouble in
514 the previous section? We looked at the assembler and found out that
515 OCaml compiles a call to a C function called `greaterthan` whenever it
516 sees the polymorphic form of \>. This function takes two arguments, in
517 registers `%eax` and `%ebx`. But `greaterthan` can be called with
518 integers, floats, strings, opaque objects ... How does it know what
519 `%eax` and `%ebx` point to?
520
521 It uses the following decision tree:
522
523 * **Tag bit is one:** compare the two integers and return.
524 * **Tag bit is zero:** `%eax` and `%ebx` must point at two
525 heap-allocated memory blocks. Look at the header word of the memory
526 blocks, specifically the bottom 8 bits of the header word, which tag
527 the content of the block.
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
528 * **String_tag:** Compare two strings.
529 * **Double_tag:** Compare two floats.
530 * etc.
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
531
532 Note that because \> has type `'a -> 'a -> bool`, both arguments must
533 have the same type. The compiler should enforce this at compile time. I
534 would assume that `greaterthan` probably contains code to sanity-check
535 this at run time however.
536
537 ### Floats
538 Floats are, by default, boxed (allocated on the heap). Save this as
539 `float.ml` and compile it with `ocamlopt -S float.ml -o float`:
540
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
541 ```ocamltop
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
542 print_float 3.0
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
543 ```
544 The number is not passed directly to `string_of_float` in the `%eax`
545 register as happened above with ints. Instead, it is created statically
546 in the data segment:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
547
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
548 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
549 .data
550 .long 2301
551 .globl Float__1
552 Float__1:
553 .double 3.0
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
554 ```
555 and a pointer to the float is passed in `%eax` instead:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
556
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
557 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
558 movl $Float__1, %eax
559 call Pervasives__string_of_float_157
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
560 ```
561 Note the structure of the floating point number: it has a header (2301),
562 followed by the 8 byte (2 word) representation of the number itself. The
563 header can be decoded by writing it as binary:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
564
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
565 ```
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
566 Length of the object in words: 0000 0000 0000 0000 0000 10 (8 bytes)
567 Color: 00
568 Tag: 1111 1101 (Double_tag)
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
569 ```
570 `string_of_float` isn't polymorphic, but suppose we have a polymorphic
571 function `foo : 'a -> unit` taking one polymorphic argument. If we call
572 `foo` with `%eax` containing 7, then this is equivalent to `foo 3`,
573 whereas if we call `foo` with `%eax` containing a pointer to `Float__1`
574 above, then this is equivalent to `foo 3.0`.
575
576 ### Arrays
577 I mentioned earlier that one of OCaml's targets was numerical computing.
578 Numerical computing does a lot of work on vectors and matrices, which
579 are essentially arrays of floats. As a special hack to make this go
580 faster, OCaml implements \<dfn\>arrays of unboxed floats\</dfn\>. This
581 means that in the special case where we have an object of type
582 `float array` (array of floats), OCaml stores them the same way as in C:
583
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
584 ```C
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
585 double array[10];
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
586 ```
587 ... instead of having an array of pointers to ten separately allocated
588 floats on the heap.
589
590 Let's see this in practice:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
591
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
592 ```ocaml
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
593 let a = Array.create 10 0.0;;
594 for i = 0 to 9 do
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
595 a.(i) <- float_of_int i
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
596 done
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
597 ```
598 I'm going to compile this code with the `-unsafe` option to remove
599 bounds checking (simplifying the code for our exposition here). The
600 first line, which creates the array, is compiled to a simple C call:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
601
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
602 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
603 pushl $Arrayfloats__1 ; Boxed float 0.0
604 pushl $21 ; The integer 10
605 movl $make_vect, %eax ; Address of the C function to call
606 call caml_c_call
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
607 ; ...
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
608 movl %eax, Arrayfloats ; Store the resulting pointer to the
609 ; array at this place on the heap.
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
610 ```
611 The loop is compiled to this relatively simple assembly language:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
612
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
613 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
614 movl $1, %eax ; Let %eax = 0. %eax is going to store i.
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
615 cmpl $19, %eax ; If %eax > 9, then jump out of the
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
616 jg .L100 ; loop (to label .L100 at the end).
617
618 .L101: ; This is the start of the loop body.
619 movl Arrayfloats, %ecx ; Address of the array to %ecx.
620
621 movl %eax, %ebx ; Copy i to %ebx.
622 sarl $1, %ebx ; Remove the tag bit from %ebx by
623 ; shifting it right 1 place. So %ebx
624 ; now contains the real integer i.
625
626 pushl %ebx ; Convert %ebx to a float.
627 fildl (%esp)
628 addl $4, %esp
629
630 fstpl -4(%ecx, %eax, 4) ; Store the float in the array at the ith
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
631 ; position.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
632
633 addl $2, %eax ; i := i + 1
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
634 cmpl $19, %eax ; If i <= 9, loop around again.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
635 jle .L101
636 .L100:
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
637 ```
638 The important statement is the one which stores the float into the
639 array. It is:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
640
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
641 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
642 fstpl -4(%ecx, %eax, 4)
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
643 ```
644 The assembler syntax is rather complex, but the bracketed expression
645 `-4(%ecx, %eax, 4)` means "at the address `%ecx + 4*%eax - 4`". Recall
646 that `%eax` is the OCaml representation of i, complete with tag bit, so
647 it is essentially equal to `i*2+1`, so let's substitute that and
648 multiply it out:
649
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
650 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
651 %ecx + 4*%eax - 4
652 = %ecx + 4*(i*2+1) - 4
653 = %ecx + 4*i*2 + 4 - 4
654 = %ecx + 8*i
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
655 ```
656 (Each float in the array is 8 bytes long.)
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
657
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
658 So arrays of floats are unboxed, as expected.
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
659
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
660 ### Partially applied functions and closures
661 How does OCaml compile functions which are only partially applied? Let's
662 compile this code:
663
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
664 ```ocaml
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
665 Array.map ((+) 2) [| 1; 2; 3; 4; 5 |]
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
666 ```
667 If you recall the syntax, `[| ... |]` declares an array (in this case an
668 `int array`), and `((+) 2)` is a closure - "the function which adds 2 to
669 things".
670
671 Compiling this code reveals some interesting new features. Firstly the
672 code which allocates the array:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
673
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
674 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
675 movl $24, %eax ; Allocate 5 * 4 + 4 = 24 bytes of memory.
676 call caml_alloc
677
678 leal 4(%eax), %eax ; Let %eax point to 4 bytes into the
679 ; allocated memory.
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
680 ```
681 All heap allocations have the same format: 4 byte header + data. In this
682 case the data is 5 integers, so we allocate 4 bytes for the header plus
683 5 * 4 bytes for the data. We update the pointer to point at the first
684 data word, ie. 4 bytes into the allocated memory block.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
685
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
686 Next OCaml generates code to initialize the array:
687
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
688 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
689 movl $5120, -4(%eax)
690 movl $3, (%eax)
691 movl $5, 4(%eax)
692 movl $7, 8(%eax)
693 movl $9, 12(%eax)
694 movl $11, 16(%eax)
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
695 ```
696 The header word is 5120, which if you write it in binary means a block
697 containing 5 words, with tag zero. The tag of zero means it's a
698 "structured block" a.k.a. an array. We also copy the numbers 1, 2, 3, 4
699 and 5 to the appropriate places in the array. Notice the OCaml
700 representation of integers is used. Because this is a structured block,
701 the garbage collector will scan each word in this block, and the GC
d16560f @bluddy Typo correction
bluddy authored
702 needs to be able to distinguish between integers and pointers to other
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
703 heap-allocated blocks (the GC does not have access to type information
704 about this array).
705
706 Next the closure `((+) 2)` is created. The closure is represented by
707 this block allocated in the data segment:
708
7abfd3f @Chris00 Remove all ```tryocaml
Chris00 authored
709 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
710 .data
711 .long 3319
712 .globl Closure__1
713 Closure__1:
714 .long caml_curry2
715 .long 5
716 .long Closure__fun_58
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
717 ```
718 The header is 3319, indicating a `Closure_tag` with length 3 words. The
719 3 words in the block are the address of the function `caml_curry2`, the
720 integer number 2 and the address of this function:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
721
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
722 ```assembly
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
723 .text
724 .globl Closure__fun_58
725 Closure__fun_58:
726
727 ; The function takes two arguments, %eax and %ebx.
728 ; This line causes the function to return %eax + %ebx - 1.
729
730 lea -1(%eax, %ebx), %eax
731 ret
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
732 ```
733 What does this function do? On the face of it, it adds the two
734 arguments, and subtracts one. But remember that `%eax` and `%ebx` are in
735 the OCaml representation for integers. Let's represent them as:
736
737 * `%eax = 2 * a + 1`
738 * `%ebx = 2 * b + 1`
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
739
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
740 where `a` and `b` are the actual integer arguments. So this function
741 returns:
742
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
743 ```
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
744 %eax + %ebx - 1
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
745 = 2 * a + 1 + 2 * b + 1 - 1
746 = 2 * a + 2 * b + 1
747 = 2 * (a + b) + 1
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
748 ```
749 In other words, this function returns the OCaml integer representation
750 of the sum `a + b`. This function is `(+)`!
751
752 (It's actually more subtle than this - to perform the mathematics
753 quickly, OCaml uses the x86 addressing hardware in a way that probably
754 wasn't intended by the designers of the x86.)
755
756 So back to our closure - we won't go into the details of the
757 `caml_curry2` function, but just say that this closure is the argument
758 `2` applied to the function `(+)`, waiting for a second argument. Just
759 as expected.
760
761 The actual call to the `Array.map` function is quite difficult to
762 understand, but the main points for our examination of OCaml is that the
763 code:
764
765 * Does call `Array.map` with an explicit closure.
766 * Does not attempt to inline the call and turn it into a loop.
767
768 Calling `Array.map` in this way is undoubtedly slower than writing a
769 loop over the array by hand. The overhead is mainly in the fact that the
770 closure must be evaluated for each element of the array, and that isn't
771 as fast as inlining the closure as a function (if this optimization were
772 even possible). However, if you had a more substantial closure than just
773 `((+) 2)`, the overhead would be reduced. The FP version also saves
774 expensive *programmer* time versus writing the imperative loop.
775
776 ## Profiling
777 There are two types of profiling that you can do on OCaml programs:
778
779 1. Get execution counts for bytecode.
780 1. Get real profiling for native code.
781
782 The `ocamlcp` and `ocamlprof` programs perform profiling on bytecode.
783 Here is an example:
784
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
785 ```ocaml
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
786 (* $ ocamlcp -p a graphics.cma graphtest.ml -o graphtest
787 $ ./graphtest
788 $ ocamlprof graphtest.ml *)
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
789
790 let () =
791 Random.self_init ();
792 Graphics.open_graph " 640x480"
793
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
794 let rec iterate r x_init i =
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
795 (* 12820000 *) if i == 1 then (* 25640 *) x_init
796 else
797 (* 12794360 *) let x = iterate r x_init (i-1) in
798 r *. x *. (1.0 -. x);;
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
799
800 let () =
801 for x = 0 to 640 do
802 (* 641 *) let r = 4.0 *. (float_of_int x) /. 640.0 in
803 for i = 0 to 39 do
804 (* 25640 *) let x_init = Random.float 1.0 in
805 let x_final = iterate r x_init 500 in
806 let y = int_of_float (x_final *. 480.) in
807 Graphics.plot x y
808 done
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
809 done
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
810 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
811
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
812 The comments `(* nnn *)` are added by `ocamlprof`, showing how many
813 times each part of the code was called.
814
815 Profiling native code is done using your operating system's native
816 support for profiling. In the case of Linux, we use `gprof`.
817
818 To demonstrate native code profiling, I'm going to calculate the first
8ba0f91 @Chris00 Remove dead references to Doug Bagley site
Chris00 authored
819 3000 primes using the Sieve of Eratosthenes.
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
820 This program uses streams and camlp4, techniques which we haven't
821 covered in this tutorial.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
822
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
823 ```ocaml
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
824 let rec filter p = parser
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
825 [< 'n; s >] -> if p n then [< 'n; filter p s >] else [< filter p s >]
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
826
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
827 let naturals =
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
828 let rec gen n = [< 'n; gen (succ n) >] in gen 2
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
829
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
830 let primes =
831 let rec sieve = parser
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
832 [< 'n; s >] -> [< 'n; sieve (filter (fun m -> m mod n <> 0) s) >] in
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
833 sieve naturals
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
834
d6d539d @Chris00 Tutorials: syntax highlight "Performance and Profiling".
Chris00 authored
835 let () =
836 for i = 1 to 3000 do
837 ignore (Stream.next primes)
838 done
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
839 ```
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
840
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
841 We compile it using the `-p` option to `ocamlopt` which tells the
842 compiler to include profiling information for `gprof`:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
843
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
844 ```shell
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
845 ocamlopt -p -pp "camlp4o pa_extend.cmo" -I +camlp4 sieve.ml -o sieve
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
846 ```
847 After running the program as normal, the profiling code dumps out a file
848 `gmon.out` which we can interpret with `gprof`:
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
849
d918b2f @Chris00 Do not eval code that hangs the process.
Chris00 authored
850 ```
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
851 $ gprof ./sieve
852 Flat profile:
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
853
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
854 Each sample counts as 0.01 seconds.
855 % cumulative self self total
856 time seconds seconds calls s/call s/call name
857 10.86 0.57 0.57 2109 0.00 0.00 sweep_slice
858 9.71 1.08 0.51 1113 0.00 0.00 mark_slice
859 7.24 1.46 0.38 4569034 0.00 0.00 Sieve__code_begin
860 6.86 1.82 0.36 9171515 0.00 0.00 Stream__set_data_140
861 6.57 2.17 0.34 12741964 0.00 0.00 fl_merge_block
862 6.29 2.50 0.33 4575034 0.00 0.00 Stream__peek_154
863 5.81 2.80 0.30 12561656 0.00 0.00 alloc_shr
864 5.71 3.10 0.30 3222 0.00 0.00 oldify_mopup
865 4.57 3.34 0.24 12561656 0.00 0.00 allocate_block
866 4.57 3.58 0.24 9171515 0.00 0.00 modify
867 4.38 3.81 0.23 8387342 0.00 0.00 oldify_one
868 3.81 4.01 0.20 12561658 0.00 0.00 fl_allocate
869 3.81 4.21 0.20 4569034 0.00 0.00 Sieve__filter_56
870 3.62 4.40 0.19 6444 0.00 0.00 empty_minor_heap
871 3.24 4.57 0.17 3222 0.00 0.00 oldify_local_roots
872 2.29 4.69 0.12 4599482 0.00 0.00 Stream__slazy_221
873 2.10 4.80 0.11 4597215 0.00 0.00 darken
874 1.90 4.90 0.10 4596481 0.00 0.00 Stream__fun_345
875 1.52 4.98 0.08 4575034 0.00 0.00 Stream__icons_207
876 1.52 5.06 0.08 4575034 0.00 0.00 Stream__junk_165
877 1.14 5.12 0.06 1112 0.00 0.00 do_local_roots
2d2f6e7 @Chris00 Import the Japanese translation of "Performance and Profiling"
Chris00 authored
878
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
879 [ etc. ]
384ee97 @pw374 (redesign) html->md: actual conversion
pw374 authored
880 ```
881 In fact this program spends much of its time in the garbage collector
882 (not surprisingly, since although the solution is elegant, it is far
883 from optimal - a solution using arrays and loops would have been much
884 faster).
885
886 ## Summary
887 In summary here are some tips for getting the best performance out of
888 your programs:
889
890 1. Write your program as simply as possible. If it takes too long to
891 run, profile it to find out where it's spending its time and
892 concentrate optimizations on just those areas.
893 1. Check for unintentional polymorphism, and add type hints for the
894 compiler.
895 1. Closures are slower than simple function calls, but add to
896 maintainability and readability.
897 1. As a last resort, rewrite hotspots in your program in C (but first
898 check the assembly language produced by the OCaml compiler to see if
899 you can do better than it).
900 1. Performance might depend on external factors (speed of your database
901 queries? speed of the network?). If so then no amount of
902 optimization will help you.
903
904 ### Further reading
905 You can find out more about how OCaml represents different types by
906 reading chapter 18 of the manual ("Interfacing C with OCaml") and also
907 looking at the `mlvalues.h` header file.
908
909 ## Discussion On This Page
910 ### Java dynamic dispatch
911 **There are some serious mistakes in the last paragraph:**
912
913 * Dynamic method dispatch itself is seldom a performance problem. In
914 languages without multiple inheritance (e.g. Java) this is usually
915 done via one step of pointer indirection. Objects in OCaml are also
916 dynamically dispatched. Since this is the point with polymorphism in
917 an OO setting.
918
919 * Dynamic method dispatch often hinders a compiler to inline function
920 and this hits the performance.
921
922 * In Java is a dynamic type check (aka cast) much more expensive than
923 a dynamic method dispatch.
c15a69e @agarwal Added tutorials/performance_and_profiling.html. Copied by permission …
agarwal authored
924
Something went wrong with that request. Please try again.