-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unboxing parameters and more kinds of local bindings #5204
Comments
Comment author: @alainfrisch I've created a branch more_unboxing in the OCaml SVN. It changes the unboxing strategy. Instead of deciding whether an identifier can be unboxed by looking at the right-hand side of its definition, it considers any identifier which is really used in a unboxed float context as a candidate for unboxing. Function parameters can thus be unboxed as well. This strategy is a little bit fragile, and it might break in presence of GADTs. It would be better to propagate type information further down in the compiler to allow a cleaner strategy. |
Comment author: @alainfrisch The code below runs in 1.75 seconds on the branch vs. 2 seconds on the trunk: let f x y = let () = |
Comment author: @Chris00 On my machine, it runs in 2s (more_unboxing) v.s. 2.32s (3.12 branch), so a similar kind of speedup. |
Comment author: @Chris00 It is even more impressive on a program such as the one below — which shows that, with such a patch, one can factorize computations while loosing very little performance. It indeed runs in 2.01s (more_unboxing) v.s. 3.45 (3.12 branch). (The use of -inline does not really make any difference.) let g x y = let f x y = let () = |
Comment author: @alainfrisch Christophe, thanks for your feedback on the branch! I'm not so surprised that -inline does not make any difference. It would probably make a bigger difference if you swap the number of iterations in g and f (so that there are many function calls and boxing of floats). What I don't really understand is why this last example is so much slower than the previous one with 3.12. |
Comment author: @Chris00
Strangely enough, it is not the case. |
Comment author: @alainfrisch The current branch follows the following strategy (always unbox,
In general, this reduces allocation as much as possible: floats are I've uploaded a series of micro-benchmarks to illustrate different cases: bench1: 4.80 -> 2.60 A typical case where a local variable should really be unboxed, but bench2: 1.44 -> 1.20 The boxed version is indeed used within the loop, but quite rarely. bench3: 1.16 -> 1.20 The boxed version is always needed, and always different from its bench4: 0.65 -> 0.64 The boxed version is always needed, but most of the time, it hasn't bench5: 1.21 -> 0.67 The boxed version is always needed, but most of the time, it's value bench6: 1.05 -> 0.65 Same semantics as bench4, but forcing the update of the variable Of course, micro-benchmarks don't give much information, but it's good
Moreover, the new strategy would allow us to benefit even more from |
Comment author: atavener I extracted some samples from my active codebase and turned them into self-contained files...
The code is in a very functional style (rather than imperative), and has not been optimized. So I haven't tried to cater to the compiler at all yet. I didn't include any visualization, but if desired I can add something simple using Graphics module. My time measurements between OCaml 3.12.0 and "more_unboxing" were: noiz.ml: 1.5s -> 1.7s The only real float component of noiz.ml is linear interpolation and a simple 5th-order polynomial. The particle system is much more float heavy, mostly as 3-element vectors (in a record), and the core being the Runge-Kutta4 integrator, which is implemented in a polymorphic way, relying on a provided "multiply-accumulate" function for the type to be integrated. |
Comment author: @alainfrisch I've synchronized the branch with the trunk. atavener: thanks for posting your code. On my machine (Intel Core i7, 2.80GHz, running in 64-bit mode), I get: noiz.ml (switched from 1024x1024 to 4096x4096): 2.57 -> 2.50 As you mention, your code is written in a quite high-level functional style. Changing the final summation in noiz.ml from: let sum = Array.(fold_left (fun s arr -> fold_left (+.) s arr) 0. tex) in to: let r = ref 0. in we get a more interesting speedup with the new strategy: 2.55 -> 2.42 |
Comment author: @Chris00 On my machine (x86_64 GNU/Linux, Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz): |
Comment author: @Chris00 nbody: 14.44 -> 13.99 (but variability is greater than the gain) Many floating point computations but not so many references. (For comparison; gcc -O3 with a similar code runs in 11.66.) |
Comment author: atavener Haha, yeah. Might want to bump up those loop-counts, sorry. It's encouraging to see that the results are generally positive! Even 10% on chunks of "normal" code which makes use of floats. |
Comment author: @alainfrisch Christophe: I can observe a comparable and quite stable speedup (about -3%) for nbody. (This speedup is probably explained by the fact that the dt argument in "advance" becomes unboxed.) A more noticeable gain is obtained for its "energy" function. Reducing the example to only calling this function 10000000 times, I get: 1.97 -> 1.83. If one abstracts the dist2 function: let dist2 dx dy dz = and calls it in the advance function, it doesn't change performance (2.55 -> 2.50) because of inlining. If we compile with "-inline 0", one gets a more noticeable speedup: 3.29 -> 3.07. |
Comment author: @Chris00 square: 5.58 -> 5.67 Real world case consisting in integrating a functional over a triangular mesh. Requires http://forge.ocamlcore.org/projects/mesh/ For some reason, the program compiled with this branch is consistently slower — but not by much, so this is no big deal. |
Comment author: @alainfrisch Christophe: do you think it would be easy to create a standalone version of square that can be compiled with a stock OCaml distribution? (As far as I can tell, you only use mesh to produce a static data structure. Maybe just output-valuing some bigarrays statically might be enough; or generating an OCaml source to reproduce this value.) |
Comment author: @Chris00 I have uploaded square2.ml. It reads the file square_mesh.bin (created by square_mesh.ml which is also added for completeness). There is a problem however: when compiled with 3.12, square2 runs fine but with this branch it raises Invalid_argument("index out of bounds"). It seems that the content of the bigarrays is not restored properly. |
Comment author: @Chris00 square3.ml does not use marshalling. The running times of the program compiled with 3.12 and this branch are about the same. When I see differences, the 3.12 version is slightly faster. Here is such a run: $ time ./square-3.12 && time ./square-dev |
Comment author: @alainfrisch Thanks Christophe! Here are some timings for square3 on my machine, comparing between trunk (before) and more_unboxing (after). In square brackets, I also report the allocated memory (returned by Gc.allocated_byte): 4.96 -> 5.03 [8 Gb -> 7.4 Gb] Moving the definition of f inside nonlinearity: 4.86 -> 4.95 [8 Gb -> 7.4 Gb] Same, with -inline 10000: 4.46 -> 4.40 [615 Mb -> 0.457 Mb] (better!) After looking at the generated cmm and assembly code, my guess is that the slight performance degradation comes from the compilation of the function "f". With the new compilation scheme, the parameters are unboxed at the entry of the function, which forces them to be saved on the stack (because under Unix, calling cos/sin can destroy all the xmm registers). One could probably improve this by pushing unboxing down the ulambda-code to the innermost node where the parameter is indeed used (or, as an easier and less general fix: disable unboxing when the parameter is used only once, not under a loop). But I can imagine cases where unboxing early is actually better for the pipeline, so I'm not sure we'd always win. |
Comment author: @alainfrisch I've committed a patch to push unboxing down the tree (without duplicating it). The new unboxing strategy does not change much the performance of square3. What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance. |
Comment author: @Chris00
That would be really nice indeed! |
Comment author: @Chris00 To tackle the problem with unmarshalling, I uploaded square5. With 3.12 everything is fine. With this branch, I get the message “tr incorrect when i = 4” which means that the unmarshalling is done right but that somehow [tr] gets modified along the computations... This is rather strange (maybe a separate bug report needs to be opened?). |
Comment author: @alainfrisch Christophe: I can reproduce the bug with square5. I'll look at it. Smaller repro case: ========== let () =
|
Comment author: @alainfrisch Mmmh, the problem disappeared after a full "make clean world opt opt.opt install". I assume we used a compilation command which refers to a different version of bigarray.a. Christophe: can you confirm? |
Comment author: @Chris00 I confirm but the bug disappeared only after I updated to todays version of the branch. So this was not a mere recompilation problem. |
Comment author: @alainfrisch I will need to confirm that with a more serious methodology, but it seems the patch produce an interesting speedup over our entire unit test suite (around 6%, with some tests allocating twice fewer than before). |
Comment author: @alainfrisch One should rerun the various benchmarks provided here to check that their boxing behavior is as expected. But globally, the new scheme will never unbox a function parameter. So in let unbox x y = let r = ref 0. in for i = 0 to n do r := !r +. x *. y done accessing the value for x or y in the loop will result in a memory read every time. The new strategy is more about removing costly "boxing" than removing repeated "unboxing". Btw, people interested in this topic can read http://www.lexifi.com/blog/unboxed-floats-ocaml which summarizes the recent evolutions of that strategy. |
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc. |
I wrote a summary of the work on parameter unboxing at #5894 (comment) |
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc. |
Original bug ID: 5204
Reporter: @alainfrisch
Status: acknowledged (set by @damiendoligez on 2011-05-17T14:47:01Z)
Resolution: open
Priority: normal
Severity: feature
Category: back end (clambda to assembly)
Parent of: #5133
Monitored by: ronan.lehy@gmail.com mehdi dario @ygrek till @jmeber "Julien Signoles" @hcarty @Chris00
Bug description
Unboxing for floats and boxed integers is currently tried when an identifier is explicitly bound by a let/match construction but not by a lambda.
For instance, x and y are unboxed in the loop below:
let unbox x y =
let x = x +. 0. in
let y = y +. 0. in
let r = ref 0. in
for i = 0 to n do
r := !r +. x *. y
done
but not if the "+ 0." are removed (because the let are then simplied).
A related problem is that the decision of unboxing a local binding "let x = E1 in E2" is done only by looking at E1. For instance, if E1 is an access to a float field of a record, unboxing is tried, but not if the record field is not statically a float.
So x and y are unboxed in:
type r = {x: float; y:float}
let unbox r =
let x = r.x in
let y = r.y in
...
but not if we define the type as:
type 'a r = {x: 'a; y:'a}
It would be better to try unboxing as soon as the bound expression is a float (or a boxed integer). Type information is not available, but one could find out quite easily if unboxing should be tried by looking at how the identifier is used in the body of the local binding / function.
Or do people see cases where allowing unboxing in more cases is problematic?
File attachments
The text was updated successfully, but these errors were encountered: