Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native compilation optimization of tuple assignment #4800

Closed
vicuna opened this issue May 20, 2009 · 14 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link

commented May 20, 2009

Original bug ID: 4800
Reporter: jerhoud
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:14:50Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 3.10.2
Fixed in version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Tags: patch
Has duplicate: #6670
Related to: #5879 #7017
Monitored by: meyer @gasche abdallah @protz dario @ygrek till @jberdine "Julien Signoles" @hcarty @Chris00 @mjambon @yakobowski @mmottl

Bug description

Very often I need to define several variables according to a condition (or a pattern matching), so I write something like :

let (a, b) = if cond then (a1, b1) else (a2, b2) in ...

or

let (a, b) =
match value with
| case1 -> (a1, b1)
| case2 -> (a2, b2)
| ...
in ...

In these situation, the compiler allocates memory and creates a tuple. This is quite inefficient. I would like the compiler to optimize this allocation away.

File attachments

@vicuna

This comment has been minimized.

Copy link
Author

commented May 28, 2010

Comment author: @alainfrisch

The experimental branch inplace_let, which you can find in the OCaml Subversion repository, implements the suggested optimization. Feel free to try it and report any performance difference you could observe.

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 16, 2011

Comment author: @alainfrisch

I've done some micro benchmarks.

Consider the code below:

let () =
let r = ref 0 in
for k = 1 to 10000 do
for i = 1 to 100000 do
let (x, y) =
if i mod 2 = 0 then (1, i * 2)
else (2, i * 3)
in
r := !r * x + y
done;
done;
Printf.printf "-> %i\n%!" !r

and the alternative version:

...
let b = i mod 2 = 0 in
let x = if b then 1 else 2 in
let y = if b then i * 2 else i * 3 in
...

The timings are:
3.80s without the patch
3.25s for the alternative version (with and without the patch)
2.60s with the patch

Another test:

let f x y z =
let a, b =
match z with
| Some (u, v) -> u, v * 2
| None -> 10, 20
in
a * x + b * y

let () =
let r = ref 0 in
for k = 1 to 2000 do
for i = 1 to 100000 do
r := !r + f k i (Some (k, i));
r := !r + f k i None
done;
done;
Printf.printf "-> %i\n%!" !r

The timings are:
1.80s without the patch
1.60s with the patch

I've also found a case where the patch degrades performances:

let () =
let x = ref 0. in
let y = ref 0. in
for k = 1 to 2000 do
for i = 1 to 100000 do
let x, y =
if i mod 2 = 0 then x, y else y, x
in
x := !x +. 1.;
y := !y -. 1.;
done;
done;
Printf.printf "-> %f, %f\n%!" !x !y

The timings are:
4.15s without the patch
4.60s with the patch

If we replace floats with ints, the patch improves performances:
0.77s without the patch
0.51s with the patch

The following variant:

let () =
let x = ref 0. in
let y = ref 0. in
for k = 1 to 2000 do
for i = 1 to 100000 do
let x, y =
if i mod 2 = 0 then x, y else y, x
in
x := !x +. 1.;
y := !y -. 1.;
x := !x +. 1.;
y := !y -. 1.;
done;
done;
Printf.printf "-> %f, %f\n%!" !x !y

is again faster with the patch:

8.32s without the patch
7.52s with the patch

and so is this one:

let () =
let x = ref 1. in
let y = ref 2. in
for k = 1 to 2000 do
for i = 1 to 100000 do
let x, y, z =
if i mod 2 = 0 then
x, y, 1.
else
y, x, 2.
in
x := !x +. z;
y := !y +. 1.;
done;
done;
Printf.printf "-> %f, %f\n%!" !x !y

4.89s without the patch
3.81s with the patch

It seem that most of the time, we get significant speedups. Concerning the slowdown: I could not tell from the generated assembly code where it comes from; probably some cache or alignment issue.

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 20, 2011

Comment author: @gasche

A did a quick review.

High-level points:

  1. I'm not sure why you decided to support tuples only, and not records (float boxing issues? see below). Similarly, why doesn't transform_return support Lswitch? Ltrywith?
  2. In the optimization case, you initialize your variables with Const_pointer 0. This relies on the non-local invariant that the pattern optimized only match one-word data. You should probably add a comment justifying the safety of this initialization.
  3. You did not add tests to the testsuite. It's always better when patches come with test coverage. Adding tests is boring stuff, and doing it bit by bit when you write a patch on a specific aspect of the compiler is a good routine to have. Plus you sometimes find bugs.

Nitpicking:

  • I'm not fond of the silent shadowing of for_let
  • It's not clear why one seq call needs List.rev and not the other. If there is some evaluation order to respect, you should comment about it. Maybe this would be simpler if assign_pat just returned a list of side-effects to apply? Plus that would potentially result in slightly better results (less seq [] cases).
  • in assign_pat you handle Tpat_var specifically but not Tpat_any; I'm under the impression that both would "do the right thing" in the default case; why did you choose to single out one and not the other?
  • The control flow of assign_pat is convoluted and non-optimal; in particular, the default case will compute for_let loc lam pat' (copy_ids ids) even when the optimization doesn't apply and this is therefore discarded. I would rather suggest the following flow: make the reference local to the function (this simplifies the interface) and, in the wildcard case, raise Exit if !opt if false.

PS: it's easier if you give an URL to the diff(s).
http://caml.inria.fr/cgi-bin/viewvc.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?r1=10475&r2=10474&pathrev=10475

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 20, 2011

Comment author: @alainfrisch

Hi Gabriel,

Thanks for the detailed feedback.

  1. Tuple only: the typical idiom is to write

    let (x, y) = if ... then (e1, e2) else (e3, e4)

    The intention is not really to create tuples, but to return several bindings.
    I don't think people explicitly create records to deconstruct them immediately.

    Lswitch, Ltrywith would be easy to handle, indeed.

  2. You're right that the initialization is not very nice. I'm not 100% sure this is safe, actually. I propose to wait to see if there is some interest in the patch (i.e. if it is considered useful) before investigating the actual implementation.

I don't understand your last proposal about assign_pat; the wildcard case in assign_pat does not, by itself, disallow the optimization (opt can be set to true afterwards).

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 20, 2011

Comment author: @gasche

Here is the sort of things I had in mind. I didn't even try to compile the code, it's just a draft. I may also have misunderstood your code (in which case I'd welcome clarifying in-code comments) in which case this wouldn't be equivalent.

let assign_pat loc pat lam =
let rec assign opt pat lam =
match pat.pat_desc, lam with
| Tpat_tuple patl, Lprim(Pmakeblock , lams) ->
List.concat (List.map2 (assign true) patl lams)
| Tpat_tuple patl, Lconst(Const_block (
, scl)) ->
List.concat (List.map2 (fun p sc -> assign true p (Lconst sc)) patl scl)
| _ ->
if not opt then raise Exit;
match pat with
| Tpat_any -> []
| Tpat_var id -> [Lassign(id,lam)]
| _ ->
let ids = pat_bound_idents pat in
let ids = List.map (fun id -> id, Ident.rename id) ids in
let pat' = alpha_pat ids pat in
[for_let loc lam pat' (copy_ids ids)]
in assign false pat lam

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 20, 2011

Comment author: @alainfrisch

The behavior is different.

The original code "pushes" the bindings down the bound expression, duplicating it on branching node. If one of the bindings "cancels" out a literal tuple, the optimization is triggered. This is achieved by sharing the reference across all the calls to assign_pat. For instance, consider:

let x, y =
if ... then z else (1, 2)

Here, transform_return will push the binding (x,y) to the two branches; on the second fails, the corresponding call to assign_pat detects that the optimization operates. But on the first call to assign_pat (for "z"), we need the fallback case.

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 20, 2011

Comment author: @gasche

Indeed, that's... wicked.

I hope you'll publish a testsuite with all those interesting corner cases covered and explained. Maybe an in-code comment is also good.

I'm wondering however, could this example result in a performance degradation? Imagine that "z" is the usual case and (1,2) is an exceptional case (that's how I expect this pattern to be used). The result of your optimization, if I'm not mistaken, will allocate two dummy-initialized variables x and y, get the z couple, set x to z.fst and y to z.snd, and go on. The current compiler would just get z, bind x to z.fst and y to z.snd, and move on.

The difference is tiny (in particular, of course you don't allocate more) but it would be interesting to know if the difference is noticeable. Maybe it would be safer to optimize only when all branches are tuples or do not return (assert false, raise...).

A good thing with either choices is that the performance-maniac compiler-guru user can tune its code to get the semantics he wants: by turning z into fst z, snd z in your example he could ensure the restricted optimization fires, and by writing let (x, y) as _c = ... he can disable the optimistic one.

PS: I understand that you don't want to invest more time on this until the usefulness of the optimization is demonstrated and accepted. My technical comments on the code should not be taken as an opinion.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 19, 2012

Comment author: @gasche

While looking at Alain's implementation again, I had an idea of a different implementation of the optimization. I implemented it as the attached patch.

The idea is to use Lstaticcatch/Lstaticraise instead of Lassign.

Note: I have added comments liberally, but barring that the patches are of the exact same size.

Alain patch transforms

 let (x,y) =
    let foo = ... in
    if foo then (1, 2) else (3,4)
 in bar

into (approximatively; in this special case there is no name freshening)

 let x = dummy in let y = dummy in
 begin
  let foo = ... in
  if foo then
    (let x1 = 1 in let y1 = 2 in x <- x1; y <- y1)
  else
    (let x2 = 3 in let y2 = 4 in x <- x2; y <- y2)
 end;
 bar

While my implementation produces

 catch
   let foo = ... in
   if foo then
     (let x1 = 1 in let y1 = 2 in exit x1 y1)
   else
    (let x2 = 3 in let y2 = 4 in exit x2 y2)
 with x y ->
   bar

I liked the use of control-flow instead of mutation here -- and the removal of those unsatisfying "dummy initialization".

The performance profile is different from Alain's, but related. This patch tends to do slightly better in some cases, and slightly worse in others. See at the end of this message the compared timings on the micro-benchmarks posted by Alain above.

I believe the implementation is more efficient in general, but maybe does not benefit from float or reference unboxing optimization in Lassign, and is therefore slower in this case -- I'm not familiar with unboxing optimizations, but that's the most plausible justification I have; I'd be glad to confirmed or contradicted.
If this is the case, then I'm wondering whether float unboxing could be performed on catch/exit as well.

Performance numbers on Alain's tests, in the order corresponding to his message.

test1.ml (integers)
3.792s no-opt
2.694s assign-opt
2.422s exit-opt

test2.ml (integers)
1.979s no-opt
1.644s assign-opt
1.551s exit-opt

test3.ml (float references)
4.648s no-opt
4.783s assign-opt
4.691s exit-opt

test4.ml (float references)
8.871s no-opt
7.779s assign-opt
8.843s exit-opt

test5.ml (float references)
5.334s no-opt
4.026s assign-opt
5.272s exit-opt

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 20, 2012

Comment author: @alainfrisch

Gabriel, I really like your proposal. Indeed, it's much cleaner to use control-flow rather than mutation.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 20, 2012

Comment author: @gasche

I have explored the boxing issue a bit. I just uploaded an experimental patch (very lightly tested) that fixes the issue I could identify by reading the cmm code.

Unfortunately, in the process of testing this patch on my examples, I discovered that my measurements are much more variable that I thought : on the test machine, I get upto 1 second variation with the same optimizer, which makes 4-seconds measurements unusable...

I will try to find more robust measurements, but in the meantime, here are the relevant region of the cmm code on test5.ml.

No optimization at all ('match/1037' is the identifier of the tuple
(let (x,y) = ...); 'x' and 'y' are inlined):

(data global "camlTest5__3" int 1277 "camlTest5__3": double 1.)
(data global "camlTest5__4" int 1277 "camlTest5__4": double 2.)

(let
(match/1037
(if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
(alloc 3072 x/1030 y/1031 "camlTest5__3")
(alloc 3072 y/1031 x/1030 "camlTest5__4"))
y/1035 (load (+a match/1037 8))
x/1034 (load match/1037))
(...))

With Alain's assign patch:

(data global "camlTest5__3" int 1277 "camlTest5__3": double 1.)
(data global "camlTest5__4" int 1277 "camlTest5__4": double 2.)

(let (x/1034 1a y/1035 1a z/1036 1a)
(if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
(seq (assign z/1036 "camlTest5__3")
(assign y/1035 y/1031) (assign x/1034 x/1030))
(seq (assign z/1036 "camlTest5__4")
(assign y/1035 x/1030) (assign x/1034 y/1031)))
(...))

With the catch/exit patch:

(catch
(if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
(let (z/1048 1. z/1042 (alloc 1277 z/1048))
(exit 1 z/1042 y/1031 x/1030))
(let (z/1047 2. z/1039 (alloc 1277 z/1047))
(exit 1 z/1039 x/1030 y/1031)))
with(1 z/1036 y/1035 x/1034)
(...))

Notice the degradation with 'z' that is explicitely allocated instead
of being a global symbol. That's the defect I tried to fix with the
new (independent) patch.

With my 'avoid undeeded unboxing' patch:

(data global "camlTest5__3" int 1277 "camlTest5__3": double 1.)
(data global "camlTest5__4" int 1277 "camlTest5__4": double 2.)

(catch
(if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
(let z/1042 "camlTest5__3"
(exit 1 z/1042 y/1031 x/1030))
(let z/1039 "camlTest5__4"
(exit 1 z/1039 x/1030 y/1031)))
with(1 z/1036 y/1035 x/1034)
(...))

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 4, 2012

Comment author: @alainfrisch

I believe the second patch (0001-avoid-float-unboxing-when-uneeded.patch) is wrong. If a mutable variable is needed in both boxed and unboxed form, one should not use the unboxed compilation scheme (otherwise assignments to the variable are only visible to context which use the unboxed form).

@vicuna

This comment has been minimized.

Copy link
Author

commented Dec 4, 2012

Comment author: @alainfrisch

FWIW, I could not observe any significant impact on performance of applying Gabriel's first patch (but not the second one) for our entire test suite (maybe because we avoid the optimized construction in performance critical code). There is a small reduction of the total number of allocated bytes (e.g. for some tests, around 3% fewer allocated bytes).

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 15, 2013

Comment author: @alainfrisch

With static exceptions (#5879), it is possible to code by hand Gabriel's solution. For instance:

let () =
  let r = ref 0 in
  for k = 1 to 10000 do
    for i = 1 to 100000 do
      try
        if i mod 2 = 0 then raise (`Cont (1, i * 2))
        else raise (`Cont (2, i * 3))
      with `Cont (x, y) ->
        r := !r * x + y
    done;
  done;
  Printf.printf "-> %i\n%!" !r

and this achieves identical performances. Of course, it is not very nice to write code like that, but it's good to know that even if no optimization for tuple assignment is ever included, static exceptions give a way to improve speed of performance critical code by hand.

Another observation is that the changes required to support escaping try...with blocks with a static exception would benefit to Gabriel's approach as well, making it possible to optimize the case where the right-hand side build tuples in terminal position of a try...with body.

@vicuna

This comment has been minimized.

Copy link
Author

commented Oct 15, 2015

Comment author: @alainfrisch

I've committed this on trunk, rev 16501. This is a version close to Gabriel's, with some improvements in the map_return function in order to traverse trywith/staticatch expressions (because the staticraise we use is allowed to cross try...with blocks!) and to ignore staticraise/raise expression (because they don't return). This improves the effectiveness of the patch by avoiding tuple allocations in cases such as:

let (x, y) =
  try
    if i mod 2 = 0 then (1, i * 2)
    else if i mod 5 = 0 then raise Exit
    else (-1, i * 3)
  with Exit ->
    (1, -1)
in

A test has been added to check that no allocation happens in this example.

The optimization is currently done quite early in the compilation process, thus benefiting both to bytecode and the native. This means that (i) js_of_ocaml also benefits from it; (ii) the optimization is not reapplied after inlining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.