New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: tail recursion modulo constructors #181

Open
wants to merge 16 commits into
base: trunk
from

Conversation

Projects
None yet
@let-def
Contributor

let-def commented May 6, 2015

This pull request introduces tail-recursion modulo constructor, which allows to write
a version List.map that is both tail-recursive and natural.

Using TRMC

Getting the benefits of TRMC is opt-in. Since it turns stack allocation into mutation and heap-allocation, it might have an impact on the program profile.

However, because of the nature of TRMC mutations, they almost never cross heaps and stay in the fast-path (I don't have recent benchmarks yet, but the performance cost, if any is negligible).

Attributes

A few syntaxes have been tried, the two most satisfying ones are:

  • annotating bindings
let rec map f = function
 | [] -> []
 | x :: xs ->
   let y = f x in
   y :: map f xs
 [@@trmc]
  • annotating patterns
let rec map[@trmc] f = function
 | [] -> []
 | x :: xs ->
   let y = f x in
   y :: map f xs

The first one applies only to global bindings (item attributes are not yet allowed on local-let).
The second one required updating the parser, as attributes were not allowed on the identifier of function binding, and proved quite tricky. It is the one implemented in current revision.

Finally, the -force-trmc flags enable automatic rewriting of calls to trmc style.

TRMC cases

TRMC applies to all tail-positions consisting of value constructors, possibly nested, applied to the result of a recursive call to a "trmc candidate", that is a recursive function being defined which have the [@trmc] attribute specified above.

Examples:

  • y :: map f xs from above,
  • y :: z :: map f xs would qualify also
  • in case of ambiguity, for instance when mapping both subtrees of a tree,
    e.g. Node (l,r) -> Node (tree_map f l, tree_map f r), one of the call will
    be rewritten in TRMC style but which one is not specified.

In the last case, use an explicit let-binding to enforce the order. TRMC rewritting won't cross let-bindings, so

let l = tree_map f l in
let r = tree_map f r in
Node (l, r)

will disable TRMC.

let l = tree_map f l in
Node (l, tree_map f r)

will force TRMC on the right subtree.

Also, you shouldn't rely on the evaluation order in the expression in tail-position, as trmc rewritting might change the ordering (trmc call evaluated last). But since afaik evaluation order of arguments is not strongly specified in OCaml, it should already be the case.

Warnings

Three new warnings are introduced:

  • 51, [@trmc] attribute on a function which doesn't satisfy trmc requirements
  • 52, [@trmc] attribute on a function which is never applied in trmc position
  • 53, a call has been detected to be in trmc-position, the function could benefit of [@trmc] attribute (useful to detect trmc opportunities in existing code base).

However, in the last case, making this a warning is not satisfying, as plenty of codes with -warn-error A will break.

Cons

There are a few cases where observable behavior will change or break.

When using delim-cc or hansei. Since the continuation captures the stack and not the heap, and trmc move values to the heap and out of the stack really early, one will be able to observe values getting mutated. TRMC shouldn't be used in such cases.

It is unclear how TRMC will behave with multicore GCs and alternative runtimes (like js_of_ocaml, although it is more likely to be fine). The mutations introduced will touch values which are not expected to be mutated.
As far as OCaml semantics and the classic OCaml runtime are concerned, this is fine since these values are not observable. But different runtimes might have different expectations.

let rec l = 1 :: l in map succ l might prove to be a bit memory hungry instead of failing quickly.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche May 6, 2015

Member

For the record, I have discussed this feature extensively with @def-lkb , and I think it is a nice thing to have opt-in.

I am tired of having to explain to beginners that the nice elegant way to write map is not the actual way you have to implement it in a reasonable library. Whenever possible, the elegant code should also be the best code, and TRMC brings us closer to this ideal. (I realize this sounds like a philosophical point, but I think it is a strong argument and it took me some years to realize it.)

On the other hand TRMC can be problematic (I do think Hansei is an interesting use of the language, and thanks Oleg for pointing out that destination-passing-style List.map breaks it), so it is important that it is only opt-in.

Finally, there are also some important use-cases where TRMC does not suffice to obtain good stack usage (it is not clear how to implement mapM in a strict language without blowing the stack on large lists). It is not the be-all end-all of "OS stack vs. heap for callstack allocation" debates.

Member

gasche commented May 6, 2015

For the record, I have discussed this feature extensively with @def-lkb , and I think it is a nice thing to have opt-in.

I am tired of having to explain to beginners that the nice elegant way to write map is not the actual way you have to implement it in a reasonable library. Whenever possible, the elegant code should also be the best code, and TRMC brings us closer to this ideal. (I realize this sounds like a philosophical point, but I think it is a strong argument and it took me some years to realize it.)

On the other hand TRMC can be problematic (I do think Hansei is an interesting use of the language, and thanks Oleg for pointing out that destination-passing-style List.map breaks it), so it is important that it is only opt-in.

Finally, there are also some important use-cases where TRMC does not suffice to obtain good stack usage (it is not clear how to implement mapM in a strict language without blowing the stack on large lists). It is not the be-all end-all of "OS stack vs. heap for callstack allocation" debates.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch May 6, 2015

Contributor

This is a great proposal.

I'm not sure I buy @gasche's argument about beginners, though. If TRMC is not automatic, having to tell beginners to enable it explicitly is not natural and doesn't achieve the goal of least-surprise.

Also, it would be quite useful to benchmark various versions of List.map. Last time I checked, a manual encoding of TRMC for this case leads to significantly worse performances than the naive non tailrec version. It would be interesting to compare the TRMC version with the "defunctionalized CPS" version (i.e. using an accumulator and a List.rev), because that one could also be mechanized. (There are several ways to achieve good performance for common sizes and still have a version that works on long list (do a non-tailrec version for the first N elements; unroll the loop) -- but that's another story.)

Contributor

alainfrisch commented May 6, 2015

This is a great proposal.

I'm not sure I buy @gasche's argument about beginners, though. If TRMC is not automatic, having to tell beginners to enable it explicitly is not natural and doesn't achieve the goal of least-surprise.

Also, it would be quite useful to benchmark various versions of List.map. Last time I checked, a manual encoding of TRMC for this case leads to significantly worse performances than the naive non tailrec version. It would be interesting to compare the TRMC version with the "defunctionalized CPS" version (i.e. using an accumulator and a List.rev), because that one could also be mechanized. (There are several ways to achieve good performance for common sizes and still have a version that works on long list (do a non-tailrec version for the first N elements; unroll the loop) -- but that's another story.)

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli May 6, 2015

Contributor

Le mercredi, 6 mai 2015 à 17:13, gasche a écrit :

I am tired of having to explain to beginners that the nice elegant way to write map is not the actual way you have to implement it in a reasonable library. Whenever possible, the elegant code should also be the best code, and TRMC brings us closer to this ideal. (I realize this sounds like a philosophical point, but I think it is a strong argument and it took me some years to realize it.)

It's always funny how people pull out the beginners argument when they find it handy to supposedly make a point. You brought nothing for the beginners since you now have to explain to them both what tail-recursion and tail-recursion modulo constructors is and that you need to add an attribute to get the latter, I have seen more beginner friendly introductions…

(And as thinking about beginners when you design is usually a bad idea, always design for the working programmer. API and PL designers should really read more hci books)

Best,

Daniel

Contributor

dbuenzli commented May 6, 2015

Le mercredi, 6 mai 2015 à 17:13, gasche a écrit :

I am tired of having to explain to beginners that the nice elegant way to write map is not the actual way you have to implement it in a reasonable library. Whenever possible, the elegant code should also be the best code, and TRMC brings us closer to this ideal. (I realize this sounds like a philosophical point, but I think it is a strong argument and it took me some years to realize it.)

It's always funny how people pull out the beginners argument when they find it handy to supposedly make a point. You brought nothing for the beginners since you now have to explain to them both what tail-recursion and tail-recursion modulo constructors is and that you need to add an attribute to get the latter, I have seen more beginner friendly introductions…

(And as thinking about beginners when you design is usually a bad idea, always design for the working programmer. API and PL designers should really read more hci books)

Best,

Daniel

@lpw25

This comment has been minimized.

Show comment
Hide comment
@lpw25

lpw25 May 6, 2015

Contributor

I do like this idea.

However, I am concerned about the addition of more mutations of immutable things since these are more costly in multicore (if the mutated block is on the major heap then you must allocate the pointed-to value on the major heap as well).

I am also concerned about the interaction with delimited continuations (which -- spoiler alert -- may also be relevant to multicore). It is possible that this interaction can be fixed. Essentially copying a stack must also cause any TRMC values being built to be copied (I believe Gabriel discussed such a scheme on the caml-list). I think that this should probably be worked out properly before considering TRMC for inclusion.

Contributor

lpw25 commented May 6, 2015

I do like this idea.

However, I am concerned about the addition of more mutations of immutable things since these are more costly in multicore (if the mutated block is on the major heap then you must allocate the pointed-to value on the major heap as well).

I am also concerned about the interaction with delimited continuations (which -- spoiler alert -- may also be relevant to multicore). It is possible that this interaction can be fixed. Essentially copying a stack must also cause any TRMC values being built to be copied (I believe Gabriel discussed such a scheme on the caml-list). I think that this should probably be worked out properly before considering TRMC for inclusion.

@yallop

This comment has been minimized.

Show comment
Hide comment
@yallop

yallop May 6, 2015

Member

It sounds like this only applies to self calls. Can it be generalised to support arbitrary tail calls?

Member

yallop commented May 6, 2015

It sounds like this only applies to self calls. Can it be generalised to support arbitrary tail calls?

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def May 6, 2015

Contributor

@yallop this applies to all recursive functions defined in the let-bindings parents of the call-site.

let rec f =and g =
  let rec g' = … x :: f () in
  g' …
in

Here, that call to f from g' is rewritable. Note that the drawback is that this might generate an amount of code exponential to the number of nested lets.
This could maybe be worked around by doing trmc rewriting after lambda lifting, but that would be much more involved.

Contributor

let-def commented May 6, 2015

@yallop this applies to all recursive functions defined in the let-bindings parents of the call-site.

let rec f =and g =
  let rec g' = … x :: f () in
  g' …
in

Here, that call to f from g' is rewritable. Note that the drawback is that this might generate an amount of code exponential to the number of nested lets.
This could maybe be worked around by doing trmc rewriting after lambda lifting, but that would be much more involved.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy May 6, 2015

@lpw25 will multicore be purely a runtime or will it also require compiling programs differently? If the latter, then this optimization can be turned off. Perhaps ideally, both code paths can be included, and the runtime gets to choose which one it wants by setting a global, or via a preprocessing phase much like dynamic linking?

bluddy commented May 6, 2015

@lpw25 will multicore be purely a runtime or will it also require compiling programs differently? If the latter, then this optimization can be turned off. Perhaps ideally, both code paths can be included, and the runtime gets to choose which one it wants by setting a global, or via a preprocessing phase much like dynamic linking?

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli May 6, 2015

Contributor

Le mercredi, 6 mai 2015 à 18:16, Yotam Barnoy a écrit :

@lpw25 (https://github.com/lpw25) will multicore be purely a runtime or will it also require compiling programs differently? If the latter, then this optimization can be turned off.

Well no, it's an all or nothing thing. Otherwise you may blow the stack depending on the runtime and you really don't want that.

Daniel

Contributor

dbuenzli commented May 6, 2015

Le mercredi, 6 mai 2015 à 18:16, Yotam Barnoy a écrit :

@lpw25 (https://github.com/lpw25) will multicore be purely a runtime or will it also require compiling programs differently? If the latter, then this optimization can be turned off.

Well no, it's an all or nothing thing. Otherwise you may blow the stack depending on the runtime and you really don't want that.

Daniel

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy May 6, 2015

@dbuenzli Right, well, I think it's better to view this optimization not as a surefire way to fix stack overflows, but as a hint to the compiler. Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

Otherwise, this really does get messy for multicore, which is coming up... any day now?

bluddy commented May 6, 2015

@dbuenzli Right, well, I think it's better to view this optimization not as a surefire way to fix stack overflows, but as a hint to the compiler. Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

Otherwise, this really does get messy for multicore, which is coming up... any day now?

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli May 6, 2015

Contributor

@dbuenzli (https://github.com/dbuenzli) Right, well, I think it's better to view this optimization not as a surefire way to fix stack overflows, but as a hint to the compiler.
This proposal is not an optimization (reread Frédéric's very first sentence) and it seems it may even impact performance negatively in certain scenarios.

Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

Personally I'm first interested in writing robust and correct programs and part of this task is making sure that you do not blow the stack whenever your data becomes arbitrarily large.

Contributor

dbuenzli commented May 6, 2015

@dbuenzli (https://github.com/dbuenzli) Right, well, I think it's better to view this optimization not as a surefire way to fix stack overflows, but as a hint to the compiler.
This proposal is not an optimization (reread Frédéric's very first sentence) and it seems it may even impact performance negatively in certain scenarios.

Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

Personally I'm first interested in writing robust and correct programs and part of this task is making sure that you do not blow the stack whenever your data becomes arbitrarily large.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch May 6, 2015

Contributor

Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

I think this is wrong. Most tricks to avoid blowing the stack (CPS, extra accumulator, mutation) tend to produce code much slower than the non-tailrec version.

Contributor

alainfrisch commented May 6, 2015

Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

I think this is wrong. Most tricks to avoid blowing the stack (CPS, extra accumulator, mutation) tend to produce code much slower than the non-tailrec version.

@bluddy

This comment has been minimized.

Show comment
Hide comment
@bluddy

bluddy May 6, 2015

My bad. I guess it's only a performance optimization as compared to reversing the reverse-mapped list, which in itself is done to avoid stack overflow.

bluddy commented May 6, 2015

My bad. I guess it's only a performance optimization as compared to reversing the reverse-mapped list, which in itself is done to avoid stack overflow.

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def May 6, 2015

Contributor

I think this is wrong. Most tricks to avoid blowing the stack (CPS, extra accumulator, mutation) tend to produce code much slower than the non-tailrec version.

This doesn't match my observations, but my benchmarks were not made with latest version, so take what I am about to say with a pinch of salt.

Generally, code with and without trmc were performing similarly.
On a quick test with operf-micro, there was no difference (around 1%, generally in favor of trmc, but likely due to external factors), although operf-micro is probably not stressing much trmc cases.
On a variety of tests involving List.map:

  • for small lists and next to no work in the applied function, (List.map succ [1;2]), naive version was 10% faster,
  • as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one (when list size got close to minor heap, my guesstimate: the only work was to build the list, in the naive case it was built twice, once on the stack, once in heap, or directly in heap for trmc version)
  • unrolling List.map some times (directly matching lists with less than 5 elements) had much more impact than trmc rewriting.

The unrolled version was faster than Core_list.map one, quite significantly on large lists (when Core switches to the array representation).

Contributor

let-def commented May 6, 2015

I think this is wrong. Most tricks to avoid blowing the stack (CPS, extra accumulator, mutation) tend to produce code much slower than the non-tailrec version.

This doesn't match my observations, but my benchmarks were not made with latest version, so take what I am about to say with a pinch of salt.

Generally, code with and without trmc were performing similarly.
On a quick test with operf-micro, there was no difference (around 1%, generally in favor of trmc, but likely due to external factors), although operf-micro is probably not stressing much trmc cases.
On a variety of tests involving List.map:

  • for small lists and next to no work in the applied function, (List.map succ [1;2]), naive version was 10% faster,
  • as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one (when list size got close to minor heap, my guesstimate: the only work was to build the list, in the naive case it was built twice, once on the stack, once in heap, or directly in heap for trmc version)
  • unrolling List.map some times (directly matching lists with less than 5 elements) had much more impact than trmc rewriting.

The unrolled version was faster than Core_list.map one, quite significantly on large lists (when Core switches to the array representation).

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch May 6, 2015

Contributor

as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one

I would be interested to see such cases, and how they would behave when the partially built list does not fit in the minor heap during the entire List.map (either because the list is very long, or just because the mapped function allocates).

Contributor

alainfrisch commented May 6, 2015

as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one

I would be interested to see such cases, and how they would behave when the partially built list does not fit in the minor heap during the entire List.map (either because the list is very long, or just because the mapped function allocates).

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def May 12, 2015

Contributor

Ok, I fixed a bug I introduced while rebasing and now the worst case is 30% slower with TRMC with operf-micro, and 1 to 3% faster in other cases. I haven't had time to investigate the cases where TRMC was slower.

Contributor

let-def commented May 12, 2015

Ok, I fixed a bug I introduced while rebasing and now the worst case is 30% slower with TRMC with operf-micro, and 1 to 3% faster in other cases. I haven't had time to investigate the cases where TRMC was slower.

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def May 12, 2015

Contributor

After discussion with @lpw25, in order to make TRMC behaves well multicore gc and delim-cc, we decided to restrict TRMC rewriting to recursive calls occurring at the last position of the block being built.

That is for (::) : 'a * 'a list -> 'a list, the last position is the list value.
This applies to nested calls to, so that in x :: (y :: f xs), the call f xs could be rewritten, but not in x :: (f xs :: ys).

Contributor

let-def commented May 12, 2015

After discussion with @lpw25, in order to make TRMC behaves well multicore gc and delim-cc, we decided to restrict TRMC rewriting to recursive calls occurring at the last position of the block being built.

That is for (::) : 'a * 'a list -> 'a list, the last position is the list value.
This applies to nested calls to, so that in x :: (y :: f xs), the call f xs could be rewritten, but not in x :: (f xs :: ys).

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jun 17, 2015

Contributor
  1. Do we agree that the two versions below behave the same:
let rec map1[@trmc] f = function [] -> [] | hd :: tl -> f hd :: map1 f tl;;
let rec map2[@trmc] f = function [] -> [] | hd :: tl -> let r = f hd in r :: map2 f tl;;

?

(i.e. that the trmc attribute can change the evaluation order)

  1. In terms of user interface, I'm wondering whether it would be preferable to put the attribute on the call:
let rec map f = function [] -> [] | hd :: tl -> f hd :: (map f tl [@trmc])

or:

let rec map f = function [] -> [] | hd :: tl -> f hd :: (map[@trmc] f tl)

This would be more in line with the [@tailcall] attribute, and makes it more explicit where users expect a call to be turned into a tail one.

Contributor

alainfrisch commented Jun 17, 2015

  1. Do we agree that the two versions below behave the same:
let rec map1[@trmc] f = function [] -> [] | hd :: tl -> f hd :: map1 f tl;;
let rec map2[@trmc] f = function [] -> [] | hd :: tl -> let r = f hd in r :: map2 f tl;;

?

(i.e. that the trmc attribute can change the evaluation order)

  1. In terms of user interface, I'm wondering whether it would be preferable to put the attribute on the call:
let rec map f = function [] -> [] | hd :: tl -> f hd :: (map f tl [@trmc])

or:

let rec map f = function [] -> [] | hd :: tl -> f hd :: (map[@trmc] f tl)

This would be more in line with the [@tailcall] attribute, and makes it more explicit where users expect a call to be turned into a tail one.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jun 17, 2015

Contributor

as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one

I'm still interested to see concrete cases where the naive version trmc's map is faster than the non-trmc version, for the same evaluation order.

I ran the following benchmark:

let rec map_trmc [@trmc] f = function
  | [] -> []
  | hd :: tl -> let r = f hd in r :: map_trmc f tl

let rec map f = function
  | [] -> []
  | hd :: tl -> let r = f hd in r :: map f tl

let n = 10000000

let test len =
  let l = ref [] in
  for i = 1 to len do l := i :: !l done;
  let l = !l in

  Gc.compact ();
  let t0 = Unix.gettimeofday () in
  for j = 1 to n / len do ignore (map succ l) done;
  let x1 = Unix.gettimeofday () -. t0 in

  Gc.compact ();
  let t0 = Unix.gettimeofday () in
  for j = 1 to n / len do ignore (map_trmc succ l) done;
  let x2 = Unix.gettimeofday () -. t0 in

  Printf.printf "len = %08i:   ratio = % 6.02f%%\n%!" len (100. *. x2 /. x1)


let () =
  test 10;
  test 100;
  test 1000;
  test 10000;
  test 100000

and I get:

len = 00000010:   ratio =  104.93%
len = 00000100:   ratio =  104.29%
len = 00001000:   ratio =  99.68%
len = 00010000:   ratio =  116.31%
len = 00100000:   ratio =  144.07%

changing the implementation to:

let rec map_trmc [@trmc] f = function
  | [] -> []
  | hd :: tl -> f hd :: map_trmc f tl

let rec map f = function
  | [] -> []
  | hd :: tl -> f hd :: map f tl

give very different results:

len = 00000010:   ratio =  106.70%
len = 00000100:   ratio =  55.58%
len = 00001000:   ratio =  50.80%
len = 00010000:   ratio =  73.18%
len = 00100000:   ratio =  128.37%

but here the semantics is different (the trmc version always evaluate left-to-right).

Now, using:

let rec map f l = List.rev (List.rev_map f l)

I get:

len = 00000010:   ratio =  73.83%
len = 00000100:   ratio =  79.02%
len = 00001000:   ratio =  77.45%
len = 00010000:   ratio =  71.45%
len = 00100000:   ratio =  82.75%
Contributor

alainfrisch commented Jun 17, 2015

as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one

I'm still interested to see concrete cases where the naive version trmc's map is faster than the non-trmc version, for the same evaluation order.

I ran the following benchmark:

let rec map_trmc [@trmc] f = function
  | [] -> []
  | hd :: tl -> let r = f hd in r :: map_trmc f tl

let rec map f = function
  | [] -> []
  | hd :: tl -> let r = f hd in r :: map f tl

let n = 10000000

let test len =
  let l = ref [] in
  for i = 1 to len do l := i :: !l done;
  let l = !l in

  Gc.compact ();
  let t0 = Unix.gettimeofday () in
  for j = 1 to n / len do ignore (map succ l) done;
  let x1 = Unix.gettimeofday () -. t0 in

  Gc.compact ();
  let t0 = Unix.gettimeofday () in
  for j = 1 to n / len do ignore (map_trmc succ l) done;
  let x2 = Unix.gettimeofday () -. t0 in

  Printf.printf "len = %08i:   ratio = % 6.02f%%\n%!" len (100. *. x2 /. x1)


let () =
  test 10;
  test 100;
  test 1000;
  test 10000;
  test 100000

and I get:

len = 00000010:   ratio =  104.93%
len = 00000100:   ratio =  104.29%
len = 00001000:   ratio =  99.68%
len = 00010000:   ratio =  116.31%
len = 00100000:   ratio =  144.07%

changing the implementation to:

let rec map_trmc [@trmc] f = function
  | [] -> []
  | hd :: tl -> f hd :: map_trmc f tl

let rec map f = function
  | [] -> []
  | hd :: tl -> f hd :: map f tl

give very different results:

len = 00000010:   ratio =  106.70%
len = 00000100:   ratio =  55.58%
len = 00001000:   ratio =  50.80%
len = 00010000:   ratio =  73.18%
len = 00100000:   ratio =  128.37%

but here the semantics is different (the trmc version always evaluate left-to-right).

Now, using:

let rec map f l = List.rev (List.rev_map f l)

I get:

len = 00000010:   ratio =  73.83%
len = 00000100:   ratio =  79.02%
len = 00001000:   ratio =  77.45%
len = 00010000:   ratio =  71.45%
len = 00100000:   ratio =  82.75%
@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Jun 17, 2015

Contributor

(i.e. that the trmc attribute can change the evaluation order)

Yes, trmc force the trmc-call (right most) to be evaluated last.

let rec map f = function [] -> [] | hd :: tl -> f hd :: (map f tl [@TRMC])
or:
let rec map f = function [] -> [] | hd :: tl -> f hd :: (map[@TRMC] f tl)

Nice, I also find that better and more predictible. Unless people express other opinions, I think I will apply your suggestion.

Contributor

let-def commented Jun 17, 2015

(i.e. that the trmc attribute can change the evaluation order)

Yes, trmc force the trmc-call (right most) to be evaluated last.

let rec map f = function [] -> [] | hd :: tl -> f hd :: (map f tl [@TRMC])
or:
let rec map f = function [] -> [] | hd :: tl -> f hd :: (map[@TRMC] f tl)

Nice, I also find that better and more predictible. Unless people express other opinions, I think I will apply your suggestion.

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Jun 17, 2015

Contributor

About your benchmark: I likely made the mistake of having wrong evaluation order in my tests.
Your results seem to sum up what I observed.

In a second run, I unrolled the map function a few times and the difference was much larger than between TRMC and non-TRMC versions. And this time I took care of having the appropriate evaluation order. (Which makes me think that this rewriting might conflict with flambda unrolling)

Contributor

let-def commented Jun 17, 2015

About your benchmark: I likely made the mistake of having wrong evaluation order in my tests.
Your results seem to sum up what I observed.

In a second run, I unrolled the map function a few times and the difference was much larger than between TRMC and non-TRMC versions. And this time I took care of having the appropriate evaluation order. (Which makes me think that this rewriting might conflict with flambda unrolling)

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Jun 17, 2015

Member

In terms of user interface, I'm wondering whether it would be preferable to put the attribute on the call [...]. This would be more in line with the [@tailcall] attribute, and makes it more explicit where users expect a call to be turned into a tail one.

Is TRMC a global or local transform? My previous mental model was of TRMC as whole-function transform, and I was just writing a message explaining that I dislike your proposal for this reason. Thinking about it more, it looks like (TRMC indeed requires a whole-function change but) the possibilty for each recursive call to be TRMC-ed is independent from the rest of the recursive function body. If that is really the case, then maybe it should indeed be at the level of the call.

This would also have the advantage of easily allowing the user to specify which case should be trmc-ed if there are two recursive calls under the constructors, which can be helpful when the recursion depth is uneven. (But as noted by @def-lkb this can already done by using let explicitly.)

(I think in this case it would be both fairly reasonable and intuitive to allow both positions.)

Member

gasche commented Jun 17, 2015

In terms of user interface, I'm wondering whether it would be preferable to put the attribute on the call [...]. This would be more in line with the [@tailcall] attribute, and makes it more explicit where users expect a call to be turned into a tail one.

Is TRMC a global or local transform? My previous mental model was of TRMC as whole-function transform, and I was just writing a message explaining that I dislike your proposal for this reason. Thinking about it more, it looks like (TRMC indeed requires a whole-function change but) the possibilty for each recursive call to be TRMC-ed is independent from the rest of the recursive function body. If that is really the case, then maybe it should indeed be at the level of the call.

This would also have the advantage of easily allowing the user to specify which case should be trmc-ed if there are two recursive calls under the constructors, which can be helpful when the recursion depth is uneven. (But as noted by @def-lkb this can already done by using let explicitly.)

(I think in this case it would be both fairly reasonable and intuitive to allow both positions.)

@bobzhang

This comment has been minimized.

Show comment
Hide comment
@bobzhang

bobzhang Aug 14, 2015

Member

I like this idea, can we pass all attributes to function declarations downto lambda layer in a modular way instead of the ad-hoc passing down, now it seems if we want to add one optimization, we need change lambda interface, this is already the case with [@taillcall], this will quickly makes lambda layer bloat

Member

bobzhang commented Aug 14, 2015

I like this idea, can we pass all attributes to function declarations downto lambda layer in a modular way instead of the ad-hoc passing down, now it seems if we want to add one optimization, we need change lambda interface, this is already the case with [@taillcall], this will quickly makes lambda layer bloat

@bobzhang bobzhang referenced this pull request Aug 14, 2015

Closed

At tailcall #111

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Nov 15, 2015

Contributor

@def-lkb you should add tests.

Contributor

let-def commented Nov 15, 2015

@def-lkb you should add tests.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Nov 19, 2015

Member

We discussed it at the developer meeting today, and the consensus was that there were too many changes in for 4.03, so this could be postponed.

Also, the idea of only ever mutating the last field (as a way to create a possibility of a workaround to support multi-shot continuations) proved unpopular. I don't see it suggested in the PR (and I don't understand the full details of it: if the last field is not a self-call but some other is, then no TRMC is performed on this use-site?). On the other hand, I do think that supporting multi-shot continuations was a valid move (for HANSEI for example); we may need to think harder about (1) less ad-hoc designs for the multi-shot case or (2) whether a clear specification exists that encompasses both choices of behaviour, to at least be able to give a stable specification that allows changing our minds about this specific design detail.

Member

gasche commented Nov 19, 2015

We discussed it at the developer meeting today, and the consensus was that there were too many changes in for 4.03, so this could be postponed.

Also, the idea of only ever mutating the last field (as a way to create a possibility of a workaround to support multi-shot continuations) proved unpopular. I don't see it suggested in the PR (and I don't understand the full details of it: if the last field is not a self-call but some other is, then no TRMC is performed on this use-site?). On the other hand, I do think that supporting multi-shot continuations was a valid move (for HANSEI for example); we may need to think harder about (1) less ad-hoc designs for the multi-shot case or (2) whether a clear specification exists that encompasses both choices of behaviour, to at least be able to give a stable specification that allows changing our minds about this specific design detail.

@damiendoligez damiendoligez added this to the 4.04-or-later milestone Jan 25, 2016

@mshinwell mshinwell removed this from the 4.04 milestone Sep 7, 2016

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Dec 27, 2016

Contributor

@let-def What is the plan for this pull request?

Contributor

mshinwell commented Dec 27, 2016

@let-def What is the plan for this pull request?

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Dec 27, 2016

Contributor

@mshinwell Unless there is a consensus that this is better implemented in flambda than lambda (potentially better optimizing but no longer benefiting bytecode interpreter), I will port to latest version.
Then I rerun tests and provide uptodate figures.

Contributor

let-def commented Dec 27, 2016

@mshinwell Unless there is a consensus that this is better implemented in flambda than lambda (potentially better optimizing but no longer benefiting bytecode interpreter), I will port to latest version.
Then I rerun tests and provide uptodate figures.

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Dec 27, 2016

Contributor

Can you outline in what ways you think this will perform better if done as an Flambda pass? (It would be the natural place for this kind of thing.)

Contributor

mshinwell commented Dec 27, 2016

Can you outline in what ways you think this will perform better if done as an Flambda pass? (It would be the natural place for this kind of thing.)

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Dec 27, 2016

Contributor

The fact that TRMC introduce effects in otherwise pure code can prevent some flambda optimization.
However, now that it is possible to express initialization vs assignment in the IR, flambda can also be made aware that this is a simple effect and still perform optimizations (AFAIR from discussion with @chambart).

Contributor

let-def commented Dec 27, 2016

The fact that TRMC introduce effects in otherwise pure code can prevent some flambda optimization.
However, now that it is possible to express initialization vs assignment in the IR, flambda can also be made aware that this is a simple effect and still perform optimizations (AFAIR from discussion with @chambart).

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Dec 27, 2016

Contributor

In which case I think you should decide whether you want to port it to Flambda or not. The case for doing that sounds less compelling given what you say.

Contributor

mshinwell commented Dec 27, 2016

In which case I think you should decide whether you want to port it to Flambda or not. The case for doing that sounds less compelling given what you say.

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Dec 27, 2016

Contributor

I don't want to port to flambda, I think it is better higher in the pipeline. I will just port to trunk. But it would be nice to ensure this doesn't prevent flambda from working either.

In my opinion:

  • ths rewriting incurs some deep changes on the runtime behavior (O(1) stack vs O(n) stack), I think it is important to be able to control when it applies
  • it is driven by syntax (if only to make it easy to predict it), so higher in the pipeline is better

Furthermore, in the current design it is opt-in, it applies only in presence of an attribute [@trmc]. My use case is to benefit from this optimization without having to do dirty low-level hacks (even harder to get correct with flambda) when the programmer knows it should apply.

Contributor

let-def commented Dec 27, 2016

I don't want to port to flambda, I think it is better higher in the pipeline. I will just port to trunk. But it would be nice to ensure this doesn't prevent flambda from working either.

In my opinion:

  • ths rewriting incurs some deep changes on the runtime behavior (O(1) stack vs O(n) stack), I think it is important to be able to control when it applies
  • it is driven by syntax (if only to make it easy to predict it), so higher in the pipeline is better

Furthermore, in the current design it is opt-in, it applies only in presence of an attribute [@trmc]. My use case is to benefit from this optimization without having to do dirty low-level hacks (even harder to get correct with flambda) when the programmer knows it should apply.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 27, 2016

Member

For the record, I support (in principle) having this optimization opt-in, especially if we advertise it as an experimental feature. I also think that it would be nice to have a compiler command-line parameter to disable it, to avoid breakage for users (HANSEI/delimcc, potentially multicore-ocaml) who break the uniqueness assumption -- the fact that this means that trmc is not guaranteed by the marker justifies the experimental status.

Member

gasche commented Dec 27, 2016

For the record, I support (in principle) having this optimization opt-in, especially if we advertise it as an experimental feature. I also think that it would be nice to have a compiler command-line parameter to disable it, to avoid breakage for users (HANSEI/delimcc, potentially multicore-ocaml) who break the uniqueness assumption -- the fact that this means that trmc is not guaranteed by the marker justifies the experimental status.

@bobzhang

This comment has been minimized.

Show comment
Hide comment
@bobzhang

bobzhang Dec 28, 2016

Member

if it is experimental feature, how about try it with the new ocaml plugin system?

Member

bobzhang commented Dec 28, 2016

if it is experimental feature, how about try it with the new ocaml plugin system?

@stedolan

This comment has been minimized.

Show comment
Hide comment
@stedolan

stedolan Mar 1, 2017

Contributor

For the record, this optimisation is compatible with multicore, if implemented carefully.

Multicore distinguishes "mutable" and "immutable" operations. Oddly enough, "immutable" doesn't mean that it is never modified (!), since there are a few places in OCaml where "immutable" fields change (building recursive values, initialising recursive modules, functional update of large records, etc.)

So, in multicore, there are "mutable" and "immutable" reads, but also "mutable" and "immutable" writes. The invariant is that a field must never be subject to both mutable writes and immutable reads: it is always safe to use an immutable write or a mutable read.

Tail recursion modulo cons modifies immutable fields after they have been allocated. This is compatible with multicore, as long as the modifications are done using "immutable" writes.

Contributor

stedolan commented Mar 1, 2017

For the record, this optimisation is compatible with multicore, if implemented carefully.

Multicore distinguishes "mutable" and "immutable" operations. Oddly enough, "immutable" doesn't mean that it is never modified (!), since there are a few places in OCaml where "immutable" fields change (building recursive values, initialising recursive modules, functional update of large records, etc.)

So, in multicore, there are "mutable" and "immutable" reads, but also "mutable" and "immutable" writes. The invariant is that a field must never be subject to both mutable writes and immutable reads: it is always safe to use an immutable write or a mutable read.

Tail recursion modulo cons modifies immutable fields after they have been allocated. This is compatible with multicore, as long as the modifications are done using "immutable" writes.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Mar 1, 2017

Member

But there is not only the question of whether it would break the runtime's assumption, my point was rather that the correctness of the optimization relies on sequentiality. If non-linear usage of a fiber-delimited continuations is allowed by the runtime, a not-yet-completed value may be aliased between several different writers, so you could have a situation where one of them mutates the result location while another has already filled it and shared it with its readers, which makes the optimization incorrect. This is exactly the same issue as with HANSEI.

(You can detect the issue at runtime (by refusing to re-initialize an already-initialized field, assumed there is enough synchronization around those immutable writes so that they do not race), but it is unclear that there is a better strategy than failing there (you could think of doing a copy in that case, along the path from the result's root to the sentinel value to reinitialize, but in general you may not know what this path is, and strategies such as keeping the pointers inverted along the path add more overhead to the optimization).

Member

gasche commented Mar 1, 2017

But there is not only the question of whether it would break the runtime's assumption, my point was rather that the correctness of the optimization relies on sequentiality. If non-linear usage of a fiber-delimited continuations is allowed by the runtime, a not-yet-completed value may be aliased between several different writers, so you could have a situation where one of them mutates the result location while another has already filled it and shared it with its readers, which makes the optimization incorrect. This is exactly the same issue as with HANSEI.

(You can detect the issue at runtime (by refusing to re-initialize an already-initialized field, assumed there is enough synchronization around those immutable writes so that they do not race), but it is unclear that there is a better strategy than failing there (you could think of doing a copy in that case, along the path from the result's root to the sentinel value to reinitialize, but in general you may not know what this path is, and strategies such as keeping the pointers inverted along the path add more overhead to the optimization).

@stedolan

This comment has been minimized.

Show comment
Hide comment
@stedolan

stedolan Mar 1, 2017

Contributor

Ah, I see. However, since the continuations in multicore are single-shot (so that every function returns at most once) I think this still works.

Nonlinear continuations break more than just TRMC. For instance, the optimisation that turns a local ref into a mutable stack slot breaks in the presence of stack-copying continuations.

Contributor

stedolan commented Mar 1, 2017

Ah, I see. However, since the continuations in multicore are single-shot (so that every function returns at most once) I think this still works.

Nonlinear continuations break more than just TRMC. For instance, the optimisation that turns a local ref into a mutable stack slot breaks in the presence of stack-copying continuations.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 27, 2017

@damiendoligez damiendoligez removed this from the consider-for-4.07 milestone Jun 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment