Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize some local functions #2143

Merged
merged 10 commits into from Nov 27, 2018

Conversation

Projects
None yet
8 participants
@alainfrisch
Copy link
Contributor

commented Nov 9, 2018

This is a revised version of MPR#6242.

[EDIT: modified description to match the latest version of the PR]

A new optimization pass is added in Simplif to avoid the overhead of allocating and calling local closures. Functions bound to an identifier are simplified away if all references to the identifier are full applications of the function (i.e. same number of arguments) and if all these calls are in the same "tail scope". In that case, a "staticcatch" handler is inserted around that scope and each call is turned into a "staticraise" jumping to that handler. A special case is when a function is used only once (for a full application), which obviously fall into the case above; in that case, an existing pass further simplifies the static exception by inlining the handler (i.e. the function's body) on the call site.

The case of local functions used in tail position corresponds to a very frequent use case where multiple branches (of pattern matching or conditionals) need to finish with the same computation. There is currently no very good way to express this pattern: (1) a local function allocates the closure (even when unused), calling the function has a cost, and this breaks optimizations restricted to a single function body (unboxing of boxed number, register allocation); (2) a try...with is costly (even with raise_notrace), breaks the tail-context (i.e. can turn tail calls into non-tail ones) and is syntactically not very pleasant.

It could be the case that some cases are or will be optimized with flambda (I did not check), but (1) this would not benefit to Javascript backends (js_of_ocaml & Bucklescript) and (2) flambda is still not the default (and not very widely used I think).

Playing again the micro-benchmark from MPR6242, I get:

Default settings for ocamlopt (no flambda):
before: 4.0s
after: 3.3s (18% faster)

With -inline 100:
before: 4.0s
after: 1.5s (62% faster)

Another micro-benchmark:

let foo x =
  let r = ref 0. in
  if x < 10 then r := !r +. 1.;
  let ret n = !r *. n in
  if x mod 3 = 0 then ret 2.
  else if x mod 3 = 1 then ret 1.
  else 10.

let () =
  let r = ref 0. in
  for i = 1 to 1000000000 do
    r := !r +. foo i
  done;
  Printf.printf "%f\n%!" !r

Default settings for ocamlopt:
before: 5.9s
after: 4.2s (29% faster)

With -inline 100:
before: 5.9s
after: 2.4s (59% faster)

@gasche
Copy link
Member

left a comment

I am pleasantly surprised by how simple the changes are. It looks like a worthy optimization if we can get it in this complexity budget.

One thing that is a bit too subtle for my taste in the current implementation is the fact that while general occurrence-counting is implemented in an ugly and efficient way (pushing and popping integer references from an environment hashtable), tailcall-occurrence counting is implemented using an elegant and inefficient iterator (if I understand correctly, let ... in let ... in let ... in ... risks a quadratic worst-case).

Using a general traversal for occurrence counting would be shorter, easier to read, and also even more quadratic. On the other hand, using an optimized implementation of tailrec-counting is actually very tricky (I first thought it would be easy), because "being in tail position" is not an absolute notion, it is relative to one parent in the AST.

I think the following algorithm would work:

  • During the one-pass iteration, keep a hashtable to count arbitrary occurrences of each variable, but also a hashtable to count tail-position calls to each variable.
  • Use an iterator on all subterms, which passes the tail-rec status: (in_tail_position:bool -> Lambda.t -> Lambda.t) -> Lambda.t -> Lambda.t.
  • When you recurse on a subterm which is in tail-position, nothing special happens (you continue iterating, using the logic in the current PR).
  • When you recurse on a subterm which is not in tail-position, you temporarily forget about the current hahshable counting tail calls (for example by pushing it into a stack of disabled hashtables, to be restored later) and start with a fresh empty table. As a result, only the uses of the variables that will be bound deeper in the subterm will be counted as tail-recursive -- as we need. We you are done traversing that part, you restore the old tail-counts hashtable.
Llet
(
str, kind, id,
Lfunction {lf with kind=Curried; params; body},

This comment has been minimized.

Copy link
@gasche

gasche Nov 9, 2018

Member

Why are you repeating kind=Curried and params which haven't changed, if I understand correctly?

let rec g lam = f (shallow_map g lam) in
g

let iter_tail f = function

This comment has been minimized.

Copy link
@gasche

gasche Nov 9, 2018

Member

This function needs a careful second review to ensure that it is not overly optimistic in which tails are tail-cals (if we get this wrong, the static-raise optimisation is unsound). For example the Lstaticcatch case was a bit subtle to me, although I believe this one is indeed correct.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

You're right that one should be careful with quadratic behavior (even though the constant factor should be very small here). One possibility, which might be equivalent to your proposal, would be to traverse the term, keeping track during the descent of a "non-tailness-level" (incremented each time we navigate to a sub term in non-tail position). The traversal would store the maximum level for each identifier, which allows, on the way up, to know if a given identifier is only used in tail-call position.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 10, 2018

Yes, I thought of this level idea as well at first, but I thought it was a bit more complex than my next proposal (by cutting the previous binders at each non-tail point, I make sure that only the identifiers of non-tailness-level below the current one remain in the active table). Thinking about the two again, maybe the "level" idea is easier to explain because it relies less on the dynamics of traversal. Either implementation would be fine. In both case, I would welcome efforts to keep the code readable by moving this counting logic to a named auxiliary function, and/or having a comment to explain what is going on.

@gasche
Copy link
Member

left a comment

The new implementation makes sense to me, although I think that a comment indicating what tbl, depth and static are for could help newcomers understand what the code is doing. Why did you decide to remove the "inline single-use functions" case, is it subsumed by another simplification?

Show resolved Hide resolved bytecomp/lambda.mli Outdated
expression.
*)

val shallow_iter: tail:(lambda -> unit) -> non_tail:(lambda -> unit) -> lambda -> unit

This comment has been minimized.

Copy link
@gasche

gasche Nov 12, 2018

Member

This function is fine, but personally I think that (tail:bool -> lambda -> unit) -> lambda -> unit would be a bit easier to use than the current proposal. It makes it easy to ignore the information (you can completely replace the main iter function with this generalization, instead of having to provide a separate function because the API is a bit inconvenient), and I think in most cases tail and non_tail are going to share most of their logic, so marking the differences with if tail ...; inserts makes a lot of sense.

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Nov 12, 2018

Author Contributor

you can completely replace the main iter function with this generalization, instead of having to provide a separate function because the API is a bit inconvenient

The current approach allows defining the previous iter_head_constructor (which does not distinguish between the two kinds of sub-nodes) without having to allocate a closure at each node; and for the only use of the more fine-grained version (in Simplif), it works quite well to have two different callbacks.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

(The commit messages says "pseudi linear-time" instead of "pseudo", I think.)

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

@gasche : thanks for the live review 😄

The latest version is a generalization of the two cases initially described: it turns a function into a static handler as soon as all uses of the function are full applications, all of them in them "tail scope". This cover in particular the case of a function used only once (including for module-level function bindings, i.e. a function introduced in a module but not exported and used only once in the module).

Nested functions are supported as well, for instance all local function bindings are eliminated in:

let foo x y =
  let f x = x * x * x in
  let ret n = f (x + n * y) in
  let a =
    if x mod 2 = 0 then ret 3
    else if y mod 2 = 0 then ret 4
    else f 10
  in
  a + 1
@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

(Ah, there is something wrong in the current version, I'm investigating...)

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

Misc.protect_refs [R(current_scope, scope)] (fun () -> tail lam)

True, but this might be a bit performance sensitive (called for each non-tail lambda), so I felt that the extra 3 lines of code where worth it.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

(The commit messages says "pseudi linear-time" instead of "pseudo", I think.)

Yes, of course. If the PR is accepted, it should be squashed I think, so I won't bother rewriting the history to fix this typo.

@mshinwell

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

@lpw25 and I discussed this proposal; here are some thoughts. (This was a comment on your first version.)

As background: one of the main design principles behind the next version of Flambda is to try to ensure that, even in the presence of high optimisation, users are able to predict and understand what the compiler is doing. The current compiler is good in this regard and we would like not to compromise it. The principle extends, in particular, to avoiding "cliff edges" where optimisations suddenly do (or do not) kick in and cause sudden performance changes.

One place where "cliff edges" can occur is when using optimisations that depend on the number of uses of some particular thing, such as the first optimisation proposed here: adding a second use of a local function could suddenly cause a non-negligible change in performance. As such we think we should be a bit careful here both in terms of the specification of the optimisation, and also in its implementation. In terms of the specification, it should be syntactically obvious when the optimisation is going to take effect, so we can give a clear criterion in the manual that is easily checked by the programmer. In terms of the implementation, this probably means doing the analysis prior to Lambda, marking the sub-terms appropriately based on syntactic criteria isomorphic to the equivalents on the source text of the program, then doing the optimisation later based on such markers.

Writing high-performance code sometimes requires manual guidance to be given to the compiler.
In this case, we should provide attributes (perhaps re-using, or at least respecting the existing inlining ones) to disable the optimisation, on both the function declaration and its use sites.

It's worth noting that these optimisations, perhaps surprisingly, are not always a win. For example, for the tail optimisation, if the function is small it might be better just to inline it at each of the use sites (especially if the inlined versions are then variously simplified by using information about the function's arguments). Another problem is interaction with inlining of the outer function: if the outer function gets larger by virtue of containing the code of the inner local function as a result of these optimisations, it might prevent the outer function getting inlined, which could be worse. These kind of interactions, along with the ability if we do these optimisations in Flambda instead of at the Lambda level to integrate their results into the inlining reports, tend to support an approach where the relevant sub-terms are marked for later analysis. We could just disable the transformation on Lambda and do it later, somehow integrating these optimisations into the usual inlining benefit calculations, or perhaps having them as a separate phase after inlining. (We haven't spent enough time to really get to grips with how this would work yet, but these are some preliminary ideas.)

That said, we also suspect that the application of the tail optimisation prior to Flambda (but not the used-only-once optimisation) might not compromise optimisation in Flambda 2.0, because its whole modus operandi is that of static catch handlers (continuations).

Another issue relates to closed local functions, which will be lifted out, so there won't be any closure allocation. There still will be register shuffling to match the calling convention---but on the other hand, so will there probably be under these optimisations, where the shared continuation has its own calling convention assigned by the register allocator. As such, we think it would be better to phrase both optimisations in terms of guaranteeing lack of closure allocations: "a local function syntactically used only once, or syntactically used only in tail position, is guaranteed not to allocate a closure". Then the compiler can choose whether to lift the function out or apply one of these optimisations.

Given that these optimisations are effectively stand-alone, I also recommend they should be in a separate module, rather than lumping them together into Simplif.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

The principle extends, in particular, to avoiding "cliff edges" where optimisations suddenly do (or do not) kick in and cause sudden performance changes.

I wonder how this could be achieved with non-trivial heuristics to decide whether to inline or not. Inlining can have a huge impact in terms of further simplification (and thus avoiding allocations, or even some pure calculations), but it would be crazy to inline blindly as soon as at least one such "possibly important" optimization is made possible. So there needs to be heuristics and probably thresholds, and I don't see how you can achieve the smoothness and predictability you are aiming at.

The current compiler has already plenty of "cliff edges". For instance, turning references into mutable variables will be disabled as soon as there is one occurrence of the reference being passed to another function; or, we used to have the property that any use of a local float variable would disabled its unboxing if the context expected a boxed version (luckily, this is now more "continuous").

In terms of the implementation, this probably means doing the analysis prior to Lambda, marking the sub-terms appropriately based on syntactic criteria isomorphic to the equivalents on the source text of the program, then doing the optimisation later based on such markers.

I understand all what you are saying, but do you agree that this is really not how things are done today?
And do you expect to formalize flambda2 optimizations in terms that can be specified and understood syntactically by users?

The current proposal seems to be aligned with optimizations as currently implemented at the lambda level in Simplif. Also, I'm rather sensitive to compiler performance and simplicity of the compiler code base, and I'd prefer not to complexify the current implementation nor introduce more intermediate languages. This would probably exceed the "complexity budget" @gasche mentioned. More advanced approches should be left to flambda(2) I think.

It's worth noting that these optimisations, perhaps surprisingly, are not always a win.

The examples you give are useful, but I think they apply equally to inlining itself (inlining an inner function can make the outer function less inlinable). I expect this to be handled by flambda(2) (which can disable or not the new optimization). flambda could also decide to re-introduce functions for all "static handlers" (or even all sub-expressions) and base its inlining decisions independently of actual function boundaries in the source code.

I'd really appreciate if the current PR could be evaluated independently of flambda(2). Considering how simple its implementation is and how straightforward it is to enable/disable it, we can always revisit its relevance later.

There still will be register shuffling to match the calling convention---but on the other hand, so will there probably be under these optimisations, where the shared continuation has its own calling convention assigned by the register allocator.

I'm not sure to understand your point here. Yes, the benefit of the optimization would be less important with closed functions, but would it be harmful to still apply it? One still avoids crossing function boundaries, without expanding the total size of the code. There could of course be bad cache effects in some cases, but this is the case for almost any optimization.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

FWIW, I tried compiling typing/*.ml with and without the new optimization enabled (with ocamlc.opt) and could not observe difference in compilation time.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

Why did you decide to remove the "inline single-use functions" case, is it subsumed by another simplification?

It's subsumed by the latest version, which will turn the single call to the version to a staticcatch just around the call (turned into a staticraise), and the simplify_exits pass will do the actual inlining.

For instance, consider:

let foo i =
  let f x = x + 1 in
  let r = f (i * 2) in
  r * 2

We get the following versions of the lambda code:

rawlambda:

** before simplifications:
(setglobal Foo!
  (let
    (foo/3 =
       (function i/4
         (let
           (f/5 = (function x/6 (+ x/6 1)) r/7 =[int] (apply f/5 (* i/4 2)))
           (* r/7 2))))
    (makeblock 0 foo/3)))
** after simplify_local_functions:
 (setglobal Foo!
   (let
     (foo/3 =
        (function i/4
          (let (r/7 =[int] (catch (exit 1 (* i/4 2)) with (1 x/6) (+ x/6 1)))
            (* r/7 2))))
     (makeblock 0 foo/3)))
** after simplify_exits:
 (setglobal Foo!
   (let
     (foo/3 =
        (function i/4
          (let (r/7 =[int] (let (x/9 =a (* i/4 2)) (+ x/9 1))) (* r/7 2))))
     (makeblock 0 foo/3)))
** after simplify_lets:
 (setglobal Foo!
   (let (foo/3 = (function i/4 (let (r/7 =[int] (+ (* i/4 2) 1)) (* r/7 2))))
     (makeblock 0 foo/3)))
@gasche

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

With regard to what Mark says, one argument in favor of the proposed optimization is that, if I understand correctly, it is program-size-preserving: some call f args are replaced by raise f_id args, and one let f = fun args -> e in body is replaced by catch body with f_id args -> e, but no duplication occurs: we are just lowering local functions, not changing the program structure¹. In particular, this should preserve or decrease a reasonable inlining size heuristic.

¹: this is modulo Alain's last trick which lifts a function declaration within the deepest possible subterm with the hope of increasing localization opportunities.

Personally I think that this optimization is rather well-behaved and it is the kind I don't find shocking to have early in the pipeline (as early as "static raises and catch" are available). Another similar optimization would be lowering try/raise into staticcatch/staticraise. This is all very similar in spirit to lowering references to mutable variables, which is already done in Simplif.

Of course, all this is assuming that the PR can be implemented within a reasonably low complexity budget. My guess from preliminary reviewing is that it is doable.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

In particular, this should preserve or decrease a reasonable inlining size heuristic.

The trouble is when the inlining heuristics is based on the size of the function's body (excluding inner functions), which is increased by the current proposal. But with such heuristics, it's also the case that inlining itself could increase the size of some functions, making themselves ineligible to inlining. So I think this is a topic for the inliner to deal with, and hopefully flambda(2) has something clever to do here.

Another similar optimization would be lowering try/raise into staticcatch/staticraise.

(FTR, a version of this was proposed in #638 )

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

The problem with optimisations based on the number/nature of uses is not only that they produce sudden changes, but that these changes are often "spooky action at a distance". You add a call to function A at position B in your code and an unrelated call to A far away at point C gets dramatically slower. The classic example of this is when adding some test code for a function dramatically slows down its use production.

The main way to mitigate against this is to limit the "distance" at which these changes can occur. For this reason I strongly suggest that you change this PR back to only working on local functions.

A related issue is that the number of uses can be affected by earlier optimisations. For example, you
might have:

let foo ... =
  let bar x = ... in
  bar x;
  ...
  if false then bar x

which allows the optimisation to trigger, but if you make a simple change to the second use:

let foo ... =
  let bar x = ... in
  bar x;
  ...
  let debug = false in
  if debug then bar x

and suddenly your program slows down.

This is why I think it would be much better to base the analysis on the number of uses in the original source program rather than the number of uses at the point in the compilation pipeline that you happen to put the analysis.

It seems like it wouldn't be too hard to do the analysis on the typedtree as it is converted into lambda code, which would then ensure that the optimisation was based on a simple predictable syntactic criteria.

The current compiler has already plenty of "cliff edges". For instance, turning references into mutable variables will be disabled as soon as there is one occurrence of the reference being passed to another function.

I'm not sure about "plenty". To my knowledge the reference optimisation is the only other usage-based optimisation we do. I'd also love to base that on a reliable syntactic condition as well. Either way, having some unpredictable optimisations already is not a great argument for adding more of them.

I think they apply equally to inlining itself (inlining an inner function can make the outer function less inlinable)

They do, and flambda has strategies to counter them. This is an argument in favour of delaying the actual transformation until later in the compilation, so that flambda can manage this aspect of things. If you were to do the analysis earlier than Simplif and just do the transformation in Simplif then flambda could disable the transformation and do its own version whilst still sharing the analysis code -- to make sure that all versions of the compiler make the associated guarantee uniformly.

@lthls

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

I have investigated the test failure with flambda (staticalloc.ml fails with flambda turned on).

The failure comes from the first assertion, assert (g () == ((1, 2), (1, 2))), where g gets inlined by the optimization in this pull request.

The difference between the two sides is that the right-hand side gets recognized as constant during earlier transformations (Translcore, I believe), while the left-hand side is still a dynamic allocation at this point.
The specifics of the Flambda implementation means that with the default settings, the left-hand side is still compiled to a statically allocated structure whose contents are initialized at the start of the program (Initialize_symbol, in flambda terms) while the right-hand side is, as expected, fully static (Let_symbol).
Compiling with -O2 allows the relevant passes to be rerun, translating the left-hand Initialize_symbol into a Let_symbol, sharing the constants, and allowing the test to succeed.

Note that the same issue appears if g is manually inlined in the source code, so this could be considered as a problem with flambda rather than this specific optimization.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

For this reason I strongly suggest that you change this PR back to only working on local functions.

At the level where the optimization operates, there is no obvious distinction between global and local functions. The first version already dealt with global functions.

Adding complexity to a relatively straightforward optimization in order to reduce the number of cases where it applies seems a bit weird. In my opinion, what's important is not so much to ensure that the optimization does not apply "by chance", but to provide decent guarantees that some well-specified forms will be optimized. For instance, we don't need to advertise that global functions will be impacted, and if they do, the gain for them will be usually quite small (so the loss if the optimization is broken by a second use would be small as well).

Honestly, I don't believe that flambda(2) will ever be able to achieve a state where the user can reasonably predict exactly which inlining will happen, purely following simple syntactic criteria. What is possible is that some specific code patterns (including with explicit attributes) can be documented to be handled efficiently, and this is already quite good.

Concretely, there are two important cases where we want to guarantee that the current optimization applies:

  • Lifting an arbitrary sub-expression into a local let-bound function in order to reduce the nesting level.

  • Sharing identical code in multiple possible "continuations" of a sub-expression (e.g. a function that finishes with a pattern matching where multiple clauses want to return with a similar computation).

Unfortunately, there is currently no "Writing Efficient Code with OCaml" manual that I could extend with such descriptions.

To my knowledge the reference optimisation is the only other usage-based optimisation we do.

We used to have the unboxing strategy for boxed numbers (which was broken by any use in a non-unboxing context). The new heuristics is more robust in a sense, but adding one use of a variable can inform the compiler that a given variable is, say, a float, trigger its unboxing, resulting in a re-boxing of all accesses (required a boxed float), and thus slowing down dramatically a tight loop.

There also also optimizations in Simplif where (1) "static handlers" are simplified away depending on the fact they are used only once and (2) some local pure "alias" let-bindings are inlined when used only once. If we add the case of turning reference into mutable variable, this pretty much covers all existing optimizations in Simplif -- i.e. they are all more or less based on tracking use sites. I think the current PR is pretty much in line with the existing approach; I understand it does not align with the goals of flambda(2), though, and I'm really not shocked if those about 100 lines of code are disabled when flambda(2) is in use.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2018

@lthls Thanks for the analysis!

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

Adding complexity to a relatively straightforward optimization in order to reduce the number of cases where it applies seems a bit weird.

It's a pretty standard approach to improving predictability of a system -- we do it with type inference all the time.

Honestly, I don't believe that flambda(2) will ever be able to achieve a state where the user can reasonably predict exactly which inlining will happen, purely following simple syntactic criteria.

Flambda doesn't rely on syntactic criteria but all its optimisations are based on analysing the definitions of values not their uses. This means that when your code slows down it is because you changed something about the code that is now slower not some unrelated code. That property does not hold for this PR, which I think justifies making its behaviour as predictable as possible, especially if it be made completely predictable with relatively little effort.

For instance, we don't need to advertise that global functions will be impacted, and if they do, the gain for them will be usually quite small

From this I gather that the optimisation does not apply to calls within other functions? If that's the case then this is indeed not too bad. Your description of it working on module-level functions sounded like any global function used only once -- even if within another module-level function -- would be inlined.

There also also optimizations in Simplif where (1) "static handlers" are simplified away depending on the fact they are used only once and (2) some local pure "alias" let-bindings are inlined when used only once.

To be fair, those optimisations do not affect the amount of allocation in the program. They are not likely to have much affect on performance. They are also optimising constructs that were introduced by the compiler, not altering the behaviour of something specified by the user.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 13, 2018

especially if it be made completely predictable with relatively little effort.

This remains to be proven, and I don't believe this is the case. To be clear, I don't intend to spend efforts trying to do it.

To elaborate a bit on my position: The current implementation does two traversals of the Lambda structure, one to detect the optimization opportunities, another one to apply them. IIUC, you are suggesting to do the first pass on the Typedtree, or perhaps during the Typedtree->Lambda translation, and keep the second one as a rewriting Lambda->Lambda. But (1) analyzing the Typedtree is much more difficult, (2) I don't see how to cleanly pass the information on "tail scope" from Typedtree to Lambda without changing the Lambda definition (e.g. by adding explicit scope identifiers). Anyway, splitting the optimization in 2 passes implemented on different AST is bound to make it harder to maintain, understand, and to combine it with possible other simplification passes applied at the Lambda level. For instance, one could very well imagine some other simple inlining heuristics at the lambda level, which could be applied before the currently discussed optimization and interact with it (by exposing more opportunities). In your view, this is bad since it reduces the "predictability" of the optimization, but I don't buy it. It seems weird to me to apply to the proposed optimization some "syntactic predictability" criteria which don't hold for most other current and planned optimizations.

Moreover, your real concern seems to be specifically about the "usage" analysis, but implementing the analysis in the Typedtree level will not really change the situation. It could be perhaps more predictable, but as easily broken by a "remote use". The alternative is either to completely disable the optimization, or to allow it to duplicate some code.

Now, if one wants to make the optimization more "predictable", one could add a dedicated attribute that would check that the optimization applies to a given function. Would it be enough to address parts of your concern?

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 13, 2018

Yes, I think a let[@static] f x y z = foo attribute that generates an error if a binding cannot be optimized would be nice to have. Later we could think of generalizing it to let[@static] exception Foo in ... and let[@static] x = ref foo in ....

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 13, 2018

I've piggy-backed the existing @inline attribute (now allowing [@inline static]), since the compilation scheme can be seen as a form of inlining (and requesting it is incompatible with other forms of inlining). Moreover, [@inline never] disables the new optimization.

The new [@inline static] attribute also makes it easier to write non-regression tests. I'll do so when I get a green light on the overall approach.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 13, 2018

Having @inline {always,static,never} suggests that "static" is a compromise on the spectrum between always and never: inline when "the call is static" or something. I think this naming/interface is not quite right.

Any other ideas? Maybe always-static would get the information through, but I still have a preference for a proper @static attribute. On the other hand, it's fair to ask how to explicitly disable the transformation. Maybe one could support @static {always,never} (but this is not as clear as I would like), or consider that "static" is a comment on the form of inlining that is possible, and enables inlining by default, but that @inline never disables the transformation? In particular, let[@static][@inline never] expresses: "prevent me from using this in non-tail positions, but don't optimize me anyway".

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Nov 13, 2018

To be clear, I don't intend to spend efforts trying to do it.

I still suspect this is easier than you think, but I think you can leave this part for this PR. If this problem comes up too much in practice then we can make it more predictable -- possibly alongside making the ref optimisation similarly predictable.

As a side-note, one way to think about what worries me is to think in terms of code review. When reviewing changes to code whose performance is important you want to have a reasonable idea of what changes might affect that performance. In the general case usage-based analyses make this much harder. In this specific case I think the usages are local enough to make this not too bad. However, I would still rather have a clear specification that I can tell users so that they know exactly what they are looking for during review.

Would it be enough to address parts of your concern?

This at least allows you to prevent the problem from happening more that once to the same code, so it's definitely worth having.

I think it should be a separate annotation from [@inline] because the transfromation is of a fundamentally different nature. Inlining means that each usage is specialised to its arguments, that is not necessarily the case here. Also, with flambda 2 continuations can also be inlined, so we'll probably want to support annotating a function with both the annotation for this optimisation and an annotation describing how the resulting continuation should be inlined.

As for the particular choice, I'm not sure that static is very clear. What about something like local? That seems to reasonably accurately describe the condition that must hold for the optimisation to take place. I believe that this kind of optimisation is traditionally called "contification", but that's probably not the best name for the attribute, although perhaps continuation would work.

The final issue I have with this PR is how it affects back-traces and other debug information. I think that you should do something similar to the body of the continuation to what is done to the bodies of inlined functions. This means going to each Debuginfo.t inside the body and appending information about the function call that is now missing from the call stack. Since there may be multiple calls you may need to use the location of the following continuation rather than the locations of the calls -- in which case you may want to mark the frame as something like "(continuation)" rather than "(inlined)". Obviously, for calls that are in proper tail position you can skip doing this since their calls wouldn't be in the backtrace anyway.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 13, 2018

When reviewing changes to code whose performance is important you want to have a reasonable idea of what changes might affect that performance. In the general case usage-based analyses make this much harder.

How does that work for flambda? Do you expect users to predict which calls will be inlined? Even if the heuristics is not usage-based, a tiny change to a function definition, or its call site, can change the inlining decision, and any change to a performance-sensitive piece of code could thus have bad consequence. I don't know if this applies to flambda, but with Closure, even removing code can disable some inlining (by first allowing more inlining, which then makes some function's body too large to be inlined itself).

I think it should be a separate annotation from [@inline] because the transfromation is of a fundamentally different nature.

Ok.

About the name: what really happens is that one tries to replace all calls to the function by gotos/jumps to its body (this works if all call sites share the same continuation). So perhaps an attribute name like [@jump]?

The final issue I have with this PR is how it affects back-traces and other debug information.

Ok, thanks, I will look at that.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 13, 2018

This means going to each Debuginfo.t inside the body

Debuginfo is introduced later and is not available at the lambda level. I don't think it's currently possible to keep track of the rewriting and maintain stack traces without rather large changes to the Lambda definition. Anyway, since the continuation is shared, the stack trace would not allow to distinguish from multiple call sites, and I'm not sure that it would bring much information to keep the location of the following continuation.

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Nov 13, 2018

How does that work for flambda? Do you expect users to predict which calls will be inlined?

I expect people to know that changing a function they call in their hot loop might affect their application's performance. Currently, at that point they then need to re-benchmark things to check that its all ok, but we have some prototype tooling that gives detailed output about the inlining decisions in the editor which will help with this part too.

Debuginfo is introduced later and is not available at the lambda level.

Ok. It's probably worth waiting for @mshinwell's upcoming changes before looking at this properly. Even ignoring the backtrace we're still going to need to do some processing here to make the scopes of variables in DWARF correct -- but obviously that requires us to actually have the code that handles scopes merged.

I'm not sure that it would bring much information to keep the location of the following continuation

It's not so much about the additional information as about not confusing people when they look at the backtrace. Missing backtrace frames do surprise people and can mislead beginners.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 18, 2018

I would like this PR to converge rather than be closed because @alainfrisch wants to cut his losses in terms of time invested. The current implementation passed my review and I believe it is reasonable. I'm proposing to push this through sooner rather than later; my goal is to have something merged that seems reasonable enough to everyone.

Here are the main points of discussion. We need to either reach a consensus or make a decision.

  • Flambda. I decide that it is ok to work independently/upstream of flambda for this one, provided the change doesn't break anything there. The failing test (see @lthls's excellent analysis in https://github.com/lthls) definitely needs to be fixed or disabled, though. I consider this point solved.

  • Predictability. As far as I understand, everyone agrees that "all usages are tail-call from the function-declaration point" is robust enough, but the automatic lifting to the smallest scope makes people nervous (the smallest-scope is invisible, and adding a call outside it can suddenly degrade performances). @lpw25 has proposals for improvement but agrees that they could be done in a later PR (by someone else). We yet have to converge on this point. It seems to me that the reasonable short-term choices are (1) accept the current status, (2) disable the lifting optimization or (3) ask that the lifting be performed only when the attribute is used (so the intent is explicit), like [@@unboxed] is done on-demand-only today.

    (Technically we can distinguish the lifting from the staticalization/lowering. Option (3) above proposes to do the lowering by default (when it is possible), but only lift when the attribute expresses intent.)

  • Debugging information. I understand that we don't have the infrastructure to do things correctly here. I think it would be reasonable to keep the current status, because the optimization is very local. (Having this compromise may influence people's opinion on the predictability decision above.) I consider this point solved.

  • User interfance / choice of attribute. Both @lpw25 and myself consider that piggy-backing on @inline is not the approach we want. @alainfrisch, would you be willing to try with a separate attribute? I think that this is not one of the aspects that we should leave to a further PR, because it shows up in the user-interface and I would rather avoid interface-breaking changes in another release.

@lthls

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2018

To clarify a bit my point about the difference between closure and flambda:
Closure does not, at the moment, remove closure allocations when inlining all the uses of a function.
So inlining functions before closure allocation (i.e. in lambda) improves performances in two ways compared to inlining in closure: it removes allocations, and it opens more inlining possibilities than just bigger -inline parameters and [@@inline] annotations since functions containing a closure allocation cannot be themselves inlined.
Flambda can remove unused closure allocations, so there wouldn't be any significant difference between inlining in lambda and putting [@@inline always] attributes to let flambda inline it.

So there's no doubt that this PR is beneficial for closure (and the bytecode backends), but we'll need the benchmarks to know whether it should be enabled or disabled for flambda.

@kayceesrk

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2018

Benchmarking results are now available: http://ocamllabs.io/multicore/

@lthls

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2018

Thanks! I made a pull request to the bench repo to add flambda-enabled variants, to see if the picture looks different with flambda.

alainfrisch added some commits Nov 9, 2018

Optimize some local functions at the lambda level
A new optimization pass is added in Simplif to avoid the overhead of
allocating and calling local closures.  Two cases are supported:

 1.  Local function used only once:

     let f x y = ... in
     ...
     .... f e1 e2 ...

     In this case, the function body is inlined on the call site.

  2. Local function only used in tail position:

     let f x y = ... in
     match ... with
     | ... -> .... in f e1 e2
     | ... -> if .... then f e3 e4 else ...
     | ... -> ...

     In this case, the function is turned into a "static exception":

     staticcatch
       match ... with
       | ... -> .... in staticraise E(e1, e2)
       | ... -> if .... then staticraise E(e1, e2) else ...
       | ... -> ...
     with E(x, y) ->
       ...
Guarantee pseudi linear-time complexity
Only the "tail position" optimization is kept.
Support for [@inline static]
A warning is raised when a function is marked with [@inline static]
but the optimization does not apply to it.  (The warning is
automatically disabled when the optimization pass itself is disabled,
namely in bytecode + debug mode.)

Additionally, [@inline never] disables the new optimization.

@alainfrisch alainfrisch force-pushed the alainfrisch:optim_local_tailcalls branch from d6bec1b to 98512cd Nov 26, 2018

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 26, 2018

The benchmark is not very conclusive, so what shall with do? Merge in the current state, or disable when flambda is enabled?

@lthls

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Given the benchmarks, I'm ok with merging as is.
The benchmarks have showed some differences between flambda and closure that need investigating, but it doesn't look related to this PR, so we'll take care of it on our side.
We'll also think about allowing flambda to inline static catches, but it doesn't need to block this PR.

@alainfrisch alainfrisch merged commit 9b27a9c into ocaml:trunk Nov 27, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 27, 2018

Thanks!

@xclerc

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Shouldn't the new attribute be documented in the manual?

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 27, 2018

Ah right sorry, I will submit a further PR.

@alainfrisch

This comment has been minimized.

Copy link
Contributor Author

commented Nov 27, 2018

-> #2169

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.