cmd/compile: improve inlining cost model #17566

Open
josharian opened this Issue Oct 24, 2016 · 19 comments

Comments

Projects
None yet
Contributor

josharian commented Oct 24, 2016

The current inlining cost model is simplistic. Every gc.Node in a function has a cost of one. However, the actual impact of each node varies. Some nodes (OKEY) are placeholders never generate any code. Some nodes (OAPPEND) generate lots of code.

In addition to leading to bad inlining decisions, this design means that any refactoring that changes the AST structure can have unexpected and significant impact on compiled code. See CL 31674 for an example.

Inlining occurs near the beginning of compilation, which makes good predictions hard. For example, new or make or & might allocate (large runtime call, much code generated) or not (near zero code generated). As another example, code guarded by if false still gets counted. As another example, we don't know whether bounds checks (which generate lots of code) will be eliminated or not.

One approach is to hand-write a better cost model: append is very expensive, things that might end up in a runtime call are moderately expensive, pure structure and variable/const declarations are cheap or free.

Another approach is to compile lots of code and generate a simple machine-built model (e.g. linear regression) from it.

I have tried both of these approaches, and believe both of them to be improvements, but did not mail either of them, for two reasons:

  • Inlining significantly impacts binary size, runtime performance, and compilation speed. Minor changes to the cost model or to the budget can have big impacts on all three. I hoped to find a model and budget that was clearly pareto optimal, but I did not. In order to make forward progress, we need to make a decision about what metric(s) we want to optimize for, and which code to measure those metric on. This is to my mind the single biggest blocker for improving inlining.
  • Changing inlining decisions impacts the runtime as well as other code and minor inlining changes to the runtime can have outsized performance impacts. I see several possible ways to address this. (1) We could add a //go:inline annotation for use in the runtime only, to allow runtime authors to force the compiler to inline performance-critical functions. If a non-inlinable function was marked //go:inline, compilation would fail. (2) We could add a //go:mustinline annotation for use in the runtime only (see CL 22785), to allow runtime authors to protect currently-inlined functions against becoming non-inlined. (3) We could tune runtime inlining (cost model, budget, etc.) independently.

Three other related ideas:

  • We might want to take into account the number and size of parameters and return values of a function when deciding whether to inline it, since those determine the cost in binary size of setting up the function call.
  • We might want to have separate budgets for expressions and control flow, since branches end up being more expensive on most metrics.
  • We could treat intra-package inlining different than inter-package inlining. When inlining across packages, we don't actually need to decide early on whether to allow inlining, since the actual inlining will occur in an entirely different compiler invocation. We could thus wait until function compilation is complete, and we know exactly how large the fully optimized code is, and then make our decision about whether other packages should inline that function. Downsides to this otherwise appealing idea are: (1) unexpected performance impact of moving a function from one package to another, (2) it introduces a significant dependency between full compilation and writing export data, which e.g. would prevent #15734.

cc @dr2chase @randall77 @ianlancetaylor @mdempsky

@josharian josharian added this to the Go1.9 milestone Oct 24, 2016

Contributor

ianlancetaylor commented Oct 24, 2016

I'm going to toss in a few more ideas to consider.

An unexported function that is only called once is often cheaper to inline.

Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.

Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type error.)

One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists.

Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks.

Contributor

rasky commented Oct 24, 2016

Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.

A common case where this would apply is when calling marshallers/unmarshallers that use reflect to inspect interface{} parameters. Many of them tend to have a "fast-path" in case reflect.Type is some basic type or has some basic property. Inlining that fast-path would make it even faster. Eg: see binary.Read and think how much of its code could be pulled when the value bound to the interface{} argument is known at compile-time.

Member

thanm commented Oct 24, 2016

Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used.

Member

minux commented Oct 25, 2016

Contributor

dr2chase commented Oct 25, 2016

Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too.

Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up.

Contributor

josharian commented Oct 25, 2016

doing compilation bottom-up

I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions.

Contributor

dr2chase commented Oct 25, 2016

It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values.

Contributor

josharian commented Oct 25, 2016

I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round.

Contributor

CAFxX commented Nov 22, 2016

An unexported function that is only called once is often cheaper to inline.

Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

Contributor

ianlancetaylor commented Nov 22, 2016

An unexported function that is only called once is often cheaper to inline.

Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.

It is not true when the code looks like

func internal() {
    // Large complex function with loops.
}

func External() {
    if veryRareCase {
        internal()
    }
}

Because in the normal case where you don't need to call internal you don't have set up a stack frame.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

In package main, yes.

Contributor

CAFxX commented Nov 22, 2016

Because in the normal case where you don't need to call internal you don't have set up a stack frame.

Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

In package main, yes.

The tyranny of unit-at-a-time :D

Functions that start with a run of if param1 == nil { return nil } where the tests and return values are simple parameters or constants would avoid the call overhead if just this part was inlined. The size of setting up the call/return could be weighed against the size of the simple tests.

Member

mvdan commented Jan 18, 2017

@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline.

Unless the compiler chunks, it's up to the developer to split the function in two I presume.

Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one?
Another possibility on stripping some of the simple quick paths is the remaining slow non-inlined function no longer needs some of the parameters.

Member

mvdan commented Jan 19, 2017

Yes, for example, here tryFastPath is likely small enough to be inlined (if forward thunk funcs were inlineable, see #8421):

func tryFastPath() {
    if someCond {
        // fast path
        return ...
    }
    return slowPath()
}
Contributor

josharian commented May 18, 2017

Too late for 1.9.

@josharian josharian modified the milestones: Go1.10, Go1.9 May 18, 2017

Change https://golang.org/cl/57410 mentions this issue: runtime: add TestIntendedInlining

gopherbot pushed a commit that referenced this issue Aug 22, 2017

runtime: add TestIntendedInlining
The intent is to allow more aggressive refactoring
in the runtime without silent performance changes.

The test would be useful for many functions.
I've seeded it with the runtime functions tophash and add;
it will grow organically (or wither!) from here.

Updates #21536 and #17566

Change-Id: Ib26d9cfd395e7a8844150224da0856add7bedc42
Reviewed-on: https://go-review.googlesource.com/57410
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Contributor

TocarIP commented Sep 25, 2017

Another example of current inlining heuristic punishing more readable code

if !foo {
  return
}
if !bar {
  return
}
if !baz {
  return
}
//do stuff

is more "expensive" than

if foo && bar && baz {
// do stuff
}
return

This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491
for details)

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment