Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upcmd/compile: improve inlining cost model #17566
Comments
josharian
added
Performance
ToolSpeed
labels
Oct 24, 2016
josharian
added this to the Go1.9 milestone
Oct 24, 2016
josharian
referenced this issue
Oct 24, 2016
Closed
cmd/go: add test to ensure upx can compress our binaries #16706
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianlancetaylor
Oct 24, 2016
Contributor
I'm going to toss in a few more ideas to consider.
An unexported function that is only called once is often cheaper to inline.
Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.
Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type error.)
One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists.
Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks.
|
I'm going to toss in a few more ideas to consider. An unexported function that is only called once is often cheaper to inline. Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call. Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists. Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rasky
Oct 24, 2016
Member
Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.
A common case where this would apply is when calling marshallers/unmarshallers that use reflect to inspect interface{} parameters. Many of them tend to have a "fast-path" in case reflect.Type is some basic type or has some basic property. Inlining that fast-path would make it even faster. Eg: see binary.Read and think how much of its code could be pulled when the value bound to the interface{} argument is known at compile-time.
A common case where this would apply is when calling marshallers/unmarshallers that use |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
thanm
Oct 24, 2016
Member
Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used.
|
Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
minux
Oct 25, 2016
Member
|
No pragmas that mandates inline, please.
I already expressed dislike for //go:noinline, and I will firmly object any
proposal for //go:mustinline or something like that, even if it's limited
to runtime.
If we can't find a good heuristics for the runtime package, I don't think
it will handle real-world cases well.
Also, we need to somehow fix the traceback for inlined non-leaf functions
first.
Another idea for the inlining decision is how simpler could the function
body be if inlined. Esp. for reflect using functions that has fast paths,
if the input type matches the fast path, even though the function might be
very complicated, the inlined version might be really simple.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dr2chase
Oct 25, 2016
Contributor
Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too.
Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up.
|
Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too. Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
josharian
Oct 25, 2016
Contributor
doing compilation bottom-up
I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions.
I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dr2chase
Oct 25, 2016
Contributor
It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values.
|
It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
josharian
Oct 25, 2016
Contributor
I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round.
|
I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round. |
josharian
referenced this issue
Oct 31, 2016
Open
cmd/compile: slightly unclear heap escape message (with -m) #16300
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
CAFxX
Nov 22, 2016
Contributor
An unexported function that is only called once is often cheaper to inline.
Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.
Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?
Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true. Also, just to understand, in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianlancetaylor
Nov 22, 2016
Contributor
An unexported function that is only called once is often cheaper to inline.
Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.
It is not true when the code looks like
func internal() {
// Large complex function with loops.
}
func External() {
if veryRareCase {
internal()
}
}
Because in the normal case where you don't need to call internal you don't have set up a stack frame.
Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?
In package main, yes.
It is not true when the code looks like
Because in the normal case where you don't need to call
In package main, yes. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
CAFxX
Nov 22, 2016
Contributor
Because in the normal case where you don't need to call internal you don't have set up a stack frame.
Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.
Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?
In package main, yes.
The tyranny of unit-at-a-time :D
Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.
The tyranny of unit-at-a-time :D |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
RalphCorderoy
Jan 18, 2017
Functions that start with a run of if param1 == nil { return nil } where the tests and return values are simple parameters or constants would avoid the call overhead if just this part was inlined. The size of setting up the call/return could be weighed against the size of the simple tests.
RalphCorderoy
commented
Jan 18, 2017
|
Functions that start with a run of |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mvdan
Jan 18, 2017
Member
@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline.
Unless the compiler chunks, it's up to the developer to split the function in two I presume.
|
@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline. Unless the compiler chunks, it's up to the developer to split the function in two I presume. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
RalphCorderoy
Jan 19, 2017
Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one?
Another possibility on stripping some of the simple quick paths is the remaining slow non-inlined function no longer needs some of the parameters.
RalphCorderoy
commented
Jan 19, 2017
|
Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mvdan
Jan 19, 2017
Member
Yes, for example, here tryFastPath is likely small enough to be inlined (if forward thunk funcs were inlineable, see #8421):
func tryFastPath() {
if someCond {
// fast path
return ...
}
return slowPath()
}
|
Yes, for example, here
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Too late for 1.9. |
josharian
modified the milestones:
Go1.10,
Go1.9
May 18, 2017
OneOfOne
referenced this issue
Aug 19, 2017
Closed
Proposal: cmd/compile: add a go:inline directive #21536
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gopherbot
Aug 20, 2017
Change https://golang.org/cl/57410 mentions this issue: runtime: add TestIntendedInlining
gopherbot
commented
Aug 20, 2017
|
Change https://golang.org/cl/57410 mentions this issue: |
pushed a commit
that referenced
this issue
Aug 22, 2017
mvdan
referenced this issue
Sep 12, 2017
Closed
cmd/compile: expand TestIntendedInlining to more packages and funcs #21851
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
TocarIP
Sep 25, 2017
Contributor
Another example of current inlining heuristic punishing more readable code
if !foo {
return
}
if !bar {
return
}
if !baz {
return
}
//do stuff
is more "expensive" than
if foo && bar && baz {
// do stuff
}
return
This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491
for details)
|
Another example of current inlining heuristic punishing more readable code
is more "expensive" than
This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bradfitz
modified the milestones:
Go1.10,
Go1.11
Nov 28, 2017
Quasilyte
referenced this issue
Jan 23, 2018
Closed
cmd/compile: improve const expr if/else DCE #23521
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
TocarIP
Feb 9, 2018
Contributor
Another issue with inliner, is that cost of function with inlined call to some other function has higher cost than manual inlining. Consider following example:
package main
var a int
var b int
func foo() { a = b }
func f00() { foo() }
func f01() { f00() }
func f02() { f01() }
func f03() { f02() }
func f04() { f03() }
func f05() { f04() }
func f06() { f05() }
func f07() { f06() }
func f08() { f07() }
func f09() { f08() }
func f10() { f09() }
func f11() { f10() }
func f12() { f11() }
func f13() { f12() }
func f14() { f13() }
func f15() { f14() }
func f16() { f15() }
func f17() { f16() }
func f18() { f17() }
func f19() { f18() }
func f20() { f19() }
func f21() { f20() }
func f22() { f21() }
func f23() { f22() }
func f24() { f23() }
func f25() { f24() }
func f26() { f25() }
func f27() { f26() }
func f28() { f27() }
func f29() { f28() }
func f30() { f29() }
func f31() { f30() }
func f32() { f31() }
func f33() { f32() }
func f34() { f33() }
func f35() { f34() }
func f36() { f35() }
func f37() { f36() }
func f38() { f37() }
func f39() { f38() }
func main() {
f39()
}
It should be optimized into:
package main
var a int
var b int
func main() {
a = b
}
However after each inlining cost of new function is increased by 2, causing (with -m -m):
./foo.go:47:6: cannot inline f38: function too complex: cost 81 exceeds budget 80
This caused by inliner adding new goto to/from inlined body and new assignments for each argument/return value. For simple case (single return/no assignments to arguments) we should probably adjust cost of inlining outer functions by something like 2+number_of_arguments + number_of_return_values. I've tried simply raising limit by 20 and e. g. image/png/paeth gets significantly faster:
Paeth-6 6.83ns ± 0% 4.39ns ± 0% -35.73% (p=0.000 n=9+9)
Because now paeth with inlined abs is itself inlinable.
|
Another issue with inliner, is that cost of function with inlined call to some other function has higher cost than manual inlining. Consider following example:
It should be optimized into:
However after each inlining cost of new function is increased by 2, causing (with -m -m):
Because now paeth with inlined abs is itself inlinable. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
aclements
Apr 28, 2018
Member
Another inlining heuristic idea: if a small function is called with a func literal argument, inline the small function, and "inline" (or is it a constant fold?) the call to the now-constant func argument. That would enable zero-cost control flow abstraction. This may depend on how many times the argument is called: if it's just once then there's no reason not to, but if it's more than once then it may depend on the complexity of the argument.
For example, in CL 109716 it would have been nice to abstract out the high-performance bitmap iteration pattern, but there's no way to do that right now without adding a fair amount of overhead. With this heuristic, it would have been possible to lift that pattern into a function at no cost, where the loop body was supplied as a func literal argument.
|
Another inlining heuristic idea: if a small function is called with a func literal argument, inline the small function, and "inline" (or is it a constant fold?) the call to the now-constant func argument. That would enable zero-cost control flow abstraction. This may depend on how many times the argument is called: if it's just once then there's no reason not to, but if it's more than once then it may depend on the complexity of the argument. For example, in CL 109716 it would have been nice to abstract out the high-performance bitmap iteration pattern, but there's no way to do that right now without adding a fair amount of overhead. With this heuristic, it would have been possible to lift that pattern into a function at no cost, where the loop body was supplied as a func literal argument. |
bradfitz
modified the milestones:
Go1.11,
Unplanned
May 18, 2018
josharian
referenced this issue
Jun 21, 2018
Open
cmd/compile: record and use per-function optimization data #25999
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gopherbot
Jul 23, 2018
Change https://golang.org/cl/125516 mentions this issue: cmd/compile: set stricter inlining threshold in large functions
gopherbot
commented
Jul 23, 2018
|
Change https://golang.org/cl/125516 mentions this issue: |
pushed a commit
that referenced
this issue
Jul 24, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
josharian
Aug 14, 2018
Contributor
In the commit message of CL 125516, @randall77 commented:
At some point it might be nice to have a better heuristic for "inlined body is smaller than the call", a non-cliff way to scale down the cost as the function gets bigger, doing cheaper inlined calls first, etc.
(Copied here so that it is easier to find by folks interested in improving inlining heuristics.)
|
In the commit message of CL 125516, @randall77 commented:
(Copied here so that it is easier to find by folks interested in improving inlining heuristics.) |
josharian commentedOct 24, 2016
•
edited
Edited 1 time
-
josharian
edited Oct 24, 2016 (most recent)
The current inlining cost model is simplistic. Every gc.Node in a function has a cost of one. However, the actual impact of each node varies. Some nodes (OKEY) are placeholders never generate any code. Some nodes (OAPPEND) generate lots of code.
In addition to leading to bad inlining decisions, this design means that any refactoring that changes the AST structure can have unexpected and significant impact on compiled code. See CL 31674 for an example.
Inlining occurs near the beginning of compilation, which makes good predictions hard. For example,
newormakeor&might allocate (large runtime call, much code generated) or not (near zero code generated). As another example, code guarded byif falsestill gets counted. As another example, we don't know whether bounds checks (which generate lots of code) will be eliminated or not.One approach is to hand-write a better cost model: append is very expensive, things that might end up in a runtime call are moderately expensive, pure structure and variable/const declarations are cheap or free.
Another approach is to compile lots of code and generate a simple machine-built model (e.g. linear regression) from it.
I have tried both of these approaches, and believe both of them to be improvements, but did not mail either of them, for two reasons:
Three other related ideas:
cc @dr2chase @randall77 @ianlancetaylor @mdempsky