New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: improve inlining cost model #17566

Open
josharian opened this Issue Oct 24, 2016 · 35 comments

Comments

Projects
None yet
@josharian
Contributor

josharian commented Oct 24, 2016

The current inlining cost model is simplistic. Every gc.Node in a function has a cost of one. However, the actual impact of each node varies. Some nodes (OKEY) are placeholders never generate any code. Some nodes (OAPPEND) generate lots of code.

In addition to leading to bad inlining decisions, this design means that any refactoring that changes the AST structure can have unexpected and significant impact on compiled code. See CL 31674 for an example.

Inlining occurs near the beginning of compilation, which makes good predictions hard. For example, new or make or & might allocate (large runtime call, much code generated) or not (near zero code generated). As another example, code guarded by if false still gets counted. As another example, we don't know whether bounds checks (which generate lots of code) will be eliminated or not.

One approach is to hand-write a better cost model: append is very expensive, things that might end up in a runtime call are moderately expensive, pure structure and variable/const declarations are cheap or free.

Another approach is to compile lots of code and generate a simple machine-built model (e.g. linear regression) from it.

I have tried both of these approaches, and believe both of them to be improvements, but did not mail either of them, for two reasons:

  • Inlining significantly impacts binary size, runtime performance, and compilation speed. Minor changes to the cost model or to the budget can have big impacts on all three. I hoped to find a model and budget that was clearly pareto optimal, but I did not. In order to make forward progress, we need to make a decision about what metric(s) we want to optimize for, and which code to measure those metric on. This is to my mind the single biggest blocker for improving inlining.
  • Changing inlining decisions impacts the runtime as well as other code and minor inlining changes to the runtime can have outsized performance impacts. I see several possible ways to address this. (1) We could add a //go:inline annotation for use in the runtime only, to allow runtime authors to force the compiler to inline performance-critical functions. If a non-inlinable function was marked //go:inline, compilation would fail. (2) We could add a //go:mustinline annotation for use in the runtime only (see CL 22785), to allow runtime authors to protect currently-inlined functions against becoming non-inlined. (3) We could tune runtime inlining (cost model, budget, etc.) independently.

Three other related ideas:

  • We might want to take into account the number and size of parameters and return values of a function when deciding whether to inline it, since those determine the cost in binary size of setting up the function call.
  • We might want to have separate budgets for expressions and control flow, since branches end up being more expensive on most metrics.
  • We could treat intra-package inlining different than inter-package inlining. When inlining across packages, we don't actually need to decide early on whether to allow inlining, since the actual inlining will occur in an entirely different compiler invocation. We could thus wait until function compilation is complete, and we know exactly how large the fully optimized code is, and then make our decision about whether other packages should inline that function. Downsides to this otherwise appealing idea are: (1) unexpected performance impact of moving a function from one package to another, (2) it introduces a significant dependency between full compilation and writing export data, which e.g. would prevent #15734.

cc @dr2chase @randall77 @ianlancetaylor @mdempsky

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 24, 2016

I'm going to toss in a few more ideas to consider.

An unexported function that is only called once is often cheaper to inline.

Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.

Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type error.)

One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists.

Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks.

@rasky

This comment has been minimized.

Member

rasky commented Oct 24, 2016

Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.

A common case where this would apply is when calling marshallers/unmarshallers that use reflect to inspect interface{} parameters. Many of them tend to have a "fast-path" in case reflect.Type is some basic type or has some basic property. Inlining that fast-path would make it even faster. Eg: see binary.Read and think how much of its code could be pulled when the value bound to the interface{} argument is known at compile-time.

@thanm

This comment has been minimized.

Member

thanm commented Oct 24, 2016

Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used.

@minux

This comment has been minimized.

Member

minux commented Oct 25, 2016

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Oct 25, 2016

Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too.

Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up.

@josharian

This comment has been minimized.

Contributor

josharian commented Oct 25, 2016

doing compilation bottom-up

I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Oct 25, 2016

It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values.

@josharian

This comment has been minimized.

Contributor

josharian commented Oct 25, 2016

I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round.

@CAFxX

This comment has been minimized.

Contributor

CAFxX commented Nov 22, 2016

An unexported function that is only called once is often cheaper to inline.

Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Nov 22, 2016

An unexported function that is only called once is often cheaper to inline.

Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.

It is not true when the code looks like

func internal() {
    // Large complex function with loops.
}

func External() {
    if veryRareCase {
        internal()
    }
}

Because in the normal case where you don't need to call internal you don't have set up a stack frame.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

In package main, yes.

@CAFxX

This comment has been minimized.

Contributor

CAFxX commented Nov 22, 2016

Because in the normal case where you don't need to call internal you don't have set up a stack frame.

Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

In package main, yes.

The tyranny of unit-at-a-time :D

@RalphCorderoy

This comment has been minimized.

RalphCorderoy commented Jan 18, 2017

Functions that start with a run of if param1 == nil { return nil } where the tests and return values are simple parameters or constants would avoid the call overhead if just this part was inlined. The size of setting up the call/return could be weighed against the size of the simple tests.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 18, 2017

@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline.

Unless the compiler chunks, it's up to the developer to split the function in two I presume.

@RalphCorderoy

This comment has been minimized.

RalphCorderoy commented Jan 19, 2017

Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one?
Another possibility on stripping some of the simple quick paths is the remaining slow non-inlined function no longer needs some of the parameters.

@mvdan

This comment has been minimized.

Member

mvdan commented Jan 19, 2017

Yes, for example, here tryFastPath is likely small enough to be inlined (if forward thunk funcs were inlineable, see #8421):

func tryFastPath() {
    if someCond {
        // fast path
        return ...
    }
    return slowPath()
}
@josharian

This comment has been minimized.

Contributor

josharian commented May 18, 2017

Too late for 1.9.

@gopherbot

This comment has been minimized.

gopherbot commented Aug 20, 2017

Change https://golang.org/cl/57410 mentions this issue: runtime: add TestIntendedInlining

gopherbot pushed a commit that referenced this issue Aug 22, 2017

runtime: add TestIntendedInlining
The intent is to allow more aggressive refactoring
in the runtime without silent performance changes.

The test would be useful for many functions.
I've seeded it with the runtime functions tophash and add;
it will grow organically (or wither!) from here.

Updates #21536 and #17566

Change-Id: Ib26d9cfd395e7a8844150224da0856add7bedc42
Reviewed-on: https://go-review.googlesource.com/57410
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Sep 25, 2017

Another example of current inlining heuristic punishing more readable code

if !foo {
  return
}
if !bar {
  return
}
if !baz {
  return
}
//do stuff

is more "expensive" than

if foo && bar && baz {
// do stuff
}
return

This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491
for details)

@ghasemloo

This comment has been minimized.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Feb 9, 2018

Another issue with inliner, is that cost of function with inlined call to some other function has higher cost than manual inlining. Consider following example:

package main

var a int
var b int

func foo() { a = b }

func f00() { foo() }

func f01() { f00() }
func f02() { f01() }
func f03() { f02() }
func f04() { f03() }
func f05() { f04() }
func f06() { f05() }
func f07() { f06() }
func f08() { f07() }
func f09() { f08() }
func f10() { f09() }
func f11() { f10() }
func f12() { f11() }
func f13() { f12() }
func f14() { f13() }
func f15() { f14() }
func f16() { f15() }
func f17() { f16() }
func f18() { f17() }
func f19() { f18() }
func f20() { f19() }
func f21() { f20() }
func f22() { f21() }
func f23() { f22() }
func f24() { f23() }
func f25() { f24() }
func f26() { f25() }
func f27() { f26() }
func f28() { f27() }
func f29() { f28() }
func f30() { f29() }
func f31() { f30() }
func f32() { f31() }
func f33() { f32() }
func f34() { f33() }
func f35() { f34() }
func f36() { f35() }
func f37() { f36() }
func f38() { f37() }
func f39() { f38() }

func main() {
        f39()
}

It should be optimized into:

package main

var a int
var b int

func main() {
        a = b
}

However after each inlining cost of new function is increased by 2, causing (with -m -m):
./foo.go:47:6: cannot inline f38: function too complex: cost 81 exceeds budget 80
This caused by inliner adding new goto to/from inlined body and new assignments for each argument/return value. For simple case (single return/no assignments to arguments) we should probably adjust cost of inlining outer functions by something like 2+number_of_arguments + number_of_return_values. I've tried simply raising limit by 20 and e. g. image/png/paeth gets significantly faster:

Paeth-6  6.83ns ± 0%  4.39ns ± 0%  -35.73%  (p=0.000 n=9+9)

Because now paeth with inlined abs is itself inlinable.

@aclements

This comment has been minimized.

Member

aclements commented Apr 28, 2018

Another inlining heuristic idea: if a small function is called with a func literal argument, inline the small function, and "inline" (or is it a constant fold?) the call to the now-constant func argument. That would enable zero-cost control flow abstraction. This may depend on how many times the argument is called: if it's just once then there's no reason not to, but if it's more than once then it may depend on the complexity of the argument.

For example, in CL 109716 it would have been nice to abstract out the high-performance bitmap iteration pattern, but there's no way to do that right now without adding a fair amount of overhead. With this heuristic, it would have been possible to lift that pattern into a function at no cost, where the loop body was supplied as a func literal argument.

@gopherbot

This comment has been minimized.

gopherbot commented Jul 23, 2018

Change https://golang.org/cl/125516 mentions this issue: cmd/compile: set stricter inlining threshold in large functions

gopherbot pushed a commit that referenced this issue Jul 24, 2018

cmd/compile: set stricter inlining threshold in large functions
If we're compiling a large function, be more picky about how big
the function we're inlining is.  If the function is >5000 nodes,
we lower the inlining threshold from a cost of 80 to 20.

Turns out reflect.Value's cost is exactly 80.  That's the function
at issue in #26546.

20 was chosen as a proxy for "inlined body is smaller than the call would be".
Simple functions still get inlined, like this one at cost 7:

func ifaceIndir(t *rtype) bool {
	return t.kind&kindDirectIface == 0
}

5000 nodes was chosen as the big function size.  Here are all the
5000+ node (~~1000+ lines) functions in the stdlib:

5187 cmd/internal/obj/arm (*ctxt5).asmout
6879 cmd/internal/obj/s390x (*ctxtz).asmout
6567 cmd/internal/obj/ppc64 (*ctxt9).asmout
9643 cmd/internal/obj/arm64 (*ctxt7).asmout
5042 cmd/internal/obj/x86 (*AsmBuf).doasm
8768 cmd/compile/internal/ssa rewriteBlockAMD64
8878 cmd/compile/internal/ssa rewriteBlockARM
8344 cmd/compile/internal/ssa rewriteValueARM64_OpARM64OR_20
7916 cmd/compile/internal/ssa rewriteValueARM64_OpARM64OR_30
5427 cmd/compile/internal/ssa rewriteBlockARM64
5126 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_50
6152 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_60
6412 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_70
6486 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_80
6534 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_90
6534 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_100
6534 cmd/compile/internal/ssa rewriteValuePPC64_OpPPC64OR_110
6675 cmd/compile/internal/gc typecheck1
5433 cmd/compile/internal/gc walkexpr
14070 cmd/vendor/golang.org/x/arch/arm64/arm64asm decodeArg

There are a lot more smaller (~1000 node) functions in the stdlib.
The function in #26546 has 12477 nodes.

At some point it might be nice to have a better heuristic for "inlined
body is smaller than the call", a non-cliff way to scale down the cost
as the function gets bigger, doing cheaper inlined calls first, etc.
All that can wait for another release. I'd like to do this CL for
1.11.

Fixes #26546
Update #17566

Change-Id: Idda13020e46ec2b28d79a17217f44b189f8139ac
Reviewed-on: https://go-review.googlesource.com/125516
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@josharian

This comment has been minimized.

Contributor

josharian commented Aug 14, 2018

In the commit message of CL 125516, @randall77 commented:

At some point it might be nice to have a better heuristic for "inlined body is smaller than the call", a non-cliff way to scale down the cost as the function gets bigger, doing cheaper inlined calls first, etc.

(Copied here so that it is easier to find by folks interested in improving inlining heuristics.)

@bradfitz

This comment has been minimized.

Member

bradfitz commented Aug 29, 2018

In a Gophercon talk right now, the presenter demoed removing the use of math/bits intrinsics to remove the inlining budget of those "function calls". Unfortunate.

@randall77

This comment has been minimized.

Contributor

randall77 commented Aug 29, 2018

We give calls to intrinsics a cost of 1. They don't get a full call's cost.
What am I missing?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 29, 2018

Maybe nothing. The call was to math/bits.RotateLeft32. I don't see that in the list of intrinsics in cmd/compile/internal/gc/ssa.go. (But maybe there is someplace else to look that I don't know about.)

@randall77

This comment has been minimized.

Contributor

randall77 commented Aug 29, 2018

Ah yes, that's not an intrinsic. We detect the rotate directly from the operations in the body of that function.

The body looks like it has ~7 operations in it, but it reduces to just one instruction in the end.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Aug 29, 2018

Should we start pretending we had feedback information and that we had stored in a file, and use that to guide inlining?

hotsites=["blake_foo.go:17(math/bits.RotateLeft32)","blake_f..go:19(math/bits.RotateLeft32)"]
hotfuncs=["whatever.blake/g"]

Make that something the compilation depends on, if we change the file, we force a recompilation.

@josharian

This comment has been minimized.

Contributor

josharian commented Aug 29, 2018

Ah yes, that's not an intrinsic.

Short term, perhaps we should just make it an intrinsic (for AMD64). Then it behaves appropriately end-to-end, including inlining. That's pretty easy and safe. Maybe I can help someone get this in tomorrow at community day.

There are many long term fixes, of course.

Here's a test case we can add:

// +build amd64
// errorcheck -0 -m

// Copyright 2018 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// Test that inlining of math/bits.RotateLeft* treats those calls as intrinsics.

package p

import "math/bits"

var (
	x8  uint8
	x16 uint16
	x32 uint32
	x64 uint64
	x   uint
)

func f() { // ERROR "can inline f"
	x8 = bits.RotateLeft8(x8, 1)    // ERROR "inlining call"
	x16 = bits.RotateLeft16(x16, 1) // ERROR "inlining call"
	x32 = bits.RotateLeft32(x32, 1) // ERROR "inlining call"
	x64 = bits.RotateLeft64(x64, 1) // ERROR "inlining call"
	x = bits.RotateLeft(x, 1)       // ERROR "inlining call"
}

This fails with 1.11 and should pass.

@josharian

This comment has been minimized.

Contributor

josharian commented Aug 29, 2018

Unless I hear otherwise, @andybons and I will make this an intrinsic tomorrow.

@gopherbot

This comment has been minimized.

gopherbot commented Aug 30, 2018

Change https://golang.org/cl/132435 mentions this issue: cmd/compile: make math/bits.RotateLeft* an intrinsic on amd64

gopherbot pushed a commit that referenced this issue Aug 30, 2018

cmd/compile: make math/bits.RotateLeft* an intrinsic on amd64
Previously, pattern matching was good enough to achieve good performance
for the RotateLeft* functions, but the inlining cost for them was much
too high. Make RotateLeft* intrinsic on amd64 as a stop-gap for now to
reduce inlining costs.

This should be done (or at least looked at) for other architectures
as well.

Updates #17566

Change-Id: I6a106ff00b6c4e3f490650af3e083ed2be00c819
Reviewed-on: https://go-review.googlesource.com/132435
Run-TryBot: Andrew Bonventre <andybons@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@ugorji

This comment has been minimized.

Contributor

ugorji commented Sep 25, 2018

Any chance this gets in for go 1.12?

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 25, 2018

Someone needs to work on it. And have an idea of how to do it.
Other than that, it's ready to go :)

@ugorji

This comment has been minimized.

Contributor

ugorji commented Sep 25, 2018

@randall77 ;) I wish I knew enough about it to jump it - I would be all over it. This ... just ... feels like low-hanging fruit towards closing the performance gap against C/Rust/etc, and it seems to have been on the radar for a few years now.

@gopherbot

This comment has been minimized.

gopherbot commented Oct 10, 2018

Change https://golang.org/cl/141117 mentions this issue: cmd/compile: remove some inl budget hacks

gopherbot pushed a commit that referenced this issue Oct 10, 2018

cmd/compile: remove some inl budget hacks
Prior to stack tracing, inlining could cause
dead pointers to be kept alive in some loops.
See #18336 and CL 31674.

The adjustment removed by this change preserved the inlining status quo
in the face of Node structure changes, to avoid creating new problems.
Now that stack tracing provides precision, these hacks can be removed.

Of course, our inlining code model is already hacky (#17566),
but at least now there will be fewer epicyclical hacks.

Newly inline-able functions in std cmd as a result of this change:

hash/adler32/adler32.go:65:6: can inline (*digest).UnmarshalBinary
hash/fnv/fnv.go:281:6: can inline (*sum32).UnmarshalBinary
hash/fnv/fnv.go:292:6: can inline (*sum32a).UnmarshalBinary
reflect/value.go:1298:6: can inline Value.OverflowComplex
compress/bzip2/bit_reader.go:25:6: can inline newBitReader
encoding/xml/xml.go:365:6: can inline (*Decoder).switchToReader
vendor/golang_org/x/crypto/cryptobyte/builder.go:77:6: can inline (*Builder).AddUint16
crypto/x509/x509.go:1851:58: can inline buildExtensions.func2.1.1
crypto/x509/x509.go:1871:58: can inline buildExtensions.func2.3.1
crypto/x509/x509.go:1883:58: can inline buildExtensions.func2.4.1
cmd/vet/internal/cfg/builder.go:463:6: can inline (*builder).labeledBlock
crypto/tls/handshake_messages.go:1450:6: can inline (*newSessionTicketMsg).marshal
crypto/tls/handshake_server.go:769:6: can inline (*serverHandshakeState).clientHelloInfo
crypto/tls/handshake_messages.go:1171:6: can inline (*nextProtoMsg).unmarshal
cmd/link/internal/amd64/obj.go:40:6: can inline Init
cmd/link/internal/ppc64/obj.go:40:6: can inline Init
net/http/httputil/persist.go:54:6: can inline NewServerConn
net/http/fcgi/child.go:83:6: can inline newResponse
cmd/compile/internal/ssa/poset.go:245:6: can inline (*poset).newnode

Change-Id: I19e8e383a6273849673d35189a9358870665f82f
Reviewed-on: https://go-review.googlesource.com/c/141117
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ilya Tocar <ilya.tocar@intel.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment