Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: improve inlining cost model #17566

Open
josharian opened this issue Oct 24, 2016 · 81 comments
Open

cmd/compile: improve inlining cost model #17566

josharian opened this issue Oct 24, 2016 · 81 comments

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Oct 24, 2016

The current inlining cost model is simplistic. Every gc.Node in a function has a cost of one. However, the actual impact of each node varies. Some nodes (OKEY) are placeholders never generate any code. Some nodes (OAPPEND) generate lots of code.

In addition to leading to bad inlining decisions, this design means that any refactoring that changes the AST structure can have unexpected and significant impact on compiled code. See CL 31674 for an example.

Inlining occurs near the beginning of compilation, which makes good predictions hard. For example, new or make or & might allocate (large runtime call, much code generated) or not (near zero code generated). As another example, code guarded by if false still gets counted. As another example, we don't know whether bounds checks (which generate lots of code) will be eliminated or not.

One approach is to hand-write a better cost model: append is very expensive, things that might end up in a runtime call are moderately expensive, pure structure and variable/const declarations are cheap or free.

Another approach is to compile lots of code and generate a simple machine-built model (e.g. linear regression) from it.

I have tried both of these approaches, and believe both of them to be improvements, but did not mail either of them, for two reasons:

  • Inlining significantly impacts binary size, runtime performance, and compilation speed. Minor changes to the cost model or to the budget can have big impacts on all three. I hoped to find a model and budget that was clearly pareto optimal, but I did not. In order to make forward progress, we need to make a decision about what metric(s) we want to optimize for, and which code to measure those metric on. This is to my mind the single biggest blocker for improving inlining.
  • Changing inlining decisions impacts the runtime as well as other code and minor inlining changes to the runtime can have outsized performance impacts. I see several possible ways to address this. (1) We could add a //go:inline annotation for use in the runtime only, to allow runtime authors to force the compiler to inline performance-critical functions. If a non-inlinable function was marked //go:inline, compilation would fail. (2) We could add a //go:mustinline annotation for use in the runtime only (see CL 22785), to allow runtime authors to protect currently-inlined functions against becoming non-inlined. (3) We could tune runtime inlining (cost model, budget, etc.) independently.

Three other related ideas:

  • We might want to take into account the number and size of parameters and return values of a function when deciding whether to inline it, since those determine the cost in binary size of setting up the function call.
  • We might want to have separate budgets for expressions and control flow, since branches end up being more expensive on most metrics.
  • We could treat intra-package inlining different than inter-package inlining. When inlining across packages, we don't actually need to decide early on whether to allow inlining, since the actual inlining will occur in an entirely different compiler invocation. We could thus wait until function compilation is complete, and we know exactly how large the fully optimized code is, and then make our decision about whether other packages should inline that function. Downsides to this otherwise appealing idea are: (1) unexpected performance impact of moving a function from one package to another, (2) it introduces a significant dependency between full compilation and writing export data, which e.g. would prevent #15734.

cc @dr2chase @randall77 @ianlancetaylor @mdempsky

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 24, 2016

I'm going to toss in a few more ideas to consider.

An unexported function that is only called once is often cheaper to inline.

Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.

Functions that only make function calls in error cases, which is fairly common, can be cheaper to handle as a mix of inlining and outlining: you inline the main control flow but leave the error handling in a separate function. This may be particularly worth considering when inlining across packages, as the export data only needs to include the main control flow. (Error cases are detectable as the control flow blocks that return a non-nil value for a single result parameter of type error.)

One of the most important optimizations for large programs is feedback directed optimization aka profiled guided optimization. One of the most important lessons to learn from feedback/profiling is which functions are worth inlining, both on a per-call basis and on a "most calls pass X as argument N" basis. Therefore, while we have no FDO/PGO framework at present, any work on inlining should consider how to incorporate information gleaned from such a framework when it exists.

Pareto optimal is a nice goal but I suspect it is somewhat unrealistic. It's almost always possible to find a horrible decision made by any specific algorithm, but the algorithm can still be better on realistic benchmarks.

@rasky
Copy link
Member

@rasky rasky commented Oct 24, 2016

Functions that include tests of parameter values can be cheaper to inline for specific calls that pass constant arguments for those parameters. That is, the cost of inlining is not solely determined by the function itself, it is also determined by the nature of the call.

A common case where this would apply is when calling marshallers/unmarshallers that use reflect to inspect interface{} parameters. Many of them tend to have a "fast-path" in case reflect.Type is some basic type or has some basic property. Inlining that fast-path would make it even faster. Eg: see binary.Read and think how much of its code could be pulled when the value bound to the interface{} argument is known at compile-time.

@thanm
Copy link
Member

@thanm thanm commented Oct 24, 2016

Along the lines of what iant@ said, it's common for C++ compilers take into account whether a callsite appears in a loop (and thus might be "hotter"). This can help for toolchains that don't support FDO/PGO or for applications in which FDO/PGO are not being used.

@minux
Copy link
Member

@minux minux commented Oct 25, 2016

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Oct 25, 2016

Couldn't we obtain a minor improvement in the cost model by measuring the size of generated assembly language? It would require preserving a copy of the tree till after compilation, and doing compilation bottom-up (same way as inlining is scheduled) but that would give you a more accurate measure. There's a moderate chance of being able to determine goodness of constant parameters at the SSA-level, too.

Note that this would require rearranging all of these transformations (inlining, escape analysis, closure conversion, compilation) to run them function/recursive-function-nest at-a-time, so that the results from compiling bottom-most functions all the way to assembly language would be available to inform inlining at the next level up.

@josharian
Copy link
Contributor Author

@josharian josharian commented Oct 25, 2016

doing compilation bottom-up

I have also considered this. There'd be a lot of high risk work rearranging the rest of the compiler to work this way. It could also hurt our chances to get a big boost out of concurrent compilation; you want to start on the biggest, slowest functions ASAP, but those are the most likely to depend on many other functions.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Oct 25, 2016

It doesn't look that high risk to me; it's just another iteration order. SSA also gives us a slightly more tractable place to compute things like "constant parameter values that shrink code size", even if it is only so crude as looking for blocks directly conditional on comparisons with parameter values.

@josharian
Copy link
Contributor Author

@josharian josharian commented Oct 25, 2016

I think we could test the inlining benefits of the bottom-up compilation pretty easily. One way is to do it just for inter-package compilation (as suggested above); another is to hack cmd/compile to dump the function asm size somewhere and then hack cmd/go to compile all packages twice, using the dumped sizes for the second round.

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Nov 22, 2016

An unexported function that is only called once is often cheaper to inline.

Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 22, 2016

An unexported function that is only called once is often cheaper to inline.

Out of curiosity, why "often"? I can't think off the top of my head a case in which the contrary is true.

It is not true when the code looks like

func internal() {
    // Large complex function with loops.
}

func External() {
    if veryRareCase {
        internal()
    }
}

Because in the normal case where you don't need to call internal you don't have set up a stack frame.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

In package main, yes.

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Nov 22, 2016

Because in the normal case where you don't need to call internal you don't have set up a stack frame.

Oh I see, makes sense. Would be nice (also in other cases) if setting up the stack frame could be sunk in the if, but likely it wouldn't be worth the extra effort.

Also, just to understand, in -buildmode=exe basically every single function beside the entry function is going to be considered unexported?

In package main, yes.

The tyranny of unit-at-a-time :D

@RalphCorderoy
Copy link

@RalphCorderoy RalphCorderoy commented Jan 18, 2017

Functions that start with a run of if param1 == nil { return nil } where the tests and return values are simple parameters or constants would avoid the call overhead if just this part was inlined. The size of setting up the call/return could be weighed against the size of the simple tests.

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 18, 2017

@RalphCorderoy I've been thinking about the same kind of function body "chunking" for early returns. Especially interesting for quick paths, where the slow path is too big to inline.

Unless the compiler chunks, it's up to the developer to split the function in two I presume.

@RalphCorderoy
Copy link

@RalphCorderoy RalphCorderoy commented Jan 19, 2017

Hi @mvdan, Split the function in two with the intention the compiler then inlines the non-leaf first one?
Another possibility on stripping some of the simple quick paths is the remaining slow non-inlined function no longer needs some of the parameters.

@mvdan
Copy link
Member

@mvdan mvdan commented Jan 19, 2017

Yes, for example, here tryFastPath is likely small enough to be inlined (if forward thunk funcs were inlineable, see #8421):

func tryFastPath() {
    if someCond {
        // fast path
        return ...
    }
    return slowPath()
}
@josharian
Copy link
Contributor Author

@josharian josharian commented May 18, 2017

Too late for 1.9.

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 20, 2017

Change https://golang.org/cl/57410 mentions this issue: runtime: add TestIntendedInlining

gopherbot pushed a commit that referenced this issue Aug 22, 2017
The intent is to allow more aggressive refactoring
in the runtime without silent performance changes.

The test would be useful for many functions.
I've seeded it with the runtime functions tophash and add;
it will grow organically (or wither!) from here.

Updates #21536 and #17566

Change-Id: Ib26d9cfd395e7a8844150224da0856add7bedc42
Reviewed-on: https://go-review.googlesource.com/57410
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@TocarIP
Copy link
Contributor

@TocarIP TocarIP commented Sep 25, 2017

Another example of current inlining heuristic punishing more readable code

if !foo {
  return
}
if !bar {
  return
}
if !baz {
  return
}
//do stuff

is more "expensive" than

if foo && bar && baz {
// do stuff
}
return

This is based on real code from regexp package (see https://go-review.googlesource.com/c/go/+/65491
for details)

@cheng-chang
Copy link

@cheng-chang cheng-chang commented Apr 18, 2020

A related benchmark:

import (
        "sort"
)

func Search(a []int, x int) int {
        return sort.Search(len(a), func(i int) bool { return a[i] > x }) - 1
}

func InlinedSearch(a []int, x int) int {
        i, j := 0, len(a)
        for i < j {
                h := i + (j-i)/2 // avoid overflow when computing h
                // i ≤ h < j
                if a[h] <= x {
                        i = h + 1
                } else {
                        j = h
                }
        }
        return i - 1
}
import (
        "testing"
        "log"
)

func bench(b *testing.B, search func([]int, int) int) {
        v := []int{1, 10, 100, 1000, 10000, 100000}
        for i := 0; i < b.N; i++ {
                r := search(v, 999)
                if r != 2 {
                        log.Fatal(r)
                }
        }
}


func BenchmarkSearch(b *testing.B) {
        bench(b, Search)
}

func BenchmarkInlinedSearch(b *testing.B) {
        bench(b, InlinedSearch)
}
(base) ➜  t go test -bench .
goos: darwin
goarch: amd64
BenchmarkSearch-16           	100600551	        11.9 ns/op
BenchmarkInlinedSearch-16    	221246642	         5.32 ns/op
PASS

Go compiler has a manual optimization for inlining Search at https://github.com/golang/go/blob/master/src/go/token/position.go#L518, the comment in that file is from 2011, unfortunately, it's still the case now.

@klauspost
Copy link
Contributor

@klauspost klauspost commented May 22, 2020

Not to beat on an injured horse, but I wanted to add a real example, since this seems to be dominated by quite artificial benchmarks.

In my zstandard decompression inlining a single function gave an overall speed improvement end-to-end of 1.19x on a real-world decompression scenario. The pull request with the change is here: klauspost/compress#259

The only thing changed is the manual inlining. CPU profiles showed the called function taking on average 11% of the execution time. To me this means that setting up the function call was about as expensive as executing the called function. The called function already had another function manually inlined for a similar speedup. I assume it is because the number of args and returned values.

To me this shows that in some cases the Go compiler is way too conservative in what it decides to inline, or the metric for determining the complexity is off.

I would like to write Go code, where I don't have to manually inline the code, since it degrades readability. I am also not a big fan of inline pragmas, since it usually makes junior devs slap it on everything because some random microbenchmark 'was faster'.

I don't know if the in-lining metric takes the number of params/return values into account when making its decision, but if not I would propose something like multiplying the threshold by the number of params/return values since that would at least take those into account.

@josharian
Copy link
Contributor Author

@josharian josharian commented May 24, 2020

@klauspost thanks. Real world, concrete examples continue to be very helpful.

By way of explanation (not justification), one of the major challenges to working on inlining is measuring the quality of the results. Ideas I have aplenty, and implementation is easy enough. But any given change usually has mild widespread results (impacts to binary size, compilation time) and severe local results (impacts to performance on individual functions). These are hard to know how to balance. Having target functions that are known to be important is useful.

I don't know if the in-lining metric takes the number of params/return values into account when making its decision

It does not. I have implemented this but could not convince myself at the time that it was clearly better. (And "clearly better" is the bar. Otherwise the churn and resultant bug reports aren't worth it.)

This also allows people to "manually" inline by adding unused throwaway parameters to a function. I no longer know whether this is a bug or a feature: In my heart I know it is bad, yet I have wanted it so very many times myself...

A related change would be to increase the budget on a call-site basis for call sites that are passing constants as arguments, as that increases the probability of significant optimization gains after inlining. The code is not well set up for per-call-site budget modifications. It would be useful to change that, if anyone feels inspired to go tackle it.

@mvdan
Copy link
Member

@mvdan mvdan commented May 28, 2020

I have also been wondering if @dr2chase's research with his corpus of open source Go code with benchmarks might be biased. If performance-sensitive Go code out there already performs manual function call inlining where it's really needed, such as what @klauspost mentioned above, then the benchmark comparison results from trying to improve the inlining heuristic might be significantly skewed.

@rasky
Copy link
Member

@rasky rasky commented May 29, 2020

That’s a brilliant point. I know for sure that I do inline all my performance sensitive code.

@klauspost
Copy link
Contributor

@klauspost klauspost commented May 29, 2020

In the stdlib I have #38324 pending which also includes a manual inlined function call.

In my benchmarks this change alone gave an overall improvement to 1.13x in gzip decompression speed end-to-end.

@mvdan
Copy link
Member

@mvdan mvdan commented May 29, 2020

That’s a brilliant point. I know for sure that I do inline all my performance sensitive code.

I do too, and there are dozens of such cases in the standard library. Unfortunately, once manual inlining happens, that information is lost. It's not like we just switch func Foo(...) to inline func Foo(...).

In an ideal world, the corpus to drive heuristic improvements would be how people should write Go code in an idiomatic and maintainable way. But of course, reality is very different, especially for projects which are performance-sensitive enough to have benchmarks.

Edit: can anyone think of a way we could possibly not lose the information? It's technically possible to retrieve it from the git history, but I doubt a machine could automate that. Annotating the manually inlined code with comments could work, even if just as a reminder to de-inline the piece of code in the future once the compiler gets better. But what could those annotations look like? Comments with start/end sections?

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 30, 2020

Edit: can anyone think of a way we could possibly not lose the information? It's technically possible to retrieve it from the git history, but I doubt a machine could automate that. Annotating the manually inlined code with comments could work, even if just as a reminder to de-inline the piece of code in the future once the compiler gets better. But what could those annotations look like? Comments with start/end sections?

That really feels like trying to workaround the workaround -- tiptoeing around the pragmatic solution that would be being able to tell the compiler "I want inlining to happen at this specific callsite, or an error if that's not possible".

We can do, and are doing, this manually today anyway with the resulting increase in maintenance burden. It is extremely unlikely we'll get anytime soon to a point where the issue of the inliner being too conservative and forcing people to duplicate code will be resolved, given that we (correctly, IMO) want to prioritize compilation speed in the common case. Crucially, even with significant improvements to the compiler, there are always going to be cases in which the inliner won't do the right thing we want.

This is why I think the one described above is the most pragmatic solution: we give people that would otherwise resort to duplicate code a way to avoid doing it, without punishing compilation in the common case.

@mvdan
Copy link
Member

@mvdan mvdan commented May 30, 2020

@CAFxX you might find the last group of comments in #21536 useful. I agree that, in the short term, some sort of directive or hint would be useful. The problem is that it goes against Go's design, and in the long run we wouldn't want to have one once the compiler gets better.

Of course, in practice people force the compiler's hand anyway by manually inlining, I completely agree with you there. Which is why I'm wondering if there's something we could do as a temporary measure. Perhaps a pragma that's not guaranteed to work forever, such as the println builtin?

Println is useful for bootstrapping and debugging; it is not guaranteed to stay in the language.

@rasky
Copy link
Member

@rasky rasky commented May 30, 2020

I’m not sure why people are so convinced that eventually an annotation would not be required. GCC and LLVM have had man years of tuning in their inline heuristics / engines and there’s still no way you can be sure to write performance code without using compiler hints. In fact, heuristics use the hints, they don’t replace them.

The problem is unsolved nowadays. Even if we had people working on inline heuristics full time (spoiler: we don’t), it’s a really hard bet to make that they will be able to make it work without hints. Sure, it can get better than now (almost anything is better than now), but it’s unlikely to solve all performance problems.

It’s too bad that there’s no agreement on this.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 30, 2020

Hints can help in some cases, but they by no means solve the problem.

  • Ideally, inlining decisions should take into account the call site (constant argument, in a loop, etc.). If we're annotating the function declaration, then that doesn't happen. If we somehow annotate the call site, then library functions won't get inlined unless users of that library add annotations. So we need both...
  • Annotations get stale over time. Just because a function should be inlined today doesn't mean it should after its code has been modified 10 times. Or maybe it helps to inline in Go 1.18, but not Go 1.19.
  • Annotations get over-used. If we annotate declarations, then every library author is going to mark their functions inlineable. Because their library's performance is very important to them, it shows up in their benchmarks, etc. That's fine if you're using that library in a performance-sensitive way, but maybe you aren't. If you're not, then that over-inlining is just causing code bloat.
  • Inlining decisions can depend on the architecture. Want a hint that lists architectures?
@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 30, 2020

Hints can help in some cases, but they by no means solve the problem.

Not claiming it does solve all problems, but unless we find perfect inlining heuristics, neither does improving the inliner. It really just solves the problem I described above.

  • Ideally, inlining decisions should take into account the call site (constant argument, in a loop, etc.). If we're annotating the function declaration, then that doesn't happen. If we somehow annotate the call site, then library functions won't get inlined unless users of that library add annotations. So we need both...

Not sure I see the argument here. Im not arguing for function annotation, I explicitly argued for a callsite annotation. Why wouldn't be callsite annotations be enough for library functions? Again, it's just a way to ask the compiler to do what would otherwise be done manually by copy-pasting the callee.

Interestingly, even though it's pretty much orthogonal to the proposal itself, when the inliner gets better and starts to actually make decisions based on callsites instead of functions, it may actually transitively propagate it as a hint that the specific callpath is expected to be frequently executed.

  • Annotations get stale over time. Just because a function should be inlined today doesn't mean it should after its code has been modified 10 times. Or maybe it helps to inline in Go 1.18, but not Go 1.19.

Same argument applies to implicit assumptions about the inliner behavior ("1.18 inlined this function automatically so it didn't need manual inlining, but 1.19 does not inline it automatically").

  • Annotations get over-used. If we annotate declarations, then every library author is going to mark their functions inlineable. Because their library's performance is very important to them, it shows up in their benchmarks, etc. That's fine if you're using that library in a performance-sensitive way, but maybe you aren't. If you're not, then that over-inlining is just causing code bloat.

Hence why I specifically argue about an annotation on the callsite and not on the function. I agree library authors sometime feel the pressure you're mentioning, but as I clearly stated manual inlining is happening today for that very reason, and therefore I find it hard to consider this as a very compelling argument against the callsite annotation.

  • Inlining decisions can depend on the architecture. Want a hint that lists architectures?

First, this applies also to manual inlining, so again I don't see the argument.

Second, we already have it - it may be not extremely ergonomic, but it's still more ergonomic and arguably easier to read than manual inlining (note that the mustinline syntax is a strawman):

if runtime.GOARCH == "amd64" {
  //go:mustinline
  foo()
} else {
  foo()
}
@seebs
Copy link
Contributor

@seebs seebs commented May 30, 2020

oh that's not nearly confusing enough. consider:

if runtime.GOARCH == "amd64" {
  //go:mustinline
}
foo()

I think there's uses for both call-site and non-call-site annotations.

I think it matters a lot exactly what's being done. Imagine a function which is basically just a switch, called with a constant or non-constant parameter. It might make sense to inline when the parameter is constant, because 90% of the code will go away, but not when the parameter is non-constant. So I think any one simple answer will be poor for other circumstances.

So when thinking about this, it may make sense to assume that the correct answer is "yes, we need function annotations, call-site annotations, and some amount of looking ahead at what optimizations are possible for a given function when inlined". Eventually. And we don't need them all at once, but if we start with the understanding that they will each provide value and solve problems that the others don't, that makes it easier to pick one even though it doesn't solve every problem.

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 30, 2020

The problem is that it goes against Go's design,

This also applies in both cases.

  • in one case, because we add a knob and we all dislike knobs
  • in the other, because it makes readability and maintainability worse

and in the long run we wouldn't want to have one once the compiler gets better

My argument, that I made above, is that this is unlikely to be the case unless we want to give up compilation speed. We can make small incremental improvements, but I think we can agree it will not cover all cases in which people resort to manual inlining. The problem to me seems to be that we are thinking of the two approaches as alternatives (as in, if we do one we don't need the other) whereas they're really not.

@josharian
Copy link
Contributor Author

@josharian josharian commented May 30, 2020

Here’s a compromise idea that I imagine will please no one. :)

Allow call site annotations like /*go1.16:inline*/. (Maybe also function-site annotations?) /* annotations are necessary because there may be multiple calls on one line. For versions other than go 1.16, this is a no-op. And the annotation is a no-op unless you pass the compiler flag -l=manual. That compiler flag disables inlining except exactly where annotated. (Or maybe the existence of a single relevant annotation disables all inlining except exactly that requested.) If a requested inlining at a call site is impossible, compilation fails.

This grants complete control to the user who really needs it. It is enough of a pain to discourage casual use. It is scoped to exactly one Go version to avoid accidental rot. It provides raw data about optimal inlining decisions to compiler authors.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 30, 2020

Not sure I see the argument here. Im not arguing for function annotation, I explicitly argued for a callsite annotation.

I'm not specifically talking about your suggestion. Just the general idea of inline annotations, both callsite and declaration site.

Why wouldn't be callsite annotations be enough for library functions?

func (t *T) Foo() int {
    return t.x + t.y
}

With callsite annotations every place t.Foo is used needs an annotation. As opposed to just once at the definition. Which will lead to library authors adding "This function will perform better if you add a @inline annotation at each call site" to their docs.

Hence why I specifically argue about an annotation on the callsite and not on the function. I agree library authors sometime feel the pressure you're mentioning, but as I clearly stated manual inlining is happening today for that very reason, and therefore I find it hard to consider this as a very compelling argument against the callsite annotation.

This will happen with both callsite or declaration annotations, just at a slightly different level.

func Foo() {
   internalFoo()
}

If the library containing Foo isn't a hot path in my application, I'd rather not have Foo inlined into its call sites, or have internalFoo inlined into Foo. So an inline annotation at the declaration of Foo and/or one at the callsite of internalFoo are both at least unnecessary and likely code-size harmful (assuming that the bodies were big enough to not be covered by the inlining heuristic we have today).

@mvdan
Copy link
Member

@mvdan mvdan commented May 30, 2020

Quick observation: right now, the manual inlining generally only happens within a module or even a single package, not crossing module boundaries. Perhaps some form of directive could have a similar restriction, to prevent public APIs from being littered with dubious performance directives. Or perhaps we could even restrict it to unexported functions.

@zeebo
Copy link
Contributor

@zeebo zeebo commented May 30, 2020

I was seconds away from posting a similar suggestion about only allowing inlining annotations on unexported functions.

That would cover every case I've ever ran into and avoid all of the terrible manual inlining I have done for substantial gains over the literal years.


edit: The main pain with not having control comes from exploratory programming. It is a ton of work to manually inline, find out it's not a great idea, and then undo. Then do that a ton over many refactorings that could change the assumptions. I would personally be happy manually inlining still if it wasn't so painful to discover when it was a good idea for a specific case. So even if the annotations were ignored unless compiled with a special flag, I'd still be happy.

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 31, 2020

I'm not specifically talking about your suggestion. Just the general idea of inline annotations, both callsite and declaration site.

I was just pointing out that the arguments you made don't really apply to callsite annotations as much as they apply to function annotations, though. Conflating the two don't seem to help discussion.

With callsite annotations every place t.Foo is used needs an annotation.

Each place where the caller cares enough to and the inliner doesn't do it automatically, yes. That is pretty much exactly what we want to avoid the copy-paste scenario. Why would that be an argument against callsite annotations? We all agree it would be ideal if the inliner was better so that we didn't have to; the assumption is that it likely won't become so much better that there will be no need for manual control.

As opposed to just once at the definition. Which will lead to library authors adding "This function will perform better if you add a @inline annotation at each call site" to their docs.

That is pretty much what is already happening today, with the difference right now the inlining has to be done manually. Callsite annotations wouldn't make this worse, on the contrary, not only it would avoid having to duplicate code, it would potetntially even avoid having to expose additional APIs (like the one linked above).

This will happen with both callsite or declaration annotations, just at a slightly different level.

My point was that it is already happening today, with the difference that right now people have to manually inline code. That point still applies.

Furthermore, I think we agree it would happen less in the case of callsite annotations.

func Foo() {
   internalFoo()
}

If the library containing Foo isn't a hot path in my application, I'd rather not have Foo inlined into its call sites, or have internalFoo inlined into Foo. So an inline annotation at the declaration of Foo and/or one at the callsite of internalFoo are both at least unnecessary and likely code-size harmful (assuming that the bodies were big enough to not be covered by the inlining heuristic we have today).

This is definitely one valid argument. I kinda already offered a solution for that though ("even though it's pretty much orthogonal to the proposal itself, when the inliner gets better and starts to actually make decisions based on callsites instead of functions, it may actually transitively propagate it as a hint that the specific callpath is expected to be frequently executed").

Regardless, I wouldn't discount a practical solution just because it doesn't cover 100% of the possible use-cases (as by that standard Go itself likely wouldn't exist)

Quick observation: right now, the manual inlining generally only happens within a module or even a single package, not crossing module boundaries. Perhaps some form of directive could have a similar restriction, to prevent public APIs from being littered with dubious performance directives. Or perhaps we could even restrict it to unexported functions.

One note though: I still think there's value in allowing callsite annotations to request cross-module inlining (I gave one example just above, and can probably dig a few others up). Restricing function annotations to unexported symbols OTOH sounds like a pretty reasonable compromise.

And the annotation is a no-op unless you pass the compiler flag -l=manual.

Or maybe, only callsite annotations in the root module (the module that contains the main package) are enabled by default?

@mvdan
Copy link
Member

@mvdan mvdan commented Jun 20, 2020

One note though: I still think there's value in allowing callsite annotations to request cross-module inlining (I gave one example just above, and can probably dig a few others up).

I think the danger there is that people could add notes to their function definition godoc like "for maximum performance, annotate calls with...". We can strongly warn against that kind of thing, but it will happen anyway. That's why I think starting with unexported funcs would give a large part of the benefit at little risk, so it could be the first step for an annotation.

@renthraysk
Copy link

@renthraysk renthraysk commented Jul 12, 2020

Just hit an annoyance with the inliner in past few days, reading and writing individual bits.

Writer.WriteBit() function is inlineable. By making 2 concessions/hacks, x argument an uint64, and reusing x for carry result.

However, ReadBit() which is using a similar strategy, is "function too complex: cost 88 exceeds budget 80"

@elichai
Copy link

@elichai elichai commented Aug 11, 2020

I'm having a bunch of problems with bad inlining when implementing hash functions.
I had to manually inline functions, move things from function to another to please the inliner and more, at the end I increased the perf by 200%(which is huge for hashes) just by manually inlining and similar things.

I also can't use the useful:

func memset(s []byte, c byte) {
	for i := range s {
		s[i] = c
	}
}

Because the fact that it doesn't inline range loops, then if I call memset(s, 0) it will call to memset and then loop instead of just calling to runtime.memclrNoHeapPointers which it could've done if it knew that c=0

@martisch
Copy link
Contributor

@martisch martisch commented Aug 12, 2020

Because the fact that it doesn't inline range loops, then if I call memset(s, 0) it will call to memset and then loop instead of just calling to runtime.memclrNoHeapPointers which it could've done if it new that c=0

I dont think the existing range clear optimizations will trigger even if the inlining is changed.
Thats something that will not only need inlining but the range clear optimization detection to move into the ssa pass that can represent that c is a constant in the loop.

Better to write that directly for now:

func memclear(s []byte) {
	for i := range s {
		s[i] = 0
	}
}

If inlining is improved it can take care of the call overhead here.

@yuzefovich
Copy link

@yuzefovich yuzefovich commented Dec 23, 2020

My colleague wrote a tool that folks following this thread might find useful (it allows for adding comments that are "assertions" on the compiled code for checking whether a function (or a call-site) is inlined). We at cockroachdb began introducing these "assertions" into the codebase and are verifying them during the linter test runs.

@FiloSottile
Copy link
Contributor

@FiloSottile FiloSottile commented Jun 5, 2021

I just came out of a fight with the inliner and figured I would report back. I was trying to outline an allocation for a function with multiple return values, and the inliner had a lot of non-obvious opinions.

Here's the version that finally worked, with cost 72.

func (v *Point) ExtendedCoordinates() (X, Y, Z, T *field.Element) {
	var e [4]field.Element
	X, Y, Z, T = v.extendedCoordinates(&e)
	return
}

Both the more idiomatic ways to write this function have costs higher than 84.

Avoiding the named returns costs 84.

func (v *Point) ExtendedCoordinates() (X, Y, Z, T *field.Element) {
	var e [4]field.Element
	return v.extendedCoordinates(&e)
}

Making separate allocations costs 90.

func (v *Point) ExtendedCoordinates() (X, Y, Z, T *field.Element) {
	var X1, Y1, Z1, T1 field.Element
	X, Y, Z, T = v.extendedCoordinates(&X1, &Y1, &Z1, &T1)
	return
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet