New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: enable mid-stack inlining #19348

Open
davidlazar opened this Issue Mar 1, 2017 · 43 comments

Comments

Projects
None yet
@davidlazar
Member

davidlazar commented Mar 1, 2017

Design doc: https://golang.org/design/19348-midstack-inlining

@davidlazar davidlazar added the Proposal label Mar 1, 2017

@davidlazar davidlazar self-assigned this Mar 1, 2017

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Mar 1, 2017

CL https://golang.org/cl/37231 mentions this issue.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Mar 3, 2017

CL https://golang.org/cl/37233 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 3, 2017

cmd/compile,link: generate PC-value tables with inlining information
In order to generate accurate tracebacks, the runtime needs to know the
inlined call stack for a given PC. This creates two tables per function
for this purpose. The first table is the inlining tree (stored in the
function's funcdata), which has a node containing the file, line, and
function name for every inlined call. The second table is a PC-value
table that maps each PC to a node in the inlining tree (or -1 if the PC
is not the result of inlining).

To give the appearance that inlining hasn't happened, the runtime also
needs the original source position information of inlined AST nodes.
Previously the compiler plastered over the line numbers of inlined AST
nodes with the line number of the call. This meant that the PC-line
table mapped each PC to line number of the outermost call in its inlined
call stack, with no way to access the innermost line number.

Now the compiler retains line numbers of inlined AST nodes and writes
the innermost source position information to the PC-line and PC-file
tables. Some tools and tests expect to see outermost line numbers, so we
provide the OutermostLine function for displaying line info.

To keep track of the inlined call stack for an AST node, we extend the
src.PosBase type with an index into a global inlining tree. Every time
the compiler inlines a call, it creates a node in the global inlining
tree for the call, and writes its index to the PosBase of every inlined
AST node. The parent of this node is the inlining tree index of the
call. -1 signifies no parent.

For each function, the compiler creates a local inlining tree and a
PC-value table mapping each PC to an index in the local tree.  These are
written to an object file, which is read by the linker.  The linker
re-encodes these tables compactly by deduplicating function names and
file names.

This change increases the size of binaries by 4-5%. For example, this is
how the go1 benchmark binary is impacted by this change:

section             old bytes   new bytes   delta
.text               3.49M ± 0%  3.49M ± 0%   +0.06%
.rodata             1.12M ± 0%  1.21M ± 0%   +8.21%
.gopclntab          1.50M ± 0%  1.68M ± 0%  +11.89%
.debug_line          338k ± 0%   435k ± 0%  +28.78%
Total               9.21M ± 0%  9.58M ± 0%   +4.01%

Updates #19348.

Change-Id: Ic4f180c3b516018138236b0c35e0218270d957d3
Reviewed-on: https://go-review.googlesource.com/37231
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>

gopherbot pushed a commit that referenced this issue Mar 3, 2017

runtime: use inlining tables to generate accurate tracebacks
The code in https://play.golang.org/p/aYQPrTtzoK now produces the
following stack trace:

goroutine 1 [running]:
main.(*point).negate(...)
	/tmp/go/main.go:8
main.main()
	/tmp/go/main.go:14 +0x23

Previously the stack trace missed the inlined call:

goroutine 1 [running]:
main.main()
	/tmp/go/main.go:14 +0x23

Fixes #10152.
Updates #19348.

Change-Id: Ib43c67012f53da0ef1a1e69bcafb65b57d9cecb2
Reviewed-on: https://go-review.googlesource.com/37233
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@dhananjay92

This comment has been minimized.

Show comment
Hide comment
@dhananjay92

dhananjay92 Mar 4, 2017

Member

This is awesome.

I probably missed some discussion, but is there a design doc or proposal doc I can look at?

Out of curiosity, is there a plan to emit this information as part of DWARF? It would be a nice feature if debuggers can access the InlTree info (right now they can't print correct backtraces for inlined calls; I confirmed with 781fd39).

Member

dhananjay92 commented Mar 4, 2017

This is awesome.

I probably missed some discussion, but is there a design doc or proposal doc I can look at?

Out of curiosity, is there a plan to emit this information as part of DWARF? It would be a nice feature if debuggers can access the InlTree info (right now they can't print correct backtraces for inlined calls; I confirmed with 781fd39).

@davidlazar

This comment has been minimized.

Show comment
Hide comment
@davidlazar

davidlazar Mar 6, 2017

Member

There is an outdated proposal doc. I'll update and publish it this week. In the meantime, these slides give an overview of the approach: https://golang.org/s/go19inliningtalk

I haven't looked at the DWARF yet, but the plan is to add inlining info to the DWARF tables before we turn on mid-stack inlining for 1.9.

Member

davidlazar commented Mar 6, 2017

There is an outdated proposal doc. I'll update and publish it this week. In the meantime, these slides give an overview of the approach: https://golang.org/s/go19inliningtalk

I haven't looked at the DWARF yet, but the plan is to add inlining info to the DWARF tables before we turn on mid-stack inlining for 1.9.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Mar 6, 2017

CL https://golang.org/cl/37854 mentions this issue.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Mar 10, 2017

CL https://golang.org/cl/38090 mentions this issue.

gopherbot pushed a commit to golang/proposal that referenced this issue Mar 11, 2017

design: add mid-stack inlining design doc
For golang/go#19348.

Change-Id: Ibf3e3817b35226a33d961e76fedb924e15e37069
Reviewed-on: https://go-review.googlesource.com/38090
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>

@gopherbot gopherbot added this to the Proposal milestone Mar 20, 2017

@rsc

This comment has been minimized.

Show comment
Hide comment
@rsc

rsc Mar 27, 2017

Contributor

It seems clear we're going to do this, assuming the right tuning (not yet done!). The tuning itself doesn't have to go through the proposal process. Accepting proposal.

Contributor

rsc commented Mar 27, 2017

It seems clear we're going to do this, assuming the right tuning (not yet done!). The tuning itself doesn't have to go through the proposal process. Accepting proposal.

@rsc rsc changed the title from proposal: mid-stack inlining in the Go compiler to cmd/compile: enable mid-stack inlining in the Go compiler Mar 27, 2017

@rsc rsc changed the title from cmd/compile: enable mid-stack inlining in the Go compiler to cmd/compile: enable mid-stack inlining Mar 27, 2017

@rsc rsc modified the milestones: Go1.9Maybe, Proposal Mar 27, 2017

gopherbot pushed a commit that referenced this issue Mar 29, 2017

runtime: handle inlined calls in runtime.Callers
The `skip` argument passed to runtime.Caller and runtime.Callers should
be interpreted as the number of logical calls to skip (rather than the
number of physical stack frames to skip). This changes runtime.Callers
to skip inlined calls in addition to physical stack frames.

The result value of runtime.Callers is a slice of program counters
([]uintptr) representing physical stack frames. If the `skip` parameter
to runtime.Callers skips part-way into a physical frame, there is no
convenient way to encode that in the resulting slice. To avoid changing
the API in an incompatible way, our solution is to store the number of
skipped logical calls of the first frame in the _second_ uintptr
returned by runtime.Callers. Since this number is a small integer, we
encode it as a valid PC value into a small symbol called:

    runtime.skipPleaseUseCallersFrames

For example, if f() calls g(), g() calls `runtime.Callers(2, pcs)`, and
g() is inlined into f, then the frame for f will be partially skipped,
resulting in the following slice:

    pcs = []uintptr{pc_in_f, runtime.skipPleaseUseCallersFrames+1, ...}

We store the skip PC in pcs[1] instead of pcs[0] so that `pcs[i:]` will
truncate the captured stack trace rather than grow it for all i.

Updates #19348.

Change-Id: I1c56f89ac48c29e6f52a5d085567c6d77d499cf1
Reviewed-on: https://go-review.googlesource.com/37854
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@bilokurov

This comment has been minimized.

Show comment
Hide comment
@bilokurov

bilokurov Apr 7, 2017

It seems that func Caller(skip int) in runtime/extern.go also needs to be updated for this change, as it currently calls findfunc(pc), similarly to FuncForPC.

bilokurov commented Apr 7, 2017

It seems that func Caller(skip int) in runtime/extern.go also needs to be updated for this change, as it currently calls findfunc(pc), similarly to FuncForPC.

@davidlazar

This comment has been minimized.

Show comment
Hide comment
@davidlazar

davidlazar Apr 7, 2017

Member

Indeed. I have a CL that updates runtime.Caller but haven't mailed it out yet.

Member

davidlazar commented Apr 7, 2017

Indeed. I have a CL that updates runtime.Caller but haven't mailed it out yet.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Apr 10, 2017

CL https://golang.org/cl/40270 mentions this issue.

lparth added a commit to lparth/go that referenced this issue Apr 13, 2017

runtime: handle inlined calls in runtime.Callers
The `skip` argument passed to runtime.Caller and runtime.Callers should
be interpreted as the number of logical calls to skip (rather than the
number of physical stack frames to skip). This changes runtime.Callers
to skip inlined calls in addition to physical stack frames.

The result value of runtime.Callers is a slice of program counters
([]uintptr) representing physical stack frames. If the `skip` parameter
to runtime.Callers skips part-way into a physical frame, there is no
convenient way to encode that in the resulting slice. To avoid changing
the API in an incompatible way, our solution is to store the number of
skipped logical calls of the first frame in the _second_ uintptr
returned by runtime.Callers. Since this number is a small integer, we
encode it as a valid PC value into a small symbol called:

    runtime.skipPleaseUseCallersFrames

For example, if f() calls g(), g() calls `runtime.Callers(2, pcs)`, and
g() is inlined into f, then the frame for f will be partially skipped,
resulting in the following slice:

    pcs = []uintptr{pc_in_f, runtime.skipPleaseUseCallersFrames+1, ...}

We store the skip PC in pcs[1] instead of pcs[0] so that `pcs[i:]` will
truncate the captured stack trace rather than grow it for all i.

Updates golang#19348.

Change-Id: I1c56f89ac48c29e6f52a5d085567c6d77d499cf1
Reviewed-on: https://go-review.googlesource.com/37854
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>

gopherbot pushed a commit that referenced this issue Apr 18, 2017

runtime: skip logical frames in runtime.Caller
This rewrites runtime.Caller in terms of stackExpander, which already
handles inlined frames and partially skipped frames. This also has the
effect of making runtime.Caller understand cgo frames if there is a cgo
symbolizer.

Updates #19348.

Change-Id: Icdf4df921aab5aa394d4d92e3becc4dd169c9a6e
Reviewed-on: https://go-review.googlesource.com/40270
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@cespare

This comment has been minimized.

Show comment
Hide comment
@cespare

cespare Jun 8, 2017

Contributor

Is -l=4 going to be the default for Go 1.9?

Contributor

cespare commented Jun 8, 2017

Is -l=4 going to be the default for Go 1.9?

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Jun 8, 2017

Contributor

Not yet, it has high compilation costs, largely because we need to be much pickier about how we read export data.

Contributor

dr2chase commented Jun 8, 2017

Not yet, it has high compilation costs, largely because we need to be much pickier about how we read export data.

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Jun 8, 2017

Member

Is -l=4 going to be the default for Go 1.9?

No.

Maybe in Go 1.10.

Member

bradfitz commented Jun 8, 2017

Is -l=4 going to be the default for Go 1.9?

No.

Maybe in Go 1.10.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Feb 26, 2018

Contributor

We'd like to get this done for 1.11.

I think the needed export format changes are done. I think Matthew has some more changes lined up to make things better (compile-time faster), but at this point they aren't blockers.

The major TODO at this point is to tune the inlining heuristic. Mid-stack inlining helps runtime speed, but it can make binaries bigger. A lot bigger in some cases; cmd/compile's text segment gets ~100% bigger. I don't think that's launchable as-is, so we need to figure out the right way to tweak the heuristics to preserve as much speed as we can while keeping binary size manageable. Ideas welcome; there's no obvious plan of attack here.

Yes, we'd definitely like to get rid of all the situations where people have had to manually inline things.

Contributor

randall77 commented Feb 26, 2018

We'd like to get this done for 1.11.

I think the needed export format changes are done. I think Matthew has some more changes lined up to make things better (compile-time faster), but at this point they aren't blockers.

The major TODO at this point is to tune the inlining heuristic. Mid-stack inlining helps runtime speed, but it can make binaries bigger. A lot bigger in some cases; cmd/compile's text segment gets ~100% bigger. I don't think that's launchable as-is, so we need to figure out the right way to tweak the heuristics to preserve as much speed as we can while keeping binary size manageable. Ideas welcome; there's no obvious plan of attack here.

Yes, we'd definitely like to get rid of all the situations where people have had to manually inline things.

@mvdan

This comment has been minimized.

Show comment
Hide comment
@mvdan

mvdan Feb 26, 2018

Member

Has any thought been given to enabling a conservative version of mid-stack inlining in 1.11? That is, only doing the extra inlining where it means little or none increment in binary size.

Member

mvdan commented Feb 26, 2018

Has any thought been given to enabling a conservative version of mid-stack inlining in 1.11? That is, only doing the extra inlining where it means little or none increment in binary size.

@dlsniper

This comment has been minimized.

Show comment
Hide comment
@dlsniper

dlsniper Feb 26, 2018

Contributor

@randall77 would you consider having a conservative version, as @mvdan suggested, but allowing users to also experiment with this by providing a compiler directive, like //go:inline, which could perform the inline but only up to a max defined complexity?

Contributor

dlsniper commented Feb 26, 2018

@randall77 would you consider having a conservative version, as @mvdan suggested, but allowing users to also experiment with this by providing a compiler directive, like //go:inline, which could perform the inline but only up to a max defined complexity?

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Feb 26, 2018

Contributor

@mvdan: That's an option. It's not trivial to do, though, as at inlining decision time we don't know what the final binary size difference is going to end up being. We have to more or less guess based on the info we do have (# and kind of AST nodes).

@dlsniper: I'd like to avoid a //go:inline comment if we can. I don't think it solves the problem well, as the inlining decision should probably depend on characteristics of the call site (e.g. in a loop, constant arguments, etc.), not just the function being called.

Contributor

randall77 commented Feb 26, 2018

@mvdan: That's an option. It's not trivial to do, though, as at inlining decision time we don't know what the final binary size difference is going to end up being. We have to more or less guess based on the info we do have (# and kind of AST nodes).

@dlsniper: I'd like to avoid a //go:inline comment if we can. I don't think it solves the problem well, as the inlining decision should probably depend on characteristics of the call site (e.g. in a loop, constant arguments, etc.), not just the function being called.

@mvdan

This comment has been minimized.

Show comment
Hide comment
@mvdan

mvdan Feb 26, 2018

Member

I was thinking conservative in terms of the heuristic. For example, every extra level of inlining could increase the cost of the function by a constant, or by a percent.

I assume that this will come down to lots of testing and gathering of data, though. I'm not sure how useful it is to throw ideas at this issue before then :)

Member

mvdan commented Feb 26, 2018

I was thinking conservative in terms of the heuristic. For example, every extra level of inlining could increase the cost of the function by a constant, or by a percent.

I assume that this will come down to lots of testing and gathering of data, though. I'm not sure how useful it is to throw ideas at this issue before then :)

@CAFxX

This comment has been minimized.

Show comment
Hide comment
@CAFxX

CAFxX Feb 26, 2018

Contributor

Maybe a silly idea but... What if, at least for this first version, mid-stack in lining was enabled only for functions that are transitively statically reachable from a benchmark function in the same package? Would be nice to extend this to use actual profiling information in the future.

Contributor

CAFxX commented Feb 26, 2018

Maybe a silly idea but... What if, at least for this first version, mid-stack in lining was enabled only for functions that are transitively statically reachable from a benchmark function in the same package? Would be nice to extend this to use actual profiling information in the future.

@CAFxX

This comment has been minimized.

Show comment
Hide comment
@CAFxX

CAFxX Feb 26, 2018

Contributor

@randall77 what if the //go:inline was used to mark the callsite instead?

Contributor

CAFxX commented Feb 26, 2018

@randall77 what if the //go:inline was used to mark the callsite instead?

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Feb 26, 2018

Contributor

Transitively reachable from a benchmark function sounds problematic. When doing a non-test build, the compiler never sees _test.go files, which is where all the Benchmark functions tend to be. And having the existence of a Benchmark function affect the performance of the function being benchmarked sounds like a recipe for HeisenBugs. On the plus side, though, it would encourage the writing of Benchmark functions.

The compiler has no support for //go: directives at statement or expression scope, only at global scope. Not that it couldn't be added, but it's significant work.

Contributor

randall77 commented Feb 26, 2018

Transitively reachable from a benchmark function sounds problematic. When doing a non-test build, the compiler never sees _test.go files, which is where all the Benchmark functions tend to be. And having the existence of a Benchmark function affect the performance of the function being benchmarked sounds like a recipe for HeisenBugs. On the plus side, though, it would encourage the writing of Benchmark functions.

The compiler has no support for //go: directives at statement or expression scope, only at global scope. Not that it couldn't be added, but it's significant work.

@chewxy

This comment has been minimized.

Show comment
Hide comment
@chewxy

chewxy Feb 26, 2018

Is there a way to track the inlined code such that #14840 would then be useful and eliminate more deadcode? The inlining process is going to touch the linker anyway, might as well make it useful?

chewxy commented Feb 26, 2018

Is there a way to track the inlined code such that #14840 would then be useful and eliminate more deadcode? The inlining process is going to touch the linker anyway, might as well make it useful?

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Feb 26, 2018

Contributor

@chewxy The inlining process does not involve the linker. The compiler does ~all the work.

The inlining process will detect and remove dead code. If that ends up removing the last reference to a global, that global will be removed by the linker. But I don't think that will help with #14840, which is about globals with init functions.

Contributor

randall77 commented Feb 26, 2018

@chewxy The inlining process does not involve the linker. The compiler does ~all the work.

The inlining process will detect and remove dead code. If that ends up removing the last reference to a global, that global will be removed by the linker. But I don't think that will help with #14840, which is about globals with init functions.

@CAFxX

This comment has been minimized.

Show comment
Hide comment
@CAFxX

CAFxX Feb 27, 2018

Contributor

The major TODO at this point is to tune the inlining heuristic. Mid-stack inlining helps runtime speed, but it can make binaries bigger. A lot bigger in some cases; cmd/compile's text segment gets ~100% bigger. I don't think that's launchable as-is, so we need to figure out the right way to tweak the heuristics to preserve as much speed as we can while keeping binary size manageable. Ideas welcome; there's no obvious plan of attack here.

Silly idea # 2: how about brute forcing this? 💪

  • Grab an intern 🥇
  • Gather a corpus of Go code with (macro?)benchmarks
  • For each benchmark measure speed (+allocations?) and text size with inlining disabled (baseline)
  • For each benchmark measure the same as above, with "random" inlining decisions in the functions that are transitively called by it; have the compiler log those decisions (repeat this step many times to generate many measures)
  • Run some fancy ML method on the corpus of inlining decisions and benchmark results (relative to the baseline) to identify a set of inlining heuristics that yield good improvements at the expense of a reasonable increase in text size. 〰️👋
  • Profit! Implement in the inliner the heuristics identified above 👌

As a bonus point, the intern gets to write a paper about this. 🤣

Contributor

CAFxX commented Feb 27, 2018

The major TODO at this point is to tune the inlining heuristic. Mid-stack inlining helps runtime speed, but it can make binaries bigger. A lot bigger in some cases; cmd/compile's text segment gets ~100% bigger. I don't think that's launchable as-is, so we need to figure out the right way to tweak the heuristics to preserve as much speed as we can while keeping binary size manageable. Ideas welcome; there's no obvious plan of attack here.

Silly idea # 2: how about brute forcing this? 💪

  • Grab an intern 🥇
  • Gather a corpus of Go code with (macro?)benchmarks
  • For each benchmark measure speed (+allocations?) and text size with inlining disabled (baseline)
  • For each benchmark measure the same as above, with "random" inlining decisions in the functions that are transitively called by it; have the compiler log those decisions (repeat this step many times to generate many measures)
  • Run some fancy ML method on the corpus of inlining decisions and benchmark results (relative to the baseline) to identify a set of inlining heuristics that yield good improvements at the expense of a reasonable increase in text size. 〰️👋
  • Profit! Implement in the inliner the heuristics identified above 👌

As a bonus point, the intern gets to write a paper about this. 🤣

@dlsniper

This comment has been minimized.

Show comment
Hide comment
@dlsniper

dlsniper Feb 27, 2018

Contributor

@dlsniper: I'd like to avoid a //go:inline comment if we can. I don't think it solves the problem well, as the inlining decision should probably depend on characteristics of the call site (e.g. in a loop, constant arguments, etc.), not just the function being called.

The compiler has no support for //go: directives at statement or expression scope, only at global scope. Not that it couldn't be added, but it's significant work.

@randall77 thank you for replying so quick on this. I understand that there is a fair amount of work, and concern at the same time with regards to how users will use this functionality. I think that the approach of having this enabled by default but with conservative defaults would be a good start.

However, what I have in mind when suggesting the introduction of //go:inline that could be added at call site is that the experienced users will have the understanding for how to use it and will be able to assert, via benchmarks, which approach works better for them when the compiler defaults are not enough.

From there, that could be collected as a feedback or observed how this is used in the wild, in order to allow further experimentation / changes to the heuristics / defaults. Much like what @CAFxX suggested but without dedicating an intern and a lot of hardware to running benchmarks. For example, in all of my use-cases so far, I would gladly trade a few more MB of binary size for better runtime speed. I understand that others may not wish to do the same, which is I why I think that satisfying all these requirements would be better left to the users.

One of the other interesting options of having this as a compiler directive is allowing of fine-tuning the standard library code by performing analysis on the existing benchmarks.

  • Do I think it could be potentially abused / misused by users that do not understand what this option will do? Yes, I do. But then the burden would be entirely on the users rather than on the Go team to figure out "the best" way to move forward with this.
  • Who are the people that I target with this option? This allows people that understand what they are doing to further fine-tune their code at a level that they would not have access today, which I believe it's a good step in the direction of giving some control to the users while providing solid defaults.
  • Do I like the idea of introducing more magic directives to the compiler? I do not, but I also do not see another way to give these hints to the compiler.

Hope this helps. I'll continue to watch this issue and look forward to how this will work out. Thank you.

Contributor

dlsniper commented Feb 27, 2018

@dlsniper: I'd like to avoid a //go:inline comment if we can. I don't think it solves the problem well, as the inlining decision should probably depend on characteristics of the call site (e.g. in a loop, constant arguments, etc.), not just the function being called.

The compiler has no support for //go: directives at statement or expression scope, only at global scope. Not that it couldn't be added, but it's significant work.

@randall77 thank you for replying so quick on this. I understand that there is a fair amount of work, and concern at the same time with regards to how users will use this functionality. I think that the approach of having this enabled by default but with conservative defaults would be a good start.

However, what I have in mind when suggesting the introduction of //go:inline that could be added at call site is that the experienced users will have the understanding for how to use it and will be able to assert, via benchmarks, which approach works better for them when the compiler defaults are not enough.

From there, that could be collected as a feedback or observed how this is used in the wild, in order to allow further experimentation / changes to the heuristics / defaults. Much like what @CAFxX suggested but without dedicating an intern and a lot of hardware to running benchmarks. For example, in all of my use-cases so far, I would gladly trade a few more MB of binary size for better runtime speed. I understand that others may not wish to do the same, which is I why I think that satisfying all these requirements would be better left to the users.

One of the other interesting options of having this as a compiler directive is allowing of fine-tuning the standard library code by performing analysis on the existing benchmarks.

  • Do I think it could be potentially abused / misused by users that do not understand what this option will do? Yes, I do. But then the burden would be entirely on the users rather than on the Go team to figure out "the best" way to move forward with this.
  • Who are the people that I target with this option? This allows people that understand what they are doing to further fine-tune their code at a level that they would not have access today, which I believe it's a good step in the direction of giving some control to the users while providing solid defaults.
  • Do I like the idea of introducing more magic directives to the compiler? I do not, but I also do not see another way to give these hints to the compiler.

Hope this helps. I'll continue to watch this issue and look forward to how this will work out. Thank you.

@ugorji

This comment has been minimized.

Show comment
Hide comment
@ugorji

ugorji Feb 27, 2018

Contributor

@dlsniper @randall77 My only concern with enforcing the //go:inline is that it only scales for final executables, not for libraries. Imagine i put //go:inline all over my lib, and a user depends on my lib. It wasn't the user's decision - it was the author of the lib forcing his decision on the users.

If we do //go:inline, let it be a hint to the compiler, that if this function doesn't make the cut but it is within reason, pls inline it. E.g. let's say only functions up to a cost of 10 are inlined, but my function has a cost of 12, meaning it will not be inlined by default. But as the cost is within 30% over threshold (ie cost less than 10+30% = 13), and the author says "pls inline", then it will be inlined, but if cost is over 30%, the hint will be disregarded.

This is similar to how c++ inline keyword works, as a hint.

Now, I personally don't want a //go:inline. I prefer that the "general" (conservative) rules/heuristics for inlining are reasonable and fair and known and published. The compiler can still tweak outside of the general/published/conservative rules, but authors will work within those published ones and be happy.

My 2 cents.

Contributor

ugorji commented Feb 27, 2018

@dlsniper @randall77 My only concern with enforcing the //go:inline is that it only scales for final executables, not for libraries. Imagine i put //go:inline all over my lib, and a user depends on my lib. It wasn't the user's decision - it was the author of the lib forcing his decision on the users.

If we do //go:inline, let it be a hint to the compiler, that if this function doesn't make the cut but it is within reason, pls inline it. E.g. let's say only functions up to a cost of 10 are inlined, but my function has a cost of 12, meaning it will not be inlined by default. But as the cost is within 30% over threshold (ie cost less than 10+30% = 13), and the author says "pls inline", then it will be inlined, but if cost is over 30%, the hint will be disregarded.

This is similar to how c++ inline keyword works, as a hint.

Now, I personally don't want a //go:inline. I prefer that the "general" (conservative) rules/heuristics for inlining are reasonable and fair and known and published. The compiler can still tweak outside of the general/published/conservative rules, but authors will work within those published ones and be happy.

My 2 cents.

@mvdan

This comment has been minimized.

Show comment
Hide comment
@mvdan

mvdan Feb 27, 2018

Member

As far as I know, it has been core to Go's design (including its compiler) to have as few knobs and flags as possible. This includes flags like -O4 and compiler directives in the code.

There have been no knobs to control inlining until now; why should enabling mid-stack inlining change that?

Member

mvdan commented Feb 27, 2018

As far as I know, it has been core to Go's design (including its compiler) to have as few knobs and flags as possible. This includes flags like -O4 and compiler directives in the code.

There have been no knobs to control inlining until now; why should enabling mid-stack inlining change that?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Mar 7, 2018

Member

The problem with inlining directives is that people are notoriously bad at maintaining them as needs shift and code changes. We've resisted exposing inlining directives even to the runtime (which has several directives not available to user code) because we know they'll get stale and lead to code bloat and less performance. Instead, we have a test that checks that key functions are being inlined by the compiler's heuristics; and even that list gets out of date quickly.

Our long-term (albeit vague) plan is to use profile-guided optimization to make inlining decisions, rather than hand-crafted heuristics or developer annotations. It'll take a while to get there, but it fits very nicely with the Go model of doing the right thing automatically.

Member

aclements commented Mar 7, 2018

The problem with inlining directives is that people are notoriously bad at maintaining them as needs shift and code changes. We've resisted exposing inlining directives even to the runtime (which has several directives not available to user code) because we know they'll get stale and lead to code bloat and less performance. Instead, we have a test that checks that key functions are being inlined by the compiler's heuristics; and even that list gets out of date quickly.

Our long-term (albeit vague) plan is to use profile-guided optimization to make inlining decisions, rather than hand-crafted heuristics or developer annotations. It'll take a while to get there, but it fits very nicely with the Go model of doing the right thing automatically.

@ugorji

This comment has been minimized.

Show comment
Hide comment
@ugorji

ugorji Apr 23, 2018

Contributor

@dlsniper @randall77 @aclements Given that code-freeze is in a week, will this be making it into go 1.11 in some form? There seems to have been zero movement here.

There some clear wins here, even with simple heuristics e.g. inlining leaf functions that panic, short delegate functions that just call other functions, leaf functions with switch statements, etc. These will make reflect faster, which will impact just about every go program (using json, fmt, etc).

Thanks.

Contributor

ugorji commented Apr 23, 2018

@dlsniper @randall77 @aclements Given that code-freeze is in a week, will this be making it into go 1.11 in some form? There seems to have been zero movement here.

There some clear wins here, even with simple heuristics e.g. inlining leaf functions that panic, short delegate functions that just call other functions, leaf functions with switch statements, etc. These will make reflect faster, which will impact just about every go program (using json, fmt, etc).

Thanks.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Apr 23, 2018

Contributor

We're not sure yet. We'd like to get something in, but the current heuristics are too aggressive. We've seen code size blowups of 100%. We've also seen net slowdowns.

We're thinking of trying to enable this for 1.11, but with a much stricter heuristic. But we don't know what that heuristic might be yet. Unfortunately, this is no one's top priority at the moment.

If you have particular programs that do get significant speedups from mid-stack inlining, please post them. It will help us guide the choice of heuristic.

Contributor

randall77 commented Apr 23, 2018

We're not sure yet. We'd like to get something in, but the current heuristics are too aggressive. We've seen code size blowups of 100%. We've also seen net slowdowns.

We're thinking of trying to enable this for 1.11, but with a much stricter heuristic. But we don't know what that heuristic might be yet. Unfortunately, this is no one's top priority at the moment.

If you have particular programs that do get significant speedups from mid-stack inlining, please post them. It will help us guide the choice of heuristic.

@dr2chase dr2chase modified the milestones: Go1.11, Go1.12 May 31, 2018

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase May 31, 2018

Contributor

See also: https://go-review.googlesource.com/c/go/+/109918

Quick summary of where we are:

  • calling panic no longer forbids inlining.
  • -l=4 gets midstack inlining, if you want to play with it. Binaries get bigger, compiles take longer. We're interested in feedback on how this works for people, especially when it doesn't work.
  • the compiler itself is not helped by midstack inlining; a -l=4-built (but not -l=4-compiling) compiler runs slower than normal.
  • some benchmarks speed up nicely

The bigger+slower compiler is worrisome, which is the main reason this is not enabled yet; if this happened to your binary, you'd not be happy. The minimum plan is to understand how to manage inlining so bigger+slower at least doesn't happen to the compiler, and hope that it generalizes. A more ambitious plan is to build some sort of a feedback framework so that it's clear where inlining would actually help, instead of just guessing. Or we could use machine learning....

Contributor

dr2chase commented May 31, 2018

See also: https://go-review.googlesource.com/c/go/+/109918

Quick summary of where we are:

  • calling panic no longer forbids inlining.
  • -l=4 gets midstack inlining, if you want to play with it. Binaries get bigger, compiles take longer. We're interested in feedback on how this works for people, especially when it doesn't work.
  • the compiler itself is not helped by midstack inlining; a -l=4-built (but not -l=4-compiling) compiler runs slower than normal.
  • some benchmarks speed up nicely

The bigger+slower compiler is worrisome, which is the main reason this is not enabled yet; if this happened to your binary, you'd not be happy. The minimum plan is to understand how to manage inlining so bigger+slower at least doesn't happen to the compiler, and hope that it generalizes. A more ambitious plan is to build some sort of a feedback framework so that it's clear where inlining would actually help, instead of just guessing. Or we could use machine learning....

@CAFxX

This comment has been minimized.

Show comment
Hide comment
@CAFxX

CAFxX Jun 1, 2018

Contributor

Is -l=4 still unsupported for production use? Or is it now supported for production but with potential performance regressions (like, say, -O3)?

Contributor

CAFxX commented Jun 1, 2018

Is -l=4 still unsupported for production use? Or is it now supported for production but with potential performance regressions (like, say, -O3)?

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Jun 1, 2018

Contributor

I am not sure of the official position, and it's not tested as much as it should be (i.e., I need to see about whether we can have a -l=4 test box) but it's supposed to at least execute correctly and we'd like to know when it doesn't, which I think is different from "you own the pieces". Debugging is also not well-tested for -l=4 binaries.

I've been rebenchmarking the compiler to check how inlining changes its performance, and the short answer is it isn't worse, but it isn't better, and without a noinline annotation on one method it doubles the size of the binary (with the annotation, it's only 50% larger). I don't think we want the noinline annotations to become part of common practice for using go (we use them in tests, very helpful there) but on the other hand it can also be a good way of figuring out the sort of inlining mistakes the compiler needs to not make in order to turn this on in general.

Contributor

dr2chase commented Jun 1, 2018

I am not sure of the official position, and it's not tested as much as it should be (i.e., I need to see about whether we can have a -l=4 test box) but it's supposed to at least execute correctly and we'd like to know when it doesn't, which I think is different from "you own the pieces". Debugging is also not well-tested for -l=4 binaries.

I've been rebenchmarking the compiler to check how inlining changes its performance, and the short answer is it isn't worse, but it isn't better, and without a noinline annotation on one method it doubles the size of the binary (with the annotation, it's only 50% larger). I don't think we want the noinline annotations to become part of common practice for using go (we use them in tests, very helpful there) but on the other hand it can also be a good way of figuring out the sort of inlining mistakes the compiler needs to not make in order to turn this on in general.

@btracey

This comment has been minimized.

Show comment
Hide comment
@btracey

btracey Jun 1, 2018

Contributor

I ran some of our BLAS benchmarks with no significant effect. Do you know if functions with asm stubs can be inlined (assuming the build tags are such that the asm is not actually used)?

Contributor

btracey commented Jun 1, 2018

I ran some of our BLAS benchmarks with no significant effect. Do you know if functions with asm stubs can be inlined (assuming the build tags are such that the asm is not actually used)?

chromium-infra-bot pushed a commit to luci/luci-go that referenced this issue Jun 2, 2018

stringset: switch Set to a concrete type
There had been a plan to make a concurrent safe Set implementation but it's not
happening. Replace the interface Set with the concrete type set, which turns the
method through an interface to direct calls, which can be inlined by the
compiler.

Because of golang/go#19348, inline all the functions
(it doesn't make the code longer!) to ensure that most methods can be inlined.

Change-Id: Ib8dd36a16ab57c1383edd236ad62b78b4a10091c
Reviewed-on: https://chromium-review.googlesource.com/1083531
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
@davecheney

This comment has been minimized.

Show comment
Hide comment
@davecheney

davecheney Jun 2, 2018

Contributor
Contributor

davecheney commented Jun 2, 2018

@laboger

This comment has been minimized.

Show comment
Hide comment
@laboger

laboger Jun 4, 2018

Contributor

I've been rebenchmarking the compiler to check how inlining changes its performance, and the short answer is it isn't worse, but it isn't better, and without a noinline annotation on one method it doubles the size of the binary (with the annotation, it's only 50% larger).

Is this a statement for amd64 or for other GOARCHes too? I would expect to see more improvement on ppc64x because of the high cost of loading and storing the arguments and return values.

Contributor

laboger commented Jun 4, 2018

I've been rebenchmarking the compiler to check how inlining changes its performance, and the short answer is it isn't worse, but it isn't better, and without a noinline annotation on one method it doubles the size of the binary (with the annotation, it's only 50% larger).

Is this a statement for amd64 or for other GOARCHes too? I would expect to see more improvement on ppc64x because of the high cost of loading and storing the arguments and return values.

@ugorji

This comment has been minimized.

Show comment
Hide comment
@ugorji

ugorji Sep 22, 2018

Contributor

Ping. Any chance this gets in for go 1.12?

Contributor

ugorji commented Sep 22, 2018

Ping. Any chance this gets in for go 1.12?

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Sep 23, 2018

Contributor

Someone needs to look into better heuristics, because the current rules tend to over-bloat the generated binary. Someone is not supposed to be me, though I really want it to happen.

Contributor

dr2chase commented Sep 23, 2018

Someone needs to look into better heuristics, because the current rules tend to over-bloat the generated binary. Someone is not supposed to be me, though I really want it to happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment