Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: no closure inlining even in the simple case? #35196

Open
ivoras opened this issue Oct 27, 2019 · 10 comments
Open

cmd/compile: no closure inlining even in the simple case? #35196

ivoras opened this issue Oct 27, 2019 · 10 comments

Comments

@ivoras
Copy link

@ivoras ivoras commented Oct 27, 2019

I hesitate to call this an issue, since I know it's a complex topic, and I've read about half a dozen issue reports touching this topic, which kind-of sort-of fix some inlining cases, but not all of them.

I'm using a simple pattern for avoiding lock leakage:

// WithRWMutex extends the RWMutex type with convenient .With(func) functions
type WithRWMutex struct {
	sync.RWMutex
}

// WithRLock executes the given function with the mutex rlocked
func (m *WithRWMutex) WithRLock(f func()) {
	m.RWMutex.RLock()
	f()
	m.RWMutex.RUnlock()
}

// WithWLock executes the given function with the mutex wlocked
func (m *WithRWMutex) WithWLock(f func()) {
	m.RWMutex.Lock()
	f()
	m.RWMutex.Unlock()
}

in practice it's used like this:

beforeCode()
o.WithWLock(func() {
 doSomething(o)
})
afterCode()

Full example here: https://go.godbolt.org/z/mkvMbZ

I'd expect that when the closure passed to WithWlock() is simple enough (i.e. no need for a new stack), inlining would collapse all the scaffolding and compile it to something like:

beforeCode()
m.RWMutex.Lock()
doSomething(o)
m.RWMutex.Unlock()
afterCode()

But looking at the assembly code (https://go.godbolt.org/z/mkvMbZ), the inner function is compiled once, its address taken, passed to the WithWLock() function, i.e. a full closure call is being done.

This is on go 1.13.

It seems that a full closure call is an expensive choice for this particular pattern?

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Oct 27, 2019

Our current inline heuristic thinks that (*WithRWMutex).WithRLock is too expensive to inline. Its body contains 3 calls, which is above the cost threshold.

Even if I hack the compiler to inline in this case, it still doesn't inline the anonymous func. But it does do so in simpler cases. Not sure what's going on there, probably a phase ordering issue.

@ivoras

This comment has been minimized.

Copy link
Author

@ivoras ivoras commented Oct 27, 2019

Hmmm... which criteria is used which results in 3 function calls being considered expensive? Wouldn't the assembly for that case be something like 3 CALL instructions with some low-weight scaffolding for the arguments and the stack?

Or is it just the middle call, the one which calls the closure function, the one which drives the cost up, since it's a call via a function pointer?

Re: hacking the compiler - could the "inlining threshold" be a compiler argument?

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Oct 27, 2019

Hmmm... which criteria is used which results in 3 function calls being considered expensive? Wouldn't the assembly for that case be something like 3 CALL instructions with some low-weight scaffolding for the arguments and the stack?

We have a cost budget of 80, and each call is cost 57. That normally allows only one call in an inlineable body (with a few exceptions, like intrinsics). See #19348 (comment)

Or is it just the middle call, the one which calls the closure function, the one which drives the cost up, since it's a call via a function pointer?

The closureness of the call isn't used as part of the cost, other than it must be able to resolve the target function.

Re: hacking the compiler - could the "inlining threshold" be a compiler argument?

We do have a -l argument to the compiler which adjusts inlining decisions. But that flag is mostly for testing/debugging, it's not intended for performance tuning. In general we try to avoid providing knobs like this, as it leads to all kinds of thorny questions, like "what if two packages want two different inlining thresholds?" "is is the threshold of the calling package, or the called package?".

@ivoras

This comment has been minimized.

Copy link
Author

@ivoras ivoras commented Oct 27, 2019

Thank you, I appreciate your answers and your expertise.

On my system, the -l flag is documented as "disable inlining" so not really usable for experimenting.

But even if inlining worked, is the compiler smart enough to inline the closure as a whole inside the one and only caller (since the closure is an anonymous function used exactly once)?

FWIW, 2 more cents: I'm guessing the issue is because "go build" also builds required packages, so there could be mixed results from multiple invocations. Doesn't that point to flags like inline thresholds being global settings on a build machine, for all packages. Maybe in a config file, an environment variable, etc.?

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Oct 27, 2019

But even if inlining worked, is the compiler smart enough to inline the closure as a whole inside the one and only caller (since the closure is an anonymous function used exactly once)?

This code optimizes to just a call to doSomething.

	func() {
		doSomething(o)
	}()

FWIW, 2 more cents: I'm guessing the issue is because "go build" also builds required packages, so there could be mixed results from multiple invocations. Doesn't that point to flags like inline thresholds being global settings on a build machine, for all packages. Maybe in a config file, an environment variable, etc.?

Sure, but that doesn't really solve the problem, it just punts it to the user. I import 2 packages, A and B. A says "I am fastest when the inlining threshold is 63". B says "I am fastest when the inlining threshold is 95". What should I set my inlining threshold to?

We've basically decided that tuning to this level is not worth the complication it introduces to the ecosystem.

@FiloSottile FiloSottile changed the title No closure inlining even in the simple case? cmd/compile: no closure inlining even in the simple case? Oct 28, 2019
@FiloSottile FiloSottile added this to the Unplanned milestone Oct 28, 2019
@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Feb 12, 2020

I'm wondering about what I think is this issue, but in reading through this thread a few times I'm not exactly sure what the precise problem under discussion is. @randall77 said

Even if I hack the compiler to inline in this case, it still doesn't inline the anonymous func. But it does do so in simpler cases. Not sure what's going on there, probably a phase ordering issue.

Is that the core problem here to be fixed?

In my case, I'm looking at code like this:

// +build ignore

package main

import (
	"fmt"
)

type M struct {
	bs []float64
}

func (m *M) E(ff func(b float64) float64) float64 {
	var sum float64
	for _, b := range m.bs {
		sum += ff(b)
	}
	return sum
}

func f(b float64) float64 {
	return b / 12345
}

func main() {
	m := &M{bs: []float64{1, 2, 3, 4, 5}}
	fmt.Println(m.E(f))
}

The call to ff won't get inlined:

$ go1.14rc1 run -gcflags '-m -m' test.go
# command-line-arguments
./test.go:13:6: cannot inline (*M).E: unhandled op RANGE
./test.go:21:6: can inline f as: func(float64) float64 { return b / 12345 }
./test.go:25:6: cannot inline main: function too complex: cost 150 exceeds budget 80
./test.go:27:13: inlining call to fmt.Println func(...interface {}) (int, error) { var fmt..autotmp_3 int; fmt..autotmp_3 = <N>; var fmt..autotmp_4 error; fmt..autotmp_4 = <N>; fmt..autotmp_3, fmt..autotmp_4 = fmt.Fprintln(io.Writer(os.Stdout), fmt.a...); return fmt..autotmp_3, fmt..autotmp_4 }
./test.go:13:7: m does not escape
./test.go:13:15: ff does not escape
./test.go:27:17: m.E(f) escapes to heap:
./test.go:27:17:   flow: ~arg0 = &{storage for m.E(f)}:
./test.go:27:17:     from m.E(f) (spill) at ./test.go:27:17
./test.go:27:17:     from ~arg0 = <N> (assign-pair) at ./test.go:27:13
./test.go:27:17:   flow: {storage for []interface {} literal} = ~arg0:
./test.go:27:17:     from []interface {} literal (slice-literal-element) at ./test.go:27:13
./test.go:27:17:   flow: fmt.a = &{storage for []interface {} literal}:
./test.go:27:17:     from []interface {} literal (spill) at ./test.go:27:13
./test.go:27:17:     from fmt.a = []interface {} literal (assign) at ./test.go:27:13
./test.go:27:17:   flow: {heap} = *fmt.a:
./test.go:27:17:     from fmt.Fprintln(io.Writer(os.Stdout), fmt.a...) (call parameter) at ./test.go:27:13
./test.go:26:7: &M literal does not escape
./test.go:26:23: []float64 literal does not escape
./test.go:27:17: m.E(f) escapes to heap
./test.go:27:13: []interface {} literal does not escape
<autogenerated>:1: .this does not escape
0.0012150668286755773

However, if I call f directly rather than passing it as a function argument, it is inlined:

// +build ignore

package main

import (
	"fmt"
)

type M struct {
	bs []float64
}

func (m *M) E() float64 {
	var sum float64
	for _, b := range m.bs {
		sum += f(b)
	}
	return sum
}

func f(b float64) float64 {
	return b / 12345
}

func main() {
	m := &M{bs: []float64{1, 2, 3, 4, 5}}
	fmt.Println(m.E())
}
$ go1.14rc1 run -gcflags '-m -m' test.go
# command-line-arguments
./test.go:21:6: can inline f as: func(float64) float64 { return b / 12345 }
./test.go:13:6: cannot inline (*M).E: unhandled op RANGE
./test.go:16:11: inlining call to f func(float64) float64 { return b / 12345 }
./test.go:25:6: cannot inline main: function too complex: cost 149 exceeds budget 80
./test.go:27:13: inlining call to fmt.Println func(...interface {}) (int, error) { var fmt..autotmp_3 int; fmt..autotmp_3 = <N>; var fmt..autotmp_4 error; fmt..autotmp_4 = <N>; fmt..autotmp_3, fmt..autotmp_4 = fmt.Fprintln(io.Writer(os.Stdout), fmt.a...); return fmt..autotmp_3, fmt..autotmp_4 }
./test.go:13:7: m does not escape
./test.go:27:17: m.E() escapes to heap:
./test.go:27:17:   flow: ~arg0 = &{storage for m.E()}:
./test.go:27:17:     from m.E() (spill) at ./test.go:27:17
./test.go:27:17:     from ~arg0 = <N> (assign-pair) at ./test.go:27:13
./test.go:27:17:   flow: {storage for []interface {} literal} = ~arg0:
./test.go:27:17:     from []interface {} literal (slice-literal-element) at ./test.go:27:13
./test.go:27:17:   flow: fmt.a = &{storage for []interface {} literal}:
./test.go:27:17:     from []interface {} literal (spill) at ./test.go:27:13
./test.go:27:17:     from fmt.a = []interface {} literal (assign) at ./test.go:27:13
./test.go:27:17:   flow: {heap} = *fmt.a:
./test.go:27:17:     from fmt.Fprintln(io.Writer(os.Stdout), fmt.a...) (call parameter) at ./test.go:27:13
./test.go:26:7: &M literal does not escape
./test.go:26:23: []float64 literal does not escape
./test.go:27:17: m.E() escapes to heap
./test.go:27:13: []interface {} literal does not escape
<autogenerated>:1: .this does not escape
0.0012150668286755773

Is that the same as this issue? Is it another issue? Should I open a new issue?

@ivoras

This comment has been minimized.

Copy link
Author

@ivoras ivoras commented Feb 12, 2020

Heavy edit: I thought the issue was something else.

How do you know that ff in the first case won't be inlined? Neither outputs mention ff in the context of inlining?

Regarding whether it's similar or not - the original issue is about anonymous function, which I think warrant aggressive inlining since the majority of them will only be called from a single context (judging how code translates to inlining budget, I'd say push it to 500, that should cover many logging / formatting handlers, various wrappers and error handlers).

@thanm

This comment has been minimized.

Copy link
Member

@thanm thanm commented Feb 12, 2020

Setting aside the original example and looking at this one (passing function f to method E):

  • the inliner currently won't inline any function that contains a for+range loop (you can see this in the -m output, "unhandled op RANGE")

  • given that 'E' is not inlined into main, then at that point the body of E has an indirect call. The Go compiler currently only inlines direct calls; there is no special magic to analyze the program and determine the possible set of indirect call targets. In particular, in the general case this requires interprocedural analysis (build a call graph, etc).

In your second example you've replaced the indirect call with a direct call; the Go compiler has no problems with inlining those as long as the callee is small enough (and various other criteria are met).

Hope this helps.

@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Feb 12, 2020

@ivoras

How do you know that ff in the first case won't be inlined? Neither outputs mention ff in the context of inlining?

If it were inlined, it would be mentioned in the -gcflags '-m -m' output. In the second case where the direct call is inlined, it prints:

./test.go:16:11: inlining call to f func(float64) float64 { return b / 12345 }

(But I also confirmed by looking at the assembly.)

@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Feb 12, 2020

@thanm thanks, that is very helpful. To summarize, it sounds like the key optimization for me would be inlining of indirect calls (in some cases where the target is static), which is difficult because it requires heavyweight analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.