Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

Open
josharian opened this issue Mar 3, 2019 · 24 comments

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Mar 3, 2019

package p

func dispatch(x int) int {
	if x < 10 {
		return small(x)
	}
	return big(x)
}

//go:noinline
func small(x int) int { return 0 }

//go:noinline
func big(x int) int { return 1 }

dispatch should be inlined: It is very simple, and on either path through the function we call exactly one other simple function.

It is not:

x.go:3:6: cannot inline dispatch: function too complex: cost 126 exceeds budget 80

This is because we attach a significant cost to each function call.

Note that we are happy to inline this dispatch function:

func dispatch(x int) int {
	fn := big
	if x < 10 {
		fn = small
	}
	return fn(x)
}

Fixing this requires either some special casing to recognize common patterns (like the original dispatch) or a bit of control flow analysis.

cc @randall77 @dr2chase

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 3, 2019

Change https://golang.org/cl/164968 mentions this issue: math/big: add fast path for pure Go addVW for large z

gopherbot pushed a commit that referenced this issue Mar 9, 2019
In the normal case, only a few words have to be updated when adding a word to a vector.
When that happens, we can simply copy the rest of the words, which is much faster.
However, the overhead of that makes it prohibitive for small vectors,
so we check the size at the beginning.

The implementation is a bit weird to allow addVW to continued to be inlined; see #30548.

The AddVW benchmarks are surprising, but fully repeatable.
The SubVW benchmarks are more or less as expected.
I expect that removing the indirect function call will
help both and make them a bit more normal.

name            old time/op    new time/op     delta
AddVW/1-8         4.27ns ± 2%     3.81ns ± 3%   -10.83%  (p=0.000 n=89+90)
AddVW/2-8         4.91ns ± 2%     4.34ns ± 1%   -11.60%  (p=0.000 n=83+90)
AddVW/3-8         5.77ns ± 4%     5.76ns ± 2%      ~     (p=0.365 n=91+87)
AddVW/4-8         6.03ns ± 1%     6.03ns ± 1%      ~     (p=0.392 n=80+76)
AddVW/5-8         6.48ns ± 2%     6.63ns ± 1%    +2.27%  (p=0.000 n=76+74)
AddVW/10-8        9.56ns ± 2%     9.56ns ± 1%    -0.02%  (p=0.002 n=69+76)
AddVW/100-8       90.6ns ± 0%     18.1ns ± 4%   -79.99%  (p=0.000 n=72+94)
AddVW/1000-8       865ns ± 0%       85ns ± 6%   -90.14%  (p=0.000 n=66+96)
AddVW/10000-8     8.57µs ± 2%     1.82µs ± 3%   -78.73%  (p=0.000 n=99+94)
AddVW/100000-8    84.4µs ± 2%     31.8µs ± 4%   -62.29%  (p=0.000 n=93+98)

name            old time/op    new time/op     delta
SubVW/1-8         3.90ns ± 2%     4.13ns ± 4%    +6.02%  (p=0.000 n=92+95)
SubVW/2-8         4.15ns ± 1%     5.20ns ± 1%   +25.22%  (p=0.000 n=83+85)
SubVW/3-8         5.50ns ± 2%     6.22ns ± 6%   +13.21%  (p=0.000 n=91+97)
SubVW/4-8         5.99ns ± 1%     6.63ns ± 1%   +10.63%  (p=0.000 n=79+61)
SubVW/5-8         6.75ns ± 4%     6.88ns ± 2%    +1.82%  (p=0.000 n=98+73)
SubVW/10-8        9.57ns ± 1%     9.56ns ± 1%    -0.13%  (p=0.000 n=77+64)
SubVW/100-8       90.3ns ± 1%     18.1ns ± 2%   -80.00%  (p=0.000 n=75+94)
SubVW/1000-8       860ns ± 4%       85ns ± 7%   -90.14%  (p=0.000 n=97+99)
SubVW/10000-8     8.51µs ± 3%     1.77µs ± 6%   -79.21%  (p=0.000 n=100+97)
SubVW/100000-8    84.4µs ± 3%     31.5µs ± 3%   -62.66%  (p=0.000 n=92+92)

Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
Reviewed-on: https://go-review.googlesource.com/c/go/+/164968
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
josharian added a commit to josharian/go that referenced this issue May 7, 2019
In the normal case, only a few words have to be updated when adding a word to a vector.
When that happens, we can simply copy the rest of the words, which is much faster.
However, the overhead of that makes it prohibitive for small vectors,
so we check the size at the beginning.

The implementation is a bit weird to allow addVW to continued to be inlined; see golang#30548.

The AddVW benchmarks are surprising, but fully repeatable.
The SubVW benchmarks are more or less as expected.
I expect that removing the indirect function call will
help both and make them a bit more normal.

name            old time/op    new time/op     delta
AddVW/1-8         4.27ns ± 2%     3.81ns ± 3%   -10.83%  (p=0.000 n=89+90)
AddVW/2-8         4.91ns ± 2%     4.34ns ± 1%   -11.60%  (p=0.000 n=83+90)
AddVW/3-8         5.77ns ± 4%     5.76ns ± 2%      ~     (p=0.365 n=91+87)
AddVW/4-8         6.03ns ± 1%     6.03ns ± 1%      ~     (p=0.392 n=80+76)
AddVW/5-8         6.48ns ± 2%     6.63ns ± 1%    +2.27%  (p=0.000 n=76+74)
AddVW/10-8        9.56ns ± 2%     9.56ns ± 1%    -0.02%  (p=0.002 n=69+76)
AddVW/100-8       90.6ns ± 0%     18.1ns ± 4%   -79.99%  (p=0.000 n=72+94)
AddVW/1000-8       865ns ± 0%       85ns ± 6%   -90.14%  (p=0.000 n=66+96)
AddVW/10000-8     8.57µs ± 2%     1.82µs ± 3%   -78.73%  (p=0.000 n=99+94)
AddVW/100000-8    84.4µs ± 2%     31.8µs ± 4%   -62.29%  (p=0.000 n=93+98)

name            old time/op    new time/op     delta
SubVW/1-8         3.90ns ± 2%     4.13ns ± 4%    +6.02%  (p=0.000 n=92+95)
SubVW/2-8         4.15ns ± 1%     5.20ns ± 1%   +25.22%  (p=0.000 n=83+85)
SubVW/3-8         5.50ns ± 2%     6.22ns ± 6%   +13.21%  (p=0.000 n=91+97)
SubVW/4-8         5.99ns ± 1%     6.63ns ± 1%   +10.63%  (p=0.000 n=79+61)
SubVW/5-8         6.75ns ± 4%     6.88ns ± 2%    +1.82%  (p=0.000 n=98+73)
SubVW/10-8        9.57ns ± 1%     9.56ns ± 1%    -0.13%  (p=0.000 n=77+64)
SubVW/100-8       90.3ns ± 1%     18.1ns ± 2%   -80.00%  (p=0.000 n=75+94)
SubVW/1000-8       860ns ± 4%       85ns ± 7%   -90.14%  (p=0.000 n=97+99)
SubVW/10000-8     8.51µs ± 3%     1.77µs ± 6%   -79.21%  (p=0.000 n=100+97)
SubVW/100000-8    84.4µs ± 3%     31.5µs ± 3%   -62.66%  (p=0.000 n=92+92)

Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 20, 2020

A similar-but-not-identical case (I can open a separate issue if needed) is common in logging libraries like zap:

// Debug logs a message at DebugLevel. The message includes any fields passed
// at the log site, as well as any fields accumulated on the logger.
func (log *Logger) Debug(msg string, fields ...Field) {
	if ce := log.check(DebugLevel, msg); ce != nil {
		ce.Write(fields...)
	}
}

So much so that library authors resort to suggesting to users to -basically- perform the inlining manually as it allows to turn something like this:

logger.Debug("something happened",
  zap.String("field", obj.Field()),
  zap.Int64("state", obj.State()),
)

into this, potentially bypassing one vararg call:

if e := logger.Check(zap.DebugLevel, "something happened"); e != nil {
  e.Write(
    zap.String("field", obj.Field()),
    zap.Int64("state", obj.State()),
  )
}

If the compiler inlined Debug, it would be possible to avoid the vararg call in the (likely) case debug-level logging is disabled at runtime, and potentially even avoid calling the two obj methods if we knew they were side-effect free.

@josharian
Copy link
Contributor Author

@josharian josharian commented May 20, 2020

Following our proud ugly tradition of inlining hacks, one relatively obvious option here is to reduce the inlining cost of calls inside conditionals. That would treat these two cases similarly. Any interest in prototyping this?

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 21, 2020

I have a simplistic prototype that halves extraCallCost of calls inside the if.

It works for

if f1() {
  f2()
}

but it obviously falls for

if condition {
  return f1()
}
return f2()

even though it inlines

if condition {
  return f1()
} else {
  return f2()
}

This prototype doesn't seem to affect the time it takes for make.bash, and bin/go increases by 0.5% (+73KB). Haven't run yet more benchmarks (which corpus do we run nowadays for changes to the inliner?).

That made me think that maybe a better way to handle this would be to have the extraCallCost of the second call in each function to be always 1 (even the first one would work, but that would also affect functions with a single call) possibly with the added condition that at least one of the calls is inside an if (assuming we don't want to inline a(), b())

@josharian
Copy link
Contributor Author

@josharian josharian commented May 21, 2020

maybe a better way to handle this would be to have the extraCallCost of the second call in each function to be always 1

Seems worth a shot.

possibly with the added condition that at least one of the calls is inside an if (assuming we don't want to inline a(), b())

I think we want that added condition. I assume that we don't want to inline a(), b().

@dr2chase put together the initial heuristic, but that piece at least seems pretty clearly intentional.

@josharian
Copy link
Contributor Author

@josharian josharian commented May 21, 2020

Btw, for testing the impact, you may find github.com/josharian/compilecmp helpful

@ugorji
Copy link
Contributor

@ugorji ugorji commented Sep 14, 2020

The ideas here will resolves the different use-cases I was targeting in #19348 (comment).

My initial use-cases were around helper methods that do a switch/if-else-if-else or that call 2 functions (see f and g below).

f() {
    if x {
        a() // cost 57
    } else if y {
        b() // cost 1
    } else {
        c() // cost 1
    }
    d() // cost 1 (second function)
    e() // cost 57 // without this call, f should be inlineable
}

g() {
    a() // cost 57
    b() // cost 1
}
@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Sep 14, 2020

Seems like we could rewrite the visitor slightly, make it compute the max of visits down two arms of an if, rather than the sum.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Sep 14, 2020

CL coming....

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 14, 2020

Change https://golang.org/cl/254796 mentions this issue: cmd/compile: modify inlining heuristic for if; max arms, not sum

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Sep 15, 2020

If y'all feel like playing with the CL to see how it does with your examples, that might be good.
When I benchmark, I don't get ginormous binary growth, but also things don't get much faster, but perhaps people have coded around this already.

amd64: https://perf.golang.org/search?q=upload:20200914.8
arm64: https://perf.golang.org/search?q=upload:20200915.1

Both did poorly on the same benchmark, which is "interesting".

PS there's at least one failure when you run tests, because this changed some of the output for logopt (the JSON optimization logger for LSP use).

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Sep 16, 2020

Mystery regression debugged; if a function f looks like "return new(Thing).initThing()", and

  • initThing does not escape its receiver,
  • initThing is large enough that inlining it into f causes f to become too large to inline,
  • a hot caller of f does not escape f's result

then inlining initThing into f is a Bad Idea, because it results in avoidable heap allocation. The "solution" to this is a hack, whereby small functions that contain allocations have a lower limit on the size of things that they will inline (similar mechanism to not-inlining into giant methods) which will tend to let them inline more often. This bug could be biting us already, I just happened to make it appear with this bit of tinkering.

Better interleaving of escape analysis and inlining would help with this. Like I said, hack. I am benchmarking just to see what the this-time-for-sure heuristic does for binary size and benchmark time.

Interested parties are again invited to tinker with this, to see if it helps their code enough to matter.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Sep 16, 2020

There, regressions gone. Slight performance increase, slight binary size increase.
https://perf.golang.org/search?q=upload:20200916.1

@josharian
Copy link
Contributor Author

@josharian josharian commented Sep 16, 2020

My Go time is severely restricted at the moment, but I really look forward to trying this out. And reverting my math/big hacks.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Sep 16, 2020

@josharian We're also discussing better hacks. I'm not thrilled at the existence of additional knobs that are not measuring precisely what we care about, yet definitely have value.

@ugorji
Copy link
Contributor

@ugorji ugorji commented Oct 16, 2020

@dr2chase Gentle ping. Any update? It will be great to get this into go1.16.

For more context, I have many functions that does something to the tune of:

func read1Byte() byte {
    if decodingFromBytes {
          read1ByteFromByteSlice()
    } else if decodingFromBufferedIO {
          read1ByteFromBufIO()
    } else {
          read1ByteFromIO()
    }
}

Currently, code like this can never be inlined, and using an interface prevented inlining from happening.

The common path of read1ByteFromByteSlice() is very cheap, and I wanted that inlined. With your code change, it will be.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Oct 16, 2020

So, the CL is up for review, and there's an open "bug", and this is low-risk -- so it might go in later, rather than sooner -- i.e., it can go in after the deadline, unless I misunderstand that policy. Right now I'm working on stuff that I'd call much riskier that we need to be sure works. There's also some higher-priority inlining work going on to deal with a performance regression in 1.15, and that needs to happen first. See #41474. It's mostly done.

There's default/automatic concern about the effect this will have on build times and binary size, since it will by default cause more things to become inlineable, hence larger binaries, and more time needed to produce them, so we may want to tweak those knobs a little bit.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 9, 2020

Change https://golang.org/cl/267419 mentions this issue: cmd/compile: allocation-aware inlining tweaks

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Nov 10, 2020

What's special about

if x { return f() } else { return g() }

that it should be inlined, but that

return f() + g()

should not be?

The inlining heuristics are currently oriented around the resulting code size. This justifies inlining a := g; if x { a = f }; return a() because it needs less code size than if x { return f() } else { return g() }: in particular, it only needs to generate one function call rather than two.

If the inlining heuristic cost assigned to function calls is too high, I think it's reasonable to just lower it. I don't see how control flow details weigh in here, since we're not (at least yet) doing any substantive control flow analysis at time of inlining.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Nov 11, 2020

Cutting the cost of a call in half increases text size by 3.5%.
Cutting it to 39 (half of inline budget, minus 1) increases text size by 2%.
I could benchmark for a corresponding performance increase, but my recollection is we won't get it.

By-the-way, the best-looking "floats-all-boats" version of this CL doesn't work for a 3-call if tree. It includes a budget for how much the "max accounting" can be lower than the "sum accounting", and the current works-best number there is only 68, which is 11 more than the cost of a call; 100 was also tested, and it was not as good -- the binaries were a bit larger (of course) and the performance improvements were not as good, and there were more losers.

All the benchmarking results used to obtain these parameters appear here.
Some heuristics were dropped, and two were added, in the course of this experiment. Most recent is the right-most "sheet" (of many). Escape-analysis-informed heuristics seemed like they ought to be better, but (1) other well-motivated CLs made it harder to use them and (2) given the parameters being tested at the time, they didn't work that well and (3) the hacky allocation/call/size heuristics worked surprisingly well. Further research (that I don't have time for -- the freeze is getting harder and harder every day, and I need to return to the register ABI work) might revive them, though that would require a discussion of updating the call-graph on the fly (there are algorithms for this, but they're not the sort of thing you'd introduce this late in the release cycle).

@ugorji
Copy link
Contributor

@ugorji ugorji commented Nov 12, 2020

What's special about

if x { return f() } else { return g() }

that it should be inlined, but that

return f() + g()

should not be?

The inlining heuristics are currently oriented around the resulting code size. This justifies inlining a := g; if x { a = f }; return a() because it needs less code size than if x { return f() } else { return g() }: in particular, it only needs to generate one function call rather than two.

My understanding of the inlining heuristics, is that a function takes more budget (~57 out of 80) than a simple instruction, because we decided the inclusion of a call that cannot be inlined suggests that inlining the wrapping function is of less value, so we reduced the leftover budget to about ~20.

However, a call like if x { f1() } else if y { f2() } else { f3() } is really a single function being called. Based on that, the cost should be the max cost of each block.

It should be just as possible to inline:

func a() {
  i++
  f()
} 

as to inline

func a() {
  i++
  if i > 0 {
    f()
  } else {
    g()
  }
}

When we started talking about mid-stack inlining, the goal was to reduce function call cost to 1 (similar to panic, append, and other intrinsic functions). Since that is no longer on the table and a function call costs a large portion of the inlining budget, then we should allow a switch where only one branch is taken to equal roughly the same cost.

This is what the CL achieves.

If the inlining heuristic cost assigned to function calls is too high, I think it's reasonable to just lower it. I don't see how control flow details weigh in here, since we're not (at least yet) doing any substantive control flow analysis at time of inlining.

It would suck, both in performance (creating unnecessary temp closures) and readability, if we unintentionally encourage folks to resort to this.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 12, 2020

However, a call like if x { f1() } else if y { f2() } else { f3() } is really a single function being called. Based on that, the cost should be the max cost of each block.

One of the costs we're trying to model here is code size. Inlining that 3-branch if statement replaces 1 call (the callsite) with 3 calls. The fact that only one of those 3 calls can execute on a particular invocation doesn't help with code size (unless the conditions can be statically evaluated).

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Nov 13, 2020

Putting this off till 1.17, need to do other work first. I do think the max-of-if-arms is reasonable, but we need to do a better job at not inlining in some cases. The companion CL that was supposed to help with this was too flaky, Keith and I were discussing ideas in comments there. Crude sketch is to put off the inlining decision till we can tell if it will over-fill an otherwise inlineable caller; alternately, look ahead more careful in the max-cost estimator.

One thing I tried that seems like a win is to also impose a quota/budget on the amount that max-if-cost may diverge from sum-if-cost; this reduced binary size increase without doing much at all to hurt performance. Note that the interaction of max-if-cost with look-before-leap inline cost estimation will be a bit of pain.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Nov 24, 2020

Before I forget, here's a writeup of what I tried and what I learned, and what I might try in the future. https://docs.google.com/document/d/17LAdIdHbcdhvxxQ2nszx4SgM_vbDGz6rDoj7BjL7VqM/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants