Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548

Open
josharian opened this issue Mar 3, 2019 · 6 comments

Comments

@josharian
Copy link
Contributor

@josharian josharian commented Mar 3, 2019

package p

func dispatch(x int) int {
	if x < 10 {
		return small(x)
	}
	return big(x)
}

//go:noinline
func small(x int) int { return 0 }

//go:noinline
func big(x int) int { return 1 }

dispatch should be inlined: It is very simple, and on either path through the function we call exactly one other simple function.

It is not:

x.go:3:6: cannot inline dispatch: function too complex: cost 126 exceeds budget 80

This is because we attach a significant cost to each function call.

Note that we are happy to inline this dispatch function:

func dispatch(x int) int {
	fn := big
	if x < 10 {
		fn = small
	}
	return fn(x)
}

Fixing this requires either some special casing to recognize common patterns (like the original dispatch) or a bit of control flow analysis.

cc @randall77 @dr2chase

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 3, 2019

Change https://golang.org/cl/164968 mentions this issue: math/big: add fast path for pure Go addVW for large z

gopherbot pushed a commit that referenced this issue Mar 9, 2019
In the normal case, only a few words have to be updated when adding a word to a vector.
When that happens, we can simply copy the rest of the words, which is much faster.
However, the overhead of that makes it prohibitive for small vectors,
so we check the size at the beginning.

The implementation is a bit weird to allow addVW to continued to be inlined; see #30548.

The AddVW benchmarks are surprising, but fully repeatable.
The SubVW benchmarks are more or less as expected.
I expect that removing the indirect function call will
help both and make them a bit more normal.

name            old time/op    new time/op     delta
AddVW/1-8         4.27ns ± 2%     3.81ns ± 3%   -10.83%  (p=0.000 n=89+90)
AddVW/2-8         4.91ns ± 2%     4.34ns ± 1%   -11.60%  (p=0.000 n=83+90)
AddVW/3-8         5.77ns ± 4%     5.76ns ± 2%      ~     (p=0.365 n=91+87)
AddVW/4-8         6.03ns ± 1%     6.03ns ± 1%      ~     (p=0.392 n=80+76)
AddVW/5-8         6.48ns ± 2%     6.63ns ± 1%    +2.27%  (p=0.000 n=76+74)
AddVW/10-8        9.56ns ± 2%     9.56ns ± 1%    -0.02%  (p=0.002 n=69+76)
AddVW/100-8       90.6ns ± 0%     18.1ns ± 4%   -79.99%  (p=0.000 n=72+94)
AddVW/1000-8       865ns ± 0%       85ns ± 6%   -90.14%  (p=0.000 n=66+96)
AddVW/10000-8     8.57µs ± 2%     1.82µs ± 3%   -78.73%  (p=0.000 n=99+94)
AddVW/100000-8    84.4µs ± 2%     31.8µs ± 4%   -62.29%  (p=0.000 n=93+98)

name            old time/op    new time/op     delta
SubVW/1-8         3.90ns ± 2%     4.13ns ± 4%    +6.02%  (p=0.000 n=92+95)
SubVW/2-8         4.15ns ± 1%     5.20ns ± 1%   +25.22%  (p=0.000 n=83+85)
SubVW/3-8         5.50ns ± 2%     6.22ns ± 6%   +13.21%  (p=0.000 n=91+97)
SubVW/4-8         5.99ns ± 1%     6.63ns ± 1%   +10.63%  (p=0.000 n=79+61)
SubVW/5-8         6.75ns ± 4%     6.88ns ± 2%    +1.82%  (p=0.000 n=98+73)
SubVW/10-8        9.57ns ± 1%     9.56ns ± 1%    -0.13%  (p=0.000 n=77+64)
SubVW/100-8       90.3ns ± 1%     18.1ns ± 2%   -80.00%  (p=0.000 n=75+94)
SubVW/1000-8       860ns ± 4%       85ns ± 7%   -90.14%  (p=0.000 n=97+99)
SubVW/10000-8     8.51µs ± 3%     1.77µs ± 6%   -79.21%  (p=0.000 n=100+97)
SubVW/100000-8    84.4µs ± 3%     31.5µs ± 3%   -62.66%  (p=0.000 n=92+92)

Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
Reviewed-on: https://go-review.googlesource.com/c/go/+/164968
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
josharian added a commit to josharian/go that referenced this issue May 7, 2019
In the normal case, only a few words have to be updated when adding a word to a vector.
When that happens, we can simply copy the rest of the words, which is much faster.
However, the overhead of that makes it prohibitive for small vectors,
so we check the size at the beginning.

The implementation is a bit weird to allow addVW to continued to be inlined; see golang#30548.

The AddVW benchmarks are surprising, but fully repeatable.
The SubVW benchmarks are more or less as expected.
I expect that removing the indirect function call will
help both and make them a bit more normal.

name            old time/op    new time/op     delta
AddVW/1-8         4.27ns ± 2%     3.81ns ± 3%   -10.83%  (p=0.000 n=89+90)
AddVW/2-8         4.91ns ± 2%     4.34ns ± 1%   -11.60%  (p=0.000 n=83+90)
AddVW/3-8         5.77ns ± 4%     5.76ns ± 2%      ~     (p=0.365 n=91+87)
AddVW/4-8         6.03ns ± 1%     6.03ns ± 1%      ~     (p=0.392 n=80+76)
AddVW/5-8         6.48ns ± 2%     6.63ns ± 1%    +2.27%  (p=0.000 n=76+74)
AddVW/10-8        9.56ns ± 2%     9.56ns ± 1%    -0.02%  (p=0.002 n=69+76)
AddVW/100-8       90.6ns ± 0%     18.1ns ± 4%   -79.99%  (p=0.000 n=72+94)
AddVW/1000-8       865ns ± 0%       85ns ± 6%   -90.14%  (p=0.000 n=66+96)
AddVW/10000-8     8.57µs ± 2%     1.82µs ± 3%   -78.73%  (p=0.000 n=99+94)
AddVW/100000-8    84.4µs ± 2%     31.8µs ± 4%   -62.29%  (p=0.000 n=93+98)

name            old time/op    new time/op     delta
SubVW/1-8         3.90ns ± 2%     4.13ns ± 4%    +6.02%  (p=0.000 n=92+95)
SubVW/2-8         4.15ns ± 1%     5.20ns ± 1%   +25.22%  (p=0.000 n=83+85)
SubVW/3-8         5.50ns ± 2%     6.22ns ± 6%   +13.21%  (p=0.000 n=91+97)
SubVW/4-8         5.99ns ± 1%     6.63ns ± 1%   +10.63%  (p=0.000 n=79+61)
SubVW/5-8         6.75ns ± 4%     6.88ns ± 2%    +1.82%  (p=0.000 n=98+73)
SubVW/10-8        9.57ns ± 1%     9.56ns ± 1%    -0.13%  (p=0.000 n=77+64)
SubVW/100-8       90.3ns ± 1%     18.1ns ± 2%   -80.00%  (p=0.000 n=75+94)
SubVW/1000-8       860ns ± 4%       85ns ± 7%   -90.14%  (p=0.000 n=97+99)
SubVW/10000-8     8.51µs ± 3%     1.77µs ± 6%   -79.21%  (p=0.000 n=100+97)
SubVW/100000-8    84.4µs ± 3%     31.5µs ± 3%   -62.66%  (p=0.000 n=92+92)

Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 20, 2020

A similar-but-not-identical case (I can open a separate issue if needed) is common in logging libraries like zap:

// Debug logs a message at DebugLevel. The message includes any fields passed
// at the log site, as well as any fields accumulated on the logger.
func (log *Logger) Debug(msg string, fields ...Field) {
	if ce := log.check(DebugLevel, msg); ce != nil {
		ce.Write(fields...)
	}
}

So much so that library authors resort to suggesting to users to -basically- perform the inlining manually as it allows to turn something like this:

logger.Debug("something happened",
  zap.String("field", obj.Field()),
  zap.Int64("state", obj.State()),
)

into this, potentially bypassing the vararg call:

if e := logger.Check(zap.DebugLevel, "something happened"); e != nil {
  e.Write(
    zap.String("field", obj.Field()),
    zap.Int64("state", obj.State()),
  )
}

If the compiler inlined Debug, it would be possible to avoid the vararg call in the (likely) case debug-level logging is disabled at runtime, and potentially even avoid calling the two obj methods if we knew they were side-effect free.

@josharian
Copy link
Contributor Author

@josharian josharian commented May 20, 2020

Following our proud ugly tradition of inlining hacks, one relatively obvious option here is to reduce the inlining cost of calls inside conditionals. That would treat these two cases similarly. Any interest in prototyping this?

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented May 21, 2020

I have a simplistic prototype that halves extraCallCost of calls inside the if.

It works for

if f1() {
  f2()
}

but it obviously falls for

if condition {
  return f1()
}
return f2()

even though it inlines

if condition {
  return f1()
} else {
  return f2()
}

This prototype doesn't seem to affect the time it takes for make.bash, and bin/go increases by 0.5% (+73KB). Haven't run yet more benchmarks (which corpus do we run nowadays for changes to the inliner?).

That made me think that maybe a better way to handle this would be to have the extraCallCost of the second call in each function to be always 1 (even the first one would work, but that would also affect functions with a single call) possibly with the added condition that at least one of the calls is inside an if (assuming we don't want to inline a(), b())

@josharian
Copy link
Contributor Author

@josharian josharian commented May 21, 2020

maybe a better way to handle this would be to have the extraCallCost of the second call in each function to be always 1

Seems worth a shot.

possibly with the added condition that at least one of the calls is inside an if (assuming we don't want to inline a(), b())

I think we want that added condition. I assume that we don't want to inline a(), b().

@dr2chase put together the initial heuristic, but that piece at least seems pretty clearly intentional.

@josharian
Copy link
Contributor Author

@josharian josharian commented May 21, 2020

Btw, for testing the impact, you may find github.com/josharian/compilecmp helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.