New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: tweak branchelim heuristic on amd64 #25298

Open
TocarIP opened this Issue May 8, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@TocarIP
Contributor

TocarIP commented May 8, 2018

What version of Go are you using (go version)?

master:
go version devel +cd1976d Tue May 8 19:57:49 2018 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/localdisk/itocar/gocache/"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/localdisk/itocar/gopath"
GORACE=""
GOROOT="/localdisk/itocar/golang"
GOTMPDIR=""
GOTOOLDIR="/localdisk/itocar/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build948974604=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run strconv/Atof benchmarks

What did you expect to see?

Same or better performance as 1.10

What did you see instead?

Atof64Decimal-8       20.9ns ± 1%  20.3ns ± 2%   -3.11%  (p=0.000 n=10+10)
Atof64Float-8         22.9ns ± 1%  23.1ns ± 2%   +0.83%  (p=0.042 n=10+10)
Atof64FloatExp-8      56.3ns ± 3%  68.8ns ± 2%  +22.11%  (p=0.000 n=10+10)
Atof64Big-8           84.0ns ± 1%  94.0ns ± 1%  +11.91%  (p=0.000 n=10+10)
Atof64RandomBits-8    72.9ns ±15%  80.5ns ±24%  +10.47%  (p=0.022 n=10+9)
Atof64RandomFloats-8  80.2ns ±25%  91.6ns ±19%  +14.27%  (p=0.029 n=10+10)
Atof32Decimal-8       20.1ns ± 2%  20.5ns ± 2%   +1.59%  (p=0.008 n=10+10)
Atof32Float-8         21.5ns ± 3%  21.9ns ± 3%   +2.09%  (p=0.012 n=10+10)
Atof32FloatExp-8      58.6ns ± 3%  68.2ns ± 2%  +16.49%  (p=0.000 n=10+10)
Atof32Random-8        93.8ns ± 0%  86.4ns ± 0%   -7.83%  (p=0.000 n=10+8)

Bisects points to a35ec9a

Looking at code I see that (*extFloat).Normalize got slower, probably because branches were well predicted, and most instruction are dependent on a result of branch.

It looks like heuristic for generating CMOV should be tweaked, but I'm not sure how. Current threshold is low and reducing it further will cause performance impact in other cases. Maybe we should avoid generating long chain of dependent CMOVs?

@randall77

This comment has been minimized.

Contributor

randall77 commented May 8, 2018

@Quasilyte

This comment has been minimized.

Contributor

Quasilyte commented May 11, 2018

Comparing package strings on go tip and Go 1.10.2:

SingleMaxSkipping-4     1.36µs ± 0%     3.21µs ± 0%   +135.09%  (p=0.008 n=5+5)

Main regression happened in a35ec9a (1675ns/op -> 3201ns/op).

@rasky

This comment has been minimized.

Member

rasky commented May 12, 2018

By a cursory look, it seems that regressing benchmarks always execute code with fixed inputs. For instance, if you look at Atof64FloatExp which is +22% with low variance, it is:

func BenchmarkAtof64FloatExp(b *testing.B) {
	for i := 0; i < b.N; i++ {
		ParseFloat("-5.09e75", 64)
	}
}

So it always parses the same number, so the CPU is able to branch predict very well what it is going to happen. The same applies to SingleMaxSkipping:

func BenchmarkSingleMaxSkipping(b *testing.B) {
	benchmarkSingleString(b, Repeat("b", 25), Repeat("a", 10000))
}

So, yes, if the branches are well predicted, CMOV can be slower. @TocarIP what is the exact CPU model you used for benchmarks?

On the other hand, on random inputs it is indeed faster:

Atof32Random-8        93.8ns ± 0%  86.4ns ± 0%   -7.83%  (p=0.000 n=10+8)

I'm not sure what to do here.

PS: @TocarIP FWIW, (*extFloat).Normalize should use bits.LeadingZeros64 and avoid all branches in the first place:

func (f *extFloat) Normalize() uint {
    shift := bits.LeadingZeros64(f.mant)
    f.mant <<= uint(shift)
    f.exp -= shift
    return uint(shift)
}
@gopherbot

This comment has been minimized.

gopherbot commented May 15, 2018

Change https://golang.org/cl/113256 mentions this issue: strconv: simplify (*extFloat).Normalize

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented May 15, 2018

@rasky I've used i7-6700. Agreed about (*extFloat).Normalize.
FWIW in singleMaxSkipping cmov is generated only for inlined

func max(a, b int) int {
        if a > b {
                return a
        }
        return b
}

Which is a case where we want to generate CMOV, and where gcc/clang also generate CMOV.

@bradfitz bradfitz added this to the Unplanned milestone May 18, 2018

gopherbot pushed a commit that referenced this issue May 30, 2018

strconv: simplify (*extFloat).Normalize
Use math/bits.LeadingZeros64 instead of local implementation.
This simplifies code, makes Normalize inlinable and fixes performance regression.
Idea was suggested by Giovanni Bajo in #25298

Performance results below:
Atof64Decimal-6                          46.7ns ± 0%  46.7ns ± 0%     ~     (all equal)
Atof64Float-6                            57.9ns ± 1%  56.9ns ± 0%   -1.72%  (p=0.000 n=10+9)
Atof64FloatExp-6                          163ns ± 0%   123ns ± 0%  -24.54%  (p=0.002 n=8+10)
Atof64Big-6                               222ns ± 1%   185ns ± 1%  -16.65%  (p=0.000 n=9+10)
Atof64RandomBits-6                        155ns ± 2%   154ns ± 3%     ~     (p=0.225 n=10+10)
Atof64RandomFloats-6                      156ns ± 2%   154ns ± 2%     ~     (p=0.124 n=10+9)
Atof32Decimal-6                          47.3ns ± 0%  46.7ns ± 0%   -1.26%  (p=0.000 n=7+9)
Atof32Float-6                            51.5ns ± 1%  51.6ns ± 1%     ~     (p=0.455 n=10+9)
Atof32FloatExp-6                          163ns ± 1%   124ns ± 1%  -24.36%  (p=0.000 n=10+10)
Atof32Random-6                            199ns ± 1%   163ns ± 0%  -17.93%  (p=0.000 n=10+10)
FormatFloat/Decimal-6                     209ns ± 2%   211ns ± 2%     ~     (p=0.402 n=10+10)
FormatFloat/Float-6                       393ns ± 2%   379ns ± 1%   -3.57%  (p=0.000 n=10+10)
FormatFloat/Exp-6                         333ns ± 2%   321ns ± 1%   -3.56%  (p=0.000 n=10+9)
FormatFloat/NegExp-6                      338ns ± 3%   317ns ± 1%   -6.27%  (p=0.000 n=10+9)
FormatFloat/Big-6                         457ns ± 1%   443ns ± 2%   -2.99%  (p=0.000 n=9+10)
FormatFloat/BinaryExp-6                   230ns ± 2%   232ns ± 2%     ~     (p=0.070 n=10+10)
FormatFloat/32Integer-6                   209ns ± 2%   211ns ± 1%     ~     (p=0.203 n=10+8)
FormatFloat/32ExactFraction-6             330ns ± 2%   319ns ± 1%   -3.42%  (p=0.000 n=10+10)
FormatFloat/32Point-6                     393ns ± 2%   377ns ± 1%   -4.15%  (p=0.000 n=10+10)
FormatFloat/32Exp-6                       331ns ± 2%   318ns ± 2%   -4.02%  (p=0.000 n=10+10)
FormatFloat/32NegExp-6                    327ns ± 2%   315ns ± 2%   -3.70%  (p=0.000 n=10+10)
FormatFloat/64Fixed1-6                    265ns ± 2%   253ns ± 2%   -4.38%  (p=0.000 n=10+10)
FormatFloat/64Fixed2-6                    278ns ± 2%   262ns ± 3%   -5.71%  (p=0.000 n=10+10)
FormatFloat/64Fixed3-6                    271ns ± 2%   260ns ± 2%   -4.03%  (p=0.000 n=10+10)
FormatFloat/64Fixed4-6                    277ns ± 3%   267ns ± 1%   -3.55%  (p=0.000 n=10+9)
FormatFloat/Slowpath64-6                 71.0µs ± 0%  71.0µs ± 0%     ~     (p=0.744 n=10+8)
AppendFloat/Decimal-6                     100ns ± 1%   100ns ± 0%     ~     (p=0.294 n=10+8)
AppendFloat/Float-6                       273ns ± 0%   260ns ± 1%   -4.87%  (p=0.000 n=7+10)
AppendFloat/Exp-6                         213ns ± 0%   200ns ± 0%   -6.29%  (p=0.000 n=8+10)
AppendFloat/NegExp-6                      211ns ± 0%   198ns ± 0%   -6.16%  (p=0.000 n=8+8)
AppendFloat/Big-6                         319ns ± 0%   305ns ± 0%   -4.31%  (p=0.000 n=8+7)
AppendFloat/BinaryExp-6                  98.4ns ± 0%  92.9ns ± 0%   -5.63%  (p=0.000 n=9+8)
AppendFloat/32Integer-6                   101ns ± 1%   102ns ± 1%   +0.89%  (p=0.004 n=10+10)
AppendFloat/32ExactFraction-6             222ns ± 1%   210ns ± 0%   -5.28%  (p=0.000 n=10+9)
AppendFloat/32Point-6                     273ns ± 1%   261ns ± 1%   -4.62%  (p=0.000 n=10+9)
AppendFloat/32Exp-6                       209ns ± 1%   197ns ± 0%   -5.56%  (p=0.000 n=10+9)
AppendFloat/32NegExp-6                    207ns ± 1%   194ns ± 1%   -6.18%  (p=0.000 n=10+10)
AppendFloat/64Fixed1-6                    145ns ± 0%   131ns ± 1%   -9.93%  (p=0.000 n=9+10)
AppendFloat/64Fixed2-6                    160ns ± 0%   146ns ± 0%   -8.58%  (p=0.000 n=10+8)
AppendFloat/64Fixed3-6                    147ns ± 1%   132ns ± 1%  -10.25%  (p=0.000 n=10+10)
AppendFloat/64Fixed4-6                    161ns ± 1%   149ns ± 0%   -7.93%  (p=0.000 n=10+10)
AppendFloat/Slowpath64-6                 70.6µs ± 1%  70.9µs ± 0%   +0.37%  (p=0.000 n=10+8)

Change-Id: I63bbc40905abd795fbd24743604c790023d11a43
Reviewed-on: https://go-review.googlesource.com/113256
Run-TryBot: Ilya Tocar <ilya.tocar@intel.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment