-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
What version of Go are you using (go version
)?
1.17dev
Does this issue reproduce with the latest release?
This is only observed with the register ABI, so not on 1.16, but yes on 1.17 (which not a release yet).
What operating system and processor architecture are you using (go env
)?
Linux and Darwin OS, amd64 (i.e., where we have activated register ABI).
What did you do?
Using bent:
bent -U -v -N=7 -b=kanzi
with configuration.toml =
[[Configurations]]
Name = "Baseline"
Root = "$HOME/work/go-1.16/"
[[Configurations]]
Name = "Registers"
Root = "$HOME/work/go/"
Or:
GOPATH=`pwd`/gopath go get -d -t -v github.com/flanglet/kanzi-go/benchmark
GOPATH=`pwd`/gopath (cd gopath/src/github.com/flanglet/kanzi-go/benchmark; go test -vet=off -c . )
with different versions of go, saving the respective binaries.
and then observe timings for MTFT
GOPATH=`pwd`/gopath (cd gopath/src/github.com/flanglet/kanzi-go/benchmark; \
./kanzi_mtft_Baseline -test.run=none -test.bench=BenchmarkMTFT )
What did you expect to see?
Same or better run time.
What did you see instead?
10-15% worse run time.
The cause (from differential profiles) appears to be poor compilation of at least one at least one inner loop.
Profiles:
drchase@drchase:~/work/bent$ go tool pprof -flat $HOME/work/bent/testbin/kanzi_mtft_Baseline gopath/src/github.com/flanglet/kanzi-go/benchmark/kanzi_mtft_Baseline_1.prof
File: kanzi_mtft_Baseline
Type: cpu
Time: May 11, 2021 at 4:08pm (UTC)
Duration: 11.13s, Total samples = 10.97s (98.55%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 10.94s, 99.73% of 10.97s total
Dropped 9 nodes (cum <= 0.05s)
flat flat% sum% cum cum%
5.93s 54.06% 54.06% 5.94s 54.15% github.com/flanglet/kanzi-go/transform.(*SBRT).Forward
5.01s 45.67% 99.73% 5.02s 45.76% github.com/flanglet/kanzi-go/transform.(*SBRT).Inverse
0 0% 99.73% 10.97s 100% github.com/flanglet/kanzi-go/benchmark.BenchmarkMTFT
0 0% 99.73% 10.97s 100% github.com/flanglet/kanzi-go/benchmark.testTransformSpeed
0 0% 99.73% 10.97s 100% testing.(*B).launch
0 0% 99.73% 10.97s 100% testing.(*B).runN
(pprof) quit
drchase@drchase:~/work/bent$ go tool pprof -flat $HOME/work/bent/testbin/kanzi_mtft_Registers gopath/src/github.com/flanglet/kanzi-go/benchmark/kanzi_mtft_Registers_1.prof
File: kanzi_mtft_Registers
Type: cpu
Time: May 11, 2021 at 4:08pm (UTC)
Duration: 12.43s, Total samples = 12.25s (98.52%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 12.20s, 99.59% of 12.25s total
Dropped 20 nodes (cum <= 0.06s)
flat flat% sum% cum cum%
6.32s 51.59% 51.59% 6.32s 51.59% github.com/flanglet/kanzi-go/transform.(*SBRT).Forward
5.87s 47.92% 99.51% 5.89s 48.08% github.com/flanglet/kanzi-go/transform.(*SBRT).Inverse
0.01s 0.082% 99.59% 12.25s 100% github.com/flanglet/kanzi-go/benchmark.testTransformSpeed
0 0% 99.59% 12.25s 100% github.com/flanglet/kanzi-go/benchmark.BenchmarkMTFT
0 0% 99.59% 12.25s 100% testing.(*B).launch
0 0% 99.59% 12.25s 100% testing.(*B).runN
(pprof)
Using GOSSAFUNC to compare the generated code, the main difference in Forward
appears to be in compilation of a loop:
before:
v258 00047 (+161) MOVBLZX R11, R15
v260 00048 (161) MOVB R14, "".r2s-4392(SP)(R15*1)
v264 00049 (161) MOVB R11, "".s2r-4648(SP)(R14*1)
v227 00050 (159) MOVL R12, R11
v214 00051 (+159) TESTB R11, R11
b34 00052 (159) JLS 59
v239 00053 (159) LEAL -1(R11), R12
v240 00054 (159) MOVBLZX R12, R14
v242 00055 (159) MOVBLZX "".r2s-4392(SP)(R14*1), R14
v221 00056 (159) MOVQ "".q-4136(SP)(R14*8), R15
v289 00057 (159) CMPQ R15, AX
b38 00058 (159) JLE 47
after:
v243 00143 (+161) MOVBLZX R12, R9
v102 00144 (161) MOVB DX, "".r2s-4392(SP)(R9*1)
v82 00145 (161) MOVB R12, "".s2r-4648(SP)(DX*1)
v46 00146 (154) MOVQ CX, DX
v44 00147 (154) MOVQ DI, R9
v261 00148 (159) MOVL R8, R12
v207 00149 (154) MOVQ AX, R8
v247 00150 (+159) TESTB R12, R12
b34 00151 (159) JLS 134
v128 00152 (145) MOVQ R8, AX
v225 00153 (159) LEAL -1(R12), R8
v147 00154 (144) MOVQ DX, CX
v226 00155 (159) MOVBLZX R8, DX
v228 00156 (159) MOVBLZX "".r2s-4392(SP)(DX*1), DX
v257 00157 (146) MOVQ R9, DI
v33 00158 (159) MOVQ "".q-4136(SP)(DX*8), R9
v51 00159 (159) CMPQ R9, R13
b38 00160 (159) JLE 143
An easy way to get these files with bent is to modify the configuration, like so:
[[Configurations]]
Name = "Baseline"
Root = "$HOME/work/go-1.16/"
GcEnv = [ "GOSSAFUNC=(*SBRT).Forward", "GOSSADIR=/tmp/baseline" ]
[[Configurations]]
Name = "Registers"
Root = "$HOME/work/go/"
GcEnv = [ "GOSSAFUNC=transform.(*SBRT).Forward", "GOSSADIR=/tmp/registers" ]
The source code:
.../gopath/src/github.com/flanglet/kanzi-go/transform/SBRT.go:
func (this *SBRT) Forward(src, dst []byte) (uint, uint, error) {
if len(src) == 0 {
return 0, 0, nil
}
if &src[0] == &dst[0] {
return 0, 0, errors.New("Input and output buffers cannot be equal")
}
count := len(src)
if count > len(dst) {
errMsg := fmt.Sprintf("Block size is %v, output buffer length is %v", count, len(dst))
return 0, 0, errors.New(errMsg)
}
s2r := [256]uint8{}
r2s := [256]uint8{}
for i := range s2r {
s2r[i] = uint8(i)
r2s[i] = uint8(i)
}
m1 := this.mask1
m2 := this.mask2
s := this.shift
p := [256]int{}
q := [256]int{}
for i := 0; i < count; i++ {
c := uint8(src[i])
r := s2r[c]
dst[i] = byte(r)
qc := ((i & m1) + (p[c] & m2)) >> s
p[c] = i
q[c] = qc
// Move up symbol to correct rank -- // THIS IS LINE 158
for r > 0 && q[r2s[r-1]] <= qc {
t := r2s[r-1]
r2s[r], s2r[t] = t, r
r--
}
r2s[r] = c
s2r[c] = r
}
return uint(count), uint(count), nil
}
Fixed count profiles were collected with a RunWrapper = ["cpuprofilex"]
which is a slightly modified version of cpuprofile
, where "10000x" is a good run count for this benchmark:
#!/bin/bash
# Run args as command, but run cpuprofile and then pprof to capture test cpuprofile output
pf="${BENT_BINARY}_${BENT_I}.prof"
"$@" -test.cpuprofile="$pf" -test.benchtime=10000x
echo cpuprofile in `pwd`/"$pf"
if [[ x`which pprof` == x"" ]] ; then
go tool pprof -text -flat -nodecount=20 "$pf"
else
pprof -text -flat -nodecount=20 "$pf"
fi
My current hypothesis is that use of the register ABI increases register pressure in the loop and causes the register allocator to make unwise choices. I don't think that this is an easy fix for 1.17, and overall performance is nicely up for far more benchmarks than those that regressed.
Regarding bent; it's currently at github.com/dr2chase/bent, but is in the process of being migrated to github.com/golang/benchmarks/cmd/bent.
"Forward"
1.16 ssa.html: https://gist.github.com/dr2chase/fb7110e9fafd6bccffd1aeb93d1b9144
1.17 ssa.html: https://gist.github.com/dr2chase/ba0a4f898842d99c4aec9e26c428c5b5
"Inverse" (which looks a whole lot like Forward)
1.16 ssa.html: https://gist.github.com/dr2chase/c9aaeb44d5e1063ab673c93c9a07a0a1
1.17 ssa.html: https://gist.github.com/dr2chase/2683c78471c4aac85c5daecb98392da6