cmd/compile: MTFT benchmark performance regression with register ABI

### What version of Go are you using (`go version`)?

1.17dev

### Does this issue reproduce with the latest release?

This is only observed with the register ABI, so not on 1.16, but yes on 1.17 (which not a release yet).

### What operating system and processor architecture are you using (`go env`)?

Linux and Darwin OS, amd64 (i.e., where we have activated register ABI).

### What did you do?

Using bent:
```
bent -U -v -N=7  -b=kanzi
```
with configuration.toml = 
```
[[Configurations]]
  Name = "Baseline"
  Root = "$HOME/work/go-1.16/"

[[Configurations]]
  Name = "Registers"
  Root = "$HOME/work/go/"
```

Or:
```
GOPATH=`pwd`/gopath go get -d -t -v github.com/flanglet/kanzi-go/benchmark
GOPATH=`pwd`/gopath  (cd gopath/src/github.com/flanglet/kanzi-go/benchmark; go test -vet=off -c . )
```
with different versions of go, saving the respective binaries.

and then observe timings for `MTFT`
```
GOPATH=`pwd`/gopath  (cd gopath/src/github.com/flanglet/kanzi-go/benchmark; \
./kanzi_mtft_Baseline -test.run=none -test.bench=BenchmarkMTFT )
```

### What did you expect to see?
Same or better run time.


### What did you see instead?
10-15% worse run time.

The cause (from differential profiles) appears to be poor compilation of at least one at least one inner loop.
Profiles:
```
drchase@drchase:~/work/bent$ go tool pprof -flat $HOME/work/bent/testbin/kanzi_mtft_Baseline  gopath/src/github.com/flanglet/kanzi-go/benchmark/kanzi_mtft_Baseline_1.prof
File: kanzi_mtft_Baseline
Type: cpu
Time: May 11, 2021 at 4:08pm (UTC)
Duration: 11.13s, Total samples = 10.97s (98.55%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 10.94s, 99.73% of 10.97s total
Dropped 9 nodes (cum <= 0.05s)
      flat  flat%   sum%        cum   cum%
     5.93s 54.06% 54.06%      5.94s 54.15%  github.com/flanglet/kanzi-go/transform.(*SBRT).Forward
     5.01s 45.67% 99.73%      5.02s 45.76%  github.com/flanglet/kanzi-go/transform.(*SBRT).Inverse
         0     0% 99.73%     10.97s   100%  github.com/flanglet/kanzi-go/benchmark.BenchmarkMTFT
         0     0% 99.73%     10.97s   100%  github.com/flanglet/kanzi-go/benchmark.testTransformSpeed
         0     0% 99.73%     10.97s   100%  testing.(*B).launch
         0     0% 99.73%     10.97s   100%  testing.(*B).runN
(pprof) quit

drchase@drchase:~/work/bent$  go tool pprof -flat $HOME/work/bent/testbin/kanzi_mtft_Registers  gopath/src/github.com/flanglet/kanzi-go/benchmark/kanzi_mtft_Registers_1.prof
File: kanzi_mtft_Registers
Type: cpu
Time: May 11, 2021 at 4:08pm (UTC)
Duration: 12.43s, Total samples = 12.25s (98.52%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 12.20s, 99.59% of 12.25s total
Dropped 20 nodes (cum <= 0.06s)
      flat  flat%   sum%        cum   cum%
     6.32s 51.59% 51.59%      6.32s 51.59%  github.com/flanglet/kanzi-go/transform.(*SBRT).Forward
     5.87s 47.92% 99.51%      5.89s 48.08%  github.com/flanglet/kanzi-go/transform.(*SBRT).Inverse
     0.01s 0.082% 99.59%     12.25s   100%  github.com/flanglet/kanzi-go/benchmark.testTransformSpeed
         0     0% 99.59%     12.25s   100%  github.com/flanglet/kanzi-go/benchmark.BenchmarkMTFT
         0     0% 99.59%     12.25s   100%  testing.(*B).launch
         0     0% 99.59%     12.25s   100%  testing.(*B).runN
(pprof)
```

Using GOSSAFUNC to compare the generated code, the main difference in `Forward` appears to be in compilation of a loop:
```
before:
v258 	00047 (+161) MOVBLZX R11, R15
v260 	00048 (161) MOVB R14, "".r2s-4392(SP)(R15*1)
v264 	00049 (161) MOVB R11, "".s2r-4648(SP)(R14*1)
v227 	00050 (159) MOVL R12, R11
v214 	00051 (+159) TESTB R11, R11
b34 	00052 (159) JLS 59
v239 	00053 (159) LEAL -1(R11), R12
v240 	00054 (159) MOVBLZX R12, R14
v242 	00055 (159) MOVBLZX "".r2s-4392(SP)(R14*1), R14
v221 	00056 (159) MOVQ "".q-4136(SP)(R14*8), R15
v289 	00057 (159) CMPQ R15, AX
b38 	00058 (159) JLE 47

after:
v243 	00143 (+161) MOVBLZX R12, R9
v102 	00144 (161) MOVB DX, "".r2s-4392(SP)(R9*1)
v82 	00145 (161) MOVB R12, "".s2r-4648(SP)(DX*1)
v46 	00146 (154) MOVQ CX, DX
v44 	00147 (154) MOVQ DI, R9
v261 	00148 (159) MOVL R8, R12
v207 	00149 (154) MOVQ AX, R8
v247 	00150 (+159) TESTB R12, R12
b34 	00151 (159) JLS 134
v128 	00152 (145) MOVQ R8, AX
v225 	00153 (159) LEAL -1(R12), R8
v147 	00154 (144) MOVQ DX, CX
v226 	00155 (159) MOVBLZX R8, DX
v228 	00156 (159) MOVBLZX "".r2s-4392(SP)(DX*1), DX
v257 	00157 (146) MOVQ R9, DI
v33 	00158 (159) MOVQ "".q-4136(SP)(DX*8), R9
v51 	00159 (159) CMPQ R9, R13
b38 	00160 (159) JLE 143
```

An easy way to get these files with bent is to modify the configuration, like so:
```
[[Configurations]]
  Name = "Baseline"
  Root = "$HOME/work/go-1.16/"
  GcEnv = [ "GOSSAFUNC=(*SBRT).Forward", "GOSSADIR=/tmp/baseline" ]

[[Configurations]]
  Name = "Registers"
  Root = "$HOME/work/go/"
  GcEnv = [ "GOSSAFUNC=transform.(*SBRT).Forward", "GOSSADIR=/tmp/registers" ]
```

The source code:
```
.../gopath/src/github.com/flanglet/kanzi-go/transform/SBRT.go:

func (this *SBRT) Forward(src, dst []byte) (uint, uint, error) {
	if len(src) == 0 {
		return 0, 0, nil
	}
 
	if &src[0] == &dst[0] {
		return 0, 0, errors.New("Input and output buffers cannot be equal")
	}
 
	count := len(src)
 
	if count > len(dst) {
		errMsg := fmt.Sprintf("Block size is %v, output buffer length is %v", count, len(dst))
		return 0, 0, errors.New(errMsg)
	}
 
	s2r := [256]uint8{}
	r2s := [256]uint8{}
 
	for i := range s2r {
		s2r[i] = uint8(i)
		r2s[i] = uint8(i)
	}
 
	m1 := this.mask1
	m2 := this.mask2
	s := this.shift
	p := [256]int{}
	q := [256]int{}
 
	for i := 0; i < count; i++ {
		c := uint8(src[i])
		r := s2r[c]
		dst[i] = byte(r)
		qc := ((i & m1) + (p[c] & m2)) >> s
		p[c] = i
		q[c] = qc
 
		// Move up symbol to correct rank -- // THIS IS LINE 158
		for r > 0 && q[r2s[r-1]] <= qc {
			t := r2s[r-1]
			r2s[r], s2r[t] = t, r
			r--
		}
 
		r2s[r] = c
		s2r[c] = r
	}
 
	return uint(count), uint(count), nil
}
```


Fixed count profiles were collected with a `RunWrapper = ["cpuprofilex"]` which is a slightly modified version of `cpuprofile`, where "10000x" is a good run count for this benchmark:
```
#!/bin/bash
# Run args as command, but run cpuprofile and then pprof to capture test cpuprofile output
pf="${BENT_BINARY}_${BENT_I}.prof"
"$@" -test.cpuprofile="$pf" -test.benchtime=10000x
echo cpuprofile in `pwd`/"$pf"
if [[ x`which pprof` == x"" ]] ; then
    go tool pprof -text -flat -nodecount=20 "$pf"
else
    pprof -text -flat -nodecount=20 "$pf"
fi
```

My current hypothesis is that use of the register ABI increases register pressure in the loop and causes the register allocator to make unwise choices.  I don't think that this is an easy fix for 1.17, and overall performance is nicely up for far more benchmarks than those that regressed.

Regarding bent; it's currently at github.com/dr2chase/bent, but is in the process of being migrated to github.com/golang/benchmarks/cmd/bent.

"Forward"
1.16 ssa.html: https://gist.github.com/dr2chase/fb7110e9fafd6bccffd1aeb93d1b9144
1.17 ssa.html: https://gist.github.com/dr2chase/ba0a4f898842d99c4aec9e26c428c5b5

"Inverse" (which looks a whole lot like Forward)
1.16 ssa.html: https://gist.github.com/dr2chase/c9aaeb44d5e1063ab673c93c9a07a0a1
1.17 ssa.html: https://gist.github.com/dr2chase/2683c78471c4aac85c5daecb98392da6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: MTFT benchmark performance regression with register ABI #46216

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: MTFT benchmark performance regression with register ABI #46216

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?