Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: inline is slower #51028

Closed
go101 opened this issue Feb 5, 2022 · 1 comment
Closed

cmd/compile: inline is slower #51028

go101 opened this issue Feb 5, 2022 · 1 comment

Comments

@go101
Copy link

@go101 go101 commented Feb 5, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18beta2 linux/amd64

Does this issue reproduce with the latest release?

Not for go1.17.6, only for 1.18 beta 2.

What did you do?

package copycost

import "testing"

const N = 1024

func Sum_RangeSliceIdx_Inline(a []int) (r int) {
	for i := range a {
		r += a[i]
	}
	return
}

//go:noinline
func Sum_RangeSliceIdx_NoInline(a []int) (r int) {
	for i := range a {
		r += a[i]
	}
	return
}

func buildArray() [N]int {
	var a [N]int
	for i := 0; i < N; i++ {
		a[i] = (N - i) & i
	}
	return a
}

var r [128]int

func Benchmark_Sum_RangeSliceIdx_Inline(b *testing.B) {
	var a = buildArray()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_RangeSliceIdx_Inline(a[:])
	}
}

func Benchmark_Sum_RangeSliceIdx_NoInline(b *testing.B) {
	var a = buildArray()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r[i&127] = Sum_RangeSliceIdx_NoInline(a[:])
	}
}

What did you expect to see?

The inlined function is faster.

What did you see instead?

The inlined one is slower.

$ go version
go version go1.18beta2 linux/amd64

$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_Sum_RangeSliceIdx_Inline-4     	 1996579	       603.2 ns/op
Benchmark_Sum_RangeSliceIdx_NoInline-4   	 2176748	       545.0 ns/op

$ go version
go version go1.17.6 linux/amd64

$ go test -bench=.
...
Benchmark_Sum_RangeSliceIdx_Inline-4     	 2201970	       534.3 ns/op
Benchmark_Sum_RangeSliceIdx_NoInline-4   	 2250338	       535.2 ns/op
@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 5, 2022

Inner loop in the noninlined case:

	0x000b 00011 (/Users/khr/gowork/issue51028_test.go:16)	LEAQ	1(CX), SI
	0x000f 00015 (/Users/khr/gowork/issue51028_test.go:17)	MOVQ	(AX)(CX*8), DI
	0x0013 00019 (/Users/khr/gowork/issue51028_test.go:17)	ADDQ	DI, DX
	0x0016 00022 (/Users/khr/gowork/issue51028_test.go:16)	MOVQ	SI, CX
	0x0019 00025 (/Users/khr/gowork/issue51028_test.go:16)	CMPQ	BX, CX
	0x001c 00028 (/Users/khr/gowork/issue51028_test.go:16)	JGT	11

In the inlined case:

	0x00cd 00205 (/Users/khr/gowork/issue51028_test.go:8)	LEAQ	1(BX), SI
	0x00d1 00209 (/Users/khr/gowork/issue51028_test.go:9)	MOVQ	"".a+8200(SP)(BX*8), DI
	0x00d9 00217 (/Users/khr/gowork/issue51028_test.go:9)	ADDQ	DI, DX
	0x00dc 00220 (/Users/khr/gowork/issue51028_test.go:8)	MOVQ	SI, BX
	0x00df 00223 (/Users/khr/gowork/issue51028_test.go:8)	NOP
	0x00e0 00224 (/Users/khr/gowork/issue51028_test.go:8)	CMPQ	BX, $1024
	0x00e7 00231 (/Users/khr/gowork/issue51028_test.go:8)	JLT	205

There's a nop pad to align the cmp/jlt to fix #35881. That could conceivably be the reason for the slowdown, in which case the performance difference is just because we got alignment-lucky in the noninline case and alignment-unlucky in the inline case.
It could also conceivably be the larger instruction / additional microop required for the load. The noninline case effectively lifts the computation of &a out of the loop.

The first one seems like there's nothing to fix, except possibly to revert the #35881 fix some day. (If I revert it on my desktop right now, the performance goes down by a factor of 2! So not today.)

The second one would need some sort of partial lifting out of the loop. That's #15808, with the additional complication here that we'd have to break a single instruction into 2 parts and lift just one of those parts out of the loop.

So I don't think there's anything to do for this issue. I'm going to close in favor of #15808.

(This doesn't reproduce for 1.17 because we don't inline functions with loops in 1.17.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants