Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: multiplication strength-reduction rules hurting performance #21434

Closed
ALTree opened this issue Aug 14, 2017 · 4 comments

Comments

Projects
None yet
4 participants
@ALTree
Copy link
Member

commented Aug 14, 2017

I noticed that certain strength-reduction rules for multiplication seem to harm performance on my machine.

Take the following rule in AMD64.rules (that reduces c * n, when c is a constant one less of a power of 2):

(MULQconst [c] x) && isPowerOfTwo(c+1) && c >= 15 -> (SUBQ (SHLQconst <v.Type> [log2(c+1)] x) x)

Two silly benchmarks:

package main

import "testing"

//go:noinline
func F1(n int) int {
	return 15 * n
}

var arr = []int{
	2, 3, 5, 7, 11, 13, 18, 19, 23, 29, 31, 37, 41, 43,
}

func F2() int {
	sum := 0
	for i := range arr {
		sum += 15 * arr[i]
	}

	for i := range arr {
		sum += 31 * arr[i]
	}
	return sum
}

var sink int

func BenchmarkF1(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = F1(i)
	}
}

func BenchmarkF2(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = F2()
	}
}

The generated code for F1 with and without the strength reduction rule:

"".F1 STEXT nosplit size=21 args=0x10 locals=0x0
        [...]
	0x0000 00000	MOVQ	"".n+8(SP), AX
	0x0005 00005	MOVQ	AX, CX
	0x0008 00008	SHLQ	$4, AX
	0x000c 00012	SUBQ	CX, AX
	0x000f 00015	MOVQ	AX, "".~r1+16(SP)
	0x0014 00020	RET
"".F1 STEXT nosplit size=15 args=0x10 locals=0x0
        [...]
	0x0000 00000	MOVQ	"".n+8(SP), AX
	0x0005 00005	IMULQ	$15, AX
	0x0009 00009	MOVQ	AX, "".~r1+16(SP)
	0x000e 00014	RET

A single IMULQ instruction is replaced by a MOVQ, a SHLQ, and a SUBQ; the reduced code also uses one more register. The reduction seems to harm performances. Benchmarks results with tip vs tip+rule-disabled

name  old time/op  new time/op  delta
F1-4  2.34ns ± 2%  2.33ns ± 0%     ~     (p=1.000 n=5+5)
F2-4  23.2ns ± 2%  17.2ns ± 3%  -26.08%  (p=0.008 n=5+5)

The rule makes no difference on the microbenchmark, and strongly hurts performances on the more realistic second benchmark.

This is on an Haswell machine, where the rule-of-thumb currently used for this kind of reduction (addq, shlq, leaq, negq all cost 1, imulq costs 3) should be valid. And yet, the reduced code is either as fast or significantly slower than the one using imul.

@ALTree ALTree added the Performance label Aug 14, 2017

@josharian

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2017

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2017

I don't think your test is actually measuring the multiply latency. It is only measuring multiply throughput.
If you chain the multiplies, so the next multiply needs the result of the previous one, then you're measuring latency and the rewrite we're using shows improvement.

package main

import "testing"

//go:noinline
func F1(n int) int {
	r := 0
	for i := 0; i < n; i++ {
		r += i * 15 // uses x<<4-x
	}
	return r
}

//go:noinline
func F2(n int) int {
	r := 0
	for i := 0; i < n; i++ {
		r += i * 19 // uses imul
	}
	return r
}

//go:noinline
func F3(n int) int {
	r := 1
	for i := 0; i < n; i++ {
		r *= 15 // uses x<<4-x
	}
	return r
}

//go:noinline
func F4(n int) int {
	r := 1
	for i := 0; i < n; i++ {
		r *= 19 // uses imul
	}
	return r
}

var sink int

func BenchmarkF1(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = F1(1000)
	}
}

func BenchmarkF2(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = F2(1000)
	}
}

func BenchmarkF3(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = F3(1000)
	}
}
func BenchmarkF4(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = F4(1000)
	}
}
BenchmarkF1-12    	 1000000	      1035 ns/op
BenchmarkF2-12    	 2000000	       809 ns/op
BenchmarkF3-12    	 3000000	       584 ns/op
BenchmarkF4-12    	 2000000	       897 ns/op

So when measuring throughput (F1 vs F2) the rewrite hurts, probably because we're substituting 3 instructions for 1. (Side note - why does this hurt? I would expect a reasonable fetch/retire engine to keep up with this loop.)

When measuring latency (F3 vs F4), however, the rewrite helps. We're now doing the multiply in two latency 1 instructions instead of 1 latency 3 instruction.

I think latency is more important than throughput, so we should keep the rewrite. I'm happy to hear arguments otherwise, though.

My processor is Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz, YMMV.

@ALTree

This comment has been minimized.

Copy link
Member Author

commented Aug 14, 2017

Ah, the reduction improves latency. Thanks for the explanation.

I think latency is more important than throughput, so we should keep the rewrite

Well, the other compilers I looked at (GCC, Clang, intel) seem, too, to optimize for latency, so I guess that's a point in favour of keeping those rewrites.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2017

Ok, I'll close this issue then. Feel free to reopen if you want to continue discussing.

@randall77 randall77 closed this Aug 14, 2017

@golang golang locked and limited conversation to collaborators Aug 14, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.