Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math: Lgamma got slower from Go 1.6 to Go 1.7 #17121

Open
bulatm opened this issue Sep 15, 2016 · 7 comments

Comments

Projects
None yet
8 participants
@bulatm
Copy link

commented Sep 15, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go1.6.3, go version devel +8086e7c

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build584510834=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

Investigating slowdown of Lgamma benchmark.
1.6 vs 1.7 gives:
name old time/op new time/op delta
Lgamma-4 11.9ns ± 0% 14.2ns ± 0% +19.33% (p=0.000 n=29+29)

It might be compiler issue(code not changed). Looking into objdump of lgamma function for both versions I see different instruction order and registers used:
In 1.7:

REPNE MOVSD_XMM 0x14a4b5(IP), X3
REPNE MOVSD_XMM 0x14a4bd(IP), X5
REPNE MOVSD_XMM 0x14a4c5(IP), X6
REPNE MOVSD_XMM 0x14a4cd(IP), X7
REPNE MOVSD_XMM 0x14a4d4(IP), X8
REPNE MOVSD_XMM 0x14a4db(IP), X9
REPNE MULSD X4, X9
REPNE ADDSD X9, X8
REPNE MULSD X4, X8
REPNE ADDSD X8, X7
REPNE MULSD X4, X7
REPNE ADDSD X7, X6
REPNE MULSD X4, X6
...

In 1.6:

REPNE MOVSD_XMM 0x1f94c0(IP), X0
REPNE MOVSD_XMM 0x1f94c0(IP), X1
REPNE MULSD X3, X1
REPNE ADDSD X1, X0
REPNE MULSD X3, X0
REPNE MOVSD_XMM 0x1f949c(IP), X1
REPNE ADDSD X1, X0
REPNE MULSD X3, X0
REPNE MOVSD_XMM 0x1f9484(IP), X1
REPNE ADDSD X1, X0
REPNE MULSD X3, X0
...

Tried force scheduler to generate instruction order like in 1.6
(for that expanded all expressions like p1 := _lgamT[0] + w_(_lgamT[3]+w_(_lgamT[6]+w_(_lgamT[9]+w__lgamT[12]))) with tmp variables).
But result was still bad:
name old time/op new time/op delta
Lgamma-4 11.9ns ± 0% 13.1ns ± 0% +10.08% (p=0.000 n=29+19)

Also tried to generate random registers - no much effect:
name old time/op new time/op delta
Lgamma-4 11.9ns ± 0% 13.5ns ± 0% +13.45% (p=0.000 n=29+10)

@randall77

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

@bradfitz bradfitz changed the title math/Lgamma: go1.7 vs go1.6 benchmark slowdown math: Lgamma got slower from Go 1.6 to Go 1.7 Sep 15, 2016

@quentinmit quentinmit added this to the Go1.8 milestone Oct 3, 2016

@quentinmit quentinmit modified the milestones: Go1.8Maybe, Go1.8 Oct 13, 2016

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2016

@josharian

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2016

Confirmed that there's been no change since 1.7, although the slowdown from 1.6 is less pronounced for me:

$ benchstat 16 17
name      old time/op  new time/op  delta
Lgamma-8  21.1ns ± 5%  22.7ns ±11%  +7.48%  (p=0.000 n=46+50)
$ benchstat 17 tip
name      old time/op  new time/op  delta
Lgamma-8  22.7ns ±11%  22.7ns ±10%   ~     (p=0.684 n=50+50)
$ benchstat 16 tip
name      old time/op  new time/op  delta
Lgamma-8  21.1ns ± 5%  22.7ns ±10%  +7.18%  (p=0.000 n=46+50)
@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2016

For record, I tried CL https://go-review.googlesource.com/c/31656/ to make loads and computations interleaving, but it doesn't help on speed.

@rsc rsc modified the milestones: Go1.9, Go1.8Maybe Nov 11, 2016

@aregm

This comment has been minimized.

Copy link

commented Mar 2, 2017

I have looked into the internals of the slow case and looks like there is a problem with register spill/fill caused by a call in the hot loop here:
46db60: movsd %xmm2,0x40(%rsp) <---------------
46db66: movsd 0x10368a(%rip),%xmm3 # 5711f8 <$f64.3b90000000000000>
46db6d: 00
46db6e: ucomisd %xmm2,%xmm3
46db72: jbe 46dbc5 <math.Lgamma+0xe5>
...
6db8b: callq 471150 <math.Log> <---------------- and SinPi later
...
46dbc5: ucomisd %xmm0,%xmm1
46dbc9: jbe 46e46f <math.Lgamma+0x98f>
...
46e46f: xor %ecx,%ecx
46e471: xorps %xmm0,%xmm0
46e474: mov $0x1,%rax
46e47b: jmpq 46dc62 <math.Lgamma+0x182>
...
46dc62: mov %cl,0x16(%rsp)
46dc66: movsd %xmm0,0x30(%rsp)
46dc6c: mov %rax,0x38(%rsp)
46dc71: movsd 0x40(%rsp),%xmm2 <---------------

@rsc

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2017

Hi @aregm, thanks very much for looking at this. I tried to reproduce this result and I can't. If I run

go get -u rsc.io/tmp/gammabench
cd $GOPATH/src/rsc.io/tmp/gammabench
go test -bench . -count 5

then I get:

BenchmarkLgamma16-8    	100000000	        13.4 ns/op
BenchmarkLgamma16-8    	100000000	        13.4 ns/op
BenchmarkLgamma16-8    	100000000	        13.4 ns/op
BenchmarkLgamma16-8    	100000000	        13.5 ns/op
BenchmarkLgamma16-8    	100000000	        13.4 ns/op
BenchmarkLgamma17-8    	100000000	        14.8 ns/op
BenchmarkLgamma17-8    	100000000	        14.8 ns/op
BenchmarkLgamma17-8    	100000000	        14.8 ns/op
BenchmarkLgamma17-8    	100000000	        14.8 ns/op
BenchmarkLgamma17-8    	100000000	        14.8 ns/op
BenchmarkLgammaZZZ-8   	100000000	        14.7 ns/op
BenchmarkLgammaZZZ-8   	100000000	        14.6 ns/op
BenchmarkLgammaZZZ-8   	100000000	        14.6 ns/op
BenchmarkLgammaZZZ-8   	100000000	        14.6 ns/op
BenchmarkLgammaZZZ-8   	100000000	        14.8 ns/op

That directory has the assembly equivalents of the Go 1.6, Go 1.7, and Go master (ZZZ) compiler output for the math.Lgamma function (lgamma16.s, lgamma17.s, lgammazzz.s). The idea was to be able to modify the assembly directly, to be able to experiment to test hypotheses about the slowdown without having to coerce the compiler into making the changes you want to try.

In the Go 1.6 assembly, the spill of the temporary holding lgamma happens before math.Log, instead of above the switch, as you identified. That is, Go 1.6 does:

	DIVSD	X4, X1
	ADDSD	X1, X0
	MOVAPD	X0, X5
	MOVSD	X0, lgamma+8(FP)
	MOVSD	$(1.0), X1
	CMPQ	BX, $4
	JGT	_1284
	CMPQ	BX, $3
	JNE	_1252
_1189:
	MOVAPD	X1, X2
	MOVSD	$(2.0), X1
	ADDSD	X3, X1
	MULSD	X2, X1
	MOVSD	X5, autotmp_0310-8(SP)
	MOVSD	X1, (SP)
	PCDATA	$0, $0
	CALL	math·Log(SB)
	MOVSD	8(SP), X2
	MOVSD	autotmp_0310-8(SP), X5
	ADDSD	X2, X5
	MOVSD	X5, lgamma+8(FP)
	NOP

while current master does (Go 1.7 is similar):

	DIVSD	X1, X2
	ADDSD	X3, X2
	MOVSD	X2, autotmp_419-16(SP)
	CMPQ	DX, $4
	JGT	_1695
	CMPQ	DX, $3
	JNE	_1658
	MOVSD	$(1.0), X1
_1604:
	ADDSD	X5, X4
	MULSD	X1, X4
	MOVSD	X4, (SP)
	PCDATA	$0, $0
	CALL	math·Log(SB)
	MOVSD autotmp_419-16(SP), X0
	ADDSD	8(SP), X0

It's true that autotmp_419 is being spilled unnecessarily in the case that the switch doesn't end up in a case that eventually calls Log. But the spll in master to autotmp_419 replaces the (possibly also unnecessary) spill to lgamma+0(FP) in Go 1.6, so there aren't extra spills compared to Go 1.6, at least not in the hot code.

Even so, I tried changing the lgammazzz.s assembly to move the spill:

	DIVSD	X1, X2
	ADDSD	X3, X2
//old	MOVSD	X2, autotmp_419-16(SP)
	CMPQ	DX, $4
	JGT	_1695
	CMPQ	DX, $3
	JNE	_1658
	MOVSD	$(1.0), X1
_1604:
	ADDSD	X5, X4
	MULSD	X1, X4
	MOVSD	X4, (SP)
	PCDATA	$0, $0
	MOVSD	X2, autotmp_419-16(SP) // new
	CALL	math·Log(SB)
	MOVSD autotmp_419-16(SP), X0
	ADDSD	8(SP), X0

That cut BenchmarkLgammaZZZ from 14.6ns to 14.5ns, so it does help by 0.1ns but it doesn't seem to be the major contributor for the 1.2ns gap between Go 1.6 and master.

@aregm

This comment has been minimized.

Copy link

commented Apr 5, 2017

There are two main causes for this degradation: first in bad case the register allocator is not keeping X2 register live enough and does redundant spills in the critical path, which is:
pc101
pc2246
pc123
pc224
pc2231
pc378
pc432
pc448
pc1393

In the good case, it keeps X2 alive throughout the critical path and spills it just before ret.
Second, the bad case has about 5% more code execution which affects the caches and contributes to the rest of degradation.
I have used temp register to get back some of the degradations and it gives more than half back.

lgamma_bad.s.txt
lgamma_good.s.txt

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jun 7, 2017

@bradfitz bradfitz added the help wanted label Jun 7, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Unplanned Nov 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.