Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: CMOV instruction slower than branch #26306

Closed
PeterRK opened this issue Jul 10, 2018 · 18 comments

Comments

Projects
None yet
10 participants
@PeterRK
Copy link

commented Jul 10, 2018

What version of Go are you using (go version)?

go1.11beta1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/rk/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/rk/GoSpace/Projects"
GOPROXY=""
GORACE=""
GOROOT="/home/rk/GoSpace/GO"
GOTMPDIR=""
GOTOOLDIR="/home/rk/GoSpace/GO/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473408654=/tmp/go-build -gno-record-gcc-switches"
VGOMODROOT=""

What did you do?

I built the code below with go 1.10 and go 1.11.
https://play.golang.org/p/22MEbiXFpzo

What did you expect to see?

The binary built by go 1.11 is as fast as that built by go 1.10.

What did you see instead?

The binary built by go 1.11 is incredibly slower than that built by go 1.10.

Go 1.11 compiles the function "down" to assembly like this:

    MOVQ pos+32(SP), AX
    MOVQ list+8(SP), CX
    MOVL (CX)(AX*4), DX
    LEAQ 1(AX)(AX*1), BX
    MOVQ list+16(SP), SI
    DECQ SI
    JMP L42
L28:
    MOVL DI, (CX)(AX*4)
    LEAQ 1(BX)(BX*1), DI
    MOVQ BX, AX
    MOVQ DI, BX
L42:
    CMPQ BX, SI
    JGE L73
    MOVL 4(CX)(BX*4), DI
    LEAQ 1(BX), R8
    MOVL (CX)(BX*4), R9
    CMPL DI, R9
    CMOVQHI R8, BX
//  JHI L100
L66:
    MOVL (CX)(BX*4), DI
    CMPL DX, DI
    JCS L28
L73:    
    CMPQ BX, SI
    JNE L97
    MOVL (CX)(BX*4), SI
    CMPL DX, SI
    JCC L92
    MOVL SI, (CX)(AX*4)
L88:    
    MOVL DX, (CX)(BX*4)
    RET
L92:
    MOVQ AX, BX
    JMP L88
L97:    
    MOVQ AX, BX
    JMP L88
L100:
//  MOVQ R8, BX
//  JMP L66

If replacing the CMOV instruction with a branch, it can be 80% faster.

@PeterRK PeterRK changed the title Go 1.11 genertate very slow CMOV instruction Go 1.11 generates very slow CMOV instruction Jul 10, 2018

@josharian josharian added this to the Go1.11 milestone Jul 10, 2018

@josharian josharian changed the title Go 1.11 generates very slow CMOV instruction cmd/compile: CMOV instruction 5x slower than branch Jul 10, 2018

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

@PeterRK What processor are you running?

@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

I've microbenchmarked the code here on my Core i7-4510U between go1.10.3 and go1.11beta1 and benchstat returns the following:

name        old time/op  new time/op  delta
HeapSort-4  52.3µs ± 3%  39.8µs ± 2%  -23.89%  (p=0.000 n=10+10)

This would indicate there is actually a 24% speed improvement between go1.10.3 and go1.11beta1. I can confirm that the difference is the use of the CMOV instruction in the down function.

Here is the Benchmark function:

const size = 1000

func makeSlice(size int) []uint32 {
	a := make([]uint32, size)
	for i := range a {
		a[i] = rand.Uint32()
	}
	return a
}

func BenchmarkHeapSort(b *testing.B) {
	a := makeSlice(size)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		HeapSort(a)
	}
}
@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

For larger arrrays (size = 1000000) go1.11beta1 is getting slower.

name        old time/op  new time/op  delta
HeapSort-4  75.9ms ± 2%  86.0ms ± 1%  +13.29%  (p=0.000 n=10+9)
@PeterRK

This comment has been minimized.

Copy link
Author

commented Jul 10, 2018

I have tried i7-8700 & i5-7500.
Binary built by go 1.11 have less cache miss and less branch miss,but 80% more execution time.
I think it can not be said 5x slower. @josharian

test with 20M list:
10
11

test code:
https://play.golang.org/p/h8mmeue7v1R

@ALTree

This comment has been minimized.

Copy link
Member

commented Jul 10, 2018

Here's an auto-contained benchmark that can be directly copy-pasted in a _test.go file for whoever wants to try it: https://play.golang.org/p/4VwOmmmX069

Here's the result with count=5 on my machine for sizes from 100 to 1000000 (old is 1.10, new is tip):

name                old time/op  new time/op  delta
HeapSort/100-4      1.79µs ± 3%  2.45µs ± 2%  +36.43%  (p=0.008 n=5+5)
HeapSort/1000-4     52.1µs ± 3%  39.2µs ± 0%  -24.73%  (p=0.016 n=5+4)
HeapSort/10000-4     607µs ± 7%   540µs ± 0%  -11.08%  (p=0.016 n=5+4)
HeapSort/100000-4   6.55ms ± 1%  7.09ms ± 2%   +8.20%  (p=0.008 n=5+5)
HeapSort/1000000-4  75.0ms ± 0%  85.7ms ± 2%  +14.27%  (p=0.008 n=5+5)

On a i7-4510U cpu.

@agnivade

This comment has been minimized.

Copy link
Member

commented Jul 10, 2018

Thanks @ALTree for the self-contained benchmark. I was very interested in the CMOV instruction so I gave it a shot too. I see different results in my CPU (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz).

name                old time/op  new time/op  delta
HeapSort/100-4      2.03µs ± 1%  2.43µs ± 1%  +19.92%  (p=0.008 n=5+5)
HeapSort/1000-4     60.0µs ± 0%  38.8µs ± 1%  -35.29%  (p=0.008 n=5+5)
HeapSort/10000-4     681µs ± 1%   529µs ± 0%  -22.26%  (p=0.008 n=5+5)
HeapSort/100000-4   7.61ms ± 1%  6.78ms ± 1%  -10.93%  (p=0.008 n=5+5)
HeapSort/1000000-4  87.8ms ± 2%  85.9ms ± 1%   -2.19%  (p=0.008 n=5+5)

Things look more or less same, except the last row, where I see perf improvement rather than a slowdown. For the 100 case, I see a slowdown too.

Sorry if this just adds noise. I only posted because I saw different results in the last case.

@PeterRK PeterRK changed the title cmd/compile: CMOV instruction 5x slower than branch cmd/compile: CMOV instruction slower than branch Jul 10, 2018

@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

This might be related to #25298 as pointed out by @randall77 on golang-dev. See the discussion branch predictor versus CMOV.

@ulikunitz

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

The benchmark code has the problem that after the first iteration slice a is already sorted.
Here is a fix: https://play.golang.org/p/b_-IC1OSEWq

The new benchstat looks like this:

name                 old time/op  new time/op  delta
HeapSort/100-4       5.59µs ± 3%  3.74µs ± 4%  -33.14%  (p=0.000 n=10+10)
HeapSort/1000-4      75.0µs ± 0%  50.1µs ± 1%  -33.26%   (p=0.000 n=7+10)
HeapSort/10000-4      879µs ± 2%   610µs ± 0%  -30.61%   (p=0.000 n=10+8)
HeapSort/100000-4    10.8ms ± 0%   8.8ms ± 1%  -18.80%   (p=0.000 n=8+10)
HeapSort/1000000-4    141ms ± 1%   146ms ± 1%   +3.62%  (p=0.000 n=10+10)
HeapSort/10000000-4   2.54s ± 3%   4.23s ± 6%  +66.86%  (p=0.000 n=10+10)

Apparently branch prediction and speculative execution (loads) are faster than CMOV if the slice size exceeds the level 2 cache.

@rasky

This comment has been minimized.

Copy link
Member

commented Jul 10, 2018

Honestly, I'm not sure what to do. @TocarIP?

@PeterRK

This comment has been minimized.

Copy link
Author

commented Jul 11, 2018

go 1.11 vs go 1.10 on different scales:
2018-07-11

Go 1.11 loses its advantage when data size grows. CMOV causes that.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2018

I see similar things. I suspect there are two different issues - one with N=100 and one with large N.
I'm going to focus on the large N for now.

With a few more data points (old=tip with cmov generation turned off, new=tip):

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          1752           2022           +15.41%
BenchmarkHeapSort/1000-8         53765          35192          -34.54%
BenchmarkHeapSort/10000-8        760440         502731         -33.89%
BenchmarkHeapSort/100000-8       9790058        7496200        -23.43%
BenchmarkHeapSort/1000000-8      131420794      130939404      -0.37%
BenchmarkHeapSort/2000000-8      294149147      333858182      +13.50%
BenchmarkHeapSort/10000000-8     2271577039     3583052169     +57.73%
BenchmarkHeapSort/20000000-8     5213875449     9264420912     +77.69%

As the array gets larger, cmov gets relatively slower. My L3 cache is 10MB, which is N=1.25e6, just about at the inflection point.
The cmov is only being used to select the correct child, like this:

if list[kid+1] > list[kid] {
	kid++
}

The array size dependence seems an awful lot like cache effects. If the array we're sorting fits in L3 cache, it is fast. If the array does not fit, it's slower. My suspicion is that when using branches, if the branch is predicted correctly (probably 50% of the time) we can issue the next read (the grandchild in the heap) in parallel with the current read. Using cmov, there's a data dependence from one load to the next, so we never issue two of them in parallel. That might be enough of an effect to explain the result - the cmov serializes the reads, and the cache miss is so slow that the branch mispredict & restart can all be hidden in the cache miss.

It seems a pretty niche case where this effect comes into play. I wouldn't want to remove cmov generation because of it.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2018

I can somewhat verify my claim. I can add a prefetch to the benchmark and that mostly fixes the large N behavior. At the head of the loop in down, do:

if kid*2 < last {
	_ = atomic.LoadUint32(&list[kid*2])
}

Now old=nocmov, new=tip:

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          2029           2262           +11.48%
BenchmarkHeapSort/1000-8         56825          37950          -33.22%
BenchmarkHeapSort/10000-8        807093         526252         -34.80%
BenchmarkHeapSort/100000-8       10113466       6808430        -32.68%
BenchmarkHeapSort/1000000-8      130448267      101057677      -22.53%
BenchmarkHeapSort/2000000-8      295489785      243496350      -17.60%
BenchmarkHeapSort/4000000-8      721814370      646561035      -10.43%
BenchmarkHeapSort/6000000-8      1202974592     1161344992     -3.46%
BenchmarkHeapSort/8000000-8      1729255015     1735633353     +0.37%
BenchmarkHeapSort/10000000-8     2274368639     2370006748     +4.21%
BenchmarkHeapSort/20000000-8     5265656303     5989478005     +13.75%

There is still some effect, but it is much reduced.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2018

Perhaps we could detect the fact that the result of a CMOV is used as an argument to a load, and suppress the use of a CMOV in those cases. That would allow the chip to continue to speculate the load. Seems like a pretty vague condition, though. And definitely not 1.11 material.
I'm going to punt this issue to 1.12. I think we're at a reasonable state for the release.

@randall77 randall77 modified the milestones: Go1.11, Go1.12 Jul 16, 2018

@bcmills

This comment has been minimized.

Copy link
Member

commented Jul 16, 2018

Are there µ-arches where the CMOV doesn't block the speculative load? (That seems like the sort of thing that register renaming is supposed to take care of.)

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2018

@bcmills I don't think so. The speculative load needs the address from which to load, and I don't know of any archs which speculate the address.

addr := ...
if cond { addr++ }
_ = *addr

With regular branches, cond does not need to be evaluated for the load to be issued. The processor speculates on the branch and either issues the load from addr or addr+1, depending on the branch direction it predicted.

addr := ...
addr = CMOV(cond, addr+1, addr)
_ = *addr

With the CMOV, the address from which you load is data-dependent on cond, so cond must be evaluated before the load is issued.

@gopherbot

This comment has been minimized.

Copy link

commented Oct 29, 2018

Change https://golang.org/cl/145717 mentions this issue: cmd/compile: don't use CMOV ops to compute load addresses

@gopherbot

This comment has been minimized.

Copy link

commented Mar 18, 2019

Change https://golang.org/cl/168117 mentions this issue: cmd/compile: replace OpSlicemask with branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.