x/mobile: Go running much slower than CPP on ARM32 / ARM64 #42798
Comments
Your go code uses 64 bit sqrt while the c code uses 32 bit sqrt. |
@AlexRouSg ya, I edited the code above, as it's a large part, so I edited directly on the question. It's my fault didn't take a further test...
|
hmmm for ARM32 try setting the @gopherbot please add labels Performance, NeedsInvestigation |
Both of the funcs have an extra bounds check that might be part of the reason. Compare https://go.godbolt.org/z/eGne9o.
Rest of the difference probably comes from different optimizations https://c.godbolt.org/z/dTdrfo. |
I edited the code above according to your advises.
The compiler do show something, I'm still trying to understand it. |
Looking at gcc & clang output https://c.godbolt.org/z/vo7nhz, it seems they are unrolling the loop. (Notice the multiple calls to |
Here are manually unrolled Go versions https://play.golang.org/p/O3gUjWq4aMi. Results from RaspberryPi 4 (using arm32), Go 1.15.5, gcc 8.3.0:
|
I tested your code on [Go scale] Repeat: [200], Total: 6.441751ms, Mean: 32.208µs
[Go scale_range] Repeat: [200], Total: 5.249708ms, Mean: 26.248µs
[C scale] Repeat: [200], Total: 3.955ms, Mean: 19.775µs
[Go scale_unroll1] Repeat: [200], Total: 10.919709ms, Mean: 54.598µs
[Go scale_unroll2] Repeat: [200], Total: 3.766ms, Mean: 18.83µs
[Go scale_unroll3] Repeat: [200], Total: 3.391792ms, Mean: 16.958µs
[Go scale_unroll4] Repeat: [200], Total: 2.937959ms, Mean: 14.689µs I also did some assemble test on
|
I guess the issue can be clarified to Go arm doesn't use indexed loads. https://go.godbolt.org/z/aovTe1 |
Maybe fmovs needs separate rules at https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/gen/ARM64.rules#L772 Indexed loads work properly for uint32.
|
Change https://golang.org/cl/273706 mentions this issue: |
@egonelbre I found something else on
$ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC=$ANDROID_CC go build -a test_cpp_scale.go
$ adb push test_cpp_scale /data/local/tmp && adb shell '/data/local/tmp/test_cpp_scale'
test_cpp_scale: 1 file pushed, 0 skipped. 3590.1 MB/s (2374192 bytes in 0.001s)
[Go scale] Repeat: [200], Total: 10.604791ms, Mean: 53.023µs
[Go scale64] Repeat: [200], Total: 10.56849ms, Mean: 52.842µs
[C scale] Repeat: [200], Total: 1.54927ms, Mean: 7.746µs
[C scale64] Repeat: [200], Total: 2.954636ms, Mean: 14.773µs
package main
/*
float scale(float *src, int dim, float scale) {
for(int i = 0; i < dim; i++) {
src[i] *= scale;
}
return src[dim-1];
}
double scale64(double *src, int dim, double scale) {
for(int i = 0; i < dim; i++) {
src[i] *= scale;
}
return src[dim-1];
}
*/
import "C"
import (
"fmt"
"time"
"unsafe"
)
type TestFunc func()
func main() {
dim := 10240
src := make([]float32, dim)
for jj := range src {
src[jj] = float32(jj % 10)
}
src64 := make([]float64, dim)
for jj := range src64 {
src64[jj] = float64(jj % 10)
}
repeats := 200
basicTest(func() { scale(src, 1) }, "Go scale", repeats)
basicTest(func() { scale64(src64, 1) }, "Go scale64", repeats)
basicTest(func() { C.scale((*C.float)(unsafe.Pointer(&src[0])), C.int(dim), 1) }, "C scale", repeats)
basicTest(func() { C.scale64((*C.double)(unsafe.Pointer(&src64[0])), C.int(dim), 1) }, "C scale64", repeats)
}
func basicTest(testFunc TestFunc, name string, repeats int) {
ss := time.Now()
for ii := 0; ii < repeats; ii++ {
testFunc()
}
tt := time.Since(ss)
fmt.Printf("[%s] Repeat: [%d], Total: %s, Mean: %s\n", name, repeats, tt, tt/time.Duration(repeats))
}
func scale(xs []float32, scale float32) {
for i := range xs {
xs[i] *= scale
}
}
func scale64(xs []float64, scale float64) {
for i := range xs {
xs[i] *= scale
}
}
|
I tested the same code on Raspberry 4 (with the fix and using gcc), results:
Using clang:
Looking at the C assembly, it seems clang is automatically vectorizing and interleaving. See https://llvm.org/docs/Vectorizers.html for more information. This explains the 2x difference. After disabling vectorization and interleaving, I got:
|
Ya, I think that can explain it. Thanks for your clarify. $ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC="$ANDROID_CC" CGO_CFLAGS="-O3 -fno-vectorize" go build -a test_cpp_scale.go
$ adb push test_cpp_scale /data/local/tmp && adb shell '/data/local/tmp/test_cpp_scale'
test_cpp_scale: 1 file pushed, 0 skipped. 3243.5 MB/s (2364312 bytes in 0.001s)
[Go scale] Repeat: [200], Total: 10.593854ms, Mean: 52.969µs
[Go scale64] Repeat: [200], Total: 10.506198ms, Mean: 52.53µs
[C scale] Repeat: [200], Total: 7.748177ms, Mean: 38.74µs
[C scale64] Repeat: [200], Total: 7.8475ms, Mean: 39.237µs |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I met this issue in multiple functions, like
image resize
/image warpaffine
and many others, thatGo
runs much slower thanCPP
code onARM32 / ARM64
. Here I wrote two test functions incpp
andgo
to reproduce.L2Norm
into two functionssquareSum
andmultiplyWithNum
._range
functions, and useGOARM=7
buildingARM32
.x86_64
bygo run
. The mean running times are simmilar.x86_64
bygo build
and run. The mean running times vary.ARM64
.squareSum
is similar, butmultiplyWithNum
is~7
times slower.ARM32
.squareSum
is similar, butmultiplyWithNum
is~1.5
times slower after addingGOARM=7
.So what did I do wrong here? How can I improve this?
What did you expect to see?
Expect to see
Go
has a similar performance onARM32
/ARM64
platform.What did you see instead?
Go
onX86_64
platform has a similar performance withcpp
.Go multiplyWithNum
function onARM64
platform is~7
times slower thancpp
.Go multiplyWithNum
function onARM32
platform is~1.5
times slower thancpp
.The text was updated successfully, but these errors were encountered: