Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/mobile: Go running much slower than CPP on ARM32 / ARM64 #42798

Open
leondgarse opened this issue Nov 24, 2020 · 14 comments
Open

x/mobile: Go running much slower than CPP on ARM32 / ARM64 #42798

leondgarse opened this issue Nov 24, 2020 · 14 comments

Comments

@leondgarse
Copy link

@leondgarse leondgarse commented Nov 24, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.5 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/leondgarse/.cache/go-build"
GOENV="/home/leondgarse/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/leondgarse/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leondgarse/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build417888568=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I met this issue in multiple functions, like image resize / image warpaffine and many others, that Go runs much slower than CPP code on ARM32 / ARM64. Here I wrote two test functions in cpp and go to reproduce.

  • Edition 1 Separate L2Norm into two functions squareSum and multiplyWithNum.
  • Edition 2 Add two _range functions, and use GOARM=7 building ARM32.
package main

/*
float square_sum(float *src, int dim) {
    float ss = 0;
    for (int jj = 0; jj < dim; jj++) {
        ss += src[jj] * src[jj];
    }
    return ss;
}

void multiply_with_num(float *src, float num, int dim) {
    for (int jj = 0; jj < dim; jj++) {
        src[jj] *= num;
    }
}
*/
import "C"
import (
    "fmt"
    "runtime"
    "time"
    "unsafe"
)

type TestFunc func()

func main() {
    runtime.GOMAXPROCS(4) // Changing GOMAXPROCS makes no difference
    // Init
    dim := 10240
    src := make([]float32, dim)
    for jj := range src {
        src[jj] = float32(jj % 10)
    }

    repeats := 200
    basicTest(func() { squareSum(src, dim) }, "Go squareSum", repeats)
    basicTest(func() { squareSum_range(src) }, "Go squareSum_range", repeats)
    basicTest(func() { C.square_sum((*C.float)(unsafe.Pointer(&src[0])), C.int(dim)) }, "C square_sum", repeats)
    basicTest(func() { multiplyWithNum(src, 1, dim) }, "Go multiplyWithNum", repeats)
    basicTest(func() { multiplyWithNum_range(src, 1) }, "Go multiplyWithNum_range", repeats)
    basicTest(func() { C.multiply_with_num((*C.float)(unsafe.Pointer(&src[0])), 1, C.int(dim)) }, "C multiply_with_num", repeats)
}

func basicTest(testFunc TestFunc, name string, repeats int) {
    ss := time.Now()
    for ii := 0; ii < repeats; ii++ {
        testFunc()
    }
    tt := time.Since(ss)
    fmt.Printf("[%s] Repeat: [%d], Total: %s, Mean: %s\n", name, repeats, tt, tt/time.Duration(repeats))
}

func squareSum(aa []float32, dim int) float32 {
    var ss float32
    for jj := 0; jj < dim; jj++ {
        ss += aa[jj] * aa[jj]
    }
    return ss
}

func squareSum_range(aa []float32) float32 {
    var ss float32
    for jj := range aa {
        ss += aa[jj] * aa[jj]
    }
    return ss
}

func multiplyWithNum(aa []float32, num float32, dim int) {
    for jj := 0; jj < dim; jj++ {
        aa[jj] *= num
    }
}

func multiplyWithNum_range(aa []float32, num float32) {
    for ii := range aa {
        aa[ii] *= num
    }
}
  • Test on host x86_64 by go run. The mean running times are simmilar.
    $ go run test_cpp.go
    [Go squareSum] Repeat: [200], Total: 1.904418ms, Mean: 9.522µs
    [Go squareSum_range] Repeat: [200], Total: 1.911713ms, Mean: 9.558µs
    [C square_sum] Repeat: [200], Total: 1.917224ms, Mean: 9.586µs
    [Go multiplyWithNum] Repeat: [200], Total: 978.724µs, Mean: 4.893µs
    [Go multiplyWithNum_range] Repeat: [200], Total: 956.303µs, Mean: 4.781µs
    [C multiply_with_num] Repeat: [200], Total: 964.747µs, Mean: 4.823µs
  • Test on host x86_64 by go build and run. The mean running times vary.
    $ go build -a test_cpp.go
    $ ./test_cpp
    [Go squareSum] Repeat: [200], Total: 2.180981ms, Mean: 10.904µs
    [Go squareSum_range] Repeat: [200], Total: 1.925532ms, Mean: 9.627µs
    [C square_sum] Repeat: [200], Total: 1.947857ms, Mean: 9.739µs
    [Go multiplyWithNum] Repeat: [200], Total: 833.088µs, Mean: 4.165µs
    [Go multiplyWithNum_range] Repeat: [200], Total: 602.318µs, Mean: 3.011µs
    [C multiply_with_num] Repeat: [200], Total: 863.123µs, Mean: 4.315µs
    
    $ ./test_cpp
    [Go squareSum] Repeat: [200], Total: 10.55661ms, Mean: 52.783µs
    [Go squareSum_range] Repeat: [200], Total: 10.226328ms, Mean: 51.131µs
    [C square_sum] Repeat: [200], Total: 6.421496ms, Mean: 32.107µs
    [Go multiplyWithNum] Repeat: [200], Total: 2.327771ms, Mean: 11.638µs
    [Go multiplyWithNum_range] Repeat: [200], Total: 1.451907ms, Mean: 7.259µs
    [C multiply_with_num] Repeat: [200], Total: 1.883815ms, Mean: 9.419µs
  • Test on ARM64. squareSum is similar, but multiplyWithNum is ~7 times slower.
    $ adb shell 'cat /proc/cpuinfo'
    Processor       : AArch64 Processor rev 4 (aarch64)
    CPU architecture: 8
    Hardware        : Qualcomm Technologies, Inc SDM630
    
    $ SDK_HOME="$HOME/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64"
    $ ANDROID_CC="$SDK_HOME/bin/aarch64-linux-android29-clang -Wl,-rpath-link,$SDK_HOME/sysroot/usr/lib/aarch64-linux-android/29"
    $ ANDROID_ARCH="arm64"
    
    $ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC=$ANDROID_CC go build -a test_cpp.go
    $ adb push test_cpp /data/local/tmp
    $ adb shell '/data/local/tmp/test_cpp'
    [Go squareSum] Repeat: [200], Total: 8.596094ms, Mean: 42.98µs
    [Go squareSum_range] Repeat: [200], Total: 7.548593ms, Mean: 37.742µs
    [C square_sum] Repeat: [200], Total: 7.679844ms, Mean: 38.399µs
    [Go multiplyWithNum] Repeat: [200], Total: 11.419532ms, Mean: 57.097µs
    [Go multiplyWithNum_range] Repeat: [200], Total: 10.328698ms, Mean: 51.643µs
    [C multiply_with_num] Repeat: [200], Total: 1.431354ms, Mean: 7.156µs
  • Test on ARM32. squareSum is similar, but multiplyWithNum is ~1.5 times slower after adding GOARM=7.
    $ adb shell 'cat /proc/cpuinfo'
    model name      : ARMv7 Processor rev 1 (v7l)
    CPU architecture: 7
    Hardware        : Generic DT based system
    
    $ SDK_HOME="$HOME/Android/Sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/linux-x86_64"
    $ ANDROID_CC="$SDK_HOME/bin/armv7a-linux-androideabi29-clang -Wl,-rpath-link,$SDK_HOME/sysroot/usr/lib/arm-linux-androideabi/29"
    $ ANDROID_ARCH="arm"
    
    $ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH GOARM=7 CC=$ANDROID_CC go build -a test_cpp.go
    $ adb push test_cpp /data/local/tmp
    $ adb shell '/data/local/tmp/test_cpp'
    [Go squareSum] Repeat: [200], Total: 6.548209ms, Mean: 32.741µs
    [Go squareSum_range] Repeat: [200], Total: 6.442917ms, Mean: 32.214µs
    [C square_sum] Repeat: [200], Total: 5.458542ms, Mean: 27.292µs
    [Go multiplyWithNum] Repeat: [200], Total: 5.886126ms, Mean: 29.43µs
    [Go multiplyWithNum_range] Repeat: [200], Total: 5.243293ms, Mean: 26.216µs
    [C multiply_with_num] Repeat: [200], Total: 3.371375ms, Mean: 16.856µs

So what did I do wrong here? How can I improve this?

What did you expect to see?

Expect to see Go has a similar performance on ARM32 / ARM64 platform.

What did you see instead?

  • Go on X86_64 platform has a similar performance with cpp.
  • Go multiplyWithNum function on ARM64 platform is ~7 times slower than cpp.
  • Go multiplyWithNum function on ARM32 platform is ~1.5 times slower than cpp.
@gopherbot gopherbot added the mobile label Nov 24, 2020
@gopherbot gopherbot added this to the Unreleased milestone Nov 24, 2020
@AlexRouSg
Copy link
Contributor

@AlexRouSg AlexRouSg commented Nov 24, 2020

Your go code uses 64 bit sqrt while the c code uses 32 bit sqrt.
And the memory layout is different.
That is not exactly a fair comparison.

@leondgarse
Copy link
Author

@leondgarse leondgarse commented Nov 25, 2020

@AlexRouSg ya, I edited the code above, as it's a large part, so I edited directly on the question. It's my fault didn't take a further test...

  • Separate L2Norm into two functions squareSum and multiplyWithNum.
  • Both using a 1 dimension array now.
  • Updated executing results.
  • It makes some differences, squareSum on ARM64 platform is similar.
@AlexRouSg
Copy link
Contributor

@AlexRouSg AlexRouSg commented Nov 25, 2020

hmmm for ARM32 try setting the GOARM=7 env var to make sure it's using the armv7 calls. Might want to do a go build -a to force a rebuild. I'm not too sure whether the other numbers are expected so you'll have to wait for someone else to comment on them.

@gopherbot please add labels Performance, NeedsInvestigation

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Nov 25, 2020

Both of the funcs have an extra bounds check that might be part of the reason. Compare https://go.godbolt.org/z/eGne9o.

func multiplyWithNum(xs []float32, num float32) {
    for i := range xs {
        xs[i] *= num
    }
}

Rest of the difference probably comes from different optimizations https://c.godbolt.org/z/dTdrfo.

@leondgarse
Copy link
Author

@leondgarse leondgarse commented Nov 26, 2020

I edited the code above according to your advises.

  • Add two _range functions. Seems it improves the performance, but still a gap for multiplyWithNum.
  • Use GOARM=7 on ARM32. It do works!
  • A basicTest function for testing, as it's getting large...
  • Updated executing results. It's mainly the ARM32 one.

The compiler do show something, I'm still trying to understand it.

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Nov 26, 2020

Looking at gcc & clang output https://c.godbolt.org/z/vo7nhz, it seems they are unrolling the loop. (Notice the multiple calls to fmul in the assembly.)

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Nov 26, 2020

Here are manually unrolled Go versions https://play.golang.org/p/O3gUjWq4aMi.

Results from RaspberryPi 4 (using arm32), Go 1.15.5, gcc 8.3.0:

go    20982.638ns
go u2 12497.042ns
go u3 10857.608ns
go u4 10101.493ns
c     15200.917ns
@leondgarse
Copy link
Author

@leondgarse leondgarse commented Nov 27, 2020

I tested your code on ARM32, and it also improves performance under my environment!

[Go scale] Repeat: [200], Total: 6.441751ms, Mean: 32.208µs
[Go scale_range] Repeat: [200], Total: 5.249708ms, Mean: 26.248µs
[C scale] Repeat: [200], Total: 3.955ms, Mean: 19.775µs
[Go scale_unroll1] Repeat: [200], Total: 10.919709ms, Mean: 54.598µs
[Go scale_unroll2] Repeat: [200], Total: 3.766ms, Mean: 18.83µs
[Go scale_unroll3] Repeat: [200], Total: 3.391792ms, Mean: 16.958µs
[Go scale_unroll4] Repeat: [200], Total: 2.937959ms, Mean: 14.689µs

I also did some assemble test on ARM64, here is my results:

  • Firstly, add a test_cpp.s file, so I can change asm code there directly. Including two functions:
    • scale_range_asm_1 is the original compiled one.
    • scale_range_asm_2 using R2*4 instead of LSL $2, R2, R3, which is the amd64 compiled method.
    #include "textflag.h"
    
    TEXT    ·scale_range_asm_1(SB), NOSPLIT, $0-32
            MOVD    xs+8(FP), R0
            MOVD    xs(FP), R1
            FMOVS   scale+24(FP), F0
            MOVD    ZR, R2
            JMP     scale_range_asm_1_2
    scale_range_asm_1_1:
            LSL     $2, R2, R3
            FMOVS   (R1)(R3), F1
            FMULS   F0, F1, F1
            FMOVS   F1, (R1)(R3)
            ADD     $1, R2, R2
    scale_range_asm_1_2:
            CMP     R0, R2
            BLT     scale_range_asm_1_1
            RET     (R30)
    
    TEXT    ·scale_range_asm_2(SB), NOSPLIT, $0-32
            MOVD    xs+8(FP), R0
            MOVD    xs(FP), R1
            FMOVS   scale+24(FP), F0
            MOVD    ZR, R2
            JMP     scale_range_asm_2_2
    scale_range_asm_2_1:
            FMOVS   (R1)(R2*4), F1
            FMULS   F0, F1, F1
            FMOVS   F1, (R1)(R2*4)
            ADD     $1, R2, R2
    scale_range_asm_2_2:
            CMP     R0, R2
            BLT     scale_range_asm_2_1
            RET     (R30)
  • Delete all C part from test_cpp.go.
  • Then add function definitions and test calls in go file test_cpp.go
    ...
    func scale_range_asm_1(xs []float32, scale float32)
    func scale_range_asm_2(xs []float32, scale float32)
    ...
    func main() {
        ...
        basicTest(func() { scale_range_asm_1(src, 1) }, "Go scale_range_asm_1", repeats)
        basicTest(func() { scale_range_asm_2(src, 1) }, "Go scale_range_asm_2", repeats)
        ...
    }
  • Go build and run on ARM64
    $ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC=$ANDROID_CC go build -a .
    $ ls
    test_asm_arm64  test_cpp.go  test_cpp.s
    
    $ adb push test_asm_arm64 /data/local/tmp && adb shell '/data/local/tmp/test_asm_arm64'
    [Go scale range] Repeat: [200], Total: 10.336041ms, Mean: 51.68µs
    [Go scale_range_asm_1] Repeat: [200], Total: 10.579948ms, Mean: 52.899µs
    [Go scale_range_asm_2] Repeat: [200], Total: 8.107604ms, Mean: 40.538µs
    [Go scale_unroll2] Repeat: [200], Total: 7.169792ms, Mean: 35.848µs
    [Go scale_unroll3] Repeat: [200], Total: 6.878385ms, Mean: 34.391µs
    [Go scale_unroll4] Repeat: [200], Total: 6.851875ms, Mean: 34.259µs
  • Seems the scale_range_asm_2 is improving the performance under my environment...
@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Nov 27, 2020

I guess the issue can be clarified to Go arm doesn't use indexed loads. https://go.godbolt.org/z/aovTe1

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Nov 27, 2020

Maybe fmovs needs separate rules at https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/gen/ARM64.rules#L772

Indexed loads work properly for uint32.

package ex

func getUint32(xs []uint32, i int) uint32 {
    return xs[i]
}

func getFloat32(xs []float32, i int) float32 {
    return xs[i]
}
        text    "".getUint32(SB), LEAF|NOFRAME|ABIInternal, $0-40
        funcdata        ZR, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
        funcdata        $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
        movd    "".i+24(FP), R0
        movd    "".xs(FP), R1
        movwu   (R1)(R0<<2), R0
        movw    R0, "".~r2+32(FP)
        ret     (R30)

        text    "".getFloat32(SB), LEAF|NOFRAME|ABIInternal, $0-40
        funcdata        ZR, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
        funcdata        $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
        movd    "".i+24(FP), R0
        lsl     $2, R0, R0
        movd    "".xs(FP), R1
        fmovs   (R1)(R0), F0
        fmovs   F0, "".~r2+32(FP)
        ret     (R30)
@gopherbot
Copy link

@gopherbot gopherbot commented Nov 27, 2020

Change https://golang.org/cl/273706 mentions this issue: cmd/compile: ARM64 optimize []float64 and []float32 access

@leondgarse
Copy link
Author

@leondgarse leondgarse commented Dec 2, 2020

@egonelbre I found something else on ARM64, that if we use double / float64 in scale, executing time for cpp will double, but almost no difference for go.

  • Result
$ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC=$ANDROID_CC go build -a test_cpp_scale.go
$ adb push test_cpp_scale /data/local/tmp && adb shell '/data/local/tmp/test_cpp_scale'
test_cpp_scale: 1 file pushed, 0 skipped. 3590.1 MB/s (2374192 bytes in 0.001s)
[Go scale] Repeat: [200], Total: 10.604791ms, Mean: 53.023µs
[Go scale64] Repeat: [200], Total: 10.56849ms, Mean: 52.842µs
[C scale] Repeat: [200], Total: 1.54927ms, Mean: 7.746µs
[C scale64] Repeat: [200], Total: 2.954636ms, Mean: 14.773µs
  • Script
package main

/*
float scale(float *src, int dim, float scale) {
    for(int i = 0; i < dim; i++) {
        src[i] *= scale;
    }
    return src[dim-1];
}

double scale64(double *src, int dim, double scale) {
    for(int i = 0; i < dim; i++) {
        src[i] *= scale;
    }
    return src[dim-1];
}
*/
import "C"
import (
    "fmt"
    "time"
    "unsafe"
)

type TestFunc func()

func main() {
    dim := 10240
    src := make([]float32, dim)
    for jj := range src {
        src[jj] = float32(jj % 10)
    }
    src64 := make([]float64, dim)
    for jj := range src64 {
        src64[jj] = float64(jj % 10)
    }

    repeats := 200
    basicTest(func() { scale(src, 1) }, "Go scale", repeats)
    basicTest(func() { scale64(src64, 1) }, "Go scale64", repeats)
    basicTest(func() { C.scale((*C.float)(unsafe.Pointer(&src[0])), C.int(dim), 1) }, "C scale", repeats)
    basicTest(func() { C.scale64((*C.double)(unsafe.Pointer(&src64[0])), C.int(dim), 1) }, "C scale64", repeats)
}

func basicTest(testFunc TestFunc, name string, repeats int) {
    ss := time.Now()
    for ii := 0; ii < repeats; ii++ {
        testFunc()
    }
    tt := time.Since(ss)
    fmt.Printf("[%s] Repeat: [%d], Total: %s, Mean: %s\n", name, repeats, tt, tt/time.Duration(repeats))
}

func scale(xs []float32, scale float32) {
    for i := range xs {
        xs[i] *= scale
    }
}

func scale64(xs []float64, scale float64) {
    for i := range xs {
        xs[i] *= scale
    }
}
  • From cpp asm code scale / scale64, we can see scale is using s0 / s1, while scale64 using d0 / d1.
  • CPP scale on ARM64 is insanely fast, don't know if that's I'm missing any parameter here...
@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Dec 2, 2020

I tested the same code on Raspberry 4 (with the fix and using gcc), results:

[Go scale] Repeat: [200], Total: 3.264352ms, Mean: 16.321µs
[Go scale64] Repeat: [200], Total: 3.285981ms, Mean: 16.429µs
[C scale] Repeat: [200], Total: 2.910802ms, Mean: 14.554µs
[C scale64] Repeat: [200], Total: 3.032948ms, Mean: 15.164µs

Using clang:

ubuntu@cloudy:~/code/sandbox/floats$ CC=clang go run main.go
[Go scale] Repeat: [200], Total: 3.297851ms, Mean: 16.489µs
[Go scale64] Repeat: [200], Total: 4.368629ms, Mean: 21.843µs
[C scale] Repeat: [200], Total: 1.156703ms, Mean: 5.783µs
[C scale64] Repeat: [200], Total: 2.338682ms, Mean: 11.693µs

Looking at the C assembly, it seems clang is automatically vectorizing and interleaving. See https://llvm.org/docs/Vectorizers.html for more information. This explains the 2x difference.

After disabling vectorization and interleaving, I got:

ubuntu@cloudy:~/code/sandbox/floats$ CC=clang go run main.go
[Go scale] Repeat: [200], Total: 3.279888ms, Mean: 16.399µs
[Go scale64] Repeat: [200], Total: 3.266185ms, Mean: 16.33µs
[C scale] Repeat: [200], Total: 3.077429ms, Mean: 15.387µs
[C scale64] Repeat: [200], Total: 2.845988ms, Mean: 14.229µs
@leondgarse
Copy link
Author

@leondgarse leondgarse commented Dec 2, 2020

Ya, I think that can explain it. Thanks for your clarify.

$ CGO_ENABLED=1 GOOS=android GOARCH=$ANDROID_ARCH CC="$ANDROID_CC" CGO_CFLAGS="-O3 -fno-vectorize" go build -a test_cpp_scale.go
$ adb push test_cpp_scale /data/local/tmp && adb shell '/data/local/tmp/test_cpp_scale'
test_cpp_scale: 1 file pushed, 0 skipped. 3243.5 MB/s (2364312 bytes in 0.001s)
[Go scale] Repeat: [200], Total: 10.593854ms, Mean: 52.969µs
[Go scale64] Repeat: [200], Total: 10.506198ms, Mean: 52.53µs
[C scale] Repeat: [200], Total: 7.748177ms, Mean: 38.74µs
[C scale64] Repeat: [200], Total: 7.8475ms, Mean: 39.237µs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants