Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: memmove performance on arm64 for unaligned copies is poor on some CPUs #40324

Open
AWSjswinney opened this issue Jul 21, 2020 · 1 comment

Comments

@AWSjswinney
Copy link

@AWSjswinney AWSjswinney commented Jul 21, 2020

The following is indented to start a discussion about the performance of memmove on arm64 and the pros and cons of implementing micro-architecture aware flags to achieve better performance on varying CPUs. A pull request making the changes tested in the data below can be found here: https://go-review.googlesource.com/c/go/+/243357

What version of Go are you using (go version)?

tip of master branch (at the time of writing)

$ go version
go version devel +4469f5446a Wed Jul 15 21:52:49 2020 +0000 linux/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

arch: arm64
os: Amazon Linux 2

go env
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/ec2-user/.cache/go-build"
GOENV="/home/ec2-user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ec2-user/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ec2-user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/ec2-user/go-compiler-test"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/ec2-user/go-compiler-test/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build862296855=/tmp/go-build -gno-record-gcc-switches"

What did you do?

cd src/
go test runtime -test.run "TestMemmove" -test.bench="BenchmarkMemmove" -test.count=10 -test.timeout=90m

What did you expect to see?

I expected to see good performance across a range of systems on all of the memmove benchmarks.

What did you see instead?

On some CPUs, unaligned copies showed poor performance.

Data

The CPUs I used for testing are:

  • AWS Graviton 2, c6g.xlarge, Neoverse-N1
  • AWS Graviton 1, a1.xlarge, Cortex-A72
  • Raspberry Pi 3, Cortex-A53

The new implementation has several new optimizations, as noted in the commit message:

This implementation makes use of new optimizations:
 - software pipelined loop for large (>128 byte) moves
 - medium size moves (17..128 bytes) have a new implementation
 - address realignment when src or dst is unaligned
 - preference for aligned src (loads) or dst (stores) depending on
   micro-architecture

Aligned copies

For aligned copies, performance improved by about 5% for the largest copies. The implementation without realignment performs slightly better, since it omits the overhead of checking the flag and performing the realignment.
memmove-neoverse-n1-memmove

name                               old time/op    new time/op     delta
Memmove/2048-4                       62.1ns ± 0%     59.0ns ± 0%   -4.99%  (p=0.000 n=10+10)
Memmove/4096-4                        117ns ± 1%      110ns ± 0%   -5.66%  (p=0.000 n=10+10)
name                               old speed      new speed       delta
Memmove/2048-4                     33.0GB/s ± 0%   34.7GB/s ± 0%   +5.36%  (p=0.000 n=10+10)
Memmove/4096-4                     35.2GB/s ± 0%   37.1GB/s ± 0%   +5.56%  (p=0.000 n=10+9)

And for Cortex-A53, the improvements were better. This could be because the A53 doesn't use out of order execution and the loads and stores in this implementation are manually reordered.
memmove-a53-memmove

name                               old time/op    new time/op    delta
Memmove/512-4                         127ns ± 0%     110ns ± 0%  -13.39%  (p=0.000 n=8+8)
Memmove/1024-4                        222ns ± 0%     205ns ± 1%   -7.66%  (p=0.000 n=7+10)
Memmove/2048-4                        411ns ± 0%     366ns ± 0%  -10.98%  (p=0.000 n=8+9)
Memmove/4096-4                        795ns ± 1%     695ns ± 1%  -12.63%  (p=0.000 n=10+10)
name                               old speed      new speed      delta
Memmove/512-4                      4.03GB/s ± 0%  4.66GB/s ± 0%  +15.47%  (p=0.000 n=8+8)
Memmove/1024-4                     4.62GB/s ± 0%  5.00GB/s ± 1%   +8.17%  (p=0.000 n=8+10)
Memmove/2048-4                     4.98GB/s ± 0%  5.59GB/s ± 0%  +12.22%  (p=0.000 n=8+9)
Memmove/4096-4                     5.15GB/s ± 1%  5.90GB/s ± 1%  +14.51%  (p=0.000 n=10+10)

Unaligned Copies

For unaligned copies, the difference is more apparent. First we look at a Neoverse N1 CPU
The following compares:

  • the proposed implementation (with the CPUID flag)
  • realignment with the destination pointer (the wrong choice for this CPU)
  • the proposed implementation without the CPUID flag and realignment
  • the existing implementation
    memmove-neoverse-n1-unaligned-src

Next, you can see in a Cortex-A72, the opposite alignement choice produces the best results.
The following compares:

  • the proposed implementation (with the CPUID flag)
  • realignment with the source pointer (the wrong choice for this CPU)
  • the proposed implementation without the CPUID flag and realignment
  • the existing implementation
    memmove-a72-unaligned-dst

Conclusion

For the proposed implementation, I chose to use CPUID to determine the micro architecture to select the best performing move for the target CPU. Since two of the test CPUs have opposite behavior regarding alignment, this flag also both CPUs (and hopefully others as well) to achieve good performance.

@agnivade
Copy link
Contributor

@agnivade agnivade commented Jul 21, 2020

@ALTree ALTree added the Performance label Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.