Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: cmd/go: add GOARM=8 for further optimization on armv7/aarch32 #29373

Open
benshi001 opened this issue Dec 21, 2018 · 5 comments

Comments

@benshi001
Copy link
Member

commented Dec 21, 2018

Currently an ARM program built by go will call runtime.udiv() for a division, in which it detect if a hardware divider is available, or use software division.

The main reason is that a hardware divider is an optional component for an ARMv7 machine. But in the real world, most ARMv7 SOC has it, such as RaspberryPi2.

GOARM=8 implies that the program will run in the aarch32 mode (ARMv7 compatible) of an arm64 machine, on which a hardware divider is a must. So

  1. the compiler will directly generate a SDIV/UDIV for div and mod operations.
  2. the program built with GOARM=8 can also run on ARMv7 with a HW Dividor, like RP2.

The go1 benchmark does show some improvement for directly generation of SDIV/UDIV against runtime detection.

name                     old time/op    new time/op    delta
BinaryTree17-4              20.6s ± 0%     20.5s ± 1%   -0.30%  (p=0.000 n=40+40)
Fannkuch11-4                9.31s ± 0%     9.27s ± 0%   -0.42%  (p=0.000 n=40+39)
FmtFprintfEmpty-4           297ns ± 0%     298ns ± 0%   +0.34%  (p=0.000 n=38+40)
FmtFprintfString-4          588ns ± 0%     599ns ± 0%   +1.81%  (p=0.000 n=36+40)
FmtFprintfInt-4             633ns ± 0%     637ns ± 0%   +0.56%  (p=0.000 n=40+29)
FmtFprintfIntInt-4          960ns ± 0%     953ns ± 0%   -0.71%  (p=0.000 n=40+40)
FmtFprintfPrefixedInt-4    1.05µs ± 0%    1.05µs ± 0%     ~     (p=0.194 n=35+38)
FmtFprintfFloat-4          1.95µs ± 0%    1.75µs ± 0%  -10.12%  (p=0.000 n=38+40)
FmtManyArgs-4              3.55µs ± 0%    3.42µs ± 0%   -3.68%  (p=0.000 n=40+40)
GobDecode-4                37.4ms ± 1%    37.4ms ± 1%     ~     (p=0.320 n=37+39)
GobEncode-4                34.7ms ± 1%    34.4ms ± 1%   -0.80%  (p=0.000 n=40+40)
Gzip-4                      2.06s ± 1%     2.07s ± 1%   +0.44%  (p=0.000 n=39+38)
Gunzip-4                    254ms ± 0%     254ms ± 0%   +0.16%  (p=0.000 n=40+38)
HTTPClientServer-4          823µs ± 2%     817µs ± 2%   -0.70%  (p=0.008 n=37+37)
JSONEncode-4               79.4ms ± 0%    76.0ms ± 1%   -4.23%  (p=0.000 n=32+40)
JSONDecode-4                308ms ± 0%     304ms ± 0%   -1.06%  (p=0.000 n=40+39)
Mandelbrot200-4            17.6ms ± 0%    17.6ms ± 0%     ~     (p=0.210 n=34+38)
GoParse-4                  18.9ms ± 1%    18.7ms ± 1%   -1.10%  (p=0.000 n=39+40)
RegexpMatchEasy0_32-4       500ns ± 0%     502ns ± 2%   +0.35%  (p=0.014 n=39+40)
RegexpMatchEasy0_1K-4      3.82µs ± 0%    3.82µs ± 0%   +0.15%  (p=0.000 n=40+40)
RegexpMatchEasy1_32-4       546ns ± 0%     548ns ± 1%   +0.44%  (p=0.000 n=40+40)
RegexpMatchEasy1_1K-4      4.78µs ± 0%    4.79µs ± 0%   +0.16%  (p=0.000 n=38+37)
RegexpMatchMedium_32-4      737ns ± 2%     741ns ± 3%   +0.53%  (p=0.026 n=40+40)
RegexpMatchMedium_1K-4      164µs ± 0%     162µs ± 0%   -0.72%  (p=0.000 n=39+35)
RegexpMatchHard_32-4       10.6µs ± 0%    10.5µs ± 0%   -0.86%  (p=0.000 n=40+40)
RegexpMatchHard_1K-4        316µs ± 0%     312µs ± 0%   -1.13%  (p=0.000 n=38+40)
Revcomp-4                  40.5ms ± 3%    40.8ms ± 2%   +0.85%  (p=0.001 n=40+39)
Template-4                  395ms ± 0%     387ms ± 0%   -2.07%  (p=0.000 n=40+39)
TimeParse-4                2.68µs ± 0%    2.65µs ± 0%   -1.12%  (p=0.000 n=40+40)
TimeFormat-4               5.42µs ± 0%    5.29µs ± 0%   -2.30%  (p=0.000 n=38+37)
[Geo mean]                  304µs          302µs        -0.88%

name                     old speed      new speed      delta
GobDecode-4              20.5MB/s ± 1%  20.5MB/s ± 1%     ~     (p=0.284 n=37+39)
GobEncode-4              22.1MB/s ± 1%  22.3MB/s ± 1%   +0.81%  (p=0.000 n=40+38)
Gzip-4                   9.41MB/s ± 1%  9.37MB/s ± 1%   -0.41%  (p=0.000 n=38+38)
Gunzip-4                 76.5MB/s ± 0%  76.4MB/s ± 0%   -0.16%  (p=0.000 n=40+38)
JSONEncode-4             24.4MB/s ± 0%  25.5MB/s ± 1%   +4.42%  (p=0.000 n=32+40)
JSONDecode-4             6.31MB/s ± 0%  6.37MB/s ± 0%   +1.02%  (p=0.000 n=40+35)
GoParse-4                3.06MB/s ± 1%  3.10MB/s ± 1%   +1.13%  (p=0.000 n=39+40)
RegexpMatchEasy0_32-4    63.9MB/s ± 0%  63.7MB/s ± 2%     ~     (p=0.070 n=39+40)
RegexpMatchEasy0_1K-4     268MB/s ± 0%   268MB/s ± 0%   -0.15%  (p=0.000 n=40+40)
RegexpMatchEasy1_32-4    58.6MB/s ± 0%  58.3MB/s ± 1%   -0.44%  (p=0.000 n=40+40)
RegexpMatchEasy1_1K-4     214MB/s ± 0%   214MB/s ± 0%   -0.16%  (p=0.000 n=38+37)
RegexpMatchMedium_32-4   1.36MB/s ± 3%  1.35MB/s ± 3%   -0.70%  (p=0.022 n=40+40)
RegexpMatchMedium_1K-4   6.26MB/s ± 0%  6.31MB/s ± 0%   +0.73%  (p=0.000 n=37+40)
RegexpMatchHard_32-4     3.01MB/s ± 1%  3.04MB/s ± 0%   +1.03%  (p=0.000 n=40+40)
RegexpMatchHard_1K-4     3.25MB/s ± 0%  3.28MB/s ± 0%   +1.05%  (p=0.000 n=40+40)
Revcomp-4                62.8MB/s ± 4%  62.3MB/s ± 2%   -0.86%  (p=0.001 n=40+39)
Template-4               4.91MB/s ± 0%  5.01MB/s ± 0%   +2.11%  (p=0.000 n=40+39)
[Geo mean]               17.0MB/s       17.1MB/s        +0.53%

@benshi001 benshi001 added the Proposal label Dec 21, 2018

@benshi001 benshi001 added this to the Go1.13 milestone Dec 21, 2018

@benshi001 benshi001 changed the title Add GOARM=8 Add GOARM=8 for further optimization on armv7/aarch32 Dec 21, 2018

@ALTree ALTree changed the title Add GOARM=8 for further optimization on armv7/aarch32 proposal: cmd/go: add GOARM=8 for further optimization on armv7/aarch32 Dec 21, 2018

@benshi001

This comment has been minimized.

Copy link
Member Author

commented Dec 25, 2018

@cherrymui @bradfitz @randall77 @griesemer
What is your opinion?

@agnivade agnivade modified the milestones: Go1.13, Proposal Dec 25, 2018

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Dec 26, 2018

I don't think it is a good idea to introduce another GOARM value for only two instructions and a performance gain less than 1% of geomean. I think dynamic feature detection is still better in this case. If the overhead of a runtime call is larger than we thought, we could generate the feature test and conditional branch inlined. Compared to a division operation, I don't think the overhead of a conditional branch is not acceptable.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2019

It sounds like the proposal is GOARM=8 means ARMv7+hwdivide.
And ARMv8 was 64-bit ARM, right?
So it's a little weird to call 32-bit ARMv7 with hw divide GOARM=8.

It looks like this usually doesn't matter, performance-wise. Maybe the compiler should emit code to do the check + branch like we already do for write barriers? That would at least allow a bit more optimization of the hw code path (not doing unnecessary spills/reloads/etc).

And a very division heavy function could just have two copies with the compiler having hoisted the check out of the loop or other body to amortize it, all without a GOARM=.

Another option is to follow the GOMIPS and have GOARM=7+divide. But the faster branch check a la write barriers seems better to try first.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2019

Can someone please try the faster branch check in the previous comment? We should have that data before making any decision to expose this detail to users.

@rsc rsc added the WaitingForInfo label Jan 23, 2019

@gopherbot

This comment has been minimized.

Copy link

commented Feb 23, 2019

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this Feb 23, 2019

@ALTree ALTree reopened this Feb 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.