-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: cmd/go: add GOARM=8 for further optimization on armv7/aarch32 #29373
Comments
@cherrymui @bradfitz @randall77 @griesemer |
I don't think it is a good idea to introduce another GOARM value for only two instructions and a performance gain less than 1% of geomean. I think dynamic feature detection is still better in this case. If the overhead of a runtime call is larger than we thought, we could generate the feature test and conditional branch inlined. Compared to a division operation, I don't think the overhead of a conditional branch is not acceptable. |
It sounds like the proposal is GOARM=8 means ARMv7+hwdivide. It looks like this usually doesn't matter, performance-wise. Maybe the compiler should emit code to do the check + branch like we already do for write barriers? That would at least allow a bit more optimization of the hw code path (not doing unnecessary spills/reloads/etc). And a very division heavy function could just have two copies with the compiler having hoisted the check out of the loop or other body to amortize it, all without a GOARM=. Another option is to follow the GOMIPS and have GOARM=7+divide. But the faster branch check a la write barriers seems better to try first. |
Can someone please try the faster branch check in the previous comment? We should have that data before making any decision to expose this detail to users. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
unfortunantely, the go1 benchmark shows little improvement for a runtime check of a hardware dividor.
|
The reason why "GOARM=8" works well, is due to
With the "GOARM=8", the compile can fill more spilt values to the saved registers, and so the go1 benchmark improves. So I strongly suggest adding GOARM=8, as ARM's official definition:
GOARM=8 does not only improve division speed, but also has potential benefits (running arm32 program on a armv8 machine) |
Go doesn't support ARMv8 optimizations yet (see this GitHub issue: golang/go#29373) but can still benefit from ARMv7 optimizations. A comment is left in `go.mk` to mention this and avoid any confusion when reading "ifeq ARMv8 → GOARM = 7". Signed-off-by: Michael Baudino <michael@baudi.no>
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8 optimizations (see issue: golang/go#29373) but can still benefit from ARMv7 optimizations. Signed-off-by: Michael Baudino <michael@baudi.no> [yann.morin.1998@free.fr: - move the comment to its own line, expand and reword it a bit - reword the commit log ] Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8 optimizations (see issue: golang/go#29373) but can still benefit from ARMv7 optimizations. Signed-off-by: Michael Baudino <michael@baudi.no> [yann.morin.1998@free.fr: - move the comment to its own line, expand and reword it a bit - reword the commit log ] Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr> (cherry picked from commit c59409a) Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8 optimizations (see issue: golang/go#29373) but can still benefit from ARMv7 optimizations. Signed-off-by: Michael Baudino <michael@baudi.no> [yann.morin.1998@free.fr: - move the comment to its own line, expand and reword it a bit - reword the commit log ] Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr> (cherry picked from commit c59409a) Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8 optimizations (see issue: golang/go#29373) but can still benefit from ARMv7 optimizations. Signed-off-by: Michael Baudino <michael@baudi.no> [yann.morin.1998@free.fr: - move the comment to its own line, expand and reword it a bit - reword the commit log ] Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr> (cherry picked from commit c59409a) Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8 optimizations (see issue: golang/go#29373) but can still benefit from ARMv7 optimizations. Signed-off-by: Michael Baudino <michael@baudi.no> [yann.morin.1998@free.fr: - move the comment to its own line, expand and reword it a bit - reword the commit log ] Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr> (cherry picked from commit c59409a) Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
When building for an ARMv8 in 32-bit, Go does not yet support ARMv8 optimizations (see issue: golang/go#29373) but can still benefit from ARMv7 optimizations. Signed-off-by: Michael Baudino <michael@baudi.no> [yann.morin.1998@free.fr: - move the comment to its own line, expand and reword it a bit - reword the commit log ] Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr> (cherry picked from commit c59409a) Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
Should the results from the last comment be more impactful after the #40724 ? 🤔 |
I'm not sure if that's applicable here (or even possible with modern memory protection mechanisms), but in the past dynamic feature detection was sometimes implemented by patching the executable in-memory at runtime, one invocation at a time. |
Currently an ARM program built by go will call runtime.udiv() for a division, in which it detect if a hardware divider is available, or use software division.
The main reason is that a hardware divider is an optional component for an ARMv7 machine. But in the real world, most ARMv7 SOC has it, such as RaspberryPi2.
GOARM=8 implies that the program will run in the aarch32 mode (ARMv7 compatible) of an arm64 machine, on which a hardware divider is a must. So
The go1 benchmark does show some improvement for directly generation of SDIV/UDIV against runtime detection.
The text was updated successfully, but these errors were encountered: