-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[compiler-rt][ARM] Optimized mulsf3 and divsf3 #168394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division. These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF. Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.
In the earliest version of cmake supported by LLVM, `try_compile` doesn't understand convenient `SOURCE_FROM_CONTENT` option, so we must manually write our (empty) assembly source file and pass it to `try_compile` by filename. It also doesn't understand `NO_CACHE`, so in order to avoid making a spurious cache entry called `success` which would confuse the next run, I'm putting the result directly into the user-specified output variable. This leads to a less helpful comment in CMakeCache.txt, but what can you do.
A buildbot reported a failure in a hardfp build, related to ABI: the test was calling __mulsf3 and passing arguments in s0/s1, but the code inside __mulsf3 was reading them out of r0/r1. The ABI using GPRs is correct for __aeabi_fmul, but not for __mulsf3, which takes float arguments in accordance with whatever the normal ABI is. So in hardfp, the two functions behave differently. The obvious question is why anyone is linking this function in to a hardfp build in the first place - surely in a hardfp context clients would just use a vmul instruction instead of calling either of these entry points? But there seems to be no provision in builtins/CMakeLists.txt for leaving things out of hardfp builds. The generic __mulsf3.c is still included in a hardfp builtins library. So I've stuck with those basic premises, and just corrected my replacement functions to get the ABIs right.
The current functions depend on the MLS instruction, and future ones will depend on CLZ too.
|
This PR is initially a sequence of four commits already. The first is #161546 unchanged; the next three fix build and test failures, two of which were found by buildbots after the previous attempt, and the third I found myself during re-testing. I know they'll all be squashed together into one commit for landing, but I keep them separate to make review easier. |
smithp35
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixups look good to me. Can confirm that the hard to soft-float conversion is per the hard-float AAPCS, and that the new files are excluded from the "base_SOURCES".
|
(I've only reviewed the new changes, given that the first commit is just the original work) |
(Reland of #161546, fixing three build and test issues)
This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division.
These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF.
Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.