-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[compiler-rt][ARM] Optimized mulsf3 and divsf3 #161546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division. These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF. Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.
This is the second PR in my planned series to upstream optimized AArch32 FP implementations, as discussed on Discourse in August. (Sorry for the delay.) The first PR is #154093, which is replacing an existing assembly implementation with (we think) a better one. This one is adding new assembly implementations, for functions which don't have them already. The two PRs conflict, but benignly, in that they both add the same supporting C functions; whichever one lands first, I'll update the other one. This PR is not quite in a committable state yet, because I'd like advice on what to do about the new tests. At the moment, they're using But other architectures, and the existing C implementations in compiler-rt, can't be expected to pass those tests in their strict form. So those tests will have to be reverted to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't give much of a review here, but superficially this looks fine to me.
) | ||
|
||
option(COMPILER_RT_ARM_OPTIMIZED_FP | ||
"On 32-bit Arm, use optimized assembly implementations of FP arithmetic" ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is a code size vs speed tradeoff, right?
I think it would be a good idea to say that explicitly. (And IMHO if the new assembly routines are both smaller and faster they should just be replaced instead of having two options).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done that, with a "likely" in it to cover the fact that until we've gone through all of the available functions we won't know for sure whether all of them trade off size for speed.
(It's also difficult to judge, since when you compare assembly against C, the C is more likely to vary with compile options, so the answer might turn out to be "in this configuration but not that one".)
|
||
DEFINE_AEABI_FUNCTION_ALIAS(__aeabi_fmul, __mulsf3) | ||
|
||
DEFINE_COMPILERRT_FUNCTION(__mulsf3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the previous .thumb
, I think that we should use DEFINE_COMPILERRT_THUMB_FUNCTION
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. (I'm not sure it makes any difference when assembling for a Thumb-only architecture, but it keeps things consistent with existing files.)
LSLS r3, r2, #23 | ||
ADDS r0, r0, r3 // put on the biased exponent | ||
|
||
BL __funder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be SYMBOL_NAME(__compiler_rt_funder)
?
LOCAL_LABEL(denorm): | ||
PUSH {r0,r1,r2,r3} | ||
MOV r0, sp | ||
BL __fnorm2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be SYMBOL_NAME(__compiler_rt_fnorm2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How embarrassing. Not only that, but I hadn't actually added the helper functions to the build in the Thumb1 case. Apparently forgot to re-test both architectures before pushing! Fixed now.
// propagates an appropriate NaN to the output, dealing with the special | ||
// cases of signalling/quiet NaNs. | ||
LOCAL_LABEL(nan): | ||
BL __fnan2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be SYMBOL_NAME(__compiler_rt_fnan2)
?
|
||
*/ | ||
|
||
.p2align 2 // make sure we start on a 32-bit boundary, even in Thumb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that changing this to 4-byte boundary
is better than 32-bit boundary
as it can be confusing when scanning over the comment and code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
compiler-rt/lib/builtins/arm/fnan2.c
Outdated
if (aadj < 0x00800000) // a is a quiet NaN? | ||
return a; // if so, return it | ||
else // expect (badj < 0x00800000) | ||
return b; // in that case b must be a quiet NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be either of the following:
return (aadj < 0x00800000) ? a : b;
or
if (aadj < 0x00800000)
return a;
return b;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This was Petr Hosek's comment on llvm#154093, but if we're doing that, we should do it consistently.
Now we should only test the extra NaN faithfulness in cases where it's provided by the library. Also tweaked the cmake setup to make it easier to add more assembly files later. Plus a missing piece of comment in fnan2.c.
This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division.
These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF.
Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.