Skip to content

Conversation

statham-arm
Copy link
Collaborator

This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division.

These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF.

Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.

This commit adds optimized assembly versions of single-precision float
multiplication and division. Both functions are implemented in a style
that can be assembled as either of Arm and Thumb2; for multiplication,
a separate implementation is provided for Thumb1. Also, extensive new
tests are added for multiplication and division.

These implementations can be removed from the build by defining the
cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF.

Outlying parts of the functionality which are not on the fast path,
such as NaN handling and underflow, are handled in helper functions
written in C. These can be shared between the Arm/Thumb2 and Thumb1
implementations, and also reused by other optimized assembly functions
we hope to add in future.
@statham-arm
Copy link
Collaborator Author

This is the second PR in my planned series to upstream optimized AArch32 FP implementations, as discussed on Discourse in August. (Sorry for the delay.)

The first PR is #154093, which is replacing an existing assembly implementation with (we think) a better one. This one is adding new assembly implementations, for functions which don't have them already. The two PRs conflict, but benignly, in that they both add the same supporting C functions; whichever one lands first, I'll update the other one.

This PR is not quite in a committable state yet, because I'd like advice on what to do about the new tests. At the moment, they're using compareResultF to check the answers, which forgives differences of opinion in NaN handling. Our assembly routines have well specified NaN handling (designed to match the behavior of Arm's hardware FP), and a set of tests to check it. So when they're testing the new versions of the function, I'd like to make them check the output NaNs exactly.

But other architectures, and the existing C implementations in compiler-rt, can't be expected to pass those tests in their strict form. So those tests will have to be reverted to use compareResultF on any other architecture, or when the new config option COMPILER_RT_ARM_OPTIMIZED_FP=OFF is set. Any thoughts on the best thing to do about that?

Copy link
Contributor

@aykevl aykevl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't give much of a review here, but superficially this looks fine to me.

)

option(COMPILER_RT_ARM_OPTIMIZED_FP
"On 32-bit Arm, use optimized assembly implementations of FP arithmetic" ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a code size vs speed tradeoff, right?
I think it would be a good idea to say that explicitly. (And IMHO if the new assembly routines are both smaller and faster they should just be replaced instead of having two options).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that, with a "likely" in it to cover the fact that until we've gone through all of the available functions we won't know for sure whether all of them trade off size for speed.

(It's also difficult to judge, since when you compare assembly against C, the C is more likely to vary with compile options, so the answer might turn out to be "in this configuration but not that one".)


DEFINE_AEABI_FUNCTION_ALIAS(__aeabi_fmul, __mulsf3)

DEFINE_COMPILERRT_FUNCTION(__mulsf3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the previous .thumb, I think that we should use DEFINE_COMPILERRT_THUMB_FUNCTION instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (I'm not sure it makes any difference when assembling for a Thumb-only architecture, but it keeps things consistent with existing files.)

LSLS r3, r2, #23
ADDS r0, r0, r3 // put on the biased exponent

BL __funder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be SYMBOL_NAME(__compiler_rt_funder)?

LOCAL_LABEL(denorm):
PUSH {r0,r1,r2,r3}
MOV r0, sp
BL __fnorm2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be SYMBOL_NAME(__compiler_rt_fnorm2)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How embarrassing. Not only that, but I hadn't actually added the helper functions to the build in the Thumb1 case. Apparently forgot to re-test both architectures before pushing! Fixed now.

// propagates an appropriate NaN to the output, dealing with the special
// cases of signalling/quiet NaNs.
LOCAL_LABEL(nan):
BL __fnan2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be SYMBOL_NAME(__compiler_rt_fnan2)?


*/

.p2align 2 // make sure we start on a 32-bit boundary, even in Thumb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that changing this to 4-byte boundary is better than 32-bit boundary as it can be confusing when scanning over the comment and code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 34 to 37
if (aadj < 0x00800000) // a is a quiet NaN?
return a; // if so, return it
else // expect (badj < 0x00800000)
return b; // in that case b must be a quiet NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be either of the following:

return (aadj < 0x00800000) ? a : b;

or

if (aadj < 0x00800000)
  return a;
return b;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

statham-arm added a commit to statham-arm/llvm-project that referenced this pull request Oct 2, 2025
Now we should only test the extra NaN faithfulness in cases where it's
provided by the library. Also tweaked the cmake setup to make it
easier to add more assembly files later. Plus a missing piece of
comment in fnan2.c.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants