Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeGen: Improve lowering of NUM_TO_VEC on A64 for constants #1194

Merged
merged 4 commits into from
Mar 13, 2024

Conversation

zeux
Copy link
Collaborator

@zeux zeux commented Mar 12, 2024

When the input is a constant, we use a fairly inefficient sequence of fmov+fcvt+dup or, when the double isn't encodable in fmov, adr+ldr+fcvt+dup.

Instead, we can use the same lowering as X64 when the input is a constant, and load the vector from memory. However, if the constant is encodable via fmov, we can use a vector fmov instead (which is just one instruction and doesn't need constant space).

Fortunately the bit encoding of fmov for 32-bit floating point numbers matches that of 64-bit: the decoding algorithm is a little different because it expands into a larger exponent, but the values are compatible, so if a double can be encoded into a scalar fmov with a given abcdefgh pattern, the same pattern should encode the same float; due to the very limited number of mantissa and exponent bits, all values that are encodable are also exact in both 32-bit and 64-bit floats.

This strategy is ~same as what gcc uses. For complex vectors, we previously used 4 instructions and 8 bytes of constant storage, and now we use 2 instructions and 16 bytes of constant storage, so the memory footprint is the same; for simple vectors we just need 1 instruction (4 bytes).

clang lowers vector constants a little differently, opting to synthesize a 64-bit integer using 4 instructions (mov/movk) and then move it to the vector register - this requires 5 instructions and 20 bytes, vs ours/gcc 2 instructions and 8+16=24 bytes. I tried a simpler version of this that would be more compact - synthesize a 32-bit integer constant with mov+movk, and move it to vector register via dup.4s - but this was a little slower on M2, so for now we prefer the slightly larger version as it's not a regression vs current implementation.

On the vector approximation benchmark we get:

  • Before this PR (flag=false): ~7.85 ns/op
  • After this PR (flag=true): ~7.74 ns/op
  • After this PR, with 0.125 instead of 0.123 in the benchmark code (to use fmov): ~7.52 ns/op
  • Not part of this PR, but the mov/dup strategy described above: ~8.00 ns/op

When the input is a constant, we use a fairly inefficient sequence of
fmov+fcvt+dup or, when the double isn't encodable in fmov, adr+ldr+fcvt+dup.

Instead, we can use the same lowering as X64 when the input is a constant, and
load the vector from memory. However, if the constant is encodable via fmov, we
can use a vector fmov instead (which is just one instruction and doesn't need
constant space).

Fortunately the bit encoding of fmov for 32-bit floating point numbers matches
that of 64-bit: the decoding algorithm is a little different because it expands
into a larger exponent, but the values are compatible, so if a double can be encoded
into a scalar fmov with a given abcdefgh pattern, the same pattern should encode the
same float; due to the very limited number of mantissa and exponent bits, all values
that are encodable are also exact in both 32-bit and 64-bit floats.

This strategy is ~same as what gcc uses. For complex vectors, we previously used 4
instructions and 8 bytes of constant storage, and now we use 2 instructions and 16
bytes of constant storage, so the memory footprint is the same; for simple vectors we
just need 1 instruction (4 bytes).

clang lowers vector constants a little differently, opting to synthesize a 64-bit integer
using 4 instructions (mov/movk) and then move it to the vector register - this requires
5 instructions and 20 bytes, vs ours/gcc 2 instructions and 8+16=24 bytes. I tried a
simpler version of this that would be more compact - synthesize a 32-bit integer constant
with mov+movk, and move it to vector register via dup.4s - but this was a little slower
on M2, so for now we prefer the slightly larger version as it's not a regression vs current
implementation.
All other tests happen to fit into fmov.
This is probably unnecessary but more coverage = better
1.9375 has all mantissa bits set; we only tested powers of two before.
@zeux zeux requested a review from vegorov-rbx March 12, 2024 21:46
Copy link
Collaborator

@vegorov-rbx vegorov-rbx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@vegorov-rbx vegorov-rbx merged commit 9aa82c6 into master Mar 13, 2024
8 checks passed
@vegorov-rbx vegorov-rbx deleted the a64-numvec branch March 13, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants