CodeGen: Improve lowering of NUM_TO_VEC on A64 for constants #1194

zeux · 2024-03-12T18:19:49Z

When the input is a constant, we use a fairly inefficient sequence of fmov+fcvt+dup or, when the double isn't encodable in fmov, adr+ldr+fcvt+dup.

Instead, we can use the same lowering as X64 when the input is a constant, and load the vector from memory. However, if the constant is encodable via fmov, we can use a vector fmov instead (which is just one instruction and doesn't need constant space).

Fortunately the bit encoding of fmov for 32-bit floating point numbers matches that of 64-bit: the decoding algorithm is a little different because it expands into a larger exponent, but the values are compatible, so if a double can be encoded into a scalar fmov with a given abcdefgh pattern, the same pattern should encode the same float; due to the very limited number of mantissa and exponent bits, all values that are encodable are also exact in both 32-bit and 64-bit floats.

This strategy is ~same as what gcc uses. For complex vectors, we previously used 4 instructions and 8 bytes of constant storage, and now we use 2 instructions and 16 bytes of constant storage, so the memory footprint is the same; for simple vectors we just need 1 instruction (4 bytes).

clang lowers vector constants a little differently, opting to synthesize a 64-bit integer using 4 instructions (mov/movk) and then move it to the vector register - this requires 5 instructions and 20 bytes, vs ours/gcc 2 instructions and 8+16=24 bytes. I tried a simpler version of this that would be more compact - synthesize a 32-bit integer constant with mov+movk, and move it to vector register via dup.4s - but this was a little slower on M2, so for now we prefer the slightly larger version as it's not a regression vs current implementation.

On the vector approximation benchmark we get:

Before this PR (flag=false): ~7.85 ns/op
After this PR (flag=true): ~7.74 ns/op
After this PR, with 0.125 instead of 0.123 in the benchmark code (to use fmov): ~7.52 ns/op
Not part of this PR, but the mov/dup strategy described above: ~8.00 ns/op

When the input is a constant, we use a fairly inefficient sequence of fmov+fcvt+dup or, when the double isn't encodable in fmov, adr+ldr+fcvt+dup. Instead, we can use the same lowering as X64 when the input is a constant, and load the vector from memory. However, if the constant is encodable via fmov, we can use a vector fmov instead (which is just one instruction and doesn't need constant space). Fortunately the bit encoding of fmov for 32-bit floating point numbers matches that of 64-bit: the decoding algorithm is a little different because it expands into a larger exponent, but the values are compatible, so if a double can be encoded into a scalar fmov with a given abcdefgh pattern, the same pattern should encode the same float; due to the very limited number of mantissa and exponent bits, all values that are encodable are also exact in both 32-bit and 64-bit floats. This strategy is ~same as what gcc uses. For complex vectors, we previously used 4 instructions and 8 bytes of constant storage, and now we use 2 instructions and 16 bytes of constant storage, so the memory footprint is the same; for simple vectors we just need 1 instruction (4 bytes). clang lowers vector constants a little differently, opting to synthesize a 64-bit integer using 4 instructions (mov/movk) and then move it to the vector register - this requires 5 instructions and 20 bytes, vs ours/gcc 2 instructions and 8+16=24 bytes. I tried a simpler version of this that would be more compact - synthesize a 32-bit integer constant with mov+movk, and move it to vector register via dup.4s - but this was a little slower on M2, so for now we prefer the slightly larger version as it's not a regression vs current implementation.

All other tests happen to fit into fmov.

This is probably unnecessary but more coverage = better

1.9375 has all mantissa bits set; we only tested powers of two before.

vegorov-rbx

Thank you

zeux added 4 commits March 12, 2024 11:10

Add a test with a larger constant to cover adr+ldr path

5d722d9

All other tests happen to fit into fmov.

tests: Also add -0.125 as an extra fmov test

bf33cb6

This is probably unnecessary but more coverage = better

tests: Add a boundary case test just in case

9561bec

1.9375 has all mantissa bits set; we only tested powers of two before.

zeux requested a review from vegorov-rbx March 12, 2024 21:46

vegorov-rbx approved these changes Mar 13, 2024

View reviewed changes

vegorov-rbx merged commit 9aa82c6 into master Mar 13, 2024
8 checks passed

vegorov-rbx deleted the a64-numvec branch March 13, 2024 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeGen: Improve lowering of NUM_TO_VEC on A64 for constants #1194

CodeGen: Improve lowering of NUM_TO_VEC on A64 for constants #1194

zeux commented Mar 12, 2024 •

edited

Loading

vegorov-rbx left a comment

CodeGen: Improve lowering of NUM_TO_VEC on A64 for constants #1194

CodeGen: Improve lowering of NUM_TO_VEC on A64 for constants #1194

Conversation

zeux commented Mar 12, 2024 • edited Loading

vegorov-rbx left a comment

Choose a reason for hiding this comment

zeux commented Mar 12, 2024 •

edited

Loading