-
Notifications
You must be signed in to change notification settings - Fork 14.9k
Description
Summary
An ldr
corresponding to materializing const uint8x16_t ZERO_LT_AMP_CR = {0, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 4, 1, 1, 1};
is obviously always loop-invariant. Furthermore, when the whole function has fewer vector values than there are vector registers, there cannot be too much register pressure on the vector registers, and such a constant should be hoisted to the top of the function.
When the function is small, LICM happens properly for such constants.
When the function is large, the ldr
for the constant moves to immediately before use, and the load is from static memory.
In between these cases, each constant is loaded at the top of the function and immediately spilled to the stack. Then the constant is reloaded from the stack immediately before use.
Steps to reproduce
- Download nsHtml5TokenizerSIMD-expanded.cpp
- On Apple Silicon Mac using clang trunk, compile it with
clang++ -Wno-everything -isysroot "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.2.sdk" nsHtml5TokenizerSIMD-expanded.cpp -O3 -S
- Inspect
nsHtml5TokenizerSIMD-expanded.s
. - Comment out the line following the comment
// Commenting out the following line allows vector constant LICM!
- Recompile and inspect the assembly again.
Actual results
On first compilation, the only tbl.16b
instruction is preceded by ldr q3, [sp, #32] ; 16-byte Folded Reload
.
On the second compilation, the ldr
has moved upwards out of the loops.
Expected results
Expected vector constants always to be hoisted out of loops when there aren't more live vector values than there are vector registers regardless of how much ALU code exists in the overall function.
Additional info
I tried flipping the various boolean return values in MachineLICM.cpp
to see if one of them was at fault, but I didn't find the cause.
Compiling on Mac differs from Compiler Explorer's armv8-a target. LLVM says it used the following on Mac:
Features:+aes,+altnzcv,+ccdp,+ccidx,+ccpp,+complxnum,+crc,+dit,+dotprod,+flagm,+fp-armv8,+fp16fml,+fptoint,+fullfp16,+jsconv,+lse,+neon,+pauth,+perfmon,+predres,+ras,+rcpc,+rdm,+sb,+sha2,+sha3,+specrestrict,+ssbs,+v8.1a,+v8.2a,+v8.3a,+v8.4a,+v8a
CPU:apple-m1
TuneCPU:apple-m1