New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement NEON vector algorithms #2597
Conversation
Note, according to this article I should be using the NEON versions of these function under ARM64EC but |
As noted on Discord, there is the question of whether or not we might want to unroll the inner loop at all. |
After further refining the unrolled version of the code we now have equal performance for |
Opened a devcom issue for poor codegen here: https://developercommunity.visualstudio.com/t/arm64-neon-ld1m2-q32-intrinsic-results-in-extraneo/1690469?from=email |
Issue was closed there, because some questions were not answered in time. |
Guess not all anymore after #2434 |
@rhuijben I didn't bother answering because I got confirmation from a backend dev that it'll be a won't fix. They're working a full overhaul of their register allocator that will fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks correct to me, however I'm not sure we want to take this approach to adding vectorized algorithms for additional architectures. Trying to bolt on additional code with the preprocessor is probably not too bad for vector reverse, but I have a feeling it might become a nightmare for more complicated algorithms, and it makes things harder to follow.
const __m256i _Left = _mm256_loadu_si256(static_cast<__m256i*>(_First1)); | ||
const __m256i _Right = _mm256_loadu_si256(static_cast<__m256i*>(_First2)); | ||
_mm256_storeu_si256(static_cast<__m256i*>(_First1), _Right); | ||
_mm256_storeu_si256(static_cast<__m256i*>(_First2), _Left); | ||
#elif defined(_VECTOR_ARM64) // ^^^ _M_IX86 || _VECTOR_X64 ^^^ // vvv _VECTOR_ARM64 vvv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should have a seperate function for arm, I'm not a huge fan of trying to weave together all the varients using the preprocessor like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we could do that. I just felt that the logic was so similar that we might as well just inline it.
There's also a merge issue with find/count |
Hi @cbezault! We talked about this PR at the weekly maintainer meeting - sorry for not reviewing it earlier, but the codebase has changed significantly since this was originally opened. @barcharcraz says, based on her earlier review, that in addition to the source-level conflicts that have accumulated, some of the algorithms' approaches have changed, so more substantial rework would be required. Additionally, we would need benchmarks added to the new benchmarks suite, to re-confirm that the changed code is still an improvement. Finally, she mentioned that a whole bunch of vectorized algorithms have been added in the meantime (@AlexGuteniev has been busy 😹 😻), although that isn't a blocking issue strictly speaking since we can always make incremental improvements. When I thought only source-level conflicts were the issue, I was getting ready to pick up this PR, but after talking with @barcharcraz, I believe that it would be as much or more work to get this PR ready as it would be to start from scratch. Therefore we're going to close this PR without merging. We still appreciate the effort you put into it, and I apologize for not being able to land it! |
Fixes #813
This PR adds NEON versions of all the vector algorithms currently implemented for SSE/AVX2.
Benchmarks
Benchmark using google benchmark and borrowed from @BillyONeal all run on an SQ1 Surface Pro X.
Summary
Interestingly it appears that for small sizes and for
unsigned long long
the cost of crossing the ARM64EC thunk boundary potentially outweighs the gains that a native implementation provides. However, the runs of the x64 binary had a lot more variation than the runs of the native ARM64 binaries and the results of any individual run should be taken with a grain of salt (note the differences between the runs of BenchUnsignedIntSseReverse which was unchanged between the two test runs).Benchmarks
Benchmark Code
Old Results (ARM64 native)
New Results (ARM64 Native, std::reverse = NeonUnrolledReverse)
Old Results (x64 on ARM64, std::reverse = SseReverse)
New Results (x64 on ARM64, std::reverse = Native NEON)
Codegen
Summary
The codegen looks quite good and is comparable to gcc/llvm codegen except for some random
mov
s that get inserted. (See this godbolt link). LLVM prefers to emitldp
/stp
instructions with quad registers vs. theld1
/st1
instructions with explicit vector registers I ended up having to use and that gcc uses. If I force LLVM to use ld1 it does not interleave the loads with the rev64 instruction and it ends up looking almost identical to the gcc codegen.Code
Code rewritten for LLVM/GCC
MSVC codegen
LLVM codegen
GCC codegen
llvm-mca
All results run with
llvm-mca -all-stats --march=arm64 -mcpu=kryo -mattr=+kryo,+a76,+neon -dispatch=8 -lqueue=68 -squeue=72 -register-file-size=128
this doesn't exactly match reality as the SQ1 has a 160 entry reorder buffer not the 128 entry one that is specified for A76 processors but there isn't a super easy way to tune that parameter.Summary
llvm-mca indicates that the unrolled version of our code will perform less well for all data sizes tested by a small margin.
This is primarily due to the unnecessary
mov
s that are inserted by the compiler. If the assembly is directly modified to remove thosemov
s we achieve parity and start outstripping the unrolled code when the data is of size 256 bytes or larger. This aligns more closely to the empirical results presented above.llvm-mca output
Vanilla Assembly (not unrolled)
Unrolled Assembly
Vanilla results size 16 bytes
Unrolled results size 16 bytes
Vanilla results size 32 bytes
Unrolled results size 32 bytes
Vanilla results size 64 bytes
Unrolled results size 64 bytes
Vanilla results size 128 bytes
Unrolled results size 128 bytes
Vanilla results size 256 bytes
Unrolled results size 256 bytes