-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ARMv8 NEON (AdvSIMD) acceleration #5
Conversation
And use a wrapper function to allow cfg()ed early-return for dispatch
Currently using the "simple" movemask from https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/ but it should be possible to try the "interleaved" one later. Benchmark results: on Cortex-A72 with a 541MB file, the speedup is 2.58x (1.721s to 0.662s).
Hey, you did it! In record time, too! |
// Bulk movemask as described in | ||
// https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/ | ||
let mut matches = { | ||
let bit_mask: uint8x16_t = std::mem::transmute([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to hoist this out of the loop? Hopefully the compiler will detect it as a fixed value but with transmute
it's possible it wouldn't. I haven't checked, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked, does not make a performance difference at all.
let sum0 = vpaddq_u8(t0, t1); | ||
let sum1 = vpaddq_u8(t2, t3); | ||
let sum0 = vpaddq_u8(sum0, sum1); | ||
let sum0 = vpaddq_u8(sum0, sum0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that this right here is why it's ARM64-only. vpadd_u8
(no q
) is available on ARM but it's going to take a lot more operations to get the result that way and probably not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who would ever want to chew through huge log files and whatnot at SIMD speeds on old 32-bit junk anyway? :D
Minor update to fix documentation typo and to use a shorter version of the branchfree.org URL to prevent long lines (tested).
Benchmark on Neoverse-N1 (AWS Graviton2), slightly larger log (564M):
|
That looks great. I think we can merge this now :) |
@unrelentingtech I've added you to the new CONTRIBUTORS.md - please let me know if you want your name/initials/whatever in there alongside your GitHub handle. |
Before:
After:
:)