Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ARMv8 NEON (AdvSIMD) acceleration #5

Merged
merged 3 commits into from
Sep 1, 2021
Merged

Implement ARMv8 NEON (AdvSIMD) acceleration #5

merged 3 commits into from
Sep 1, 2021

Conversation

valpackett
Copy link
Contributor

Before:

Benchmark #1: target/release/tac ~/big.log
  Time (mean ± σ):      1.721 s ±  0.012 s    [User: 1.418 s, System: 0.303 s]
  Range (min … max):    1.706 s …  1.738 s    10 runs

After:

Benchmark #1: target/release/tac ~/big.log
  Time (mean ± σ):     662.7 ms ±   7.8 ms    [User: 353.5 ms, System: 308.5 ms]
  Range (min … max):   650.1 ms … 673.0 ms    10 runs

:)

And use a wrapper function to allow cfg()ed early-return for dispatch
Currently using the "simple" movemask from
https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/
but it should be possible to try the "interleaved" one later.

Benchmark results: on Cortex-A72 with a 541MB file, the speedup is 2.58x (1.721s to 0.662s).
@mqudsi
Copy link
Member

mqudsi commented Aug 31, 2021

Hey, you did it! In record time, too!

// Bulk movemask as described in
// https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/
let mut matches = {
let bit_mask: uint8x16_t = std::mem::transmute([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to hoist this out of the loop? Hopefully the compiler will detect it as a fixed value but with transmute it's possible it wouldn't. I haven't checked, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked, does not make a performance difference at all.

Comment on lines +253 to +256
let sum0 = vpaddq_u8(t0, t1);
let sum1 = vpaddq_u8(t2, t3);
let sum0 = vpaddq_u8(sum0, sum1);
let sum0 = vpaddq_u8(sum0, sum0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying that this right here is why it's ARM64-only. vpadd_u8 (no q) is available on ARM but it's going to take a lot more operations to get the result that way and probably not worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who would ever want to chew through huge log files and whatnot at SIMD speeds on old 32-bit junk anyway? :D

Minor update to fix documentation typo and to use a shorter version of
the branchfree.org URL to prevent long lines (tested).
@valpackett
Copy link
Contributor Author

Benchmark on Neoverse-N1 (AWS Graviton2), slightly larger log (564M):

Benchmark 1: target/release/tac ~/big.log
  Time (mean ± σ):      1.082 s ±  0.001 s    [User: 1.041 s, System: 0.041 s]
  Range (min … max):    1.082 s …  1.084 s    10 runs
Benchmark 1: target/release/tac ~/big.log
  Time (mean ± σ):     170.4 ms ±   0.5 ms    [User: 131.0 ms, System: 39.4 ms]
  Range (min … max):   169.4 ms … 171.3 ms    17 runs

@mqudsi
Copy link
Member

mqudsi commented Sep 1, 2021

That looks great. I think we can merge this now :)

@mqudsi mqudsi merged commit b8619ed into neosmart:master Sep 1, 2021
@mqudsi
Copy link
Member

mqudsi commented Sep 1, 2021

@unrelentingtech I've added you to the new CONTRIBUTORS.md - please let me know if you want your name/initials/whatever in there alongside your GitHub handle.

@valpackett valpackett deleted the neon branch September 1, 2021 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants