Implement ARMv8 NEON (AdvSIMD) acceleration #5

valpackett · 2021-08-31T12:21:13Z

Before:

Benchmark #1: target/release/tac ~/big.log
  Time (mean ± σ):      1.721 s ±  0.012 s    [User: 1.418 s, System: 0.303 s]
  Range (min … max):    1.706 s …  1.738 s    10 runs

After:

Benchmark #1: target/release/tac ~/big.log
  Time (mean ± σ):     662.7 ms ±   7.8 ms    [User: 353.5 ms, System: 308.5 ms]
  Range (min … max):   650.1 ms … 673.0 ms    10 runs

:)

And use a wrapper function to allow cfg()ed early-return for dispatch

Currently using the "simple" movemask from https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/ but it should be possible to try the "interleaved" one later. Benchmark results: on Cortex-A72 with a 541MB file, the speedup is 2.58x (1.721s to 0.662s).

mqudsi · 2021-08-31T16:42:13Z

Hey, you did it! In record time, too!

mqudsi · 2021-08-31T15:39:02Z

src/main.rs

+                // Bulk movemask as described in
+                // https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/
+                let mut matches = {
+                    let bit_mask: uint8x16_t = std::mem::transmute([


Do you want to hoist this out of the loop? Hopefully the compiler will detect it as a fixed value but with transmute it's possible it wouldn't. I haven't checked, though.

Just checked, does not make a performance difference at all.

mqudsi · 2021-08-31T16:43:52Z

src/main.rs

+                    let sum0 = vpaddq_u8(t0, t1);
+                    let sum1 = vpaddq_u8(t2, t3);
+                    let sum0 = vpaddq_u8(sum0, sum1);
+                    let sum0 = vpaddq_u8(sum0, sum0);


It's annoying that this right here is why it's ARM64-only. vpadd_u8 (no q) is available on ARM but it's going to take a lot more operations to get the result that way and probably not worth it.

Who would ever want to chew through huge log files and whatnot at SIMD speeds on old 32-bit junk anyway? :D

Minor update to fix documentation typo and to use a shorter version of the branchfree.org URL to prevent long lines (tested).

valpackett · 2021-08-31T18:16:56Z

Benchmark on Neoverse-N1 (AWS Graviton2), slightly larger log (564M):

Benchmark 1: target/release/tac ~/big.log
  Time (mean ± σ):      1.082 s ±  0.001 s    [User: 1.041 s, System: 0.041 s]
  Range (min … max):    1.082 s …  1.084 s    10 runs

Benchmark 1: target/release/tac ~/big.log
  Time (mean ± σ):     170.4 ms ±   0.5 ms    [User: 131.0 ms, System: 39.4 ms]
  Range (min … max):   169.4 ms … 171.3 ms    17 runs

mqudsi · 2021-09-01T15:19:02Z

That looks great. I think we can merge this now :)

mqudsi · 2021-09-01T15:23:41Z

@unrelentingtech I've added you to the new CONTRIBUTORS.md - please let me know if you want your name/initials/whatever in there alongside your GitHub handle.

valpackett added 2 commits August 31, 2021 13:26

Properly cfg() all x86-specific things to x86

2e2eeea

And use a wrapper function to allow cfg()ed early-return for dispatch

mqudsi reviewed Aug 31, 2021

View reviewed changes

fixup! Add AArch64 NEON/AdvSIMD acceleration

099f836

Minor update to fix documentation typo and to use a shorter version of the branchfree.org URL to prevent long lines (tested).

mqudsi merged commit b8619ed into neosmart:master Sep 1, 2021

valpackett deleted the neon branch September 1, 2021 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ARMv8 NEON (AdvSIMD) acceleration #5

Implement ARMv8 NEON (AdvSIMD) acceleration #5

valpackett commented Aug 31, 2021

mqudsi commented Aug 31, 2021

mqudsi Aug 31, 2021

valpackett Aug 31, 2021

mqudsi Aug 31, 2021

valpackett Aug 31, 2021

valpackett commented Aug 31, 2021

mqudsi commented Sep 1, 2021

mqudsi commented Sep 1, 2021

Implement ARMv8 NEON (AdvSIMD) acceleration #5

Implement ARMv8 NEON (AdvSIMD) acceleration #5

Conversation

valpackett commented Aug 31, 2021

mqudsi commented Aug 31, 2021

mqudsi Aug 31, 2021

Choose a reason for hiding this comment

valpackett Aug 31, 2021

Choose a reason for hiding this comment

mqudsi Aug 31, 2021

Choose a reason for hiding this comment

valpackett Aug 31, 2021

Choose a reason for hiding this comment

valpackett commented Aug 31, 2021

mqudsi commented Sep 1, 2021

mqudsi commented Sep 1, 2021