Use lookup table for hashing to boost perf #2

bovee · 2018-09-18T16:20:33Z

Inspired by this comment. It's kind of gross and doesn't panic! on a non-ACGTN nucleotide, but it does give a fairly substantial speed-up.

Two notes:

I enabled by default; maybe we don't want to do that without a panic!/non-nucleotide check?
It'd be really nice to build the table with a const fn; I'm not sure what the current status of those is though.

Also, I really love the bench library you're using here:

nthash/nthash_iterator  time:   [47.482 us 47.956 us 48.578 us]
                        change: [-80.065% -79.659% -79.244%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
nthash/nthash_simple    time:   [167.02 us 167.98 us 169.18 us]
                        change: [-78.542% -78.384% -78.224%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

luizirber · 2018-09-18T17:01:35Z

This is awesome!

Build is failing because I'm doing cargo fmt -- --check, but the recommended formatting is silly for this case...
How about using lazy_static!? The example seems like what you want.
Criterion is super cool =]

Since this is a small project, it's easier to test new crates (like criterion), so I'm pretty happy I spent some time writing the basic implementation (and already got two PRs, yay!)

bovee · 2018-09-18T18:20:56Z

Thanks! This was mostly a not-serious "do lookup tables improve perf?" PR (I was really hoping LLVM would be smart here, but it looks like lookup tables are still best).

I can definitely fix up to use lazy_static over the next few days (been meaning to look at it for a while) and I have a couple ideas for reintroducing panics to the functions (which I continue to think are a good idea). I think that'll probably make fmt happy too?

bovee · 2018-09-20T15:41:39Z

Okay, I did a rewrite using lazy_static which is much easier to read. I also added in 1 values for all non-ACGTN bases and panic if we ever see that (so the behavior should be identical to the existing behavior).

The one, minor issue is that perf regresses slightly from the precomputed lookup above:

nthash/nthash_iterator  time:   [68.371 us 68.823 us 69.460 us]
                        change: [-68.236% -67.967% -67.718%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
nthash/nthash_simple    time:   [234.90 us 235.83 us 236.92 us]
                        change: [-70.306% -69.993% -69.664%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  5 (5.00%) high severe

I think this is probably good to merge as is, but it would be nice to use a const fn at some point (maybe when that feature hits stable) to pregenerate the table during compile time for that extra ~10% boost we see above.

luizirber · 2018-09-21T13:28:06Z

For ref, this is the numbers I see with the C++ ntHash:

./nt_bench 
2018-09-14 19:15:12
Running ./nt_bench
Run on (32 X 2800.06 MHz CPU s)
CPU Caches:
  L1 Data 32K (x32)
  L1 Instruction 32K (x32)
  L2 Unified 256K (x32)
  L3 Unified 25600K (x32)
--------------------------------------------------
Benchmark           Time           CPU Iterations
--------------------------------------------------
BM_nthash       36444 ns      36443 ns      19454

(code in https://github.com/luizirber/nthash_perf)

So, I'm fine with merging this with lazy_static (because the code is clearer), but with the previous array we got pretty close to the original implementation.

lynaghk · 2019-01-23T00:22:50Z

@bovee @luizirber just in case y'all are interested: I saw this PR and was also surprised that rust/LLVM wasn't generating optimal code for such a simple match. I dug into the generated LLVM and assembly IR extensively here: https://kevinlynagh.com/notes/match-vs-lookup/

Summary:

Match is slower than lookup because more instructions are generated and the loop doesn't unroll as much
The overhead of accessing a lazy_static is actually significant in such a tight loop; may want to write out the lookup table in full to avoid this overhead. (Or wait until Rust 1.33, where let is allowed in const expressions)

bovee · 2019-01-23T17:22:33Z

@lynaghk this is really interesting and much stranger than I thought when I threw together this PR; thanks for writing it up!

luizirber · 2019-01-23T19:06:47Z

@lynaghk thank you for showing up and sharing your thoughful post! That's an amazing deep dive into the problem!

(quick nitpick: the date in your post is "2018/01/22", and I was thinking "I don't think nthash is that old!" =P)

lynaghk · 2019-01-25T01:45:34Z

Good eye Luiz, just fixed that date = ) Thanks

…

On Wed, Jan 23, 2019 at 11:06 AM Luiz Irber ***@***.***> wrote: @lynaghk <https://github.com/lynaghk> thank you for showing up and sharing your thoughful post! That's an amazing deep dive into the problem! (quick nitpick: the date in your post is "2018/01/22", and I was thinking "I don't think nthash is that old!" =P) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJBz1fw-zjPuCMrQbSdltY0BzJLNXiiks5vGLLIgaJpZM4WuSce> .

-- Kevin Lynagh *https://kevinlynagh.com <https://kevinlynagh.com>*1.888.502.1042

kloetzl · 2020-01-24T21:01:46Z

Match is slower than lookup because more instructions are generated and the loop doesn't unroll as much.

Just for completeness, that is only a half-truth. The bigger problem is that the match statement gets compiled into a number of cmp/jmps which the CPU has a hard time predicting. The constant rollbacks stall execution.

bovee mentioned this pull request Sep 19, 2018

Other Hash Functions onecodex/finch-rs#26

Open

Use static ref lookup tables for hashing to boost perf

e400907

bovee changed the title ~~Use lookup table for hashing to boost perf.~~ Use lookup table for hashing to boost perf Sep 20, 2018

luizirber approved these changes Sep 21, 2018

View reviewed changes

luizirber merged commit 12ca743 into luizirber:master Sep 21, 2018

luizirber mentioned this pull request Feb 28, 2019

Initialize with const fn #6

Merged

luizirber mentioned this pull request Jul 8, 2020

Uses a lookup table instead of match to convert nucleotides to 2 bits onecodex/needletail#48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lookup table for hashing to boost perf #2

Use lookup table for hashing to boost perf #2

bovee commented Sep 18, 2018

luizirber commented Sep 18, 2018

bovee commented Sep 18, 2018

bovee commented Sep 20, 2018

luizirber commented Sep 21, 2018

lynaghk commented Jan 23, 2019

bovee commented Jan 23, 2019

luizirber commented Jan 23, 2019

lynaghk commented Jan 25, 2019 via email

kloetzl commented Jan 24, 2020

Use lookup table for hashing to boost perf #2

Use lookup table for hashing to boost perf #2

Conversation

bovee commented Sep 18, 2018

luizirber commented Sep 18, 2018

bovee commented Sep 18, 2018

bovee commented Sep 20, 2018

luizirber commented Sep 21, 2018

lynaghk commented Jan 23, 2019

bovee commented Jan 23, 2019

luizirber commented Jan 23, 2019

lynaghk commented Jan 25, 2019 via email

kloetzl commented Jan 24, 2020