XXH128 #29

ronag · 2022-01-28T21:31:57Z

Wasm supports simd. Could we also have a xxh128 variant?

jungomi · 2022-02-01T14:23:24Z

xxh128 and also SIMD for xxh64 would certainly be interesting. Unfortunately, Safari still doesn't support WASM's SIMD, and I think that all major browsers should be supported without any feature detection (with https://github.com/GoogleChromeLabs/wasm-feature-detect) because that would require shipping multiple WASM binaries.
I would not expect it in the immediate future, but it's definitely something that should be considered at some point.

fabiospampinato · 2022-02-07T20:19:46Z

@jungomi I'm looking forward to this library using SIMD instructions. IMO it might be worth implementing that already even if Safari doesn't support those yet, there are use cases where Safari isn't a target engine anyway (Node-only or Electron-only apps for example).

jungomi · 2022-02-08T00:41:53Z

Yes, absolutely. But I personally don't really have the time right now to look into the WebAssembly SIMD and to port xxh3 implementation to WebAssembly. I'm of course open for any contributions, but as long as no one is in dire need for it (more likely the xxh128 than just the improvements from SIMD) or just eager to implement it, I wouldn't expect it to happen immediately.

marcusdarmstrong · 2023-03-11T22:33:40Z

I've gone through an initial implementation of one-shot SIMD 64b xxh3 for performance validation purposes and unfortunately the results aren't particularly promising.

Node 19.7 on macOS/aarch64: (I don't currently own an x64 machine so I can't validate the performance differences that exist on the v8 SSE implementation of WASM SIMD—I could publish my branch if somebody else would like to check)

xxhash-wasm#h32 x 19,159,453 ops/sec ±0.17% (99 runs sampled)
xxhash-wasm#h64 x 18,787,416 ops/sec ±0.10% (102 runs sampled)
xxhash-wasm/xxh3 x 19,101,282 ops/sec ±0.56% (100 runs sampled)
Benchmark 1 bytes - Fastest is xxhash-wasm/xxh3
xxhash-wasm#h32 x 17,385,807 ops/sec ±0.15% (102 runs sampled)
xxhash-wasm#h64 x 17,173,576 ops/sec ±0.34% (101 runs sampled)
xxhash-wasm/xxh3 x 17,289,084 ops/sec ±0.18% (100 runs sampled)
Benchmark 10 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 15,429,409 ops/sec ±0.09% (101 runs sampled)
xxhash-wasm#h64 x 15,365,766 ops/sec ±0.26% (99 runs sampled)
xxhash-wasm/xxh3 x 13,104,484 ops/sec ±0.06% (100 runs sampled)
Benchmark 100 bytes - Fastest is xxhash-wasm#h32,xxhash-wasm#h64
xxhash-wasm#h32 x 4,562,643 ops/sec ±0.22% (100 runs sampled)
xxhash-wasm#h64 x 6,410,266 ops/sec ±0.13% (101 runs sampled)
xxhash-wasm/xxh3 x 5,181,818 ops/sec ±0.29% (100 runs sampled)
Benchmark 1000 bytes - Fastest is xxhash-wasm#h64
xxhash-wasm#h32 x 559,445 ops/sec ±0.16% (98 runs sampled)
xxhash-wasm#h64 x 897,968 ops/sec ±0.06% (99 runs sampled)
xxhash-wasm/xxh3 x 727,428 ops/sec ±0.13% (102 runs sampled)
Benchmark 10000 bytes - Fastest is xxhash-wasm#h64

With larger input sizes demonstrating the same pattern as the 10k benchmark with linear decreases in speed. It's also worth mentioning that node 19.7 appears to have a sizable performance improvement over 18.x on bigint handling which leads to xxh64 matching the low-input-size performance of xxh32.

I have outstanding work to do to try to improve the performance of the xxh3 implementation on input byte sizes >240 by 1. unrolling some loops and 2. moving accumulators from memory to locals. That said, the lagging 100b benchmark would be unaffected by these improvements.

The primary operation in use on that 100b case (and a fairly critical one in general for xxh3) is a u64 multiply yielding a u128, which isn't well supported by the WASM instruction set. Originally I ported the xxHash scalar implementation of that operation directly to WASM, but after seeing these results I attempted a vectorized version myself. Unfortunately that vectorized implementation didn't appear to provide any performance advantage. I'm very new to SIMD (which is part of why I attempted this implementation) so I'll attach that function directly here in case anybody has a suggestion for improvement.

Algorithm reference: https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L3784

(func $XXH3_mul128_fold64 (param $lhs i64) (param $rhs i64) (result i64)
  (local $r_hi v128)
  (local $r_lo v128)
  (local $left v128)
  (local $r_his v128)
  (local $r_los v128)
  (local $r_los_his v128)
  (local $r_los_los v128)
  (local $cross i64)
  (local $upper i64)
  (local $lower i64)

  (local.set $left (i64x2.splat (local.get $lhs)))
  (local.set $r_hi (i32x4.splat (i32.wrap_i64 (i64.shr_u (local.get $rhs) (i64.const 32)))))
  (local.set $r_lo (i32x4.splat (i32.wrap_i64 (local.get $rhs))))

  (local.set $r_his 
    (i64x2.extmul_high_i32x4_u (local.get $left) (local.get $r_hi)))
  (local.set $r_los
    (i64x2.extmul_high_i32x4_u (local.get $left) (local.get $r_lo)))

  (local.set $r_los_his (i64x2.shr_u (local.get $r_los) (i32.const 32)))
  (local.set $r_los_los 
    (v128.and 
      (local.get $r_los)
      (v128.const i64x2 0xFFFFFFFF 0xFFFFFFFF)))

  (local.set $cross
    (i64.add 
      ;; r_los_his[0] = lo_lo >> 32
      (i64x2.extract_lane 0 (local.get $r_los_his))
      (i64.add 
        ;; r_los_los[1] = hi_lo & 0xFFFFFFFF
        (i64x2.extract_lane 1 (local.get $r_los_los))
        ;; r_his[0] = lo_hi
        (i64x2.extract_lane 0 (local.get $r_his)))))
  (local.set $upper
    (i64.add 
      ;; r_los_his[1] = hi_lo >> 32
      (i64x2.extract_lane 1 (local.get $r_los_his))
      (i64.add 
        (i64.shr_u (local.get $cross) (i64.const 32))
        ;; r_his[1] = hi_hi
        (i64x2.extract_lane 1 (local.get $r_his)))))
  (local.set $lower 
    (i64.or 
      (i64.shl (local.get $cross) (i64.const 32))
      ;; r_los_los[0] = lo_lo & 0xFFFFFFFF
      (i64x2.extract_lane 0 (local.get $r_los_los))))
  (i64.xor (local.get $lower) (local.get $upper)))

Edit/Update: Manually unrolling loops does not appear to materially alter performance.

marcusdarmstrong · 2023-03-15T14:25:53Z

As a follow on, I've pushed my work-in-progress branch: https://github.com/marcusdarmstrong/xxhash-wasm/tree/xxh3

Beyond the above discussion, the branch contains two versions of 64b xxh3, a more orderly port and the xxh3-perf version where I've been attempting to optimize the code by way of inlining, unrolling, and additional vectorization.

That additional optimization work was able to drive aarch64 performance into being a win or flat at almost every benchmark, however, I was able to run a test on an old x64 system and the results were disheartening.

Benchmark results from node 16.15 (as this is the latest version of node the x64 machine supports)
aarch64:

xxhash-wasm#h32 x 6,863,147 ops/sec ±0.59% (95 runs sampled)
xxhash-wasm#h64 x 5,409,205 ops/sec ±0.99% (95 runs sampled)
xxhash-wasm/xxh3-perf x 5,451,317 ops/sec ±0.65% (95 runs sampled)
Benchmark 1 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 6,987,866 ops/sec ±0.50% (99 runs sampled)
xxhash-wasm#h64 x 5,305,161 ops/sec ±0.51% (97 runs sampled)
xxhash-wasm/xxh3-perf x 5,549,863 ops/sec ±0.52% (95 runs sampled)
Benchmark 10 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 6,607,805 ops/sec ±0.32% (99 runs sampled)
xxhash-wasm#h64 x 5,110,061 ops/sec ±0.48% (98 runs sampled)
xxhash-wasm/xxh3-perf x 5,202,677 ops/sec ±0.48% (99 runs sampled)
Benchmark 100 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 3,081,462 ops/sec ±0.41% (97 runs sampled)
xxhash-wasm#h64 x 3,425,694 ops/sec ±0.34% (93 runs sampled)
xxhash-wasm/xxh3-perf x 3,737,133 ops/sec ±0.27% (102 runs sampled)
Benchmark 1000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 528,151 ops/sec ±0.16% (99 runs sampled)
xxhash-wasm#h64 x 818,949 ops/sec ±0.13% (97 runs sampled)
xxhash-wasm/xxh3-perf x 823,626 ops/sec ±0.07% (99 runs sampled)
Benchmark 10000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 58,849 ops/sec ±0.02% (101 runs sampled)
xxhash-wasm#h64 x 97,561 ops/sec ±0.09% (102 runs sampled)
xxhash-wasm/xxh3-perf x 98,420 ops/sec ±0.05% (101 runs sampled)
Benchmark 100000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 5,881 ops/sec ±0.15% (101 runs sampled)
xxhash-wasm#h64 x 9,786 ops/sec ±0.11% (99 runs sampled)
xxhash-wasm/xxh3-perf x 9,951 ops/sec ±0.09% (102 runs sampled)
Benchmark 1000000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 577 ops/sec ±0.02% (99 runs sampled)
xxhash-wasm#h64 x 948 ops/sec ±0.08% (100 runs sampled)
xxhash-wasm/xxh3-perf x 957 ops/sec ±0.02% (101 runs sampled)
Benchmark 10000000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 56.32 ops/sec ±0.06% (74 runs sampled)
xxhash-wasm#h64 x 91.13 ops/sec ±0.05% (79 runs sampled)
xxhash-wasm/xxh3-perf x 92.05 ops/sec ±0.08% (80 runs sampled)
Benchmark 100000000 bytes - Fastest is xxhash-wasm/xxh3-perf

x64:

xxhash-wasm#h32 x 3,984,394 ops/sec ±0.58% (94 runs sampled)
xxhash-wasm#h64 x 3,048,905 ops/sec ±0.48% (95 runs sampled)
xxhash-wasm/xxh3-perf x 3,044,536 ops/sec ±0.52% (97 runs sampled)
Benchmark 1 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 3,903,296 ops/sec ±0.15% (96 runs sampled)
xxhash-wasm#h64 x 2,993,762 ops/sec ±0.22% (96 runs sampled)
xxhash-wasm/xxh3-perf x 2,967,903 ops/sec ±0.21% (98 runs sampled)
Benchmark 10 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 3,668,453 ops/sec ±0.28% (97 runs sampled)
xxhash-wasm#h64 x 2,899,470 ops/sec ±0.61% (94 runs sampled)
xxhash-wasm/xxh3-perf x 2,573,493 ops/sec ±0.31% (98 runs sampled)
Benchmark 100 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 2,270,914 ops/sec ±0.29% (99 runs sampled)
xxhash-wasm#h64 x 2,235,033 ops/sec ±0.26% (96 runs sampled)
xxhash-wasm/xxh3-perf x 2,063,224 ops/sec ±0.19% (98 runs sampled)
Benchmark 1000 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 460,768 ops/sec ±0.31% (97 runs sampled)
xxhash-wasm#h64 x 643,998 ops/sec ±0.24% (97 runs sampled)
xxhash-wasm/xxh3-perf x 474,138 ops/sec ±0.16% (97 runs sampled)
Benchmark 10000 bytes - Fastest is xxhash-wasm#h64
xxhash-wasm#h32 x 53,130 ops/sec ±0.28% (96 runs sampled)
xxhash-wasm#h64 x 81,439 ops/sec ±0.22% (99 runs sampled)
xxhash-wasm/xxh3-perf x 55,706 ops/sec ±0.31% (94 runs sampled)
Benchmark 100000 bytes - Fastest is xxhash-wasm#h64
xxhash-wasm#h32 x 4,599 ops/sec ±0.25% (98 runs sampled)
xxhash-wasm#h64 x 6,523 ops/sec ±0.40% (96 runs sampled)
xxhash-wasm/xxh3-perf x 4,881 ops/sec ±0.27% (96 runs sampled)
Benchmark 1000000 bytes - Fastest is xxhash-wasm#h64
xxhash-wasm#h32 x 378 ops/sec ±0.40% (91 runs sampled)
xxhash-wasm#h64 x 502 ops/sec ±0.58% (94 runs sampled)
xxhash-wasm/xxh3-perf x 401 ops/sec ±0.37% (92 runs sampled)
Benchmark 10000000 bytes - Fastest is xxhash-wasm#h64
xxhash-wasm#h32 x 37.05 ops/sec ±0.52% (65 runs sampled)
xxhash-wasm#h64 x 48.07 ops/sec ±1.33% (63 runs sampled)
xxhash-wasm/xxh3-perf x 38.78 ops/sec ±0.88% (67 runs sampled)
Benchmark 100000000 bytes - Fastest is xxhash-wasm#h64

As a result of these numbers my intent would be to not bother trying to get this into a publishable state, as I don't see much of a path to further optimizations—but if anybody else sees this and wants to give it a shot, feel free to reach out and chat.

ronag · 2023-03-15T14:55:19Z

I don't think testing with node 16.15 is going to be very indicative...

ronag · 2023-03-15T14:56:46Z

I could try running it on a EPYC machine. But I don't see any "bench" script in the package.json. How do I run the benchmarks?

marcusdarmstrong · 2023-03-15T14:57:36Z

The bench directory is a separate node application that needs a separate yarn install.

ronag · 2023-03-15T15:00:48Z

Cannot find module '/xxhash-wasm/bench/node_modules/xxhash-wasm/cjs/xxhash-wasm.cjs'

ronag · 2023-03-15T15:02:10Z

wasm-opt was missing in deps for building

ronag · 2023-03-15T15:04:03Z

Mixed resuts:

xxhash-wasm#h32 x 12,567,019 ops/sec ±0.26% (91 runs sampled)
xxhash-wasm#h64 x 12,739,953 ops/sec ±0.33% (96 runs sampled)
xxhash-wasm/xxh3-perf x 13,046,099 ops/sec ±0.23% (95 runs sampled)
Benchmark 1 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 11,619,075 ops/sec ±0.48% (97 runs sampled)
xxhash-wasm#h64 x 11,382,328 ops/sec ±0.33% (94 runs sampled)
xxhash-wasm/xxh3-perf x 11,146,227 ops/sec ±0.44% (93 runs sampled)
Benchmark 10 bytes - Fastest is xxhash-wasm#h32
xxhash-wasm#h32 x 9,388,540 ops/sec ±0.33% (97 runs sampled)
xxhash-wasm#h64 x 9,568,718 ops/sec ±0.28% (97 runs sampled)
xxhash-wasm/xxh3-perf x 8,205,738 ops/sec ±0.34% (96 runs sampled)
Benchmark 100 bytes - Fastest is xxhash-wasm#h64
xxhash-wasm#h32 x 3,109,426 ops/sec ±0.29% (93 runs sampled)
xxhash-wasm#h64 x 4,213,040 ops/sec ±0.33% (93 runs sampled)
xxhash-wasm/xxh3-perf x 4,454,175 ops/sec ±0.55% (96 runs sampled)
Benchmark 1000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 403,521 ops/sec ±0.29% (97 runs sampled)
xxhash-wasm#h64 x 597,044 ops/sec ±0.23% (94 runs sampled)
xxhash-wasm/xxh3-perf x 704,610 ops/sec ±0.37% (93 runs sampled)
Benchmark 10000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 40,900 ops/sec ±0.22% (98 runs sampled)
xxhash-wasm#h64 x 62,809 ops/sec ±0.31% (94 runs sampled)
xxhash-wasm/xxh3-perf x 74,341 ops/sec ±0.32% (95 runs sampled)
Benchmark 100000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 3,949 ops/sec ±0.29% (97 runs sampled)
xxhash-wasm#h64 x 6,049 ops/sec ±0.35% (94 runs sampled)
xxhash-wasm/xxh3-perf x 6,954 ops/sec ±0.33% (97 runs sampled)
Benchmark 1000000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 399 ops/sec ±0.28% (92 runs sampled)
xxhash-wasm#h64 x 608 ops/sec ±0.39% (93 runs sampled)
xxhash-wasm/xxh3-perf x 709 ops/sec ±0.42% (94 runs sampled)
Benchmark 10000000 bytes - Fastest is xxhash-wasm/xxh3-perf
xxhash-wasm#h32 x 35.07 ops/sec ±1.22% (61 runs sampled)
xxhash-wasm#h64 x 51.52 ops/sec ±0.71% (67 runs sampled)
xxhash-wasm/xxh3-perf x 58.89 ops/sec ±0.37% (62 runs sampled)
Benchmark 100000000 bytes - Fastest is xxhash-wasm/xxh3-perf

ronag · 2023-03-15T15:05:52Z

I suspect SIMD overhead is too large for inputs smaller than 64 bytes.

marcusdarmstrong · 2023-03-15T15:07:46Z

SIMD is only really in use for inputs > 240b. The underlying driver of performance difficulties at smaller input byte sizes is that wasm doesn't have a 64b*64b=>128b mul operation, while the C version of xxhash is able to use intrinsics to access that operation on x64 and aarch64.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XXH128 #29

XXH128 #29

ronag commented Jan 28, 2022

jungomi commented Feb 1, 2022 •

edited

fabiospampinato commented Feb 7, 2022

jungomi commented Feb 8, 2022

marcusdarmstrong commented Mar 11, 2023 •

edited

marcusdarmstrong commented Mar 15, 2023

ronag commented Mar 15, 2023

ronag commented Mar 15, 2023

marcusdarmstrong commented Mar 15, 2023

ronag commented Mar 15, 2023

ronag commented Mar 15, 2023

ronag commented Mar 15, 2023 •

edited

ronag commented Mar 15, 2023

marcusdarmstrong commented Mar 15, 2023

XXH128 #29

XXH128 #29

Comments

ronag commented Jan 28, 2022

jungomi commented Feb 1, 2022 • edited

fabiospampinato commented Feb 7, 2022

jungomi commented Feb 8, 2022

marcusdarmstrong commented Mar 11, 2023 • edited

marcusdarmstrong commented Mar 15, 2023

ronag commented Mar 15, 2023

ronag commented Mar 15, 2023

marcusdarmstrong commented Mar 15, 2023

ronag commented Mar 15, 2023

ronag commented Mar 15, 2023

ronag commented Mar 15, 2023 • edited

ronag commented Mar 15, 2023

marcusdarmstrong commented Mar 15, 2023

jungomi commented Feb 1, 2022 •

edited

marcusdarmstrong commented Mar 11, 2023 •

edited

ronag commented Mar 15, 2023 •

edited