Add SSE2/AVX2/WASM SIMD support #86

james7132 · 2023-01-09T18:46:56Z

Fixes #73. Changes Block from a alias to a platform/target-specific newtype around usize, __m128i, __m256i, or v128.

This supports all SIMD intrinsics that have been stabilized into the standard library.

SSE2 is universally available on all x86_64 machines, so this should see a 4x speedup relative to the u32-based approach originally used before #74.

AVX2 is only on~89% of consumer machines, so it may not be fully reliable, but should show another 2x speedup over SSE2. Those who are using this in a cloud or server environment will likely benefit from using --target-cpu=native, which should enable it on the target machine.

NOTE: This adds a lot of unsafe code, simply by nature of using SIMD intrinsics. There's a good chunk of core::mem::transmute going around too, though I try to keep it to a minimum.

Performance

group                           avx2                                   sse2                                   usize
-----                           ----                                   ----                                   -----
insert/1m                       1.00   1772.7±8.57µs        ? ?/sec    1.00  1777.2±13.75µs        ? ?/sec    1.00   1773.4±8.48µs        ? ?/sec
insert_range/1m                 1.00      2.7±0.02µs        ? ?/sec    1.98      5.3±0.04µs        ? ?/sec    3.96     10.6±0.09µs        ? ?/sec
iter_ones/all_ones              1.02    424.5±1.07µs        ? ?/sec    1.01    420.2±1.59µs        ? ?/sec    1.00    416.5±2.06µs        ? ?/sec
iter_ones/all_zeros             2.01     17.1±0.08µs        ? ?/sec    1.00      8.5±0.08µs        ? ?/sec    1.00      8.5±0.07µs        ? ?/sec
iter_ones/contains_all_ones     1.00   1629.8±8.58µs        ? ?/sec    1.00   1629.3±5.90µs        ? ?/sec    1.16  1897.4±26.96µs        ? ?/sec
iter_ones/contains_all_zeros    1.00   1629.4±4.83µs        ? ?/sec    1.00   1628.9±4.56µs        ? ?/sec    1.16  1897.1±28.11µs        ? ?/sec

Using a ported version of the benchmarks to Criterion (see #84), on set or batch operations, like insert_range, intersection_with, etc. these SIMD accelerated versions are expectedly 2-4 times faster than when using usize as the block, which should extend the performance gains of #74 even further.

I tested optionally using runtime feature detection via ix_x86_feature_detected on these operations, and unfortunately that causes serious regressions.

james7132 · 2023-01-09T19:40:57Z

Note: I think this currently breaks serde support.

This reverts commit 5282a4d.

james7132 · 2024-03-19T05:45:41Z

Reran benchmarks, this seems to be a significant gain on multiple fronts when the right CPU features are enabled during compilation:

group                           master                                 simd
-----                           ------                                 ----
clear/1m                        1.00   1153.7±8.18ns        ? ?/sec    1.00   1155.9±3.84ns        ? ?/sec
count_ones/1m                   1.00      2.5±0.01µs        ? ?/sec    1.91      4.7±0.02µs        ? ?/sec
difference_with/1m              1.28      2.7±0.00µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
grow_and_insert                 1.00      2.4±0.02ms        ? ?/sec    1.09      2.6±0.03ms        ? ?/sec
insert/1m                       1.00    962.1±9.87µs        ? ?/sec    1.00    958.7±5.74µs        ? ?/sec
insert_range/1m                 1.00   1148.3±4.07ns        ? ?/sec    1.18   1358.1±6.24ns        ? ?/sec
intersect_with/1m               1.31      2.8±0.01µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
iter_ones/all_ones              1.00      2.3±0.00ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
iter_ones/all_zeros             1.00      4.3±0.07µs        ? ?/sec    1.01      4.3±0.14µs        ? ?/sec
iter_ones/contains_all_ones     2.51   410.4±19.39µs        ? ?/sec    1.00    163.6±1.68µs        ? ?/sec
iter_ones/contains_all_zeros    2.49    407.3±4.52µs        ? ?/sec    1.00    163.4±0.14µs        ? ?/sec
iter_ones/sparse                1.37    288.4±3.85µs        ? ?/sec    1.00    210.5±0.68µs        ? ?/sec
symmetric_difference_with/1m    1.30      2.8±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
union_with/1m                   1.31      2.8±0.02µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec

The Mask changes might need to be reverted though. The regressions in insert_range and count_ones are hard to swallow.

SkiFire13 · 2024-03-19T09:36:53Z

src/lib.rs

+        // SAFETY: This is using the exact same allocation pattern, size, and capacity
+        // making this reconstruction of the Vec safe.
+        let mut data = unsafe {
+            let mut data = ManuallyDrop::new(self.data);
+            let ptr = data.as_mut_ptr().cast();
+            let len = data.len() * SimdBlock::USIZE_COUNT;
+            let capacity = data.capacity() * SimdBlock::USIZE_COUNT;
+            Vec::from_raw_parts(ptr, len, capacity)
+        };


I don't think this is safe. The Vec is initially allocated with SimdBlocks, which can have a different alignment than usizes. For example the one in avx2 has an alignment of 32 and the one in sse2 has an alignment of 16, while usize only has an alignment of 8.

james7132 marked this pull request as draft January 9, 2023 19:41

rj00a mentioned this pull request Feb 19, 2024

SIMD accelerated BitSet rj00a/evenio#18

Open

james7132 added 3 commits March 18, 2024 12:45

Force no_std

238b03a

Use SIMD blocks instead of usize

378a7bf

Use Block::is_empty properly

648d906

james7132 force-pushed the simd branch from ebb4a99 to 648d906 Compare March 18, 2024 22:02

james7132 added 7 commits March 18, 2024 15:06

Cleanup

069291c

Remove the assertion due to MSRV

9c1ae24

Move out the common shared code

ac87e14

Allocate the minimal amount of memory

afe7acb

Formatting

173cc0c

Shut up clippy

7e0d844

Get Masks working with blocks instead of usize

5282a4d

james7132 marked this pull request as ready for review March 19, 2024 01:11

james7132 added 2 commits March 18, 2024 22:13

Fix serde implementation

930909c

Revert "Get Masks working with blocks instead of usize"

be79d16

This reverts commit 5282a4d.

Extend CI for wasm and features checking

1a5099b

james7132 merged commit d7ae91f into petgraph:master Mar 19, 2024
13 checks passed

james7132 mentioned this pull request Mar 19, 2024

Efficient creation of all-ones bitset #101

Open

SkiFire13 reviewed Mar 19, 2024

View reviewed changes

msvbg mentioned this pull request Mar 19, 2024

Add NEON support #115

Closed

james7132 mentioned this pull request Mar 21, 2024

Add support for AVX #121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SSE2/AVX2/WASM SIMD support #86

Add SSE2/AVX2/WASM SIMD support #86

james7132 commented Jan 9, 2023 •

edited

james7132 commented Jan 9, 2023

james7132 commented Mar 19, 2024 •

edited

SkiFire13 Mar 19, 2024

Add SSE2/AVX2/WASM SIMD support #86

Add SSE2/AVX2/WASM SIMD support #86

Conversation

james7132 commented Jan 9, 2023 • edited

Performance

james7132 commented Jan 9, 2023

james7132 commented Mar 19, 2024 • edited

SkiFire13 Mar 19, 2024

Choose a reason for hiding this comment

james7132 commented Jan 9, 2023 •

edited

james7132 commented Mar 19, 2024 •

edited