Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SSE2/AVX2/WASM SIMD support #86

Merged
merged 13 commits into from Mar 19, 2024
Merged

Add SSE2/AVX2/WASM SIMD support #86

merged 13 commits into from Mar 19, 2024

Conversation

james7132
Copy link
Collaborator

@james7132 james7132 commented Jan 9, 2023

Fixes #73. Changes Block from a alias to a platform/target-specific newtype around usize, __m128i, __m256i, or v128.

This supports all SIMD intrinsics that have been stabilized into the standard library.

SSE2 is universally available on all x86_64 machines, so this should see a 4x speedup relative to the u32-based approach originally used before #74.

AVX2 is only on~89% of consumer machines, so it may not be fully reliable, but should show another 2x speedup over SSE2. Those who are using this in a cloud or server environment will likely benefit from using --target-cpu=native, which should enable it on the target machine.

NOTE: This adds a lot of unsafe code, simply by nature of using SIMD intrinsics. There's a good chunk of core::mem::transmute going around too, though I try to keep it to a minimum.

Performance

group                           avx2                                   sse2                                   usize
-----                           ----                                   ----                                   -----
insert/1m                       1.00   1772.7±8.57µs        ? ?/sec    1.00  1777.2±13.75µs        ? ?/sec    1.00   1773.4±8.48µs        ? ?/sec
insert_range/1m                 1.00      2.7±0.02µs        ? ?/sec    1.98      5.3±0.04µs        ? ?/sec    3.96     10.6±0.09µs        ? ?/sec
iter_ones/all_ones              1.02    424.5±1.07µs        ? ?/sec    1.01    420.2±1.59µs        ? ?/sec    1.00    416.5±2.06µs        ? ?/sec
iter_ones/all_zeros             2.01     17.1±0.08µs        ? ?/sec    1.00      8.5±0.08µs        ? ?/sec    1.00      8.5±0.07µs        ? ?/sec
iter_ones/contains_all_ones     1.00   1629.8±8.58µs        ? ?/sec    1.00   1629.3±5.90µs        ? ?/sec    1.16  1897.4±26.96µs        ? ?/sec
iter_ones/contains_all_zeros    1.00   1629.4±4.83µs        ? ?/sec    1.00   1628.9±4.56µs        ? ?/sec    1.16  1897.1±28.11µs        ? ?/sec

Using a ported version of the benchmarks to Criterion (see #84), on set or batch operations, like insert_range, intersection_with, etc. these SIMD accelerated versions are expectedly 2-4 times faster than when using usize as the block, which should extend the performance gains of #74 even further.

I tested optionally using runtime feature detection via ix_x86_feature_detected on these operations, and unfortunately that causes serious regressions.

@james7132
Copy link
Collaborator Author

Note: I think this currently breaks serde support.

@james7132 james7132 marked this pull request as draft January 9, 2023 19:41
@james7132 james7132 marked this pull request as ready for review March 19, 2024 01:11
@james7132
Copy link
Collaborator Author

james7132 commented Mar 19, 2024

Reran benchmarks, this seems to be a significant gain on multiple fronts when the right CPU features are enabled during compilation:

group                           master                                 simd
-----                           ------                                 ----
clear/1m                        1.00   1153.7±8.18ns        ? ?/sec    1.00   1155.9±3.84ns        ? ?/sec
count_ones/1m                   1.00      2.5±0.01µs        ? ?/sec    1.91      4.7±0.02µs        ? ?/sec
difference_with/1m              1.28      2.7±0.00µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
grow_and_insert                 1.00      2.4±0.02ms        ? ?/sec    1.09      2.6±0.03ms        ? ?/sec
insert/1m                       1.00    962.1±9.87µs        ? ?/sec    1.00    958.7±5.74µs        ? ?/sec
insert_range/1m                 1.00   1148.3±4.07ns        ? ?/sec    1.18   1358.1±6.24ns        ? ?/sec
intersect_with/1m               1.31      2.8±0.01µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
iter_ones/all_ones              1.00      2.3±0.00ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
iter_ones/all_zeros             1.00      4.3±0.07µs        ? ?/sec    1.01      4.3±0.14µs        ? ?/sec
iter_ones/contains_all_ones     2.51   410.4±19.39µs        ? ?/sec    1.00    163.6±1.68µs        ? ?/sec
iter_ones/contains_all_zeros    2.49    407.3±4.52µs        ? ?/sec    1.00    163.4±0.14µs        ? ?/sec
iter_ones/sparse                1.37    288.4±3.85µs        ? ?/sec    1.00    210.5±0.68µs        ? ?/sec
symmetric_difference_with/1m    1.30      2.8±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
union_with/1m                   1.31      2.8±0.02µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec

The Mask changes might need to be reverted though. The regressions in insert_range and count_ones are hard to swallow.

@james7132 james7132 merged commit d7ae91f into petgraph:master Mar 19, 2024
13 checks passed
Comment on lines +547 to +555
// SAFETY: This is using the exact same allocation pattern, size, and capacity
// making this reconstruction of the Vec safe.
let mut data = unsafe {
let mut data = ManuallyDrop::new(self.data);
let ptr = data.as_mut_ptr().cast();
let len = data.len() * SimdBlock::USIZE_COUNT;
let capacity = data.capacity() * SimdBlock::USIZE_COUNT;
Vec::from_raw_parts(ptr, len, capacity)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe. The Vec is initially allocated with SimdBlocks, which can have a different alignment than usizes. For example the one in avx2 has an alignment of 32 and the one in sse2 has an alignment of 16, while usize only has an alignment of 8.

@msvbg msvbg mentioned this pull request Mar 19, 2024
@james7132 james7132 mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SIMD Accelerated Operations
2 participants