AVX2 vectorization for very large bitsets #4422

AlexGuteniev · 2024-02-24T11:53:28Z

Not sure if it worth merging due to complexity growth, and noticeable improvement only for larger bitsets

Turned out that the existing vectorization is fine for bitsets beyond 256 bits (32 bytes), and AVX2 upgrade is harmful.
For very large bitsets the improvement is noticeable.

Results

Before:
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    211 ns          206 ns      2800000
BM_bitset_to_string<64, char>                   2662 ns         2668 ns       263529
BM_bitset_to_string<512, char>                  3730 ns         3767 ns       186667
BM_bitset_to_string_large_single<char>           230 ns          230 ns      2986667
BM_bitset_to_string<7, wchar_t>                  131 ns          131 ns      5600000
BM_bitset_to_string<64, wchar_t>                2488 ns         2511 ns       298667
BM_bitset_to_string<512, wchar_t>               4360 ns         4349 ns       154483
BM_bitset_to_string_large_single<wchar_t>        306 ns          300 ns      2240000

After:
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    224 ns          223 ns      2800000
BM_bitset_to_string<64, char>                   2696 ns         2668 ns       263529
BM_bitset_to_string<512, char>                  3313 ns         3278 ns       224000
BM_bitset_to_string_large_single<char>           157 ns          157 ns      4072727
BM_bitset_to_string<7, wchar_t>                  128 ns          126 ns      4977778
BM_bitset_to_string<64, wchar_t>                2421 ns         2400 ns       280000
BM_bitset_to_string<512, wchar_t>               3245 ns         3223 ns       203636
BM_bitset_to_string_large_single<wchar_t>        209 ns          209 ns      3446154

stl/src/vector_algorithms.cpp

AlexGuteniev · 2024-02-26T10:18:53Z

stl/src/vector_algorithms.cpp

+            char _Tmp[32];
+            _mm256_storeu_si256(reinterpret_cast<__m256i*>(_Tmp), _Elems);
+            const char* const _Tmpd = _Tmp + (32 - _Size_bits);
+            memcpy(_Dest, _Tmpd, _Size_bits);


We could take advantage of at least remaining of 32-bit unit available due to using array of units in bitset.
Here we can use AVX2 masked store _mm256_maskstore_epi32, and and 4-byte memcpy above.
I'm not sure if the gain worth doing this, as the improvement only applies to the tail part.

Correction: this applies to the above memcpy only, To write more here we should over-reserve string, which though seems also feasible.

AlexGuteniev · 2024-02-26T14:33:19Z

Created DevCom-10601346 based on suboptimal AVX2 codegen.
This change might give better results if the issue is fixed.

Note `bits < 64 && str.size() != N` preparing to handle values of N that aren't evenly divisible by 64. Note `b.template to_string<wchar_t>()` disambiguation.

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-02-29T01:33:31Z

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-02-29T21:59:24Z

Thanks for keeping those vector units busy! 📈 🎉 😹

AlexGuteniev added 2 commits February 24, 2024 13:44

initial AVX2 bitset::to_string implemetnation

d20f879

Another benchmark case

234abf9

AlexGuteniev requested a review from a team as a code owner February 24, 2024 11:53

github-actions bot added this to Initial Review in Code Reviews Feb 24, 2024

StephanTLavavej added the performance Must go faster label Feb 24, 2024

StephanTLavavej self-assigned this Feb 24, 2024

Code Reviews automation moved this from Initial Review to Work In Progress Feb 25, 2024

StephanTLavavej requested changes Feb 25, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Show resolved Hide resolved

StephanTLavavej removed their assignment Feb 25, 2024

AlexGuteniev requested a review from StephanTLavavej February 25, 2024 08:01

StephanTLavavej self-assigned this Feb 25, 2024

StephanTLavavej moved this from Work In Progress to Initial Review in Code Reviews Feb 25, 2024

AlexGuteniev commented Feb 26, 2024

View reviewed changes

StephanTLavavej added 4 commits February 27, 2024 14:30

Merge branch 'main' into bitset

4214855

Extract test_randomized_bitset.

fa005b8

Note `bits < 64 && str.size() != N` preparing to handle values of N that aren't evenly divisible by 64. Note `b.template to_string<wchar_t>()` disambiguation.

Test the range [507, 549), revealing failures!

d840d32

Fix damaged memcpy.

7eb4106

StephanTLavavej reviewed Feb 28, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Feb 28, 2024

View reviewed changes

StephanTLavavej removed their assignment Feb 28, 2024

StephanTLavavej moved this from Initial Review to Final Review in Code Reviews Feb 28, 2024

StephanTLavavej assigned CaseyCarter and StephanTLavavej Feb 28, 2024

StephanTLavavej unassigned CaseyCarter Feb 29, 2024

StephanTLavavej merged commit 8b081e2 into microsoft:main Feb 29, 2024
35 checks passed

Code Reviews automation moved this from Final Review to Done Feb 29, 2024

AlexGuteniev deleted the bitset branch February 29, 2024 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 vectorization for very large bitsets #4422

AVX2 vectorization for very large bitsets #4422

AlexGuteniev commented Feb 24, 2024

AlexGuteniev Feb 26, 2024

AlexGuteniev Feb 29, 2024

AlexGuteniev commented Feb 26, 2024

StephanTLavavej commented Feb 29, 2024

StephanTLavavej commented Feb 29, 2024

AVX2 vectorization for very large bitsets #4422

AVX2 vectorization for very large bitsets #4422

Conversation

AlexGuteniev commented Feb 24, 2024

Results

AlexGuteniev Feb 26, 2024

Choose a reason for hiding this comment

AlexGuteniev Feb 29, 2024

Choose a reason for hiding this comment

AlexGuteniev commented Feb 26, 2024

StephanTLavavej commented Feb 29, 2024

StephanTLavavej commented Feb 29, 2024