Vectorize `find_first_of` for 8 and 16 bit elements with SSE4.2 `pcmpestri` #4466

AlexGuteniev · 2024-03-10T14:38:14Z

For element set that fits SSE register, that is length of "needle" is up to 16 for 8-bit element, up to 8 for 16-bit element

Possible future work:

Use find instead for 1-element cases as @Alcaro suggested
basic_string::find_first_of. Certainly not the general case with user-provided char traits, but maybe for standard cases;
Expand to more that 16 bytes "needle". Will add inner loop, make it square algorithmic complexity, and more implementation complexity, but may be useful.

Benchmark result

The first non-type template parameter in the benchmark results is the position where the value is found, "haystack" length is twice that.
The second non-type template parameter in the benchmark results is the "needle" size.

Before:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t, 2, 3>            6.33 ns         6.42 ns    112000000
bm<uint16_t, 2, 3>           6.47 ns         6.45 ns     89600000
bm<uint8_t, 7, 4>            22.8 ns         22.9 ns     32000000
bm<uint16_t, 7, 4>           15.1 ns         15.3 ns     44800000
bm<uint8_t, 9, 3>            17.7 ns         17.6 ns     37333333
bm<uint16_t, 9, 3>           16.6 ns         16.7 ns     44800000
bm<uint8_t, 22, 5>           73.7 ns         73.2 ns      8960000
bm<uint16_t, 22, 5>          69.3 ns         69.8 ns     11200000
bm<uint8_t, 3056, 7>         7991 ns         8022 ns        89600
bm<uint16_t, 3056, 7>        7314 ns         7324 ns        89600
bm<uint8_t, 1011, 11>        3709 ns         3767 ns       186667
bm<uint16_t, 1011, 11>       3662 ns         3599 ns       186667

After:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t, 2, 3>            6.22 ns         6.14 ns    112000000
bm<uint16_t, 2, 3>           6.95 ns         6.98 ns     89600000
bm<uint8_t, 7, 4>            23.9 ns         24.1 ns     29866667
bm<uint16_t, 7, 4>           23.1 ns         22.9 ns     32000000
bm<uint8_t, 9, 3>            13.1 ns         13.2 ns     49777778
bm<uint16_t, 9, 3>           14.1 ns         14.1 ns     49777778
bm<uint8_t, 22, 5>           14.3 ns         14.3 ns     44800000
bm<uint16_t, 22, 5>          15.0 ns         15.0 ns     44800000
bm<uint8_t, 3056, 7>          314 ns          314 ns      2240000
bm<uint16_t, 3056, 7>         611 ns          614 ns      1120000
bm<uint8_t, 1011, 11>         107 ns          107 ns      6400000
bm<uint16_t, 1011, 11>       3478 ns         3516 ns       213333

Explanation:

Starting with bm<uint8_t, 9, 3> row, the vectorization is engaged;
bm<uint8_t, 22, 5> already shows significant improvement;
The exact amount of the improvement is hard to determine, as it is comparison of O(N**2) vs O(N) algorithm, the longer "needle" is, the more improvement is observed;
bm<uint16_t, 1011, 11> falls back to the scalar algorithm due to "needle" length not fitting SSE register.

Alcaro · 2024-03-19T21:55:58Z

As we discovered on Discord a while ago, some people call std::find_first_of with a single-element needle. This should be forwarded to std::find (or memchr or something), even if vectorization is otherwise disabled.

AlexGuteniev · 2024-03-19T22:02:20Z

As we discovered on Discord a while ago, some people call std::find_first_of with a single-element needle. This should be forwarded to std::find (or memchr or something), even if vectorization is otherwise disabled.

I remember this, but it is so much unrelated, that I think it better fits a separate PR

stl/inc/algorithm

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/inc/algorithm

benchmarks/src/find_first_of.cpp

stl/inc/algorithm

stl/src/vector_algorithms.cpp

stl/inc/algorithm

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-03-21T07:45:59Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-03-21T21:49:27Z

Thanks for vectorizing more algorithms! 🚀 ⏩ 🎉

AlexGuteniev added 2 commits March 10, 2024 15:57

Vectorize find_first_of

2b28a09

align load

9a3cc9f

AlexGuteniev requested a review from a team as a code owner March 10, 2024 14:38

github-actions bot added this to Initial Review in Code Reviews Mar 10, 2024

AlexGuteniev added 3 commits March 10, 2024 16:44

format

8ed06e3

the fall back is unnecessary

a1241df

arm

2ee66d3

StephanTLavavej added the performance Must go faster label Mar 10, 2024

StephanTLavavej self-assigned this Mar 10, 2024

AlexGuteniev added 2 commits March 11, 2024 11:41

embed predicate check

09531ca

fix test

05f6312

StephanTLavavej requested changes Mar 20, 2024

View reviewed changes

StephanTLavavej removed their assignment Mar 20, 2024

AlexGuteniev added 8 commits March 20, 2024 09:13

constantness mix +coverage

f283dda

at() -> []

f83cfbb

Missing include

7e33224

Scope and not naming _Bytes

697bf8a

ranges + coverage

dae5a81

_Needle_length

b7d69f6

fix benchmark

f5b8f78

_INLINE_VAR

2c9979b

AlexGuteniev requested a review from StephanTLavavej March 20, 2024 09:36

StephanTLavavej self-assigned this Mar 20, 2024

AlexGuteniev and others added 3 commits March 21, 2024 08:51

must reuse!

1ab536c

Merge branch 'main' into first_of

1d4f78f

Code review feedback.

4785124

StephanTLavavej reviewed Mar 21, 2024

View reviewed changes

stl/inc/algorithm Outdated Show resolved Hide resolved

stl/inc/algorithm Outdated Show resolved Hide resolved

stl/inc/algorithm Outdated Show resolved Hide resolved

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Mar 21, 2024

View reviewed changes

StephanTLavavej merged commit 9d761bd into microsoft:main Mar 21, 2024
35 checks passed

AlexGuteniev deleted the first_of branch March 21, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `find_first_of` for 8 and 16 bit elements with SSE4.2 `pcmpestri` #4466

Vectorize `find_first_of` for 8 and 16 bit elements with SSE4.2 `pcmpestri` #4466

AlexGuteniev commented Mar 10, 2024 •

edited

Alcaro commented Mar 19, 2024

AlexGuteniev commented Mar 19, 2024 •

edited

StephanTLavavej commented Mar 21, 2024

StephanTLavavej commented Mar 21, 2024

Vectorize find_first_of for 8 and 16 bit elements with SSE4.2 pcmpestri #4466

Vectorize find_first_of for 8 and 16 bit elements with SSE4.2 pcmpestri #4466

Conversation

AlexGuteniev commented Mar 10, 2024 • edited

Alcaro commented Mar 19, 2024

AlexGuteniev commented Mar 19, 2024 • edited

StephanTLavavej commented Mar 21, 2024

StephanTLavavej commented Mar 21, 2024

Vectorize `find_first_of` for 8 and 16 bit elements with SSE4.2 `pcmpestri` #4466

Vectorize `find_first_of` for 8 and 16 bit elements with SSE4.2 `pcmpestri` #4466

AlexGuteniev commented Mar 10, 2024 •

edited

AlexGuteniev commented Mar 19, 2024 •

edited