Help the compiler vectorize `std::iota` #4627

AlexGuteniev · 2024-04-24T17:58:49Z

The compiler vectorization of iota algorithm exists, but it is very fragile; one of issues reported as DevCom-10593477

Can do ranges::iota as well, if this is considered acceptable approach.

Benchmark results are with default benchmark options, I think it is SSE2 and not AVX2.

Before:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<std::uint32_t>/7          3.60 ns         1.36 ns    448000000
bm<std::uint32_t>/18         7.26 ns         3.61 ns    264084211
bm<std::uint32_t>/43         17.4 ns         10.3 ns     74666667
bm<std::uint32_t>/131        50.8 ns         17.1 ns     50176000
bm<std::uint32_t>/315         105 ns         50.0 ns     10000000
bm<std::uint32_t>/1212        393 ns          155 ns      3733333
bm<std::uint64_t>/7          2.88 ns         1.32 ns    640000000
bm<std::uint64_t>/18         4.98 ns         2.59 ns    320000000
bm<std::uint64_t>/43         16.5 ns         8.12 ns    100000000
bm<std::uint64_t>/131        49.0 ns         24.3 ns     37333333
bm<std::uint64_t>/315        91.9 ns         31.5 ns     20363636
bm<std::uint64_t>/1212        308 ns          147 ns     10000000

After:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<std::uint32_t>/7          3.69 ns         2.96 ns    560000000
bm<std::uint32_t>/18         3.65 ns         1.75 ns    454623280
bm<std::uint32_t>/43         7.19 ns         3.40 ns    179200000
bm<std::uint32_t>/131        18.6 ns         10.5 ns     89600000
bm<std::uint32_t>/315        44.2 ns         20.1 ns     28000000
bm<std::uint32_t>/1212        176 ns         82.3 ns     11200000
bm<std::uint64_t>/7          2.60 ns         1.28 ns   1000000000
bm<std::uint64_t>/18         4.53 ns         2.43 ns    560000000
bm<std::uint64_t>/43         8.84 ns         4.81 ns    194782609
bm<std::uint64_t>/131        26.6 ns         11.4 ns     66901333
bm<std::uint64_t>/315        66.3 ns         35.6 ns     26352941
bm<std::uint64_t>/1212        247 ns          107 ns      4977778

stl/inc/numeric

AlexGuteniev · 2024-04-25T07:03:03Z

I admit I can still vectorize this better that the compiler.
Specifically handle 8 and 16 bit cases, and handle 32 and 64 bit cases better. I've created DevCom-10646935 for that.

benchmarks/src/iota.cpp

stl/inc/numeric

StephanTLavavej · 2024-04-26T01:08:20Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

AlexGuteniev · 2024-04-26T05:53:47Z

I pushed changes because I thought the following case was broken.

_Ty = int
size_t = unsigned int
_Val = -1000
(_Last - First) = 0x8000'0001

Probably we need coverage

…'t need to be C++20 `consteval`. They're always stored in `constexpr` variables.

…ons. The `_Ugly` functions aren't marked `_EXPORT_STD`.

This matches how we implement `_Is_standard_unsigned_integer` in `<__msvc_bit_utils.hpp>`.

This answers the question of whether `static_cast<_Ty>(_Size)` is value-preserving.

StephanTLavavej · 2024-04-26T08:47:03Z

Good catch! After talking on Discord, I've replaced the increasingly complicated helper function with C++20 in_range, suitably uglified to _In_range for downlevel usage. There have been other places in the STL where I've wanted to use this but it wasn't available, so I think it'll be very worthwhile to extract it. (in_range is very tricky to implement, and we managed to do so both correctly and very elegantly in terms of performance.)

Allocating 2 gigs on a 32-bit system would be problematic, so I think that the current level of test coverage should be fine.

I reran the benchmarks to verify that the optimization is still effective:

Benchmark (ns)	main	Now
`bm<std::uint32_t>/7`	2.75	4.45
`bm<std::uint32_t>/18`	5.35	3.39
`bm<std::uint32_t>/43`	10.9	6.70
`bm<std::uint32_t>/131`	33.0	15.0
`bm<std::uint32_t>/315`	71.5	39.1
`bm<std::uint32_t>/1212`	259	142
`bm<std::uint64_t>/7`	2.75	2.75
`bm<std::uint64_t>/18`	5.41	3.84
`bm<std::uint64_t>/43`	10.9	7.22
`bm<std::uint64_t>/131`	33.1	22.6
`bm<std::uint64_t>/315`	71.7	53.6
`bm<std::uint64_t>/1212`	259	203

…rt` in `_Min_limit`/`_Max_limit`. These are internal helpers, and the "public" `_In_range` validates the user-provided types.

StephanTLavavej · 2024-04-27T00:10:50Z

🔢 🔢 🔢

Based on microsoft#4627 fixup that extracts `_In_range` This also makes some place dirctly using `_In_range`, but mostly `_Max_limit` is used

Ranges version of microsoft#4627

optimize std::iota

e8bc5ba

AlexGuteniev requested a review from a team as a code owner April 24, 2024 17:58

StephanTLavavej added the performance Must go faster label Apr 24, 2024

StephanTLavavej requested changes Apr 24, 2024

View reviewed changes

stl/inc/numeric Outdated Show resolved Hide resolved

no numeric limits!

d7c3481

AlexGuteniev requested a review from StephanTLavavej April 24, 2024 19:51

StephanTLavavej self-assigned this Apr 24, 2024

StephanTLavavej added 3 commits April 25, 2024 11:40

Include more headers.

5d963b2

Qualify std::size_t, use auto to avoid repeating it.

ece7526

Flip compile-time test to mirror run-time test.

172d0e9

StephanTLavavej reviewed Apr 25, 2024

View reviewed changes

benchmarks/src/iota.cpp Outdated Show resolved Hide resolved

benchmarks/src/iota.cpp Outdated Show resolved Hide resolved

stl/inc/numeric Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Apr 25, 2024

View reviewed changes

StephanTLavavej assigned StephanTLavavej and unassigned StephanTLavavej Apr 25, 2024

AlexGuteniev added 2 commits April 26, 2024 08:27

ADL safety for consistency

16b0363

test range of signed too

ad54d4f

StephanTLavavej added 4 commits April 26, 2024 00:57

in_range extraction, part 1: Internal _Min_limit/_Max_limit don…

b399a0a

…'t need to be C++20 `consteval`. They're always stored in `constexpr` variables.

in_range extraction, part 2: Unconditionally provide _Ugly functi…

8c315b4

…ons. The `_Ugly` functions aren't marked `_EXPORT_STD`.

Cleanup: Simplify _Is_standard_integer.

0b258ce

This matches how we implement `_Is_standard_unsigned_integer` in `<__msvc_bit_utils.hpp>`.

Simplify optimization with _In_range<_Ty>(_Size).

23d696a

This answers the question of whether `static_cast<_Ty>(_Size)` is value-preserving.

StephanTLavavej approved these changes Apr 26, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

Use _STL_INTERNAL_STATIC_ASSERT instead of C++17 terse `static_asse…

12325cc

…rt` in `_Min_limit`/`_Max_limit`. These are internal helpers, and the "public" `_In_range` validates the user-provided types.

StephanTLavavej approved these changes Apr 26, 2024

View reviewed changes

StephanTLavavej merged commit 290a95b into microsoft:main Apr 27, 2024
39 checks passed

AlexGuteniev deleted the one_iota branch April 27, 2024 04:40

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request Apr 27, 2024

Avoid including <limits> to improve throughput

5b7d561

Based on microsoft#4627 fixup that extracts `_In_range` This also makes some place dirctly using `_In_range`, but mostly `_Max_limit` is used

AlexGuteniev mentioned this pull request Apr 27, 2024

Refactor <limits> usage #4634

Merged

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request May 2, 2024

Help the compiler vectorize ranges::iota

e8b62e8

Ranges version of microsoft#4627

This was referenced May 2, 2024

Help the compiler vectorize ranges::iota #4647

Merged

Auto-vectorize count_if, count #4653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help the compiler vectorize `std::iota` #4627

Help the compiler vectorize `std::iota` #4627

AlexGuteniev commented Apr 24, 2024

AlexGuteniev commented Apr 25, 2024 •

edited

Loading

StephanTLavavej commented Apr 26, 2024

AlexGuteniev commented Apr 26, 2024

StephanTLavavej commented Apr 26, 2024

This comment was marked as resolved.

StephanTLavavej commented Apr 27, 2024

Help the compiler vectorize std::iota #4627

Help the compiler vectorize std::iota #4627

Conversation

AlexGuteniev commented Apr 24, 2024

AlexGuteniev commented Apr 25, 2024 • edited Loading

StephanTLavavej commented Apr 26, 2024

AlexGuteniev commented Apr 26, 2024

StephanTLavavej commented Apr 26, 2024

This comment was marked as resolved.

StephanTLavavej commented Apr 27, 2024

🔢 🔢 🔢

Help the compiler vectorize `std::iota` #4627

Help the compiler vectorize `std::iota` #4627

AlexGuteniev commented Apr 25, 2024 •

edited

Loading