Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vectorized min_ / max_element for ints #2447

Merged
merged 66 commits into from
Jun 19, 2022

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Dec 26, 2021

📝 Summary

SSE4.1 implementation of min_element / max_element / minmax_element for signed and unsigned integers of sizes 1,2,4,8.

Resolves #2438

The algorithm is more complex than existing vector algorithms, not sure if this level of complexity is fine.

🧭Further directions (suggested next PRs)

  • Add AVX2. Twice faster, twice bloater
  • ranges::min, ranges::max, ranges::minmax -- as don't need iterators, this will be simpler faster algorithm, will consist only of vertical max and one reduction.
  • Add the same for floating point types

🏁 Perf benchmark

Benchmark
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <iostream>
#include <ranges>
#include <random>
#include <intrin.h>

enum class Kind {
    Min,
    Max,
    Minmax,
};

void* volatile discard1;
void* volatile discard2;

template<typename T>
void benchmark_find(T* a, std::size_t max, Kind kind, size_t rep) {
    auto t1 = std::chrono::steady_clock::now();

    switch (kind)
    {
    case Kind::Min:
        for (std::size_t s = 0; s < rep; s++) {
            discard1 = std::min_element(a, a + max);
        }
        break;
    case Kind::Max:
        for (std::size_t s = 0; s < rep; s++) {
            discard2 = std::max_element(a, a + max);
        }
        break;
    case Kind::Minmax:
        for (std::size_t s = 0; s < rep; s++) {
            std::tie(discard1, discard2) = std::minmax_element(a, a + max);
        }
        break;
    }

    auto t2 = std::chrono::steady_clock::now();

    const char* op_str = nullptr;
    switch (kind)
    {
    case Kind::Min:
        op_str = "min";
        break;
    case Kind::Max:
        op_str = "max";
        break;
    case Kind::Minmax:
        op_str = "minmax";
        break;
    }
    std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s  -- "
        << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max << "; " << rep << " repetitions \n";
}


constexpr std::size_t Nmax = 8192 - 1;

alignas(64) std::uint8_t    a8[Nmax];
alignas(64) std::uint16_t   a16[Nmax];
alignas(64) std::uint32_t   a32[Nmax];
alignas(64) std::uint64_t   a64[Nmax];


extern "C" long __isa_enabled;

int main()
{
    std::mt19937 gen(84710);
    std::uniform_int_distribution<int> dis(1, 20);
    std::generate(std::begin(a8), std::begin(a8), [&] { return dis(gen); });
    std::generate(std::begin(a16), std::begin(a16), [&] { return dis(gen); });
    std::generate(std::begin(a32), std::begin(a32), [&] { return dis(gen); });
    std::generate(std::begin(a64), std::begin(a64), [&] { return dis(gen); });

    std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n";

    benchmark_find(a8, Nmax, Kind::Minmax, 100000);
    benchmark_find(a16, Nmax, Kind::Minmax, 100000);
    benchmark_find(a32, Nmax, Kind::Minmax, 100000);
    benchmark_find(a64, Nmax, Kind::Minmax, 100000);

    std::cout << '\n';

    benchmark_find(a8, Nmax, Kind::Min, 100000);
    benchmark_find(a16, Nmax, Kind::Min, 100000);
    benchmark_find(a32, Nmax, Kind::Min, 100000);
    benchmark_find(a64, Nmax, Kind::Min, 100000);

    std::cout << '\n';

    benchmark_find(a8, Nmax, Kind::Max, 100000);
    benchmark_find(a16, Nmax, Kind::Max, 100000);
    benchmark_find(a32, Nmax, Kind::Max, 100000);
    benchmark_find(a64, Nmax, Kind::Max, 100000);

    std::cout << "Done\n";

    return 0;
}
Current benchmark results
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1
** Copyright (c) 2021 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark

C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE%

C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB%

C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH%

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 0
  0.740433s  -- Op minmax Size 1 byte elements, array size 8191; 100000 repetitions
    0.7414s  -- Op minmax Size 2 byte elements, array size 8191; 100000 repetitions
  0.739575s  -- Op minmax Size 4 byte elements, array size 8191; 100000 repetitions
   0.74237s  -- Op minmax Size 8 byte elements, array size 8191; 100000 repetitions

   1.47435s  -- Op min Size 1 byte elements, array size 8191; 100000 repetitions
   1.47671s  -- Op min Size 2 byte elements, array size 8191; 100000 repetitions
   1.47945s  -- Op min Size 4 byte elements, array size 8191; 100000 repetitions
   1.47856s  -- Op min Size 8 byte elements, array size 8191; 100000 repetitions

    1.4785s  -- Op max Size 1 byte elements, array size 8191; 100000 repetitions
   1.47736s  -- Op max Size 2 byte elements, array size 8191; 100000 repetitions
    1.4765s  -- Op max Size 4 byte elements, array size 8191; 100000 repetitions
   1.47861s  -- Op max Size 8 byte elements, array size 8191; 100000 repetitions
Done

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 1
  0.071214s  -- Op minmax Size 1 byte elements, array size 8191; 100000 repetitions
  0.134923s  -- Op minmax Size 2 byte elements, array size 8191; 100000 repetitions
  0.217484s  -- Op minmax Size 4 byte elements, array size 8191; 100000 repetitions
  0.513987s  -- Op minmax Size 8 byte elements, array size 8191; 100000 repetitions

 0.0567489s  -- Op min Size 1 byte elements, array size 8191; 100000 repetitions
   0.13397s  -- Op min Size 2 byte elements, array size 8191; 100000 repetitions
  0.160763s  -- Op min Size 4 byte elements, array size 8191; 100000 repetitions
  0.480221s  -- Op min Size 8 byte elements, array size 8191; 100000 repetitions

 0.0559508s  -- Op max Size 1 byte elements, array size 8191; 100000 repetitions
  0.134017s  -- Op max Size 2 byte elements, array size 8191; 100000 repetitions
  0.161021s  -- Op max Size 4 byte elements, array size 8191; 100000 repetitions
  0.491206s  -- Op max Size 8 byte elements, array size 8191; 100000 repetitions
Done

Results table
size before after
minmax 1 byte 0.74361s 0.071214s
minmax 2 bytes 0.746617s 0.134923s
minmax 4 bytes 0.742254s 0.217484s
minmax 8 bytes 0.750876s 0.513987s
min 1 byte 1.47848s 0.0567489s
min 2 bytes 1.4882s 0.13397s
min 4 bytes 1.47821s 0.160763s
min 8 bytes 1.47858s 0.480221s
max 1 byte 1.48783s 0.0559508s
max 2 bytes 1.48865s 0.134017s
max 4 bytes 1.48718s 0.161021s
max 8 bytes 1.48245s 0.491206s

⚖️ Size impact

The change adds more code.
DLLs and PDBs for them are not affected. Static libraries are affected.
The impact is negligible for static libs, but noticeable for import libs.

Table
File name Size before Size after
libcpmt.lib 35,023,488 35,271,438
libcpmt1.lib 35,802,670 36,050,304
libcpmtd.lib 36,120,578 36,357,874
libcpmtd0.lib 34,999,810 35,236,962
libcpmtd1.lib 35,874,386 36,111,490
msvcprt.lib 1,141,058 1,388,864
msvcprtd.lib 1,138,636 1,375,852
stl_asan.lib 3,014 3,014

✔️ Test coverage

  • Expand VSO_0000000_vector_algorithms to test newly vectorized cases with various sizes

Resolves microsoft#2438

TODO:
 * Test coverage
 * Attach minmax_element
 * Add AVX2 version of the same

----
<detail>
<summary><b>Benchmark</b></summary>

```C++
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <iostream>
#include <ranges>
#include <intrin.h>

enum class Kind {
    Min,
    Max,
};

template<typename T>
void benchmark_find(T* a, std::size_t max, size_t start, size_t pos, Kind kind, size_t rep) {
    std::fill_n(a, max, '0');
    if (pos < max && pos >= start) {
        if (kind == Kind::Min) {
            a[pos] = '*';
        }
        else {
            a[pos] = '1';
        }
    }

    auto t1 = std::chrono::steady_clock::now();

    switch (kind)
    {
    case Kind::Min:
        for (std::size_t s = 0; s < rep; s++) {
            if (std::min_element(a + start, a + max) != a + pos) {
                abort();
            }
        }
        break;
    case Kind::Max:
        for (std::size_t s = 0; s < rep; s++) {
            if (std::min_element(a + start, a + max) != a + pos) {
                abort();
            }
        }
        break;
    }

    auto t2 = std::chrono::steady_clock::now();

    const char* op_str = nullptr;
    switch (kind)
    {
    case Kind::Min:
        op_str = "min";
        break;
    case Kind::Max:
        op_str = "max";
        break;
    }
    std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s  -- "
        << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max
        << " starting at " << start << " found at " << pos << "; " << rep << " repetitions \n";
}

constexpr std::size_t Nmax = 8192;

alignas(64) std::uint8_t    a8[Nmax];
alignas(64) std::uint16_t   a16[Nmax];
alignas(64) std::uint32_t   a32[Nmax];
alignas(64) std::uint64_t   a64[Nmax];

extern "C" long __isa_enabled;

int main()
{
    std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n";

    benchmark_find(a8, Nmax, 0, 3459, Kind::Min, 100000);
    benchmark_find(a16, Nmax, 0, 3459, Kind::Min, 100000);
    benchmark_find(a32, Nmax, 0, 3459, Kind::Min, 100000);
    benchmark_find(a64, Nmax, 0, 3459, Kind::Min, 100000);

    benchmark_find(a8, Nmax, 0, 3459, Kind::Max, 100000);
    benchmark_find(a16, Nmax, 0, 3459, Kind::Max, 100000);
    benchmark_find(a32, Nmax, 0, 3459, Kind::Max, 100000);
    benchmark_find(a64, Nmax, 0, 3459, Kind::Max, 100000);

    std::cout << "Done\n";

    return 0;
}
```

<detail>
<summary><b>Current benchmark results</b></summary>

```
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1
** Copyright (c) 2021 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark

C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE%

C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB%

C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH%

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp
vector_find_benchmark.cpp(1): warning C4005: '_USE_STD_VECTOR_ALGORITHMS': macro redefinition
vector_find_benchmark.cpp: note: see previous definition of '_USE_STD_VECTOR_ALGORITHMS'

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 0
   1.48497s  -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
   1.48125s  -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
   1.47988s  -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
   1.48431s  -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 1
 0.0559598s  -- Op min Size 1 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
 0.0681002s  -- Op min Size 2 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
  0.159074s  -- Op min Size 4 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
  0.597614s  -- Op min Size 8 byte elements, array size 8192 starting at 0 found at 3459; 100000 repetitions
```
</detail>
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@AlexGuteniev AlexGuteniev marked this pull request as ready for review December 27, 2021 12:03
@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 27, 2021 12:03
@barcharcraz barcharcraz added this to @barcharcraz in Maintainer Priorities Jun 14, 2022
@barcharcraz barcharcraz moved this from Initial Review to Final Review in Code Reviews Jun 15, 2022
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej mentioned this pull request Jun 18, 2022
3 tasks
@StephanTLavavej
Copy link
Member

StephanTLavavej commented Jun 18, 2022

@AlexGuteniev Thanks, this looks good - another amazing speedup! I pushed changes for the issues I noticed (FYI @barcharcraz in case you want to double-check). Edit: Also renamed "cor" to "correction".

I would like to know what the "TRANSITION, 17.3 Preview 2" comments are referring to, but that doesn't have to block this PR from merging. Edit: Found it, pushing changes.

I am also somewhat curious whether CUDA 11.6 has opened up new possibilities, but the PR should be fine as-is. Edit: Removed CUDA workarounds.

@StephanTLavavej StephanTLavavej removed their assignment Jun 18, 2022
@StephanTLavavej StephanTLavavej moved this from Final Review to Ready To Merge in Code Reviews Jun 18, 2022
We believe that CUDA 11.6 supports `__builtin_is_constant_evaluated`.
@StephanTLavavej StephanTLavavej self-assigned this Jun 19, 2022
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

We need to restore the warning workarounds because the internal build is still using a slightly older compiler without the fix. I've opted to make them uncommented perma-workarounds.

@StephanTLavavej StephanTLavavej merged commit 78cd436 into microsoft:main Jun 19, 2022
Code Reviews automation moved this from Ready To Merge to Done Jun 19, 2022
Maintainer Priorities automation moved this from @barcharcraz to Done Jun 19, 2022
@StephanTLavavej
Copy link
Member

Thanks for minimizing the time and maximizing the speed of these algorithms! 🚀 🎉 😻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers
5 participants