Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make bitset use popcount with ISA detection out of loop #2201

Merged
merged 12 commits into from
Sep 25, 2021

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Sep 12, 2021

Resolve #667

Moving dispatch out of loop does not rely on compiler optimization.

Rewrite of #2126 PR. The original PR had an issue that it relied on compiler optimization to move the check out of loop, which did not happen always.

To make sure the machinery is kept in <limits>, the callback is used. To make sure the functions are always inlined into the callback, they are wrapped into lambdas.

Additionally, as @statementreply suggested, made 32-bit version of fallback using 32-bit types, and for 64-bit types making work by parts.

Benchmark
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

#include <bit>
#include <chrono>
#include <iostream>
#include <random>
#include <limits>

template<std::size_t Words>
struct bitset {

    void populate() {
        std::mt19937 mt(43223);
        std::uniform_int_distribution<Ty> dis(std::numeric_limits<Ty>::min(), std::numeric_limits<Ty>::max());
        for (std::size_t Wpos = 0; Wpos <= Words; ++Wpos) {
            Array[Wpos] = dis(mt);
        }
    }

    using Ty = std::uint64_t;

    size_t count_old() const noexcept {
        const char* const _Bitsperbyte =
            "\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4"
            "\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
            "\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
            "\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
            "\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
            "\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
            "\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
            "\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
            "\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
            "\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
            "\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
            "\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
            "\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
            "\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
            "\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
            "\4\5\5\6\5\6\6\7\5\6\6\7\6\7\7\x8";
        const unsigned char* _Ptr = &reinterpret_cast<const unsigned char&>(Array);
        const unsigned char* const _End = _Ptr + sizeof(Array);
        size_t _Val = 0;
        for (; _Ptr != _End; ++_Ptr) {
            _Val += _Bitsperbyte[*_Ptr];
        }

        return _Val;
    }

    size_t count_new() const noexcept {
        return std::_Select_popcount_impl<Ty>([this](auto _Popcount_impl) {
            constexpr auto _Count = Words;
            size_t _Val = 0;
            for (size_t _Wpos = 0; _Wpos <= _Count; ++_Wpos) {
                _Val += _Popcount_impl(Array[_Wpos]);
            }
            return _Val;
            });
    }

    Ty Array[Words + 1];
};

volatile std::size_t result;

int main()
{
    const int N = 20000;
    bitset<10'000> bs;
    bs.populate();

    std::chrono::steady_clock::duration d1{};
    std::chrono::steady_clock::duration d2{};
    std::chrono::steady_clock::duration d3{};
    std::chrono::steady_clock::time_point t;

    t = std::chrono::steady_clock::now();
    for (int i = 0; i < N; ++i) {
        result = bs.count_old();
    }
    d1 = std::chrono::steady_clock::now() - t;

    t = std::chrono::steady_clock::now();
    for (int i = 0; i < N; ++i) {
        result = bs.count_new();
    }
    d2 = std::chrono::steady_clock::now() - t;

    std::__isa_available = 0;
    t = std::chrono::steady_clock::now();
    for (int i = 0; i < N; ++i) {
        result = bs.count_new();
    }
    d3 = std::chrono::steady_clock::now() - t;

    std::cout << "old\t" << d1 << "\n";
    std::cout << "new\t" << d2 << "\n";
    std::cout << "new fb\t" << d3 << "\n";
    return 0;
}

Results:
(fb stands for fallback implementation when __isa_available is set to zero)
x64

old     538852500ns
new     154080600ns
new fb  311499000ns

x86

old     1245970000ns
new     153716400ns
new fb  480384500ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner September 12, 2021 15:14
@CaseyCarter CaseyCarter added the performance Must go faster label Sep 13, 2021
Copy link
Member

@CaseyCarter CaseyCarter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits - this mostly LGTM.

stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
stl/inc/limits Outdated Show resolved Hide resolved
Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the structure of this code, but, after much headscratching I can't think of a better way.

So, approved

stl/inc/limits Show resolved Hide resolved
@StephanTLavavej StephanTLavavej self-assigned this Sep 23, 2021
@StephanTLavavej
Copy link
Member

I'm mirroring this to an MSVC-internal PR. Please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 5ebbfd4 into microsoft:main Sep 25, 2021
@StephanTLavavej
Copy link
Member

Thanks for investigating this and finding a way to significantly improve bitset performance! 🚀 🪐 😻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<bitset>: count() should use the same approach as std::popcount
4 participants