Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<random>: Implement Lemire's fast integer generation #3012

Merged
merged 12 commits into from
Sep 22, 2022

Conversation

MattStephanson
Copy link
Contributor

@MattStephanson MattStephanson commented Aug 9, 2022

Implements @lemire's "Fast Random Integer Generation in an Interval", https://dl.acm.org/doi/10.1145/3230636 and https://arxiv.org/abs/1805.10941. Fixes #178.

I'm not happy with the x86 or LCG performance, but I've been tinkering with it for weeks and haven't been able to improve it further. I'm using a Surface Pro 8, i5-1135G7. It's plugged in and set to "Best Performance", but I'm otherwise not very knowledgeable about how to run good microbenchmarks. If anyone has any thoughts, I'd love to hear them.

Benchmark code
#include <random>
#include <benchmark/benchmark.h>

/// Test URBGs alone

static void BM_mt19937(benchmark::State& state) {
    std::mt19937 gen;
    for (auto _ : state) {
        benchmark::DoNotOptimize(gen());
    }
}
BENCHMARK(BM_mt19937);

static void BM_mt19937_64(benchmark::State& state) {
    std::mt19937_64 gen;
    for (auto _ : state) {
        benchmark::DoNotOptimize(gen());
    }
}
BENCHMARK(BM_mt19937_64);

static void BM_lcg(benchmark::State& state) {
    std::minstd_rand gen;
    for (auto _ : state) {
        benchmark::DoNotOptimize(gen());
    }
}
BENCHMARK(BM_lcg);

uint32_t GetMax() {
    std::random_device gen;
    std::uniform_int_distribution<uint32_t> dist(10'000'000, 20'000'000);
    return dist(gen);
}

static const uint32_t max = GetMax(); // random divisor to prevent strength reduction

/// Test mt19937

static void BM_raw_mt19937_old(benchmark::State& state) {
    std::mt19937 gen;
    std::_Rng_from_urng<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_old);

static void BM_raw_mt19937_new(benchmark::State& state) {
    std::mt19937 gen;
    std::_Rng_from_urng_v2<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_new);

/// Test mt19937_64

static void BM_raw_mt19937_64_old(benchmark::State& state) {
    std::mt19937_64 gen;
    std::_Rng_from_urng<uint64_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_64_old);

static void BM_raw_mt19937_64_new(benchmark::State& state) {
    std::mt19937_64 gen;
    std::_Rng_from_urng_v2<uint64_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_64_new);

/// Test minstd_rand

static void BM_raw_lcg_old(benchmark::State& state) {
    std::minstd_rand gen;
    std::_Rng_from_urng<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_lcg_old);

static void BM_raw_lcg_new(benchmark::State& state) {
    std::minstd_rand gen;
    std::_Rng_from_urng_v2<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_lcg_new);

BENCHMARK_MAIN();
Benchmark results

x86

2022-08-08T19:53:31-07:00
Running C:\Users\steph\source\repos\sandbox\Release\sandbox.exe
Run on (8 X 2424.25 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  4.38 ns         4.39 ns    160000000
BM_mt19937_64               9.79 ns         9.77 ns     64000000
BM_lcg                      9.39 ns         8.54 ns     64000000
BM_raw_mt19937_old          7.75 ns         7.67 ns    112000000
BM_raw_mt19937_new          5.18 ns         5.16 ns    100000000
BM_raw_mt19937_64_old       21.2 ns         21.0 ns     32000000
BM_raw_mt19937_64_new       19.0 ns         18.8 ns     37333333
BM_raw_lcg_old              25.9 ns         26.1 ns     26352941
BM_raw_lcg_new              28.2 ns         28.3 ns     24888889

x64

2022-08-08T19:54:41-07:00
Running C:\Users\steph\source\repos\sandbox\x64\Release\sandbox.exe
Run on (8 X 2444.76 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  3.77 ns         3.75 ns    179200000
BM_mt19937_64               3.87 ns         3.84 ns    179200000
BM_lcg                      3.96 ns         4.01 ns    179200000
BM_raw_mt19937_old          5.70 ns         5.72 ns    112000000
BM_raw_mt19937_new          4.20 ns         4.24 ns    165925926
BM_raw_mt19937_64_old       8.50 ns         8.58 ns     74666667
BM_raw_mt19937_64_new       4.64 ns         4.50 ns    149333333
BM_raw_lcg_old              15.2 ns         15.4 ns     49777778
BM_raw_lcg_new              17.3 ns         17.3 ns     40727273

@MattStephanson MattStephanson changed the title Lemire's fast integer generation <random>: Implement Lemire's fast integer generation Aug 9, 2022
@CaseyCarter CaseyCarter added the performance Must go faster label Aug 9, 2022
@lemire
Copy link

lemire commented Aug 9, 2022

These results are interesting...

BM_raw_mt19937_64_old       8.50 ns         8.58 ns     74666667
BM_raw_mt19937_64_new       4.64 ns         4.50 ns    149333333

Copy link
Contributor

@frederick-vs-ja frederick-vs-ja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that this is a great step towards DevCom-879048.

stl/inc/__msvc_int128.hpp Outdated Show resolved Hide resolved
stl/inc/__msvc_int128.hpp Outdated Show resolved Hide resolved
@statementreply
Copy link
Contributor

I added xoshiro256** and xoshiro128** for comparison, which are fast and have small states. Benchmark results on MattStephanson@0156b8d (reformatted):

AMD Ryzen 7 5700X, x86
2022-08-15T09:43:59+08:00
Running benchmark_x86.exe
Run on (16 X 3394 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 32768 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    3.42 ns         3.38 ns    203636364
BM_mt19937_64                 9.46 ns         9.59 ns     89600000
BM_lcg                        5.52 ns         5.58 ns    112000000
BM_xoshiro256xx               5.85 ns         5.86 ns    112000000
BM_xoshiro128xx               1.35 ns         1.35 ns    497777778
BM_raw_mt19937_old            4.74 ns         4.71 ns    149333333
BM_raw_mt19937_new            3.67 ns         3.57 ns    179200000
BM_raw_mt19937_64_old        13.6  ns        13.8  ns     49777778
BM_raw_mt19937_64_new        17.8  ns        18.0  ns     40727273
BM_raw_lcg_old               17.5  ns        17.6  ns     40727273
BM_raw_lcg_new               20.5  ns        20.4  ns     34461538
BM_raw_xoshiro256xx_old      11.4  ns        11.7  ns     64000000
BM_raw_xoshiro256xx_new      13.8  ns        14.0  ns     56000000
BM_raw_xoshiro128xx_old       3.88 ns         3.92 ns    179200000
BM_raw_xoshiro128xx_new       2.29 ns         2.29 ns    320000000
AMD Ryzen 7 5700X, x64
2022-08-15T09:41:08+08:00
Running benchmark_x64
Run on (16 X 3394 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 32768 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    2.58 ns         2.62 ns    280000000
BM_mt19937_64                 2.86 ns         2.83 ns    248888889
BM_lcg                        3.23 ns         3.22 ns    213333333
BM_xoshiro256xx               1.29 ns         1.28 ns    560000000
BM_xoshiro128xx               1.08 ns         1.07 ns    640000000
BM_raw_mt19937_old            4.99 ns         5.00 ns    100000000
BM_raw_mt19937_new            3.90 ns         3.92 ns    179200000
BM_raw_mt19937_64_old         5.56 ns         5.58 ns    112000000
BM_raw_mt19937_64_new         4.12 ns         4.17 ns    172307692
BM_raw_lcg_old               11.4  ns        11.2  ns     56000000
BM_raw_lcg_new               11.8  ns        11.7  ns     56000000
BM_raw_xoshiro256xx_old       4.53 ns         4.55 ns    154482759
BM_raw_xoshiro256xx_new       1.73 ns         1.73 ns    407272727
BM_raw_xoshiro128xx_old       3.88 ns         3.84 ns    179200000
BM_raw_xoshiro128xx_new       1.51 ns         1.50 ns    407272727
Intel Core i5-8400, x86
2022-08-13T21:48:46+08:00
Running benchmark_x86
Run on (6 X 2808 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    4.61 ns         4.55 ns    154482759
BM_mt19937_64                11.8  ns        11.7  ns     56000000
BM_lcg                        8.50 ns         8.58 ns     74666667
BM_xoshiro256xx               6.01 ns         6.00 ns    112000000
BM_xoshiro128xx               2.12 ns         2.13 ns    344615385
BM_raw_mt19937_old           10.4  ns        10.3  ns     74666667
BM_raw_mt19937_new            5.13 ns         5.02 ns    112000000
BM_raw_mt19937_64_old        25.8  ns        25.5  ns     26352941
BM_raw_mt19937_64_new        21.6  ns        21.5  ns     32000000
BM_raw_lcg_old               31.8  ns        30.0  ns     21333333
BM_raw_lcg_new               34.0  ns        33.7  ns     21333333
BM_raw_xoshiro256xx_old      21.4  ns        21.5  ns     32000000
BM_raw_xoshiro256xx_new      18.5  ns        18.8  ns     40727273
BM_raw_xoshiro128xx_old       7.99 ns         8.02 ns     89600000
BM_raw_xoshiro128xx_new       3.65 ns         3.69 ns    194782609
Intel Core i5-8400, x64
2022-08-13T21:49:07+08:00
Running benchmark_x64
Run on (6 X 2808 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    3.50 ns         3.53 ns    203636364
BM_mt19937_64                 3.53 ns         3.53 ns    194782609
BM_lcg                        4.03 ns         3.99 ns    172307692
BM_xoshiro256xx               1.60 ns         1.53 ns    407272727
BM_xoshiro128xx               1.46 ns         1.46 ns    448000000
BM_raw_mt19937_old           11.0  ns        11.0  ns     64000000
BM_raw_mt19937_new            4.90 ns         4.87 ns    144516129
BM_raw_mt19937_64_old        24.9  ns        25.1  ns     28000000
BM_raw_mt19937_64_new         5.15 ns         5.16 ns    100000000
BM_raw_lcg_old               20.1  ns        19.9  ns     34461538
BM_raw_lcg_new               15.2  ns        15.0  ns     44800000
BM_raw_xoshiro256xx_old      21.2  ns        21.3  ns     34461538
BM_raw_xoshiro256xx_new       2.54 ns         2.57 ns    280000000
BM_raw_xoshiro128xx_old       7.75 ns         7.67 ns     89600000
BM_raw_xoshiro128xx_new       2.00 ns         2.01 ns    373333333

@frederick-vs-ja

This comment was marked as resolved.

Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this approach looks good to me! (I skipped the int128 changes as I assume we'll want to land @frederick-vs-ja's #3036 first.)

The benchmark results look convincing enough to me, especially @statementreply's cases. 😻

stl/inc/xutility Outdated Show resolved Hide resolved
tests/std/tests/GH_000178_uniform_int/test.cpp Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
tests/std/tests/GH_000178_uniform_int/test.cpp Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
tests/std/tests/GH_000178_uniform_int/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/GH_000178_uniform_int/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/GH_000178_uniform_int/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/GH_000178_uniform_int/test.cpp Outdated Show resolved Hide resolved
@MattStephanson
Copy link
Contributor Author

Some of @StephanTLavavej's feedback affects codegen, so here are my updated benchmark results. I think they're pretty similar to what I originally posted.

x86

2022-08-18T20:20:48-07:00
Running C:\Users\steph\source\repos\sandbox\Release\sandbox.exe
Run on (8 X 2433.76 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  3.80 ns         3.85 ns    186666667
BM_mt19937_64               9.52 ns         9.42 ns     74666667
BM_lcg                      8.44 ns         8.54 ns     89600000
BM_raw_mt19937_old          5.90 ns         5.86 ns    112000000
BM_raw_mt19937_new          4.45 ns         4.52 ns    165925926
BM_raw_mt19937_64_old       17.7 ns         17.3 ns     40727273
BM_raw_mt19937_64_new       17.1 ns         17.3 ns     40727273
BM_raw_lcg_old              26.4 ns         26.7 ns     26352941
BM_raw_lcg_new              27.9 ns         27.9 ns     26352941

x64

2022-08-18T20:22:16-07:00
Running C:\Users\steph\source\repos\sandbox\x64\Release\sandbox.exe
Run on (8 X 2437.08 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  3.76 ns         3.77 ns    186666667
BM_mt19937_64               3.82 ns         3.85 ns    186666667
BM_lcg                      3.91 ns         3.92 ns    179200000
BM_raw_mt19937_old          5.60 ns         5.58 ns    112000000
BM_raw_mt19937_new          4.18 ns         4.17 ns    172307692
BM_raw_mt19937_64_old       8.35 ns         8.37 ns     89600000
BM_raw_mt19937_64_new       4.55 ns         4.50 ns    149333333
BM_raw_lcg_old              15.0 ns         15.0 ns     44800000
BM_raw_lcg_new              17.0 ns         16.9 ns     40727273

@MattStephanson MattStephanson marked this pull request as ready for review August 19, 2022 05:29
@MattStephanson MattStephanson requested a review from a team as a code owner August 19, 2022 05:29
@strega-nil-ms strega-nil-ms added the blocked Something is preventing work on this label Aug 23, 2022
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed the blocked Something is preventing work on this label Sep 1, 2022
@StephanTLavavej

This comment was marked as resolved.

@MattStephanson

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Sep 3, 2022
@StephanTLavavej

This comment was marked as outdated.

@StephanTLavavej StephanTLavavej removed their assignment Sep 3, 2022
@StephanTLavavej
Copy link
Member

I see, a damaged commit was amended, but the history wasn't further rewritten.

@StephanTLavavej
Copy link
Member

@strega-nil-ms I pushed minor changes to the benchmark after validating that it still builds and runs properly. Thanks for adding it!

stl/inc/xutility Show resolved Hide resolved
stl/inc/xutility Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
Copy link
Contributor

@strega-nil-ms strega-nil-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!!!
akko having a face like 'ahhhh thanks so much'

Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct to me, as far as my understanding of goes. I think this is ready.

@StephanTLavavej
Copy link
Member

StephanTLavavej commented Sep 20, 2022

I've pushed a merge with main to resolve a trivial merge conflict - _Rng_from_urng_v2 was being added immediately above declarations where I added extern "C++" for modules.

✅ I double-checked that this PR should have no impact on modules.

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 8334fca into microsoft:main Sep 22, 2022
@StephanTLavavej
Copy link
Member

Thank you for improving the performance of <random>'s most popular distribution! Also thanks to @lemire for inventing the algorithm, @chris0x44 for the original PR, and @strega-nil-ms for adding your benchmark to the growing collection! 😻 🎉 🚀

This will be available in either VS 2022 17.5 Preview 1 or Preview 2 (depending on internal merge logistics; the Changelog will record our current expectation).

CaseyCarter pushed a commit to CaseyCarter/STL that referenced this pull request Oct 6, 2022
Co-authored-by: Nicole Mazzuca <mazzucan@outlook.com>
Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>
@MattStephanson MattStephanson deleted the gh_178_uniform_int branch May 10, 2023 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<random>: Optimize uniform_int_distribution with multiply-shift
9 participants