`<random>`: Implement Lemire's fast integer generation #3012

MattStephanson · 2022-08-09T03:20:13Z

Implements @lemire's "Fast Random Integer Generation in an Interval", https://dl.acm.org/doi/10.1145/3230636 and https://arxiv.org/abs/1805.10941. Fixes #178.

I'm not happy with the x86 or LCG performance, but I've been tinkering with it for weeks and haven't been able to improve it further. I'm using a Surface Pro 8, i5-1135G7. It's plugged in and set to "Best Performance", but I'm otherwise not very knowledgeable about how to run good microbenchmarks. If anyone has any thoughts, I'd love to hear them.

Benchmark code

#include <random>
#include <benchmark/benchmark.h>

/// Test URBGs alone

static void BM_mt19937(benchmark::State& state) {
    std::mt19937 gen;
    for (auto _ : state) {
        benchmark::DoNotOptimize(gen());
    }
}
BENCHMARK(BM_mt19937);

static void BM_mt19937_64(benchmark::State& state) {
    std::mt19937_64 gen;
    for (auto _ : state) {
        benchmark::DoNotOptimize(gen());
    }
}
BENCHMARK(BM_mt19937_64);

static void BM_lcg(benchmark::State& state) {
    std::minstd_rand gen;
    for (auto _ : state) {
        benchmark::DoNotOptimize(gen());
    }
}
BENCHMARK(BM_lcg);

uint32_t GetMax() {
    std::random_device gen;
    std::uniform_int_distribution<uint32_t> dist(10'000'000, 20'000'000);
    return dist(gen);
}

static const uint32_t max = GetMax(); // random divisor to prevent strength reduction

/// Test mt19937

static void BM_raw_mt19937_old(benchmark::State& state) {
    std::mt19937 gen;
    std::_Rng_from_urng<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_old);

static void BM_raw_mt19937_new(benchmark::State& state) {
    std::mt19937 gen;
    std::_Rng_from_urng_v2<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_new);

/// Test mt19937_64

static void BM_raw_mt19937_64_old(benchmark::State& state) {
    std::mt19937_64 gen;
    std::_Rng_from_urng<uint64_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_64_old);

static void BM_raw_mt19937_64_new(benchmark::State& state) {
    std::mt19937_64 gen;
    std::_Rng_from_urng_v2<uint64_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_mt19937_64_new);

/// Test minstd_rand

static void BM_raw_lcg_old(benchmark::State& state) {
    std::minstd_rand gen;
    std::_Rng_from_urng<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_lcg_old);

static void BM_raw_lcg_new(benchmark::State& state) {
    std::minstd_rand gen;
    std::_Rng_from_urng_v2<uint32_t, decltype(gen)> rng(gen);
    for (auto _ : state) {
        benchmark::DoNotOptimize(rng(max));
    }
}
BENCHMARK(BM_raw_lcg_new);

BENCHMARK_MAIN();

Benchmark results

x86

2022-08-08T19:53:31-07:00
Running C:\Users\steph\source\repos\sandbox\Release\sandbox.exe
Run on (8 X 2424.25 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  4.38 ns         4.39 ns    160000000
BM_mt19937_64               9.79 ns         9.77 ns     64000000
BM_lcg                      9.39 ns         8.54 ns     64000000
BM_raw_mt19937_old          7.75 ns         7.67 ns    112000000
BM_raw_mt19937_new          5.18 ns         5.16 ns    100000000
BM_raw_mt19937_64_old       21.2 ns         21.0 ns     32000000
BM_raw_mt19937_64_new       19.0 ns         18.8 ns     37333333
BM_raw_lcg_old              25.9 ns         26.1 ns     26352941
BM_raw_lcg_new              28.2 ns         28.3 ns     24888889

x64

2022-08-08T19:54:41-07:00
Running C:\Users\steph\source\repos\sandbox\x64\Release\sandbox.exe
Run on (8 X 2444.76 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  3.77 ns         3.75 ns    179200000
BM_mt19937_64               3.87 ns         3.84 ns    179200000
BM_lcg                      3.96 ns         4.01 ns    179200000
BM_raw_mt19937_old          5.70 ns         5.72 ns    112000000
BM_raw_mt19937_new          4.20 ns         4.24 ns    165925926
BM_raw_mt19937_64_old       8.50 ns         8.58 ns     74666667
BM_raw_mt19937_64_new       4.64 ns         4.50 ns    149333333
BM_raw_lcg_old              15.2 ns         15.4 ns     49777778
BM_raw_lcg_new              17.3 ns         17.3 ns     40727273

lemire · 2022-08-09T03:50:12Z

These results are interesting...

BM_raw_mt19937_64_old       8.50 ns         8.58 ns     74666667
BM_raw_mt19937_64_new       4.64 ns         4.50 ns    149333333

frederick-vs-ja

I suppose that this is a great step towards DevCom-879048.

stl/inc/__msvc_int128.hpp

statementreply · 2022-08-15T02:01:43Z

I added xoshiro256** and xoshiro128** for comparison, which are fast and have small states. Benchmark results on MattStephanson@0156b8d (reformatted):

AMD Ryzen 7 5700X, x86

2022-08-15T09:43:59+08:00
Running benchmark_x86.exe
Run on (16 X 3394 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 32768 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    3.42 ns         3.38 ns    203636364
BM_mt19937_64                 9.46 ns         9.59 ns     89600000
BM_lcg                        5.52 ns         5.58 ns    112000000
BM_xoshiro256xx               5.85 ns         5.86 ns    112000000
BM_xoshiro128xx               1.35 ns         1.35 ns    497777778
BM_raw_mt19937_old            4.74 ns         4.71 ns    149333333
BM_raw_mt19937_new            3.67 ns         3.57 ns    179200000
BM_raw_mt19937_64_old        13.6  ns        13.8  ns     49777778
BM_raw_mt19937_64_new        17.8  ns        18.0  ns     40727273
BM_raw_lcg_old               17.5  ns        17.6  ns     40727273
BM_raw_lcg_new               20.5  ns        20.4  ns     34461538
BM_raw_xoshiro256xx_old      11.4  ns        11.7  ns     64000000
BM_raw_xoshiro256xx_new      13.8  ns        14.0  ns     56000000
BM_raw_xoshiro128xx_old       3.88 ns         3.92 ns    179200000
BM_raw_xoshiro128xx_new       2.29 ns         2.29 ns    320000000

AMD Ryzen 7 5700X, x64

2022-08-15T09:41:08+08:00
Running benchmark_x64
Run on (16 X 3394 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 32768 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    2.58 ns         2.62 ns    280000000
BM_mt19937_64                 2.86 ns         2.83 ns    248888889
BM_lcg                        3.23 ns         3.22 ns    213333333
BM_xoshiro256xx               1.29 ns         1.28 ns    560000000
BM_xoshiro128xx               1.08 ns         1.07 ns    640000000
BM_raw_mt19937_old            4.99 ns         5.00 ns    100000000
BM_raw_mt19937_new            3.90 ns         3.92 ns    179200000
BM_raw_mt19937_64_old         5.56 ns         5.58 ns    112000000
BM_raw_mt19937_64_new         4.12 ns         4.17 ns    172307692
BM_raw_lcg_old               11.4  ns        11.2  ns     56000000
BM_raw_lcg_new               11.8  ns        11.7  ns     56000000
BM_raw_xoshiro256xx_old       4.53 ns         4.55 ns    154482759
BM_raw_xoshiro256xx_new       1.73 ns         1.73 ns    407272727
BM_raw_xoshiro128xx_old       3.88 ns         3.84 ns    179200000
BM_raw_xoshiro128xx_new       1.51 ns         1.50 ns    407272727

Intel Core i5-8400, x86

2022-08-13T21:48:46+08:00
Running benchmark_x86
Run on (6 X 2808 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    4.61 ns         4.55 ns    154482759
BM_mt19937_64                11.8  ns        11.7  ns     56000000
BM_lcg                        8.50 ns         8.58 ns     74666667
BM_xoshiro256xx               6.01 ns         6.00 ns    112000000
BM_xoshiro128xx               2.12 ns         2.13 ns    344615385
BM_raw_mt19937_old           10.4  ns        10.3  ns     74666667
BM_raw_mt19937_new            5.13 ns         5.02 ns    112000000
BM_raw_mt19937_64_old        25.8  ns        25.5  ns     26352941
BM_raw_mt19937_64_new        21.6  ns        21.5  ns     32000000
BM_raw_lcg_old               31.8  ns        30.0  ns     21333333
BM_raw_lcg_new               34.0  ns        33.7  ns     21333333
BM_raw_xoshiro256xx_old      21.4  ns        21.5  ns     32000000
BM_raw_xoshiro256xx_new      18.5  ns        18.8  ns     40727273
BM_raw_xoshiro128xx_old       7.99 ns         8.02 ns     89600000
BM_raw_xoshiro128xx_new       3.65 ns         3.69 ns    194782609

Intel Core i5-8400, x64

2022-08-13T21:49:07+08:00
Running benchmark_x64
Run on (6 X 2808 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_mt19937                    3.50 ns         3.53 ns    203636364
BM_mt19937_64                 3.53 ns         3.53 ns    194782609
BM_lcg                        4.03 ns         3.99 ns    172307692
BM_xoshiro256xx               1.60 ns         1.53 ns    407272727
BM_xoshiro128xx               1.46 ns         1.46 ns    448000000
BM_raw_mt19937_old           11.0  ns        11.0  ns     64000000
BM_raw_mt19937_new            4.90 ns         4.87 ns    144516129
BM_raw_mt19937_64_old        24.9  ns        25.1  ns     28000000
BM_raw_mt19937_64_new         5.15 ns         5.16 ns    100000000
BM_raw_lcg_old               20.1  ns        19.9  ns     34461538
BM_raw_lcg_new               15.2  ns        15.0  ns     44800000
BM_raw_xoshiro256xx_old      21.2  ns        21.3  ns     34461538
BM_raw_xoshiro256xx_new       2.54 ns         2.57 ns    280000000
BM_raw_xoshiro128xx_old       7.75 ns         7.67 ns     89600000
BM_raw_xoshiro128xx_new       2.00 ns         2.01 ns    373333333

StephanTLavavej

Thanks, this approach looks good to me! (I skipped the int128 changes as I assume we'll want to land @frederick-vs-ja's #3036 first.)

The benchmark results look convincing enough to me, especially @statementreply's cases. 😻

stl/inc/xutility

tests/std/tests/GH_000178_uniform_int/test.cpp

stl/inc/xutility

tests/std/tests/GH_000178_uniform_int/test.cpp

stl/inc/xutility

tests/std/tests/GH_000178_uniform_int/test.cpp

MattStephanson · 2022-08-19T05:29:30Z

Some of @StephanTLavavej's feedback affects codegen, so here are my updated benchmark results. I think they're pretty similar to what I originally posted.

x86

2022-08-18T20:20:48-07:00
Running C:\Users\steph\source\repos\sandbox\Release\sandbox.exe
Run on (8 X 2433.76 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  3.80 ns         3.85 ns    186666667
BM_mt19937_64               9.52 ns         9.42 ns     74666667
BM_lcg                      8.44 ns         8.54 ns     89600000
BM_raw_mt19937_old          5.90 ns         5.86 ns    112000000
BM_raw_mt19937_new          4.45 ns         4.52 ns    165925926
BM_raw_mt19937_64_old       17.7 ns         17.3 ns     40727273
BM_raw_mt19937_64_new       17.1 ns         17.3 ns     40727273
BM_raw_lcg_old              26.4 ns         26.7 ns     26352941
BM_raw_lcg_new              27.9 ns         27.9 ns     26352941

x64

2022-08-18T20:22:16-07:00
Running C:\Users\steph\source\repos\sandbox\x64\Release\sandbox.exe
Run on (8 X 2437.08 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 8192 KiB (x1)
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_mt19937                  3.76 ns         3.77 ns    186666667
BM_mt19937_64               3.82 ns         3.85 ns    186666667
BM_lcg                      3.91 ns         3.92 ns    179200000
BM_raw_mt19937_old          5.60 ns         5.58 ns    112000000
BM_raw_mt19937_new          4.18 ns         4.17 ns    172307692
BM_raw_mt19937_64_old       8.35 ns         8.37 ns     89600000
BM_raw_mt19937_64_new       4.55 ns         4.50 ns    149333333
BM_raw_lcg_old              15.0 ns         15.0 ns     44800000
BM_raw_lcg_new              17.0 ns         16.9 ns     40727273

StephanTLavavej · 2022-09-12T22:49:12Z

I see, a damaged commit was amended, but the history wasn't further rewritten.

benchmarks/src/random_integer_generation.cpp

StephanTLavavej · 2022-09-12T23:14:08Z

@strega-nil-ms I pushed minor changes to the benchmark after validating that it still builds and runs properly. Thanks for adding it!

stl/inc/xutility

strega-nil-ms

Thanks so much!!!

barcharcraz

This seems correct to me, as far as my understanding of goes. I think this is ready.

StephanTLavavej · 2022-09-20T21:10:16Z

I've pushed a merge with main to resolve a trivial merge conflict - _Rng_from_urng_v2 was being added immediately above declarations where I added extern "C++" for modules.

✅ I double-checked that this PR should have no impact on modules.

StephanTLavavej · 2022-09-22T07:51:55Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2022-09-22T21:21:32Z

Thank you for improving the performance of <random>'s most popular distribution! Also thanks to @lemire for inventing the algorithm, @chris0x44 for the original PR, and @strega-nil-ms for adding your benchmark to the growing collection! 😻 🎉 🚀

This will be available in either VS 2022 17.5 Preview 1 or Preview 2 (depending on internal merge logistics; the Changelog will record our current expectation).

Co-authored-by: Nicole Mazzuca <mazzucan@outlook.com> Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

Lemire's fast integer generation

0156b8d

MattStephanson changed the title ~~Lemire's fast integer generation~~ <random>: Implement Lemire's fast integer generation Aug 9, 2022

CaseyCarter added the performance Must go faster label Aug 9, 2022

frederick-vs-ja reviewed Aug 9, 2022

View reviewed changes

stl/inc/__msvc_int128.hpp Outdated Show resolved Hide resolved

stl/inc/__msvc_int128.hpp Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

frederick-vs-ja mentioned this pull request Aug 16, 2022

<__msvc_int128.hpp>: Backport 128-bit integer-class types to C++14 mode #3036

Merged

StephanTLavavej requested changes Aug 17, 2022

View reviewed changes

MattStephanson added 2 commits August 18, 2022 18:47

review feedback

3c8de09

Merge branch 'main' into gh_176_uniform_int

d4cd6f3

MattStephanson marked this pull request as ready for review August 19, 2022 05:29

MattStephanson requested a review from a team as a code owner August 19, 2022 05:29

strega-nil-ms added the blocked Something is preventing work on this label Aug 23, 2022

This comment was marked as resolved.

Sign in to view

StephanTLavavej removed the blocked Something is preventing work on this label Sep 1, 2022

This comment was marked as resolved.

Sign in to view

Merge branch 'main' into gh_178_uniform_int

3970473

This comment was marked as resolved.

Sign in to view

StephanTLavavej approved these changes Sep 2, 2022

View reviewed changes

StephanTLavavej self-assigned this Sep 3, 2022

This comment was marked as outdated.

Sign in to view

StephanTLavavej removed their assignment Sep 3, 2022

StephanTLavavej added 3 commits September 12, 2022 16:05

Add banner.

6c5657b

Include <cstdint>, qualify typedefs.

7e02578

Rename max to maximum.

2f00b93

StephanTLavavej reviewed Sep 12, 2022

View reviewed changes

benchmarks/src/random_integer_generation.cpp Show resolved Hide resolved

benchmarks/src/random_integer_generation.cpp Outdated Show resolved Hide resolved

benchmarks/src/random_integer_generation.cpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Sep 12, 2022

View reviewed changes

strega-nil-ms reviewed Sep 12, 2022

View reviewed changes

stl/inc/xutility Show resolved Hide resolved

stl/inc/xutility Show resolved Hide resolved

stl/inc/xutility Outdated Show resolved Hide resolved

review feedback

ff90fe8

strega-nil-ms reviewed Sep 13, 2022

View reviewed changes

stl/inc/xutility Show resolved Hide resolved

StephanTLavavej assigned strega-nil-ms Sep 14, 2022

MattStephanson added 2 commits September 16, 2022 19:54

Merge branch 'main' into gh_178_uniform_int

2191c90

Episode V: The Review Feedback Strikes Back

cc4dcea

strega-nil-ms approved these changes Sep 19, 2022

View reviewed changes

barcharcraz approved these changes Sep 20, 2022

View reviewed changes

Merge branch 'main' into gh_178_uniform_int

7401f4e

StephanTLavavej approved these changes Sep 20, 2022

View reviewed changes

StephanTLavavej assigned StephanTLavavej and unassigned barcharcraz and strega-nil-ms Sep 20, 2022

StephanTLavavej merged commit 8334fca into microsoft:main Sep 22, 2022

CaseyCarter pushed a commit to CaseyCarter/STL that referenced this pull request Oct 6, 2022

<random>: Implement Lemire's fast integer generation (microsoft#3012)

908f716

Co-authored-by: Nicole Mazzuca <mazzucan@outlook.com> Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

StephanTLavavej mentioned this pull request Oct 14, 2022

Move _Rng_from_urng and _Rng_from_urng_v2 out of <xutility> #3157

Merged

totalgee mentioned this pull request Mar 6, 2023

uniform_int_distribution output changed from VS 17.4 to 17.5? #3541

Closed

MattStephanson deleted the gh_178_uniform_int branch May 10, 2023 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<random>`: Implement Lemire's fast integer generation #3012

`<random>`: Implement Lemire's fast integer generation #3012

MattStephanson commented Aug 9, 2022 •

edited by StephanTLavavej

Loading

lemire commented Aug 9, 2022

frederick-vs-ja left a comment

statementreply commented Aug 15, 2022

This comment was marked as resolved.

StephanTLavavej left a comment

MattStephanson commented Aug 19, 2022

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

StephanTLavavej commented Sep 12, 2022

StephanTLavavej commented Sep 12, 2022

strega-nil-ms left a comment •

edited

Loading

barcharcraz left a comment

StephanTLavavej commented Sep 20, 2022 •

edited

Loading

StephanTLavavej commented Sep 22, 2022

StephanTLavavej commented Sep 22, 2022

<random>: Implement Lemire's fast integer generation #3012

<random>: Implement Lemire's fast integer generation #3012

Conversation

MattStephanson commented Aug 9, 2022 • edited by StephanTLavavej Loading

x86

x64

lemire commented Aug 9, 2022

frederick-vs-ja left a comment

Choose a reason for hiding this comment

statementreply commented Aug 15, 2022

This comment was marked as resolved.

StephanTLavavej left a comment

Choose a reason for hiding this comment

MattStephanson commented Aug 19, 2022

x86

x64

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as outdated.

StephanTLavavej commented Sep 12, 2022

StephanTLavavej commented Sep 12, 2022

strega-nil-ms left a comment • edited Loading

Choose a reason for hiding this comment

barcharcraz left a comment

Choose a reason for hiding this comment

StephanTLavavej commented Sep 20, 2022 • edited Loading

StephanTLavavej commented Sep 22, 2022

StephanTLavavej commented Sep 22, 2022

`<random>`: Implement Lemire's fast integer generation #3012

`<random>`: Implement Lemire's fast integer generation #3012

MattStephanson commented Aug 9, 2022 •

edited by StephanTLavavej

Loading

strega-nil-ms left a comment •

edited

Loading

StephanTLavavej commented Sep 20, 2022 •

edited

Loading