-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Marsaglia polar method with Box-muller to generate a normally distributed random number #6556
Conversation
…rate a normally distributed random number
Can one of the admins verify this patch? |
ok to test |
can you please also include the code you used to generate all the timings? |
format is wrong , you need to run the |
diff --git a/algorithms/src/Kokkos_Random.hpp b/algorithms/src/Kokkos_Random.hpp
index 0057c79cb..0ee96098c 100644
--- a/algorithms/src/Kokkos_Random.hpp
+++ b/algorithms/src/Kokkos_Random.hpp
@@ -855,9 +855,9 @@ class Random_XorShift64 {
double normal() {
constexpr double M_PI2 = 2.0 * Kokkos::numbers::pi_v<double>;
- double u = drand();
- double v = drand();
- double r = Kokkos::sqrt(-2.0 * Kokkos::log(u));
+ double u = drand();
+ double v = drand();
+ double r = Kokkos::sqrt(-2.0 * Kokkos::log(u));
double theta = v * M_PI2;
return r * Kokkos::cos(theta);
}
@@ -1099,9 +1099,9 @@ class Random_XorShift1024 {
double normal() {
constexpr double M_PI2 = 2.0 * Kokkos::numbers::pi_v<double>;
- double u = drand();
- double v = drand();
- double r = Kokkos::sqrt(-2.0 * Kokkos::log(u));
+ double u = drand();
+ double v = drand();
+ double r = Kokkos::sqrt(-2.0 * Kokkos::log(u));
double theta = v * M_PI2;
return r * Kokkos::cos(theta);
} |
@fnrizzi , here is the code I used to benchmark. Please note that the random example isn't automatically built by kokkos even with I just pushed the formatting-related changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for working on this!
Thanks for your feedback and help. |
On GPU, Box-Muller is expected to significantly outperform current Marsaglia polar. The reasoning can be found here:
The polar method (Press et. al. 1992) is simple and relatively efficient, but the probability of looping per thread is 14 percent. This leads to an expected 1.6 iterations per generated sample turning into an expected 3.1 iterations when warp effects are taken into account.
On the CPU side, there doesn't seem to be any difference.
Before
After
(The timings are taken using one of the kokkos examples here- generate a 2D array with N rows and 1000 columns. All values in seconds. CPU is AMD 3700x, GPU is Nvidia 1650 super, 16GB RAM)
I'm not quite sure why there is such a performance gap in Polar method on GPU between the 64 and 1024 bit versions. Maybe this can be tuned out by trying different kernel launch configurations (i.e. #blocks, #threads/block). I wanted to present the results before diving deeper.