Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Marsaglia polar method with Box-muller to generate a normally distributed random number #6556

Merged
merged 4 commits into from
Nov 9, 2023

Conversation

Shihab-Shahriar
Copy link
Contributor

@Shihab-Shahriar Shihab-Shahriar commented Oct 31, 2023

On GPU, Box-Muller is expected to significantly outperform current Marsaglia polar. The reasoning can be found here:

The polar method (Press et. al. 1992) is simple and relatively efficient, but the probability of looping per thread is 14 percent. This leads to an expected 1.6 iterations per generated sample turning into an expected 3.1 iterations when warp effects are taken into account.

On the CPU side, there doesn't seem to be any difference.

Before

N CPU CPU CUDA CUDA
64 1024 64 1024
1e4 0.061071 0.0757334 0.0150474 0.016794
1e5 0.600972 0.761684 0.120827 0.28777
2e5 1.19428 1.50356 0.236278 0.586949

After

N CPU CPU CUDA CUDA
64 1024 64 1024
1e4 0.0610803 0.0734002 0.0119627 0.0119717
1e5 0.60062 0.73712 0.0957707 0.0968437
2e5 1.17503 1.47986 0.195854 0.19791

(The timings are taken using one of the kokkos examples here- generate a 2D array with N rows and 1000 columns. All values in seconds. CPU is AMD 3700x, GPU is Nvidia 1650 super, 16GB RAM)

I'm not quite sure why there is such a performance gap in Polar method on GPU between the 64 and 1024 bit versions. Maybe this can be tuned out by trying different kernel launch configurations (i.e. #blocks, #threads/block). I wanted to present the results before diving deeper.

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@fnrizzi
Copy link
Contributor

fnrizzi commented Oct 31, 2023

ok to test

@fnrizzi
Copy link
Contributor

fnrizzi commented Oct 31, 2023

can you please also include the code you used to generate all the timings?

@fnrizzi
Copy link
Contributor

fnrizzi commented Oct 31, 2023

format is wrong , you need to run the /scripts/apply-clang-format before pushing the repo

@dalg24
Copy link
Member

dalg24 commented Oct 31, 2023

format is wrong , you need to run the /scripts/apply-clang-format before pushing the repo

diff --git a/algorithms/src/Kokkos_Random.hpp b/algorithms/src/Kokkos_Random.hpp
index 0057c79cb..0ee96098c 100644
--- a/algorithms/src/Kokkos_Random.hpp
+++ b/algorithms/src/Kokkos_Random.hpp
@@ -855,9 +855,9 @@ class Random_XorShift64 {
   double normal() {
     constexpr double M_PI2 = 2.0 * Kokkos::numbers::pi_v<double>;
 
-    double u = drand();
-    double v = drand();
-    double r = Kokkos::sqrt(-2.0 * Kokkos::log(u));
+    double u     = drand();
+    double v     = drand();
+    double r     = Kokkos::sqrt(-2.0 * Kokkos::log(u));
     double theta = v * M_PI2;
     return r * Kokkos::cos(theta);
   }
@@ -1099,9 +1099,9 @@ class Random_XorShift1024 {
   double normal() {
     constexpr double M_PI2 = 2.0 * Kokkos::numbers::pi_v<double>;
 
-    double u = drand();
-    double v = drand();
-    double r = Kokkos::sqrt(-2.0 * Kokkos::log(u));
+    double u     = drand();
+    double v     = drand();
+    double r     = Kokkos::sqrt(-2.0 * Kokkos::log(u));
     double theta = v * M_PI2;
     return r * Kokkos::cos(theta);
   }

@Shihab-Shahriar
Copy link
Contributor Author

Shihab-Shahriar commented Oct 31, 2023

@fnrizzi , here is the code I used to benchmark. Please note that the random example isn't automatically built by kokkos even with Kokkos_ENABLE_EXAMPLES=ON, so I had to edit some CMakeLists.txt files. The CMake options I used are in the build-job.sh file at the root.

I just pushed the formatting-related changes.

@dalg24 dalg24 added Performance Code showing unusually slow performance for an architecture and/or backend Kokkos-Algorithms labels Oct 31, 2023
Copy link
Contributor

@fnrizzi fnrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for working on this!

@Shihab-Shahriar
Copy link
Contributor Author

Thanks for your feedback and help.

@dalg24 dalg24 merged commit 0a83695 into kokkos:develop Nov 9, 2023
28 checks passed
@dalg24 dalg24 mentioned this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kokkos-Algorithms Performance Code showing unusually slow performance for an architecture and/or backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants