Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a memory bug in the free_state function of random pools. #6290

Merged
merged 7 commits into from
Jul 21, 2023

Conversation

Shihab-Shahriar
Copy link
Contributor

Fixes #6140.

The free_state function was missing a memory fence between two of it's global memory operations.

  KOKKOS_INLINE_FUNCTION
  void free_state(const Random_XorShift64<DeviceType>& state) const {
    state_(state.state_idx_, 0) = state.state_;
    locks_(state.state_idx_, 0) = 0;
  }

Since not all devices guarantee order in global memory operations, without a fence, the lock could get released before the state_ array has been updated in global memory.

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@masterleinad
Copy link
Contributor

Is there any chance that we can get a unit test for this?

@Shihab-Shahriar
Copy link
Contributor Author

I added an unit test. In summary, this is how it works:

  • Since this bug gets triggered only when states get reused by next thread with similar ID, the number of streams (n_streams) should be significantly larger than number of states (which equals execution space's concurrency()). I used n_streams = ExecutionSpace{}.concurrency() * 4.
  • Generate 8 samples for each stream on device, copy back to a (n_streams ,8) shaped 2d array on host.
  • To quickly find the duplicates- sort all the streams based on first number, if that's equal then second number and so on. (there's some room for optimization here- we don't need to do entire sort if a duplicate is found. But I opted for simplicity.)
  • Check each pair of neighbors in the sorted order for duplicates.

I have been able to reliably reproduce the bug for 64 bit random pool in my device (Nvidia V100), but not for 1024 bit one. I think this is because as the pool has 16 states that get updated in free_state, the probability of at least one of them reaching global memory before before lock gets released is much higher.

I also want to note that the test has a narrow focus of reproducing the issue reported in #6140 i.e. checking for exact duplicates. Statistical tests that check for independence of the streams [1] can potentially identify the issue with 1024 bit pool without memory fences.

[1] Salmon, John K., Mark A. Moraes, Ron O. Dror, and David E. Shaw. "Parallel random numbers: as easy as 1, 2, 3." In Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp. 1-12. 2011.

@masterleinad
Copy link
Contributor

OK to test.

@masterleinad
Copy link
Contributor

PLease fix the indentation (according to clang-format-8):

+ ./scripts/docker/check_format_cpp.sh
diff --git a/algorithms/src/Kokkos_Random.hpp b/algorithms/src/Kokkos_Random.hpp
index aada5fe27..2d7d236d2 100644
--- a/algorithms/src/Kokkos_Random.hpp
+++ b/algorithms/src/Kokkos_Random.hpp
@@ -1210,7 +1210,7 @@ class Random_XorShift1024_Pool {
   KOKKOS_INLINE_FUNCTION
   void free_state(const Random_XorShift1024<DeviceType>& state) const {
     for (int i = 0; i < 16; i++) state_(state.state_idx_, i) = state.state_[i];
-    p_(state.state_idx_, 0)     = state.p_;
+    p_(state.state_idx_, 0) = state.p_;
     // Release the lock only after the state has been updated in memory
     Kokkos::memory_fence();
     locks_(state.state_idx_, 0) = 0;
diff --git a/algorithms/unit_tests/TestRandom.hpp b/algorithms/unit_tests/TestRandom.hpp
index 7a31411b9..fae2f029d 100644
--- a/algorithms/unit_tests/TestRandom.hpp
+++ b/algorithms/unit_tests/TestRandom.hpp
@@ -480,8 +480,7 @@ struct generate_random_stream {
   GeneratorPool rand_pool;
   int samples;
 
-  generate_random_stream(ViewType vals_, GeneratorPool rand_pool_,
-                  int samples_)
+  generate_random_stream(ViewType vals_, GeneratorPool rand_pool_, int samples_)
       : vals(vals_), rand_pool(rand_pool_), samples(samples_) {}
 
   KOKKOS_INLINE_FUNCTION
@@ -494,50 +493,48 @@ struct generate_random_stream {
   }
 };
 
-
 // NOTE: this doesn't test the statistical independence of multiple streams
-// generated by a Random pool, it only tests for complete duplicates. 
+// generated by a Random pool, it only tests for complete duplicates.
 template <class ExecutionSpace, class Pool>
-void test_duplicate_stream(){
-  using ViewType = Kokkos::View<uint64_t**, ExecutionSpace>; 
+void test_duplicate_stream() {
+  using ViewType = Kokkos::View<uint64_t**, ExecutionSpace>;
 
   // Heuristic to create a "large enough" number of streams.
   int n_streams = ExecutionSpace{}.concurrency() * 4;
-  int samples = 8;
+  int samples   = 8;
 
   Pool rand_pool(42);
   ViewType vals_d("Vals", n_streams, samples);
   typename ViewType::HostMirror vals_h = Kokkos::create_mirror_view(vals_d);
 
-  Kokkos::parallel_for(n_streams,generate_random_stream<ExecutionSpace, Pool>(
-                          vals_d, rand_pool, samples));
+  Kokkos::parallel_for(n_streams, generate_random_stream<ExecutionSpace, Pool>(
+                                      vals_d, rand_pool, samples));
 
   Kokkos::fence();
   Kokkos::deep_copy(vals_h, vals_d);
 
   /*
   To quickly find streams that are identical, we sort them by the first number,
-  if that's equal then the second and so on. We then test each neighbor pair 
+  if that's equal then the second and so on. We then test each neighbor pair
   for duplicates.
-  */ 
+  */
   std::vector<size_t> indices(n_streams);
   std::iota(indices.begin(), indices.end(), 0);
 
-  auto comparator = [&](int i, int j){
-                  for (int k=0; k<samples; k++){
-                    if(vals_h(i, k)!=vals_h(j, k)) 
-                      return vals_h(i, k) < vals_h(j, k);
-                  }
-                  return false;
-              };
+  auto comparator = [&](int i, int j) {
+    for (int k = 0; k < samples; k++) {
+      if (vals_h(i, k) != vals_h(j, k)) return vals_h(i, k) < vals_h(j, k);
+    }
+    return false;
+  };
   std::sort(indices.begin(), indices.end(), comparator);
 
-  for(int i=0; i< n_streams-1; i++){
-    int idx1 = indices[i], idx2 = indices[i+1];
+  for (int i = 0; i < n_streams - 1; i++) {
+    int idx1 = indices[i], idx2 = indices[i + 1];
 
     int k = 0;
-    while(k < samples && vals_h(idx1, k)==vals_h(idx2,k)) k++;
-    ASSERT_LT(k, samples)  << "Duplicate streams found";
+    while (k < samples && vals_h(idx1, k) == vals_h(idx2, k)) k++;
+    ASSERT_LT(k, samples) << "Duplicate streams found";
   }
 }
 
@@ -583,8 +580,8 @@ TEST(TEST_CATEGORY, Random_XorShift1024_0) {
 
 TEST(TEST_CATEGORY, Multi_streams) {
   using ExecutionSpace = TEST_EXECSPACE;
-  using Pool64 = Kokkos::Random_XorShift64_Pool<ExecutionSpace>;
-  using Pool1024 = Kokkos::Random_XorShift1024_Pool<ExecutionSpace>;
+  using Pool64         = Kokkos::Random_XorShift64_Pool<ExecutionSpace>;
+  using Pool1024       = Kokkos::Random_XorShift1024_Pool<ExecutionSpace>;
 
   AlgoRandomImpl::test_duplicate_stream<ExecutionSpace, Pool64>();
   AlgoRandomImpl::test_duplicate_stream<ExecutionSpace, Pool1024>();

algorithms/unit_tests/TestRandom.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestRandom.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestRandom.hpp Outdated Show resolved Hide resolved
algorithms/unit_tests/TestRandom.hpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Jul 19, 2023

SYCL failure is unrelated but OpenMPTarget has an issue

29: [ RUN      ] openmptarget.Multi_streams
29: CUDA error: an illegal memory access was encountered 
29: Libomptarget error: Call to targetDataEnd failed, abort target.
29: Libomptarget error: Failed to process data after launching the kernel.
29: Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
29: Kokkos_OpenMPTarget_ParallelFor_Range.hpp:54:1: Libomptarget fatal error 1: failure of target construct while offloading is mandatory

we could disable if we can't figure it out. @rgayatri23 please advise.

@rgayatri23
Copy link
Contributor

SYCL failure is unrelated but OpenMPTarget has an issue

29: [ RUN      ] openmptarget.Multi_streams
29: CUDA error: an illegal memory access was encountered 
29: Libomptarget error: Call to targetDataEnd failed, abort target.
29: Libomptarget error: Failed to process data after launching the kernel.
29: Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
29: Kokkos_OpenMPTarget_ParallelFor_Range.hpp:54:1: Libomptarget fatal error 1: failure of target construct while offloading is mandatory

we could disable if we can't figure it out. @rgayatri23 please advise.

Yeah this is very likely a compiler bug. Can we disable it for OpenMPTarget for now? Once it is merged, I will go line-by-line to see what triggers it and take the necessary action.

@dalg24
Copy link
Member

dalg24 commented Jul 20, 2023

Uggg of course now SYCL fails

39: [ RUN      ] sycl.Multi_streams
39: /var/jenkins/workspace/Kokkos/algorithms/unit_tests/TestRandom.hpp:538: Failure
39: Expected: (k) < (samples), actual: 8 vs 8
39: Duplicate streams found

@masterleinad please advise

@masterleinad
Copy link
Contributor

@masterleinad please advise

It works on Intel GPUs so we might just disable the test for SYCL with a FIXME if KOKKOS_ARCH_INTEL_GPU is not defined.

@dalg24
Copy link
Member

dalg24 commented Jul 20, 2023

@masterleinad please advise

It works on Intel GPUs so we might just disable the test for SYCL with a FIXME if KOKKOS_ARCH_INTEL_GPU is not defined.

What about

#if defined(KOKKOS_ENABLE_SYCL) && defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU)
  if constexpr (std::is_same_v<ExecutionSpace,
                               Kokkos::Experimental::SYCL>) {
    GTEST_SKIP() << "Failing on NVIDIA GPUs";  // FIXME_SYCL
  }
#endif

@masterleinad
Copy link
Contributor

Sure, we can also just disable for SYCL+CUDA. It boils down to the same restriction anyway for now.

@dalg24
Copy link
Member

dalg24 commented Jul 21, 2023

Ignoring HIP builds that timed out.

Thank you for fixing this!

@dalg24 dalg24 merged commit 4d1c6c3 into kokkos:develop Jul 21, 2023
27 of 28 checks passed
@dalg24 dalg24 mentioned this pull request Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants