-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a memory bug in the free_state
function of random pools.
#6290
Conversation
…tate update and lock release operations.
Can one of the admins verify this patch? |
Is there any chance that we can get a unit test for this? |
I added an unit test. In summary, this is how it works:
I have been able to reliably reproduce the bug for 64 bit random pool in my device (Nvidia V100), but not for 1024 bit one. I think this is because as the pool has 16 states that get updated in I also want to note that the test has a narrow focus of reproducing the issue reported in #6140 i.e. checking for exact duplicates. Statistical tests that check for independence of the streams [1] can potentially identify the issue with 1024 bit pool without memory fences. [1] Salmon, John K., Mark A. Moraes, Ron O. Dror, and David E. Shaw. "Parallel random numbers: as easy as 1, 2, 3." In Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp. 1-12. 2011. |
OK to test. |
PLease fix the indentation (according to + ./scripts/docker/check_format_cpp.sh
diff --git a/algorithms/src/Kokkos_Random.hpp b/algorithms/src/Kokkos_Random.hpp
index aada5fe27..2d7d236d2 100644
--- a/algorithms/src/Kokkos_Random.hpp
+++ b/algorithms/src/Kokkos_Random.hpp
@@ -1210,7 +1210,7 @@ class Random_XorShift1024_Pool {
KOKKOS_INLINE_FUNCTION
void free_state(const Random_XorShift1024<DeviceType>& state) const {
for (int i = 0; i < 16; i++) state_(state.state_idx_, i) = state.state_[i];
- p_(state.state_idx_, 0) = state.p_;
+ p_(state.state_idx_, 0) = state.p_;
// Release the lock only after the state has been updated in memory
Kokkos::memory_fence();
locks_(state.state_idx_, 0) = 0;
diff --git a/algorithms/unit_tests/TestRandom.hpp b/algorithms/unit_tests/TestRandom.hpp
index 7a31411b9..fae2f029d 100644
--- a/algorithms/unit_tests/TestRandom.hpp
+++ b/algorithms/unit_tests/TestRandom.hpp
@@ -480,8 +480,7 @@ struct generate_random_stream {
GeneratorPool rand_pool;
int samples;
- generate_random_stream(ViewType vals_, GeneratorPool rand_pool_,
- int samples_)
+ generate_random_stream(ViewType vals_, GeneratorPool rand_pool_, int samples_)
: vals(vals_), rand_pool(rand_pool_), samples(samples_) {}
KOKKOS_INLINE_FUNCTION
@@ -494,50 +493,48 @@ struct generate_random_stream {
}
};
-
// NOTE: this doesn't test the statistical independence of multiple streams
-// generated by a Random pool, it only tests for complete duplicates.
+// generated by a Random pool, it only tests for complete duplicates.
template <class ExecutionSpace, class Pool>
-void test_duplicate_stream(){
- using ViewType = Kokkos::View<uint64_t**, ExecutionSpace>;
+void test_duplicate_stream() {
+ using ViewType = Kokkos::View<uint64_t**, ExecutionSpace>;
// Heuristic to create a "large enough" number of streams.
int n_streams = ExecutionSpace{}.concurrency() * 4;
- int samples = 8;
+ int samples = 8;
Pool rand_pool(42);
ViewType vals_d("Vals", n_streams, samples);
typename ViewType::HostMirror vals_h = Kokkos::create_mirror_view(vals_d);
- Kokkos::parallel_for(n_streams,generate_random_stream<ExecutionSpace, Pool>(
- vals_d, rand_pool, samples));
+ Kokkos::parallel_for(n_streams, generate_random_stream<ExecutionSpace, Pool>(
+ vals_d, rand_pool, samples));
Kokkos::fence();
Kokkos::deep_copy(vals_h, vals_d);
/*
To quickly find streams that are identical, we sort them by the first number,
- if that's equal then the second and so on. We then test each neighbor pair
+ if that's equal then the second and so on. We then test each neighbor pair
for duplicates.
- */
+ */
std::vector<size_t> indices(n_streams);
std::iota(indices.begin(), indices.end(), 0);
- auto comparator = [&](int i, int j){
- for (int k=0; k<samples; k++){
- if(vals_h(i, k)!=vals_h(j, k))
- return vals_h(i, k) < vals_h(j, k);
- }
- return false;
- };
+ auto comparator = [&](int i, int j) {
+ for (int k = 0; k < samples; k++) {
+ if (vals_h(i, k) != vals_h(j, k)) return vals_h(i, k) < vals_h(j, k);
+ }
+ return false;
+ };
std::sort(indices.begin(), indices.end(), comparator);
- for(int i=0; i< n_streams-1; i++){
- int idx1 = indices[i], idx2 = indices[i+1];
+ for (int i = 0; i < n_streams - 1; i++) {
+ int idx1 = indices[i], idx2 = indices[i + 1];
int k = 0;
- while(k < samples && vals_h(idx1, k)==vals_h(idx2,k)) k++;
- ASSERT_LT(k, samples) << "Duplicate streams found";
+ while (k < samples && vals_h(idx1, k) == vals_h(idx2, k)) k++;
+ ASSERT_LT(k, samples) << "Duplicate streams found";
}
}
@@ -583,8 +580,8 @@ TEST(TEST_CATEGORY, Random_XorShift1024_0) {
TEST(TEST_CATEGORY, Multi_streams) {
using ExecutionSpace = TEST_EXECSPACE;
- using Pool64 = Kokkos::Random_XorShift64_Pool<ExecutionSpace>;
- using Pool1024 = Kokkos::Random_XorShift1024_Pool<ExecutionSpace>;
+ using Pool64 = Kokkos::Random_XorShift64_Pool<ExecutionSpace>;
+ using Pool1024 = Kokkos::Random_XorShift1024_Pool<ExecutionSpace>;
AlgoRandomImpl::test_duplicate_stream<ExecutionSpace, Pool64>();
AlgoRandomImpl::test_duplicate_stream<ExecutionSpace, Pool1024>(); |
SYCL failure is unrelated but OpenMPTarget has an issue
we could disable if we can't figure it out. @rgayatri23 please advise. |
Yeah this is very likely a compiler bug. Can we disable it for OpenMPTarget for now? Once it is merged, I will go line-by-line to see what triggers it and take the necessary action. |
Uggg of course now SYCL fails
@masterleinad please advise |
It works on |
What about #if defined(KOKKOS_ENABLE_SYCL) && defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU)
if constexpr (std::is_same_v<ExecutionSpace,
Kokkos::Experimental::SYCL>) {
GTEST_SKIP() << "Failing on NVIDIA GPUs"; // FIXME_SYCL
}
#endif |
Sure, we can also just disable for SYCL+CUDA. It boils down to the same restriction anyway for now. |
Ignoring HIP builds that timed out. Thank you for fixing this! |
Fixes #6140.
The
free_state
function was missing a memory fence between two of it's global memory operations.Since not all devices guarantee order in global memory operations, without a fence, the lock could get released before the
state_
array has been updated in global memory.