-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add warp sync for Cuda parallel reduce to avoid race condition #6630
Add warp sync for Cuda parallel reduce to avoid race condition #6630
Conversation
compute-sanitizer --tool=racecheck discovered a potential racecondition for Cuda parallel reductions (using range policy) where data was being updated on a single thread inside a warp, but the warp was not being synchronized before being read.
Please comment about the performance implications of that fence |
// Inside cuda_single_inter_block_reduce_scan() above, shared[i] below | ||
// might have been updated by a single thread within a warp without | ||
// synchronization afterwards. Synchronize threads within warp to avoid | ||
// potential racecondition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo. Add the missing white space if you somehow make any change or retrigger CI. Otherwise not worth fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would at least mention that final
is also only run by a single thread and that the result is not necessarily available to the whole warp.
We can certainly check, but that thing semantically needs to be there - even though I doubt anyone can trigger a bug right now for it missing. |
@dalg24 I'm not seeing any significant performance impact. Launching parallel reduce kernels and the relative difference in timings is under 1% compared to develop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the only place it is needed? Thre are many instances of
if (CudaTraits::WarpSize < word_count.value) {
__syncthreads();
}
@dalg24 This is the only place I've had compute-sanitizer racecheck complain. I could do a more detailed investigation in to other places. |
[4.2.01] Add warp sync for Cuda parallel reduce to avoid race condition #6630
The cherry-pick should not have been merged before that one. That was a mistake. |
Specifically thinking kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp Lines 765 to 767 in dcf93fc
and kokkos/core/src/Cuda/Kokkos_Cuda_Parallel_MDRange.hpp Lines 309 to 311 in dcf93fc
|
Yes those two places also need it. |
Potential race condition discovered by
compute-sanitizer --tool=racecheck
on a simple reduce using Cuda:Basically, the value of
shared[i]
below__syncwarp()
was being updated on a single thread within a warp inKokkos_Cuda_ReduceScan.hpp:349
, but no synchronization is guarenteed before readingshared[i]
on (potentially) different threads.Closes #6217.