-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Reductions: Fix data races reported by Nvidia compute-sanitizer #4855
CUDA Reductions: Fix data races reported by Nvidia compute-sanitizer #4855
Conversation
Before: $ compute-sanitizer --tool=racecheck ./redtest 1 ========= COMPUTE-SANITIZER Running with n = 1 ========= WARNING: Race reported between Read access at 0xf10 in /scratch/pbmille/kokkos/install/cuda10/include/impl/Kokkos_FunctorAdapter.hpp:1571:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) ========= and Write access at 0x8a0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:557:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) [3968 hazards] ========= ========= WARNING: Race reported between Read access at 0x11a0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:548:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) ========= and Write access at 0x11d0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:548:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) [144 hazards] ========= ========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:572:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) ========= and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards] ========= ========= RACECHECK SUMMARY: 3 hazards displayed (0 errors, 3 warnings) After: $ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1 ========= COMPUTE-SANITIZER Running with n = 1 ========= WARNING: Race reported between Read access at 0x11f0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double volatile *,bool,int) ========= and Write access at 0x1220 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double volatile *,bool,int) [144 hazards] ========= ========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:576:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) ========= and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards] ========= ========= RACECHECK SUMMARY: 2 hazards displayed (0 errors, 2 warnings)
Before: $ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1 ========= COMPUTE-SANITIZER Running with n = 1 ========= WARNING: Race reported between Read access at 0x11f0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) ========= and Write access at 0x1220 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) [144 hazards] ========= ========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:575:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) ========= and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards] ========= ========= RACECHECK SUMMARY: 2 hazards displayed (0 errors, 2 warnings) After: $ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1 ========= COMPUTE-SANITIZER Running with n = 1 ========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:577:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) ========= and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards] ========= ========= RACECHECK SUMMARY: 1 hazard displayed (0 errors, 1 warning)
Did you look at performance? |
I haven't in the least. For the first error and commit, @crtrott agreed that it's a legitimate defect in the code. We haven't looked closely at the second commit fixing the second error yet. I have a change that covers the third error, but I'm much less confident in it as yet |
@@ -539,13 +539,18 @@ struct CudaReductionsFunctor<FunctorType, ArgTag, false, false> { | |||
: ((1 << width) - 1) | |||
<< ((threadIdx.y * blockDim.x + threadIdx.x) / width) * width; | |||
const int lane_id = (threadIdx.y * blockDim.x + threadIdx.x) % 32; | |||
|
|||
__syncwarp(mask); | |||
|
|||
for (int delta = skip_vector ? blockDim.x : 1; delta < width; delta *= 2) { | |||
if (lane_id + delta < 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this needs to be
if ((lane_id + delta < 32) && (lane_id%(delta*2)==0)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted in chat, this suggestion didn't seem to make any difference to test results, either following the other changes in this PR, or with this PR plus my volatile wrappers. In the former case, everything still passed, in the latter, the half_t
reduction unit test still failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on what this change in condition is supposed to achieve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically in the first iteration you have this:
thread read_pos write_pos
0 1 0
1 2 1
2 3 2
so threads read something which gets overwritten by the next guy in the same line.
|
I would vote against cherry-picking it too. It's not been observed to cause errors in existing code otherwise, and it risks performance impacts. |
If it turns out to be entirely sound and low-impact, it could be part of a 3.6.1, which would support people making more use of Nvidia's compute-sanitizer |
Detected with Nvidia's compute-sanitizer racecheck tool, with the tiny test case
Before this change:
After this change, the first two reports are eliminated: