Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Reductions: Fix data races reported by Nvidia compute-sanitizer #4855

Merged
merged 2 commits into from
Mar 10, 2022
Merged

CUDA Reductions: Fix data races reported by Nvidia compute-sanitizer #4855

merged 2 commits into from
Mar 10, 2022

Conversation

PhilMiller
Copy link
Contributor

@PhilMiller PhilMiller commented Mar 9, 2022

Detected with Nvidia's compute-sanitizer racecheck tool, with the tiny test case

#include <Kokkos_Core.hpp>
#include <cstdlib>
#include <iostream>

struct Sum {
  using value_type = double;

  KOKKOS_INLINE_FUNCTION
  void operator()(const int i, double& lsum) const {
    lsum += i;
  }
};

int main(int argc, char** argv)
{
  Kokkos::initialize(argc, argv);

  int n = std::atoi(argv[1]);
  std::cout << "Running with n = " << n << std::endl;
  {
    Kokkos::View<double*> v("foo", n);
    double val;
    Sum sum;
    Kokkos::parallel_reduce(n, sum, val);
    Kokkos::fence();
  }
  Kokkos::finalize();
  return 0;
}

Before this change:

$ compute-sanitizer --tool=racecheck ./redtest 1
========= COMPUTE-SANITIZER
Running with n = 1
========= WARNING: Race reported between Read access at 0xf10 in /scratch/pbmille/kokkos/install/cuda10/include/impl/Kokkos_FunctorAdapter.hpp:1571:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int)
=========     and Write access at 0x8a0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:557:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) [3968 hazards]
========= 
========= WARNING: Race reported between Read access at 0x11a0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:548:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int)
=========     and Write access at 0x11d0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:548:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) [144 hazards]
========= 
========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:572:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*)
=========     and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards]
========= 
========= RACECHECK SUMMARY: 3 hazards displayed (0 errors, 3 warnings)

After this change, the first two reports are eliminated:

$ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1
========= COMPUTE-SANITIZER
Running with n = 1
========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:577:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*)
=========     and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards]
========= 
========= RACECHECK SUMMARY: 1 hazard displayed (0 errors, 1 warning)

@PhilMiller PhilMiller requested a review from crtrott March 9, 2022 00:59
@PhilMiller PhilMiller changed the title CUDA Reductions: add missing __syncwarp() call CUDA Reductions: Fix data races reported by Nvidia compute-sanitizer Mar 9, 2022
Before:

$ compute-sanitizer --tool=racecheck ./redtest 1
========= COMPUTE-SANITIZER
Running with n = 1
========= WARNING: Race reported between Read access at 0xf10 in /scratch/pbmille/kokkos/install/cuda10/include/impl/Kokkos_FunctorAdapter.hpp:1571:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int)
=========     and Write access at 0x8a0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:557:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*) [3968 hazards]
=========
========= WARNING: Race reported between Read access at 0x11a0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:548:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int)
=========     and Write access at 0x11d0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:548:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) [144 hazards]
=========
========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:572:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*)
=========     and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards]
=========
========= RACECHECK SUMMARY: 3 hazards displayed (0 errors, 3 warnings)

After:

$ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1
========= COMPUTE-SANITIZER
Running with n = 1
========= WARNING: Race reported between Read access at 0x11f0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double volatile *,bool,int)
=========     and Write access at 0x1220 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double volatile *,bool,int) [144 hazards]
=========
========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:576:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*)
=========     and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards]
=========
========= RACECHECK SUMMARY: 2 hazards displayed (0 errors, 2 warnings)
Before:

$ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1
========= COMPUTE-SANITIZER
Running with n = 1
========= WARNING: Race reported between Read access at 0x11f0 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int)
=========     and Write access at 0x1220 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:551:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_warp_reduction(Sum const &,double*,bool,int) [144 hazards]
=========
========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:575:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*)
=========     and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards]
=========
========= RACECHECK SUMMARY: 2 hazards displayed (0 errors, 2 warnings)

After:

$ compute-sanitizer --tool=racecheck --show-backtrace=yes ./redtest 1
========= COMPUTE-SANITIZER
Running with n = 1
========= WARNING: Race reported between Write access at 0x1180 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:577:Kokkos::Impl::CudaReductionsFunctor<Sum,void,bool=0,bool=0>::scalar_intra_block_reduction(Sum const &,double,bool,double*,int,double*)
=========     and Read access at 0x2d10 in /scratch/pbmille/kokkos/install/cuda10/include/Cuda/Kokkos_Cuda_Parallel_Range.hpp:321:_ZNK6Kokkos4Impl14ParallelReduceI3SumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES4_EclEv [4 hazards]
=========
========= RACECHECK SUMMARY: 1 hazard displayed (0 errors, 1 warning)
@dalg24
Copy link
Member

dalg24 commented Mar 9, 2022

Did you look at performance?

@PhilMiller
Copy link
Contributor Author

Did you look at performance?

I haven't in the least.

For the first error and commit, @crtrott agreed that it's a legitimate defect in the code.

We haven't looked closely at the second commit fixing the second error yet.

I have a change that covers the third error, but I'm much less confident in it as yet

@@ -539,13 +539,18 @@ struct CudaReductionsFunctor<FunctorType, ArgTag, false, false> {
: ((1 << width) - 1)
<< ((threadIdx.y * blockDim.x + threadIdx.x) / width) * width;
const int lane_id = (threadIdx.y * blockDim.x + threadIdx.x) % 32;

__syncwarp(mask);

for (int delta = skip_vector ? blockDim.x : 1; delta < width; delta *= 2) {
if (lane_id + delta < 32) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this needs to be

if ((lane_id + delta < 32) && (lane_id%(delta*2)==0)) {

Copy link
Contributor Author

@PhilMiller PhilMiller Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted in chat, this suggestion didn't seem to make any difference to test results, either following the other changes in this PR, or with this PR plus my volatile wrappers. In the former case, everything still passed, in the latter, the half_t reduction unit test still failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on what this change in condition is supposed to achieve

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically in the first iteration you have this:

thread read_pos write_pos
0 1 0
1 2 1
2 3 2

so threads read something which gets overwritten by the next guy in the same line.

@dalg24 dalg24 merged commit 1d19a32 into kokkos:develop Mar 10, 2022
@ndellingwood
Copy link
Contributor

@dalg24 @crtrott should this be cherry-picked to release-candidate-3.6.0?

@dalg24
Copy link
Member

dalg24 commented Mar 10, 2022

That would be good on second thought no let's not do that at the last minute.

@PhilMiller
Copy link
Contributor Author

I would vote against cherry-picking it too. It's not been observed to cause errors in existing code otherwise, and it risks performance impacts.

@PhilMiller
Copy link
Contributor Author

PhilMiller commented Mar 10, 2022

If it turns out to be entirely sound and low-impact, it could be part of a 3.6.1, which would support people making more use of Nvidia's compute-sanitizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Developer: Phil Miller
Done in Release 3.7
Development

Successfully merging this pull request may close these issues.

None yet

5 participants