Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warp sync for Cuda parallel reduce to avoid race condition #6630

Merged

Conversation

tcclevenger
Copy link
Contributor

Potential race condition discovered by compute-sanitizer --tool=racecheck on a simple reduce using Cuda:

Real result;
Kokkos::parallel_reduce("simple_reduce", N, KOKKOS_LAMBDA(int, Real&) {
  // Do nothing;
}, result);

Basically, the value of shared[i] below __syncwarp() was being updated on a single thread within a warp inKokkos_Cuda_ReduceScan.hpp:349, but no synchronization is guarenteed before reading shared[i] on (potentially) different threads.

Closes #6217.

compute-sanitizer --tool=racecheck discovered a potential racecondition for Cuda parallel reductions (using range policy) where data was being updated on a single thread inside a warp, but the warp was not being synchronized before being read.
@dalg24
Copy link
Member

dalg24 commented Nov 29, 2023

Please comment about the performance implications of that fence

// Inside cuda_single_inter_block_reduce_scan() above, shared[i] below
// might have been updated by a single thread within a warp without
// synchronization afterwards. Synchronize threads within warp to avoid
// potential racecondition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Add the missing white space if you somehow make any change or retrigger CI. Otherwise not worth fixing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least mention that final is also only run by a single thread and that the result is not necessarily available to the whole warp.

@crtrott
Copy link
Member

crtrott commented Nov 29, 2023

Please comment about the performance implications of that fence

We can certainly check, but that thing semantically needs to be there - even though I doubt anyone can trigger a bug right now for it missing.

@tcclevenger
Copy link
Contributor Author

@dalg24 I'm not seeing any significant performance impact. Launching parallel reduce kernels and the relative difference in timings is under 1% compared to develop.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the only place it is needed? Thre are many instances of

      if (CudaTraits::WarpSize < word_count.value) {
        __syncthreads();
      }

@tcclevenger
Copy link
Contributor Author

tcclevenger commented Nov 29, 2023

@dalg24 This is the only place I've had compute-sanitizer racecheck complain. I could do a more detailed investigation in to other places.

crtrott added a commit that referenced this pull request Dec 7, 2023
[4.2.01] Add warp sync for Cuda parallel reduce to avoid race condition #6630
@tcclevenger
Copy link
Contributor Author

@crtrott The cherry pick PR into 4.2.01 (#6631) got merged, but this PR into develop did not. This does include an extra commit (changing the comment wording).

@dalg24
Copy link
Member

dalg24 commented Jan 18, 2024

The cherry-pick should not have been merged before that one. That was a mistake.

@dalg24
Copy link
Member

dalg24 commented Jan 24, 2024

Specifically thinking

if (CudaTraits::WarpSize < word_count.value) {
__syncthreads();
}

and
if (CudaTraits::WarpSize < word_count.value) {
__syncthreads();
}

@crtrott crtrott merged commit e1415f8 into kokkos:develop Jan 24, 2024
28 of 30 checks passed
@crtrott
Copy link
Member

crtrott commented Jan 24, 2024

Yes those two places also need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compute-sanitizer --tool=racecheck hazard warning in parallel_reduce()
4 participants