Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in HIP backend #3467

Merged
merged 2 commits into from
Oct 8, 2020
Merged

Conversation

Rombur
Copy link
Member

@Rombur Rombur commented Oct 8, 2020

This PR add new __syncthreads() after shuffle and move another synchronization just after a shuffle. This allows us to enable one more tests. This should not be necessary but there is a race condition in shuffle with HIP. This bug has already been reported to AMD.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't call __syncthreads inside of vector level operations. Maybe a threadfence will work instead? Or we could try a local __shared variable reduction instead of shfl?

@@ -297,6 +300,9 @@ class HIPTeamMember {
for (int i = blockDim.x; (i >>= 1);) {
::Kokkos::Experimental::Impl::in_place_shfl_down(tmp2, tmp, i,
blockDim.x);
// FIXME_HIP the synchronization point is necessary because of a race
// condition in shuffle
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't do this. Not in a vector_reduce.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. this will not be called by all threads in the team .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you try using a memory fence instead?

@@ -308,6 +314,9 @@ class HIPTeamMember {
// and thus different threads could have different results.

::Kokkos::Experimental::Impl::in_place_shfl(tmp2, tmp, 0, blockDim.x);
// FIXME_HIP the synchronization point is necessary because of a race
// condition in shuffle
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do this in vector_reduce.

@@ -969,6 +981,9 @@ KOKKOS_INLINE_FUNCTION void parallel_scan(
for (int j = 1; j < static_cast<int>(blockDim.x); j <<= 1) {
value_type tmp = 0;
::Kokkos::Experimental::Impl::in_place_shfl_up(tmp, sval, j, blockDim.x);
// FIXME_HIP the synchronization point is necessary because of a race
// condition in shuffle
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do this in ThreadVectorRange

@@ -983,6 +998,9 @@ KOKKOS_INLINE_FUNCTION void parallel_scan(
// Accumulate the last value in the inclusive scan:
::Kokkos::Experimental::Impl::in_place_shfl(sval, sval, blockDim.x - 1,
blockDim.x);
// FIXME_HIP the synchronization point is necessary because of a race
// condition in shuffle
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do this in ThreadVectorRange

@@ -1025,6 +1043,9 @@ KOKKOS_INLINE_FUNCTION void single(
#ifdef __HIP_DEVICE_COMPILE__
if (threadIdx.x == 0) lambda(val);
::Kokkos::Experimental::Impl::in_place_shfl(val, val, 0, blockDim.x);
// FIXME_HIP the synchronization point is necessary because of a race
// condition in shuffle
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't do this in PerThread single struct

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Experimental::Impl::in_place_shfl(result, value, 0, width);

@@ -170,6 +170,9 @@ class HIPTeamMember {
::Kokkos::Experimental::Impl::in_place_shfl(
val, tmp, blockDim.x * thread_id, blockDim.x * blockDim.y);
}
// FIXME_HIP the synchronization point is necessary because of a race
// condition in shuffle
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you not just put the synchronization point in the shuffle function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this doesn't all doesn't work since these functions are not block synchronous called ...

@Rombur
Copy link
Member Author

Rombur commented Oct 8, 2020

I have reworked the PR. I have added __threadfence() after the call to __shfl and remove the synchronizations that are not needed.

@dalg24 dalg24 merged commit e83c557 into kokkos:develop Oct 8, 2020
@Rombur Rombur deleted the fix_shuffle branch October 13, 2020 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants