-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for shuffle reduction for the HIP backend #3154
Conversation
This looks OK to me comparing with the corresponding |
Correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK.
if ((blockDim.x * blockDim.y) > i) { | ||
value_type tmp = Kokkos::Experimental::shfl_down(value, i, warp_size); | ||
if (id + i < gridDim.x) join(value, tmp); | ||
} | ||
active += __ballot(1); | ||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait why do you do the syncthreads here? THis is within a warp?
@@ -231,6 +232,7 @@ __device__ inline void hip_intra_warp_reduction( | |||
// blockDim.y) | |||
if (threadIdx.y + shift < max_active_thread) reducer.join(result, tmp); | |||
shift *= 2; | |||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again why syncthreads?
if ((blockDim.x * blockDim.y) > i) { | ||
value_type tmp = Kokkos::Experimental::shfl_down(value, i, warp_size); | ||
if (id + i < gridDim.x) reducer.join(value, tmp); | ||
} | ||
active += __ballot(1); | ||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why syncthreads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are syncthreads in the warp level reductions. Why?
Because I have a race condition at the warp level :( I plan to bring the problem with AMD at our next meeting with them. I want to point out that there is a similar problem with CUDA Clang and there you fixed it by calling dummy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is fine, also checked that ThreadVector reduce doesn't hit the syncthreads which would be a deadlock.
I haven't done any performance comparison yet but I have added a
FIXME_HIP_PERFORMANCE
comment so we don't forget to tune the algorithm.