-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HIP] Optimize parallel_reduce #6229
[HIP] Optimize parallel_reduce #6229
Conversation
Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2
Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28
Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9
Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd
Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304
Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Thanks.
Do you have any idea why that is? I see similar behavior in #6035 but can't make much sense of it. I mean the shuffle reduction could always be implemented in terms of local memory so I would expect better performance if shuffle reductions can be used. |
const bool is_last_block = !__syncthreads_or( | ||
threadIdx.y | ||
? 0 | ||
: (1 + atomicInc(global_flags, block_count - 1) < block_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a convoluted way to do __syncthreads_and(threadIdx.y ? 1 : (atomicInc(global_flags, block_count - 1) == block_count - 1));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, I blame our comment from years ago. IIRC, this is how cuda does it as well:
const bool is_last_block = !__syncthreads_or( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I saw that. Do you agree the version I suggested is more readable though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agreed.
Mainly from what I found:
division by non-compile time constant
division by non-compile time constant in a loop. I briefly went down the rabbit hole of "can we make these divisons faster" (the answer is probably yes -- making them unsigned, doing specialty algorithms for bounded numerators, etc.), but then I stumbled across a key difference w/r/t the LDS reductions, they only do divisons by warp_size which is constexpr (and essentially: free, as compared to runtime divs):
So, looking at how the CUDA backend does it, I decided to not look a gift horse in the mouth and just use the existing impl :) |
Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4
Talking with @dalg24 and @Rombur on slack, they reminded me of the kokkos tutorials case (specifically: https://github.com/kokkos/kokkos-tutorials/blob/6cc33429eafcbf212fa84934c1b74f7f503011a2/Exercises/04/Solution/exercise_4_solution.cpp#L126-L134) that we saw large dips in before #6029. Looked at that, and tweaked the lower bound of the range here a bit to improve perf there, there's still a big dip (this happens right at the 1024) point, but it's far better than with my existing heuristic: this does reduce dot perf a bit for smaller array sizes, but puts is ~ back to what Bruno's patch had (i.e., what was in <= 4.0): |
Retest this please |
What about |
I was just thinking about that @masterleinad -- we should probably do the same exercise of flipping |
@arghdos can you explain what's the |
It's the branch I had in my fork that was the PR for #6029. This is using the same heuristic as "dev" in the DOT plot |
Let's open another issue and assign to me / @IanBogle to follow up on w/ this one |
@arghdos , @IanBogle -- I will create a Kokkos issue for |
// Conditionally set word_size_type to int16_t or int8_t if value_type is | ||
// smaller than int32_t (Kokkos::HIP::size_type) | ||
// word_size_type is used to determine the word count, shared memory buffer | ||
// size, and global memory buffer size before the scan is performed. | ||
// Within the scan, the word count is recomputed based on word_size_type | ||
// and when calculating indexes into the shared/global memory buffers for | ||
// performing the scan, word_size_type is used again. | ||
// For scalars > 4 bytes in size, indexing into shared/global memory relies | ||
// on the block and grid dimensions to ensure that we index at the correct | ||
// offset rather than at every 4 byte word; such that, when the join is | ||
// performed, we have the correct data that was copied over in chunks of 4 | ||
// bytes. | ||
using word_size_type = std::conditional_t< | ||
sizeof(value_type) < sizeof(size_type), | ||
std::conditional_t<sizeof(value_type) == 2, int16_t, int8_t>, size_type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and related changes should be a separate pull request. AFAICT, this is not for performance but for correctness with small types, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not for performance but for correctness with small types, right?
Correct. I believe what happened here was that when I switched from Shfl->SHMEM reductions, the tests for the small types started failing, so it would be a bit awkward IMO to split this from
I could see splitting that from the heuristic change tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have this part of the PR. The PR needs this and it's a really small change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this pull request indeed exposes the issue we were trying to trigger, I'm fine with doing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mirrors #4156, and associated issues. Maybe preferable to make it a separate PR, or at least in separate commits, so that it could be more easily cherry-picked to a patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of a cherry-pick in the patch release, since you cannot trigger the bug without the other changes in this PR. The reason we didn't need this change until now is because we did not use the bugged code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that we indeed need the changes here when forcing reductions to use local memory instead of shuffles.
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
__syncthreads(); | ||
bool const is_last_block = (n_done == static_cast<int>(block_count)); | ||
|
||
int n_done = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/var/jenkins/workspace/Kokkos/core/src/HIP/Kokkos_HIP_ReduceScan.hpp:390:7: error: unused variable 'n_done' [-Werror,-Wunused-variable]
int n_done = 0;
^
int n_done = 0; |
Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b
Retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am ok with this. I would like to ask folks like Stan Moore to test this out and see if we observe regressions. But we should merge this now and get reports in I guess.
I will dismiss @Rombur approval, just want to confirm with him that its still valid a month later. If he confirms I will merge. |
* always use lds Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2 * fix half and int8_t reductions Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28 * Use __syncthreads_or, implemented since ROCm 4.5 Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9 * add heuristic Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd * tune heuristic for RHODO Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304 * apply code style patch Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76 * tweak min block-size Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4 * Update core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Co-authored-by: Daniel Arndt <arndtd@ornl.gov> * remove unused variable Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b --------- Co-authored-by: Nicholas Curtis <nicurtis@amd.com> Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
Based on #6160, I spent a bit of time to understand the original regression introduced by #6029.
Essentially, the issue was most exacerbated when each thread was doing a reduction over a single element of the array (i.e., each thread loads two values, multiplies them, sums, and passes to the rest of the reduction).
This was mostly due to the cost of
hip_single_inter_block_reduce_scan
, namely due to runtime int divisions of (e.g.,)blockDim.x
inhip_inter_warp_shuffle_reduction
and friends.After spending some time playing with various reduction algorithms, I discovered that the SHMEM reduction was faster in pretty much all cases, so I flipped the default from
UseShflReduction=true
tofalse
.In addition, I did a bit of cleanup to use
__syncthreads_or
(implemented since ROCm 4.5), which bumps perf a little more, and finally, I tweaked @Rombur's heuristic to have three ranges:I've plotted the performance (runtime, in seconds) for DOT products over various commits:
Generally we see the large perf regressions:
are solved, and the "new reduce + new heur" line is fasted except one point where it's a bit slower:
this (as you might guess from the above) happens right at the "4096 blocks" crossover point. Perhaps we could tweak it a bit more, but generally this is a large improvement.
Finally, I compared LAMMPS perf over a # of benchmarks, and this maintains perf (to within the typical noise of 1%) as compared to @Rombur's patch: