[HIP] Optimize parallel_reduce #6229

skyreflectedinmirrors · 2023-06-20T18:24:55Z

Based on #6160, I spent a bit of time to understand the original regression introduced by #6029.

Essentially, the issue was most exacerbated when each thread was doing a reduction over a single element of the array (i.e., each thread loads two values, multiplies them, sums, and passes to the rest of the reduction).
This was mostly due to the cost of hip_single_inter_block_reduce_scan, namely due to runtime int divisions of (e.g.,) blockDim.x in hip_inter_warp_shuffle_reduction and friends.
After spending some time playing with various reduction algorithms, I discovered that the SHMEM reduction was faster in pretty much all cases, so I flipped the default from UseShflReduction=true to false.
In addition, I did a bit of cleanup to use __syncthreads_or (implemented since ROCm 4.5), which bumps perf a little more, and finally, I tweaked @Rombur's heuristic to have three ranges:

"small" -> not a lot of parallelism, don't touch it.
"intermediate" -> this is where we saw the largest regressions, there's not enough parallelism to hide some of the cost in reductions. Here, we target trying to have four reduction items per-thread (~ what 4.0.01 did)
"large" -> lots of parallelism (RHODO is here), so we leave it alone

I've plotted the performance (runtime, in seconds) for DOT products over various commits:

4.0.01 -> the latest stable release
dev -> the last commit before [HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160 (43a797b)
rombur (light) -> d78a24f, with the lightweighthint applied
new reduce -> shmem reduction
new reduce + new heur -> shmem reduction + the new heuristic

Generally we see the large perf regressions:

are solved, and the "new reduce + new heur" line is fasted except one point where it's a bit slower:

this (as you might guess from the above) happens right at the "4096 blocks" crossover point. Perhaps we could tweak it a bit more, but generally this is a large improvement.

Finally, I compared LAMMPS perf over a # of benchmarks, and this maintains perf (to within the typical noise of 1%) as compared to @Rombur's patch:

+---------+-----------------------+--------------------+
|         | lammps-dev-baseline   | lammps-newreduce   |
|---------+-----------------------+--------------------|
| eam     | 0.00%                 | 0.75%              |
| lj      | 0.00%                 | -0.16%             |
| reaxff  | 0.00%                 | -0.80%             |
| rhodo   | 0.00%                 | -0.20%             |
| snap    | 0.00%                 | 0.07%              |
| tersoff | 0.00%                 | 0.04%              |
+---------+-----------------------+--------------------+

Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2

Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28

Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9

Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd

Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304

skyreflectedinmirrors · 2023-06-20T18:25:15Z

cc: @Rombur @dalg24

Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76

Rombur

This is great. Thanks.

masterleinad · 2023-06-20T18:52:17Z

After spending some time playing with various reduction algorithms, I discovered that the SHMEM reduction was faster in pretty much all cases, so I flipped the default from UseShflReduction=true to false.

Do you have any idea why that is? I see similar behavior in #6035 but can't make much sense of it. I mean the shuffle reduction could always be implemented in terms of local memory so I would expect better performance if shuffle reductions can be used.

dalg24 · 2023-06-20T19:24:31Z

core/src/HIP/Kokkos_HIP_ReduceScan.hpp

+  const bool is_last_block = !__syncthreads_or(
+     threadIdx.y
+         ? 0
+         : (1 + atomicInc(global_flags, block_count - 1) < block_count));


Sounds like a convoluted way to do __syncthreads_and(threadIdx.y ? 1 : (atomicInc(global_flags, block_count - 1) == block_count - 1));

heh, I blame our comment from years ago. IIRC, this is how cuda does it as well:

kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp

Line 611 in 80dd6ad

const bool is_last_block = !__syncthreads_or(

Yes I saw that. Do you agree the version I suggested is more readable though?

yes, agreed.

skyreflectedinmirrors · 2023-06-20T19:24:55Z

Do you have any idea why that is? I see similar behavior in #6035 but can't make much sense of it. I mean the shuffle reduction could always be implemented in terms of local memory so I would expect better performance if shuffle reductions can be used.

Mainly from what I found:

kokkos/core/src/HIP/Kokkos_HIP_Shuffle_Reduce.hpp

Line 69 in 80dd6ad

int const step = warp_size / blockDim.x;

division by non-compile time constant

kokkos/core/src/HIP/Kokkos_HIP_Shuffle_Reduce.hpp

Line 77 in 80dd6ad

while (shift <= max_active_thread / step) {

division by non-compile time constant in a loop.

I briefly went down the rabbit hole of "can we make these divisons faster" (the answer is probably yes -- making them unsigned, doing specialty algorithms for bounded numerators, etc.), but then I stumbled across a key difference w/r/t the LDS reductions, they only do divisons by warp_size which is constexpr (and essentially: free, as compared to runtime divs):

kokkos/core/src/HIP/Kokkos_HIP_ReduceScan.hpp

Line 62 in 80dd6ad

int const warp_id = (threadIdx.y * blockDim.x) / warp_size;

kokkos/core/src/HIP/Kokkos_HIP_ReduceScan.hpp

Line 76 in 80dd6ad

int const num_warps = blockDim.x * blockDim.y / warp_size;

So, looking at how the CUDA backend does it, I decided to not look a gift horse in the mouth and just use the existing impl :)

Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4

skyreflectedinmirrors · 2023-06-20T19:54:57Z

Talking with @dalg24 and @Rombur on slack, they reminded me of the kokkos tutorials case (specifically: https://github.com/kokkos/kokkos-tutorials/blob/6cc33429eafcbf212fa84934c1b74f7f503011a2/Exercises/04/Solution/exercise_4_solution.cpp#L126-L134) that we saw large dips in before #6029.

Looked at that, and tweaked the lower bound of the range here a bit to improve perf there, there's still a big dip (this happens right at the 1024) point, but it's far better than with my existing heuristic:

this does reduce dot perf a bit for smaller array sizes, but puts is ~ back to what Bruno's patch had (i.e., what was in <= 4.0):

Rombur · 2023-06-21T13:03:35Z

Retest this please

masterleinad · 2023-06-21T15:16:29Z

What about MDRangePolicy and TeamPolicy? Does it make sense to control the number of blocks in a similar way? After all, they should do the same after executing the functor, right?

skyreflectedinmirrors · 2023-06-21T15:38:28Z

I was just thinking about that @masterleinad -- we should probably do the same exercise of flipping TeamPolicy to SHMEM reductions (right now it defaults to shuffle for any non-zero static size element), and tuning both them and MDRange

Rombur · 2023-06-22T13:27:17Z

@arghdos can you explain what's the relax_scratch curve on the plot.

skyreflectedinmirrors · 2023-06-22T13:46:51Z

@arghdos can you explain what's the relax_scratch curve on the plot.

It's the branch I had in my fork that was the PR for #6029. This is using the same heuristic as "dev" in the DOT plot

skyreflectedinmirrors · 2023-06-22T13:47:18Z

I was just thinking about that @masterleinad -- we should probably do the same exercise of flipping TeamPolicy to SHMEM reductions (right now it defaults to shuffle for any non-zero static size element), and tuning both them and MDRange

Let's open another issue and assign to me / @IanBogle to follow up on w/ this one

ajpowelsnl · 2023-07-03T20:00:53Z

@arghdos , @IanBogle -- I will create a Kokkos issue for parallel_reduce with shmem, and reference this issue.

masterleinad · 2023-07-12T17:50:35Z

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

+  // Conditionally set word_size_type to int16_t or int8_t if value_type is
+  // smaller than int32_t (Kokkos::HIP::size_type)
+  // word_size_type is used to determine the word count, shared memory buffer
+  // size, and global memory buffer size before the scan is performed.
+  // Within the scan, the word count is recomputed based on word_size_type
+  // and when calculating indexes into the shared/global memory buffers for
+  // performing the scan, word_size_type is used again.
+  // For scalars > 4 bytes in size, indexing into shared/global memory relies
+  // on the block and grid dimensions to ensure that we index at the correct
+  // offset rather than at every 4 byte word; such that, when the join is
+  // performed, we have the correct data that was copied over in chunks of 4
+  // bytes.
+  using word_size_type = std::conditional_t<
+      sizeof(value_type) < sizeof(size_type),
+      std::conditional_t<sizeof(value_type) == 2, int16_t, int8_t>, size_type>;


This and related changes should be a separate pull request. AFAICT, this is not for performance but for correctness with small types, right?

this is not for performance but for correctness with small types, right?

Correct. I believe what happened here was that when I switched from Shfl->SHMEM reductions, the tests for the small types started failing, so it would be a bit awkward IMO to split this from

https://github.com/kokkos/kokkos/pull/6229/files#diff-4f2a2e882baf2b7a7ae9265652801ba48521779534ee182a14a1edd8cc1fb164R152

I could see splitting that from the heuristic change tho

I think it makes sense to have this part of the PR. The PR needs this and it's a really small change.

If this pull request indeed exposes the issue we were trying to trigger, I'm fine with doing it here.

This mirrors #4156, and associated issues. Maybe preferable to make it a separate PR, or at least in separate commits, so that it could be more easily cherry-picked to a patch release.

I don't see the point of a cherry-pick in the patch release, since you cannot trigger the bug without the other changes in this PR. The reason we didn't need this change until now is because we did not use the bugged code path.

That would explain why in #5333/#5641/#5555 some tests failed on CUDA but not HIP. Note from those that there are probably several other reduction paths that still need the corresponding fix.

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

masterleinad

I verified that we indeed need the changes here when forcing reductions to use local memory instead of shuffles.

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

masterleinad · 2023-07-17T17:57:27Z

core/src/HIP/Kokkos_HIP_ReduceScan.hpp

-  __syncthreads();
-  bool const is_last_block = (n_done == static_cast<int>(block_count));
-
+  int n_done               = 0;


/var/jenkins/workspace/Kokkos/core/src/HIP/Kokkos_HIP_ReduceScan.hpp:390:7: error: unused variable 'n_done' [-Werror,-Wunused-variable] int n_done = 0; ^

Suggested change

int n_done = 0;

Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b

dalg24 · 2023-07-20T04:16:39Z

Retest this please

crtrott

I think I am ok with this. I would like to ask folks like Stan Moore to test this out and see if we observe regressions. But we should merge this now and get reports in I guess.

Stale review, just want it reconfirmed.

crtrott · 2023-07-26T22:49:23Z

I will dismiss @Rombur approval, just want to confirm with him that its still valid a month later. If he confirms I will merge.

* always use lds Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2 * fix half and int8_t reductions Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28 * Use __syncthreads_or, implemented since ROCm 4.5 Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9 * add heuristic Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd * tune heuristic for RHODO Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304 * apply code style patch Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76 * tweak min block-size Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4 * Update core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Co-authored-by: Daniel Arndt <arndtd@ornl.gov> * remove unused variable Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b --------- Co-authored-by: Nicholas Curtis <nicurtis@amd.com> Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Nicholas Curtis added 5 commits June 19, 2023 11:40

always use lds

4fb3b0f

Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2

fix half and int8_t reductions

06df1db

Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28

Use __syncthreads_or, implemented since ROCm 4.5

12d88b4

Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9

add heuristic

799bf0e

Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd

tune heuristic for RHODO

240b74c

Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304

skyreflectedinmirrors changed the title ~~Optimize parallel_reduce~~ [HIP] Optimize parallel_reduce Jun 20, 2023

apply code style patch

83d2f08

Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76

Rombur added this to the Release 4.1 milestone Jun 20, 2023

Rombur previously approved these changes Jun 20, 2023

View reviewed changes

dalg24 reviewed Jun 20, 2023

View reviewed changes

tweak min block-size

27d353e

Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4

Rombur added the Blocks Promotion Overview issue for release-blocking bugs label Jun 21, 2023

masterleinad removed this from the Release 4.1 milestone Jun 21, 2023

ajpowelsnl mentioned this pull request Jul 3, 2023

[HIP] Optimize parallel_reduce using SHMEM reductions (follow-on from #6229) #6255

Open

masterleinad removed the Blocks Promotion Overview issue for release-blocking bugs label Jul 5, 2023

masterleinad reviewed Jul 12, 2023

View reviewed changes

masterleinad approved these changes Jul 13, 2023

View reviewed changes

masterleinad reviewed Jul 13, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

masterleinad reviewed Jul 13, 2023

View reviewed changes

masterleinad mentioned this pull request Jul 14, 2023

Update to HIP TeamPolicy Block number heuristic #6284

Merged

Update core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

8ddba2c

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

masterleinad requested changes Jul 17, 2023

View reviewed changes

remove unused variable

4a6ecfb

Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b

masterleinad approved these changes Jul 19, 2023

View reviewed changes

crtrott approved these changes Jul 26, 2023

View reviewed changes

Rombur approved these changes Jul 27, 2023

View reviewed changes

crtrott merged commit 6cffdb4 into kokkos:develop Jul 27, 2023
28 checks passed

masterleinad mentioned this pull request Sep 14, 2023

CHANGELOG: 4.2.0 #6197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIP] Optimize parallel_reduce #6229

[HIP] Optimize parallel_reduce #6229

skyreflectedinmirrors commented Jun 20, 2023 •

edited

skyreflectedinmirrors commented Jun 20, 2023

Rombur left a comment

masterleinad commented Jun 20, 2023

dalg24 Jun 20, 2023

skyreflectedinmirrors Jun 20, 2023

dalg24 Jun 20, 2023

skyreflectedinmirrors Jun 20, 2023

skyreflectedinmirrors commented Jun 20, 2023 •

edited

skyreflectedinmirrors commented Jun 20, 2023

Rombur commented Jun 21, 2023

masterleinad commented Jun 21, 2023

skyreflectedinmirrors commented Jun 21, 2023

Rombur commented Jun 22, 2023

skyreflectedinmirrors commented Jun 22, 2023 •

edited

skyreflectedinmirrors commented Jun 22, 2023

ajpowelsnl commented Jul 3, 2023

masterleinad Jul 12, 2023

skyreflectedinmirrors Jul 12, 2023

Rombur Jul 12, 2023 •

edited

masterleinad Jul 13, 2023

PhilMiller Jul 13, 2023

Rombur Jul 13, 2023 •

edited

PhilMiller Jul 13, 2023

masterleinad left a comment

masterleinad Jul 17, 2023

dalg24 commented Jul 20, 2023

crtrott left a comment

crtrott commented Jul 26, 2023

[HIP] Optimize parallel_reduce #6229

[HIP] Optimize parallel_reduce #6229

Conversation

skyreflectedinmirrors commented Jun 20, 2023 • edited

skyreflectedinmirrors commented Jun 20, 2023

Rombur left a comment

Choose a reason for hiding this comment

masterleinad commented Jun 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyreflectedinmirrors commented Jun 20, 2023 • edited

skyreflectedinmirrors commented Jun 20, 2023

Rombur commented Jun 21, 2023

masterleinad commented Jun 21, 2023

skyreflectedinmirrors commented Jun 21, 2023

Rombur commented Jun 22, 2023

skyreflectedinmirrors commented Jun 22, 2023 • edited

skyreflectedinmirrors commented Jun 22, 2023

ajpowelsnl commented Jul 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur Jul 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur Jul 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Jul 20, 2023

crtrott left a comment

Choose a reason for hiding this comment

crtrott commented Jul 26, 2023

skyreflectedinmirrors commented Jun 20, 2023 •

edited

skyreflectedinmirrors commented Jun 20, 2023 •

edited

skyreflectedinmirrors commented Jun 22, 2023 •

edited

Rombur Jul 12, 2023 •

edited

Rombur Jul 13, 2023 •

edited