Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HIP] Optimize parallel_reduce #6229

Merged

Conversation

skyreflectedinmirrors
Copy link
Contributor

@skyreflectedinmirrors skyreflectedinmirrors commented Jun 20, 2023

Based on #6160, I spent a bit of time to understand the original regression introduced by #6029.

Essentially, the issue was most exacerbated when each thread was doing a reduction over a single element of the array (i.e., each thread loads two values, multiplies them, sums, and passes to the rest of the reduction).
This was mostly due to the cost of hip_single_inter_block_reduce_scan, namely due to runtime int divisions of (e.g.,) blockDim.x in hip_inter_warp_shuffle_reduction and friends.
After spending some time playing with various reduction algorithms, I discovered that the SHMEM reduction was faster in pretty much all cases, so I flipped the default from UseShflReduction=true to false.
In addition, I did a bit of cleanup to use __syncthreads_or (implemented since ROCm 4.5), which bumps perf a little more, and finally, I tweaked @Rombur's heuristic to have three ranges:

  • "small" -> not a lot of parallelism, don't touch it.
  • "intermediate" -> this is where we saw the largest regressions, there's not enough parallelism to hide some of the cost in reductions. Here, we target trying to have four reduction items per-thread (~ what 4.0.01 did)
  • "large" -> lots of parallelism (RHODO is here), so we leave it alone

I've plotted the performance (runtime, in seconds) for DOT products over various commits:

image

Generally we see the large perf regressions:

image

are solved, and the "new reduce + new heur" line is fasted except one point where it's a bit slower:

image

this (as you might guess from the above) happens right at the "4096 blocks" crossover point. Perhaps we could tweak it a bit more, but generally this is a large improvement.

Finally, I compared LAMMPS perf over a # of benchmarks, and this maintains perf (to within the typical noise of 1%) as compared to @Rombur's patch:

+---------+-----------------------+--------------------+
|         | lammps-dev-baseline   | lammps-newreduce   |
|---------+-----------------------+--------------------|
| eam     | 0.00%                 | 0.75%              |
| lj      | 0.00%                 | -0.16%             |
| reaxff  | 0.00%                 | -0.80%             |
| rhodo   | 0.00%                 | -0.20%             |
| snap    | 0.00%                 | 0.07%              |
| tersoff | 0.00%                 | 0.04%              |
+---------+-----------------------+--------------------+

Nicholas Curtis added 5 commits June 19, 2023 11:40
Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2
Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28
Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9
Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd
Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304
@skyreflectedinmirrors
Copy link
Contributor Author

cc: @Rombur @dalg24

@skyreflectedinmirrors skyreflectedinmirrors changed the title Optimize parallel_reduce [HIP] Optimize parallel_reduce Jun 20, 2023
Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76
@Rombur Rombur added this to the Release 4.1 milestone Jun 20, 2023
Rombur
Rombur previously approved these changes Jun 20, 2023
Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thanks.

@masterleinad
Copy link
Contributor

After spending some time playing with various reduction algorithms, I discovered that the SHMEM reduction was faster in pretty much all cases, so I flipped the default from UseShflReduction=true to false.

Do you have any idea why that is? I see similar behavior in #6035 but can't make much sense of it. I mean the shuffle reduction could always be implemented in terms of local memory so I would expect better performance if shuffle reductions can be used.

Comment on lines 391 to 394
const bool is_last_block = !__syncthreads_or(
threadIdx.y
? 0
: (1 + atomicInc(global_flags, block_count - 1) < block_count));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a convoluted way to do __syncthreads_and(threadIdx.y ? 1 : (atomicInc(global_flags, block_count - 1) == block_count - 1));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, I blame our comment from years ago. IIRC, this is how cuda does it as well:

const bool is_last_block = !__syncthreads_or(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I saw that. Do you agree the version I suggested is more readable though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agreed.

@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented Jun 20, 2023

Do you have any idea why that is? I see similar behavior in #6035 but can't make much sense of it. I mean the shuffle reduction could always be implemented in terms of local memory so I would expect better performance if shuffle reductions can be used.

Mainly from what I found:

int const step = warp_size / blockDim.x;

division by non-compile time constant

while (shift <= max_active_thread / step) {

division by non-compile time constant in a loop.

I briefly went down the rabbit hole of "can we make these divisons faster" (the answer is probably yes -- making them unsigned, doing specialty algorithms for bounded numerators, etc.), but then I stumbled across a key difference w/r/t the LDS reductions, they only do divisons by warp_size which is constexpr (and essentially: free, as compared to runtime divs):

int const warp_id = (threadIdx.y * blockDim.x) / warp_size;

int const num_warps = blockDim.x * blockDim.y / warp_size;

So, looking at how the CUDA backend does it, I decided to not look a gift horse in the mouth and just use the existing impl :)

Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4
@skyreflectedinmirrors
Copy link
Contributor Author

Talking with @dalg24 and @Rombur on slack, they reminded me of the kokkos tutorials case (specifically: https://github.com/kokkos/kokkos-tutorials/blob/6cc33429eafcbf212fa84934c1b74f7f503011a2/Exercises/04/Solution/exercise_4_solution.cpp#L126-L134) that we saw large dips in before #6029.

Looked at that, and tweaked the lower bound of the range here a bit to improve perf there, there's still a big dip (this happens right at the 1024) point, but it's far better than with my existing heuristic:

image

this does reduce dot perf a bit for smaller array sizes, but puts is ~ back to what Bruno's patch had (i.e., what was in <= 4.0):

image

@Rombur
Copy link
Member

Rombur commented Jun 21, 2023

Retest this please

@Rombur Rombur added the Blocks Promotion Overview issue for release-blocking bugs label Jun 21, 2023
@masterleinad
Copy link
Contributor

What about MDRangePolicy and TeamPolicy? Does it make sense to control the number of blocks in a similar way? After all, they should do the same after executing the functor, right?

@skyreflectedinmirrors
Copy link
Contributor Author

I was just thinking about that @masterleinad -- we should probably do the same exercise of flipping TeamPolicy to SHMEM reductions (right now it defaults to shuffle for any non-zero static size element), and tuning both them and MDRange

@masterleinad masterleinad removed this from the Release 4.1 milestone Jun 21, 2023
@Rombur
Copy link
Member

Rombur commented Jun 22, 2023

@arghdos can you explain what's the relax_scratch curve on the plot.

@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented Jun 22, 2023

@arghdos can you explain what's the relax_scratch curve on the plot.

It's the branch I had in my fork that was the PR for #6029. This is using the same heuristic as "dev" in the DOT plot

@skyreflectedinmirrors
Copy link
Contributor Author

I was just thinking about that @masterleinad -- we should probably do the same exercise of flipping TeamPolicy to SHMEM reductions (right now it defaults to shuffle for any non-zero static size element), and tuning both them and MDRange

Let's open another issue and assign to me / @IanBogle to follow up on w/ this one

@ajpowelsnl
Copy link
Contributor

@arghdos , @IanBogle -- I will create a Kokkos issue for parallel_reduce with shmem, and reference this issue.

Comment on lines +125 to +139
// Conditionally set word_size_type to int16_t or int8_t if value_type is
// smaller than int32_t (Kokkos::HIP::size_type)
// word_size_type is used to determine the word count, shared memory buffer
// size, and global memory buffer size before the scan is performed.
// Within the scan, the word count is recomputed based on word_size_type
// and when calculating indexes into the shared/global memory buffers for
// performing the scan, word_size_type is used again.
// For scalars > 4 bytes in size, indexing into shared/global memory relies
// on the block and grid dimensions to ensure that we index at the correct
// offset rather than at every 4 byte word; such that, when the join is
// performed, we have the correct data that was copied over in chunks of 4
// bytes.
using word_size_type = std::conditional_t<
sizeof(value_type) < sizeof(size_type),
std::conditional_t<sizeof(value_type) == 2, int16_t, int8_t>, size_type>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and related changes should be a separate pull request. AFAICT, this is not for performance but for correctness with small types, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not for performance but for correctness with small types, right?

Correct. I believe what happened here was that when I switched from Shfl->SHMEM reductions, the tests for the small types started failing, so it would be a bit awkward IMO to split this from

https://github.com/kokkos/kokkos/pull/6229/files#diff-4f2a2e882baf2b7a7ae9265652801ba48521779534ee182a14a1edd8cc1fb164R152

I could see splitting that from the heuristic change tho

Copy link
Member

@Rombur Rombur Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have this part of the PR. The PR needs this and it's a really small change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this pull request indeed exposes the issue we were trying to trigger, I'm fine with doing it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors #4156, and associated issues. Maybe preferable to make it a separate PR, or at least in separate commits, so that it could be more easily cherry-picked to a patch release.

Copy link
Member

@Rombur Rombur Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of a cherry-pick in the patch release, since you cannot trigger the bug without the other changes in this PR. The reason we didn't need this change until now is because we did not use the bugged code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would explain why in #5333/#5641/#5555 some tests failed on CUDA but not HIP. Note from those that there are probably several other reduction paths that still need the corresponding fix.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that we indeed need the changes here when forcing reductions to use local memory instead of shuffles.

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
__syncthreads();
bool const is_last_block = (n_done == static_cast<int>(block_count));

int n_done = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/var/jenkins/workspace/Kokkos/core/src/HIP/Kokkos_HIP_ReduceScan.hpp:390:7: error: unused variable 'n_done' [-Werror,-Wunused-variable]
  int n_done               = 0;
      ^
Suggested change
int n_done = 0;

Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b
@dalg24
Copy link
Member

dalg24 commented Jul 20, 2023

Retest this please

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am ok with this. I would like to ask folks like Stan Moore to test this out and see if we observe regressions. But we should merge this now and get reports in I guess.

@crtrott crtrott dismissed Rombur’s stale review July 26, 2023 22:49

Stale review, just want it reconfirmed.

@crtrott
Copy link
Member

crtrott commented Jul 26, 2023

I will dismiss @Rombur approval, just want to confirm with him that its still valid a month later. If he confirms I will merge.

@crtrott crtrott merged commit 6cffdb4 into kokkos:develop Jul 27, 2023
28 checks passed
tcclevenger pushed a commit to tcclevenger/kokkos that referenced this pull request Jul 27, 2023
* always use lds

Change-Id: I6dd7b347b74c197257160ab8657f57c7c0489cb2

* fix half and int8_t reductions

Change-Id: I1ddb1e65768b4df0f556000a3b9fcf7ee4c00e28

* Use __syncthreads_or, implemented since ROCm 4.5

Change-Id: Ibf147742743fa03d97c2d77d65855b21a58db1d9

* add heuristic

Change-Id: I9141ddaac84e5d8c590756122d5763407642afcd

* tune heuristic for RHODO

Change-Id: Id6b8a29aaaa5613d6bca1ae3031025996bfb8304

* apply code style patch

Change-Id: I2f1e2f4fcfc563d27f1c6ca31fbd32f81953ac76

* tweak min block-size

Change-Id: I4b3c016f14621bc23fd2262e5201498d3ec9e8a4

* Update core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

* remove unused variable

Change-Id: I41e6819d04fce43d017eec02920d3f6bdc40b52b

---------

Co-authored-by: Nicholas Curtis <nicurtis@amd.com>
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
@masterleinad masterleinad mentioned this pull request Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants