[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160

Rombur · 2023-05-24T20:06:34Z

#6029 introduced a regression when using parallel_reduce to perform a vector dot product (which is done by PETSc). I have a new heuristic based on a vector dot product benchmark.

@arghdos @stanmoore1 is this change OK for LAMMPS?

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

skyreflectedinmirrors · 2023-05-25T12:45:02Z

I'll give it a shot -- I am curious why it hurts PETSC as well, if you point me in that direction

Rombur · 2023-05-25T13:20:05Z

From what I see with omniperf, this PR improves L1 cache access.

skyreflectedinmirrors · 2023-05-25T13:52:48Z

It looks like this drops RHODO back to where it was before my PR, unfortunately:

Seemingly because it drops the # of WGs from 4096 to 1024 (it shouldn't be this sensitive)

Is the PETSc case captured in an issue anywhere for us to reproduce? I think we (AMD) needs to take a deeper look to see if we can find a heuristic that works for both.

Rombur · 2023-05-25T14:03:23Z

Is the PETSc case captured in an issue anywhere for us to reproduce? I think we (AMD) needs to take a deeper look to see if we can find a heuristic that works for both.

It's just a regular dot product.

crtrott · 2023-05-25T15:37:07Z

I am wondering if we should change the heuristic based on "lightweigth kernel" hint, and or simply based on size of functor?

Rombur · 2023-06-07T20:49:10Z

@arghdos I've updated the PR. By default, users get the behavior that you had implemented. If they say the kernel is lightweight then we use the new heuristic.

skyreflectedinmirrors · 2023-06-08T15:03:55Z

@Rombur -- ok I can confirm this restores RHODO perf. I think this is OK in the short term, but it's a bit of a band-aid (heavyweight shouldn't really have anything to do with it as far as I can tell).

I was out on vacation last week, but I am still digging into the root cause of the regression here so we can get a true fix.

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

dalg24

I would prefer if you applied my suggestion.

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

Co-authored-by: Christian Trott <crtrott@sandia.gov>

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp

…l_reduce (kokkos#6160) * Improve heuristic deciding the number of blocks used in parallel_reduce * Remove commented code * Use auto * Simplify constructor * Improve comment Co-authored-by: Christian Trott <crtrott@sandia.gov> * Fix format --------- Co-authored-by: Christian Trott <crtrott@sandia.gov>

masterleinad reviewed May 24, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

dalg24 added Performance Code showing unusually slow performance for an architecture and/or backend Backend - HIP labels May 25, 2023

masterleinad modified the milestones: Release 4.0.1, Release 4.1 Jun 7, 2023

Rombur force-pushed the new_reduce branch from 4e21a6b to b8bd3fa Compare June 7, 2023 20:46

Improve heuristic deciding the number of blocks used in parallel_reduce

aab89aa

Rombur force-pushed the new_reduce branch from b8bd3fa to aab89aa Compare June 7, 2023 20:47

dalg24 reviewed Jun 8, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

Remove commented code

0439864

dalg24 approved these changes Jun 8, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

dalg24 reviewed Jun 8, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

Use auto

e257507

dalg24 reviewed Jun 8, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

Simplify constructor

880f52f

crtrott reviewed Jun 8, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

Improve comment

f313721

Co-authored-by: Christian Trott <crtrott@sandia.gov>

crtrott reviewed Jun 9, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved

Fix format

e0be79d

crtrott approved these changes Jun 9, 2023

View reviewed changes

dalg24 merged commit e200ba1 into kokkos:develop Jun 9, 2023
28 checks passed

Rombur mentioned this pull request Jun 9, 2023

CHANGELOG: 4.1.0 #5902

Closed

skyreflectedinmirrors mentioned this pull request Jun 20, 2023

[HIP] Optimize parallel_reduce #6229

Merged

Rombur deleted the new_reduce branch October 2, 2023 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160

[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160

Rombur commented May 24, 2023

skyreflectedinmirrors commented May 25, 2023

Rombur commented May 25, 2023 •

edited

skyreflectedinmirrors commented May 25, 2023 •

edited

Rombur commented May 25, 2023

crtrott commented May 25, 2023

Rombur commented Jun 7, 2023

skyreflectedinmirrors commented Jun 8, 2023

dalg24 left a comment

[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160

[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160

Conversation

Rombur commented May 24, 2023

skyreflectedinmirrors commented May 25, 2023

Rombur commented May 25, 2023 • edited

skyreflectedinmirrors commented May 25, 2023 • edited

Rombur commented May 25, 2023

crtrott commented May 25, 2023

Rombur commented Jun 7, 2023

skyreflectedinmirrors commented Jun 8, 2023

dalg24 left a comment

Choose a reason for hiding this comment

Rombur commented May 25, 2023 •

edited

skyreflectedinmirrors commented May 25, 2023 •

edited