Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160

Merged
merged 6 commits into from
Jun 9, 2023

Conversation

Rombur
Copy link
Member

@Rombur Rombur commented May 24, 2023

#6029 introduced a regression when using parallel_reduce to perform a vector dot product (which is done by PETSc). I have a new heuristic based on a vector dot product benchmark.

@arghdos @stanmoore1 is this change OK for LAMMPS?

@dalg24 dalg24 added Performance Code showing unusually slow performance for an architecture and/or backend Backend - HIP labels May 25, 2023
@skyreflectedinmirrors
Copy link
Contributor

I'll give it a shot -- I am curious why it hurts PETSC as well, if you point me in that direction

@Rombur
Copy link
Member Author

Rombur commented May 25, 2023

From what I see with omniperf, this PR improves L1 cache access.

@skyreflectedinmirrors
Copy link
Contributor

skyreflectedinmirrors commented May 25, 2023

It looks like this drops RHODO back to where it was before my PR, unfortunately:

image

Seemingly because it drops the # of WGs from 4096 to 1024 (it shouldn't be this sensitive)

Is the PETSc case captured in an issue anywhere for us to reproduce? I think we (AMD) needs to take a deeper look to see if we can find a heuristic that works for both.

@Rombur
Copy link
Member Author

Rombur commented May 25, 2023

Is the PETSc case captured in an issue anywhere for us to reproduce? I think we (AMD) needs to take a deeper look to see if we can find a heuristic that works for both.

It's just a regular dot product.

@crtrott
Copy link
Member

crtrott commented May 25, 2023

I am wondering if we should change the heuristic based on "lightweigth kernel" hint, and or simply based on size of functor?

@Rombur
Copy link
Member Author

Rombur commented Jun 7, 2023

@arghdos I've updated the PR. By default, users get the behavior that you had implemented. If they say the kernel is lightweight then we use the new heuristic.

@skyreflectedinmirrors
Copy link
Contributor

@Rombur -- ok I can confirm this restores RHODO perf. I think this is OK in the short term, but it's a bit of a band-aid (heavyweight shouldn't really have anything to do with it as far as I can tell).

I was out on vacation last week, but I am still digging into the root cause of the regression here so we can get a true fix.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if you applied my suggestion.

core/src/HIP/Kokkos_HIP_Parallel_Range.hpp Outdated Show resolved Hide resolved
Co-authored-by: Christian Trott <crtrott@sandia.gov>
@dalg24 dalg24 merged commit e200ba1 into kokkos:develop Jun 9, 2023
28 checks passed
@Rombur Rombur mentioned this pull request Jun 9, 2023
nliber pushed a commit to nliber/kokkos that referenced this pull request Jun 22, 2023
…l_reduce (kokkos#6160)

* Improve heuristic deciding the number of blocks used in parallel_reduce

* Remove commented code

* Use auto

* Simplify constructor

* Improve comment

Co-authored-by: Christian Trott <crtrott@sandia.gov>

* Fix format

---------

Co-authored-by: Christian Trott <crtrott@sandia.gov>
@Rombur Rombur deleted the new_reduce branch October 2, 2023 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend - HIP Performance Code showing unusually slow performance for an architecture and/or backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants