-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HIP] Improve heuristic deciding the number of blocks used in parallel_reduce #6160
Conversation
I'll give it a shot -- I am curious why it hurts PETSC as well, if you point me in that direction |
From what I see with omniperf, this PR improves L1 cache access. |
It looks like this drops RHODO back to where it was before my PR, unfortunately: Seemingly because it drops the # of WGs from 4096 to 1024 (it shouldn't be this sensitive) Is the PETSc case captured in an issue anywhere for us to reproduce? I think we (AMD) needs to take a deeper look to see if we can find a heuristic that works for both. |
It's just a regular dot product. |
I am wondering if we should change the heuristic based on "lightweigth kernel" hint, and or simply based on size of functor? |
@arghdos I've updated the PR. By default, users get the behavior that you had implemented. If they say the kernel is lightweight then we use the new heuristic. |
@Rombur -- ok I can confirm this restores RHODO perf. I think this is OK in the short term, but it's a bit of a band-aid (heavyweight shouldn't really have anything to do with it as far as I can tell). I was out on vacation last week, but I am still digging into the root cause of the regression here so we can get a true fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if you applied my suggestion.
Co-authored-by: Christian Trott <crtrott@sandia.gov>
…l_reduce (kokkos#6160) * Improve heuristic deciding the number of blocks used in parallel_reduce * Remove commented code * Use auto * Simplify constructor * Improve comment Co-authored-by: Christian Trott <crtrott@sandia.gov> * Fix format --------- Co-authored-by: Christian Trott <crtrott@sandia.gov>
#6029 introduced a regression when using
parallel_reduce
to perform a vector dot product (which is done by PETSc). I have a new heuristic based on a vector dot product benchmark.@arghdos @stanmoore1 is this change OK for LAMMPS?