Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new blocksize deduction method for HIP Backend #3953

Merged

Conversation

skyreflectedinmirrors
Copy link
Contributor

@skyreflectedinmirrors skyreflectedinmirrors commented Apr 21, 2021

The general theory is to switch between a blocksize of 1024 and 256 depending on resource usage.
For places where resource use is light (e.g., STREAM-y) kernels -> 1024
For places with lots of resource use, or spills -> 256

This also fixes our long-standing issues with incorrect resource use calculations, resulting from passing the FunctorTypes into the blocksize deduction versus the Driver/ClosureType's, and unifies the blocksize deduction's that were scattered all over the place (ParallelRange, Teams, MDRange, etc.) into a common set of methods such that we can more easily modify them

As some sample data-points, on an MI-100 this results in:

  • an approximately 50% speedup in the GramSchmidt tester
  • an approximately 30% performance uplift for ReaxFF / Tersoff for upstream LAMMPS / Kokkos, and is neutral for EAM/LJ (within +/- 1%) -- note: I'm using llvm from f9a8c6a0e50540f68e6740a849a7caf5e4d46ca6 as it fixes issues present in the current mainline ROCm compilers.

Some additional test cases we might want to look at:

  • VTK-m had a 0x100f crash that I anticipate should be fixed by this [still testing this]
  • Maybe some Kokkos kernels results to make sure I didn't throw a wrench in over there (cc: @lucbv)

If we have some more test cases to run, I'd be happy to take a look at them as it would be good to tune the heuristic as broadly as possible :)

@skyreflectedinmirrors skyreflectedinmirrors changed the title [WIP] Implement new blocksize deduction method for HIP Backend Implement new blocksize deduction method for HIP Backend Apr 21, 2021
}
block_size >>= 1;
} while (block_size >= HIPTraits::WarpSize);
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use KOKKOS_ASSERT so we don't propagate a wrong block size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we could put an assert in here -- but the block_size=0 case is handled differently by the various layers, e.g. ParallelReduce would error out here:

https://github.com/kokkos/kokkos/blob/develop/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp#L961

But, I think we could figure out a way to make this more uniform between the backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss what we want to do on this and #3944 on Monday

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 2145bcd for a potential implementation

core/src/HIP/Kokkos_HIP_BlockSize_Deduction.hpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_Parallel_MDRange.hpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_Parallel_MDRange.hpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_Parallel_Team.hpp Outdated Show resolved Hide resolved
@skyreflectedinmirrors skyreflectedinmirrors force-pushed the hip_new_blocksize_deduction branch 4 times, most recently from 51ad919 to e09090a Compare April 26, 2021 16:08
@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented Apr 26, 2021

Added a change to use a scoped enum as per our discussion this morning. edit: resubmitted, missed a few bare bools in the Teams bits.

@masterleinad
Copy link
Contributor

Retest this please.

The general theory is to switch between a blocksize of 1024 and 256
For places where resource use is light (e.g., STREAM-y) kernels -> 1024
For places with lots of resource use, or spills -> 256

This also fixes our long-standing bugs due to resource use miscounting from the FunctorTypes vs Driver/ClosureTypes in the blocksize deduction.

Change-Id: I2730e254a478d4b936d2f80b4d0e5c96614d1142
Change-Id: Ic0221337bd6c43132eb832da52fd3f239be844ce
int internal_team_size_common(const FunctorType& f) const {
// FIXME_HIP: this could be unified with the
// internal_team_size_common_reduce
// once we can turn c++17 constexpr on by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is weird here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants