-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new blocksize deduction method for HIP Backend #3953
Implement new blocksize deduction method for HIP Backend #3953
Conversation
} | ||
block_size >>= 1; | ||
} while (block_size >= HIPTraits::WarpSize); | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use KOKKOS_ASSERT
so we don't propagate a wrong block size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we could put an assert in here -- but the block_size=0
case is handled differently by the various layers, e.g. ParallelReduce would error out here:
https://github.com/kokkos/kokkos/blob/develop/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp#L961
But, I think we could figure out a way to make this more uniform between the backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss what we want to do on this and #3944 on Monday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 2145bcd for a potential implementation
51ad919
to
e09090a
Compare
Added a change to use a scoped enum as per our discussion this morning. edit: resubmitted, missed a few bare bools in the Teams bits. |
Retest this please. |
The general theory is to switch between a blocksize of 1024 and 256 For places where resource use is light (e.g., STREAM-y) kernels -> 1024 For places with lots of resource use, or spills -> 256 This also fixes our long-standing bugs due to resource use miscounting from the FunctorTypes vs Driver/ClosureTypes in the blocksize deduction. Change-Id: I2730e254a478d4b936d2f80b4d0e5c96614d1142
e09090a
to
fcccfd2
Compare
Change-Id: Ic0221337bd6c43132eb832da52fd3f239be844ce
int internal_team_size_common(const FunctorType& f) const { | ||
// FIXME_HIP: this could be unified with the | ||
// internal_team_size_common_reduce | ||
// once we can turn c++17 constexpr on by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is weird here
The general theory is to switch between a blocksize of 1024 and 256 depending on resource usage.
For places where resource use is light (e.g., STREAM-y) kernels -> 1024
For places with lots of resource use, or spills -> 256
This also fixes our long-standing issues with incorrect resource use calculations, resulting from passing the FunctorTypes into the blocksize deduction versus the Driver/ClosureType's, and unifies the blocksize deduction's that were scattered all over the place (ParallelRange, Teams, MDRange, etc.) into a common set of methods such that we can more easily modify them
As some sample data-points, on an MI-100 this results in:
Some additional test cases we might want to look at:
If we have some more test cases to run, I'd be happy to take a look at them as it would be good to tune the heuristic as broadly as possible :)