-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to HIP TeamPolicy Block number heuristic #6284
Update to HIP TeamPolicy Block number heuristic #6284
Conversation
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #6229 it turned out that using local memory reductions instead of shuffle reductions always improved performance. Have you tried that for TeamPolicy as well?
OK to test. |
Do you have the code used for benchmarking in a public place? |
No, not as of yet (but we can extract it), the former is a simple dot product. The latter is the teams-based yAx example in the tutorials. |
Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build. |
You will probably find that you are running into kokkos/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp Lines 848 to 860 in 1a3ea28
|
The other changes required for diff --git a/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp b/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
index d67371527..cad636371 100644
--- a/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
+++ b/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
@@ -589,13 +589,28 @@ class ParallelReduce<CombinedFunctorReducerType,
public:
using size_type = HIP::size_type;
- static int constexpr UseShflReduction =
- (ReducerType::static_value_size() != 0);
+ static int constexpr UseShflReduction = false;
private:
struct ShflReductionTag {};
struct SHMEMReductionTag {};
+ // Conditionally set word_size_type to int16_t or int8_t if value_type is
+ // smaller than int32_t (Kokkos::HIP::size_type)
+ // word_size_type is used to determine the word count, shared memory buffer
+ // size, and global memory buffer size before the scan is performed.
+ // Within the scan, the word count is recomputed based on word_size_type
+ // and when calculating indexes into the shared/global memory buffers for
+ // performing the scan, word_size_type is used again.
+ // For scalars > 4 bytes in size, indexing into shared/global memory relies
+ // on the block and grid dimensions to ensure that we index at the correct
+ // offset rather than at every 4 byte word; such that, when the join is
+ // performed, we have the correct data that was copied over in chunks of 4
+ // bytes.
+ using word_size_type = std::conditional_t<
+ sizeof(value_type) < sizeof(size_type),
+ std::conditional_t<sizeof(value_type) == 2, int16_t, int8_t>, size_type>;
+
// Algorithmic constraints: blockDim.y is a power of two AND
// blockDim.y == blockDim.z == 1 shared memory utilization:
//
@@ -609,7 +624,7 @@ class ParallelReduce<CombinedFunctorReducerType,
const pointer_type m_result_ptr;
const bool m_result_ptr_device_accessible;
const bool m_result_ptr_host_accessible;
- size_type* m_scratch_space;
+ word_size_type* m_scratch_space;
size_type* m_scratch_flags;
size_type m_team_begin;
size_type m_shmem_begin;
@@ -673,13 +688,13 @@ class ParallelReduce<CombinedFunctorReducerType,
__device__ inline void run(SHMEMReductionTag, int const threadid) const {
const ReducerType& reducer = m_functor_reducer.get_reducer();
- integral_nonzero_constant<size_type, ReducerType::static_value_size() /
- sizeof(size_type)> const
- word_count(reducer.value_size() / sizeof(size_type));
+ integral_nonzero_constant<word_size_type, ReducerType::static_value_size() /
+ sizeof(word_size_type)> const
+ word_count(reducer.value_size() / sizeof(word_size_type));
reference_type value =
- reducer.init(kokkos_impl_hip_shared_memory<size_type>() +
- threadIdx.y * word_count.value);
+ reducer.init(reinterpret_cast<pointer_type>(kokkos_impl_hip_shared_memory<word_size_type>() +
+ threadIdx.y * word_count.value));
// Iterate this block through the league
iterate_through_league(threadid, value);
@@ -688,18 +703,18 @@ class ParallelReduce<CombinedFunctorReducerType,
bool do_final_reduce = (m_league_size == 0);
if (!do_final_reduce)
do_final_reduce =
- hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
+ hip_single_inter_block_reduce_scan<false>(
reducer, blockIdx.x, gridDim.x,
- kokkos_impl_hip_shared_memory<size_type>(), m_scratch_space,
+ kokkos_impl_hip_shared_memory<word_size_type>(), m_scratch_space,
m_scratch_flags);
if (do_final_reduce) {
// This is the final block with the final result at the final threads'
// location
- size_type* const shared = kokkos_impl_hip_shared_memory<size_type>() +
+ word_size_type* const shared = kokkos_impl_hip_shared_memory<word_size_type>() +
(blockDim.y - 1) * word_count.value;
- size_type* const global = m_result_ptr_device_accessible
- ? reinterpret_cast<size_type*>(m_result_ptr)
+ word_size_type* const global = m_result_ptr_device_accessible
+ ? reinterpret_cast<word_size_type*>(m_result_ptr)
: m_scratch_space;
if (threadIdx.y == 0) {
@@ -760,8 +775,8 @@ class ParallelReduce<CombinedFunctorReducerType,
? std::min(m_league_size, size_type(1024 * HIPTraits::WarpSize))
: std::min(static_cast<int>(m_league_size), m_team_size);
- m_scratch_space = hip_internal_scratch_space(
- m_policy.space(), reducer.value_size() * block_count);
+ m_scratch_space = reinterpret_cast<word_size_type*>(hip_internal_scratch_space(
+ m_policy.space(), reducer.value_size() * block_count));
m_scratch_flags =
hip_internal_scratch_flags(m_policy.space(), sizeof(size_type)); but we don't have tests that would test small types for |
I implemented the changes described by @masterleinad, and found that it did improve dot performance: The local memory reductions (teampolicy-noshfl) reduced the severity of the dip in performance, and in general we line up pretty well with the updated range policy's performance on this dot product workload. yAx performance is largely unchanged with the local memory reductions: In our data collection, we found that the team policy version of the yAx kernel has markedly worse performance than the range policy version, which we are going to look into after getting this initial PR merged. In terms of work left for this PR, there are a few tests that are failing due to the local memory reduction changes, so I need to track those down and fix them before this is ready to go. |
This is the reason tests are currently failing with this change. Is there any acceptable way of getting around this in the short term? |
Don't enable |
You need to fix the format. |
We could use the |
// keep blocks as is, already low parallelism | ||
} else if (block_count_tmp >= block_max) { | ||
block_count_tmp = block_max; | ||
// block_count_tmp = m_league_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line
I've removed the commented out code for using smaller types, but kept the note about unconditionally setting UseShflReduction=false. I've added a condition that chooses the higher block count for kernels that the user says is light weight. I've also run it through the formatting check and applied a formatting patch. |
} else if (block_count_tmp >= block_max) { | ||
block_count_tmp = block_max; | ||
|
||
} else { | ||
int nwork = m_league_size * m_team_size; | ||
int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) / | ||
(block_count_tmp * m_team_size); | ||
if (items_per_thread < 4) { | ||
int ratio = std::min( | ||
(block_count_tmp + preferred_block_min - 1) / preferred_block_min, | ||
(4 + items_per_thread - 1) / items_per_thread); | ||
block_count_tmp /= ratio; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to apply the first case unconditionally, i.e,
} else if (block_count_tmp >= block_max) { | |
block_count_tmp = block_max; | |
} else { | |
int nwork = m_league_size * m_team_size; | |
int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) / | |
(block_count_tmp * m_team_size); | |
if (items_per_thread < 4) { | |
int ratio = std::min( | |
(block_count_tmp + preferred_block_min - 1) / preferred_block_min, | |
(4 + items_per_thread - 1) / items_per_thread); | |
block_count_tmp /= ratio; | |
} | |
} | |
} else { | |
int nwork = m_league_size * m_team_size; | |
int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) / | |
(block_count_tmp * m_team_size); | |
if (items_per_thread < 4) { | |
int ratio = std::min( | |
(block_count_tmp + preferred_block_min - 1) / preferred_block_min, | |
(4 + items_per_thread - 1) / items_per_thread); | |
block_count_tmp /= ratio; | |
} | |
if (block_count_tmp >= block_max) | |
block_count_tmp = block_max; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of doing that? It's harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic limits the number of blocks to some maximum value if it exceeds it. If the number of blocks is already less than that, it potentially chooses an even smaller value.
It seems to me that it makes sense to first potentially increase the number of items per thread and only if the resulting block count still exceeds block_max
set it to block_max
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that you would get better results. It's just an heuristic. We can try but we need to rerun the benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, benchmarks would need to be rerun. It just sticks out that the current version looks discontinuous around block_max
whenever the number of items per thread is less than 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just sticks out that the current version looks discontinuous
That doesn't bother me. Unless we get better results, I would keep the code as is.
if ((property & light_weight) == light_weight) { | ||
block_max = 2097152; | ||
} else { | ||
block_max = 65536; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these magic numbers tuned for a specific architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are tuned for MI200, they were tested on MI210s.
UseShflReduction | ||
? std::min(m_league_size, size_type(1024 * HIPTraits::WarpSize)) | ||
: std::min(static_cast<int>(m_league_size), m_team_size); | ||
int block_max = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int block_max = 0; | |
int block_max; |
or (better?)
constexpr auto light_weight =
Kokkos::Experimental::WorkItemProperty::HintLightWeight;
constexpr bool is_light_weight = (typename Policy::work_item_property & light_weight) == light_weight;
constexpr int block_max = is_light_weight ? 2097152 : 65536;
block_count_tmp /= ratio; | ||
} | ||
} | ||
const int block_count = block_count_tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider writing a function that takes the policy as argument and returns that value instead of cramming the computation of block_count
here?
As far as I can tell the only variables here (league_size
, team_size
, and work_item_properties
) are determined by the policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me to have a block count heuristic function be a private member of the TeamPolicy class, as this heuristic is specific to TeamPolicy, and the function should only be used internally. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have defined it within that ParallelReduce class specialization and move it to some other place later as needed if using it elsewhere.
I will leave it up to @Rombur
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rombur, I can get this started as soon as you weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just create a new function in the ParallelReduce class. We can always move it later if we need it somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IanBogle Do you have time to finish this or do you want me to take over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rombur If you've got the time, please take over. I've been swamped lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely needs to go into some kind of function
… mid-range regimes
…s a nearby FIXME_HIP. Tests are passing, build succeeds.
…ce on light weight kernels
ff70042
to
b7cc711
Compare
b7cc711
to
e3acaf9
Compare
This is ready for review. The failing tests are unrelated to this PR. |
@@ -653,6 +656,35 @@ class ParallelReduce<CombinedFunctorReducerType, | |||
} | |||
} | |||
|
|||
int compute_block_count() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least make it a const member function so the reader knows in a glance that you do not modify data members.
I think that making it a static function and explicitly passing the policy might increase readability further but I will not block on this part.
Retest this please |
This work is related to #6255. We have updated the heuristic that sets the block count inside TeamPolicy's parallel_reduce implementation. We have profiled the resulting performance of the new heuristic on a dot product workload and a yAx workload, and we have ensured that performance is at least as good as the previous heuristic. We want to explicitly note that this is the first step in an ongoing effort to improve HIP parallel_reduce.
In the plots shown our heuristic is labeled "kokkos-change", with kokkos 4.0.00 as "kokkos-4.0.00", and the develop branch as "kokkos-dev"
Plot showing bandwidth vs. # array elements for a dot product workload:
There is an obvious dip in performance that happens where we revert back to the old heuristic. This is due to effects that could seen on the yAx workload:
The "mesa" of performance seen for larger values of N would be narrower had we fixed the dip in the dot performance by simply changing the switchover point. This would represent a fairly sizable performance regression in the yAx case, so with the current heuristic we attempt to be at least as good as the previous heuristic.
We have ideas for further optimizations, but they will require more development effort on our end. This PR is meant to serve as a starting point towards a more optimized heuristic for HIP block count selection in TeamPolicies.