Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to HIP TeamPolicy Block number heuristic #6284

Merged
merged 10 commits into from
Oct 10, 2023

Conversation

IanBogle
Copy link
Contributor

This work is related to #6255. We have updated the heuristic that sets the block count inside TeamPolicy's parallel_reduce implementation. We have profiled the resulting performance of the new heuristic on a dot product workload and a yAx workload, and we have ensured that performance is at least as good as the previous heuristic. We want to explicitly note that this is the first step in an ongoing effort to improve HIP parallel_reduce.

In the plots shown our heuristic is labeled "kokkos-change", with kokkos 4.0.00 as "kokkos-4.0.00", and the develop branch as "kokkos-dev"

Plot showing bandwidth vs. # array elements for a dot product workload:
image
There is an obvious dip in performance that happens where we revert back to the old heuristic. This is due to effects that could seen on the yAx workload:
image
The "mesa" of performance seen for larger values of N would be narrower had we fixed the dip in the dot performance by simply changing the switchover point. This would represent a fairly sizable performance regression in the yAx case, so with the current heuristic we attempt to be at least as good as the previous heuristic.

We have ideas for further optimizations, but they will require more development effort on our end. This PR is meant to serve as a starting point towards a more optimized heuristic for HIP block count selection in TeamPolicies.

@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #6229 it turned out that using local memory reductions instead of shuffle reductions always improved performance. Have you tried that for TeamPolicy as well?

@masterleinad
Copy link
Contributor

OK to test.

@masterleinad
Copy link
Contributor

Do you have the code used for benchmarking in a public place?

@skyreflectedinmirrors
Copy link
Contributor

No, not as of yet (but we can extract it), the former is a simple dot product. The latter is the teams-based yAx example in the tutorials.

@IanBogle IanBogle marked this pull request as draft July 14, 2023 19:09
@IanBogle
Copy link
Contributor Author

Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build.

@masterleinad
Copy link
Contributor

Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build.

You will probably find that you are running into

// The global parallel_reduce does not support vector_length other than 1 at
// the moment
if ((arg_policy.impl_vector_length() > 1) && !UseShflReduction)
Impl::throw_runtime_exception(
"Kokkos::parallel_reduce with a TeamPolicy using a vector length of "
"greater than 1 is not currently supported for HIP for dynamic "
"sized reduction types.");
if ((m_team_size < HIPTraits::WarpSize) && !UseShflReduction)
Impl::throw_runtime_exception(
"Kokkos::parallel_reduce with a TeamPolicy using a team_size smaller "
"than 64 is not currently supported with HIP for dynamic sized "
"reduction types.");
if enabling it unconditionally (and removing these checks will make the same tests fail).

@masterleinad
Copy link
Contributor

The other changes required for TeamPolicy should be

diff --git a/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp b/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
index d67371527..cad636371 100644
--- a/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
+++ b/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
@@ -589,13 +589,28 @@ class ParallelReduce<CombinedFunctorReducerType,
  public:
   using size_type = HIP::size_type;
 
-  static int constexpr UseShflReduction =
-      (ReducerType::static_value_size() != 0);
+  static int constexpr UseShflReduction = false;
 
  private:
   struct ShflReductionTag {};
   struct SHMEMReductionTag {};
 
+  // Conditionally set word_size_type to int16_t or int8_t if value_type is
+  // smaller than int32_t (Kokkos::HIP::size_type)
+  // word_size_type is used to determine the word count, shared memory buffer
+  // size, and global memory buffer size before the scan is performed.
+  // Within the scan, the word count is recomputed based on word_size_type
+  // and when calculating indexes into the shared/global memory buffers for
+  // performing the scan, word_size_type is used again.
+  // For scalars > 4 bytes in size, indexing into shared/global memory relies
+  // on the block and grid dimensions to ensure that we index at the correct
+  // offset rather than at every 4 byte word; such that, when the join is
+  // performed, we have the correct data that was copied over in chunks of 4
+  // bytes.
+  using word_size_type = std::conditional_t<
+       sizeof(value_type) < sizeof(size_type),
+       std::conditional_t<sizeof(value_type) == 2, int16_t, int8_t>, size_type>;
+
   // Algorithmic constraints: blockDim.y is a power of two AND
   // blockDim.y == blockDim.z == 1 shared memory utilization:
   //
@@ -609,7 +624,7 @@ class ParallelReduce<CombinedFunctorReducerType,
   const pointer_type m_result_ptr;
   const bool m_result_ptr_device_accessible;
   const bool m_result_ptr_host_accessible;
-  size_type* m_scratch_space;
+  word_size_type* m_scratch_space;
   size_type* m_scratch_flags;
   size_type m_team_begin;
   size_type m_shmem_begin;
@@ -673,13 +688,13 @@ class ParallelReduce<CombinedFunctorReducerType,
   __device__ inline void run(SHMEMReductionTag, int const threadid) const {
     const ReducerType& reducer = m_functor_reducer.get_reducer();
 
-    integral_nonzero_constant<size_type, ReducerType::static_value_size() /
-                                             sizeof(size_type)> const
-        word_count(reducer.value_size() / sizeof(size_type));
+    integral_nonzero_constant<word_size_type, ReducerType::static_value_size() /
+                                             sizeof(word_size_type)> const
+        word_count(reducer.value_size() / sizeof(word_size_type));
 
     reference_type value =
-        reducer.init(kokkos_impl_hip_shared_memory<size_type>() +
-                     threadIdx.y * word_count.value);
+        reducer.init(reinterpret_cast<pointer_type>(kokkos_impl_hip_shared_memory<word_size_type>() +
+                     threadIdx.y * word_count.value));
 
     // Iterate this block through the league
     iterate_through_league(threadid, value);
@@ -688,18 +703,18 @@ class ParallelReduce<CombinedFunctorReducerType,
     bool do_final_reduce = (m_league_size == 0);
     if (!do_final_reduce)
       do_final_reduce =
-          hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
+          hip_single_inter_block_reduce_scan<false>(
               reducer, blockIdx.x, gridDim.x,
-              kokkos_impl_hip_shared_memory<size_type>(), m_scratch_space,
+              kokkos_impl_hip_shared_memory<word_size_type>(), m_scratch_space,
               m_scratch_flags);
     if (do_final_reduce) {
       // This is the final block with the final result at the final threads'
       // location
 
-      size_type* const shared = kokkos_impl_hip_shared_memory<size_type>() +
+      word_size_type* const shared = kokkos_impl_hip_shared_memory<word_size_type>() +
                                 (blockDim.y - 1) * word_count.value;
-      size_type* const global = m_result_ptr_device_accessible
-                                    ? reinterpret_cast<size_type*>(m_result_ptr)
+      word_size_type* const global = m_result_ptr_device_accessible
+                                    ? reinterpret_cast<word_size_type*>(m_result_ptr)
                                     : m_scratch_space;
 
       if (threadIdx.y == 0) {
@@ -760,8 +775,8 @@ class ParallelReduce<CombinedFunctorReducerType,
               ? std::min(m_league_size, size_type(1024 * HIPTraits::WarpSize))
               : std::min(static_cast<int>(m_league_size), m_team_size);
 
-      m_scratch_space = hip_internal_scratch_space(
-          m_policy.space(), reducer.value_size() * block_count);
+      m_scratch_space = reinterpret_cast<word_size_type*>(hip_internal_scratch_space(
+          m_policy.space(), reducer.value_size() * block_count));
       m_scratch_flags =
           hip_internal_scratch_flags(m_policy.space(), sizeof(size_type));

but we don't have tests that would test small types for TeamPolicy yet.

@IanBogle
Copy link
Contributor Author

IanBogle commented Aug 4, 2023

I implemented the changes described by @masterleinad, and found that it did improve dot performance:
kokkos-dot

The local memory reductions (teampolicy-noshfl) reduced the severity of the dip in performance, and in general we line up pretty well with the updated range policy's performance on this dot product workload.

yAx performance is largely unchanged with the local memory reductions:
kokkos-yAx

In our data collection, we found that the team policy version of the yAx kernel has markedly worse performance than the range policy version, which we are going to look into after getting this initial PR merged. In terms of work left for this PR, there are a few tests that are failing due to the local memory reduction changes, so I need to track those down and fix them before this is ready to go.

@IanBogle
Copy link
Contributor Author

Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build.

You will probably find that you are running into

// The global parallel_reduce does not support vector_length other than 1 at
// the moment
if ((arg_policy.impl_vector_length() > 1) && !UseShflReduction)
Impl::throw_runtime_exception(
"Kokkos::parallel_reduce with a TeamPolicy using a vector length of "
"greater than 1 is not currently supported for HIP for dynamic "
"sized reduction types.");
if ((m_team_size < HIPTraits::WarpSize) && !UseShflReduction)
Impl::throw_runtime_exception(
"Kokkos::parallel_reduce with a TeamPolicy using a team_size smaller "
"than 64 is not currently supported with HIP for dynamic sized "
"reduction types.");

if enabling it unconditionally (and removing these checks will make the same tests fail).

This is the reason tests are currently failing with this change. Is there any acceptable way of getting around this in the short term?

@Rombur
Copy link
Member

Rombur commented Aug 11, 2023

This is the reason tests are currently failing with this change. Is there any acceptable way of getting around this in the short term?

Don't enable UseShflReduction unconditionally and add a FIXME_HIP that we need to fix the runtime exception and enable UseShflReduction unconditionally.

@IanBogle
Copy link
Contributor Author

I've reverted the UseShflReduction = false changes, though I left them in comments (with nearby FIXME_HIP's) because it did get us some performance. We also figured out why the yAx benchmark had such poor performance, it was because I copied the code from the kokkos tutorial, and didn't realize it used a LayoutRight by default, so we were striding like crazy. There are still a few things we want to look into later, but we're ready to finish up this PR.

We looked into increasing the maximum block size in the heuristic to see if we could get rid of the performance dip in dot performance:
kokkos-dot-perf
This plot shows the max we currently use, 65536, has an obvious performance dip, and that increasing the maximum number of blocks used to 2097152 eliminates it. However, we note that using 2097152 as the max number of blocks can require a nontrivial amount of memory (especially for datatypes larger than the one we used - double) for the final reduction, so we opt for 65536 as a default.
We also see that 65536 is a good choice for the yAx workload (note, performance is vastly different due to fixing the data layout):
kokkos-yAx-perf
We see that 65536 follows our "at least as fast as the previous heuristic" rule of thumb, where 2097152 is worse for larger league sizes, while it is only slightly faster than 65536 in certain cases.

Another important note, these plots are all with UseShflReduction=false, which is currently not active on this branch, due to tests failing. Currently, this branch should have better performance than the develop branch, and the tests are passing successfully.

@IanBogle IanBogle marked this pull request as ready for review August 18, 2023 20:05
@Rombur
Copy link
Member

Rombur commented Aug 21, 2023

You need to fix the format.

@Rombur
Copy link
Member

Rombur commented Aug 21, 2023

This plot shows the max we currently use, 65536, has an obvious performance dip, and that increasing the maximum number of blocks used to 2097152 eliminates it. However, we note that using 2097152 as the max number of blocks can require a nontrivial amount of memory (especially for datatypes larger than the one we used - double) for the final reduction, so we opt for 65536 as a default.

We could use the LightWeight hint to switch the default value. If a user tells us that his kernel is lightweight we use 2097152 otherwise we use 65536

// keep blocks as is, already low parallelism
} else if (block_count_tmp >= block_max) {
block_count_tmp = block_max;
// block_count_tmp = m_league_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line

@IanBogle
Copy link
Contributor Author

I've removed the commented out code for using smaller types, but kept the note about unconditionally setting UseShflReduction=false. I've added a condition that chooses the higher block count for kernels that the user says is light weight. I've also run it through the formatting check and applied a formatting patch.

Comment on lines 773 to 786
} else if (block_count_tmp >= block_max) {
block_count_tmp = block_max;

} else {
int nwork = m_league_size * m_team_size;
int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) /
(block_count_tmp * m_team_size);
if (items_per_thread < 4) {
int ratio = std::min(
(block_count_tmp + preferred_block_min - 1) / preferred_block_min,
(4 + items_per_thread - 1) / items_per_thread);
block_count_tmp /= ratio;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to apply the first case unconditionally, i.e,

Suggested change
} else if (block_count_tmp >= block_max) {
block_count_tmp = block_max;
} else {
int nwork = m_league_size * m_team_size;
int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) /
(block_count_tmp * m_team_size);
if (items_per_thread < 4) {
int ratio = std::min(
(block_count_tmp + preferred_block_min - 1) / preferred_block_min,
(4 + items_per_thread - 1) / items_per_thread);
block_count_tmp /= ratio;
}
}
} else {
int nwork = m_league_size * m_team_size;
int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) /
(block_count_tmp * m_team_size);
if (items_per_thread < 4) {
int ratio = std::min(
(block_count_tmp + preferred_block_min - 1) / preferred_block_min,
(4 + items_per_thread - 1) / items_per_thread);
block_count_tmp /= ratio;
}
if (block_count_tmp >= block_max)
block_count_tmp = block_max;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of doing that? It's harder to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic limits the number of blocks to some maximum value if it exceeds it. If the number of blocks is already less than that, it potentially chooses an even smaller value.
It seems to me that it makes sense to first potentially increase the number of items per thread and only if the resulting block count still exceeds block_max set it to block_max.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that you would get better results. It's just an heuristic. We can try but we need to rerun the benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, benchmarks would need to be rerun. It just sticks out that the current version looks discontinuous around block_max whenever the number of items per thread is less than 4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just sticks out that the current version looks discontinuous

That doesn't bother me. Unless we get better results, I would keep the code as is.

Comment on lines 764 to 768
if ((property & light_weight) == light_weight) {
block_max = 2097152;
} else {
block_max = 65536;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these magic numbers tuned for a specific architecture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are tuned for MI200, they were tested on MI210s.

UseShflReduction
? std::min(m_league_size, size_type(1024 * HIPTraits::WarpSize))
: std::min(static_cast<int>(m_league_size), m_team_size);
int block_max = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int block_max = 0;
int block_max;

or (better?)

constexpr auto light_weight =
          Kokkos::Experimental::WorkItemProperty::HintLightWeight;
constexpr bool is_light_weight =  (typename Policy::work_item_property & light_weight) == light_weight;
constexpr int block_max = is_light_weight ? 2097152 : 65536;

block_count_tmp /= ratio;
}
}
const int block_count = block_count_tmp;
Copy link
Member

@dalg24 dalg24 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider writing a function that takes the policy as argument and returns that value instead of cramming the computation of block_count here?
As far as I can tell the only variables here (league_size, team_size, and work_item_properties) are determined by the policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to have a block count heuristic function be a private member of the TeamPolicy class, as this heuristic is specific to TeamPolicy, and the function should only be used internally. Would that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have defined it within that ParallelReduce class specialization and move it to some other place later as needed if using it elsewhere.
I will leave it up to @Rombur

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rombur, I can get this started as soon as you weigh in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just create a new function in the ParallelReduce class. We can always move it later if we need it somewhere else

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IanBogle Do you have time to finish this or do you want me to take over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rombur If you've got the time, please take over. I've been swamped lately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs to go into some kind of function

@Rombur Rombur force-pushed the kokkos_hip_teampolicy_block_heuristic branch from ff70042 to b7cc711 Compare October 6, 2023 18:20
@Rombur Rombur force-pushed the kokkos_hip_teampolicy_block_heuristic branch from b7cc711 to e3acaf9 Compare October 6, 2023 18:23
@Rombur
Copy link
Member

Rombur commented Oct 9, 2023

This is ready for review. The failing tests are unrelated to this PR.

@@ -653,6 +656,35 @@ class ParallelReduce<CombinedFunctorReducerType,
}
}

int compute_block_count() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least make it a const member function so the reader knows in a glance that you do not modify data members.
I think that making it a static function and explicitly passing the policy might increase readability further but I will not block on this part.

@Rombur
Copy link
Member

Rombur commented Oct 10, 2023

Retest this please

@dalg24 dalg24 dismissed crtrott’s stale review October 10, 2023 19:58

Function was added

@dalg24 dalg24 merged commit 8420c2f into kokkos:develop Oct 10, 2023
28 checks passed
@Rombur Rombur mentioned this pull request Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants