Update to HIP TeamPolicy Block number heuristic #6284

IanBogle · 2023-07-14T18:11:17Z

This work is related to #6255. We have updated the heuristic that sets the block count inside TeamPolicy's parallel_reduce implementation. We have profiled the resulting performance of the new heuristic on a dot product workload and a yAx workload, and we have ensured that performance is at least as good as the previous heuristic. We want to explicitly note that this is the first step in an ongoing effort to improve HIP parallel_reduce.

In the plots shown our heuristic is labeled "kokkos-change", with kokkos 4.0.00 as "kokkos-4.0.00", and the develop branch as "kokkos-dev"

Plot showing bandwidth vs. # array elements for a dot product workload:

There is an obvious dip in performance that happens where we revert back to the old heuristic. This is due to effects that could seen on the yAx workload:

The "mesa" of performance seen for larger values of N would be narrower had we fixed the dip in the dot performance by simply changing the switchover point. This would represent a fairly sizable performance regression in the yAx case, so with the current heuristic we attempt to be at least as good as the previous heuristic.

We have ideas for further optimizations, but they will require more development effort on our end. This PR is meant to serve as a starting point towards a more optimized heuristic for HIP block count selection in TeamPolicies.

dalg24-jenkins · 2023-07-14T18:11:19Z

Can one of the admins verify this patch?

masterleinad

In #6229 it turned out that using local memory reductions instead of shuffle reductions always improved performance. Have you tried that for TeamPolicy as well?

masterleinad · 2023-07-14T18:16:36Z

OK to test.

masterleinad · 2023-07-14T18:17:56Z

Do you have the code used for benchmarking in a public place?

skyreflectedinmirrors · 2023-07-14T19:00:58Z

No, not as of yet (but we can extract it), the former is a simple dot product. The latter is the teams-based yAx example in the tutorials.

IanBogle · 2023-07-14T19:10:33Z

Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build.

masterleinad · 2023-07-14T19:16:40Z

Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build.

You will probably find that you are running into

kokkos/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

Lines 848 to 860 in 1a3ea28

    
           // The global parallel_reduce does not support vector_length other than 1 at 
        
           // the moment 
        
           if ((arg_policy.impl_vector_length() > 1) && !UseShflReduction) 
        
             Impl::throw_runtime_exception( 
        
                 "Kokkos::parallel_reduce with a TeamPolicy using a vector length of " 
        
                 "greater than 1 is not currently supported for HIP for dynamic " 
        
                 "sized reduction types."); 
        
           if ((m_team_size < HIPTraits::WarpSize) && !UseShflReduction) 
        
             Impl::throw_runtime_exception( 
        
                 "Kokkos::parallel_reduce with a TeamPolicy using a team_size smaller " 
        
                 "than 64 is not currently supported with HIP for dynamic sized " 
        
                 "reduction types.");

if enabling it unconditionally (and removing these checks will make the same tests fail).

masterleinad · 2023-07-14T19:22:26Z

The other changes required for TeamPolicy should be

diff --git a/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp b/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
index d67371527..cad636371 100644
--- a/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
+++ b/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp
@@ -589,13 +589,28 @@ class ParallelReduce<CombinedFunctorReducerType,
  public:
   using size_type = HIP::size_type;
 
-  static int constexpr UseShflReduction =
-      (ReducerType::static_value_size() != 0);
+  static int constexpr UseShflReduction = false;
 
  private:
   struct ShflReductionTag {};
   struct SHMEMReductionTag {};
 
+  // Conditionally set word_size_type to int16_t or int8_t if value_type is
+  // smaller than int32_t (Kokkos::HIP::size_type)
+  // word_size_type is used to determine the word count, shared memory buffer
+  // size, and global memory buffer size before the scan is performed.
+  // Within the scan, the word count is recomputed based on word_size_type
+  // and when calculating indexes into the shared/global memory buffers for
+  // performing the scan, word_size_type is used again.
+  // For scalars > 4 bytes in size, indexing into shared/global memory relies
+  // on the block and grid dimensions to ensure that we index at the correct
+  // offset rather than at every 4 byte word; such that, when the join is
+  // performed, we have the correct data that was copied over in chunks of 4
+  // bytes.
+  using word_size_type = std::conditional_t<
+       sizeof(value_type) < sizeof(size_type),
+       std::conditional_t<sizeof(value_type) == 2, int16_t, int8_t>, size_type>;
+
   // Algorithmic constraints: blockDim.y is a power of two AND
   // blockDim.y == blockDim.z == 1 shared memory utilization:
   //
@@ -609,7 +624,7 @@ class ParallelReduce<CombinedFunctorReducerType,
   const pointer_type m_result_ptr;
   const bool m_result_ptr_device_accessible;
   const bool m_result_ptr_host_accessible;
-  size_type* m_scratch_space;
+  word_size_type* m_scratch_space;
   size_type* m_scratch_flags;
   size_type m_team_begin;
   size_type m_shmem_begin;
@@ -673,13 +688,13 @@ class ParallelReduce<CombinedFunctorReducerType,
   __device__ inline void run(SHMEMReductionTag, int const threadid) const {
     const ReducerType& reducer = m_functor_reducer.get_reducer();
 
-    integral_nonzero_constant<size_type, ReducerType::static_value_size() /
-                                             sizeof(size_type)> const
-        word_count(reducer.value_size() / sizeof(size_type));
+    integral_nonzero_constant<word_size_type, ReducerType::static_value_size() /
+                                             sizeof(word_size_type)> const
+        word_count(reducer.value_size() / sizeof(word_size_type));
 
     reference_type value =
-        reducer.init(kokkos_impl_hip_shared_memory<size_type>() +
-                     threadIdx.y * word_count.value);
+        reducer.init(reinterpret_cast<pointer_type>(kokkos_impl_hip_shared_memory<word_size_type>() +
+                     threadIdx.y * word_count.value));
 
     // Iterate this block through the league
     iterate_through_league(threadid, value);
@@ -688,18 +703,18 @@ class ParallelReduce<CombinedFunctorReducerType,
     bool do_final_reduce = (m_league_size == 0);
     if (!do_final_reduce)
       do_final_reduce =
-          hip_single_inter_block_reduce_scan<false, FunctorType, work_tag>(
+          hip_single_inter_block_reduce_scan<false>(
               reducer, blockIdx.x, gridDim.x,
-              kokkos_impl_hip_shared_memory<size_type>(), m_scratch_space,
+              kokkos_impl_hip_shared_memory<word_size_type>(), m_scratch_space,
               m_scratch_flags);
     if (do_final_reduce) {
       // This is the final block with the final result at the final threads'
       // location
 
-      size_type* const shared = kokkos_impl_hip_shared_memory<size_type>() +
+      word_size_type* const shared = kokkos_impl_hip_shared_memory<word_size_type>() +
                                 (blockDim.y - 1) * word_count.value;
-      size_type* const global = m_result_ptr_device_accessible
-                                    ? reinterpret_cast<size_type*>(m_result_ptr)
+      word_size_type* const global = m_result_ptr_device_accessible
+                                    ? reinterpret_cast<word_size_type*>(m_result_ptr)
                                     : m_scratch_space;
 
       if (threadIdx.y == 0) {
@@ -760,8 +775,8 @@ class ParallelReduce<CombinedFunctorReducerType,
               ? std::min(m_league_size, size_type(1024 * HIPTraits::WarpSize))
               : std::min(static_cast<int>(m_league_size), m_team_size);
 
-      m_scratch_space = hip_internal_scratch_space(
-          m_policy.space(), reducer.value_size() * block_count);
+      m_scratch_space = reinterpret_cast<word_size_type*>(hip_internal_scratch_space(
+          m_policy.space(), reducer.value_size() * block_count));
       m_scratch_flags =
           hip_internal_scratch_flags(m_policy.space(), sizeof(size_type));

but we don't have tests that would test small types for TeamPolicy yet.

IanBogle · 2023-08-04T20:59:35Z

I implemented the changes described by @masterleinad, and found that it did improve dot performance:

The local memory reductions (teampolicy-noshfl) reduced the severity of the dip in performance, and in general we line up pretty well with the updated range policy's performance on this dot product workload.

yAx performance is largely unchanged with the local memory reductions:

In our data collection, we found that the team policy version of the yAx kernel has markedly worse performance than the range policy version, which we are going to look into after getting this initial PR merged. In terms of work left for this PR, there are a few tests that are failing due to the local memory reduction changes, so I need to track those down and fix them before this is ready to go.

IanBogle · 2023-08-11T17:36:04Z

Converting to draft, need some time to port the word_size_type stuff to allow UseShflReduction=false to successfully build.

You will probably find that you are running into

kokkos/core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

Lines 848 to 860 in 1a3ea28

// The global parallel_reduce does not support vector_length other than 1 at

// the moment

if ((arg_policy.impl_vector_length() > 1) && !UseShflReduction)

Impl::throw_runtime_exception(

"Kokkos::parallel_reduce with a TeamPolicy using a vector length of "

"greater than 1 is not currently supported for HIP for dynamic "

"sized reduction types.");

if ((m_team_size < HIPTraits::WarpSize) && !UseShflReduction)

Impl::throw_runtime_exception(

"Kokkos::parallel_reduce with a TeamPolicy using a team_size smaller "

"than 64 is not currently supported with HIP for dynamic sized "

"reduction types.");

if enabling it unconditionally (and removing these checks will make the same tests fail).

This is the reason tests are currently failing with this change. Is there any acceptable way of getting around this in the short term?

Rombur · 2023-08-11T21:20:05Z

This is the reason tests are currently failing with this change. Is there any acceptable way of getting around this in the short term?

Don't enable UseShflReduction unconditionally and add a FIXME_HIP that we need to fix the runtime exception and enable UseShflReduction unconditionally.

IanBogle · 2023-08-18T20:04:47Z

I've reverted the UseShflReduction = false changes, though I left them in comments (with nearby FIXME_HIP's) because it did get us some performance. We also figured out why the yAx benchmark had such poor performance, it was because I copied the code from the kokkos tutorial, and didn't realize it used a LayoutRight by default, so we were striding like crazy. There are still a few things we want to look into later, but we're ready to finish up this PR.

We looked into increasing the maximum block size in the heuristic to see if we could get rid of the performance dip in dot performance:

This plot shows the max we currently use, 65536, has an obvious performance dip, and that increasing the maximum number of blocks used to 2097152 eliminates it. However, we note that using 2097152 as the max number of blocks can require a nontrivial amount of memory (especially for datatypes larger than the one we used - double) for the final reduction, so we opt for 65536 as a default.
We also see that 65536 is a good choice for the yAx workload (note, performance is vastly different due to fixing the data layout):

We see that 65536 follows our "at least as fast as the previous heuristic" rule of thumb, where 2097152 is worse for larger league sizes, while it is only slightly faster than 65536 in certain cases.

Another important note, these plots are all with UseShflReduction=false, which is currently not active on this branch, due to tests failing. Currently, this branch should have better performance than the develop branch, and the tests are passing successfully.

Rombur · 2023-08-21T13:07:10Z

You need to fix the format.

Rombur · 2023-08-21T13:52:21Z

This plot shows the max we currently use, 65536, has an obvious performance dip, and that increasing the maximum number of blocks used to 2097152 eliminates it. However, we note that using 2097152 as the max number of blocks can require a nontrivial amount of memory (especially for datatypes larger than the one we used - double) for the final reduction, so we opt for 65536 as a default.

We could use the LightWeight hint to switch the default value. If a user tells us that his kernel is lightweight we use 2097152 otherwise we use 65536

Rombur · 2023-08-21T13:56:55Z

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

+        // keep blocks as is, already low parallelism
+      } else if (block_count_tmp >= block_max) {
+        block_count_tmp = block_max;
+        // block_count_tmp = m_league_size;


Remove this line

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

IanBogle · 2023-08-21T18:46:13Z

I've removed the commented out code for using smaller types, but kept the note about unconditionally setting UseShflReduction=false. I've added a condition that chooses the higher block count for kernels that the user says is light weight. I've also run it through the formatting check and applied a formatting patch.

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

masterleinad · 2023-08-22T18:01:21Z

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

+      } else if (block_count_tmp >= block_max) {
+        block_count_tmp = block_max;
+
+      } else {
+        int nwork            = m_league_size * m_team_size;
+        int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) /
+                               (block_count_tmp * m_team_size);
+        if (items_per_thread < 4) {
+          int ratio = std::min(
+              (block_count_tmp + preferred_block_min - 1) / preferred_block_min,
+              (4 + items_per_thread - 1) / items_per_thread);
+          block_count_tmp /= ratio;
+        }
+      }


Wouldn't it make sense to apply the first case unconditionally, i.e,

Suggested change

} else if (block_count_tmp >= block_max) {

block_count_tmp = block_max;

} else {

int nwork = m_league_size * m_team_size;

int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) /

(block_count_tmp * m_team_size);

if (items_per_thread < 4) {

int ratio = std::min(

(block_count_tmp + preferred_block_min - 1) / preferred_block_min,

(4 + items_per_thread - 1) / items_per_thread);

block_count_tmp /= ratio;

}

}

} else {

int nwork = m_league_size * m_team_size;

int items_per_thread = (nwork + block_count_tmp * m_team_size - 1) /

(block_count_tmp * m_team_size);

if (items_per_thread < 4) {

int ratio = std::min(

(block_count_tmp + preferred_block_min - 1) / preferred_block_min,

(4 + items_per_thread - 1) / items_per_thread);

block_count_tmp /= ratio;

}

if (block_count_tmp >= block_max)

block_count_tmp = block_max;

}

What's the point of doing that? It's harder to read.

The current logic limits the number of blocks to some maximum value if it exceeds it. If the number of blocks is already less than that, it potentially chooses an even smaller value.
It seems to me that it makes sense to first potentially increase the number of items per thread and only if the resulting block count still exceeds block_max set it to block_max.

I am not convinced that you would get better results. It's just an heuristic. We can try but we need to rerun the benchmark.

Of course, benchmarks would need to be rerun. It just sticks out that the current version looks discontinuous around block_max whenever the number of items per thread is less than 4.

It just sticks out that the current version looks discontinuous

That doesn't bother me. Unless we get better results, I would keep the code as is.

dalg24 · 2023-08-22T21:19:24Z

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

+      if ((property & light_weight) == light_weight) {
+        block_max = 2097152;
+      } else {
+        block_max = 65536;
+      }


Are these magic numbers tuned for a specific architecture?

These numbers are tuned for MI200, they were tested on MI210s.

dalg24 · 2023-08-22T21:25:07Z

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

-          UseShflReduction
-              ? std::min(m_league_size, size_type(1024 * HIPTraits::WarpSize))
-              : std::min(static_cast<int>(m_league_size), m_team_size);
+      int block_max = 0;


Suggested change

int block_max = 0;

int block_max;

or (better?)

constexpr auto light_weight = Kokkos::Experimental::WorkItemProperty::HintLightWeight; constexpr bool is_light_weight = (typename Policy::work_item_property & light_weight) == light_weight; constexpr int block_max = is_light_weight ? 2097152 : 65536;

dalg24 · 2023-08-22T21:33:25Z

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

+          block_count_tmp /= ratio;
+        }
+      }
+      const int block_count = block_count_tmp;


Did you consider writing a function that takes the policy as argument and returns that value instead of cramming the computation of block_count here?
As far as I can tell the only variables here (league_size, team_size, and work_item_properties) are determined by the policy.

It makes sense to me to have a block count heuristic function be a private member of the TeamPolicy class, as this heuristic is specific to TeamPolicy, and the function should only be used internally. Would that work?

I would have defined it within that ParallelReduce class specialization and move it to some other place later as needed if using it elsewhere.
I will leave it up to @Rombur

@Rombur, I can get this started as soon as you weigh in.

Just create a new function in the ParallelReduce class. We can always move it later if we need it somewhere else

@IanBogle Do you have time to finish this or do you want me to take over?

@Rombur If you've got the time, please take over. I've been swamped lately.

Sounds good.

crtrott

This definitely needs to go into some kind of function

… mid-range regimes

…s a nearby FIXME_HIP. Tests are passing, build succeeds.

…ce on light weight kernels

Rombur · 2023-10-09T14:04:33Z

This is ready for review. The failing tests are unrelated to this PR.

dalg24 · 2023-10-09T14:26:11Z

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp

@@ -653,6 +656,35 @@ class ParallelReduce<CombinedFunctorReducerType,
    }
  }

+  int compute_block_count() {


At the very least make it a const member function so the reader knows in a glance that you do not modify data members.
I think that making it a static function and explicitly passing the policy might increase readability further but I will not block on this part.

Rombur · 2023-10-10T12:48:52Z

Retest this please

Function was added

masterleinad reviewed Jul 14, 2023

View reviewed changes

IanBogle marked this pull request as draft July 14, 2023 19:09

IanBogle marked this pull request as ready for review August 18, 2023 20:05

Rombur reviewed Aug 21, 2023

View reviewed changes

masterleinad reviewed Aug 21, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp Outdated Show resolved Hide resolved

Rombur reviewed Aug 21, 2023

View reviewed changes

core/src/HIP/Kokkos_HIP_Parallel_Team.hpp Outdated Show resolved Hide resolved

Rombur approved these changes Aug 21, 2023

View reviewed changes

masterleinad reviewed Aug 22, 2023

View reviewed changes

dalg24 reviewed Aug 22, 2023

View reviewed changes

crtrott previously requested changes Aug 22, 2023

View reviewed changes

IanBogle added 6 commits October 6, 2023 14:18

Update to TeamPolicy Block number heuristic to improve performance in…

4f44126

… mid-range regimes

Commented out UseShflReduction=false and related changes, each one ha…

c863d00

…s a nearby FIXME_HIP. Tests are passing, build succeeds.

Applied formatting patch

9c0db70

Removed commented out code for smaller size types

f5ffa78

Added check for HintLightWeight_t, to allow more consistent performan…

94cd252

…ce on light weight kernels

Applied formatting patch

4c1215d

IanBogle added 2 commits October 6, 2023 14:20

Removed outdated comment block

a50abc2

Another format patch

cc33616

Rombur force-pushed the kokkos_hip_teampolicy_block_heuristic branch from ff70042 to b7cc711 Compare October 6, 2023 18:20

Move computation of block size to a function

e3acaf9

Rombur force-pushed the kokkos_hip_teampolicy_block_heuristic branch from b7cc711 to e3acaf9 Compare October 6, 2023 18:23

dalg24 reviewed Oct 9, 2023

View reviewed changes

Make compute_block_count const

c956dd9

dalg24 approved these changes Oct 10, 2023

View reviewed changes

dalg24 merged commit 8420c2f into kokkos:develop Oct 10, 2023
28 checks passed

Rombur mentioned this pull request Oct 10, 2023

CHANGELOG: 4.2.0 #6197

Closed

Update to HIP TeamPolicy Block number heuristic #6284

Update to HIP TeamPolicy Block number heuristic #6284

Conversation

IanBogle commented Jul 14, 2023

dalg24-jenkins commented Jul 14, 2023

masterleinad left a comment

Choose a reason for hiding this comment

masterleinad commented Jul 14, 2023

masterleinad commented Jul 14, 2023

skyreflectedinmirrors commented Jul 14, 2023

IanBogle commented Jul 14, 2023

masterleinad commented Jul 14, 2023

masterleinad commented Jul 14, 2023

IanBogle commented Aug 4, 2023

IanBogle commented Aug 11, 2023

Rombur commented Aug 11, 2023

IanBogle commented Aug 18, 2023

Rombur commented Aug 21, 2023

Rombur commented Aug 21, 2023

Choose a reason for hiding this comment

IanBogle commented Aug 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 Aug 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott left a comment

Choose a reason for hiding this comment

Rombur commented Oct 9, 2023

Choose a reason for hiding this comment

Rombur commented Oct 10, 2023

dalg24 Aug 22, 2023 •

edited