[HIP] Add multiple LaunchMechanism #3820

Rombur · 2021-03-02T19:42:05Z

Until now when launching a kernel, we always used local memory. This PR adds the two new types of kernel launch: constant memory and global memory. The code is similar to the CUDA code refactored by @dhollman.

dalg24 · 2021-03-02T21:37:47Z

core/src/HIP/Kokkos_HIP_Instance.cpp

+    HIP_SAFE_CALL(hipHostMalloc((void **)&constantMemHostStaging,
+                                HIPTraits::ConstantMemoryUsage));


I know you copied that from Cuda but wondering now whether that memory should be tracked via HIPHostPinnedSpace::allocate. @crtrott what do you think?

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

masterleinad · 2021-03-02T22:38:49Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

 }

 template <class DriverType>
 __global__ static void hip_parallel_launch_local_memory(
    const DriverType *driver) {
+  // FIXME_HIP driver() pass by copy


Why is this a TODO?

Because we cannot pass a driver by copy right now. This triggers a bug in the compiler.

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

masterleinad · 2021-03-02T22:49:18Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+    (base_t::get_kernel_func())<<<grid, block, shmem, hip_instance->m_stream>>>(
+        driver);


Why don't we have code corresponding to

DriverType* driver_ptr = reinterpret_cast<DriverType*>( cuda_instance->scratch_functor(sizeof(DriverType))); cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType), cudaMemcpyDefault, cuda_instance->m_stream); (base_t:: get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>( driver_ptr);

?

same reason

In that case, this should also be a FIXME, I guess.

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

dalg24 · 2021-03-10T14:03:07Z

Retest this please

dalg24 · 2021-03-10T20:24:49Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+//-----------------------------//
+// HIPParallelLaunch structure //
+//-----------------------------//
+#if HIP_VERSION < 401


I am tempted to delay until Kokkos 3.5

dalg24

Blocking. Merge onto Kokkos 3.5

dhollman

There's still too much unnecessary, undocumented duplication of code in this pull request. If I get overruled on this, fine, but I wouldn't feel comfortable maintaining this myself as is.

dhollman · 2021-03-17T21:07:47Z

core/src/HIP/Kokkos_HIP_Instance.hpp

+  // FIXME_HIP: these want to be per-device, not per-stream...  use of 'static'
+  // here will break once there are multiple devices though
+  static unsigned long *constantMemHostStaging;
+  static hipEvent_t constantMemReusable;


Can we call this something that indicates it's an event when read in code? Like constantMemAvailableEvent or something?

dhollman · 2021-03-17T21:14:21Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

@@ -87,12 +88,13 @@ __global__ __launch_bounds__(
  const DriverType &driver = *(reinterpret_cast<const DriverType *>(
      kokkos_impl_hip_constant_memory_buffer));

-  driver->operator()();
+  driver();


There's no real reason for this to be separate from the Cuda version, but if we're going to copy and paste things, let's at least add a comment that says something like "should be exactly the same as cuda_parallel_launch_constant_memory() and the analogous comment in Kokkos_Cuda_KernelLaunch.hpp so that anyone who changes either implementation knows to consider changing the other (or removing the comment, if appropriate).

Please add a comment as suggested above

dhollman · 2021-03-17T21:17:46Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+}
+
+template <typename DriverType, unsigned int maxTperB, unsigned int minBperSM>
+__global__ __launch_bounds__(


same thing. Please comment that these are currently identical to the analogous Cuda versions and comment on the Cuda versions that those are identical to the HIP versions

dhollman · 2021-03-17T21:25:57Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+                        ? HIPLaunchMechanism::ConstantMemory
+                        : HIPLaunchMechanism::GlobalMemory)
+                 : (default_launch_mechanism));
+};


This is a massive amount of very complicated code to be duplicating with Cuda, especially since the only things that are different between the two are the name, HIPTraits and HIPLaunchMechanism. (I've attached a diff image for reference). This could easily be done with a template rather than copy/paste, and I'm pretty strongly opposed to copy/pasting here. If we ever reach the point where these need to evolve separately, it's easy enough to do a partial specialization of a more general template for the case of HIPTraits and HIPLaunchMechanism, so I don't see the advantage of copy/pasting code here instead.

Consider creating an issue to keep track of this

dhollman · 2021-03-17T21:28:07Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+  static auto get_kernel_func() {
+    return hip_parallel_launch_constant_memory<DriverType>;
+  }
+};


Again this feels like an unnecessarily large amount of duplication for something that's easy enough to specialize later. (Unlike above, though, I wouldn't consider this a blocking problem for the pull request since there's already other stuff like this in the file I guess).

dhollman · 2021-03-17T21:32:30Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

@@ -170,6 +288,67 @@ struct HIPParallelLaunchKernelInvoker<DriverType, LaunchBounds,
  }
 };

+// HIPLaunchMechanism::GlobalMemory specialization
+template <typename DriverType, typename LaunchBounds>


Please comment with the line ranges that are identical in Kokkos_Cuda_KernelLaunch.hpp.

Also, if we're going to copy/paste things like this, please don't make unnecessary stylistic changes like template <class...> to template <typename...>. This just makes it harder for someone to come along later with a diff tool and figure out what the salient differences are. (Again, I would argue that the fact that we're even discussing a reader having to use a diff tool is a major problem, but if I'm going to lose that argument, please at least don't make unnecessary stylistic changes that make it harder for the reader to even use diff).

dhollman · 2021-03-17T21:36:55Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+#else
+template <typename DriverType, typename LaunchBounds = Kokkos::LaunchBounds<>,
+          HIPLaunchMechanism LaunchMechanism =
+              DeduceHIPLaunchMechanism<DriverType>::launch_mechanism>


This is the only line that really differs between these two; the rest is essentially the same. I see no reason to duplicate this much code; just put the preprocessor #if around this line and the corresponding line in the above #if block.

Rombur · 2021-06-22T14:29:07Z

ping

dalg24 · 2021-05-26T15:50:34Z

core/src/HIP/Kokkos_HIP_BlockSize_Deduction.hpp

+template <typename DriverType, typename LaunchBounds = Kokkos::LaunchBounds<>,
+          HIPLaunchMechanism LaunchMechanism =
+              DeduceHIPLaunchMechanism<DriverType>::launch_mechanism>
+unsigned get_max_blocksize_impl() {


Move the definition closer to get_preferred_blocksize_impl()

dalg24 · 2021-06-07T21:22:31Z

core/src/HIP/Kokkos_HIP_BlockSize_Deduction.hpp

+    if (static_cast<bool>(
+            HIPParallelLaunch<DriverType, LaunchBounds,
+                              LaunchMechanism>::get_scratch_size())) {


> 0 would be more readable IMO

dalg24 · 2021-06-07T21:26:55Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

@@ -87,12 +88,13 @@ __global__ __launch_bounds__(
  const DriverType &driver = *(reinterpret_cast<const DriverType *>(
      kokkos_impl_hip_constant_memory_buffer));

-  driver->operator()();
+  driver();


Please add a comment as suggested above

dalg24 · 2021-06-07T21:43:37Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+                        ? HIPLaunchMechanism::ConstantMemory
+                        : HIPLaunchMechanism::GlobalMemory)
+                 : (default_launch_mechanism));
+};


Consider creating an issue to keep track of this

dalg24 · 2021-06-07T21:44:29Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+template <bool default_launchbound_val>
+struct HIPParallelLaunchKernelFuncData {
+  static constexpr auto default_launchbounds() {
+    return !default_launchbound_val;


I find this confusing.

dalg24 · 2021-06-07T21:46:24Z

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp

+  }
+
+  static auto get_scratch_size() {
+    return HIPParallelLaunchKernelFuncData<true>::get_scratch_size(


This is not very readable

…_Deduction

Addressed and not here anymore

Rombur · 2021-07-12T12:33:55Z

ping

crtrott · 2021-07-15T00:22:50Z

core/src/HIP/Kokkos_HIP_BlockSize_Deduction.hpp

+    return LaunchBounds::maxTperB;
+  } else {
+    // we can always fit 1024 threads blocks if we only care about registers
+    // ... and don't mind spilling


Uhm we do mind spilling very much??

crtrott · 2021-07-15T00:23:12Z

core/src/HIP/Kokkos_HIP_BlockSize_Deduction.hpp

+                          LaunchMechanism>::get_scratch_size() > 0) {
+      return HIPTraits::ConservativeThreadsPerBlock;
+    }
+    return HIPTraits::MaxThreadsPerBlock;


uhm doesn't that mean we will spill like crazy?

All of this is just code that has been moved around from #3953 We automatically adapt the blocksize to decrease the spilling.

dalg24 reviewed Mar 2, 2021

View reviewed changes

masterleinad reviewed Mar 2, 2021

View reviewed changes

Rombur force-pushed the hip_kernellaunch branch from d385cd1 to 60c82db Compare March 9, 2021 18:52

dalg24 approved these changes Mar 9, 2021

View reviewed changes

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Outdated Show resolved Hide resolved

masterleinad approved these changes Mar 9, 2021

View reviewed changes

Rombur force-pushed the hip_kernellaunch branch from 60c82db to 99fb5cb Compare March 9, 2021 19:25

dalg24 reviewed Mar 10, 2021

View reviewed changes

dalg24 previously requested changes Mar 10, 2021

View reviewed changes

dalg24 added this to the Tentative 3.5 Release milestone Mar 10, 2021

dhollman self-requested a review March 17, 2021 20:46

dhollman previously requested changes Mar 17, 2021

View reviewed changes

Rombur force-pushed the hip_kernellaunch branch from 99fb5cb to b8bc6be Compare May 25, 2021 14:51

dalg24 reviewed Jun 22, 2021

View reviewed changes

Rombur added 4 commits June 23, 2021 08:55

Add multiple LaunchMechanism

10af064

Use enum class instead of enum for BlockType

2cb6fb9

Move function to compute the BlockSize from KernelLaunch to BlockSize…

5c2aad7

…_Deduction

Move BlockType enum from KernelLaunch to BlockSize_Deduction

5ca74df

Rombur force-pushed the hip_kernellaunch branch from dc4b6f9 to 5ca74df Compare June 23, 2021 14:11

Rombur mentioned this pull request Jun 23, 2021

HIP duplicates code from CUDA in KernelLaunch #4122

Open

Rombur added this to Awaiting Feedback in Developer: Bruno Turcksin Jul 14, 2021

crtrott added this to In progress in Kokkos Release 3.5 Jul 14, 2021

crtrott moved this from In progress to Awaiting Feedback in Kokkos Release 3.5 Jul 14, 2021

crtrott reviewed Jul 15, 2021

View reviewed changes

dalg24 approved these changes Jul 20, 2021

View reviewed changes

dalg24 merged commit 4e628d4 into kokkos:develop Jul 20, 2021

Kokkos Release 3.5 automation moved this from Awaiting Feedback to Done Jul 20, 2021

Rombur moved this from Awaiting Feedback to Done in Developer: Bruno Turcksin Aug 2, 2021

dalg24 mentioned this pull request Aug 16, 2021

[HIP] Fix issue w/ static initialization of function attributes #4242

Merged

Rombur deleted the hip_kernellaunch branch September 19, 2022 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIP] Add multiple LaunchMechanism #3820

[HIP] Add multiple LaunchMechanism #3820

Rombur commented Mar 2, 2021

dalg24 Mar 2, 2021

crtrott Mar 2, 2021

masterleinad Mar 2, 2021

Rombur Mar 3, 2021

masterleinad Mar 2, 2021

Rombur Mar 3, 2021

masterleinad Mar 3, 2021

dalg24 commented Mar 10, 2021

dalg24 Mar 10, 2021

dalg24 left a comment

dhollman left a comment

dhollman Mar 17, 2021

dhollman Mar 17, 2021

dalg24 Jun 7, 2021

dhollman Mar 17, 2021

dhollman Mar 17, 2021

dalg24 Jun 7, 2021

dhollman Mar 17, 2021

dhollman Mar 17, 2021

dhollman Mar 17, 2021

Rombur commented Jun 22, 2021

dalg24 May 26, 2021

dalg24 Jun 7, 2021

dalg24 Jun 7, 2021

dalg24 Jun 7, 2021

dalg24 Jun 7, 2021

dalg24 Jun 7, 2021

Rombur commented Jul 12, 2021

crtrott Jul 15, 2021

crtrott Jul 15, 2021

Rombur Jul 15, 2021

		HIP_SAFE_CALL(hipHostMalloc((void **)&constantMemHostStaging,
		HIPTraits::ConstantMemoryUsage));

		(base_t::get_kernel_func())<<<grid, block, shmem, hip_instance->m_stream>>>(
		driver);

[HIP] Add multiple LaunchMechanism #3820

[HIP] Add multiple LaunchMechanism #3820

Conversation

Rombur commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Mar 10, 2021

Choose a reason for hiding this comment

dalg24 left a comment

Choose a reason for hiding this comment

dhollman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur commented Jun 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur commented Jul 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment