-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HIP] Add multiple LaunchMechanism #3820
Conversation
HIP_SAFE_CALL(hipHostMalloc((void **)&constantMemHostStaging, | ||
HIPTraits::ConstantMemoryUsage)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you copied that from Cuda but wondering now whether that memory should be tracked via HIPHostPinnedSpace::allocate
. @crtrott what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably.
} | ||
|
||
template <class DriverType> | ||
__global__ static void hip_parallel_launch_local_memory( | ||
const DriverType *driver) { | ||
// FIXME_HIP driver() pass by copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a TODO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we cannot pass a driver by copy right now. This triggers a bug in the compiler.
(base_t::get_kernel_func())<<<grid, block, shmem, hip_instance->m_stream>>>( | ||
driver); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have code corresponding to
DriverType* driver_ptr = reinterpret_cast<DriverType*>(
cuda_instance->scratch_functor(sizeof(DriverType)));
cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType), cudaMemcpyDefault,
cuda_instance->m_stream);
(base_t::
get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>(
driver_ptr);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, this should also be a FIXME, I guess.
Retest this please |
//-----------------------------// | ||
// HIPParallelLaunch structure // | ||
//-----------------------------// | ||
#if HIP_VERSION < 401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am tempted to delay until Kokkos 3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking. Merge onto Kokkos 3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still too much unnecessary, undocumented duplication of code in this pull request. If I get overruled on this, fine, but I wouldn't feel comfortable maintaining this myself as is.
// FIXME_HIP: these want to be per-device, not per-stream... use of 'static' | ||
// here will break once there are multiple devices though | ||
static unsigned long *constantMemHostStaging; | ||
static hipEvent_t constantMemReusable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this something that indicates it's an event when read in code? Like constantMemAvailableEvent
or something?
@@ -87,12 +88,13 @@ __global__ __launch_bounds__( | |||
const DriverType &driver = *(reinterpret_cast<const DriverType *>( | |||
kokkos_impl_hip_constant_memory_buffer)); | |||
|
|||
driver->operator()(); | |||
driver(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no real reason for this to be separate from the Cuda version, but if we're going to copy and paste things, let's at least add a comment that says something like "should be exactly the same as cuda_parallel_launch_constant_memory()
and the analogous comment in Kokkos_Cuda_KernelLaunch.hpp
so that anyone who changes either implementation knows to consider changing the other (or removing the comment, if appropriate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment as suggested above
} | ||
|
||
template <typename DriverType, unsigned int maxTperB, unsigned int minBperSM> | ||
__global__ __launch_bounds__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing. Please comment that these are currently identical to the analogous Cuda versions and comment on the Cuda versions that those are identical to the HIP versions
? HIPLaunchMechanism::ConstantMemory | ||
: HIPLaunchMechanism::GlobalMemory) | ||
: (default_launch_mechanism)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a massive amount of very complicated code to be duplicating with Cuda, especially since the only things that are different between the two are the name, HIPTraits
and HIPLaunchMechanism
. (I've attached a diff image for reference). This could easily be done with a template rather than copy/paste, and I'm pretty strongly opposed to copy/pasting here. If we ever reach the point where these need to evolve separately, it's easy enough to do a partial specialization of a more general template for the case of HIPTraits
and HIPLaunchMechanism
, so I don't see the advantage of copy/pasting code here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating an issue to keep track of this
static auto get_kernel_func() { | ||
return hip_parallel_launch_constant_memory<DriverType>; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this feels like an unnecessarily large amount of duplication for something that's easy enough to specialize later. (Unlike above, though, I wouldn't consider this a blocking problem for the pull request since there's already other stuff like this in the file I guess).
@@ -170,6 +288,67 @@ struct HIPParallelLaunchKernelInvoker<DriverType, LaunchBounds, | |||
} | |||
}; | |||
|
|||
// HIPLaunchMechanism::GlobalMemory specialization | |||
template <typename DriverType, typename LaunchBounds> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment with the line ranges that are identical in Kokkos_Cuda_KernelLaunch.hpp
.
Also, if we're going to copy/paste things like this, please don't make unnecessary stylistic changes like template <class...>
to template <typename...>
. This just makes it harder for someone to come along later with a diff tool and figure out what the salient differences are. (Again, I would argue that the fact that we're even discussing a reader having to use a diff tool is a major problem, but if I'm going to lose that argument, please at least don't make unnecessary stylistic changes that make it harder for the reader to even use diff).
#else | ||
template <typename DriverType, typename LaunchBounds = Kokkos::LaunchBounds<>, | ||
HIPLaunchMechanism LaunchMechanism = | ||
DeduceHIPLaunchMechanism<DriverType>::launch_mechanism> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only line that really differs between these two; the rest is essentially the same. I see no reason to duplicate this much code; just put the preprocessor #if
around this line and the corresponding line in the above #if
block.
ping |
template <typename DriverType, typename LaunchBounds = Kokkos::LaunchBounds<>, | ||
HIPLaunchMechanism LaunchMechanism = | ||
DeduceHIPLaunchMechanism<DriverType>::launch_mechanism> | ||
unsigned get_max_blocksize_impl() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the definition closer to get_preferred_blocksize_impl()
if (static_cast<bool>( | ||
HIPParallelLaunch<DriverType, LaunchBounds, | ||
LaunchMechanism>::get_scratch_size())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> 0
would be more readable IMO
@@ -87,12 +88,13 @@ __global__ __launch_bounds__( | |||
const DriverType &driver = *(reinterpret_cast<const DriverType *>( | |||
kokkos_impl_hip_constant_memory_buffer)); | |||
|
|||
driver->operator()(); | |||
driver(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment as suggested above
? HIPLaunchMechanism::ConstantMemory | ||
: HIPLaunchMechanism::GlobalMemory) | ||
: (default_launch_mechanism)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating an issue to keep track of this
template <bool default_launchbound_val> | ||
struct HIPParallelLaunchKernelFuncData { | ||
static constexpr auto default_launchbounds() { | ||
return !default_launchbound_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing.
} | ||
|
||
static auto get_scratch_size() { | ||
return HIPParallelLaunchKernelFuncData<true>::get_scratch_size( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very readable
Addressed and not here anymore
ping |
return LaunchBounds::maxTperB; | ||
} else { | ||
// we can always fit 1024 threads blocks if we only care about registers | ||
// ... and don't mind spilling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm we do mind spilling very much??
LaunchMechanism>::get_scratch_size() > 0) { | ||
return HIPTraits::ConservativeThreadsPerBlock; | ||
} | ||
return HIPTraits::MaxThreadsPerBlock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm doesn't that mean we will spill like crazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is just code that has been moved around from #3953 We automatically adapt the blocksize to decrease the spilling.
Until now when launching a kernel, we always used local memory. This PR adds the two new types of kernel launch: constant memory and global memory. The code is similar to the CUDA code refactored by @dhollman.