Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HIP] Add multiple LaunchMechanism #3820

Merged
merged 4 commits into from
Jul 20, 2021
Merged

Conversation

Rombur
Copy link
Member

@Rombur Rombur commented Mar 2, 2021

Until now when launching a kernel, we always used local memory. This PR adds the two new types of kernel launch: constant memory and global memory. The code is similar to the CUDA code refactored by @dhollman.

Comment on lines +297 to +298
HIP_SAFE_CALL(hipHostMalloc((void **)&constantMemHostStaging,
HIPTraits::ConstantMemoryUsage));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you copied that from Cuda but wondering now whether that memory should be tracked via HIPHostPinnedSpace::allocate. @crtrott what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably.

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Outdated Show resolved Hide resolved
}

template <class DriverType>
__global__ static void hip_parallel_launch_local_memory(
const DriverType *driver) {
// FIXME_HIP driver() pass by copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we cannot pass a driver by copy right now. This triggers a bug in the compiler.

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Show resolved Hide resolved
Comment on lines +305 to +397
(base_t::get_kernel_func())<<<grid, block, shmem, hip_instance->m_stream>>>(
driver);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have code corresponding to

  DriverType* driver_ptr = reinterpret_cast<DriverType*>(
        cuda_instance->scratch_functor(sizeof(DriverType)));

    cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType), cudaMemcpyDefault,
                    cuda_instance->m_stream);
    (base_t::
         get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>(
        driver_ptr);

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, this should also be a FIXME, I guess.

core/src/HIP/Kokkos_HIP_KernelLaunch.hpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Mar 10, 2021

Retest this please

//-----------------------------//
// HIPParallelLaunch structure //
//-----------------------------//
#if HIP_VERSION < 401
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tempted to delay until Kokkos 3.5

dalg24
dalg24 previously requested changes Mar 10, 2021
Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking. Merge onto Kokkos 3.5

@dalg24 dalg24 added this to the Tentative 3.5 Release milestone Mar 10, 2021
@dhollman dhollman self-requested a review March 17, 2021 20:46
Copy link

@dhollman dhollman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still too much unnecessary, undocumented duplication of code in this pull request. If I get overruled on this, fine, but I wouldn't feel comfortable maintaining this myself as is.

// FIXME_HIP: these want to be per-device, not per-stream... use of 'static'
// here will break once there are multiple devices though
static unsigned long *constantMemHostStaging;
static hipEvent_t constantMemReusable;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this something that indicates it's an event when read in code? Like constantMemAvailableEvent or something?

@@ -87,12 +88,13 @@ __global__ __launch_bounds__(
const DriverType &driver = *(reinterpret_cast<const DriverType *>(
kokkos_impl_hip_constant_memory_buffer));

driver->operator()();
driver();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real reason for this to be separate from the Cuda version, but if we're going to copy and paste things, let's at least add a comment that says something like "should be exactly the same as cuda_parallel_launch_constant_memory() and the analogous comment in Kokkos_Cuda_KernelLaunch.hpp so that anyone who changes either implementation knows to consider changing the other (or removing the comment, if appropriate).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment as suggested above

}

template <typename DriverType, unsigned int maxTperB, unsigned int minBperSM>
__global__ __launch_bounds__(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing. Please comment that these are currently identical to the analogous Cuda versions and comment on the Cuda versions that those are identical to the HIP versions

? HIPLaunchMechanism::ConstantMemory
: HIPLaunchMechanism::GlobalMemory)
: (default_launch_mechanism));
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive amount of very complicated code to be duplicating with Cuda, especially since the only things that are different between the two are the name, HIPTraits and HIPLaunchMechanism. (I've attached a diff image for reference). This could easily be done with a template rather than copy/paste, and I'm pretty strongly opposed to copy/pasting here. If we ever reach the point where these need to evolve separately, it's easy enough to do a partial specialization of a more general template for the case of HIPTraits and HIPLaunchMechanism, so I don't see the advantage of copy/pasting code here instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating an issue to keep track of this

static auto get_kernel_func() {
return hip_parallel_launch_constant_memory<DriverType>;
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this feels like an unnecessarily large amount of duplication for something that's easy enough to specialize later. (Unlike above, though, I wouldn't consider this a blocking problem for the pull request since there's already other stuff like this in the file I guess).

@@ -170,6 +288,67 @@ struct HIPParallelLaunchKernelInvoker<DriverType, LaunchBounds,
}
};

// HIPLaunchMechanism::GlobalMemory specialization
template <typename DriverType, typename LaunchBounds>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment with the line ranges that are identical in Kokkos_Cuda_KernelLaunch.hpp.

Also, if we're going to copy/paste things like this, please don't make unnecessary stylistic changes like template <class...> to template <typename...>. This just makes it harder for someone to come along later with a diff tool and figure out what the salient differences are. (Again, I would argue that the fact that we're even discussing a reader having to use a diff tool is a major problem, but if I'm going to lose that argument, please at least don't make unnecessary stylistic changes that make it harder for the reader to even use diff).

#else
template <typename DriverType, typename LaunchBounds = Kokkos::LaunchBounds<>,
HIPLaunchMechanism LaunchMechanism =
DeduceHIPLaunchMechanism<DriverType>::launch_mechanism>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only line that really differs between these two; the rest is essentially the same. I see no reason to duplicate this much code; just put the preprocessor #if around this line and the corresponding line in the above #if block.

@Rombur
Copy link
Member Author

Rombur commented Jun 22, 2021

ping

Comment on lines 119 to 122
template <typename DriverType, typename LaunchBounds = Kokkos::LaunchBounds<>,
HIPLaunchMechanism LaunchMechanism =
DeduceHIPLaunchMechanism<DriverType>::launch_mechanism>
unsigned get_max_blocksize_impl() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the definition closer to get_preferred_blocksize_impl()

Comment on lines 72 to 74
if (static_cast<bool>(
HIPParallelLaunch<DriverType, LaunchBounds,
LaunchMechanism>::get_scratch_size())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> 0 would be more readable IMO

@@ -87,12 +88,13 @@ __global__ __launch_bounds__(
const DriverType &driver = *(reinterpret_cast<const DriverType *>(
kokkos_impl_hip_constant_memory_buffer));

driver->operator()();
driver();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment as suggested above

? HIPLaunchMechanism::ConstantMemory
: HIPLaunchMechanism::GlobalMemory)
: (default_launch_mechanism));
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating an issue to keep track of this

template <bool default_launchbound_val>
struct HIPParallelLaunchKernelFuncData {
static constexpr auto default_launchbounds() {
return !default_launchbound_val;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing.

}

static auto get_scratch_size() {
return HIPParallelLaunchKernelFuncData<true>::get_scratch_size(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very readable

@crtrott crtrott dismissed stale reviews from dhollman and dalg24 June 30, 2021 18:10

Addressed and not here anymore

@Rombur
Copy link
Member Author

Rombur commented Jul 12, 2021

ping

@Rombur Rombur added this to Awaiting Feedback in Developer: Bruno Turcksin Jul 14, 2021
@crtrott crtrott added this to In progress in Kokkos Release 3.5 Jul 14, 2021
@crtrott crtrott moved this from In progress to Awaiting Feedback in Kokkos Release 3.5 Jul 14, 2021
return LaunchBounds::maxTperB;
} else {
// we can always fit 1024 threads blocks if we only care about registers
// ... and don't mind spilling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm we do mind spilling very much??

LaunchMechanism>::get_scratch_size() > 0) {
return HIPTraits::ConservativeThreadsPerBlock;
}
return HIPTraits::MaxThreadsPerBlock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm doesn't that mean we will spill like crazy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is just code that has been moved around from #3953 We automatically adapt the blocksize to decrease the spilling.

@dalg24 dalg24 merged commit 4e628d4 into kokkos:develop Jul 20, 2021
Kokkos Release 3.5 automation moved this from Awaiting Feedback to Done Jul 20, 2021
@Rombur Rombur moved this from Awaiting Feedback to Done in Developer: Bruno Turcksin Aug 2, 2021
@Rombur Rombur deleted the hip_kernellaunch branch September 19, 2022 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants