Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement non-blocking kernel launches for HIP backend #3697

Merged
merged 1 commit into from
Jan 5, 2021

Conversation

skyreflectedinmirrors
Copy link
Contributor

@skyreflectedinmirrors skyreflectedinmirrors commented Jan 4, 2021

  • HIP instance owns an array of drivers allocated in host-pinned memory, and marked non-coherent to enable caching in L2 when possible
  • Successive kernel calls cycle through available drivers, causing a synchronization only once the limit reached.
  • Calling a fence on the instance will reset the cycle index

core/unit_test/hip/TestHIP_AsyncLauncher.cpp Outdated Show resolved Hide resolved
core/unit_test/hip/TestHIP_AsyncLauncher.cpp Outdated Show resolved Hide resolved
core/unit_test/hip/TestHIP_AsyncLauncher.cpp Outdated Show resolved Hide resolved
fence();
HIP_SAFE_CALL(hipHostFree(d_driverWorkArray));
m_maxDriverTypeSize = driverTypeSize;
if (m_maxDriverTypeSize % 128 != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 128 come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- Leopold wrote that bit, I assume it was some attempt at padding / alignment, but I'll check w/ him. It's probably safe to remove since it's pretty arbitrary.

	- HIP instance owns an array of drivers allocated in host-pinned memory, and marked non-coherent to enable caching in L2 when possible
        - Successive kernel calls cycle through available drivers, causing a synchronization only once the limit reached.
        - Calling a fence on the instance will reset the cycle index

Change-Id: Ibec81051ac6018d8aef4510b3450428b0e52d822
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add one more step and copy the thing to the device. I.e. have one copy on the device and issue a async memcpy from the host to device before issuing the kernel. Otherwise every access to a member of the drivertype will go to the host won't it?

@crtrott
Copy link
Member

crtrott commented Jan 5, 2021

Ok hm my own testing doesn't show that this slows stuff down. Does this allocation get cached on the GPU?

@skyreflectedinmirrors
Copy link
Contributor Author

skyreflectedinmirrors commented Jan 5, 2021

@crtrott -- yes, allocating the host pointer as non-coherent here tells the GPU to attempt to cache the Driver in L2, so the cost should be minimal after the first load.

We could make it an async memcopy to a device pointer, but in practice we've found this is intolerably slow for small copies because the runtime is... not so great for that use-case (yet, we continue to push on them on it). It would probably work fairly well if you could actually queue up ~100 kernels w/o interruption, but right now that isn't typically the case (e.g., in LAMMPS) because there are other (sometimes hidden) synchronization points (e.g., reductions, fences, etc).

Another option we had played around with was using LargeBAR support to just directly std::memcpy to a device pointer (using the same cycling pattern as in this PR). LargeBAR is enabled for all the server-grade cards, i.e., the MI-50's / 100's, but not the Radeon 7's of the world, and I actually have a branch that adds detection as a cmake configure time option.

For the moment I think this is a reasonable implementation, but we might consider implementing the async and LargeBAR options, and directly benchmarking them against this cached zero-copy approach on LAMMPS, ArborX, and/or any of the other codes we know are working (more or less) in the HIP backend right now.

@crtrott
Copy link
Member

crtrott commented Jan 5, 2021

That makes sense thanks! The non-coherent part is what I didn't quite recognize. In this case I don't see an issue, and this definitely cuts down launch latency quite a bit.

@crtrott crtrott merged commit fe7e4e4 into kokkos:develop Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants