Implement non-blocking kernel launches for HIP backend #3697

skyreflectedinmirrors · 2021-01-04T20:38:20Z

HIP instance owns an array of drivers allocated in host-pinned memory, and marked non-coherent to enable caching in L2 when possible
Successive kernel calls cycle through available drivers, causing a synchronization only once the limit reached.
Calling a fence on the instance will reset the cycle index

core/src/HIP/Kokkos_HIP_Instance.cpp

core/unit_test/hip/TestHIP_AsyncLauncher.cpp

core/src/HIP/Kokkos_HIP_Instance.cpp

core/unit_test/hip/TestHIP_AsyncLauncher.cpp

Rombur · 2021-01-04T22:25:03Z

core/src/HIP/Kokkos_HIP_Instance.cpp

+    fence();
+    HIP_SAFE_CALL(hipHostFree(d_driverWorkArray));
+    m_maxDriverTypeSize = driverTypeSize;
+    if (m_maxDriverTypeSize % 128 != 0)


Where does 128 come from?

Hmm -- Leopold wrote that bit, I assume it was some attempt at padding / alignment, but I'll check w/ him. It's probably safe to remove since it's pretty arbitrary.

core/unit_test/hip/TestHIP_AsyncLauncher.cpp

- HIP instance owns an array of drivers allocated in host-pinned memory, and marked non-coherent to enable caching in L2 when possible - Successive kernel calls cycle through available drivers, causing a synchronization only once the limit reached. - Calling a fence on the instance will reset the cycle index Change-Id: Ibec81051ac6018d8aef4510b3450428b0e52d822

crtrott

I think we need to add one more step and copy the thing to the device. I.e. have one copy on the device and issue a async memcpy from the host to device before issuing the kernel. Otherwise every access to a member of the drivertype will go to the host won't it?

crtrott · 2021-01-05T16:36:40Z

Ok hm my own testing doesn't show that this slows stuff down. Does this allocation get cached on the GPU?

skyreflectedinmirrors · 2021-01-05T17:11:59Z

@crtrott -- yes, allocating the host pointer as non-coherent here tells the GPU to attempt to cache the Driver in L2, so the cost should be minimal after the first load.

We could make it an async memcopy to a device pointer, but in practice we've found this is intolerably slow for small copies because the runtime is... not so great for that use-case (yet, we continue to push on them on it). It would probably work fairly well if you could actually queue up ~100 kernels w/o interruption, but right now that isn't typically the case (e.g., in LAMMPS) because there are other (sometimes hidden) synchronization points (e.g., reductions, fences, etc).

Another option we had played around with was using LargeBAR support to just directly std::memcpy to a device pointer (using the same cycling pattern as in this PR). LargeBAR is enabled for all the server-grade cards, i.e., the MI-50's / 100's, but not the Radeon 7's of the world, and I actually have a branch that adds detection as a cmake configure time option.

For the moment I think this is a reasonable implementation, but we might consider implementing the async and LargeBAR options, and directly benchmarking them against this cached zero-copy approach on LAMMPS, ArborX, and/or any of the other codes we know are working (more or less) in the HIP backend right now.

crtrott · 2021-01-05T17:14:08Z

That makes sense thanks! The non-coherent part is what I didn't quite recognize. In this case I don't see an issue, and this definitely cuts down launch latency quite a bit.