[amdgpu] Implement D2D memcpy as HSA call #69955

JonChesterfield · 2023-10-23T18:04:14Z

hsa_amd_memory_async_copy can handle device to device copies if passed the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in libomptarget that stages through a host malloc, after this it goes directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a performance benchmark. Hoping to attract further testing from internal developers after it lands.

llvmbot · 2023-10-23T18:06:01Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jon Chesterfield (JonChesterfield)

Changes

hsa_amd_memory_async_copy can handle device to device copies if passed the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in libomptarget that stages through a host malloc, after this it goes directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a performance benchmark. Hoping to attract further testing from internal developers after it lands.

Full diff: https://github.com/llvm/llvm-project/pull/69955.diff

1 Files Affected:

(modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+32-7)

diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index ab24856f9bc78e4..5d3d7837d918b41 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
                                           PinnedMemoryManager);
   }
 
-  /// Exchange data between two devices within the plugin. This function is not
-  /// supported in this plugin.
+  /// Exchange data between two devices within the plugin.
   Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
                          void *DstPtr, int64_t Size,
                          AsyncInfoWrapperTy &AsyncInfoWrapper) override {
-    // This function should never be called because the function
-    // AMDGPUPluginTy::isDataExchangable() returns false.
-    return Plugin::error("dataExchangeImpl not supported");
+    AMDGPUDeviceTy &DstDevice = static_cast<AMDGPUDeviceTy &>(DstGenericDevice);
+
+    hsa_agent_t SrcAgent = getAgent();
+    hsa_agent_t DstAgent = DstDevice.getAgent();
+
+    AMDGPUSignalTy Signal;
+    if (auto Err = Signal.init())
+      return Err;
+
+    // The agents need to have access to the corresponding memory
+    // This is presently only true if the pointers were originally
+    // allocated by this runtime or the caller made the appropriate
+    // access calls.
+    hsa_status_t Status = hsa_amd_memory_async_copy(
+        DstPtr, DstAgent, SrcPtr, SrcAgent, (Size > 0) ? (size_t)Size : 0, 0,
+        nullptr, Signal.get());
+    if (auto Err =
+            Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
+      return Err;
+
+    if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
+      return Err;
+
+    if (auto Err = Signal.deinit())
+      return Err;
+
+    return Plugin::success();
   }
 
   /// Initialize the async info for interoperability purposes.
@@ -2899,7 +2922,7 @@ struct AMDGPUPluginTy final : public GenericPluginTy {
 
   /// This plugin does not support exchanging data between two devices.
   bool isDataExchangable(int32_t SrcDeviceId, int32_t DstDeviceId) override {
-    return false;
+    return true;
   }
 
   /// Get the host device instance.
@@ -3174,9 +3197,11 @@ void *AMDGPUDeviceTy::allocate(size_t Size, void *, TargetAllocTy Kind) {
     return nullptr;
   }
 
-  if (Alloc && (Kind == TARGET_ALLOC_HOST || Kind == TARGET_ALLOC_SHARED)) {
+  if (Alloc) {
     auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();
 
+    // Inherently necessary for host or shared allocations
+    // Also enabled for device memory to allow device to device memcpy
     // Enable all kernel agents to access the buffer.
     if (auto Err = MemoryPool->enableAccess(Alloc, Size, KernelAgents)) {
       REPORT("%s\n", toString(std::move(Err)).data());

jplehr

Question about error message.
Rest LGTM

jplehr · 2023-10-23T19:06:55Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

+    if (auto Err =
+            Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))


Do we want the error to mention that this happened in the device to device copy, or is this automagically done given the surrounding method of the error?

jhuber6

I'm guessing it's impossible to make a test for this? Could we do a D2D memcpy on the same device and make sure that they are the same via a memcmp?

jhuber6 · 2023-10-23T19:11:39Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

@@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
                                          PinnedMemoryManager);
  }

-  /// Exchange data between two devices within the plugin. This function is not
-  /// supported in this plugin.
+  /// Exchange data between two devices within the plugin.
  Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
                         void *DstPtr, int64_t Size,
                         AsyncInfoWrapperTy &AsyncInfoWrapper) override {


We have AsyncInfoWrapper here, my understanding is that these already contain a "stream" which is coded to an HSA signal. Maybe @jdoerfert can expand on this, but my understanding would be that we'd call getStream or something.

Maybe, this summon a signal from scratch was copied from another memcpy call. This should be slower, simpler and more robust than splicing it into the stream machinery, so better to land this version, see it goes through the testing, and then see if integration with the stream seems to work

The fact that AsyncInfoWrapper is unused is a strong indicator that something is off. The code below is blocking, it should not be blocking. Instead of creating the signal, use the stream in AsyncInfoWrapper, something like above in the pinned case:

AMDGPUStreamTy *Stream = nullptr; void *PinnedPtr = nullptr; // Use one-step asynchronous operation when host memory is already pinned. if (void *PinnedPtr = PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) { if (auto Err = getStream(AsyncInfoWrapper, Stream)) return Err; return Stream->pushPinnedMemoryCopyAsync(PinnedPtr, TgtPtr, Size); }

The memory copies are handwritten in each direction so it's not a matter of pulling a signal from a stream and calling an existing function. It's implementable but not trivial.

I'd like to land this blocking version first in order to establish that the underling HSA calls behave as they're supposed to via the internal testing infrastructure, before adding the additional complication to wire it into the signals.

Especially so as I am far from convinced that signals work robustly between GPUs, given they're on different HSA queues.

Streams approach in #69977, seems to work equivalently to this one

JonChesterfield · 2023-10-23T19:23:40Z

I'm guessing it's impossible to make a test for this? Could we do a D2D memcpy on the same device and make sure that they are the same via a memcmp?

It's already implemented (and possibly tested) on cuda but there's no functional change from what we have at the moment that bounces through host memory. It mostly doesn't have a test because I don't think and of the CI have multiple GPUs and I haven't spent the time working out how to write an openmp test which behaves sanely with one or two gpus in this context.

jdoerfert · 2023-10-23T19:22:19Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

    auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();

+    // Inherently necessary for host or shared allocations
+    // Also enabled for device memory to allow device to device memcpy


Is this costly if unsued?

Unknown. I'd guess it's cheap if there's only one GPU in the system and has non-zero cost if there are multiple GPUs that do not perform D2D copies. However the alternative is to try to work backwards from the void* to a corresponding memory pool using maps and that's definitely non-zero cost as well.

jdoerfert · 2023-10-23T19:30:03Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

@@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
                                          PinnedMemoryManager);
  }

-  /// Exchange data between two devices within the plugin. This function is not
-  /// supported in this plugin.
+  /// Exchange data between two devices within the plugin.
  Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
                         void *DstPtr, int64_t Size,
                         AsyncInfoWrapperTy &AsyncInfoWrapper) override {


The fact that AsyncInfoWrapper is unused is a strong indicator that something is off. The code below is blocking, it should not be blocking. Instead of creating the signal, use the stream in AsyncInfoWrapper, something like above in the pinned case:

AMDGPUStreamTy *Stream = nullptr; void *PinnedPtr = nullptr; // Use one-step asynchronous operation when host memory is already pinned. if (void *PinnedPtr = PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) { if (auto Err = getStream(AsyncInfoWrapper, Stream)) return Err; return Stream->pushPinnedMemoryCopyAsync(PinnedPtr, TgtPtr, Size); }

jhuber6 · 2023-10-23T19:36:33Z

I'm guessing it's impossible to make a test for this? Could we do a D2D memcpy on the same device and make sure that they are the same via a memcmp?

It's already implemented (and possibly tested) on cuda but there's no functional change from what we have at the moment that bounces through host memory. It mostly doesn't have a test because I don't think and of the CI have multiple GPUs and I haven't spent the time working out how to write an openmp test which behaves sanely with one or two gpus in this context.

So we already have test coverage, but it's just falling back to a D2H + H2D copy instead, okay. I don't think the test suite is really set up in a way to use multiple devices, even if you're on a machine that supports it you'd need to use the device clause to check all of them. However a lot of testing could probably be simplified if we did --offload-arch=native and then just ran the test on each device, unrelated aside.

jplehr · 2023-10-24T09:34:24Z

I'm closing this in favor of #69977

JonChesterfield requested review from jdoerfert, shiltian, ronlieb and jhuber6 October 23, 2023 18:04

llvmbot added backend:AMDGPU openmp:libomptarget OpenMP offload runtime labels Oct 23, 2023

jplehr approved these changes Oct 23, 2023

View reviewed changes

jhuber6 reviewed Oct 23, 2023

View reviewed changes

[amdgpu] Implement D2D memcpy as HSA call

f7218e1

JonChesterfield force-pushed the jc_amdgpu_d2d_memcpy branch from 3feb62d to f7218e1 Compare October 23, 2023 19:26

jdoerfert reviewed Oct 23, 2023

View reviewed changes

jplehr closed this Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[amdgpu] Implement D2D memcpy as HSA call #69955

[amdgpu] Implement D2D memcpy as HSA call #69955

JonChesterfield commented Oct 23, 2023

llvmbot commented Oct 23, 2023

jplehr left a comment

jplehr Oct 23, 2023

jhuber6 left a comment

jhuber6 Oct 23, 2023

JonChesterfield Oct 23, 2023

jdoerfert Oct 23, 2023

JonChesterfield Oct 23, 2023

JonChesterfield Oct 23, 2023

JonChesterfield commented Oct 23, 2023

jdoerfert Oct 23, 2023

JonChesterfield Oct 23, 2023

jdoerfert Oct 23, 2023

jhuber6 commented Oct 23, 2023

jplehr commented Oct 24, 2023

		if (auto Err =
		Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))

[amdgpu] Implement D2D memcpy as HSA call #69955

[amdgpu] Implement D2D memcpy as HSA call #69955

Conversation

JonChesterfield commented Oct 23, 2023

llvmbot commented Oct 23, 2023

jplehr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhuber6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonChesterfield commented Oct 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhuber6 commented Oct 23, 2023

jplehr commented Oct 24, 2023