Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[amdgpu] Implement D2D memcpy as HSA call #69955

Closed

Conversation

JonChesterfield
Copy link
Collaborator

hsa_amd_memory_async_copy can handle device to device copies if passed the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in libomptarget that stages through a host malloc, after this it goes directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a performance benchmark. Hoping to attract further testing from internal developers after it lands.

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 23, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Jon Chesterfield (JonChesterfield)

Changes

hsa_amd_memory_async_copy can handle device to device copies if passed the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in libomptarget that stages through a host malloc, after this it goes directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a performance benchmark. Hoping to attract further testing from internal developers after it lands.


Full diff: https://github.com/llvm/llvm-project/pull/69955.diff

1 Files Affected:

  • (modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+32-7)
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index ab24856f9bc78e4..5d3d7837d918b41 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
                                           PinnedMemoryManager);
   }
 
-  /// Exchange data between two devices within the plugin. This function is not
-  /// supported in this plugin.
+  /// Exchange data between two devices within the plugin.
   Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
                          void *DstPtr, int64_t Size,
                          AsyncInfoWrapperTy &AsyncInfoWrapper) override {
-    // This function should never be called because the function
-    // AMDGPUPluginTy::isDataExchangable() returns false.
-    return Plugin::error("dataExchangeImpl not supported");
+    AMDGPUDeviceTy &DstDevice = static_cast<AMDGPUDeviceTy &>(DstGenericDevice);
+
+    hsa_agent_t SrcAgent = getAgent();
+    hsa_agent_t DstAgent = DstDevice.getAgent();
+
+    AMDGPUSignalTy Signal;
+    if (auto Err = Signal.init())
+      return Err;
+
+    // The agents need to have access to the corresponding memory
+    // This is presently only true if the pointers were originally
+    // allocated by this runtime or the caller made the appropriate
+    // access calls.
+    hsa_status_t Status = hsa_amd_memory_async_copy(
+        DstPtr, DstAgent, SrcPtr, SrcAgent, (Size > 0) ? (size_t)Size : 0, 0,
+        nullptr, Signal.get());
+    if (auto Err =
+            Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
+      return Err;
+
+    if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
+      return Err;
+
+    if (auto Err = Signal.deinit())
+      return Err;
+
+    return Plugin::success();
   }
 
   /// Initialize the async info for interoperability purposes.
@@ -2899,7 +2922,7 @@ struct AMDGPUPluginTy final : public GenericPluginTy {
 
   /// This plugin does not support exchanging data between two devices.
   bool isDataExchangable(int32_t SrcDeviceId, int32_t DstDeviceId) override {
-    return false;
+    return true;
   }
 
   /// Get the host device instance.
@@ -3174,9 +3197,11 @@ void *AMDGPUDeviceTy::allocate(size_t Size, void *, TargetAllocTy Kind) {
     return nullptr;
   }
 
-  if (Alloc && (Kind == TARGET_ALLOC_HOST || Kind == TARGET_ALLOC_SHARED)) {
+  if (Alloc) {
     auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();
 
+    // Inherently necessary for host or shared allocations
+    // Also enabled for device memory to allow device to device memcpy
     // Enable all kernel agents to access the buffer.
     if (auto Err = MemoryPool->enableAccess(Alloc, Size, KernelAgents)) {
       REPORT("%s\n", toString(std::move(Err)).data());

Copy link
Contributor

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about error message.
Rest LGTM

Comment on lines 2273 to 2274
if (auto Err =
Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the error to mention that this happened in the device to device copy, or is this automagically done given the surrounding method of the error?

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing it's impossible to make a test for this? Could we do a D2D memcpy on the same device and make sure that they are the same via a memcmp?

@@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
PinnedMemoryManager);
}

/// Exchange data between two devices within the plugin. This function is not
/// supported in this plugin.
/// Exchange data between two devices within the plugin.
Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
void *DstPtr, int64_t Size,
AsyncInfoWrapperTy &AsyncInfoWrapper) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have AsyncInfoWrapper here, my understanding is that these already contain a "stream" which is coded to an HSA signal. Maybe @jdoerfert can expand on this, but my understanding would be that we'd call getStream or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, this summon a signal from scratch was copied from another memcpy call. This should be slower, simpler and more robust than splicing it into the stream machinery, so better to land this version, see it goes through the testing, and then see if integration with the stream seems to work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that AsyncInfoWrapper is unused is a strong indicator that something is off. The code below is blocking, it should not be blocking. Instead of creating the signal, use the stream in AsyncInfoWrapper, something like above in the pinned case:

  AMDGPUStreamTy *Stream = nullptr;
    void *PinnedPtr = nullptr;
    // Use one-step asynchronous operation when host memory is already pinned.
    if (void *PinnedPtr =
            PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) {
      if (auto Err = getStream(AsyncInfoWrapper, Stream))
        return Err;
      return Stream->pushPinnedMemoryCopyAsync(PinnedPtr, TgtPtr, Size);
    }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory copies are handwritten in each direction so it's not a matter of pulling a signal from a stream and calling an existing function. It's implementable but not trivial.

I'd like to land this blocking version first in order to establish that the underling HSA calls behave as they're supposed to via the internal testing infrastructure, before adding the additional complication to wire it into the signals.

Especially so as I am far from convinced that signals work robustly between GPUs, given they're on different HSA queues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams approach in #69977, seems to work equivalently to this one

@JonChesterfield
Copy link
Collaborator Author

I'm guessing it's impossible to make a test for this? Could we do a D2D memcpy on the same device and make sure that they are the same via a memcmp?

It's already implemented (and possibly tested) on cuda but there's no functional change from what we have at the moment that bounces through host memory. It mostly doesn't have a test because I don't think and of the CI have multiple GPUs and I haven't spent the time working out how to write an openmp test which behaves sanely with one or two gpus in this context.

auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();

// Inherently necessary for host or shared allocations
// Also enabled for device memory to allow device to device memcpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this costly if unsued?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown. I'd guess it's cheap if there's only one GPU in the system and has non-zero cost if there are multiple GPUs that do not perform D2D copies. However the alternative is to try to work backwards from the void* to a corresponding memory pool using maps and that's definitely non-zero cost as well.

@@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
PinnedMemoryManager);
}

/// Exchange data between two devices within the plugin. This function is not
/// supported in this plugin.
/// Exchange data between two devices within the plugin.
Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
void *DstPtr, int64_t Size,
AsyncInfoWrapperTy &AsyncInfoWrapper) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that AsyncInfoWrapper is unused is a strong indicator that something is off. The code below is blocking, it should not be blocking. Instead of creating the signal, use the stream in AsyncInfoWrapper, something like above in the pinned case:

  AMDGPUStreamTy *Stream = nullptr;
    void *PinnedPtr = nullptr;
    // Use one-step asynchronous operation when host memory is already pinned.
    if (void *PinnedPtr =
            PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) {
      if (auto Err = getStream(AsyncInfoWrapper, Stream))
        return Err;
      return Stream->pushPinnedMemoryCopyAsync(PinnedPtr, TgtPtr, Size);
    }

@jhuber6
Copy link
Contributor

jhuber6 commented Oct 23, 2023

I'm guessing it's impossible to make a test for this? Could we do a D2D memcpy on the same device and make sure that they are the same via a memcmp?

It's already implemented (and possibly tested) on cuda but there's no functional change from what we have at the moment that bounces through host memory. It mostly doesn't have a test because I don't think and of the CI have multiple GPUs and I haven't spent the time working out how to write an openmp test which behaves sanely with one or two gpus in this context.

So we already have test coverage, but it's just falling back to a D2H + H2D copy instead, okay. I don't think the test suite is really set up in a way to use multiple devices, even if you're on a machine that supports it you'd need to use the device clause to check all of them. However a lot of testing could probably be simplified if we did --offload-arch=native and then just ran the test on each device, unrelated aside.

@jplehr
Copy link
Contributor

jplehr commented Oct 24, 2023

I'm closing this in favor of #69977

@jplehr jplehr closed this Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants