Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Libomptarget] Move target table handling out of the plugins #77150

Merged
merged 1 commit into from
Jan 22, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jan 5, 2024

Summary:
This patch removes the bulk of the handling of the
__tgt_offload_entries out of the plugins itself. The reason for this
is because the plugins themselves should not be handling this
implementation detail of the OpenMP runtime. Instead, we expose two new
plugin API functions to get the points to a device pointer for a global
as well as a kernel type.

This required introducing a new type to represent a binary image that
has been loaded on a device. We can then use this to load the addresses
as needed. The creation of the mapping table is then handled just in
libomptarget where we simply look up each address individually. This
should allow us to expose these operations more generically when we
provide a separate API.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 5, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch removes the bulk of the handling of the
__tgt_offload_entries out of the plugins itself. The reason for this
is because the plugins themselves should not be handling this
implementation detail of the OpenMP runtime. Instead, we expose two new
plugin API functions to get the points to a device pointer for a global
as well as a kernel type.

This required introducing a new type to represent a binary image that
has been loaded on a device. We can then use this to load the addresses
as needed. The creation of the mapping table is then handled just in
libomptarget where we simply look up each address individually. This
should allow us to expose these operations more generically when we
provide a separate API.

NOTE: Semi-WIP. This removes support for the dumpGlobals() routines
which are used for record & replay. There's no tests that test this
however. I also have not verified the correct behavior when USM is
active beyond all the tests passing as before.


Patch is 25.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/77150.diff

13 Files Affected:

  • (modified) openmp/libomptarget/include/Shared/APITypes.h (+5)
  • (modified) openmp/libomptarget/include/Shared/PluginAPI.h (+12-2)
  • (modified) openmp/libomptarget/include/Shared/PluginAPI.inc (+2)
  • (modified) openmp/libomptarget/include/device.h (+1-1)
  • (modified) openmp/libomptarget/include/rtl.h (+5)
  • (modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+6-6)
  • (modified) openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h (+13-20)
  • (modified) openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp (+60-108)
  • (modified) openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp (+6-6)
  • (modified) openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp (+6-6)
  • (modified) openmp/libomptarget/src/PluginManager.cpp (+2)
  • (modified) openmp/libomptarget/src/device.cpp (+8-2)
  • (modified) openmp/libomptarget/src/omptarget.cpp (+37-8)
diff --git a/openmp/libomptarget/include/Shared/APITypes.h b/openmp/libomptarget/include/Shared/APITypes.h
index 763a22f0a5e863..a900803a0a466f 100644
--- a/openmp/libomptarget/include/Shared/APITypes.h
+++ b/openmp/libomptarget/include/Shared/APITypes.h
@@ -62,6 +62,11 @@ struct __tgt_target_table {
       *EntriesEnd; // End of the table with all the entries (non inclusive)
 };
 
+/// This struct contains a handle to a loaded binary in the plugin device.
+struct __tgt_device_binary {
+  uint64_t handle;
+};
+
 // clang-format on
 
 /// This struct contains information exchanged between different asynchronous
diff --git a/openmp/libomptarget/include/Shared/PluginAPI.h b/openmp/libomptarget/include/Shared/PluginAPI.h
index c6aacf4ce2124b..7172b945180082 100644
--- a/openmp/libomptarget/include/Shared/PluginAPI.h
+++ b/openmp/libomptarget/include/Shared/PluginAPI.h
@@ -57,8 +57,18 @@ int32_t __tgt_rtl_init_device(int32_t ID);
 // return NULL. Otherwise, return a pointer to the built address table.
 // Individual entries in the table may also be NULL, when the corresponding
 // offload region is not supported on the target device.
-__tgt_target_table *__tgt_rtl_load_binary(int32_t ID,
-                                          __tgt_device_image *Image);
+int32_t __tgt_rtl_load_binary(int32_t ID, __tgt_device_image *Image,
+                              __tgt_device_binary *Binary);
+
+// Look up the device address of the named symbol in the given binary. Returns
+// non-zero on failure.
+int32_t __tgt_rtl_get_global(__tgt_device_binary Binary, uint64_t Size,
+                             const char *Name, void **DevicePtr);
+
+// Look up the device address of the named kernel in the given binary. Returns
+// non-zero on failure.
+int32_t __tgt_rtl_get_function(__tgt_device_binary Binary, const char *Name,
+                               void **DevicePtr);
 
 // Allocate data on the particular target device, of the specified size.
 // HostPtr is a address of the host data the allocated target data
diff --git a/openmp/libomptarget/include/Shared/PluginAPI.inc b/openmp/libomptarget/include/Shared/PluginAPI.inc
index 25ebe7d437f9d1..8eecd60c073529 100644
--- a/openmp/libomptarget/include/Shared/PluginAPI.inc
+++ b/openmp/libomptarget/include/Shared/PluginAPI.inc
@@ -19,6 +19,8 @@ PLUGIN_API_HANDLE(is_data_exchangable, false);
 PLUGIN_API_HANDLE(number_of_devices, true);
 PLUGIN_API_HANDLE(init_device, true);
 PLUGIN_API_HANDLE(load_binary, true);
+PLUGIN_API_HANDLE(get_global, true);
+PLUGIN_API_HANDLE(get_function, true);
 PLUGIN_API_HANDLE(data_alloc, true);
 PLUGIN_API_HANDLE(data_submit, true);
 PLUGIN_API_HANDLE(data_submit_async, false);
diff --git a/openmp/libomptarget/include/device.h b/openmp/libomptarget/include/device.h
index d28d3c508faf56..1cab424122e49d 100644
--- a/openmp/libomptarget/include/device.h
+++ b/openmp/libomptarget/include/device.h
@@ -70,7 +70,7 @@ struct DeviceTy {
   /// Provide access to the mapping handler.
   MappingInfoTy &getMappingInfo() { return MappingInfo; }
 
-  __tgt_target_table *loadBinary(__tgt_device_image *Img);
+  llvm::Expected<__tgt_device_binary> loadBinary(__tgt_device_image *Img);
 
   // device memory allocation/deallocation routines
   /// Allocates \p Size bytes on the device, host or shared memory space
diff --git a/openmp/libomptarget/include/rtl.h b/openmp/libomptarget/include/rtl.h
index d110e89de5f14e..5e198bdad43642 100644
--- a/openmp/libomptarget/include/rtl.h
+++ b/openmp/libomptarget/include/rtl.h
@@ -26,11 +26,16 @@
 /// are trying to (re)register an existing lib or really have a new one.
 struct TranslationTable {
   __tgt_target_table HostTable;
+  llvm::SmallVector<__tgt_target_table> DeviceTables;
 
   // Image assigned to a given device.
   llvm::SmallVector<__tgt_device_image *>
       TargetsImages; // One image per device ID.
 
+  // Arrays of entries active on the device.
+  llvm::SmallVector<llvm::SmallVector<__tgt_offload_entry>>
+      TargetsEntries; // One table per device ID.
+
   // Table of entry points or NULL if it was not already computed.
   llvm::SmallVector<__tgt_target_table *>
       TargetsTable; // One table per device ID.
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 0411c670133422..bbd0f3ef0a97d2 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -403,8 +403,9 @@ struct AMDGPUMemoryManagerTy : public DeviceAllocatorTy {
 /// Class implementing the AMDGPU device images' properties.
 struct AMDGPUDeviceImageTy : public DeviceImageTy {
   /// Create the AMDGPU image with the id and the target image pointer.
-  AMDGPUDeviceImageTy(int32_t ImageId, const __tgt_device_image *TgtImage)
-      : DeviceImageTy(ImageId, TgtImage) {}
+  AMDGPUDeviceImageTy(int32_t ImageId, GenericDeviceTy &Device,
+                      const __tgt_device_image *TgtImage)
+      : DeviceImageTy(ImageId, Device, TgtImage) {}
 
   /// Prepare and load the executable corresponding to the image.
   Error loadExecutable(const AMDGPUDeviceTy &Device);
@@ -2036,14 +2037,13 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
   uint64_t getClockFrequency() const override { return ClockFrequency; }
 
   /// Allocate and construct an AMDGPU kernel.
-  Expected<GenericKernelTy &>
-  constructKernel(const __tgt_offload_entry &KernelEntry) override {
+  Expected<GenericKernelTy &> constructKernel(const char *Name) override {
     // Allocate and construct the AMDGPU kernel.
     AMDGPUKernelTy *AMDGPUKernel = Plugin::get().allocate<AMDGPUKernelTy>();
     if (!AMDGPUKernel)
       return Plugin::error("Failed to allocate memory for AMDGPU kernel");
 
-    new (AMDGPUKernel) AMDGPUKernelTy(KernelEntry.name);
+    new (AMDGPUKernel) AMDGPUKernelTy(Name);
 
     return *AMDGPUKernel;
   }
@@ -2091,7 +2091,7 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     // Allocate and initialize the image object.
     AMDGPUDeviceImageTy *AMDImage =
         Plugin::get().allocate<AMDGPUDeviceImageTy>();
-    new (AMDImage) AMDGPUDeviceImageTy(ImageId, TgtImage);
+    new (AMDImage) AMDGPUDeviceImageTy(ImageId, *this, TgtImage);
 
     // Load the HSA executable.
     if (Error Err = AMDImage->loadExecutable(*this))
diff --git a/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h b/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
index b85dc146d86d2f..c7b27240947f1f 100644
--- a/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
+++ b/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
@@ -221,15 +221,22 @@ class DeviceImageTy {
   /// Table of offload entries.
   OffloadEntryTableTy OffloadEntryTable;
 
+  /// Reference to the device this image is loaded on.
+  GenericDeviceTy &Device;
+
 public:
-  DeviceImageTy(int32_t Id, const __tgt_device_image *Image)
-      : ImageId(Id), TgtImage(Image), TgtImageBitcode(nullptr) {
+  DeviceImageTy(int32_t Id, GenericDeviceTy &Device,
+                const __tgt_device_image *Image)
+      : ImageId(Id), TgtImage(Image), TgtImageBitcode(nullptr), Device(Device) {
     assert(TgtImage && "Invalid target image");
   }
 
   /// Get the image identifier within the device.
   int32_t getId() const { return ImageId; }
 
+  /// Get the device that this image is loaded onto.
+  GenericDeviceTy &getDevice() const { return Device; }
+
   /// Get the pointer to the raw __tgt_device_image.
   const __tgt_device_image *getTgtImage() const { return TgtImage; }
 
@@ -650,8 +657,8 @@ struct GenericDeviceTy : public DeviceAllocatorTy {
   virtual Error deinitImpl() = 0;
 
   /// Load the binary image into the device and return the target table.
-  Expected<__tgt_target_table *> loadBinary(GenericPluginTy &Plugin,
-                                            const __tgt_device_image *TgtImage);
+  Expected<DeviceImageTy *> loadBinary(GenericPluginTy &Plugin,
+                                       const __tgt_device_image *TgtImage);
   virtual Expected<DeviceImageTy *>
   loadBinaryImpl(const __tgt_device_image *TgtImage, int32_t ImageId) = 0;
 
@@ -669,9 +676,6 @@ struct GenericDeviceTy : public DeviceAllocatorTy {
   // up to the target to override this using the shouldSetupRPCServer function.
   Error setupRPCServer(GenericPluginTy &Plugin, DeviceImageTy &Image);
 
-  /// Register the offload entries for a specific image on the device.
-  Error registerOffloadEntries(DeviceImageTy &Image);
-
   /// Synchronize the current thread with the pending operations on the
   /// __tgt_async_info structure.
   Error synchronize(__tgt_async_info *AsyncInfo);
@@ -872,21 +876,10 @@ struct GenericDeviceTy : public DeviceAllocatorTy {
 
   virtual Error getDeviceStackSize(uint64_t &V) = 0;
 
-private:
-  /// Register offload entry for global variable.
-  Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,
-                                   const __tgt_offload_entry &GlobalEntry,
-                                   __tgt_offload_entry &DeviceEntry);
-
-  /// Register offload entry for kernel function.
-  Error registerKernelOffloadEntry(DeviceImageTy &DeviceImage,
-                                   const __tgt_offload_entry &KernelEntry,
-                                   __tgt_offload_entry &DeviceEntry);
-
   /// Allocate and construct a kernel object.
-  virtual Expected<GenericKernelTy &>
-  constructKernel(const __tgt_offload_entry &KernelEntry) = 0;
+  virtual Expected<GenericKernelTy &> constructKernel(const char *Name) = 0;
 
+private:
   /// Get and set the stack size and heap size for the device. If not used, the
   /// plugin can implement the setters as no-op and setting the output
   /// value to zero for the getters.
diff --git a/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp b/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp
index be9ace571f54fd..be8fac278b2696 100644
--- a/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp
@@ -842,7 +842,7 @@ Error GenericDeviceTy::deinit(GenericPluginTy &Plugin) {
 
   return deinitImpl();
 }
-Expected<__tgt_target_table *>
+Expected<DeviceImageTy *>
 GenericDeviceTy::loadBinary(GenericPluginTy &Plugin,
                             const __tgt_device_image *InputTgtImage) {
   assert(InputTgtImage && "Expected non-null target image");
@@ -886,10 +886,6 @@ GenericDeviceTy::loadBinary(GenericPluginTy &Plugin,
       return std::move(Err);
   }
 
-  // Register all offload entries of the image.
-  if (auto Err = registerOffloadEntries(*Image))
-    return std::move(Err);
-
   if (auto Err = setupRPCServer(Plugin, *Image))
     return std::move(Err);
 
@@ -915,7 +911,7 @@ GenericDeviceTy::loadBinary(GenericPluginTy &Plugin,
     return std::move(Err);
 
   // Return the pointer to the table of entries.
-  return Image->getOffloadEntryTable();
+  return Image;
 }
 
 Error GenericDeviceTy::setupDeviceEnvironment(GenericPluginTy &Plugin,
@@ -1024,99 +1020,6 @@ Error GenericDeviceTy::setupRPCServer(GenericPluginTy &Plugin,
   return Plugin::success();
 }
 
-Error GenericDeviceTy::registerOffloadEntries(DeviceImageTy &Image) {
-  const __tgt_offload_entry *Begin = Image.getTgtImage()->EntriesBegin;
-  const __tgt_offload_entry *End = Image.getTgtImage()->EntriesEnd;
-  for (const __tgt_offload_entry *Entry = Begin; Entry != End; ++Entry) {
-    // The host should have always something in the address to uniquely
-    // identify the entry.
-    if (!Entry->addr)
-      return Plugin::error("Failure to register entry without address");
-
-    __tgt_offload_entry DeviceEntry = {0};
-
-    if (Entry->size) {
-      if (auto Err = registerGlobalOffloadEntry(Image, *Entry, DeviceEntry))
-        return Err;
-    } else {
-      if (auto Err = registerKernelOffloadEntry(Image, *Entry, DeviceEntry))
-        return Err;
-    }
-
-    assert(DeviceEntry.addr && "Device addr of offload entry cannot be null");
-
-    DP("Entry point " DPxMOD " maps to%s %s (" DPxMOD ")\n",
-       DPxPTR(Entry - Begin), (Entry->size) ? " global" : "", Entry->name,
-       DPxPTR(DeviceEntry.addr));
-  }
-  return Plugin::success();
-}
-
-Error GenericDeviceTy::registerGlobalOffloadEntry(
-    DeviceImageTy &Image, const __tgt_offload_entry &GlobalEntry,
-    __tgt_offload_entry &DeviceEntry) {
-
-  GenericPluginTy &Plugin = Plugin::get();
-
-  DeviceEntry = GlobalEntry;
-
-  // Create a metadata object for the device global.
-  GlobalTy DeviceGlobal(GlobalEntry.name, GlobalEntry.size);
-
-  // Get the address of the device of the global.
-  GenericGlobalHandlerTy &GHandler = Plugin.getGlobalHandler();
-  if (auto Err =
-          GHandler.getGlobalMetadataFromDevice(*this, Image, DeviceGlobal))
-    return Err;
-
-  // Store the device address on the device entry.
-  DeviceEntry.addr = DeviceGlobal.getPtr();
-  assert(DeviceEntry.addr && "Invalid device global's address");
-
-  // Note: In the current implementation declare target variables
-  // can either be link or to. This means that once unified
-  // memory is activated via the requires directive, the variable
-  // can be used directly from the host in both cases.
-  if (Plugin.getRequiresFlags() & OMP_REQ_UNIFIED_SHARED_MEMORY) {
-    // If unified memory is present any target link or to variables
-    // can access host addresses directly. There is no longer a
-    // need for device copies.
-    GlobalTy HostGlobal(GlobalEntry);
-    if (auto Err =
-            GHandler.writeGlobalToDevice(*this, HostGlobal, DeviceGlobal))
-      return Err;
-  }
-
-  // Add the device entry on the entry table.
-  Image.getOffloadEntryTable().addEntry(DeviceEntry);
-
-  return Plugin::success();
-}
-
-Error GenericDeviceTy::registerKernelOffloadEntry(
-    DeviceImageTy &Image, const __tgt_offload_entry &KernelEntry,
-    __tgt_offload_entry &DeviceEntry) {
-  DeviceEntry = KernelEntry;
-
-  // Create a kernel object.
-  auto KernelOrErr = constructKernel(KernelEntry);
-  if (!KernelOrErr)
-    return KernelOrErr.takeError();
-
-  GenericKernelTy &Kernel = *KernelOrErr;
-
-  // Initialize the kernel.
-  if (auto Err = Kernel.init(*this, Image))
-    return Err;
-
-  // Set the device entry address to the kernel address and store the entry on
-  // the entry table.
-  DeviceEntry.addr = (void *)&Kernel;
-  Image.getOffloadEntryTable().addEntry(DeviceEntry);
-
-  return Plugin::success();
-}
-
 Error PinnedAllocationMapTy::insertEntry(void *HstPtr, void *DevAccessiblePtr,
                                          size_t Size, bool ExternallyLocked) {
   // Insert the new entry into the map.
@@ -1760,23 +1663,25 @@ int32_t __tgt_rtl_initialize_record_replay(int32_t DeviceId, int64_t MemorySize,
   return OFFLOAD_SUCCESS;
 }
 
-__tgt_target_table *__tgt_rtl_load_binary(int32_t DeviceId,
-                                          __tgt_device_image *TgtImage) {
+int32_t __tgt_rtl_load_binary(int32_t DeviceId, __tgt_device_image *TgtImage,
+                              __tgt_device_binary *Binary) {
   GenericPluginTy &Plugin = Plugin::get();
   GenericDeviceTy &Device = Plugin.getDevice(DeviceId);
 
-  auto TableOrErr = Device.loadBinary(Plugin, TgtImage);
-  if (!TableOrErr) {
-    auto Err = TableOrErr.takeError();
+  auto ImageOrErr = Device.loadBinary(Plugin, TgtImage);
+  if (!ImageOrErr) {
+    auto Err = ImageOrErr.takeError();
     REPORT("Failure to load binary image %p on device %d: %s\n", TgtImage,
            DeviceId, toString(std::move(Err)).data());
-    return nullptr;
+    return OFFLOAD_FAIL;
   }
 
-  __tgt_target_table *Table = *TableOrErr;
-  assert(Table != nullptr && "Invalid table");
+  DeviceImageTy *Image = *ImageOrErr;
+  assert(Image != nullptr && "Invalid Image");
+
+  *Binary = __tgt_device_binary{reinterpret_cast<uint64_t>(Image)};
 
-  return Table;
+  return OFFLOAD_SUCCESS;
 }
 
 void *__tgt_rtl_data_alloc(int32_t DeviceId, int64_t Size, void *HostPtr,
@@ -2072,6 +1977,53 @@ int32_t __tgt_rtl_set_device_offset(int32_t DeviceIdOffset) {
   return OFFLOAD_SUCCESS;
 }
 
+int32_t __tgt_rtl_get_global(__tgt_device_binary Binary, uint64_t Size,
+                             const char *Name, void **DevicePtr) {
+  assert(Binary.handle && "Invalid device binary handle");
+  DeviceImageTy &Image = *reinterpret_cast<DeviceImageTy *>(Binary.handle);
+
+  GenericPluginTy &Plugin = Plugin::get();
+  GenericDeviceTy &Device = Image.getDevice();
+
+  GlobalTy DeviceGlobal(Name, Size);
+  GenericGlobalHandlerTy &GHandler = Plugin.getGlobalHandler();
+  if (auto Err =
+          GHandler.getGlobalMetadataFromDevice(Device, Image, DeviceGlobal)) {
+    REPORT("Failure to look up global address: %s\n",
+           toString(std::move(Err)).data());
+    return OFFLOAD_FAIL;
+  }
+
+  *DevicePtr = DeviceGlobal.getPtr();
+  assert(DevicePtr && "Invalid device global's address");
+
+  return OFFLOAD_SUCCESS;
+}
+
+int32_t __tgt_rtl_get_function(__tgt_device_binary Binary, const char *Name,
+                               void **KernelPtr) {
+  assert(Binary.handle && "Invalid device binary handle");
+  DeviceImageTy &Image = *reinterpret_cast<DeviceImageTy *>(Binary.handle);
+
+  GenericDeviceTy &Device = Image.getDevice();
+
+  auto KernelOrErr = Device.constructKernel(Name);
+  if (Error Err = KernelOrErr.takeError()) {
+    REPORT("Failure to look up kernel: %s\n", toString(std::move(Err)).data());
+    return OFFLOAD_FAIL;
+  }
+
+  GenericKernelTy &Kernel = *KernelOrErr;
+  if (auto Err = Kernel.init(Device, Image)) {
+    REPORT("Failure to init kernel: %s\n", toString(std::move(Err)).data());
+    return OFFLOAD_FAIL;
+  }
+
+  // Note that this is not the kernel's device address.
+  *KernelPtr = &Kernel;
+  return OFFLOAD_SUCCESS;
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
index 0005bff7a8035d..9a16c12e15453d 100644
--- a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
@@ -74,8 +74,9 @@ CUresult cuMemFreeAsync(CUdeviceptr dptr, CUstream hStream) {}
 /// Class implementing the CUDA device images properties.
 struct CUDADeviceImageTy : public DeviceImageTy {
   /// Create the CUDA image with the id and the target image pointer.
-  CUDADeviceImageTy(int32_t ImageId, const __tgt_device_image *TgtImage)
-      : DeviceImageTy(ImageId, TgtImage), Module(nullptr) {}
+  CUDADeviceImageTy(int32_t ImageId, GenericDeviceTy &Device,
+                    const __tgt_device_image *TgtImage)
+      : DeviceImageTy(ImageId, Device, TgtImage), Module(nullptr) {}
 
   /// Load the image as a CUDA module.
   Error loadModule() {
@@ -398,14 +399,13 @@ struct CUDADeviceTy : public GenericDeviceTy {
   }
 
   /// Allocate and construct a CUDA kernel.
-  Expected<GenericKernelTy &>
-  constructKernel(const __tgt_offload_entry &KernelEntry) override {
+  Expected<GenericKernelTy &> constructKernel(const char *Name) override {
     // Allocate and construct the CUDA kernel.
     CUDAKernelTy *CUDAKernel = Plugin::get().allocate<CUDAKernelTy>();
     if (!CUDAKernel)
       return Plugin::error("Failed to allocate memory for CUDA kernel");
 
-    new (CUDAKernel) CUDAKernelTy(KernelEntry.name);
+    new (CUDAKernel) CUDAKernelTy(Name);
 
     return *CUDAKernel;
   }
@@ -460,7 +460,7 @@ struct CUDADeviceTy : public GenericDeviceTy {
 
     // Allocate and initialize the image object.
     CUDADeviceImageTy *CUDAImage = Plugin::get().allocate<CUDADeviceImageTy>();
-    new (CUDAImage) CUDADeviceImageTy(ImageId, TgtImage);
+    new (CUDAImage) CUDADeviceImageTy(ImageId, *this, TgtImage);
 
     // Load the CUDA module.
     if (auto Err = CUDAImage->loadModule())
diff --git a/openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp
index 43569f25055594..a03eedc542086a 100644
--- a/openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp
@@ -111,8 +111,9 @@ struct GenELF64KernelTy : public GenericKernelTy {
 /// Class implementing the GenELF64 device images properties.
 struct GenELF64DeviceImageTy : public DeviceImageTy {
   /// Create the GenELF64...
[truncated]

@jdoerfert
Copy link
Member

This removes support for the dumpGlobals() routines
which are used for record & replay.

That's a problem.

There's no tests that test this however.

That's a different problem.

Both need to be addressed, the former before we can merge this. The latter asap.

@jdoerfert
Copy link
Member

Couldn't we provide an API for the plugin to pass a reference to a GlobalHandler back to do most of this?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 6, 2024

Couldn't we provide an API for the plugin to pass a reference to a GlobalHandler back to do most of this?

This is possible, but the existence of the C API between libomptarget and the plugins makes it annoying. I like having functions that abstract over the operations the underlying runtime do. But, we could go that route in a follow-up and merge __tgt_rtl_data_(submit|retrieve) and these into a single interface.

This removes support for the dumpGlobals() routines
which are used for record & replay.

That's a problem.

Should hopefully work, but I'll need to check it further. I simply made the recording facilities append globals.

There's no tests that test this however.

That's a different problem.

Why were tests not included in https://reviews.llvm.org/D138931 when this was first added?

@jhuber6 jhuber6 requested a review from tschuett January 6, 2024 19:41
@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 6, 2024

I looked into the differences between upstream and downstream. It seems there's some differences for APU / USM handling. Would appreciate some help for what needs to be added to this to handle that correctly.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 8, 2024

@carlobertolli PTAL for USM handling w.r.t. implicit-zero-copy as present in downstream.

Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this seems fine. Some comments though

@@ -62,6 +62,11 @@ struct __tgt_target_table {
*EntriesEnd; // End of the table with all the entries (non inclusive)
};

/// This struct contains a handle to a loaded binary in the plugin device.
struct __tgt_device_binary {
uint64_t handle;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why uint64_t, you only use it as pointer, uintptr_t, or void *?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change it to uintptr_t. My thinking is that we pretty much assert that this thing is 64-bit currently and the type being uint64_t reflects that. I don't use void * because it's supposed to be opaque so the user shouldn't be concerned with the fact that it's a pointer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume only pointers, make it void *.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uintptr_t is fine, it's a common method to just hide opaque handles this way and the user doesn't need to think of it as a pointer.

__tgt_target_table *__tgt_rtl_load_binary(int32_t ID,
__tgt_device_image *Image);
int32_t __tgt_rtl_load_binary(int32_t ID, __tgt_device_image *Image,
__tgt_device_binary *Binary);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ABI breaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we determined at some point that the plugin ABI was free to be broken because it's not exported anywhere, and we should only ever load the corresponding plugin from the associated install.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right.

} else {
if (Device.RTL->get_function(Binary, Entry.name, &DeviceEntry.addr) !=
OFFLOAD_SUCCESS)
REPORT("Failed to load kernel %s\n", Entry.name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we continue after such a failure before? Same for globals and USM above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The code pretended to continue by returning nullptr, but it called REPORT AKA abort() on failure. If this fails it's probably sufficiently busted, considering that the OpenMP rules states that there should be a 1-to-1 mapping (there's another check just below here that errors out if hostTable.size != DevTable.size().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think, which I also think, this should not happen, then make it a hard failure please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REPORT is already a hard failure.

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jan 12, 2024
…nction.

This patch is supposed to land before Joseph's upstream patch:
llvm#77150

Change-Id: I71cddcdeb7181caa02361ce19873bbbf5bde9578
Copy link
Contributor

@doru1004 doru1004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jan 20, 2024
premerge landing of patch from joseph

Change-Id: I1eea45458f96bbb01970d36c58f1d700613b6702
Summary:
This patch removes the bulk of the handling of the
`__tgt_offload_entries` out of the plugins itself. The reason for this
is because the plugins themselves should not be handling this
implementation detail of the OpenMP runtime. Instead, we expose two new
plugin API functions to get the points to a device pointer for a global
as well as a kernel type.

This required introducing a new type to represent a binary image that
has been loaded on a device. We can then use this to load the addresses
as needed. The creation of the mapping table is then handled just in
`libomptarget` where we simply look up each address individually. This
should allow us to expose these operations more generically when we
provide a separate API.

NOTE: Semi-WIP. This removes support for the `dumpGlobals()` routines
which are used for record & replay. There's no tests that test this
however. I also have not verified the correct behavior when USM is
active beyond all the tests passing as before.
@jhuber6 jhuber6 merged commit 621bafd into llvm:main Jan 22, 2024
4 checks passed
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 26, 2024
premerge landing of patch from joseph

Change-Id: I1eea45458f96bbb01970d36c58f1d700613b6702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants