[OpenMP] Basic BumpAllocator for (AMD)GPUs #69806

jdoerfert · 2023-10-21T02:30:29Z

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements malloc, both internally and externally, while we continue to default to the NVIDIA malloc when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via LIBOMPTARGET_HEAP_SIZE and defaults to 512MB. It allows to track allocation statistics via
LIBOMPTARGET_DEVICE_RTL_DEBUG=8 (together with
-fopenmp-target-debug=8). Two tests were added, and one was enabled.

This is the next step towards fixing
#66708

llvmbot · 2023-10-21T02:31:02Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-openmp

Author: Johannes Doerfert (jdoerfert)

Changes

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements malloc, both internally and externally, while we continue to default to the NVIDIA malloc when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via LIBOMPTARGET_HEAP_SIZE and defaults to 512MB. It allows to track allocation statistics via
LIBOMPTARGET_DEVICE_RTL_DEBUG=8 (together with
-fopenmp-target-debug=8). Two tests were added, and one was enabled.

This is the next step towards fixing
#66708

Patch is 26.58 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69806.diff

17 Files Affected:

(modified) clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h (+16-4)
(modified) openmp/docs/design/Runtimes.rst (+1)
(modified) openmp/libomptarget/DeviceRTL/CMakeLists.txt (+2)
(added) openmp/libomptarget/DeviceRTL/include/Allocator.h (+44)
(added) openmp/libomptarget/DeviceRTL/src/Allocator.cpp (+80)
(modified) openmp/libomptarget/DeviceRTL/src/Kernel.cpp (+2)
(modified) openmp/libomptarget/DeviceRTL/src/State.cpp (+17-24)
(modified) openmp/libomptarget/DeviceRTL/src/exports (+1)
(modified) openmp/libomptarget/include/Environment.h (+21)
(modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+11-2)
(modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp (+83-4)
(modified) openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h (+11)
(modified) openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp (+5)
(modified) openmp/libomptarget/plugins-nextgen/generic-elf-64bit/src/rtl.cpp (+2-1)
(modified) openmp/libomptarget/test/mapping/lambda_mapping.cpp (+4-4)
(added) openmp/libomptarget/test/offloading/malloc.c (+37)
(added) openmp/libomptarget/test/offloading/malloc_parallel.c (+42)

diff --git a/clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h b/clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h
index d5b6846b0348859..54497ede9721484 100644
--- a/clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h
+++ b/clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h
@@ -15,9 +15,19 @@
 #endif
 
 #ifdef __cplusplus
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
 extern "C" {
+#else
+#include <stddef.h>
+#include <stdint.h>
+#include <stdlib.h>
 #endif
 
+void *llvm_device_malloc(uint64_t Size);
+void llvm_device_free(void *Ptr);
+
 #pragma omp begin declare variant match(                                       \
     device = {arch(nvptx, nvptx64)}, implementation = {extension(match_any)})
 
@@ -36,7 +46,12 @@ extern "C" {
 #pragma omp end declare variant
 
 #ifdef __AMDGCN__
-#pragma omp begin declare variant match(device = {arch(amdgcn)})
+#pragma omp begin declare variant match(                                       \
+        device = {arch(amdgcn)},                                               \
+            implementation = {extension(disable_implicit_base)})
+
+void *malloc(size_t Size) { return llvm_device_malloc(Size); }
+void free(void *Ptr) { llvm_device_free(Ptr); }
 
 // Import types which will be used by __clang_hip_libdevice_declares.h
 #ifndef __cplusplus
@@ -64,9 +79,6 @@ extern "C" {
 // need to `include <new>` in C++ mode.
 #ifdef __cplusplus
 
-// We require malloc/free.
-#include <cstdlib>
-
 #pragma push_macro("OPENMP_NOEXCEPT")
 #if __cplusplus >= 201103L
 #define OPENMP_NOEXCEPT noexcept
diff --git a/openmp/docs/design/Runtimes.rst b/openmp/docs/design/Runtimes.rst
index 4c848ca76fb8d57..62ed75797955e28 100644
--- a/openmp/docs/design/Runtimes.rst
+++ b/openmp/docs/design/Runtimes.rst
@@ -1465,3 +1465,4 @@ debugging features are supported.
 
     * Enable debugging assertions in the device. ``0x01``
     * Enable diagnosing common problems during offloading . ``0x4``
+    * Enable device malloc statistics (amdgpu only). ``0x8``
diff --git a/openmp/libomptarget/DeviceRTL/CMakeLists.txt b/openmp/libomptarget/DeviceRTL/CMakeLists.txt
index fee2414b456a14c..f71bdeae3d7f091 100644
--- a/openmp/libomptarget/DeviceRTL/CMakeLists.txt
+++ b/openmp/libomptarget/DeviceRTL/CMakeLists.txt
@@ -83,6 +83,7 @@ endif()
 list(REMOVE_DUPLICATES LIBOMPTARGET_DEVICE_ARCHITECTURES)
 
 set(include_files
+  ${include_directory}/Allocator.h
   ${include_directory}/Configuration.h
   ${include_directory}/Debug.h
   ${include_directory}/Interface.h
@@ -95,6 +96,7 @@ set(include_files
 )
 
 set(src_files
+  ${source_directory}/Allocator.cpp
   ${source_directory}/Configuration.cpp
   ${source_directory}/Debug.cpp
   ${source_directory}/Kernel.cpp
diff --git a/openmp/libomptarget/DeviceRTL/include/Allocator.h b/openmp/libomptarget/DeviceRTL/include/Allocator.h
new file mode 100644
index 000000000000000..a28eb0fb2977eaa
--- /dev/null
+++ b/openmp/libomptarget/DeviceRTL/include/Allocator.h
@@ -0,0 +1,44 @@
+//===-------- Allocator.h - OpenMP memory allocator interface ---- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef OMPTARGET_ALLOCATOR_H
+#define OMPTARGET_ALLOCATOR_H
+
+#include "Types.h"
+
+// Forward declaration.
+struct KernelEnvironmentTy;
+
+#pragma omp begin declare target device_type(nohost)
+
+namespace ompx {
+
+namespace allocator {
+
+static uint64_t constexpr ALIGNMENT = 16;
+
+/// Initialize the allocator according to \p KernelEnvironment
+void init(bool IsSPMD, KernelEnvironmentTy &KernelEnvironment);
+
+/// Allocate \p Size bytes.
+[[gnu::alloc_size(1), gnu::assume_aligned(ALIGNMENT), gnu::malloc]] void *
+alloc(uint64_t Size);
+
+/// Free the allocation pointed to by \p Ptr.
+void free(void *Ptr);
+
+} // namespace allocator
+
+} // namespace ompx
+
+#pragma omp end declare target
+
+#endif
diff --git a/openmp/libomptarget/DeviceRTL/src/Allocator.cpp b/openmp/libomptarget/DeviceRTL/src/Allocator.cpp
new file mode 100644
index 000000000000000..7a4cbfe60dd026c
--- /dev/null
+++ b/openmp/libomptarget/DeviceRTL/src/Allocator.cpp
@@ -0,0 +1,80 @@
+//===------ State.cpp - OpenMP State & ICV interface ------------- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+//===----------------------------------------------------------------------===//
+
+#include "Allocator.h"
+#include "Configuration.h"
+#include "Environment.h"
+#include "Mapping.h"
+#include "Synchronization.h"
+#include "Types.h"
+#include "Utils.h"
+
+using namespace ompx;
+
+#pragma omp begin declare target device_type(nohost)
+
+[[gnu::used, gnu::retain, gnu::weak,
+  gnu::visibility(
+      "protected")]] DeviceMemoryPoolTy __omp_rtl_device_memory_pool;
+[[gnu::used, gnu::retain, gnu::weak,
+  gnu::visibility("protected")]] DeviceMemoryPoolTrackingTy
+    __omp_rtl_device_memory_pool_tracker;
+
+/// Stateless bump allocator that uses the __omp_rtl_device_memory_pool
+/// directly.
+struct BumpAllocatorTy final {
+
+  void *alloc(uint64_t Size) {
+    Size = utils::roundUp(Size, uint64_t(allocator::ALIGNMENT));
+
+    if (config::isDebugMode(DeviceDebugKind::AllocationTracker)) {
+      atomic::add(&__omp_rtl_device_memory_pool_tracker.NumAllocations, 1,
+                  atomic::seq_cst);
+      atomic::add(&__omp_rtl_device_memory_pool_tracker.AllocationTotal, Size,
+                  atomic::seq_cst);
+      atomic::min(&__omp_rtl_device_memory_pool_tracker.AllocationMin, Size,
+                  atomic::seq_cst);
+      atomic::max(&__omp_rtl_device_memory_pool_tracker.AllocationMax, Size,
+                  atomic::seq_cst);
+    }
+
+    uint64_t *Data =
+        reinterpret_cast<uint64_t *>(&__omp_rtl_device_memory_pool.Ptr);
+    uint64_t End =
+        reinterpret_cast<uint64_t>(Data) + __omp_rtl_device_memory_pool.Size;
+
+    uint64_t OldData = atomic::add(Data, Size, atomic::seq_cst);
+    if (OldData + Size > End)
+      __builtin_trap();
+
+    return reinterpret_cast<void *>(OldData);
+  }
+
+  void free(void *) {}
+};
+
+BumpAllocatorTy BumpAllocator;
+
+/// allocator namespace implementation
+///
+///{
+
+void allocator::init(bool IsSPMD, KernelEnvironmentTy &KernelEnvironment) {
+  // TODO: Check KernelEnvironment for an allocator choice as soon as we have
+  // more than one.
+}
+
+void *allocator::alloc(uint64_t Size) { return BumpAllocator.alloc(Size); }
+
+void allocator::free(void *Ptr) { BumpAllocator.free(Ptr); }
+
+///}
+
+#pragma omp end declare target
diff --git a/openmp/libomptarget/DeviceRTL/src/Kernel.cpp b/openmp/libomptarget/DeviceRTL/src/Kernel.cpp
index 222983577164c66..91e8a00bdef9dfd 100644
--- a/openmp/libomptarget/DeviceRTL/src/Kernel.cpp
+++ b/openmp/libomptarget/DeviceRTL/src/Kernel.cpp
@@ -10,6 +10,7 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "Allocator.h"
 #include "Debug.h"
 #include "Environment.h"
 #include "Interface.h"
@@ -30,6 +31,7 @@ static void inititializeRuntime(bool IsSPMD,
   synchronize::init(IsSPMD);
   mapping::init(IsSPMD);
   state::init(IsSPMD, KernelEnvironment);
+  allocator::init(IsSPMD, KernelEnvironment);
 }
 
 /// Simple generic state machine for worker threads.
diff --git a/openmp/libomptarget/DeviceRTL/src/State.cpp b/openmp/libomptarget/DeviceRTL/src/State.cpp
index 68fe0b383548e53..5a21f4b470dc9e7 100644
--- a/openmp/libomptarget/DeviceRTL/src/State.cpp
+++ b/openmp/libomptarget/DeviceRTL/src/State.cpp
@@ -9,6 +9,8 @@
 //===----------------------------------------------------------------------===//
 
 #include "State.h"
+#include "Allocator.h"
+#include "Configuration.h"
 #include "Debug.h"
 #include "Environment.h"
 #include "Interface.h"
@@ -25,18 +27,15 @@ using namespace ompx;
 ///
 ///{
 
-/// Add worst-case padding so that future allocations are properly aligned.
-/// FIXME: The stack shouldn't require worst-case padding. Alignment needs to be
-/// passed in as an argument and the stack rewritten to support it.
-constexpr const uint32_t Alignment = 16;
-
 /// External symbol to access dynamic shared memory.
-[[gnu::aligned(Alignment)]] extern unsigned char DynamicSharedBuffer[];
+[[gnu::aligned(allocator::ALIGNMENT)]] extern unsigned char DynamicSharedBuffer[];
 #pragma omp allocate(DynamicSharedBuffer) allocator(omp_pteam_mem_alloc)
 
 /// The kernel environment passed to the init method by the compiler.
 static KernelEnvironmentTy *SHARED(KernelEnvironmentPtr);
 
+///}
+
 namespace {
 
 /// Fallback implementations are missing to trigger a link time error.
@@ -48,25 +47,17 @@ namespace {
 extern "C" {
 [[gnu::weak, gnu::leaf]] void *malloc(uint64_t Size);
 [[gnu::weak, gnu::leaf]] void free(void *Ptr);
-}
 
-///}
+void *llvm_device_malloc(uint64_t Size) { return allocator::alloc(Size); }
+void llvm_device_free(void *Ptr) { allocator::free(Ptr); }
 
-/// AMDGCN implementations of the shuffle sync idiom.
-///
-///{
 #pragma omp begin declare variant match(device = {arch(amdgcn)})
 
-extern "C" {
-void *malloc(uint64_t Size) {
-  // TODO: Use some preallocated space for dynamic malloc.
-  return nullptr;
-}
-
-void free(void *Ptr) {}
-}
+void *malloc(uint64_t Size) { return llvm_device_malloc(Size); }
+void free(void *Ptr) { llvm_device_free(Ptr); }
 
 #pragma omp end declare variant
+}
 ///}
 
 /// A "smart" stack in shared memory.
@@ -95,7 +86,7 @@ struct SharedMemorySmartStackTy {
   uint32_t computeThreadStorageTotal() {
     uint32_t NumLanesInBlock = mapping::getNumberOfThreadsInBlock();
     return utils::align_down((state::SharedScratchpadSize / NumLanesInBlock),
-                             Alignment);
+                             allocator::ALIGNMENT);
   }
 
   /// Return the top address of the warp data stack, that is the first address
@@ -105,8 +96,8 @@ struct SharedMemorySmartStackTy {
   }
 
   /// The actual storage, shared among all warps.
-  [[gnu::aligned(Alignment)]] unsigned char Data[state::SharedScratchpadSize];
-  [[gnu::aligned(Alignment)]] unsigned char Usage[mapping::MaxThreadsPerTeam];
+  [[gnu::aligned(allocator::ALIGNMENT)]] unsigned char Data[state::SharedScratchpadSize];
+  [[gnu::aligned(allocator::ALIGNMENT)]] unsigned char Usage[mapping::MaxThreadsPerTeam];
 };
 
 static_assert(state::SharedScratchpadSize / mapping::MaxThreadsPerTeam <= 256,
@@ -121,7 +112,9 @@ void SharedMemorySmartStackTy::init(bool IsSPMD) {
 
 void *SharedMemorySmartStackTy::push(uint64_t Bytes) {
   // First align the number of requested bytes.
-  uint64_t AlignedBytes = utils::align_up(Bytes, Alignment);
+  /// FIXME: The stack shouldn't require worst-case padding. Alignment needs to
+  /// be passed in as an argument and the stack rewritten to support it.
+  uint64_t AlignedBytes = utils::align_up(Bytes, allocator::ALIGNMENT);
 
   uint32_t StorageTotal = computeThreadStorageTotal();
 
@@ -149,7 +142,7 @@ void *SharedMemorySmartStackTy::push(uint64_t Bytes) {
 }
 
 void SharedMemorySmartStackTy::pop(void *Ptr, uint32_t Bytes) {
-  uint64_t AlignedBytes = utils::align_up(Bytes, Alignment);
+  uint64_t AlignedBytes = utils::align_up(Bytes, allocator::ALIGNMENT);
   if (utils::isSharedMemPtr(Ptr)) {
     int TId = mapping::getThreadIdInBlock();
     Usage[TId] -= AlignedBytes;
diff --git a/openmp/libomptarget/DeviceRTL/src/exports b/openmp/libomptarget/DeviceRTL/src/exports
index 288ddf90b4a9f2d..5567257f6b3cc75 100644
--- a/openmp/libomptarget/DeviceRTL/src/exports
+++ b/openmp/libomptarget/DeviceRTL/src/exports
@@ -16,3 +16,4 @@ free
 memcmp
 printf
 __assert_fail
+malloc
diff --git a/openmp/libomptarget/include/Environment.h b/openmp/libomptarget/include/Environment.h
index 48a0fa933bdd8c0..8194736ae4e0a9d 100644
--- a/openmp/libomptarget/include/Environment.h
+++ b/openmp/libomptarget/include/Environment.h
@@ -43,6 +43,27 @@ struct DeviceEnvironmentTy {
   uint64_t HardwareParallelism;
 };
 
+struct DeviceMemoryPoolTy {
+  void *Ptr;
+  uint64_t Size;
+};
+
+struct DeviceMemoryPoolTrackingTy {
+  uint64_t NumAllocations;
+  uint64_t AllocationTotal;
+  uint64_t AllocationMin;
+  uint64_t AllocationMax;
+
+  void combine(DeviceMemoryPoolTrackingTy &Other) {
+    NumAllocations += Other.NumAllocations;
+    AllocationTotal += Other.AllocationTotal;
+    AllocationMin = AllocationMin > Other.AllocationMin ? Other.AllocationMin
+                                                        : AllocationMin;
+    AllocationMax = AllocationMax < Other.AllocationMax ? Other.AllocationMax
+                                                        : AllocationMax;
+  }
+};
+
 // NOTE: Please don't change the order of those members as their indices are
 // used in the middle end. Always add the new data member at the end.
 // Different from KernelEnvironmentTy below, this structure contains members
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 148ca09654092cb..ab24856f9bc78e4 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -2529,10 +2529,16 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     return Plugin::success();
   }
   Error getDeviceHeapSize(uint64_t &Value) override {
-    Value = 0;
+    Value = DeviceMemoryPoolSize;
+    return Plugin::success();
+  }
+  Error setDeviceHeapSize(uint64_t Value) override {
+    for (DeviceImageTy *Image : LoadedImages)
+      if (auto Err = setupDeviceMemoryPool(Plugin::get(), *Image, Value))
+        return Err;
+    DeviceMemoryPoolSize = Value;
     return Plugin::success();
   }
-  Error setDeviceHeapSize(uint64_t Value) override { return Plugin::success(); }
 
   /// AMDGPU-specific function to get device attributes.
   template <typename Ty> Error getDeviceAttr(uint32_t Kind, Ty &Value) {
@@ -2625,6 +2631,9 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
 
   /// Reference to the host device.
   AMDHostDeviceTy &HostDevice;
+
+  /// The current size of the global device memory pool (managed by us).
+  uint64_t DeviceMemoryPoolSize = 1L << 29L /* 512MB */;
 };
 
 Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
diff --git a/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp b/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
index c8fb8d552f429a0..0243f0205dbf0e5 100644
--- a/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
+++ b/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
@@ -590,6 +590,35 @@ Error GenericDeviceTy::init(GenericPluginTy &Plugin) {
 }
 
 Error GenericDeviceTy::deinit(GenericPluginTy &Plugin) {
+
+  if (OMPX_DebugKind.get() & uint32_t(DeviceDebugKind::AllocationTracker)) {
+    GenericGlobalHandlerTy &GHandler = Plugin.getGlobalHandler();
+    for (auto *Image : LoadedImages) {
+      DeviceMemoryPoolTrackingTy ImageDeviceMemoryPoolTracking = {0, 0, ~0U, 0};
+      GlobalTy TrackerGlobal("__omp_rtl_device_memory_pool_tracker",
+                             sizeof(DeviceMemoryPoolTrackingTy),
+                             &ImageDeviceMemoryPoolTracking);
+      if (auto Err =
+              GHandler.readGlobalFromDevice(*this, *Image, TrackerGlobal))
+        return Err;
+      DeviceMemoryPoolTracking.combine(ImageDeviceMemoryPoolTracking);
+    }
+
+    // TODO: Write this by default into a file.
+    printf("\n\n|-----------------------\n"
+           "| Device memory tracker:\n"
+           "|-----------------------\n"
+           "| #Allocations: %lu\n"
+           "| Byes allocated: %lu\n"
+           "| Minimal allocation: %lu\n"
+           "| Maximal allocation: %lu\n"
+           "|-----------------------\n\n\n",
+           DeviceMemoryPoolTracking.NumAllocations,
+           DeviceMemoryPoolTracking.AllocationTotal,
+           DeviceMemoryPoolTracking.AllocationMin,
+           DeviceMemoryPoolTracking.AllocationMax);
+  }
+
   // Delete the memory manager before deinitializing the device. Otherwise,
   // we may delete device allocations after the device is deinitialized.
   if (MemoryManager)
@@ -648,6 +677,17 @@ GenericDeviceTy::loadBinary(GenericPluginTy &Plugin,
   if (auto Err = setupDeviceEnvironment(Plugin, *Image))
     return std::move(Err);
 
+  // Setup the global device memory pool if needed.
+  if (shouldSetupDeviceMemoryPool()) {
+    uint64_t HeapSize;
+    auto SizeOrErr = getDeviceHeapSize(HeapSize);
+    if (SizeOrErr) {
+      REPORT("No global device memory pool due to error: %s\n",
+             toString(std::move(SizeOrErr)).data());
+    } else if (auto Err = setupDeviceMemoryPool(Plugin, *Image, HeapSize))
+      return std::move(Err);
+  }
+
   // Register all offload entries of the image.
   if (auto Err = registerOffloadEntries(*Image))
     return std::move(Err);
@@ -713,6 +753,45 @@ Error GenericDeviceTy::setupDeviceEnvironment(GenericPluginTy &Plugin,
   return Plugin::success();
 }
 
+Error GenericDeviceTy::setupDeviceMemoryPool(GenericPluginTy &Plugin,
+                                             DeviceImageTy &Image,
+                                             uint64_t PoolSize) {
+  // Free the old pool, if any.
+  if (DeviceMemoryPool.Ptr) {
+    if (auto Err = dataDelete(DeviceMemoryPool.Ptr,
+                              TargetAllocTy::TARGET_ALLOC_DEVICE))
+      return Err;
+  }
+
+  DeviceMemoryPool.Size = PoolSize;
+  auto AllocOrErr = dataAlloc(PoolSize, /*HostPtr=*/nullptr,
+                              TargetAllocTy::TARGET_ALLOC_DEVICE);
+  if (AllocOrErr) {
+    DeviceMemoryPool.Ptr = *AllocOrErr;
+  } else {
+    auto Err = AllocOrErr.takeError();
+    REPORT("Failure to allocate device memory for global memory pool: %s\n",
+           toString(std::move(Err)).data());
+    DeviceMemoryPool.Ptr = nullptr;
+    DeviceMemoryPool.Size = 0;
+  }
+
+  // Create the metainfo of the device environment global.
+  GlobalTy TrackerGlobal("__omp_rtl_device_memory_pool_tracker",
+                         sizeof(DeviceMemoryPoolTrackingTy),
+                         &DeviceMemoryPoolTracking);
+  GenericGlobalHandlerTy &GHandler = Plugin.getGlobalHandler();
+  if (auto Err = GHandler.writeGlobalToDevice(*this, Image, TrackerGlobal))
+    return Err;
+
+  // Create the metainfo of the device environment global.
+  GlobalTy DevEnvGlobal("__omp_rtl_device_memory_pool",
+                        sizeof(DeviceMemoryPoolTy), &DeviceMemoryPool);
+
+  // Write device environment values to the device.
+  return GHandler.writeGlobalToDevice(*this, Image, DevEnvGlobal);
+}
+
 Error GenericDeviceTy::setupRPCServer(GenericPluginTy &Plugin,
                                       DeviceImageTy &Image) {
   // The plugin either does not need an RPC server or it is unavailible.
@@ -1327,10 +1406,6 @@ Error GenericPluginTy::init() {
 }
 
 Error GenericPluginTy::deinit() {
-  // There is no global handler if no device is available.
-  if (GlobalHandler)
-    delete GlobalHandler;
-
   // Deinitialize all active devices.
   for (int32_t DeviceId = 0; DeviceId < NumDevices; ++DeviceId) {
     if (Devices[DeviceId]) {
@@ -1340,6 +1415,10 @@ Error GenericPluginTy::deinit() {
     assert(!Devices[DeviceId] && "Device was not deinitialized");
   }
 
+  // There is no global handler if no device is available.
+  if (GlobalHandler)
+    delete GlobalHandler;
+
   if (RPCServer)
     delete RPCServer;
 
diff --git a/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h b/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
index 4a9c223eefbcf62..ddcf3b3cc9b9537 100644
--- a/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
+++ b/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
@@ -625,6 +625,10 @@ struct GenericDeviceTy : public DeviceAllocatorTy ...
[truncated]

github-actions · 2023-10-21T02:32:47Z

✅ With the latest revision this PR passed the C/C++ code formatter.

clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h

openmp/libomptarget/DeviceRTL/src/exports

openmp/libomptarget/DeviceRTL/src/State.cpp

jhuber6

I figured we would just have an internal implementation of malloc provided by the OpenMP device runtime library. The variants make this more difficult, but we could just use #ifdef __AMDGPU__ instead of an OpenMP variant so we can declare it as extern "C" inside the AMDGPU device RTL. Since we use a two-step compilation and always compile with clang it should be legal.

clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h

jhuber6 · 2023-10-21T02:52:00Z

openmp/libomptarget/DeviceRTL/src/Allocator.cpp

+
+    uint64_t OldData = atomic::add(Data, Size, atomic::seq_cst);
+    if (OldData + Size > End)
+      __builtin_trap();


Why not just return nullptr? Isn't that the expect failure mode for a malloc call?

This allocator is not a production allocator. Right now, we run into too many problems. And honestly, nobody checks for nullptr.

True, but I figured the inevitable dereference of the nullptr would be a good trap that's usually more easily connected with malloc.

Based on my experience it's not always easy to expose a dereference nullptr on a GPU, or to put another way, sometimes even dereference a nullptr will not trigger a trap. Given what we have right now in this allocator, it is better to just trap if we don't have enough memory.

I guess I'm just averse to traps because they currently deadlock on my machine. I still haven't tracked down the cause of that however.

True, but I figured the inevitable dereference of the nullptr would be a good trap that's usually more easily connected with malloc.

That is arguably not true, especially for "implicit"/"hidden" mallocs. People that run on AMD complain often about tests not running, or only running with O1/2/3 but don't connect it to the malloc. The problem with *nullptr is that it's UB. So we don't necessarily do it at all. I have seen many tests "pass" even though the smart stack was full but the fallback to malloc was removed and somehow writing some shared memory worked out. But that also leads to spurious fails. Let's crash loud and fast, not via silent corruption.

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing llvm#66708

jdoerfert · 2023-10-21T03:08:58Z

I figured we would just have an internal implementation of malloc provided by the OpenMP device runtime library. The variants make this more difficult, but we could just use #ifdef __AMDGPU__ instead of an OpenMP variant so we can declare it as extern "C" inside the AMDGPU device RTL. Since we use a two-step compilation and always compile with clang it should be legal.

Done.

jhuber6 · 2023-10-21T12:46:25Z

/home/jhuber/Documents/llvm/llvm-project/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp: In member function ‘uint64_t llvm::omp::target::plugin::GenericKernelTy::getNumBlocks(llvm::omp::target::plugin::GenericDeviceTy&, uint32_t*, uint64_t, uint32_t&, bool) const’:
/home/jhuber/Documents/llvm/llvm-project/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp:453:55: warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]
  453 |                TripCountNumBlocks >= DefaultNumBlocks &&
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
  454 |                    "Expected sufficient outer parallelism.");
      |                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jhuber/Documents/llvm/llvm-project/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp: In member function ‘llvm::Error llvm::omp::target::plugin::GenericDeviceTy::deinit(llvm::omp::target::plugin::GenericPluginTy&)’:
/home/jhuber/Documents/llvm/llvm-project/openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp:593:39: error: ‘DeviceDebugKind’ was not declared in this scope
  593 |   if (OMPX_DebugKind.get() & uint32_t(DeviceDebugKind::AllocationTracker)) {
      |

I get the following compile error

jdoerfert · 2023-10-21T17:40:31Z

I get the following compile error

Did you rebase? I thought I pre-committed that NFC part.

shiltian

The patch looks good to me.

jhuber6 · 2023-10-21T20:37:06Z

Looks good, applied and it bound to the OpenMP malloc without libc and the libc malloc with.

jdoerfert · 2023-10-21T21:48:17Z

Looks good, applied and it bound to the OpenMP malloc without libc and the libc malloc with.

Cool :)

jplehr · 2023-10-22T14:18:22Z

I think this broke the bot initially https://lab.llvm.org/buildbot/#/builders/193/builds/40466 and the subsequent commit did only partially fix it (https://lab.llvm.org/buildbot/#/builders/193/builds/40467) as the tests seemingly fail for the x86-64 plugin.

locally : reverts d3921e4 [OpenMP] Basic BumpAllocator for (AMD)GPUs (llvm#69806) Change-Id: Id512e729870279855744ce65bfc69e2155fb68ee

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing llvm#66708 Change-Id: I181cdca714994b285c0cd1d16dd3546809cc5dd2

This reverts commit d841fbf. Change-Id: I91980da09a1af7fb8cbb1c4fc8d28c1c768d5f5f

Local branch amd-gfx 4a300c2 Merged main:241c290ad73f into amd-gfx:77eb6cdceaa5 Remote branch main d3921e4 [OpenMP] Basic BumpAllocator for (AMD)GPUs (llvm#69806)

The commit was discussed in phabricator (https://reviews.llvm.org/D157186). Record replay currently fails on AMD as it conflicts with the heap memory allocator introduced in #69806. The workaround is setting `LIBOMPTARGET_HEAP_SIZE=0` during both record and replay run.

The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to run more tests. The allocator implements `malloc`, both internally and externally, while we continue to default to the NVIDIA `malloc` when we target NVIDIA GPUs. Once we have smarter or customizable allocators we should consider this choice, for now, this allocator is better than none. It traps if it is out of memory, making it easy to debug. Heap size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB. It allows to track allocation statistics via `LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with `-fopenmp-target-debug=8`). Two tests were added, and one was enabled. This is the next step towards fixing llvm#66708 Change-Id: I1fdec0f2a24dfff49ccbad5d43a0fd68916ccf16

jdoerfert added openmp openmp:libomptarget OpenMP offload runtime labels Oct 21, 2023

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU backend:X86 clang:headers Headers provided by Clang, e.g. for intrinsics clang:openmp OpenMP related changes to Clang openmp:libomp OpenMP host runtime labels Oct 21, 2023

jdoerfert force-pushed the fix_nested_par branch from d30ff2e to d0df9cb Compare October 21, 2023 02:31

jdoerfert force-pushed the fix_nested_par branch from d0df9cb to 50807ba Compare October 21, 2023 02:34

jhuber6 reviewed Oct 21, 2023

View reviewed changes

clang/lib/Headers/openmp_wrappers/__clang_openmp_device_functions.h Outdated Show resolved Hide resolved

openmp/libomptarget/DeviceRTL/src/exports Outdated Show resolved Hide resolved

openmp/libomptarget/DeviceRTL/src/State.cpp Outdated Show resolved Hide resolved

jdoerfert force-pushed the fix_nested_par branch 2 times, most recently from 7061910 to f3e7e40 Compare October 21, 2023 02:44

jhuber6 reviewed Oct 21, 2023

View reviewed changes

jdoerfert force-pushed the fix_nested_par branch from f3e7e40 to 9819e84 Compare October 21, 2023 03:08

shiltian approved these changes Oct 21, 2023

View reviewed changes

jdoerfert merged commit d3921e4 into llvm:main Oct 21, 2023
3 checks passed

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 22, 2023

merge main into amd-stg-open

827559e

locally : reverts d3921e4 [OpenMP] Basic BumpAllocator for (AMD)GPUs (llvm#69806) Change-Id: Id512e729870279855744ce65bfc69e2155fb68ee

spupyrev mentioned this pull request Oct 23, 2023

[BOLT] Rename cds to cdsort #69966

Merged

banach-space mentioned this pull request Oct 24, 2023

[mlir][vector] Add scalable vectors to tests for vector.contract #70039

Merged

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Oct 26, 2023

Revert "[OpenMP] Basic BumpAllocator for (AMD)GPUs (llvm#69806)"

8ca8a22

This reverts commit d841fbf. Change-Id: I91980da09a1af7fb8cbb1c4fc8d28c1c768d5f5f

koparasy mentioned this pull request Oct 27, 2023

[openmp] Fixed Support for VA for record-replay. #70396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] Basic BumpAllocator for (AMD)GPUs #69806

[OpenMP] Basic BumpAllocator for (AMD)GPUs #69806

jdoerfert commented Oct 21, 2023

llvmbot commented Oct 21, 2023 •

edited

Loading

github-actions bot commented Oct 21, 2023 •

edited

Loading

jhuber6 left a comment

jhuber6 Oct 21, 2023

jdoerfert Oct 21, 2023

jhuber6 Oct 21, 2023

shiltian Oct 21, 2023 •

edited

Loading

jhuber6 Oct 21, 2023

jdoerfert Oct 21, 2023

jdoerfert commented Oct 21, 2023

jhuber6 commented Oct 21, 2023

jdoerfert commented Oct 21, 2023

shiltian left a comment

jhuber6 commented Oct 21, 2023

jdoerfert commented Oct 21, 2023

jplehr commented Oct 22, 2023

[OpenMP] Basic BumpAllocator for (AMD)GPUs #69806

[OpenMP] Basic BumpAllocator for (AMD)GPUs #69806

Conversation

jdoerfert commented Oct 21, 2023

llvmbot commented Oct 21, 2023 • edited Loading

github-actions bot commented Oct 21, 2023 • edited Loading

jhuber6 left a comment

Choose a reason for hiding this comment

jhuber6 Oct 21, 2023

Choose a reason for hiding this comment

jdoerfert Oct 21, 2023

Choose a reason for hiding this comment

jhuber6 Oct 21, 2023

Choose a reason for hiding this comment

shiltian Oct 21, 2023 • edited Loading

Choose a reason for hiding this comment

jhuber6 Oct 21, 2023

Choose a reason for hiding this comment

jdoerfert Oct 21, 2023

Choose a reason for hiding this comment

jdoerfert commented Oct 21, 2023

jhuber6 commented Oct 21, 2023

jdoerfert commented Oct 21, 2023

shiltian left a comment

Choose a reason for hiding this comment

jhuber6 commented Oct 21, 2023

jdoerfert commented Oct 21, 2023

jplehr commented Oct 22, 2023

llvmbot commented Oct 21, 2023 •

edited

Loading

github-actions bot commented Oct 21, 2023 •

edited

Loading

shiltian Oct 21, 2023 •

edited

Loading