[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739

jhuber6 · 2025-10-09T21:28:10Z

Summary:
This patch rewrites the llvm-gpu-loader utility to use the LLVMOffload
interface. This heavily simplifies it while re-using the already
existing support. Another benefit is that I can now easily do this
dynamically so we can always build this utility without needing to find
non-standard packages.

One issue is mentioned in
#159636 where this will now
take extra time if you have both installed on the same machine. This is
just slightly annoying since most people don't have both CUDA and ROCm
at the same time so I don't consider it a blocker. I will work later to
address it.

Slightly unfortunate environment variable usage, I will also expose that
better in the future.

Fixes: #132890

Summary: This patch rewrites the `llvm-gpu-loader` utility to use the LLVMOffload interface. This heavily simplifies it while re-using the already existing support. Another benefit is that I can now easily do this dynamically so we can always build this utility without needing to find non-standard packages. One issue is mentioned in llvm#159636 where this will now take extra time if you have both installed on the same machine. This is just slightly annoying since most people don't have both CUDA and ROCm at the same time so I don't consider it a blocker. I will work later to address it. Slightly unfortunate environment variable usage, I will also expose that better in the future.

llvmbot · 2025-10-09T21:28:44Z

@llvm/pr-subscribers-libc
@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch rewrites the llvm-gpu-loader utility to use the LLVMOffload
interface. This heavily simplifies it while re-using the already
existing support. Another benefit is that I can now easily do this
dynamically so we can always build this utility without needing to find
non-standard packages.

One issue is mentioned in
#159636 where this will now
take extra time if you have both installed on the same machine. This is
just slightly annoying since most people don't have both CUDA and ROCm
at the same time so I don't consider it a blocker. I will work later to
address it.

Slightly unfortunate environment variable usage, I will also expose that
better in the future.

Fixes: #132890

Patch is 71.88 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162739.diff

10 Files Affected:

(modified) libc/cmake/modules/LLVMLibCTestRules.cmake (+3-3)
(modified) libc/startup/gpu/amdgpu/start.cpp (+6-30)
(modified) libc/startup/gpu/nvptx/start.cpp (+3-37)
(modified) llvm/tools/CMakeLists.txt (-4)
(modified) llvm/tools/llvm-gpu-loader/CMakeLists.txt (-34)
(removed) llvm/tools/llvm-gpu-loader/amdhsa.cpp (-594)
(modified) llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp (+217-83)
(modified) llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h (+164-95)
(removed) llvm/tools/llvm-gpu-loader/nvptx.cpp (-367)
(removed) llvm/tools/llvm-gpu-loader/server.h (-55)

diff --git a/libc/cmake/modules/LLVMLibCTestRules.cmake b/libc/cmake/modules/LLVMLibCTestRules.cmake
index 19da0ad29cd84..dfc5ab904fe9d 100644
--- a/libc/cmake/modules/LLVMLibCTestRules.cmake
+++ b/libc/cmake/modules/LLVMLibCTestRules.cmake
@@ -87,7 +87,6 @@ function(_get_hermetic_test_compile_options output_var)
          -mcode-object-version=${LIBC_GPU_CODE_OBJECT_VERSION})
   elseif(LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
     list(APPEND compile_options
-         "SHELL:-mllvm -nvptx-emit-init-fini-kernel=false"
          -Wno-multi-gpu --cuda-path=${LIBC_CUDA_ROOT}
          -nogpulib -march=${LIBC_GPU_TARGET_ARCHITECTURE} -fno-use-cxa-atexit)
   endif()
@@ -637,6 +636,7 @@ function(add_integration_test test_name)
   # makes `add_custom_target` construct the correct command and execute it.
   set(test_cmd
       ${INTEGRATION_TEST_ENV}
+      $<$<BOOL:${LIBC_TARGET_ARCHITECTURE_IS_NVPTX}>:LIBOMPTARGET_STACK_SIZE=3072>
       $<$<BOOL:${LIBC_TARGET_OS_IS_GPU}>:${gpu_loader_exe}>
       ${CMAKE_CROSSCOMPILING_EMULATOR}
       ${INTEGRATION_TEST_LOADER_ARGS}
@@ -790,8 +790,7 @@ function(add_libc_hermetic test_name)
   if(LIBC_TARGET_ARCHITECTURE_IS_AMDGPU)
     target_link_options(${fq_build_target_name} PRIVATE
       ${LIBC_COMPILE_OPTIONS_DEFAULT} -Wno-multi-gpu
-      -mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto
-      "-Wl,-mllvm,-amdgpu-lower-global-ctor-dtor=0" -nostdlib -static
+      -mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto -nostdlib -static
       "-Wl,-mllvm,-amdhsa-code-object-version=${LIBC_GPU_CODE_OBJECT_VERSION}")
   elseif(LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
     target_link_options(${fq_build_target_name} PRIVATE
@@ -859,6 +858,7 @@ function(add_libc_hermetic test_name)
       string(REPLACE " " ";" test_cmd "${test_cmd_parsed}")
     else()
       set(test_cmd ${HERMETIC_TEST_ENV}
+        $<$<BOOL:${LIBC_TARGET_ARCHITECTURE_IS_NVPTX}>:LIBOMPTARGET_STACK_SIZE=3072>
         $<$<BOOL:${LIBC_TARGET_OS_IS_GPU}>:${gpu_loader_exe}> ${CMAKE_CROSSCOMPILING_EMULATOR} ${HERMETIC_TEST_LOADER_ARGS}
         $<TARGET_FILE:${fq_build_target_name}> ${HERMETIC_TEST_ARGS})
     endif()
diff --git a/libc/startup/gpu/amdgpu/start.cpp b/libc/startup/gpu/amdgpu/start.cpp
index 48f095d924931..e876629e6d0e1 100644
--- a/libc/startup/gpu/amdgpu/start.cpp
+++ b/libc/startup/gpu/amdgpu/start.cpp
@@ -13,6 +13,9 @@
 #include "src/stdlib/atexit.h"
 #include "src/stdlib/exit.h"
 
+// TODO: Merge this and the NVPTX start files once the common `device_kernel`
+// attribute correctly implies `amdgpu_kernel`.
+
 extern "C" int main(int argc, char **argv, char **envp);
 extern "C" void __cxa_finalize(void *dso);
 
@@ -21,45 +24,18 @@ namespace LIBC_NAMESPACE_DECL {
 // FIXME: Factor this out into common logic so we don't need to stub it here.
 void teardown_main_tls() {}
 
-// FIXME: Touch this symbol to force this to be linked in statically.
-volatile void *dummy = &LIBC_NAMESPACE::rpc::client;
-
 DataEnvironment app;
 
-extern "C" uintptr_t __init_array_start[];
-extern "C" uintptr_t __init_array_end[];
-extern "C" uintptr_t __fini_array_start[];
-extern "C" uintptr_t __fini_array_end[];
-
-using InitCallback = void(int, char **, char **);
-using FiniCallback = void(void);
-
-static void call_init_array_callbacks(int argc, char **argv, char **env) {
-  size_t init_array_size = __init_array_end - __init_array_start;
-  for (size_t i = 0; i < init_array_size; ++i)
-    reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
-}
-
-static void call_fini_array_callbacks() {
-  size_t fini_array_size = __fini_array_end - __fini_array_start;
-  for (size_t i = fini_array_size; i > 0; --i)
-    reinterpret_cast<FiniCallback *>(__fini_array_start[i - 1])();
-}
-
 } // namespace LIBC_NAMESPACE_DECL
 
 extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel,
              clang::amdgpu_flat_work_group_size(1, 1),
              clang::amdgpu_max_num_work_groups(1)]] void
-_begin(int argc, char **argv, char **env) {
+_begin(int, char **, char **env) {
+  // The LLVM offloading runtime will automatically call any present global
+  // constructors and destructors so we defer that handling.
   __atomic_store_n(&LIBC_NAMESPACE::app.env_ptr,
                    reinterpret_cast<uintptr_t *>(env), __ATOMIC_RELAXED);
-  // We want the fini array callbacks to be run after other atexit
-  // callbacks are run. So, we register them before running the init
-  // array callbacks as they can potentially register their own atexit
-  // callbacks.
-  LIBC_NAMESPACE::atexit(&LIBC_NAMESPACE::call_fini_array_callbacks);
-  LIBC_NAMESPACE::call_init_array_callbacks(argc, argv, env);
 }
 
 extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
diff --git a/libc/startup/gpu/nvptx/start.cpp b/libc/startup/gpu/nvptx/start.cpp
index ce8f5bbb6d4eb..822930e22bc26 100644
--- a/libc/startup/gpu/nvptx/start.cpp
+++ b/libc/startup/gpu/nvptx/start.cpp
@@ -23,48 +23,14 @@ DataEnvironment app;
 // FIXME: Factor this out into common logic so we don't need to stub it here.
 void teardown_main_tls() {}
 
-// FIXME: Touch this symbol to force this to be linked in statically.
-volatile void *dummy = &LIBC_NAMESPACE::rpc::client;
-
-extern "C" {
-// Nvidia's 'nvlink' linker does not provide these symbols. We instead need
-// to manually create them and update the globals in the loader implememtation.
-uintptr_t *__init_array_start [[gnu::visibility("protected")]];
-uintptr_t *__init_array_end [[gnu::visibility("protected")]];
-uintptr_t *__fini_array_start [[gnu::visibility("protected")]];
-uintptr_t *__fini_array_end [[gnu::visibility("protected")]];
-}
-
-// Nvidia requires that the signature of the function pointers match. This means
-// we cannot support the extended constructor arguments.
-using InitCallback = void(void);
-using FiniCallback = void(void);
-
-static void call_init_array_callbacks(int, char **, char **) {
-  size_t init_array_size = __init_array_end - __init_array_start;
-  for (size_t i = 0; i < init_array_size; ++i)
-    reinterpret_cast<InitCallback *>(__init_array_start[i])();
-}
-
-static void call_fini_array_callbacks() {
-  size_t fini_array_size = __fini_array_end - __fini_array_start;
-  for (size_t i = fini_array_size; i > 0; --i)
-    reinterpret_cast<FiniCallback *>(__fini_array_start[i - 1])();
-}
-
 } // namespace LIBC_NAMESPACE_DECL
 
 extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
-_begin(int argc, char **argv, char **env) {
+_begin(int, char **, char **env) {
+  // The LLVM offloading runtime will automatically call any present global
+  // constructors and destructors so we defer that handling.
   __atomic_store_n(&LIBC_NAMESPACE::app.env_ptr,
                    reinterpret_cast<uintptr_t *>(env), __ATOMIC_RELAXED);
-
-  // We want the fini array callbacks to be run after other atexit
-  // callbacks are run. So, we register them before running the init
-  // array callbacks as they can potentially register their own atexit
-  // callbacks.
-  LIBC_NAMESPACE::atexit(&LIBC_NAMESPACE::call_fini_array_callbacks);
-  LIBC_NAMESPACE::call_init_array_callbacks(argc, argv, env);
 }
 
 extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
diff --git a/llvm/tools/CMakeLists.txt b/llvm/tools/CMakeLists.txt
index 729797aa43f0b..b9c5a79849ec8 100644
--- a/llvm/tools/CMakeLists.txt
+++ b/llvm/tools/CMakeLists.txt
@@ -9,10 +9,6 @@
 # traversing each directory.
 create_llvm_tool_options()
 
-if(NOT LLVM_COMPILER_IS_GCC_COMPATIBLE OR NOT LLVM_LIBC_GPU_BUILD)
-  set(LLVM_TOOL_LLVM_GPU_LOADER_BUILD OFF)
-endif()
-
 if(NOT LLVM_BUILD_LLVM_DYLIB AND NOT LLVM_BUILD_LLVM_C_DYLIB)
   set(LLVM_TOOL_LLVM_SHLIB_BUILD Off)
 endif()
diff --git a/llvm/tools/llvm-gpu-loader/CMakeLists.txt b/llvm/tools/llvm-gpu-loader/CMakeLists.txt
index b35a702476ada..de276635e2713 100644
--- a/llvm/tools/llvm-gpu-loader/CMakeLists.txt
+++ b/llvm/tools/llvm-gpu-loader/CMakeLists.txt
@@ -1,46 +1,12 @@
 set(LLVM_LINK_COMPONENTS
-  BinaryFormat
-  Object
   Option
   Support
-  FrontendOffloading
   TargetParser
 )
 
 add_llvm_tool(llvm-gpu-loader
   llvm-gpu-loader.cpp
 
-  # TODO: We intentionally split this currently due to statically linking the
-  #       GPU runtimes. Dynamically load the dependencies, possibly using the
-  #       LLVM offloading API when it is complete.
-  PARTIAL_SOURCES_INTENDED
-
   DEPENDS
   intrinsics_gen
 )
-
-# Locate the RPC server handling interface.
-include(FindLibcCommonUtils)
-target_link_libraries(llvm-gpu-loader PUBLIC llvm-libc-common-utilities)
-
-# Check for HSA support for targeting AMD GPUs.
-find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
-if(hsa-runtime64_FOUND)
-  target_sources(llvm-gpu-loader PRIVATE amdhsa.cpp)
-  target_compile_definitions(llvm-gpu-loader PRIVATE AMDHSA_SUPPORT)
-  target_link_libraries(llvm-gpu-loader PRIVATE hsa-runtime64::hsa-runtime64)
-
-  # Compatibility with the old amdhsa-loader name.
-  add_llvm_tool_symlink(amdhsa-loader llvm-gpu-loader)
-endif()
-
-# Check for CUDA support for targeting NVIDIA GPUs.
-find_package(CUDAToolkit 11.2 QUIET)
-if(CUDAToolkit_FOUND)
-  target_sources(llvm-gpu-loader PRIVATE nvptx.cpp)
-  target_compile_definitions(llvm-gpu-loader PRIVATE NVPTX_SUPPORT)
-  target_link_libraries(llvm-gpu-loader PRIVATE CUDA::cuda_driver)
-
-  # Compatibility with the old nvptx-loader name.
-  add_llvm_tool_symlink(nvptx-loader llvm-gpu-loader)
-endif()
diff --git a/llvm/tools/llvm-gpu-loader/amdhsa.cpp b/llvm/tools/llvm-gpu-loader/amdhsa.cpp
deleted file mode 100644
index 5715058d8cfac..0000000000000
--- a/llvm/tools/llvm-gpu-loader/amdhsa.cpp
+++ /dev/null
@@ -1,594 +0,0 @@
-//===-- Loader Implementation for AMDHSA devices --------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-// This file impelements a simple loader to run images supporting the AMDHSA
-// architecture. The file launches the '_start' kernel which should be provided
-// by the device application start code and call ultimately call the 'main'
-// function.
-//
-//===----------------------------------------------------------------------===//
-
-#include "llvm-gpu-loader.h"
-#include "server.h"
-
-#include "hsa/hsa.h"
-#include "hsa/hsa_ext_amd.h"
-
-#include "llvm/Frontend/Offloading/Utility.h"
-
-#include <atomic>
-#include <cstdio>
-#include <cstdlib>
-#include <cstring>
-#include <thread>
-#include <tuple>
-#include <utility>
-
-// The implicit arguments of COV5 AMDGPU kernels.
-struct implicit_args_t {
-  uint32_t grid_size_x;
-  uint32_t grid_size_y;
-  uint32_t grid_size_z;
-  uint16_t workgroup_size_x;
-  uint16_t workgroup_size_y;
-  uint16_t workgroup_size_z;
-  uint8_t Unused0[46];
-  uint16_t grid_dims;
-  uint8_t Unused1[190];
-};
-
-/// Print the error code and exit if \p code indicates an error.
-static void handle_error_impl(const char *file, int32_t line,
-                              hsa_status_t code) {
-  if (code == HSA_STATUS_SUCCESS || code == HSA_STATUS_INFO_BREAK)
-    return;
-
-  const char *desc;
-  if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
-    desc = "Unknown error";
-  fprintf(stderr, "%s:%d:0: Error: %s\n", file, line, desc);
-  exit(EXIT_FAILURE);
-}
-
-/// Generic interface for iterating using the HSA callbacks.
-template <typename elem_ty, typename func_ty, typename callback_ty>
-hsa_status_t iterate(func_ty func, callback_ty cb) {
-  auto l = [](elem_ty elem, void *data) -> hsa_status_t {
-    callback_ty *unwrapped = static_cast<callback_ty *>(data);
-    return (*unwrapped)(elem);
-  };
-  return func(l, static_cast<void *>(&cb));
-}
-
-/// Generic interface for iterating using the HSA callbacks.
-template <typename elem_ty, typename func_ty, typename func_arg_ty,
-          typename callback_ty>
-hsa_status_t iterate(func_ty func, func_arg_ty func_arg, callback_ty cb) {
-  auto l = [](elem_ty elem, void *data) -> hsa_status_t {
-    callback_ty *unwrapped = static_cast<callback_ty *>(data);
-    return (*unwrapped)(elem);
-  };
-  return func(func_arg, l, static_cast<void *>(&cb));
-}
-
-/// Iterate through all availible agents.
-template <typename callback_ty>
-hsa_status_t iterate_agents(callback_ty callback) {
-  return iterate<hsa_agent_t>(hsa_iterate_agents, callback);
-}
-
-/// Iterate through all availible memory pools.
-template <typename callback_ty>
-hsa_status_t iterate_agent_memory_pools(hsa_agent_t agent, callback_ty cb) {
-  return iterate<hsa_amd_memory_pool_t>(hsa_amd_agent_iterate_memory_pools,
-                                        agent, cb);
-}
-
-template <hsa_device_type_t flag>
-hsa_status_t get_agent(hsa_agent_t *output_agent) {
-  // Find the first agent with a matching device type.
-  auto cb = [&](hsa_agent_t hsa_agent) -> hsa_status_t {
-    hsa_device_type_t type;
-    hsa_status_t status =
-        hsa_agent_get_info(hsa_agent, HSA_AGENT_INFO_DEVICE, &type);
-    if (status != HSA_STATUS_SUCCESS)
-      return status;
-
-    if (type == flag) {
-      // Ensure that a GPU agent supports kernel dispatch packets.
-      if (type == HSA_DEVICE_TYPE_GPU) {
-        hsa_agent_feature_t features;
-        status =
-            hsa_agent_get_info(hsa_agent, HSA_AGENT_INFO_FEATURE, &features);
-        if (status != HSA_STATUS_SUCCESS)
-          return status;
-        if (features & HSA_AGENT_FEATURE_KERNEL_DISPATCH)
-          *output_agent = hsa_agent;
-      } else {
-        *output_agent = hsa_agent;
-      }
-      return HSA_STATUS_INFO_BREAK;
-    }
-    return HSA_STATUS_SUCCESS;
-  };
-
-  return iterate_agents(cb);
-}
-
-void print_kernel_resources(const char *kernel_name) {
-  fprintf(stderr, "Kernel resources on AMDGPU is not supported yet.\n");
-}
-
-/// Retrieve a global memory pool with a \p flag from the agent.
-template <hsa_amd_memory_pool_global_flag_t flag>
-hsa_status_t get_agent_memory_pool(hsa_agent_t agent,
-                                   hsa_amd_memory_pool_t *output_pool) {
-  auto cb = [&](hsa_amd_memory_pool_t memory_pool) {
-    uint32_t flags;
-    hsa_amd_segment_t segment;
-    if (auto err = hsa_amd_memory_pool_get_info(
-            memory_pool, HSA_AMD_MEMORY_POOL_INFO_SEGMENT, &segment))
-      return err;
-    if (auto err = hsa_amd_memory_pool_get_info(
-            memory_pool, HSA_AMD_MEMORY_POOL_INFO_GLOBAL_FLAGS, &flags))
-      return err;
-
-    if (segment != HSA_AMD_SEGMENT_GLOBAL)
-      return HSA_STATUS_SUCCESS;
-
-    if (flags & flag)
-      *output_pool = memory_pool;
-
-    return HSA_STATUS_SUCCESS;
-  };
-  return iterate_agent_memory_pools(agent, cb);
-}
-
-template <typename args_t>
-hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
-                           hsa_amd_memory_pool_t kernargs_pool,
-                           hsa_amd_memory_pool_t coarsegrained_pool,
-                           hsa_queue_t *queue, rpc::Server &server,
-                           const LaunchParameters &params,
-                           const char *kernel_name, args_t kernel_args,
-                           uint32_t wavefront_size, bool print_resource_usage) {
-  // Look up the kernel in the loaded executable.
-  hsa_executable_symbol_t symbol;
-  if (hsa_status_t err = hsa_executable_get_symbol_by_name(
-          executable, kernel_name, &dev_agent, &symbol))
-    return err;
-
-  // Retrieve different properties of the kernel symbol used for launch.
-  uint64_t kernel;
-  uint32_t args_size;
-  uint32_t group_size;
-  uint32_t private_size;
-  bool dynamic_stack;
-
-  std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_DYNAMIC_CALLSTACK, &dynamic_stack},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};
-
-  for (auto &[info, value] : symbol_infos)
-    if (hsa_status_t err = hsa_executable_symbol_get_info(symbol, info, value))
-      return err;
-
-  // Allocate space for the kernel arguments on the host and allow the GPU agent
-  // to access it.
-  void *args;
-  if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,
-                                                      /*flags=*/0, &args))
-    handle_error(err);
-  hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);
-
-  // Initialize all the arguments (explicit and implicit) to zero, then set the
-  // explicit arguments to the values created above.
-  std::memset(args, 0, args_size);
-  std::memcpy(args, &kernel_args, std::is_empty_v<args_t> ? 0 : sizeof(args_t));
-
-  // Initialize the necessary implicit arguments to the proper values.
-  int dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
-             (params.num_blocks_z * params.num_threads_z != 1);
-  implicit_args_t *implicit_args = reinterpret_cast<implicit_args_t *>(
-      reinterpret_cast<uint8_t *>(args) + sizeof(args_t));
-  implicit_args->grid_dims = dims;
-  implicit_args->grid_size_x = params.num_blocks_x;
-  implicit_args->grid_size_y = params.num_blocks_y;
-  implicit_args->grid_size_z = params.num_blocks_z;
-  implicit_args->workgroup_size_x = params.num_threads_x;
-  implicit_args->workgroup_size_y = params.num_threads_y;
-  implicit_args->workgroup_size_z = params.num_threads_z;
-
-  // Obtain a packet from the queue.
-  uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
-  while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
-    ;
-
-  const uint32_t mask = queue->size - 1;
-  hsa_kernel_dispatch_packet_t *packet =
-      static_cast<hsa_kernel_dispatch_packet_t *>(queue->base_address) +
-      (packet_id & mask);
-
-  // Set up the packet for exeuction on the device. We currently only launch
-  // with one thread on the device, forcing the rest of the wavefront to be
-  // masked off.
-  uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
-  packet->workgroup_size_x = params.num_threads_x;
-  packet->workgroup_size_y = params.num_threads_y;
-  packet->workgroup_size_z = params.num_threads_z;
-  packet->reserved0 = 0;
-  packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
-  packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
-  packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size =
-      dynamic_stack ? 16 * 1024 /* 16 KB */ : private_size;
-  packet->group_segment_size = group_size;
-  packet->kernel_object = kernel;
-  packet->kernarg_address = args;
-  packet->reserved2 = 0;
-  // Create a signal to indicate when this packet has been completed.
-  if (hsa_status_t err =
-          hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
-    handle_error(err);
-
-  if (print_resource_usage)
-    print_kernel_resources(kernel_name);
-
-  // Initialize the packet header and set the doorbell signal to begin execution
-  // by the HSA runtime.
-  uint16_t header =
-      1u << HSA_PACKET_HEADER_BARRIER |
-      (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
-      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
-      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
-  uint32_t header_word = header | (setup << 16u);
-  __atomic_store_n((uint32_t *)&packet->header, header_word, __ATOMIC_RELEASE);
-  hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);
-
-  std::atomic<bool> finished = false;
-  std::thread server_thread(
-      [](std::atomic<bool> *finished, rpc::Server *server,
-         uint32_t wavefront_size, hsa_agent_t dev_agent,
-         hsa_amd_memory_pool_t coarsegrained_pool) {
-        // Register RPC callbacks for the malloc and free functions on HSA.
-        auto malloc_handler = [&](size_t size) -> void * {
-          void *dev_ptr = nullptr;
-          if (hsa_amd_memory_pool_allocate(coarsegrained_pool, size,
-                                           /*flags=*/0, &dev_ptr))
-            dev_ptr = nullptr;
-          hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr...
[truncated]

jhuber6 · 2025-10-09T21:29:59Z

@Artem-B This may require you to change the build of the NVIDIA bot for libc. Ideally you just need to add offload to the runtimes list and make sure that libLLVMOffload.so and libcuda.so exists somewhere in the path.

Artem-B · 2025-10-09T22:07:10Z

This may require you to change the build of the NVIDIA bot for libc. Ideally you just need to add offload to the runtimes list and make sure that libLLVMOffload.so and libcuda.so exists somewhere in the path.

Are the libraries expected to exist somewhere in the search path on the build host or on the host where I run the tests?

jhuber6 · 2025-10-09T22:33:54Z

Are the libraries expected to exist somewhere in the search path on the build host or on the host where I run the tests?

We pick them out at runtime via dlopen, so they only need to exist somewhere on the machine running the test's runtime path.

sarnex

not really qualified to review this but just some basic comments

libc/cmake/modules/LLVMLibCTestRules.cmake

libc/startup/gpu/amdgpu/start.cpp

sarnex · 2025-10-10T15:56:11Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp

+  return copyArgumentVector(Envc, Envp, Device);
+}
+
+ol_device_handle_t findDevice(MemoryBufferRef Binary) {


maybe we could name this findDeviceForImage or something?

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp

sarnex · 2025-10-10T15:59:23Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef enum ol_alloc_type_t {
+  OL_ALLOC_TYPE_HOST = 0,
+  OL_ALLOC_TYPE_DEVICE = 1,
+  OL_ALLOC_TYPE_FORCE_UINT32 = 0x7fffffff


can you describe what the FORCE entries represent?

This is technically C++ so I could probably just use : uint32_t, but it's the C way to make sure the enum shows up as u32 instead of u8 or something.

sarnex · 2025-10-10T15:59:58Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef enum ol_platform_backend_t {
+  OL_PLATFORM_BACKEND_UNKNOWN = 0,
+  OL_PLATFORM_BACKEND_CUDA = 1,
+  OL_PLATFORM_BACKEND_AMDGPU = 2,


should we add one for LEVEL_ZERO even though the plugin isn't merged yet?

No, that should happen when there's actual support for it.

sarnex · 2025-10-10T16:01:39Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef struct ol_symbol_impl_t *ol_symbol_handle_t;
+typedef const struct ol_error_struct_t *ol_result_t;
+
+typedef bool (*ol_device_iterate_cb_t)(ol_device_handle_t Device,


nit: can we use using x = y; instead of typedef? llvm overall requires c++17 so i think it should be fine

These are copied from the OffloadAPI header, I didn't think it was necessary to port it to C++17 since we might need to copy past it again in the future.

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

jhuber6 requested review from Artem-B, JonChesterfield, MaskRay, RossBrunton, arsenm, callumfare, jdoerfert, kevinsala, mgorny, sarnex and shiltian October 9, 2025 21:28

llvmbot added backend:AMDGPU libc backend:NVPTX labels Oct 9, 2025

sarnex reviewed Oct 10, 2025

View reviewed changes

[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739

Are you sure you want to change the base?

[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739

Uh oh!

Conversation

jhuber6 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Oct 9, 2025

Uh oh!

Artem-B commented Oct 9, 2025

Uh oh!

jhuber6 commented Oct 9, 2025

Uh oh!

sarnex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sarnex Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sarnex Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

jhuber6 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

sarnex Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

jhuber6 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

sarnex Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhuber6 commented Oct 9, 2025 •

edited

Loading

llvmbot commented Oct 9, 2025 •

edited

Loading

sarnex Oct 10, 2025 •

edited

Loading