-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary: This patch rewrites the `llvm-gpu-loader` utility to use the LLVMOffload interface. This heavily simplifies it while re-using the already existing support. Another benefit is that I can now easily do this dynamically so we can always build this utility without needing to find non-standard packages. One issue is mentioned in llvm#159636 where this will now take extra time if you have both installed on the same machine. This is just slightly annoying since most people don't have both CUDA and ROCm at the same time so I don't consider it a blocker. I will work later to address it. Slightly unfortunate environment variable usage, I will also expose that better in the future.
@llvm/pr-subscribers-libc @llvm/pr-subscribers-backend-amdgpu Author: Joseph Huber (jhuber6) ChangesSummary: One issue is mentioned in Slightly unfortunate environment variable usage, I will also expose that Fixes: #132890 Patch is 71.88 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162739.diff 10 Files Affected:
diff --git a/libc/cmake/modules/LLVMLibCTestRules.cmake b/libc/cmake/modules/LLVMLibCTestRules.cmake
index 19da0ad29cd84..dfc5ab904fe9d 100644
--- a/libc/cmake/modules/LLVMLibCTestRules.cmake
+++ b/libc/cmake/modules/LLVMLibCTestRules.cmake
@@ -87,7 +87,6 @@ function(_get_hermetic_test_compile_options output_var)
-mcode-object-version=${LIBC_GPU_CODE_OBJECT_VERSION})
elseif(LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
list(APPEND compile_options
- "SHELL:-mllvm -nvptx-emit-init-fini-kernel=false"
-Wno-multi-gpu --cuda-path=${LIBC_CUDA_ROOT}
-nogpulib -march=${LIBC_GPU_TARGET_ARCHITECTURE} -fno-use-cxa-atexit)
endif()
@@ -637,6 +636,7 @@ function(add_integration_test test_name)
# makes `add_custom_target` construct the correct command and execute it.
set(test_cmd
${INTEGRATION_TEST_ENV}
+ $<$<BOOL:${LIBC_TARGET_ARCHITECTURE_IS_NVPTX}>:LIBOMPTARGET_STACK_SIZE=3072>
$<$<BOOL:${LIBC_TARGET_OS_IS_GPU}>:${gpu_loader_exe}>
${CMAKE_CROSSCOMPILING_EMULATOR}
${INTEGRATION_TEST_LOADER_ARGS}
@@ -790,8 +790,7 @@ function(add_libc_hermetic test_name)
if(LIBC_TARGET_ARCHITECTURE_IS_AMDGPU)
target_link_options(${fq_build_target_name} PRIVATE
${LIBC_COMPILE_OPTIONS_DEFAULT} -Wno-multi-gpu
- -mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto
- "-Wl,-mllvm,-amdgpu-lower-global-ctor-dtor=0" -nostdlib -static
+ -mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto -nostdlib -static
"-Wl,-mllvm,-amdhsa-code-object-version=${LIBC_GPU_CODE_OBJECT_VERSION}")
elseif(LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
target_link_options(${fq_build_target_name} PRIVATE
@@ -859,6 +858,7 @@ function(add_libc_hermetic test_name)
string(REPLACE " " ";" test_cmd "${test_cmd_parsed}")
else()
set(test_cmd ${HERMETIC_TEST_ENV}
+ $<$<BOOL:${LIBC_TARGET_ARCHITECTURE_IS_NVPTX}>:LIBOMPTARGET_STACK_SIZE=3072>
$<$<BOOL:${LIBC_TARGET_OS_IS_GPU}>:${gpu_loader_exe}> ${CMAKE_CROSSCOMPILING_EMULATOR} ${HERMETIC_TEST_LOADER_ARGS}
$<TARGET_FILE:${fq_build_target_name}> ${HERMETIC_TEST_ARGS})
endif()
diff --git a/libc/startup/gpu/amdgpu/start.cpp b/libc/startup/gpu/amdgpu/start.cpp
index 48f095d924931..e876629e6d0e1 100644
--- a/libc/startup/gpu/amdgpu/start.cpp
+++ b/libc/startup/gpu/amdgpu/start.cpp
@@ -13,6 +13,9 @@
#include "src/stdlib/atexit.h"
#include "src/stdlib/exit.h"
+// TODO: Merge this and the NVPTX start files once the common `device_kernel`
+// attribute correctly implies `amdgpu_kernel`.
+
extern "C" int main(int argc, char **argv, char **envp);
extern "C" void __cxa_finalize(void *dso);
@@ -21,45 +24,18 @@ namespace LIBC_NAMESPACE_DECL {
// FIXME: Factor this out into common logic so we don't need to stub it here.
void teardown_main_tls() {}
-// FIXME: Touch this symbol to force this to be linked in statically.
-volatile void *dummy = &LIBC_NAMESPACE::rpc::client;
-
DataEnvironment app;
-extern "C" uintptr_t __init_array_start[];
-extern "C" uintptr_t __init_array_end[];
-extern "C" uintptr_t __fini_array_start[];
-extern "C" uintptr_t __fini_array_end[];
-
-using InitCallback = void(int, char **, char **);
-using FiniCallback = void(void);
-
-static void call_init_array_callbacks(int argc, char **argv, char **env) {
- size_t init_array_size = __init_array_end - __init_array_start;
- for (size_t i = 0; i < init_array_size; ++i)
- reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
-}
-
-static void call_fini_array_callbacks() {
- size_t fini_array_size = __fini_array_end - __fini_array_start;
- for (size_t i = fini_array_size; i > 0; --i)
- reinterpret_cast<FiniCallback *>(__fini_array_start[i - 1])();
-}
-
} // namespace LIBC_NAMESPACE_DECL
extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel,
clang::amdgpu_flat_work_group_size(1, 1),
clang::amdgpu_max_num_work_groups(1)]] void
-_begin(int argc, char **argv, char **env) {
+_begin(int, char **, char **env) {
+ // The LLVM offloading runtime will automatically call any present global
+ // constructors and destructors so we defer that handling.
__atomic_store_n(&LIBC_NAMESPACE::app.env_ptr,
reinterpret_cast<uintptr_t *>(env), __ATOMIC_RELAXED);
- // We want the fini array callbacks to be run after other atexit
- // callbacks are run. So, we register them before running the init
- // array callbacks as they can potentially register their own atexit
- // callbacks.
- LIBC_NAMESPACE::atexit(&LIBC_NAMESPACE::call_fini_array_callbacks);
- LIBC_NAMESPACE::call_init_array_callbacks(argc, argv, env);
}
extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
diff --git a/libc/startup/gpu/nvptx/start.cpp b/libc/startup/gpu/nvptx/start.cpp
index ce8f5bbb6d4eb..822930e22bc26 100644
--- a/libc/startup/gpu/nvptx/start.cpp
+++ b/libc/startup/gpu/nvptx/start.cpp
@@ -23,48 +23,14 @@ DataEnvironment app;
// FIXME: Factor this out into common logic so we don't need to stub it here.
void teardown_main_tls() {}
-// FIXME: Touch this symbol to force this to be linked in statically.
-volatile void *dummy = &LIBC_NAMESPACE::rpc::client;
-
-extern "C" {
-// Nvidia's 'nvlink' linker does not provide these symbols. We instead need
-// to manually create them and update the globals in the loader implememtation.
-uintptr_t *__init_array_start [[gnu::visibility("protected")]];
-uintptr_t *__init_array_end [[gnu::visibility("protected")]];
-uintptr_t *__fini_array_start [[gnu::visibility("protected")]];
-uintptr_t *__fini_array_end [[gnu::visibility("protected")]];
-}
-
-// Nvidia requires that the signature of the function pointers match. This means
-// we cannot support the extended constructor arguments.
-using InitCallback = void(void);
-using FiniCallback = void(void);
-
-static void call_init_array_callbacks(int, char **, char **) {
- size_t init_array_size = __init_array_end - __init_array_start;
- for (size_t i = 0; i < init_array_size; ++i)
- reinterpret_cast<InitCallback *>(__init_array_start[i])();
-}
-
-static void call_fini_array_callbacks() {
- size_t fini_array_size = __fini_array_end - __fini_array_start;
- for (size_t i = fini_array_size; i > 0; --i)
- reinterpret_cast<FiniCallback *>(__fini_array_start[i - 1])();
-}
-
} // namespace LIBC_NAMESPACE_DECL
extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
-_begin(int argc, char **argv, char **env) {
+_begin(int, char **, char **env) {
+ // The LLVM offloading runtime will automatically call any present global
+ // constructors and destructors so we defer that handling.
__atomic_store_n(&LIBC_NAMESPACE::app.env_ptr,
reinterpret_cast<uintptr_t *>(env), __ATOMIC_RELAXED);
-
- // We want the fini array callbacks to be run after other atexit
- // callbacks are run. So, we register them before running the init
- // array callbacks as they can potentially register their own atexit
- // callbacks.
- LIBC_NAMESPACE::atexit(&LIBC_NAMESPACE::call_fini_array_callbacks);
- LIBC_NAMESPACE::call_init_array_callbacks(argc, argv, env);
}
extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
diff --git a/llvm/tools/CMakeLists.txt b/llvm/tools/CMakeLists.txt
index 729797aa43f0b..b9c5a79849ec8 100644
--- a/llvm/tools/CMakeLists.txt
+++ b/llvm/tools/CMakeLists.txt
@@ -9,10 +9,6 @@
# traversing each directory.
create_llvm_tool_options()
-if(NOT LLVM_COMPILER_IS_GCC_COMPATIBLE OR NOT LLVM_LIBC_GPU_BUILD)
- set(LLVM_TOOL_LLVM_GPU_LOADER_BUILD OFF)
-endif()
-
if(NOT LLVM_BUILD_LLVM_DYLIB AND NOT LLVM_BUILD_LLVM_C_DYLIB)
set(LLVM_TOOL_LLVM_SHLIB_BUILD Off)
endif()
diff --git a/llvm/tools/llvm-gpu-loader/CMakeLists.txt b/llvm/tools/llvm-gpu-loader/CMakeLists.txt
index b35a702476ada..de276635e2713 100644
--- a/llvm/tools/llvm-gpu-loader/CMakeLists.txt
+++ b/llvm/tools/llvm-gpu-loader/CMakeLists.txt
@@ -1,46 +1,12 @@
set(LLVM_LINK_COMPONENTS
- BinaryFormat
- Object
Option
Support
- FrontendOffloading
TargetParser
)
add_llvm_tool(llvm-gpu-loader
llvm-gpu-loader.cpp
- # TODO: We intentionally split this currently due to statically linking the
- # GPU runtimes. Dynamically load the dependencies, possibly using the
- # LLVM offloading API when it is complete.
- PARTIAL_SOURCES_INTENDED
-
DEPENDS
intrinsics_gen
)
-
-# Locate the RPC server handling interface.
-include(FindLibcCommonUtils)
-target_link_libraries(llvm-gpu-loader PUBLIC llvm-libc-common-utilities)
-
-# Check for HSA support for targeting AMD GPUs.
-find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
-if(hsa-runtime64_FOUND)
- target_sources(llvm-gpu-loader PRIVATE amdhsa.cpp)
- target_compile_definitions(llvm-gpu-loader PRIVATE AMDHSA_SUPPORT)
- target_link_libraries(llvm-gpu-loader PRIVATE hsa-runtime64::hsa-runtime64)
-
- # Compatibility with the old amdhsa-loader name.
- add_llvm_tool_symlink(amdhsa-loader llvm-gpu-loader)
-endif()
-
-# Check for CUDA support for targeting NVIDIA GPUs.
-find_package(CUDAToolkit 11.2 QUIET)
-if(CUDAToolkit_FOUND)
- target_sources(llvm-gpu-loader PRIVATE nvptx.cpp)
- target_compile_definitions(llvm-gpu-loader PRIVATE NVPTX_SUPPORT)
- target_link_libraries(llvm-gpu-loader PRIVATE CUDA::cuda_driver)
-
- # Compatibility with the old nvptx-loader name.
- add_llvm_tool_symlink(nvptx-loader llvm-gpu-loader)
-endif()
diff --git a/llvm/tools/llvm-gpu-loader/amdhsa.cpp b/llvm/tools/llvm-gpu-loader/amdhsa.cpp
deleted file mode 100644
index 5715058d8cfac..0000000000000
--- a/llvm/tools/llvm-gpu-loader/amdhsa.cpp
+++ /dev/null
@@ -1,594 +0,0 @@
-//===-- Loader Implementation for AMDHSA devices --------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-// This file impelements a simple loader to run images supporting the AMDHSA
-// architecture. The file launches the '_start' kernel which should be provided
-// by the device application start code and call ultimately call the 'main'
-// function.
-//
-//===----------------------------------------------------------------------===//
-
-#include "llvm-gpu-loader.h"
-#include "server.h"
-
-#include "hsa/hsa.h"
-#include "hsa/hsa_ext_amd.h"
-
-#include "llvm/Frontend/Offloading/Utility.h"
-
-#include <atomic>
-#include <cstdio>
-#include <cstdlib>
-#include <cstring>
-#include <thread>
-#include <tuple>
-#include <utility>
-
-// The implicit arguments of COV5 AMDGPU kernels.
-struct implicit_args_t {
- uint32_t grid_size_x;
- uint32_t grid_size_y;
- uint32_t grid_size_z;
- uint16_t workgroup_size_x;
- uint16_t workgroup_size_y;
- uint16_t workgroup_size_z;
- uint8_t Unused0[46];
- uint16_t grid_dims;
- uint8_t Unused1[190];
-};
-
-/// Print the error code and exit if \p code indicates an error.
-static void handle_error_impl(const char *file, int32_t line,
- hsa_status_t code) {
- if (code == HSA_STATUS_SUCCESS || code == HSA_STATUS_INFO_BREAK)
- return;
-
- const char *desc;
- if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
- desc = "Unknown error";
- fprintf(stderr, "%s:%d:0: Error: %s\n", file, line, desc);
- exit(EXIT_FAILURE);
-}
-
-/// Generic interface for iterating using the HSA callbacks.
-template <typename elem_ty, typename func_ty, typename callback_ty>
-hsa_status_t iterate(func_ty func, callback_ty cb) {
- auto l = [](elem_ty elem, void *data) -> hsa_status_t {
- callback_ty *unwrapped = static_cast<callback_ty *>(data);
- return (*unwrapped)(elem);
- };
- return func(l, static_cast<void *>(&cb));
-}
-
-/// Generic interface for iterating using the HSA callbacks.
-template <typename elem_ty, typename func_ty, typename func_arg_ty,
- typename callback_ty>
-hsa_status_t iterate(func_ty func, func_arg_ty func_arg, callback_ty cb) {
- auto l = [](elem_ty elem, void *data) -> hsa_status_t {
- callback_ty *unwrapped = static_cast<callback_ty *>(data);
- return (*unwrapped)(elem);
- };
- return func(func_arg, l, static_cast<void *>(&cb));
-}
-
-/// Iterate through all availible agents.
-template <typename callback_ty>
-hsa_status_t iterate_agents(callback_ty callback) {
- return iterate<hsa_agent_t>(hsa_iterate_agents, callback);
-}
-
-/// Iterate through all availible memory pools.
-template <typename callback_ty>
-hsa_status_t iterate_agent_memory_pools(hsa_agent_t agent, callback_ty cb) {
- return iterate<hsa_amd_memory_pool_t>(hsa_amd_agent_iterate_memory_pools,
- agent, cb);
-}
-
-template <hsa_device_type_t flag>
-hsa_status_t get_agent(hsa_agent_t *output_agent) {
- // Find the first agent with a matching device type.
- auto cb = [&](hsa_agent_t hsa_agent) -> hsa_status_t {
- hsa_device_type_t type;
- hsa_status_t status =
- hsa_agent_get_info(hsa_agent, HSA_AGENT_INFO_DEVICE, &type);
- if (status != HSA_STATUS_SUCCESS)
- return status;
-
- if (type == flag) {
- // Ensure that a GPU agent supports kernel dispatch packets.
- if (type == HSA_DEVICE_TYPE_GPU) {
- hsa_agent_feature_t features;
- status =
- hsa_agent_get_info(hsa_agent, HSA_AGENT_INFO_FEATURE, &features);
- if (status != HSA_STATUS_SUCCESS)
- return status;
- if (features & HSA_AGENT_FEATURE_KERNEL_DISPATCH)
- *output_agent = hsa_agent;
- } else {
- *output_agent = hsa_agent;
- }
- return HSA_STATUS_INFO_BREAK;
- }
- return HSA_STATUS_SUCCESS;
- };
-
- return iterate_agents(cb);
-}
-
-void print_kernel_resources(const char *kernel_name) {
- fprintf(stderr, "Kernel resources on AMDGPU is not supported yet.\n");
-}
-
-/// Retrieve a global memory pool with a \p flag from the agent.
-template <hsa_amd_memory_pool_global_flag_t flag>
-hsa_status_t get_agent_memory_pool(hsa_agent_t agent,
- hsa_amd_memory_pool_t *output_pool) {
- auto cb = [&](hsa_amd_memory_pool_t memory_pool) {
- uint32_t flags;
- hsa_amd_segment_t segment;
- if (auto err = hsa_amd_memory_pool_get_info(
- memory_pool, HSA_AMD_MEMORY_POOL_INFO_SEGMENT, &segment))
- return err;
- if (auto err = hsa_amd_memory_pool_get_info(
- memory_pool, HSA_AMD_MEMORY_POOL_INFO_GLOBAL_FLAGS, &flags))
- return err;
-
- if (segment != HSA_AMD_SEGMENT_GLOBAL)
- return HSA_STATUS_SUCCESS;
-
- if (flags & flag)
- *output_pool = memory_pool;
-
- return HSA_STATUS_SUCCESS;
- };
- return iterate_agent_memory_pools(agent, cb);
-}
-
-template <typename args_t>
-hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
- hsa_amd_memory_pool_t kernargs_pool,
- hsa_amd_memory_pool_t coarsegrained_pool,
- hsa_queue_t *queue, rpc::Server &server,
- const LaunchParameters ¶ms,
- const char *kernel_name, args_t kernel_args,
- uint32_t wavefront_size, bool print_resource_usage) {
- // Look up the kernel in the loaded executable.
- hsa_executable_symbol_t symbol;
- if (hsa_status_t err = hsa_executable_get_symbol_by_name(
- executable, kernel_name, &dev_agent, &symbol))
- return err;
-
- // Retrieve different properties of the kernel symbol used for launch.
- uint64_t kernel;
- uint32_t args_size;
- uint32_t group_size;
- uint32_t private_size;
- bool dynamic_stack;
-
- std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
- {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
- {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
- {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
- {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_DYNAMIC_CALLSTACK, &dynamic_stack},
- {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};
-
- for (auto &[info, value] : symbol_infos)
- if (hsa_status_t err = hsa_executable_symbol_get_info(symbol, info, value))
- return err;
-
- // Allocate space for the kernel arguments on the host and allow the GPU agent
- // to access it.
- void *args;
- if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,
- /*flags=*/0, &args))
- handle_error(err);
- hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);
-
- // Initialize all the arguments (explicit and implicit) to zero, then set the
- // explicit arguments to the values created above.
- std::memset(args, 0, args_size);
- std::memcpy(args, &kernel_args, std::is_empty_v<args_t> ? 0 : sizeof(args_t));
-
- // Initialize the necessary implicit arguments to the proper values.
- int dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
- (params.num_blocks_z * params.num_threads_z != 1);
- implicit_args_t *implicit_args = reinterpret_cast<implicit_args_t *>(
- reinterpret_cast<uint8_t *>(args) + sizeof(args_t));
- implicit_args->grid_dims = dims;
- implicit_args->grid_size_x = params.num_blocks_x;
- implicit_args->grid_size_y = params.num_blocks_y;
- implicit_args->grid_size_z = params.num_blocks_z;
- implicit_args->workgroup_size_x = params.num_threads_x;
- implicit_args->workgroup_size_y = params.num_threads_y;
- implicit_args->workgroup_size_z = params.num_threads_z;
-
- // Obtain a packet from the queue.
- uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
- while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
- ;
-
- const uint32_t mask = queue->size - 1;
- hsa_kernel_dispatch_packet_t *packet =
- static_cast<hsa_kernel_dispatch_packet_t *>(queue->base_address) +
- (packet_id & mask);
-
- // Set up the packet for exeuction on the device. We currently only launch
- // with one thread on the device, forcing the rest of the wavefront to be
- // masked off.
- uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
- packet->workgroup_size_x = params.num_threads_x;
- packet->workgroup_size_y = params.num_threads_y;
- packet->workgroup_size_z = params.num_threads_z;
- packet->reserved0 = 0;
- packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
- packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
- packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
- packet->private_segment_size =
- dynamic_stack ? 16 * 1024 /* 16 KB */ : private_size;
- packet->group_segment_size = group_size;
- packet->kernel_object = kernel;
- packet->kernarg_address = args;
- packet->reserved2 = 0;
- // Create a signal to indicate when this packet has been completed.
- if (hsa_status_t err =
- hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
- handle_error(err);
-
- if (print_resource_usage)
- print_kernel_resources(kernel_name);
-
- // Initialize the packet header and set the doorbell signal to begin execution
- // by the HSA runtime.
- uint16_t header =
- 1u << HSA_PACKET_HEADER_BARRIER |
- (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
- (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
- (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
- uint32_t header_word = header | (setup << 16u);
- __atomic_store_n((uint32_t *)&packet->header, header_word, __ATOMIC_RELEASE);
- hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);
-
- std::atomic<bool> finished = false;
- std::thread server_thread(
- [](std::atomic<bool> *finished, rpc::Server *server,
- uint32_t wavefront_size, hsa_agent_t dev_agent,
- hsa_amd_memory_pool_t coarsegrained_pool) {
- // Register RPC callbacks for the malloc and free functions on HSA.
- auto malloc_handler = [&](size_t size) -> void * {
- void *dev_ptr = nullptr;
- if (hsa_amd_memory_pool_allocate(coarsegrained_pool, size,
- /*flags=*/0, &dev_ptr))
- dev_ptr = nullptr;
- hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr...
[truncated]
|
@Artem-B This may require you to change the build of the NVIDIA bot for libc. Ideally you just need to add |
Are the libraries expected to exist somewhere in the search path on the build host or on the host where I run the tests? |
We pick them out at runtime via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really qualified to review this but just some basic comments
return copyArgumentVector(Envc, Envp, Device); | ||
} | ||
|
||
ol_device_handle_t findDevice(MemoryBufferRef Binary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could name this findDeviceForImage
or something?
typedef enum ol_alloc_type_t { | ||
OL_ALLOC_TYPE_HOST = 0, | ||
OL_ALLOC_TYPE_DEVICE = 1, | ||
OL_ALLOC_TYPE_FORCE_UINT32 = 0x7fffffff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you describe what the FORCE
entries represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically C++ so I could probably just use : uint32_t
, but it's the C way to make sure the enum shows up as u32 instead of u8 or something.
typedef enum ol_platform_backend_t { | ||
OL_PLATFORM_BACKEND_UNKNOWN = 0, | ||
OL_PLATFORM_BACKEND_CUDA = 1, | ||
OL_PLATFORM_BACKEND_AMDGPU = 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add one for LEVEL_ZERO
even though the plugin isn't merged yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that should happen when there's actual support for it.
typedef struct ol_symbol_impl_t *ol_symbol_handle_t; | ||
typedef const struct ol_error_struct_t *ol_result_t; | ||
|
||
typedef bool (*ol_device_iterate_cb_t)(ol_device_handle_t Device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use using x = y;
instead of typedef
? llvm overall requires c++17 so i think it should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are copied from the OffloadAPI header, I didn't think it was necessary to port it to C++17 since we might need to copy past it again in the future.
Summary:
This patch rewrites the
llvm-gpu-loader
utility to use the LLVMOffloadinterface. This heavily simplifies it while re-using the already
existing support. Another benefit is that I can now easily do this
dynamically so we can always build this utility without needing to find
non-standard packages.
One issue is mentioned in
#159636 where this will now
take extra time if you have both installed on the same machine. This is
just slightly annoying since most people don't have both CUDA and ROCm
at the same time so I don't consider it a blocker. I will work later to
address it.
Slightly unfortunate environment variable usage, I will also expose that
better in the future.
Fixes: #132890