[libc] Make GPU `_end` kernel only call exit callbacks #162371

jhuber6 · 2025-10-07T20:50:44Z

Summary:
We use the infrastructure to stand up a pretend hosted environment on
the GPU. Part of that is calling exit codes and handling the callback.
Exiting from inside a GPU region is problematic as it actually relies on
a lot of GPU magic behind the scenes. This is at least correct now as
we use quick_exit on the CPU when the GPU calls exit. However,
calling quick_exit will interfere with instrumentation or benchmarking
that expects a nice teardown order. For normal execution we should do
the friendly option and let the loader utility clean everything up
manually.

llvmbot · 2025-10-07T20:51:20Z

@llvm/pr-subscribers-libc

@llvm/pr-subscribers-backend-nvptx

Author: Joseph Huber (jhuber6)

Changes

Summary:
We use the infrastructure to stand up a pretend hosted environment on
the GPU. Part of that is calling exit codes and handling the callback.
Exiting from inside a GPU region is problematic as it actually relies on
a lot of GPU magic behind the scenes. This is at least correct now as
we use quick_exit on the CPU when the GPU calls exit. However,
calling quick_exit will interfere with instrumentation or benchmarking
that excepts a nice teardown order. For normal execution we should do
the friendly option and let the loader utility clean everything up
manually.

Full diff: https://github.com/llvm/llvm-project/pull/162371.diff

5 Files Affected:

(modified) libc/startup/gpu/amdgpu/start.cpp (+5-5)
(modified) libc/startup/gpu/nvptx/start.cpp (+5-4)
(modified) llvm/tools/llvm-gpu-loader/amdhsa.cpp (+2-2)
(modified) llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h (+1-3)
(modified) llvm/tools/llvm-gpu-loader/nvptx.cpp (+2-2)

diff --git a/libc/startup/gpu/amdgpu/start.cpp b/libc/startup/gpu/amdgpu/start.cpp
index 8bd0c3a938d02..bf6c96ea7c617 100644
--- a/libc/startup/gpu/amdgpu/start.cpp
+++ b/libc/startup/gpu/amdgpu/start.cpp
@@ -14,6 +14,7 @@
 #include "src/stdlib/exit.h"
 
 extern "C" int main(int argc, char **argv, char **envp);
+extern "C" void __cxa_finalize(void *dso);
 
 namespace LIBC_NAMESPACE_DECL {
 
@@ -68,9 +69,8 @@ _start(int argc, char **argv, char **envp, int *ret) {
 extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel,
              clang::amdgpu_flat_work_group_size(1, 1),
              clang::amdgpu_max_num_work_groups(1)]] void
-_end(int retval) {
-  // Only a single thread should call `exit` here, the rest should gracefully
-  // return from the kernel. This is so only one thread calls the destructors
-  // registred with 'atexit' above.
-  LIBC_NAMESPACE::exit(retval);
+_end() {
+  // Only a single thread should call the destructors registred with 'atexit'.
+  // The loader utility will handle the actual exit and return code cleanly.
+  __cxa_finalize(nullptr);
 }
diff --git a/libc/startup/gpu/nvptx/start.cpp b/libc/startup/gpu/nvptx/start.cpp
index bc529b36f5097..f6e0c849b842c 100644
--- a/libc/startup/gpu/nvptx/start.cpp
+++ b/libc/startup/gpu/nvptx/start.cpp
@@ -14,6 +14,7 @@
 #include "src/stdlib/exit.h"
 
 extern "C" int main(int argc, char **argv, char **envp);
+extern "C" void __cxa_finalize(void *dso);
 
 namespace LIBC_NAMESPACE_DECL {
 
@@ -71,8 +72,8 @@ _start(int argc, char **argv, char **envp, int *ret) {
 }
 
 extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
-_end(int retval) {
-  // To finis the execution we invoke all the callbacks registered via 'atexit'
-  // and then exit with the appropriate return value.
-  LIBC_NAMESPACE::exit(retval);
+_end() {
+  // Only a single thread should call the destructors registred with 'atexit'.
+  // The loader utility will handle the actual exit and return code cleanly.
+  __cxa_finalize(nullptr);
 }
diff --git a/llvm/tools/llvm-gpu-loader/amdhsa.cpp b/llvm/tools/llvm-gpu-loader/amdhsa.cpp
index be1b6b7993920..5715058d8cfac 100644
--- a/llvm/tools/llvm-gpu-loader/amdhsa.cpp
+++ b/llvm/tools/llvm-gpu-loader/amdhsa.cpp
@@ -192,7 +192,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   // Initialize all the arguments (explicit and implicit) to zero, then set the
   // explicit arguments to the values created above.
   std::memset(args, 0, args_size);
-  std::memcpy(args, &kernel_args, sizeof(args_t));
+  std::memcpy(args, &kernel_args, std::is_empty_v<args_t> ? 0 : sizeof(args_t));
 
   // Initialize the necessary implicit arguments to the proper values.
   int dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
@@ -563,7 +563,7 @@ int load_amdhsa(int argc, const char **argv, const char **envp, void *image,
   // Save the return value and perform basic clean-up.
   int ret = *static_cast<int *>(host_ret);
 
-  end_args_t fini_args = {ret};
+  end_args_t fini_args = {};
   if (hsa_status_t err = launch_kernel(
           dev_agent, executable, kernargs_pool, coarsegrained_pool, queue,
           server, single_threaded_params, "_end.kd", fini_args,
diff --git a/llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h b/llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h
index ed34d0bace978..08861c29b4fa4 100644
--- a/llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h
+++ b/llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h
@@ -41,9 +41,7 @@ struct start_args_t {
 };
 
 /// The arguments to the '_end' kernel.
-struct end_args_t {
-  int argc;
-};
+struct end_args_t {};
 
 /// Generic interface to load the \p image and launch execution of the _start
 /// kernel on the target device. Copies \p argc and \p argv to the device.
diff --git a/llvm/tools/llvm-gpu-loader/nvptx.cpp b/llvm/tools/llvm-gpu-loader/nvptx.cpp
index 781a045e71a94..82b455249ad24 100644
--- a/llvm/tools/llvm-gpu-loader/nvptx.cpp
+++ b/llvm/tools/llvm-gpu-loader/nvptx.cpp
@@ -177,7 +177,7 @@ CUresult launch_kernel(CUmodule binary, CUstream stream, rpc::Server &server,
     handle_error(err);
 
   // Set up the arguments to the '_start' kernel on the GPU.
-  uint64_t args_size = sizeof(args_t);
+  uint64_t args_size = std::is_empty_v<args_t> ? 0 : sizeof(args_t);
   void *args_config[] = {CU_LAUNCH_PARAM_BUFFER_POINTER, &kernel_args,
                          CU_LAUNCH_PARAM_BUFFER_SIZE, &args_size,
                          CU_LAUNCH_PARAM_END};
@@ -342,7 +342,7 @@ int load_nvptx(int argc, const char **argv, const char **envp, void *image,
   if (CUresult err = cuStreamSynchronize(stream))
     handle_error(err);
 
-  end_args_t fini_args = {host_ret};
+  end_args_t fini_args = {};
   if (CUresult err =
           launch_kernel(binary, stream, server, single_threaded_params, "_end",
                         fini_args, print_resource_usage))

github-actions · 2025-10-07T20:54:28Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Summary: We use the infrastructure to stand up a pretend hosted environment on the GPU. Part of that is calling exit codes and handling the callback. Exiting from inside a GPU region is problematic as it actually relies on a lot of GPU magic behind the scenes. This is at least *correct* now as we use `quick_exit` on the CPU when the GPU calls `exit`. However, calling `quick_exit` will interfere with instrumentation or benchmarking that excepts a nice teardown order. For normal execution we should do the friendly option and let the loader utility clean everything up manually.

SchrodingerZhu

LGTM

Summary: We use the infrastructure to stand up a pretend hosted environment on the GPU. Part of that is calling exit codes and handling the callback. Exiting from inside a GPU region is problematic as it actually relies on a lot of GPU magic behind the scenes. This is at least *correct* now as we use `quick_exit` on the CPU when the GPU calls `exit`. However, calling `quick_exit` will interfere with instrumentation or benchmarking that expects a nice teardown order. For normal execution we should do the friendly option and let the loader utility clean everything up manually.

jhuber6 requested review from arsenm, JonChesterfield, lntue, michaelrj-google, SchrodingerZhu and shiltian October 7, 2025 20:50

llvmbot added backend:AMDGPU libc backend:NVPTX labels Oct 7, 2025

jhuber6 force-pushed the exiting branch from cdc1453 to f11b32f Compare October 7, 2025 20:55

SchrodingerZhu approved these changes Oct 8, 2025

View reviewed changes

jhuber6 merged commit 8c9c91f into llvm:main Oct 8, 2025
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Make GPU `_end` kernel only call exit callbacks #162371

[libc] Make GPU `_end` kernel only call exit callbacks #162371

Uh oh!

jhuber6 commented Oct 7, 2025 •

edited by arsenm

Loading

Uh oh!

llvmbot commented Oct 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

SchrodingerZhu left a comment

Uh oh!

Uh oh!

Uh oh!

[libc] Make GPU _end kernel only call exit callbacks #162371

[libc] Make GPU _end kernel only call exit callbacks #162371

Uh oh!

Conversation

jhuber6 commented Oct 7, 2025 • edited by arsenm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SchrodingerZhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[libc] Make GPU `_end` kernel only call exit callbacks #162371

[libc] Make GPU `_end` kernel only call exit callbacks #162371

jhuber6 commented Oct 7, 2025 •

edited by arsenm

Loading

llvmbot commented Oct 7, 2025 •

edited

Loading

github-actions bot commented Oct 7, 2025 •

edited

Loading