Skip to content

Conversation

@yxsamliu
Copy link
Collaborator

Clarify how Clang-generated HIP fat binaries are registered and unregistered with the HIP runtime, and how this interacts with global constructors, destructors, and atexit handlers. Document that there is no strong guarantee on ordering relative to user-defined global ctors/dtors, recommend that HIP application developers avoid using kernels or device variables from global ctors/dtors, and describe the implications for HIP runtime developers (synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by questions from HIP application and runtime developers about fat binary registration/unregistration order and its potential interference with their own initialization and teardown code.

@yxsamliu yxsamliu requested review from Artem-B and jhuber6 November 18, 2025 16:52
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Nov 18, 2025
@llvmbot
Copy link
Member

llvmbot commented Nov 18, 2025

@llvm/pr-subscribers-clang

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

Clarify how Clang-generated HIP fat binaries are registered and unregistered with the HIP runtime, and how this interacts with global constructors, destructors, and atexit handlers. Document that there is no strong guarantee on ordering relative to user-defined global ctors/dtors, recommend that HIP application developers avoid using kernels or device variables from global ctors/dtors, and describe the implications for HIP runtime developers (synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by questions from HIP application and runtime developers about fat binary registration/unregistration order and its potential interference with their own initialization and teardown code.


Full diff: https://github.com/llvm/llvm-project/pull/168566.diff

1 Files Affected:

  • (modified) clang/docs/HIPSupport.rst (+82)
diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index ab9ea110e6d54..b33d663f0cfee 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -210,6 +210,88 @@ Host Code Compilation
 - These relocatable objects are then linked together.
 - Host code within a TU can call host functions and launch kernels from another TU.
 
+HIP Fat Binary Registration and Unregistration
+=============================================
+
+When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
+binaries" and generates host-side helper functions that register these
+fat binaries with the HIP runtime at program start and unregister them at
+program exit. In non-RDC mode (``-fno-gpu-rdc``), each compilation unit
+typically produces its own self-contained fat binary per GPU architecture. In
+RDC mode (``-fgpu-rdc``), device bitcode from multiple compilation units may be
+linked together into a single fat binary per GPU architecture.
+
+At the LLVM IR level, Clang/LLVM typically create an internal module
+constructor (for example ``__hip_module_ctor`` or a ``.hip.fatbin_reg``
+function) and add it to ``@llvm.global_ctors``. This constructor is called by
+the C runtime before ``main`` and it:
+
+* calls ``__hipRegisterFatBinary`` with a pointer to an internal wrapper
+  object that describes the HIP fat binary;
+* stores the returned handle in an internal global variable;
+* calls an internal helper such as ``__hip_register_globals`` to register
+  kernels, device variables and other metadata associated with the fat binary;
+* registers a corresponding module destructor with ``atexit`` so it will run
+  during program termination.
+
+The module destructor (for example ``__hip_module_dtor`` or a
+``.hip.fatbin_unreg`` function) loads the stored handle, checks that it is
+non-null, calls ``__hipUnregisterFatBinary`` to unregister the fat binary from
+the HIP runtime, and then clears the handle. This ensures that the HIP runtime
+sees each fat binary registered exactly once and that it is unregistered once
+at exit, even when multiple translation units contribute HIP kernels to the
+same host program.
+
+These registration/unregistration helpers are implementation details of Clang's
+HIP code generation; user code should not call ``__hipRegisterFatBinary`` or
+``__hipUnregisterFatBinary`` directly.
+
+Implications for HIP Application Developers
+------------------------------------------
+
+The fat binary registration and unregistration helpers participate in the same
+global constructor and termination mechanisms as the rest of the program, and
+there is no strong guarantee about their relative order with user-defined
+global constructors and destructors. In particular:
+
+* Applications should not invoke ``__hipRegisterFatBinary`` or
+  ``__hipUnregisterFatBinary`` explicitly.
+* Because registration happens in a compiler-generated module constructor and
+  unregistration happens via an ``atexit``-registered module destructor, the
+  exact ordering relative to other global ctors/dtors and ``atexit`` handlers
+  is implementation-dependent and may vary across platforms and toolchain
+  options.
+* To avoid subtle ordering issues, applications should not rely on HIP kernels
+  or device variables being usable from user-defined global constructors or
+  destructors. HIP initialization and teardown that touches kernels or device
+  state should instead be performed in ``main`` (or in functions called from
+  ``main``) after process startup.
+* In RDC mode, multiple translation units may contribute device code to a
+  single fat binary; user code should not make assumptions based on a
+  particular registration order between translation units.
+
+Implications for HIP Runtime Developers
+--------------------------------------
+
+HIP runtime implementations that are linked with Clang-generated host code
+must handle registration and unregistration in the presence of uncertain
+global ctor/dtor ordering:
+
+* ``__hipRegisterFatBinary`` must accept a pointer to the compiler-generated
+  wrapper object and return an opaque handle that remains valid for as long as
+  the fat binary may be used.
+* ``__hipUnregisterFatBinary`` must accept the handle previously returned by
+  ``__hipRegisterFatBinary`` and perform any necessary cleanup. It may be
+  called late in process teardown, after other parts of the runtime have
+  started shutting down, so it should be robust in the presence of partially
+  torn-down state.
+* Runtimes should use appropriate synchronization and guards so that fat
+  binary registration does not observe uninitialized resources and
+  unregistration does not release resources that are still required by other
+  runtime components. In particular, registration and unregistration routines
+  should be written to be safe under repeated calls and in the presence of
+  concurrent or overlapping initialization/teardown logic.
+
 Syntax Difference with CUDA
 ===========================
 

Clarify how Clang-generated HIP fat binaries are registered and unregistered
with the HIP runtime, and how this interacts with global constructors,
destructors, and atexit handlers. Document that there is no strong guarantee
on ordering relative to user-defined global ctors/dtors, recommend that HIP
application developers avoid using kernels or device variables from global
ctors/dtors, and describe the implications for HIP runtime developers
(synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary).
This is motivated by questions from HIP application and runtime developers
about fat binary registration/unregistration order and its potential
interference with their own initialization and teardown code.
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, with a few editorial nits.

Comment on lines +220 to +222
typically produces its own self-contained fat binary per GPU architecture. In
RDC mode (``-fgpu-rdc``), device bitcode from multiple compilation units may be
linked together into a single fat binary per GPU architecture.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"self-contained fat binary" and "device bitcode" are somewhat vague.

Should we describe it in terms of "fatbinary container with" ... "fully linked per-GPU executables" / "GPU object file/LLVM IR" ?

When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
binaries" and generates host-side helper functions that register these
fat binaries with the HIP runtime at program start and unregister them at
program exit. In non-RDC mode (``-fno-gpu-rdc``), each compilation unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some details on what is that registration is used for. I.e. to associate host-side addresses with GPU-side entities. E.g. when we call a host-side kernel stub, runtime needs to know which kernel symbol on the GPU side we intend to call.

Comment on lines +234 to +235
* registers a corresponding module destructor with ``atexit`` so it will run
during program termination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting/structure nit:

I'd add a brief summary of the atexit handler's job. We do have the details described below, but this would help the bullet-point list to provide a sufficiently coherent summary.

Alternatively, if you do want the details of the module destructor to be part of the list, then the followed paragraph should probably be attached to this bullet point or become a separate bullet point.

Comment on lines +252 to +255
The fat binary registration and unregistration helpers participate in the same
global constructor and termination mechanisms as the rest of the program, and
there is no strong guarantee about their relative order with user-defined
global constructors and destructors. In particular:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not feasibly enumerate all things that users can do wrong.

I think we should keep it very simple -- outline the guarantees we do provide, and state that there's no promise of anything else. Examples of why/how things are done under the hood may be useful, but as far as the users are concerned, all we promise is that we'll register the kernels by the time main() is called, and we will unregister them via atexit() after main() exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants