Skip to content

Commit f007972

Browse files
committed
Improve HIP docs on fat binary registration ordering
Clarify how Clang-generated HIP fat binaries are registered and unregistered with the HIP runtime, and how this interacts with global constructors, destructors, and atexit handlers. Document that there is no strong guarantee on ordering relative to user-defined global ctors/dtors, recommend that HIP application developers avoid using kernels or device variables from global ctors/dtors, and describe the implications for HIP runtime developers (synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by questions from HIP application and runtime developers about fat binary registration/unregistration order and its potential interference with their own initialization and teardown code.
1 parent e5b9e80 commit f007972

File tree

1 file changed

+82
-0
lines changed

1 file changed

+82
-0
lines changed

clang/docs/HIPSupport.rst

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,88 @@ Host Code Compilation
210210
- These relocatable objects are then linked together.
211211
- Host code within a TU can call host functions and launch kernels from another TU.
212212

213+
HIP Fat Binary Registration and Unregistration
214+
==============================================
215+
216+
When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
217+
binaries" and generates host-side helper functions that register these
218+
fat binaries with the HIP runtime at program start and unregister them at
219+
program exit. In non-RDC mode (``-fno-gpu-rdc``), each compilation unit
220+
typically produces its own self-contained fat binary per GPU architecture. In
221+
RDC mode (``-fgpu-rdc``), device bitcode from multiple compilation units may be
222+
linked together into a single fat binary per GPU architecture.
223+
224+
At the LLVM IR level, Clang/LLVM typically create an internal module
225+
constructor (for example ``__hip_module_ctor`` or a ``.hip.fatbin_reg``
226+
function) and add it to ``@llvm.global_ctors``. This constructor is called by
227+
the C runtime before ``main`` and it:
228+
229+
* calls ``__hipRegisterFatBinary`` with a pointer to an internal wrapper
230+
object that describes the HIP fat binary;
231+
* stores the returned handle in an internal global variable;
232+
* calls an internal helper such as ``__hip_register_globals`` to register
233+
kernels, device variables and other metadata associated with the fat binary;
234+
* registers a corresponding module destructor with ``atexit`` so it will run
235+
during program termination.
236+
237+
The module destructor (for example ``__hip_module_dtor`` or a
238+
``.hip.fatbin_unreg`` function) loads the stored handle, checks that it is
239+
non-null, calls ``__hipUnregisterFatBinary`` to unregister the fat binary from
240+
the HIP runtime, and then clears the handle. This ensures that the HIP runtime
241+
sees each fat binary registered exactly once and that it is unregistered once
242+
at exit, even when multiple translation units contribute HIP kernels to the
243+
same host program.
244+
245+
These registration/unregistration helpers are implementation details of Clang's
246+
HIP code generation; user code should not call ``__hipRegisterFatBinary`` or
247+
``__hipUnregisterFatBinary`` directly.
248+
249+
Implications for HIP Application Developers
250+
-------------------------------------------
251+
252+
The fat binary registration and unregistration helpers participate in the same
253+
global constructor and termination mechanisms as the rest of the program, and
254+
there is no strong guarantee about their relative order with user-defined
255+
global constructors and destructors. In particular:
256+
257+
* Applications should not invoke ``__hipRegisterFatBinary`` or
258+
``__hipUnregisterFatBinary`` explicitly.
259+
* Because registration happens in a compiler-generated module constructor and
260+
unregistration happens via an ``atexit``-registered module destructor, the
261+
exact ordering relative to other global ctors/dtors and ``atexit`` handlers
262+
is implementation-dependent and may vary across platforms and toolchain
263+
options.
264+
* To avoid subtle ordering issues, applications should not rely on HIP kernels
265+
or device variables being usable from user-defined global constructors or
266+
destructors. HIP initialization and teardown that touches kernels or device
267+
state should instead be performed in ``main`` (or in functions called from
268+
``main``) after process startup.
269+
* In RDC mode, multiple translation units may contribute device code to a
270+
single fat binary; user code should not make assumptions based on a
271+
particular registration order between translation units.
272+
273+
Implications for HIP Runtime Developers
274+
---------------------------------------
275+
276+
HIP runtime implementations that are linked with Clang-generated host code
277+
must handle registration and unregistration in the presence of uncertain
278+
global ctor/dtor ordering:
279+
280+
* ``__hipRegisterFatBinary`` must accept a pointer to the compiler-generated
281+
wrapper object and return an opaque handle that remains valid for as long as
282+
the fat binary may be used.
283+
* ``__hipUnregisterFatBinary`` must accept the handle previously returned by
284+
``__hipRegisterFatBinary`` and perform any necessary cleanup. It may be
285+
called late in process teardown, after other parts of the runtime have
286+
started shutting down, so it should be robust in the presence of partially
287+
torn-down state.
288+
* Runtimes should use appropriate synchronization and guards so that fat
289+
binary registration does not observe uninitialized resources and
290+
unregistration does not release resources that are still required by other
291+
runtime components. In particular, registration and unregistration routines
292+
should be written to be safe under repeated calls and in the presence of
293+
concurrent or overlapping initialization/teardown logic.
294+
213295
Syntax Difference with CUDA
214296
===========================
215297

0 commit comments

Comments
 (0)