@@ -210,6 +210,88 @@ Host Code Compilation
210210- These relocatable objects are then linked together.
211211- Host code within a TU can call host functions and launch kernels from another TU.
212212
213+ HIP Fat Binary Registration and Unregistration
214+ ==============================================
215+
216+ When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
217+ binaries" and generates host-side helper functions that register these
218+ fat binaries with the HIP runtime at program start and unregister them at
219+ program exit. In non-RDC mode (``-fno-gpu-rdc ``), each compilation unit
220+ typically produces its own self-contained fat binary per GPU architecture. In
221+ RDC mode (``-fgpu-rdc ``), device bitcode from multiple compilation units may be
222+ linked together into a single fat binary per GPU architecture.
223+
224+ At the LLVM IR level, Clang/LLVM typically create an internal module
225+ constructor (for example ``__hip_module_ctor `` or a ``.hip.fatbin_reg ``
226+ function) and add it to ``@llvm.global_ctors ``. This constructor is called by
227+ the C runtime before ``main `` and it:
228+
229+ * calls ``__hipRegisterFatBinary `` with a pointer to an internal wrapper
230+ object that describes the HIP fat binary;
231+ * stores the returned handle in an internal global variable;
232+ * calls an internal helper such as ``__hip_register_globals `` to register
233+ kernels, device variables and other metadata associated with the fat binary;
234+ * registers a corresponding module destructor with ``atexit `` so it will run
235+ during program termination.
236+
237+ The module destructor (for example ``__hip_module_dtor `` or a
238+ ``.hip.fatbin_unreg `` function) loads the stored handle, checks that it is
239+ non-null, calls ``__hipUnregisterFatBinary `` to unregister the fat binary from
240+ the HIP runtime, and then clears the handle. This ensures that the HIP runtime
241+ sees each fat binary registered exactly once and that it is unregistered once
242+ at exit, even when multiple translation units contribute HIP kernels to the
243+ same host program.
244+
245+ These registration/unregistration helpers are implementation details of Clang's
246+ HIP code generation; user code should not call ``__hipRegisterFatBinary `` or
247+ ``__hipUnregisterFatBinary `` directly.
248+
249+ Implications for HIP Application Developers
250+ -------------------------------------------
251+
252+ The fat binary registration and unregistration helpers participate in the same
253+ global constructor and termination mechanisms as the rest of the program, and
254+ there is no strong guarantee about their relative order with user-defined
255+ global constructors and destructors. In particular:
256+
257+ * Applications should not invoke ``__hipRegisterFatBinary `` or
258+ ``__hipUnregisterFatBinary `` explicitly.
259+ * Because registration happens in a compiler-generated module constructor and
260+ unregistration happens via an ``atexit ``-registered module destructor, the
261+ exact ordering relative to other global ctors/dtors and ``atexit `` handlers
262+ is implementation-dependent and may vary across platforms and toolchain
263+ options.
264+ * To avoid subtle ordering issues, applications should not rely on HIP kernels
265+ or device variables being usable from user-defined global constructors or
266+ destructors. HIP initialization and teardown that touches kernels or device
267+ state should instead be performed in ``main `` (or in functions called from
268+ ``main ``) after process startup.
269+ * In RDC mode, multiple translation units may contribute device code to a
270+ single fat binary; user code should not make assumptions based on a
271+ particular registration order between translation units.
272+
273+ Implications for HIP Runtime Developers
274+ ---------------------------------------
275+
276+ HIP runtime implementations that are linked with Clang-generated host code
277+ must handle registration and unregistration in the presence of uncertain
278+ global ctor/dtor ordering:
279+
280+ * ``__hipRegisterFatBinary `` must accept a pointer to the compiler-generated
281+ wrapper object and return an opaque handle that remains valid for as long as
282+ the fat binary may be used.
283+ * ``__hipUnregisterFatBinary `` must accept the handle previously returned by
284+ ``__hipRegisterFatBinary `` and perform any necessary cleanup. It may be
285+ called late in process teardown, after other parts of the runtime have
286+ started shutting down, so it should be robust in the presence of partially
287+ torn-down state.
288+ * Runtimes should use appropriate synchronization and guards so that fat
289+ binary registration does not observe uninitialized resources and
290+ unregistration does not release resources that are still required by other
291+ runtime components. In particular, registration and unregistration routines
292+ should be written to be safe under repeated calls and in the presence of
293+ concurrent or overlapping initialization/teardown logic.
294+
213295Syntax Difference with CUDA
214296===========================
215297
0 commit comments