From 792e5376f99d14da5030ad61e96f3fc0364b6870 Mon Sep 17 00:00:00 2001 From: Stanislav Mekhanoshin Date: Wed, 24 Sep 2025 00:52:37 -0700 Subject: [PATCH] [AMDGPU] Update gfx1250 documentation. NFC --- llvm/docs/AMDGPUUsage.rst | 2115 ++++++++++++++++++++++++++++++++++++- 1 file changed, 2110 insertions(+), 5 deletions(-) diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst index edabdc595a1f0..74b7604fda56d 100644 --- a/llvm/docs/AMDGPUUsage.rst +++ b/llvm/docs/AMDGPUUsage.rst @@ -979,11 +979,13 @@ supported for the ``amdgcn`` target. access is not supported except by flat and scratch instructions in GFX9-GFX11. - Code that manipulates the stack values in other lanes of a wavefront, - such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets - that reach other lanes or by explicitly constructing the scratch buffer descriptor, - triggers undefined behavior when it modifies the scratch values of other lanes. - The compiler may assume that such modifications do not occur. + On targets without "Globally Accessible Scratch" (introduced in GFX125x), code that + manipulates the stack values in other lanes of a wavefront, such as by + ``addrspacecast``-ing stack pointers to generic ones and taking offsets that reach other + lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined + behavior when it modifies the scratch values of other lanes. The compiler may assume + that such modifications do not occur for such targets. + When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the private segment size in bytes, for cases where a dynamic stack is used. @@ -1515,6 +1517,88 @@ The AMDGPU backend implements the following LLVM IR intrinsics. List AMDGPU intrinsics. +'``llvm.amdgcn.cooperative.atomic``' Intrinsics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``llvm.amdgcn.cooperative.atomic`` :ref:`family of intrinsics` +provide atomic load and store operations to a naturally-aligned contiguous memory regions. +Memory is accessed cooperatively by a collection of convergent threads, with each thread accessing +a fraction of the contiguous memory region. + + .. TODO:: + + The memory model described here is imprecise; see SWDEV-536264. + +This intrinsic has a memory ordering and may be used to synchronize-with another cooperative atomic. +If the memory ordering is relaxed, it may pair with a fence if that same fence is executed by +all participating threads with the same synchronization scope and set of address spaces. + +In both cases, a synchronize-with relation can only be established between cooperative atomics with the +same total access size. + +Each target may have additional restrictions on how the intrinsic may be used; see +:ref:`the table below`. +Targets not covered in the table do not support these intrinsics. + + .. table:: AMDGPU Cooperative Atomic Intrinsics Availability + :name: amdgpu-llvm-ir-cooperative-atomic-intrinsics-availability + + =============== ============================================================= + GFX Version Target Restrictions + =============== ============================================================= + GFX 12.5 :ref:`amdgpu-amdhsa-memory-model-gfx125x-cooperative-atomics` + =============== ============================================================= + +If the intrinsic is used without meeting all of the above conditions, or the target-specific conditions, +then this intrinsic causes undefined behavior. + + .. table:: AMDGPU Cooperative Atomic Intrinsics + :name: amdgpu-cooperative-atomic-intrinsics-table + + ======================================================= =========== ============ ========== + LLVM Intrinsic Number of Access Size Total Size + Threads Per Thread + Used + ======================================================= =========== ============ ========== + ``llvm.amdgcn.cooperative.atomic.store.32x4B`` 32 4B 128B + + ``llvm.amdgcn.cooperative.atomic.load.32x4B`` 32 4B 128B + + ``llvm.amdgcn.cooperative.atomic.store.16x8B`` 16 8B 128B + + ``llvm.amdgcn.cooperative.atomic.load.16x8B`` 16 8B 128B + + ``llvm.amdgcn.cooperative.atomic.store.8x16B`` 8 16B 128B + + ``llvm.amdgcn.cooperative.atomic.load.8x16B`` 8 16B 128B + + ======================================================= =========== ============ ========== + +The intrinsics are available for the global (``.p1`` suffix) and generic (``.p0`` suffix) address spaces. + +The atomic ordering operand (3rd operand for ``.store``, 2nd for ``.load``) is an integer that follows the +C ABI encoding of atomic memory orderings. The supported values are in +:ref:`the table below`. + + .. table:: AMDGPU Cooperative Atomic Intrinsics Atomic Memory Orderings + :name: amdgpu-cooperative-atomic-intrinsics-atomic-memory-orderings-table + + ====== ================ ================================= + Value Atomic Memory Notes + Ordering + ====== ================ ================================= + ``0`` ``relaxed`` The default for unsupported values. + + ``2`` ``acquire`` Only for ``.load`` + + ``3`` ``release`` Only for ``.store`` + + ``5`` ``seq_cst`` + ====== ================ ================================= + +The last argument of the intrinsic is the synchronization scope +as a metadata string, which must be one of the supported :ref:`memory scopes`. + .. _amdgpu_metadata: LLVM IR Metadata @@ -1843,6 +1927,7 @@ The AMDGPU backend supports the following LLVM IR attributes. This is only relevant on targets with cluster support. + ================================================ ========================================================== Calling Conventions @@ -5261,6 +5346,9 @@ The fields used by CP for code objects before V3 also match those specified in GFX10-GFX12 (wavefront size 32) - max_vgpr 1..256 - max(0, ceil(vgprs_used / 8) - 1) + GFX125X (wavefront size 32) + - max_vgpr 1..1024 + - max(0, ceil(vgprs_used / 16) - 1) Where vgprs_used is defined as the highest VGPR number @@ -6491,6 +6579,7 @@ following sections: * :ref:`amdgpu-amdhsa-memory-model-gfx942` * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11` * :ref:`amdgpu-amdhsa-memory-model-gfx12` +* :ref:`amdgpu-amdhsa-memory-model-gfx125x` .. _amdgpu-fence-as: @@ -16617,6 +16706,2022 @@ the instruction in the code sequence that references the table. - system for OpenCL.* ============ ============ ============== ========== ================================ +.. _amdgpu-amdhsa-memory-model-gfx125x: + +Memory Model GFX125x +++++++++++++++++++++++++ + +For GFX125x: + +**Device Structure:** + +* Each agent has multiple shader engines (SE). +* Each SE has multiple shader arrays (SA). +* Each SA has multiple work-group processors (WGP). +* Each WGP has 4 SIMD32 (2 SIMD32-pairs) that execute wavefronts. +* The wavefronts for a single work-group are executed in the same + WGP. + +**Device Memory:** + +* Each WGP has a single write-through WGP cache (WGP$) shared by the wavefronts of the + work-groups executing on it. The WGP$ is divided between LDS and vector L0 memory. + + * Vector L0 memory holds clean data only. + +* Each WGP$ has two request queues; one per SIMD32-pair. + Each queue can handle both LDS and vector L0 requests. Requests in one queue + are executed serially and in-order, but are not kept in order with the other queue. +* The scalar memory operations access a scalar L0 cache shared by all wavefronts + on a WGP. The scalar and vector L0 caches are not kept coherent by hardware. However, scalar + operations are used in a restricted way so do not impact the memory model. See + :ref:`amdgpu-amdhsa-memory-spaces`. +* The vector and scalar memory L0 caches are both clients of an L1 buffer shared by + all WGPs on the same SE. +* L1 buffers have separate request queues for each WGP$ it serves. Requests in one queue + are executed serially and in-order, but are not kept in order with other queues. +* L1 buffers are clients of the L2 cache. +* There may be multiple L2 caches per agent. Ranges of virtual addresses can be set up as follows: + + * Be non-hardware-coherent; copies of the data are not coherent between multiple L2s. + * Be read-write hardware-coherent with other L2 caches on the same or other agents. + * Bypass L2 entirely to ensure system coherence. + +* L2 caches have multiple memory channels to service disjoint ranges of virtual + addresses. + +**Memory Model:** + +.. note:: + + This section is currently incomplete as work on the compiler is still ongoing. + The following is a non-exhaustive list of unimplemented/undocumented features: + non-volatile bit code sequences, monitor and wait, globally accessing scratch atomics, + multicast loads, barriers (including split barriers) and cooperative atomics. + Scalar operations memory model needs more elaboration as well. + +* Vector memory operations are performed as wavefront wide operations, with the + ``EXEC`` mask predicating which lanes execute. +* Consecutive vector memory operations from the same wavefront are issued in program order. + Vector memory operations are issued (and executed) in no particular order between wavefronts. +* Wave execution of a vector memory operation instruction issues (initiates) the operation, + but completion occurs an unspecified amount of time later. + The ``s_wait_*cnt`` instructions must be used to determine if the operation has completed. +* The types of vector memory operations (and their associated ``s_wait_*cnt`` instructions) are: + + * Load (global, scratch, flat, buffer): ``s_wait_loadcnt`` + * Store (global, scratch, flat, buffer): ``s_wait_storecnt`` + * non-ASYNC LDS: ``s_wait_dscnt`` + * ASYNC LDS: ``s_wait_asynccnt`` + * Tensor: ``s_wait_tensorcnt`` + +* ``s_wait_xcnt`` is a counter that is incremented when a memory operation is issued, and + decremented when memory address translation for that instruction is completed. + Waiting on a memory counter ``s_wait_*cnt N`` also waits on ``s_wait_xcnt N``. + + * ``s_wait_xcnt 0x0`` is required before flat and global atomic stores/read-modify-write + operations to guarantee atomicity during a xnack replay. + +* Within a wavefront, vector memory operation completion (``s_wait_*cnt`` decrement) is + reported in order of issue within a type, but in no particular order between types. +* Within a wavefront, the order in which data is returned to registers by a vector memory + operation can be different from the order in which the vector memory operations were issued. + + * Thus, a ``s_wait_*cnt`` instruction must be used to prevent multiple vector memory operations + that return results to the same register from executing concurrently as they may not return + their results in instruction issue order, even though they will be reported as completed in + instruction issue order by the decrementing of the counter. + +* Within a wavefront, consecutive loads and store to the same address will be processed in program order + by the memory subsystem. Loads and stores to different addresses may be processed + out of order with respect to a different address. +* All non-ASYNC LDS vector memory operations of a WGP are performed as wavefront wide + operations in a global order and involve no caching. Completion is reported to a wavefront in + execution order. +* ASYNC LDS and tensor vector memory operations are not covered by the memory model implemented + by the AMDGPU backend. Neither ``s_wait_asynccnt`` nor ``s_wait_tensorcnt`` are inserted + automatically. They must be emitted using compiler built-in calls. +* Some vector memory operations contain a ``SCOPE`` field with values + corresponding to each cache level. The ``SCOPE`` determines whether a cache + can complete an operation locally or whether it needs to forward the operation + to the next cache level. The ``SCOPE`` values are: + + * ``SCOPE_CU``: WGP + * ``SCOPE_SE``: Shader Engine + * ``SCOPE_DEV``: Device/Agent + * ``SCOPE_SYS``: System + +* Each cache is assigned a ``SCOPE`` by the hardware depending on the agent's + configuration. + + * This ensures that ``SCOPE_DEV`` can always be used to implement agent coherence, + even in the presence of multiple non-coherent L2 caches on the same agent. + +* When a vector memory operation with a given ``SCOPE`` reaches a cache with a smaller + ``SCOPE`` value, it is forwarded to the next level of cache. +* When a vector memory operation with a given ``SCOPE`` reaches a cache with a ``SCOPE`` + value greater than or equal to its own, the operation can proceed: + + * Reads can hit into the cache. + * Writes can happen in this cache and completion (``s_wait`` decrement) can be + reported. + * RMW operations can be done locally. + +* Some memory operations contain a ``nv`` bit, for "non-volatile", which indicates + memory that is not expected to change during a kernel's execution. + This information is propagated to the cache lines for that address + (refered to as ``$nv``). + + * When ``nv=0`` reads hit dirty ``$nv=1`` data in cache, the hardware will + writeback the data to the next level in the hierarchy and then subsequently read + it again, updating the cache line with a clean ``$nv=0`` copy of the data. + +* ``global_inv``, ``global_wb`` and ``global_wbinv`` are cache control instructions. + The affected cache(s) are controlled by the ``SCOPE`` of the instruction. + Only caches whose scope is strictly smaller than the instruction's are affected. + + * ``global_inv`` invalidates the data in affected caches so that subsequent reads + will re-read from the next level in the cache hierarchy. + The invalidation requests cannot be reordered with pending or upcoming + memory operations. Instruction completion is reported using ``s_wait_loadcnt``. + * ``global_wb`` flushes the dirty data in affected caches to the next level in + the cache hierarchy. This instruction additionally ensures previous + memory operation done at a lower scope level have reached the desired + ``SCOPE:``. Instruction completion is reported using ``s_wait_storecnt`` once + all data has been acknowledged by the next level in the cache hierarchy. + * ``global_wbinv`` performs a ``global_inv`` then a ``global_wb``. + Instruction completion is reported using ``s_wait_storecnt``. + * ``global_inv``, ``global_wb`` and ``global_wbinv`` with ``nv=0`` can only + affect ``$nv=0`` cache lines, whereas ``nv=1`` can affect all cache lines. + * ``global_inv``, ``global_wb`` and ``global_wbinv`` behave like memory operations + issued to every address at the same time. They are kept in order with other + memory operations from the same wave. + +Scalar memory operations are only used to access memory that is proven to not +change during the execution of the kernel dispatch. This includes constant +address space and global address space for program scope ``const`` variables. +Therefore, the kernel machine code does not have to maintain the scalar cache to +ensure it is coherent with the vector caches. The scalar and vector caches are +invalidated between kernel dispatches by CP since constant address space data +may change between kernel dispatch executions. See +:ref:`amdgpu-amdhsa-memory-spaces`. + +Atomics in the scratch address space are handled as follows: + +* Data types <= 32 bits: The instruction is converted into an atomic in the + generic (``flat``) address space. All properties of the atomic + (atomic ordering, volatility, alignment, etc.) are preserved. + Refer to the generic address space code sequences for further information. +* Data types >32 bits: unsupported and an error is emitted. + +The code sequences used to implement the memory model for GFX125x are defined in +table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-table`. + +The mapping of LLVM IR syncscope to GFX125x instruction ``scope`` operands is +defined in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + +The table only applies if and only if it is directly referenced by an entry in +:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-table`, and it only applies to +the instruction in the code sequence that references the table. + + .. table:: AMDHSA Memory Model Code Sequences GFX125x - Instruction Scopes + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table + + ================================= ======================= + LLVM syncscope ISA + + + ================================= ======================= + *none*, one-as ``scope:SCOPE_SYS`` + system, system-one-as ``scope:SCOPE_SYS`` + agent, agent-one-as ``scope:SCOPE_DEV`` + cluster, cluster-one-as ``scope:SCOPE_SE`` + workgroup, workgroup-one-as ``scope:SCOPE_CU`` [1]_ + wavefront, wavefront-one-as ``scope:SCOPE_CU`` [1]_ + singlethread, singlethread-one-as ``scope:SCOPE_CU`` [1]_ + ================================= ======================= + + .. [1] ``SCOPE_CU`` is the default ``scope:`` emitted by the compiler. + It will be omitted when instructions are emitted in textual form by the compiler. + + .. table:: AMDHSA Memory Model Code Sequences GFX125x + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx125x-table + + ============ ============ ============== ========== ================================ + LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code + Ordering Sync Scope Address GFX125x + Space + ============ ============ ============== ========== ================================ + **Non-Atomic** + ------------------------------------------------------------------------------------ + load *none* *none* - global - !volatile & !nontemporal + - generic + - private 1. buffer/global/flat_load + - constant + - !volatile & nontemporal + + 1. buffer/global/flat_load + ``th:TH_LOAD_NT`` + + - volatile + + 1. buffer/global/flat_load + ``scope:SCOPE_SYS`` + + 2. ``s_wait_loadcnt 0x0`` + + - Must happen before + any following volatile + global/generic + load/store. + - Ensures that + volatile + operations to + different + addresses will not + be reordered by + hardware. + + load *none* *none* - local 1. ds_load + store *none* *none* - global - !volatile & !nontemporal + - generic + - private 1. buffer/global/flat_store + - constant + - !volatile & nontemporal + + 1. buffer/global/flat_store + ``th:TH_STORE_NT`` + + - volatile + + 1. buffer/global/flat_store + ``scope:SCOPE_SYS`` + + 2. ``s_wait_storecnt 0x0`` + + - Must happen before + any following volatile + global/generic + load/store. + - Ensures that + volatile + operations to + different + addresses will not + be reordered by + hardware. + + store *none* *none* - local 1. ds_store + **Unordered Atomic** + ------------------------------------------------------------------------------------ + load atomic unordered *any* *any* *Same as non-atomic*. + store atomic unordered *any* *any* *Same as non-atomic*. + atomicrmw unordered *any* *any* *Same as monotonic atomic*. + **Monotonic Atomic** + ------------------------------------------------------------------------------------ + load atomic monotonic - singlethread - global 1. buffer/global/flat_load + - wavefront - generic + - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - cluster + - agent + - system + load atomic monotonic - singlethread - local 1. ds_load + - wavefront + - workgroup + store atomic monotonic - singlethread - global 1. ``s_wait_xcnt 0x0`` + - wavefront - generic + - workgroup - Ensure operation remains atomic even during a xnack replay. + - cluster - Only needed for ``flat`` and ``global`` operations. + - agent + - system 2. buffer/global/flat_store + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + store atomic monotonic - singlethread - local 1. ds_store + - wavefront + - workgroup + atomicrmw monotonic - singlethread - global 1. ``s_wait_xcnt 0x0`` + - wavefront - generic + - workgroup - Ensure operation remains atomic even during a xnack replay. + - cluster - Only needed for ``flat`` and ``global`` operations. + - agent + - system 2. buffer/global/flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + atomicrmw monotonic - singlethread - local 1. ds_atomic + - wavefront + - workgroup + **Acquire Atomic** + ------------------------------------------------------------------------------------ + load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load + - wavefront - local + - generic + load atomic acquire - workgroup - global 1. buffer/global_load + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. ``s_wait_loadcnt 0x0`` + + - Must happen before any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + + + load atomic acquire - workgroup - local 1. ds_load + 2. ``s_wait_dscnt 0x0`` + + - If OpenCL, omit. + - Must happen before any following + global/generic load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the local load + atomic value being + acquired. + + + load atomic acquire - workgroup - generic 1. flat_load + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0`` + - Must happen before any + following global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than a local load + atomic value being + acquired. + + load atomic acquire - cluster - global 1. buffer/global_load + - agent + - system - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. ``s_wait_loadcnt 0x0`` + + - Must happen before + following + ``global_inv``. + - Ensures the load + has completed + before invalidating + the caches. + + 3. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following + loads will not see + stale global data. + + load atomic acquire - cluster - generic 1. flat_load + - agent + - system - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0`` + - Must happen before + following + ``global_inv``. + - Ensures the flat_load + has completed + before invalidating + the caches. + + 3. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - singlethread - global 1. ``s_wait_xcnt 0x0`` + - wavefront - local + - generic - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 2. buffer/global/ds/flat_atomic + + atomicrmw acquire - workgroup - global 1. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 2. buffer/global_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, + use ``th:TH_ATOMIC_RETURN`` + + 3. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + + - Must happen before any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + + atomicrmw acquire - workgroup - local 1. ds_atomic + 2. ``s_wait_dscnt 0x0`` + + - If OpenCL, omit. + - Ensures any + following global + data read is no + older than the local + atomicrmw value + being acquired. + + + atomicrmw acquire - workgroup - generic 1. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + + 2. flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, + use ``th:TH_ATOMIC_RETURN`` + + 3. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0`` + - Ensures any + following global + data read is no + older than the local + atomicrmw value + being acquired. + + atomicrmw acquire - cluster - global 1. ``s_wait_xcnt 0x0`` + - agent + - system - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``global`` operations. + + 2. buffer/global_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, + use ``th:TH_ATOMIC_RETURN`` + + 3. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + + - Must happen before + following ``global_inv``. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 4. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - cluster - generic 1. ``s_wait_xcnt 0x0`` + - agent + - system - Ensure operation remains atomic even during a xnack replay. + + 2. flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, + use ``th:TH_ATOMIC_RETURN`` + + 3. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit dscnt + - Must happen before + following + global_inv + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 4. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acquire - singlethread *none* *none* + - wavefront + fence acquire - workgroup *none* 1. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0`` + - If OpenCL and address space is local, + omit all. + - See :ref:`amdgpu-fence-as` for + more details on fencing specific + address spaces. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + + fence acquire - cluster *none* 1. | ``s_wait_storecnt 0x0`` + - agent | ``s_wait_loadcnt 0x0`` + - system | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - If OpenCL and address space is + local, omit all. + - See :ref:`amdgpu-fence-as` for + more details on fencing specific + address spaces. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + ``global_inv`` + - Ensures that the + fence-paired atomic + has completed + before invalidating the + caches. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + 2. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Ensures that + following + loads will not see + stale data. + + **Release Atomic** + ------------------------------------------------------------------------------------ + store atomic release - singlethread - global 1. ``s_wait_xcnt 0x0`` + - wavefront - local + - generic - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 2. buffer/global/ds/flat_store + + store atomic release - workgroup - global 1. | ``s_wait_storecnt 0x0`` + - cluster - generic | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before the + following store. + - Ensures that all + memory operations + have + completed before + performing the + store that is being + released. + + 2. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 3. buffer/global/flat_store + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + store atomic release - workgroup - local 1. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - Must happen before the + following store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 2. ds_store + store atomic release - agent - global 1. ``global_wb`` + - system - generic + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + ``global_wb`` or + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before the + following store. + - Ensures that all + memory operations + have + completed before + performing the + store that is being + released. + + 3. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 4. buffer/global/flat_store + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + atomicrmw release - singlethread - global 1. ``s_wait_xcnt 0x0`` + - wavefront - local + - generic - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 2. buffer/global/ds/flat_atomic + atomicrmw release - workgroup - global 1. | ``s_wait_storecnt 0x0`` + - cluster - generic | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before the + following atomic. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 3. buffer/global/flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + atomicrmw release - workgroup - local 1. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit all. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - Must happen before the + following atomic. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 2. ds_atomic + atomicrmw release - agent - global 1. ``global_wb`` + - system - generic + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + ``global_wb`` or + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before the + following atomic. + - Ensures that all + memory operations + to global and local + have completed + before performing + the atomicrmw that + is being released. + + 3. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 4. buffer/global/flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + fence release - singlethread *none* *none* + - wavefront + fence release - workgroup *none* 1. | ``s_wait_storecnt 0x0`` + - cluster | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - If OpenCL and + address space is + local, omit all. + - See :ref:`amdgpu-fence-as` for + more details on fencing specific + address spaces. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + have + completed before + performing the + following + fence-paired-atomic. + + fence release - agent *none* 1. ``global_wb`` + - system + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + | **OpenCL:** + | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + + - If OpenCl, omit ``s_wait_dscnt 0x0``. + - If OpenCL and address space is local, + omit all. + - See :ref:`amdgpu-fence-as` for + more details on fencing specific + address spaces. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + ``global_wb`` or + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + have + completed before + performing the + following + fence-paired-atomic. + + **Acquire-Release Atomic** + ------------------------------------------------------------------------------------ + atomicrmw acq_rel - singlethread - global 1. ``s_wait_xcnt 0x0`` + - wavefront - local + - generic - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 2. buffer/global/ds/flat_atomic + atomicrmw acq_rel - workgroup - global 1. | ``s_wait_storecnt 0x0`` + - cluster | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0``. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``flat`` and ``global`` operations. + + 3. buffer/global_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, use + ``th:TH_ATOMIC_RETURN``. + + 4. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + atomicrmw acq_rel - workgroup - local 1 | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 2. ds_atomic + 3. ``s_wait_dscnt 0x0`` + + - If OpenCL, omit. + - Ensures any + following global + data read is no + older than the local load + atomic value being + acquired. + + atomicrmw acq_rel - workgroup - generic 1. | ``s_wait_storecnt 0x0`` + - cluster | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_loadcnt 0x0``. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 2. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + + 3. flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, + use ``th:TH_ATOMIC_RETURN``. + + 4. | **Atomic without return:** + | ``s_wait_dscnt 0x0`` + | ``s_wait_storecnt 0x0`` + | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit ``s_wait_dscnt 0x0`` + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + + atomicrmw acq_rel - agent - global 1. ``global_wb`` + - system + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit + ``s_wait_dscnt 0x0`` + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + ``global_wb``. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global have + completed before + performing the + atomicrmw that is + being released. + + 2. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + - Only needed for ``global`` operations. + + 3. buffer/global_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, use + ``th:TH_ATOMIC_RETURN``. + + 4. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + + - Must happen before + following + ``global_inv``. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 5. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acq_rel - agent - generic 1. ``global_wb`` + - system + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit + ``s_wait_dscnt 0x0`` + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load atomic + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + ``global_wb``. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + have + completed before + performing the + atomicrmw that is + being released. + + 3. ``s_wait_xcnt 0x0`` + + - Ensure operation remains atomic even during a xnack replay. + + 4. flat_atomic + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - If atomic with return, use + ``th:TH_ATOMIC_RETURN``. + + 5. | **Atomic with return:** + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + | **Atomic without return:** + | ``s_wait_storecnt 0x0`` + | ``s_wait_dscnt 0x0`` + + + - If OpenCL, omit + ``s_wait_dscnt 0x0``. + - Must happen before + following + ``global_inv``. + - Ensures the + atomicrmw has + completed before + invalidating the + caches. + + 5. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acq_rel - singlethread *none* *none* + - wavefront + fence acq_rel - workgroup *none* 1. | ``s_wait_storecnt 0x0`` + - cluster | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL and + address space is + not generic, omit + ``s_wait_dscnt 0x0`` + - If OpenCL and + address space is + local, omit + all but ``s_wait_dscnt 0x0``. + - See :ref:`amdgpu-fence-as` for + more details on fencing specific + address spaces. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that all + memory operations + have + completed before + performing any + following global + memory operations. + - Ensures that the + preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + acquire-fence-paired-atomic) + has completed + before following + global memory + operations. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + local/generic store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + release-fence-paired-atomic). + This satisfies the + requirements of + release. + - Ensures that the + acquire-fence-paired + atomic has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + acquire-fence-paired-atomic. + + fence acq_rel - agent *none* 1. ``global_wb`` + - system + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + + 2. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL and + address space is + not generic, omit + ``s_wait_dscnt 0x0`` + - If OpenCL and + address space is + local, omit + all but ``s_wait_dscnt 0x0``. + - See :ref:`amdgpu-fence-as` for + more details on fencing specific + address spaces. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - ``s_wait_storecnt 0x0`` + must happen after + ``global_wb``. + - ``s_wait_dscnt 0x0`` + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + ``global_inv`` + - Ensures that the + preceding + global/local/generic + load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + acquire-fence-paired-atomic) + has completed + before invalidating + the caches. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + global/local/generic + store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + release-fence-paired-atomic). + This satisfies the + requirements of + release. + + 3. ``global_inv`` + + - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. This + satisfies the + requirements of + acquire. + + **Sequential Consistent Atomic** + ------------------------------------------------------------------------------------ + load atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local load atomic acquire, + - generic except must generate + all instructions even + for OpenCL.* + load atomic seq_cst - workgroup - global 1. | ``s_wait_storecnt 0x0`` + - generic | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit + ``s_wait_dscnt 0x0`` + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_dscnt 0x0`` must + happen after + preceding + local/generic load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own ``s_wait_dscnt 0x0`` + and so do not need to be + considered.) + - ``s_wait_loadcnt 0x0`` + must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own waits and so do + not need to be + considered.) + - ``s_wait_storecnt 0x0`` + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own ``s_wait_storecnt 0x0`` + and so do not need to be + considered.) + - Ensures any + preceding + sequential + consistent global/local + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + ``s_wait``\s of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The ``s_wait``\s + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the ``s_wait``\s be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generate + all instructions even + for OpenCL.* + load atomic seq_cst - workgroup - local 1. | ``s_wait_storecnt 0x0`` + | ``s_wait_loadcnt 0x0`` + | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit all. + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_loadcnt 0x0`` + must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own ``s_wait``\s and so do + not need to be + considered.) + - ``s_wait_storecnt 0x0`` + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own ``s_wait_storecnt 0x0`` + and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + ``s_wait``\s of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The s_waitcnt + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the ``s_wait``\s be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generate + all instructions even + for OpenCL.* + + load atomic seq_cst - cluster - global 1. | ``s_wait_storecnt 0x0`` + - agent - generic | ``s_wait_loadcnt 0x0`` + - system | ``s_wait_dscnt 0x0`` + + - If OpenCL, omit + ``s_wait_dscnt 0x0`` + - The waits can be + independently moved + according to the + following rules: + - ``s_wait_dscnt 0x0`` + must happen after + preceding + local load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own ``s_wait_dscnt 0x0`` + and so do + not need to be + considered.) + - ``s_wait_loadcnt 0x0`` + must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own ``s_wait``\s and so do + not need to be + considered.) + - ``s_wait_storecnt 0x0`` + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own + ``s_wait_storecnt 0x0`` and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + ``s_wait``\s of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + completing out of + order. The ``s_wait``\s + could be placed after + seq_store or before + the seq_load. We + choose the load to + make the ``s_wait``\s be + as late as possible + so that the store + may have already + completed.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generate + all instructions even + for OpenCL.* + store atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local store atomic release, + - workgroup - generic except must generate + - cluster all instructions even + - agent for OpenCL.* + - system + atomicrmw seq_cst - singlethread - global *Same as corresponding + - wavefront - local atomicrmw acq_rel, + - workgroup - generic except must generate + - cluster all instructions even + - agent for OpenCL.* + - system + fence seq_cst - singlethread *none* *Same as corresponding + - wavefront fence acq_rel, + - workgroup except must generate + - cluster all instructions even + - agent for OpenCL.* + - system + ============ ============ ============== ========== ================================ + +.. _amdgpu-amdhsa-memory-model-gfx125x-cooperative-atomics: + +'``llvm.amdgcn.cooperative.atomic``' Intrinsics +############################################### + +The collection of convergent threads participating in a cooperative atomic must belong +to the same wave32. + +Only naturally-aligned, contiguous groups of lanes may be used; +see :ref:`the table below` for the set of +possible lane groups. +Cooperative atomics may be executed by more than one group per wave. +Using an unsupported lane group, or using more lane groups per wave than the maximum will +cause undefined behavior. + +Using the intrinsic also causes undefined behavior if it loads or stores to addresses that: + +* Are not in the global address space (e.g.: private and local addresses spaces). +* Are only reachable through a bus that does not support 128B/256B requests + (e.g.: host memory over PCIe) +* Any other unsupported addresses (TBD, needs refinement) + +.. TODO:: + + Enumerate all cases where UB is invoked when using this intrinsic instead of hand-waving + "specific global memory locations". + +.. table:: GFX125x Cooperative Atomic Intrinsics + :name: gfx125x-cooperative-atomic-intrinsics-table + + ======================================================= ======================================= + LLVM Intrinsic Lane Groups + ======================================================= ======================================= + ``llvm.amdgcn.cooperative.atomic.store.32x4B`` ``0-31`` + + ``llvm.amdgcn.cooperative.atomic.load.32x4B`` ``0-31`` + + ``llvm.amdgcn.cooperative.atomic.store.16x8B`` ``0-15``, ``16-31`` + + ``llvm.amdgcn.cooperative.atomic.load.16x8B`` ``0-15``, ``16-31`` + + ``llvm.amdgcn.cooperative.atomic.store.8x16B`` ``0-7``, ``8-15``, ``16-23``, ``24-31`` + + ``llvm.amdgcn.cooperative.atomic.load.8x16B`` ``0-7``, ``8-15``, ``16-23``, ``24-31`` + + ======================================================= ======================================= + .. _amdgpu-amdhsa-trap-handler-abi: Trap Handler ABI