diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst index e8dceb836f98a..5ad8f9ab07e40 100644 --- a/llvm/docs/NVPTXUsage.rst +++ b/llvm/docs/NVPTXUsage.rst @@ -322,6 +322,450 @@ aligned '``@llvm.nvvm.barrier.cta.*``' instruction should only be used if it is known that all threads in the CTA evaluate the condition identically, otherwise behavior is undefined. +MBarrier family of Intrinsics +----------------------------- + +Overview: +^^^^^^^^^ + +An ``mbarrier`` is a barrier created in shared memory that supports: + +* Synchronizing any subset of threads within a CTA. +* One-way synchronization of threads across CTAs of a cluster. + Threads can perform only ``arrive`` operations but not ``*_wait`` on an + mbarrier located in shared::cluster space. +* Waiting for completion of asynchronous memory operations initiated by a + thread and making them visible to other threads. + +Unlike ``bar{.cta}/barrier{.cta}`` instructions which can access a limited +number of barriers per CTA, ``mbarrier`` objects are user-defined and are +only limited by the total shared memory size available. + +An mbarrier object is an opaque object in shared memory with an +alignment of 8-bytes. It keeps track of: + +* Current phase of the mbarrier object +* Count of pending arrivals for the current phase of the mbarrier object +* Count of expected arrivals for the next phase of the mbarrier object +* Count of pending asynchronous memory operations (or transactions) + tracked by the current phase of the mbarrier object. This is also + referred to as ``tx-count``. The unit of ``tx-count`` is specified + by the asynchronous memory operation (for example, + ``llvm.nvvm.cp.async.bulk.tensor.g2s.*``). + +The ``phase`` of an mbarrier object is the number of times the mbarrier +object has been used to synchronize threads/track async operations. +In each phase, threads perform: + +* arrive/expect-tx/complete-tx operations to progress the current phase. +* test_wait/try_wait operations to check for completion of the current phase. + +An mbarrier object completes the current phase when: + +* The count of the pending arrivals has reached zero AND +* The tx-count has reached zero. + +When an mbarrier object completes the current phase, below +actions are performed ``atomically``: + +* The mbarrier object transitions to the next phase. +* The pending arrival count is reinitialized to the expected arrival count. + +For more information, refer PTX ISA +``_. + +'``llvm.nvvm.mbarrier.init``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare void @llvm.nvvm.mbarrier.init(ptr %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3) %addr, i32 %count) + +Overview: +""""""""" + +The '``@llvm.nvvm.mbarrier.init.*``' intrinsics are used to initialize +an mbarrier object located at ``addr`` with the value ``count``. +``count`` is a 32-bit unsigned integer value and must be within +the range [1...2^20-1]. During initialization: + +* The tx-count and the current phase of the mbarrier object are set to 0. +* The expected and pending arrival counts are set to ``count``. + +Semantics: +"""""""""" + +The ``.shared`` variant explicitly uses shared memory address space for +the ``addr`` operand. If the ``addr`` does not fall within the +shared::cta space, then the behavior of this intrinsic is undefined. +Performing ``mbarrier.init`` on a valid mbarrier object is undefined; +use ``mbarrier.inval`` before reusing the memory for another mbarrier +or any other purpose. + +'``llvm.nvvm.mbarrier.inval``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare void @llvm.nvvm.mbarrier.inval(ptr %addr) + declare void @llvm.nvvm.mbarrier.inval.shared(ptr addrspace(3) %addr) + +Overview: +""""""""" + +The '``@llvm.nvvm.mbarrier.inval.*``' intrinsics invalidate the mbarrier +object at the address specified by ``addr``. + +Semantics: +"""""""""" + +The ``.shared`` variant explicitly uses shared memory address space for +the ``addr`` operand. If the ``addr`` does not fall within the +shared::cta space, then the behavior of this intrinsic is undefined. +It is expected that ``addr`` was previously initialized using +``mbarrier.init``; otherwise, the behavior is undefined. + +'``llvm.nvvm.mbarrier.expect.tx``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + +Overview: +""""""""" + +The '``@llvm.nvvm.mbarrier.expect.tx.*``' intrinsics increase the transaction +count of the mbarrier object at ``%addr`` by ``%tx_count``. The ``%tx_count`` +is a 32-bit unsigned integer value. + +Semantics: +"""""""""" + +The ``.space.{cta/cluster}`` indicates the address space where the mbarrier +object resides. + +The ``.scope.{cta/cluster}`` denotes the set of threads that can directly +observe the synchronizing effect of the mbarrier operation. When scope is +"cta", all threads executing in the same CTA (as the current thread) can +directly observe the effect of the ``expect.tx`` operation. Similarly, +when scope is "cluster", all threads executing in the same Cluster +(as the current thread) can directly observe the effect of the operation. + +If the ``addr`` does not fall within shared::cta or shared::cluster space, +then the behavior of this intrinsic is undefined. This intrinsic has +``relaxed`` semantics and hence does not provide any memory ordering +or visibility guarantees. + +'``llvm.nvvm.mbarrier.complete.tx``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + +Overview: +""""""""" + +The '``@llvm.nvvm.mbarrier.complete.tx.*``' intrinsics decrease the transaction +count of the mbarrier object at ``%addr`` by ``%tx_count``. The ``%tx_count`` +is a 32-bit unsigned integer value. As a result of this decrement, +the mbarrier can potentially complete its current phase and transition +to the next phase. + +Semantics: +"""""""""" + +The semantics of these intrinsics are identical to those of the +``llvm.nvvm.mbarrier.expect.tx.*`` intrinsics described above. + +'``llvm.nvvm.mbarrier.arrive``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count) + declare i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count) + + declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count) + declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count) + +Overview: +""""""""" + +The ``@llvm.nvvm.mbarrier.arrive.*`` intrinsics signal the arrival of the +executing thread or completion of an asynchronous instruction associated with +an arrive operation on the mbarrier object at ``%addr``. This operation +decrements the pending arrival count by ``%count``, a 32-bit unsigned integer, +potentially completing the current phase and triggering a transition to the +next phase. + +Semantics: +"""""""""" + +The ``.space.{cta/cluster}`` indicates the address space where the mbarrier +object resides. When the mbarrier is in shared::cta space, the intrinsics +return an opaque 64-bit value capturing the phase of the mbarrier object +_prior_ to this arrive operation. This value can be used with a try_wait +or test_wait operation to check for the completion of the mbarrier. + +The ``.scope.{cta/cluster}`` denotes the set of threads that can directly +observe the synchronizing effect of the mbarrier operation. When scope is +"cta", all threads executing in the same CTA (as the current thread) can +directly observe the effect of the ``arrive`` operation. Similarly, +when scope is "cluster", all threads executing in the same Cluster +(as the current thread) can directly observe the effect of the operation. + +If the ``addr`` does not fall within shared::cta or shared::cluster space, +then the behavior of this intrinsic is undefined. + +These intrinsics have ``release`` semantics by default. The release semantics +ensure ordering of operations that occur in program order _before_ this arrive +instruction, making their effects visible to subsequent operations in other +threads of the CTA (or cluster, depending on scope). Threads performing +corresponding acquire operations (such as mbarrier.test.wait) synchronize +with this release. The ``relaxed`` variants of these intrinsics do not +provide any memory ordering or visibility guarantees. + +'``llvm.nvvm.mbarrier.arrive.expect.tx``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + + declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + +Overview: +""""""""" + +The ``@llvm.nvvm.mbarrier.arrive.expect.tx.*`` intrinsics are similar to +the ``@llvm.nvvm.mbarrier.arrive`` intrinsics except that they also +perform an ``expect-tx`` operation _prior_ to the ``arrive`` operation. +The ``%tx_count`` specifies the transaction count for the ``expect-tx`` +operation and the count for the ``arrive`` operation is assumed to be 1. + +Semantics: +"""""""""" + +The semantics of these intrinsics are identical to those of the +``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above. + +'``llvm.nvvm.mbarrier.arrive.drop``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count) + declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count) + + declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count) + declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count) + declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count) + +Overview: +""""""""" + +The ``@llvm.nvvm.mbarrier.arrive.drop.*`` intrinsics decrement the +expected arrival count of the mbarrier object at ``%addr`` by +``%count`` and then perform an ``arrive`` operation with ``%count``. +The ``%count`` is a 32-bit integer. + +Semantics: +"""""""""" + +The semantics of these intrinsics are identical to those of the +``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above. + +'``llvm.nvvm.mbarrier.arrive.drop.expect.tx``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + + declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count) + +Overview: +""""""""" + +The ``@llvm.nvvm.mbarrier.arrive.drop.expect.tx.*`` intrinsics perform +the below operations on the mbarrier located at ``%addr``. + +* Perform an ``expect-tx`` operation i.e. increase the transaction count + of the mbarrier by ``%tx_count``, a 32-bit unsigned integer value. +* Decrement the expected arrival count of the mbarrier by 1. +* Perform an ``arrive`` operation on the mbarrier with a value of 1. + +Semantics: +"""""""""" + +The semantics of these intrinsics are identical to those of the +``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above. + +'``llvm.nvvm.mbarrier.test.wait``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state) + declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state) + declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase) + declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase) + + declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state) + declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state) + declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase) + declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase) + +Overview: +""""""""" + +The ``@llvm.nvvm.mbarrier.test.wait.*`` intrinsics test for the completion +of the current or the immediately preceding phase of an mbarrier object at +``%addr``. The test for completion can be done with either the ``state`` or +the ``phase-parity`` of the mbarrier object. + +* When done through the ``i64 %state`` operand, the state must be + returned by an ``llvm.nvvm.mbarrier.arrive.*`` on the _same_ + mbarrier object. +* The ``.parity`` variant of these intrinsics test for completion + of the phase indicated by the operand ``i32 %phase``, which is + the integer parity of either the current phase or the immediately + preceding phase of the mbarrier object. An even phase has integer + parity 0 and an odd phase has integer parity of 1. So the valid + values for phase-parity are 0 and 1. + +Semantics: +"""""""""" + +The ``.scope.{cta/cluster}`` denotes the set of threads that the +test_wait operation can directly synchronize with. + +If the ``addr`` does not fall within shared::cta space, then the +the behavior of this intrinsic is undefined. + +These intrinsics have ``acquire`` semantics by default. This acquire +pattern establishes memory ordering for operations occurring in program +order after this ``test_wait`` instruction by making operations from +other threads in the CTA (or cluster, depending on scope) visible to +subsequent operations in the current thread. When this wait completes, +it synchronizes with the corresponding release pattern from the +``mbarrier.arrive`` operation. The ``relaxed`` variants of these intrinsics +do not provide any memory ordering or visibility guarantees. + +This ``test.wait`` intrinsic is non-blocking and immediately returns +the completion status without suspending the executing thread. + +The boolean return value indicates: + +* True: The immediately preceding phase has completed +* False: The current phase is still incomplete + +When this wait returns true, the following ordering guarantees hold: + +* All memory accesses (except async operations) requested prior to + ``mbarrier.arrive`` having release semantics by participating + threads of a CTA (or cluster, depending on scope) are visible to + the executing thread. +* All ``cp.async`` operations requested prior to ``cp.async.mbarrier.arrive`` + by participating threads of a CTA are visible to the executing thread. +* All ``cp.async.bulk`` operations using the same mbarrier object requested + prior to ``mbarrier.arrive`` having release semantics by participating CTA + threads are visible to the executing thread. +* Memory accesses requested after this wait are not visible to memory + accesses performed prior to ``mbarrier.arrive`` by other participating + threads. +* No ordering guarantee exists for memory accesses by the same thread + between an ``mbarrier.arrive`` and this wait. + +'``llvm.nvvm.mbarrier.try.wait``' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Syntax: +""""""" + +.. code-block:: llvm + + declare i1 @llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state) + declare i1 @llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state) + + declare i1 @llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase) + declare i1 @llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase) + + declare i1 @llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state, i32 %timelimit) + declare i1 @llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state, i32 %timelimit) + + declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase, i32 %timelimit) + declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase, i32 %timelimit) + +Overview: +""""""""" + +The ``@llvm.nvvm.mbarrier.try.wait.*`` intrinsics test for the completion of +the current or immediately preceding phase of an mbarrier object at ``%addr``. +Unlike the ``test.wait`` intrinsics, which perform a non-blocking test, these +intrinsics may block the executing thread until the specified phase completes +or a system-dependent time limit expires. Suspended threads resume execution +when the phase completes or the time limit elapses. This time limit is +configurable through the ``.tl`` variants of these intrinsics, where the +``%timelimit`` operand (an unsigned integer) specifies the limit in +nanoseconds. Other semantics are identical to those of the ``test.wait`` +intrinsics described above. + Electing a thread ----------------- diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td index 3af1750ffcf3f..1bd4142d30952 100644 --- a/llvm/include/llvm/IR/IntrinsicsNVVM.td +++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td @@ -1851,6 +1851,73 @@ let IntrProperties = [IntrConvergent, IntrNoCallback] in { def int_nvvm_mbarrier_pending_count : NVVMBuiltin, Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem, IntrConvergent, IntrNoCallback]>; +// mbarrier.{expect_tx/complete_tx} +foreach op = ["expect_tx", "complete_tx"] in { + foreach scope = ["scope_cta", "scope_cluster"] in { + foreach space = ["space_cta", "space_cluster"] in { + defvar suffix = StrJoin<"_", [op, scope, space]>.ret; + defvar mbar_addr_ty = !if(!eq(space, "space_cta"), + llvm_shared_ptr_ty, llvm_shared_cluster_ptr_ty); + + def int_nvvm_mbarrier_ # suffix : + Intrinsic<[], [mbar_addr_ty, llvm_i32_ty], + [IntrConvergent, IntrArgMemOnly, IntrNoCallback]>; + } // space + } // scope +} // op + +// mbarrier.arrive and mbarrier.arrive.expect_tx +// mbarrier.arrive_drop and mbarrier.arrive_drop.expect_tx +foreach op = ["arrive", "arrive_expect_tx", + "arrive_drop", "arrive_drop_expect_tx"] in { + foreach scope = ["scope_cta", "scope_cluster"] in { + foreach space = ["space_cta", "space_cluster"] in { + defvar suffix = StrJoin<"_", [scope, space]>.ret; + defvar mbar_addr_ty = !if(!eq(space, "space_cta"), + llvm_shared_ptr_ty, llvm_shared_cluster_ptr_ty); + defvar args_ty = [mbar_addr_ty, // mbar_address_ptr + llvm_i32_ty]; // tx-count + + // mbarriers in shared_cluster space cannot return any value. + defvar mbar_ret_ty = !if(!eq(space, "space_cta"), + [llvm_i64_ty], []); + + def int_nvvm_mbarrier_ # op # "_" # suffix: + Intrinsic; + def int_nvvm_mbarrier_ # op # "_relaxed_" # suffix : + Intrinsic; + } // space + } // scope +} // op + +// mbarrier.{test_wait and try_wait} +foreach op = ["test_wait", "try_wait"] in { + foreach scope = ["scope_cta", "scope_cluster"] in { + foreach parity = [true, false] in { + foreach time_limit = !if(!eq(op, "try_wait"), [true, false], [false]) in { + defvar base_args = [llvm_shared_ptr_ty]; // mbar_ptr + defvar parity_args = !if(parity, [llvm_i32_ty], [llvm_i64_ty]); + defvar tl_args = !if(time_limit, [llvm_i32_ty], []); + defvar args = !listconcat(base_args, parity_args, tl_args); + defvar tmp_op = StrJoin<"_", [op, + !if(parity, "parity", ""), + !if(time_limit, "tl", "")]>.ret; + defvar suffix = StrJoin<"_", [scope, "space_cta"]>.ret; + + def int_nvvm_mbarrier_ # tmp_op # "_" # suffix : + Intrinsic<[llvm_i1_ty], args, + [IntrConvergent, NoCapture>, IntrNoCallback]>; + def int_nvvm_mbarrier_ # tmp_op # "_relaxed_" # suffix : + Intrinsic<[llvm_i1_ty], args, + [IntrConvergent, NoCapture>, IntrNoCallback, + IntrArgMemOnly, IntrReadMem]>; + } // tl + } // parity + } // scope +} // op + // Generated within nvvm. Use for ldu on sm_20 or later. Second arg is the // pointer's alignment. let IntrProperties = [IntrReadMem, IntrArgMemOnly, IntrNoCallback, IntrWillReturn, NoCapture>] in { @@ -2984,4 +3051,4 @@ foreach sp = [0, 1] in { } } -} // let TargetPrefix = "nvvm" \ No newline at end of file +} // let TargetPrefix = "nvvm" diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td index 22cf3a7eef2c1..b6a116dded609 100644 --- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td +++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td @@ -1082,6 +1082,161 @@ let Predicates = [hasPTX<70>, hasSM<80>] in { "mbarrier.pending_count.b64", [(set i32:$res, (int_nvvm_mbarrier_pending_count i64:$state))]>; } + +class MBAR_UTIL { + // The mbarrier instructions in PTX ISA are of the general form: + // mbarrier.op.semantics.scope.space.b64 arg1, arg2 ... + // where: + // op -> arrive, expect_tx, complete_tx, arrive.expect_tx etc. + // semantics -> acquire, release, relaxed (default depends on the op) + // scope -> cta or cluster (default is cta-scope) + // space -> shared::cta or shared::cluster (default is shared::cta) + // + // The 'semantics' and 'scope' go together. If one is specified, + // then the other _must_ be specified. For example: + // (A) mbarrier.arrive (valid, release and cta are default) + // (B) mbarrier.arrive.release.cta (valid, sem/scope mentioned explicitly) + // (C) mbarrier.arrive.release (invalid, needs scope) + // (D) mbarrier.arrive.cta (invalid, needs order) + // + // Wherever possible, we prefer form (A) to (B) since it is available + // from early PTX versions. In most cases, explicitly specifying the + // scope requires a later version of PTX. + string _scope_asm = !cond( + !eq(scope, "scope_cluster") : "cluster", + !eq(scope, "scope_cta") : !if(!empty(sem), "", "cta"), + true : scope); + string _space_asm = !cond( + !eq(space, "space_cta") : "shared", + !eq(space, "space_cluster") : "shared::cluster", + true : space); + + string _parity = !if(parity, "parity", ""); + string asm_str = StrJoin<".", ["mbarrier", op, _parity, + sem, _scope_asm, _space_asm, "b64"]>.ret; + + string _intr_suffix = StrJoin<"_", [!subst(".", "_", op), _parity, + !if(tl, "tl", ""), + sem, scope, space]>.ret; + string intr_name = "int_nvvm_mbarrier_" # _intr_suffix; + + // Predicate checks: + // These are used only for the "test_wait/try_wait" variants as they + // have evolved since sm80 and are complex. The predicates for the + // remaining instructions are straightforward and have already been + // applied directly. + Predicate _sm_pred = !cond(!or( + !eq(op, "try_wait"), + !eq(scope, "scope_cluster"), + !eq(sem, "relaxed")) : hasSM<90>, + true : hasSM<80>); + Predicate _ptx_pred = !cond( + !eq(sem, "relaxed") : hasPTX<86>, + !ne(_scope_asm, "") : hasPTX<80>, + !eq(op, "try_wait") : hasPTX<78>, + parity : hasPTX<71>, + true : hasPTX<70>); + list preds = [_ptx_pred, _sm_pred]; +} + +foreach op = ["expect_tx", "complete_tx"] in { + foreach scope = ["scope_cta", "scope_cluster"] in { + foreach space = ["space_cta", "space_cluster"] in { + defvar intr = !cast(MBAR_UTIL.intr_name); + defvar suffix = StrJoin<"_", [op, scope, space]>.ret; + def mbar_ # suffix : BasicNVPTXInst<(outs), (ins ADDR:$addr, B32:$tx_count), + MBAR_UTIL.asm_str, + [(intr addr:$addr, i32:$tx_count)]>, + Requires<[hasPTX<80>, hasSM<90>]>; + } // space + } // scope +} // op + +multiclass MBAR_ARR_INTR pred = []> { + // When either of sem or scope is non-default, both have to + // be explicitly specified. So, explicitly state that + // sem is `release` when scope is `cluster`. + defvar asm_sem = !if(!and(!empty(sem), !eq(scope, "scope_cluster")), + "release", sem); + + defvar asm_cta = MBAR_UTIL.asm_str; + defvar intr_cta = !cast(MBAR_UTIL.intr_name); + + defvar asm_cluster = MBAR_UTIL.asm_str; + defvar intr_cluster = !cast(MBAR_UTIL.intr_name); + + def _CTA : NVPTXInst<(outs B64:$state), + (ins ADDR:$addr, B32:$tx_count), + asm_cta # " $state, [$addr], $tx_count;", + [(set i64:$state, (intr_cta addr:$addr, i32:$tx_count))]>, + Requires; + def _CLUSTER : NVPTXInst<(outs), + (ins ADDR:$addr, B32:$tx_count), + asm_cluster # " _, [$addr], $tx_count;", + [(intr_cluster addr:$addr, i32:$tx_count)]>, + Requires; +} +foreach op = ["arrive", "arrive.expect_tx", + "arrive_drop", "arrive_drop.expect_tx"] in { + foreach scope = ["scope_cta", "scope_cluster"] in { + defvar suffix = !subst(".", "_", op) # scope; + defm mbar_ # suffix # _release : MBAR_ARR_INTR, hasSM<90>]>; + defm mbar_ # suffix # _relaxed : MBAR_ARR_INTR, hasSM<90>]>; + } // scope +} // op + +multiclass MBAR_WAIT_INTR { + // When either of sem or scope is non-default, both have to + // be explicitly specified. So, explicitly state that the + // semantics is `acquire` when the scope is `cluster`. + defvar asm_sem = !if(!and(!empty(sem), !eq(scope, "scope_cluster")), + "acquire", sem); + + defvar asm_parity = MBAR_UTIL.asm_str; + defvar pred_parity = MBAR_UTIL.preds; + defvar intr_parity = !cast(MBAR_UTIL.intr_name); + + defvar asm_state = MBAR_UTIL.asm_str; + defvar pred_state = MBAR_UTIL.preds; + defvar intr_state = !cast(MBAR_UTIL.intr_name); + + defvar ins_tl_dag = !if(time_limit, (ins B32:$tl), (ins)); + defvar tl_suffix = !if(time_limit, ", $tl;", ";"); + defvar intr_state_dag = !con((intr_state addr:$addr, i64:$state), + !if(time_limit, (intr_state i32:$tl), (intr_state))); + defvar intr_parity_dag = !con((intr_parity addr:$addr, i32:$phase), + !if(time_limit, (intr_parity i32:$tl), (intr_parity))); + + def _STATE : NVPTXInst<(outs B1:$res), !con((ins ADDR:$addr, B64:$state), ins_tl_dag), + asm_state # " $res, [$addr], $state" # tl_suffix, + [(set i1:$res, intr_state_dag)]>, + Requires; + def _PARITY : NVPTXInst<(outs B1:$res), !con((ins ADDR:$addr, B32:$phase), ins_tl_dag), + asm_parity # " $res, [$addr], $phase" # tl_suffix, + [(set i1:$res, intr_parity_dag)]>, + Requires; +} +foreach op = ["test_wait", "try_wait"] in { + foreach scope = ["scope_cta", "scope_cluster"] in { + foreach time_limit = !if(!eq(op, "try_wait"), [true, false], [false]) in { + defvar suffix = StrJoin<"_", [op, scope, !if(time_limit, "tl", "")]>.ret; + defm mbar_ # suffix # "_acquire" : MBAR_WAIT_INTR; + defm mbar_ # suffix # "_relaxed" : MBAR_WAIT_INTR; + } // time_limit + } // scope +} // op + //----------------------------------- // Math Functions //----------------------------------- diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll b/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll new file mode 100644 index 0000000000000..c440caaf98aba --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll @@ -0,0 +1,165 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %} +; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %} + +; === space_cta (addrspace 3) === +define void @test_mbarrier_arrive_scope_cta_space_cta(ptr addrspace(3) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cta_space_cta( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<6>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cta_space_cta_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cta_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.shared.b64 %rd2, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.shared.b64 %rd3, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.shared.b64 %rd4, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.shared.b64 %rd5, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cta_space_cta( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cta_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cta_space_cta_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.shared.b64 %rd1, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.shared.b64 %rd2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.shared.b64 %rd3, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.shared.b64 %rd4, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %r1 = call i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + ret void +} + +define void @test_mbarrier_arrive_scope_cluster_space_cta(ptr addrspace(3) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cluster_space_cta( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<6>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cluster_space_cta_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cta_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.release.cluster.shared.b64 %rd2, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared.b64 %rd3, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.release.cluster.shared.b64 %rd4, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared.b64 %rd5, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cluster_space_cta( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cta_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cluster_space_cta_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.release.cluster.shared.b64 %rd1, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared.b64 %rd2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.release.cluster.shared.b64 %rd3, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared.b64 %rd4, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %r1 = call i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + ret void +} + +; === space_cluster (addrspace 7) === +define void @test_mbarrier_arrive_scope_cta_space_cluster(ptr addrspace(7) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cta_space_cluster( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cta_space_cluster_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cluster_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cta_space_cluster( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cluster_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cta_space_cluster_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + call void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + ret void +} + +define void @test_mbarrier_arrive_scope_cluster_space_cluster(ptr addrspace(7) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cluster_space_cluster( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.release.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.release.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cluster_space_cluster( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cluster_space_cluster_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.release.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.release.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + call void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + ret void +} + +declare i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32) + +declare i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32) + +declare void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32) + +declare void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32) diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll b/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll new file mode 100644 index 0000000000000..e4d2aa21f7def --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll @@ -0,0 +1,165 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| %ptxas-verify -arch=sm_90 %} +; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %} + +; === space_cta (addrspace 3) === +define void @test_mbarrier_arrive_relaxed_scope_cta_space_cta(ptr addrspace(3) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cta( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<6>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cta.shared.b64 %rd2, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared.b64 %rd3, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cta.shared.b64 %rd4, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared.b64 %rd5, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cta( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cta.shared.b64 %rd1, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared.b64 %rd2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cta.shared.b64 %rd3, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared.b64 %rd4, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %r1 = call i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx) + ret void +} + +define void @test_mbarrier_arrive_relaxed_scope_cluster_space_cta(ptr addrspace(3) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cta( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<6>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cluster.shared.b64 %rd2, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared.b64 %rd3, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared.b64 %rd4, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared.b64 %rd5, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cta( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cluster.shared.b64 %rd1, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared.b64 %rd2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared.b64 %rd3, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared.b64 %rd4, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %r1 = call i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx) + ret void +} + +; === space_cluster (addrspace 7) === +define void @test_mbarrier_arrive_relaxed_scope_cta_space_cluster(ptr addrspace(7) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cluster( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cluster( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cta.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cta.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + call void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + ret void +} + +define void @test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(ptr addrspace(7) %mbar, i32 %tx) { +; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cluster( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_1]; +; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cluster( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + call void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx) + ret void +} + +declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3), i32) + +declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32) + +declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32) + +declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32) diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll b/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll new file mode 100644 index 0000000000000..441ade3351206 --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll @@ -0,0 +1,87 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %} +; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %} + +declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32) +declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32) +declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3), i32) +declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3), i32) + +declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7), i32) +declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7), i32) + +define void @test_mbarrier_tx_space_cta(ptr addrspace(3) %mbar, i32 %tx_count) { +; CHECK-PTX64-LABEL: test_mbarrier_tx_space_cta( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_tx_space_cta_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cta_param_1]; +; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cta.shared.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cluster.shared.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cta.shared.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cluster.shared.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_tx_space_cta( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cta_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_tx_space_cta_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cta.shared.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cluster.shared.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cta.shared.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cluster.shared.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx_count) + tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx_count) + + tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx_count) + tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx_count) + + ret void +} + +define void @test_mbarrier_tx_space_cluster(ptr addrspace(7) %mbar, i32 %tx_count) { +; CHECK-PTX64-LABEL: test_mbarrier_tx_space_cluster( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_tx_space_cluster_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cluster_param_1]; +; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cta.shared::cluster.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cluster.shared::cluster.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cta.shared::cluster.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cluster.shared::cluster.b64 [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: test_mbarrier_tx_space_cluster( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cluster_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_tx_space_cluster_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cta.shared::cluster.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cluster.shared::cluster.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cta.shared::cluster.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cluster.shared::cluster.b64 [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count) + tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count) + + tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count) + tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count) + + ret void +} diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll new file mode 100644 index 0000000000000..5130ae2bfea67 --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll @@ -0,0 +1,35 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_80 && ptxas-isa-7.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70| %ptxas-verify -arch=sm_80 %} +; RUN: %if ptxas-sm_80 && ptxas-isa-7.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70 --nvptx-short-ptr| %ptxas-verify -arch=sm_80 %} + +declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3), i64) + +define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state) { +; CHECK-PTX64-LABEL: mbar_test_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_test_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.test_wait.shared.b64 %p1, [%rd1], %rd2; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_test_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<2>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<2>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.shared.b64 %p1, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state) + + ret void +} diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll new file mode 100644 index 0000000000000..9327e7908cabd --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll @@ -0,0 +1,36 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_80 && ptxas-isa-7.1 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71| %ptxas-verify -arch=sm_80 %} +; RUN: %if ptxas-sm_80 && ptxas-isa-7.1 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71 --nvptx-short-ptr| %ptxas-verify -arch=sm_80 %} + +; --- test.wait.parity --- +declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3), i32) + +define void @mbar_test_wait(ptr addrspace(3) %mbar, i32 %parity) { +; CHECK-PTX64-LABEL: mbar_test_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<2>; +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.shared.b64 %p1, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_test_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<2>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_test_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.shared.b64 %p1, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity) + + ret void +} diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll new file mode 100644 index 0000000000000..9b19ad5f26026 --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll @@ -0,0 +1,83 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_90 && ptxas-isa-7.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %} +; RUN: %if ptxas-sm_90 && ptxas-isa-7.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %} + +; --- try.wait without timelimit --- +declare i1 @llvm.nvvm.mbarrier.try.wait.scope.cta.space.cta(ptr addrspace(3), i64) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cta.space.cta(ptr addrspace(3), i32) + +; --- try.wait with timelimit --- +declare i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cta.space.cta(ptr addrspace(3), i64, i32) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cta.space.cta(ptr addrspace(3), i32, i32) + +define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) { +; CHECK-PTX64-LABEL: mbar_try_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<3>; +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.shared.b64 %p1, [%rd1], %rd2; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_2]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_try_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.shared.b64 %p1, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state) + %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity) + + ret void +} + +define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) { +; CHECK-PTX64-LABEL: mbar_try_wait_tl( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<3>; +; CHECK-PTX64-NEXT: .reg .b32 %r<3>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_tl_param_1]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_3]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.shared.b64 %p1, [%rd1], %rd2, %r1; +; CHECK-PTX64-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_2]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%rd1], %r2, %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<4>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_1]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_3]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.shared.b64 %p1, [%r1], %rd1, %r2; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r3, [mbar_try_wait_tl_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%r1], %r3, %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl) + %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl) + + ret void +} diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll new file mode 100644 index 0000000000000..034953ddb3072 --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll @@ -0,0 +1,123 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %} +; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %} + +; with sm-90 and ptx-80, we have support for cluster-scope + +; --- test.wait --- +declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3), i64) + +; --- try.wait without timelimit --- +declare i1 @llvm.nvvm.mbarrier.try.wait.scope.cluster.space.cta(ptr addrspace(3), i64) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cluster.space.cta(ptr addrspace(3), i32) + +; --- try.wait with timelimit --- +declare i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cluster.space.cta(ptr addrspace(3), i64, i32) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cluster.space.cta(ptr addrspace(3), i32, i32) + +define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) { +; CHECK-PTX64-LABEL: mbar_test_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<3>; +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_test_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.test_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_2]; +; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_test_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_test_wait_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state) + %ret1 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity) + + ret void +} + +define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) { +; CHECK-PTX64-LABEL: mbar_try_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<3>; +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_2]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_try_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state) + %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity) + + ret void +} + +define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) { +; CHECK-PTX64-LABEL: mbar_try_wait_tl( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<3>; +; CHECK-PTX64-NEXT: .reg .b32 %r<3>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_tl_param_1]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_3]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2, %r1; +; CHECK-PTX64-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_2]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r2, %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<4>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_1]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_3]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1, %r2; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r3, [mbar_try_wait_tl_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r3, %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl) + %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl) + + ret void +} diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll new file mode 100644 index 0000000000000..652634b67da98 --- /dev/null +++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll @@ -0,0 +1,148 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s +; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s +; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| %ptxas-verify -arch=sm_90 %} +; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %} + +; --- test.wait --- +declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3), i32) +declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32) +declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3), i64) +declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64) + +; --- try.wait without timelimit --- +declare i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cta.space.cta(ptr addrspace(3), i64) +declare i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3), i32) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32) + +; --- try.wait with timelimit --- +declare i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cta.space.cta(ptr addrspace(3), i64, i32) +declare i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64, i32) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cta.space.cta(ptr addrspace(3), i32, i32) +declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32, i32) + +define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) { +; CHECK-PTX64-LABEL: mbar_test_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<5>; +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_test_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.test_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2; +; CHECK-PTX64-NEXT: mbarrier.test_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_2]; +; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_test_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<5>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_test_wait_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state) + %ret1 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state) + + %ret2 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity) + %ret3 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity) + + ret void +} + +define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) { +; CHECK-PTX64-LABEL: mbar_try_wait( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<5>; +; CHECK-PTX64-NEXT: .reg .b32 %r<2>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_param_1]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2; +; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_2]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r1; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_try_wait( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<5>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_1]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state) + %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state) + + %ret2 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity) + %ret3 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity) + + ret void +} + +define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) { +; CHECK-PTX64-LABEL: mbar_try_wait_tl( +; CHECK-PTX64: { +; CHECK-PTX64-NEXT: .reg .pred %p<5>; +; CHECK-PTX64-NEXT: .reg .b32 %r<3>; +; CHECK-PTX64-NEXT: .reg .b64 %rd<3>; +; CHECK-PTX64-EMPTY: +; CHECK-PTX64-NEXT: // %bb.0: +; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_0]; +; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_tl_param_1]; +; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_3]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2, %r1; +; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2, %r1; +; CHECK-PTX64-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_2]; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r2, %r1; +; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r2, %r1; +; CHECK-PTX64-NEXT: ret; +; +; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl( +; CHECK-PTX-SHARED32: { +; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<5>; +; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<4>; +; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>; +; CHECK-PTX-SHARED32-EMPTY: +; CHECK-PTX-SHARED32-NEXT: // %bb.0: +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_0]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_1]; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_3]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1, %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1, %r2; +; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r3, [mbar_try_wait_tl_param_2]; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r3, %r2; +; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r3, %r2; +; CHECK-PTX-SHARED32-NEXT: ret; + %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl) + %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl) + + %ret2 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl) + %ret3 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl) + + ret void +}