[AMDGPU] Add amdgpu-as MMRA for fences #78572

Pierre-vh · 2024-01-18T12:58:36Z

Using MMRAs, allow builtin_amdgcn_fence to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

This is done through a amdgpu-as MMRA. Currently focused on OpenCL fences, but can very easily support more AS names and codegen on more than just fences.

llvmbot · 2024-01-18T12:59:10Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-llvm-transforms

Author: Pierre van Houtryve (Pierre-vh)

Changes

Note: please only review the last commit!

Using MMRAs, implement builtin_amdgcn_fence_opencl to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

Patch is 231.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78572.diff

44 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+27)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+2)
(modified) clang/lib/Sema/SemaChecking.cpp (+6-1)
(added) clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp (+108)
(added) llvm/docs/MemoryModelRelaxationAnnotations.rst (+476)
(modified) llvm/docs/Reference.rst (+4)
(modified) llvm/include/llvm/Analysis/VectorUtils.h (+1-1)
(modified) llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h (+9)
(modified) llvm/include/llvm/CodeGen/MachineFunction.h (+2-1)
(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+37-10)
(modified) llvm/include/llvm/CodeGen/MachineInstrBuilder.h (+32-14)
(modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+12)
(modified) llvm/include/llvm/IR/FixedMetadataKinds.def (+1)
(added) llvm/include/llvm/IR/MemoryModelRelaxationAnnotations.h (+101)
(modified) llvm/lib/Analysis/VectorUtils.cpp (+10-1)
(modified) llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp (+1)
(modified) llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp (+3-1)
(modified) llvm/lib/CodeGen/MIRPrinter.cpp (+7)
(modified) llvm/lib/CodeGen/MachineFunction.cpp (+2-2)
(modified) llvm/lib/CodeGen/MachineInstr.cpp (+37-12)
(modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+9)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+1-1)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+11-6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp (+2)
(modified) llvm/lib/IR/CMakeLists.txt (+1)
(modified) llvm/lib/IR/Instruction.cpp (+13-1)
(added) llvm/lib/IR/MemoryModelRelaxationAnnotations.cpp (+201)
(modified) llvm/lib/IR/Verifier.cpp (+27)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+23-8)
(modified) llvm/lib/Transforms/Scalar/GVNHoist.cpp (+2-1)
(modified) llvm/lib/Transforms/Utils/FunctionComparator.cpp (+5)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+14-1)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/mmra.ll (+49)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+1428)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-private.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/mmra.ll (+254)
(added) llvm/test/Verifier/mmra-allowed.ll (+33)
(added) llvm/test/Verifier/mmra.ll (+37)
(modified) llvm/unittests/CodeGen/MachineInstrTest.cpp (+51)
(modified) llvm/unittests/IR/CMakeLists.txt (+1)
(added) llvm/unittests/IR/MemoryModelRelaxationAnnotationsTest.cpp (+34)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index e562ef04a30194..b9cd21bbfe514c 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")
 BUILTIN(__builtin_amdgcn_groupstaticsize, "Ui", "n")
 
 BUILTIN(__builtin_amdgcn_atomic_inc32, "UZiUZiD*UZiUicC*", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index de48b15645b1ab..050539b3816e54 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -55,6 +55,7 @@
 #include "llvm/IR/IntrinsicsX86.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/IR/MatrixBuilder.h"
+#include "llvm/IR/MemoryModelRelaxationAnnotations.h"
 #include "llvm/Support/ConvertUTF.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ScopedPrinter.h"
@@ -17859,6 +17860,25 @@ llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
   return Arg;
 }
 
+void CodeGenFunction::AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                llvm::Value *ASMask) {
+  constexpr const char *Tag = "opencl-fence-mem";
+
+  uint64_t Mask = cast<llvm::ConstantInt>(ASMask)->getZExtValue();
+  if (Mask == 0)
+    return;
+
+  // 3 bits can be set: local, global, image in that order.
+  MMRAMetadata MMRAs;
+  if (Mask & (1 << 0))
+    MMRAs.addTag(Tag, "local");
+  if (Mask & (1 << 1))
+    MMRAs.addTag(Tag, "global");
+  if (Mask & (1 << 2))
+    MMRAs.addTag(Tag, "image");
+  Inst->setMetadata(LLVMContext::MD_MMRA, MMRAs.getAsMD(getLLVMContext()));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -18348,6 +18368,13 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                             EmitScalarExpr(E->getArg(1)), AO, SSID);
     return Builder.CreateFence(AO, SSID);
   }
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl: {
+    ProcessOrderScopeAMDGCN(EmitScalarExpr(E->getArg(1)),
+                            EmitScalarExpr(E->getArg(2)), AO, SSID);
+    FenceInst *Fence = Builder.CreateFence(AO, SSID);
+    AddAMDGCNAddressSpaceMMRA(Fence, EmitScalarExpr(E->getArg(0)));
+    return Fence;
+  }
   case AMDGPU::BI__builtin_amdgcn_atomic_inc32:
   case AMDGPU::BI__builtin_amdgcn_atomic_inc64:
   case AMDGPU::BI__builtin_amdgcn_atomic_dec32:
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index 143ad64e8816b1..93a11c379e82c9 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4401,6 +4401,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *EmitHexagonBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitRISCVBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                     ReturnValueSlot ReturnValue);
+
+  void AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst, llvm::Value *ASMask);
   void ProcessOrderScopeAMDGCN(llvm::Value *Order, llvm::Value *Scope,
                                llvm::AtomicOrdering &AO,
                                llvm::SyncScope::ID &SSID);
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index ace3e386988f00..0a98f5504ff4f6 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -5098,6 +5098,10 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
     OrderIndex = 0;
     ScopeIndex = 1;
     break;
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl:
+    OrderIndex = 1;
+    ScopeIndex = 2;
+    break;
   default:
     return false;
   }
@@ -5120,7 +5124,8 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
   switch (static_cast<llvm::AtomicOrderingCABI>(Ord)) {
   case llvm::AtomicOrderingCABI::relaxed:
   case llvm::AtomicOrderingCABI::consume:
-    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence)
+    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence ||
+        BuiltinID == AMDGPU::BI__builtin_amdgcn_fence_opencl)
       return Diag(ArgExpr->getBeginLoc(),
                   diag::warn_atomic_op_has_invalid_memory_order)
              << 0 << ArgExpr->getSourceRange();
diff --git a/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
new file mode 100644
index 00000000000000..b4ac7ca8c178d9
--- /dev/null
+++ b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
@@ -0,0 +1,108 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals --version 3
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 %s -emit-llvm -O0 -o - \
+// RUN:   -triple=amdgcn-amd-amdhsa -Qn -mcode-object-version=none | FileCheck %s
+
+#define LOCAL_MASK (1 << 0)
+#define GLOBAL_MASK (1 << 1)
+#define IMAGE_MASK (1 << 2)
+
+//.
+// CHECK: @.str = private unnamed_addr addrspace(4) constant [10 x i8] c"workgroup\00", align 1
+// CHECK: @.str.1 = private unnamed_addr addrspace(4) constant [6 x i8] c"agent\00", align 1
+// CHECK: @.str.2 = private unnamed_addr addrspace(4) constant [1 x i8] zeroinitializer, align 1
+//.
+// CHECK-LABEL: define dso_local void @_Z10test_localv(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META1:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META1]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META1]]
+// CHECK-NEXT:    ret void
+//
+void test_local() {
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z11test_globalv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META2:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META2]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META2]]
+// CHECK-NEXT:    ret void
+//
+void test_global() {
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_imagev(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META3:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META3]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META3]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META3]]
+// CHECK-NEXT:    ret void
+//
+void test_image() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_mixedv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META4:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5]]
+// CHECK-NEXT:    ret void
+//
+void test_mixed() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK | LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(0xFF,__ATOMIC_SEQ_CST, "workgroup");
+}
+//.
+// CHECK: attributes #[[ATTR0]] = { mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+//.
+// CHECK: [[META0:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// CHECK: [[META1]] = !{!"opencl-fence-mem", !"local"}
+// CHECK: [[META2]] = !{!"opencl-fence-mem", !"global"}
+// CHECK: [[META3]] = !{!"opencl-fence-mem", !"image"}
+// CHECK: [[META4]] = !{[[META3]], [[META2]]}
+// CHECK: [[META5]] = !{[[META3]], [[META2]], [[META1]]}
+//.
diff --git a/llvm/docs/MemoryModelRelaxationAnnotations.rst b/llvm/docs/MemoryModelRelaxationAnnotations.rst
new file mode 100644
index 00000000000000..e83e365a3a57de
--- /dev/null
+++ b/llvm/docs/MemoryModelRelaxationAnnotations.rst
@@ -0,0 +1,476 @@
+===================================
+Memory Model Relaxation Annotations
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Memory Model Relaxation Annotations (MMRAs) are target-defined properties
+on instructions that can be used to selectively relax constraints placed
+by the memory model. For example:
+
+* The use of ``VulkanMemoryModel`` in a SPIRV program allows certain
+  memory operations to be reordered across ``acquire`` or ``release``
+  operations.
+* OpenCL APIs expose primitives to only fence a specific set of address
+  spaces, carrying that information to the backend can enable the
+  use of faster synchronization instructions, rather than fencing all
+  address spaces.
+
+MMRAs offer an opt-in system for targets to relax the default LLVM
+memory model.
+As such, they are attached to an operation using LLVM metadata which
+can always be dropped without affecting correctness.
+
+Definitions
+===========
+
+memory operation
+    A load, a store, an atomic, or a function call that is marked as
+    accessing memory.
+
+synchronizing operation
+    An instruction that synchronizes memory with other threads (e.g.
+    an atomic or a fence).
+
+tag
+    Metadata attached to a memory or synchronizing operation
+    that represents some target-defined property regarding memory
+    synchronization.
+
+    An operation may have multiple tags that each represent a different
+    property.
+
+    A tag is composed of a pair of metadata nodes:
+
+    * a *prefix* string.
+    * a *suffix* integer or string.
+
+    In LLVM IR, the pair is represented using a metadata tuple.
+    In other cases (comments, documentation, etc.), we may use the
+    ``prefix:suffix`` notation.
+    For example:
+
+    .. code-block::
+      :caption: Example: Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}  # scope:workgroup
+      !1 = !{!"scope", !"device"}     # scope:device
+      !2 = !{!"scope", !"system"}     # scope:system
+      !3 = !{!"sync-as", i32 2}  # sync-as:2
+      !4 = !{!"sync-as", i32 1}  # sync-as:1
+      !5 = !{!"sync-as", i32 0}  # sync-as:0
+
+    .. note::
+
+      The only semantics relevant to the optimizer is the
+      "compatibility" relation defined below. All other
+      semantics are target defined.
+
+    Tags can also be organised in lists to allow operations
+    to specify all of the tags they belong to. Such a list
+    is referred to as a "set of tags".
+
+    .. code-block::
+      :caption: Example: Set of Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}
+      !1 = !{!"sync-as", !"private"}
+      !2 = !{!0, !2}
+
+    .. note::
+
+      If an operation does not have MMRA metadata, it's treated as if
+      it has an empty list (``!{}``) of tags.
+
+    Note that it is not an error if a tag is not recognized by the
+    instruction it is applied to, or by the current target.
+    Such tags are simply ignored.
+
+    Both synchronizing operations and memory operations can have
+    zero or more tags attached to them using the ``!mmra`` syntax.
+
+    For the sake of readability in examples below,
+    we use a (non-functional) short syntax to represent MMMRA metadata:
+
+    .. code-block::
+      :caption: Short Syntax Example
+
+      store %ptr1 # foo:bar
+      store %ptr1 !mmra !{!"foo", !"bar"}
+
+    These two notations can be used in this document and are strictly
+    equivalent. However, only the second version is functional.
+
+compatibility
+    Two sets of tags are said to be *compatible* iff, for every unique
+    tag prefix P present in at least one set:
+
+    - the other set contains no tag with prefix P, or
+    - at least one tag with prefix P is common to both sets.
+
+    The above definition implies that an empty set is always compatible
+    with any other set. This is an important property as it ensures that
+    if a transform drops the metadata on an operation, it can never affect
+    correctness. In other words, the memory model cannot be relaxed further
+    by deleting metadata from instructions.
+
+.. _HappensBefore:
+
+The *happens-before* Relation
+==============================
+
+Compatibility checks can be used to opt out of the *happens-before* relation
+established between two instructions.
+
+Ordering
+    When two instructions' metadata are not compatible, any program order
+    between them are not in *happens-before*.
+
+    For example, consider two tags ``foo:bar`` and
+    ``foo:baz`` exposed by a target:
+
+    .. code-block::
+
+       A: store %ptr1                 # foo:bar
+       B: store %ptr2                 # foo:baz
+       X: store atomic release %ptr3  # foo:bar
+
+    In the above figure, ``A`` is compatible with ``X``, and hence ``A``
+    happens-before ``X``. But ``B`` is not compatible with
+    ``X``, and hence it is not happens-before ``X``.
+
+Synchronization
+    If an synchronizing operation has one or more tags, then whether it
+    participate in the  ``seq_cst`` order with other operations is target
+    dependent.
+
+    .. code-block::
+
+       ; Depending on the semantics of foo:bar & foo:bux, this may not
+       ; synchronize with another sequence.
+       fence release               # foo:bar
+       store atomic %ptr1          # foo:bux
+
+Examples
+--------
+
+.. code-block:: text
+  :caption: Example 1
+
+   A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate
+
+A and B are not ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Note that the ``sync-as`` value does not have to match the ``addrspace`` value.
+e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to
+only synchronize with operations happening in ``addrspace(0)``.
+
+.. code-block:: text
+  :caption: Example 2
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+Note that A and B may or may not be in *happens-before* due to other reasons.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2
+
+A and B do not have to be ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Use-cases
+=========
+
+SPIRV ``NonPrivatePointer``
+---------------------------
+
+MMRAs can support the SPIRV capability
+``VulkanMemoryModel``, where synchronizing operations only affect
+memory operations that specify ``NonPrivatePointer`` semantics.
+
+The example below is generated from a SPIRV program using the
+following recipe:
+
+- Add ``vulkan:nonprivate`` to every synchronizing operation.
+- Add ``vulkan:nonprivate`` to every non-atomic memory operation
+  that is marked ``NonPrivatePointer``.
+- Add ``vulkan:private`` to tags of every non-atomic memory operation
+  that is not marked ``NonPrivatePointer``.
+
+.. code-block::
+
+   Thread T1:
+    A: store %ptr1                 # vulkan:nonprivate
+    B: store %ptr2                 # vulkan:private
+    X: store atomic release %ptr3  # vulkan:nonprivate
+
+   Thread T2:
+    Y: load atomic acquire %ptr3   # vulkan:nonprivate
+    C: load %ptr2                  # vulkan:private
+    D: load %ptr1                  # vulkan:nonprivate
+
+Compatibility ensures that operation ``A`` is ordered
+relative to ``X`` while operation ``D`` is ordered relative to ``Y``.
+If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``.
+No such relation can be inferred about operations ``B`` and ``C``.
+
+.. note::
+   The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_
+   considers all atomic operation non-private.
+
+   Whether ``vulkan:nonprivate`` would be specified on atomic operations is
+   an implementation detail, as an atomic operation is always ``nonprivate``.
+   The implementation may choose to be explicit and emit IR with
+   ``vulkan:nonprivate`` on every atomic operation, or it could choose to
+   only emit ``vulkan::private`` and assume ``vulkan:nonprivate``
+   by default.
+
+Operations marked with ``vulkan:private`` effectively opt out of the
+happens-before order in a SPIRV program since they are incompatible
+with every synchronizing operation. Note that SPIRV operations that
+are not marked ``NonPrivatePointer`` are not entirely private to the
+thread --- they are implicitly synchronized at the start or end of a
+thread by the Vulkan *system-synchronizes-with* relationship. This
+example assumes that the target-defined semantics of
+``vulkan:private`` correctly implements this property.
+
+This scheme is general enough to express the interoperability of SPIRV
+programs with other environments.
+
+.. code-block::
+
+   Thread T1:
+   A: store %ptr1                 # vulkan:nonprivate
+   X: store atomic release %ptr2  # vulkan:nonprivate
+
+   Thread T2:
+   Y: load atomic acquire %ptr2   # foo:bar
+   B: load %ptr1
+
+In the above example, thread ``T1`` originates from a SPIRV program
+while thread ``T2`` originates from a non-SPIRV program. Whether ``X``
+can synchronize with ``Y`` is target defined.  If ``X`` synchronizes
+with ``Y``, then ``A`` happens before ``B`` (because A/X and
+Y/B are compatible).
+
+Implementation Example
+~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the implementation of SPIRV ``NonPrivatePointer`` on a target
+where all memory operations are cached, and the entire cache is
+flushed or invalidated at a ``release`` or ``acquire`` respectively. A
+possible scheme is that when translating a SPIRV program, memory
+operations marked ``NonPrivatePointer`` should not be cached, and the
+cache contents should not be touched during an ``acquire`` and
+``release`` operation.
+
+This could be implemented using the tags that share the ``vulkan:`` prefix,
+as follows:
+
+- For memory operations:
+
+  - Operations with ``vulkan:nonprivate`` should bypass the cache.
+  - Operations with ``vulkan:private`` should be cached.
+  - Operations that specify neither or both should conservatively
+    bypass the cache to ensure correctness.
+
+- For synchronizing operations:
+
+  - Operations with ``vulkan:nonprivate`` should not flush or
+    invalidate the cache.
+  - Operations with ``vulkan:private`` should flush or invalidate the...
[truncated]

llvmbot · 2024-01-18T12:59:10Z

@llvm/pr-subscribers-clang-codegen

Author: Pierre van Houtryve (Pierre-vh)

Changes

Note: please only review the last commit!

Using MMRAs, implement builtin_amdgcn_fence_opencl to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

Patch is 231.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78572.diff

44 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+27)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+2)
(modified) clang/lib/Sema/SemaChecking.cpp (+6-1)
(added) clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp (+108)
(added) llvm/docs/MemoryModelRelaxationAnnotations.rst (+476)
(modified) llvm/docs/Reference.rst (+4)
(modified) llvm/include/llvm/Analysis/VectorUtils.h (+1-1)
(modified) llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h (+9)
(modified) llvm/include/llvm/CodeGen/MachineFunction.h (+2-1)
(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+37-10)
(modified) llvm/include/llvm/CodeGen/MachineInstrBuilder.h (+32-14)
(modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+12)
(modified) llvm/include/llvm/IR/FixedMetadataKinds.def (+1)
(added) llvm/include/llvm/IR/MemoryModelRelaxationAnnotations.h (+101)
(modified) llvm/lib/Analysis/VectorUtils.cpp (+10-1)
(modified) llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp (+1)
(modified) llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp (+3-1)
(modified) llvm/lib/CodeGen/MIRPrinter.cpp (+7)
(modified) llvm/lib/CodeGen/MachineFunction.cpp (+2-2)
(modified) llvm/lib/CodeGen/MachineInstr.cpp (+37-12)
(modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+9)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+1-1)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+11-6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp (+2)
(modified) llvm/lib/IR/CMakeLists.txt (+1)
(modified) llvm/lib/IR/Instruction.cpp (+13-1)
(added) llvm/lib/IR/MemoryModelRelaxationAnnotations.cpp (+201)
(modified) llvm/lib/IR/Verifier.cpp (+27)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+23-8)
(modified) llvm/lib/Transforms/Scalar/GVNHoist.cpp (+2-1)
(modified) llvm/lib/Transforms/Utils/FunctionComparator.cpp (+5)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+14-1)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/mmra.ll (+49)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+1428)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-private.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/mmra.ll (+254)
(added) llvm/test/Verifier/mmra-allowed.ll (+33)
(added) llvm/test/Verifier/mmra.ll (+37)
(modified) llvm/unittests/CodeGen/MachineInstrTest.cpp (+51)
(modified) llvm/unittests/IR/CMakeLists.txt (+1)
(added) llvm/unittests/IR/MemoryModelRelaxationAnnotationsTest.cpp (+34)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index e562ef04a30194..b9cd21bbfe514c 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")
 BUILTIN(__builtin_amdgcn_groupstaticsize, "Ui", "n")
 
 BUILTIN(__builtin_amdgcn_atomic_inc32, "UZiUZiD*UZiUicC*", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index de48b15645b1ab..050539b3816e54 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -55,6 +55,7 @@
 #include "llvm/IR/IntrinsicsX86.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/IR/MatrixBuilder.h"
+#include "llvm/IR/MemoryModelRelaxationAnnotations.h"
 #include "llvm/Support/ConvertUTF.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ScopedPrinter.h"
@@ -17859,6 +17860,25 @@ llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
   return Arg;
 }
 
+void CodeGenFunction::AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                llvm::Value *ASMask) {
+  constexpr const char *Tag = "opencl-fence-mem";
+
+  uint64_t Mask = cast<llvm::ConstantInt>(ASMask)->getZExtValue();
+  if (Mask == 0)
+    return;
+
+  // 3 bits can be set: local, global, image in that order.
+  MMRAMetadata MMRAs;
+  if (Mask & (1 << 0))
+    MMRAs.addTag(Tag, "local");
+  if (Mask & (1 << 1))
+    MMRAs.addTag(Tag, "global");
+  if (Mask & (1 << 2))
+    MMRAs.addTag(Tag, "image");
+  Inst->setMetadata(LLVMContext::MD_MMRA, MMRAs.getAsMD(getLLVMContext()));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -18348,6 +18368,13 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                             EmitScalarExpr(E->getArg(1)), AO, SSID);
     return Builder.CreateFence(AO, SSID);
   }
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl: {
+    ProcessOrderScopeAMDGCN(EmitScalarExpr(E->getArg(1)),
+                            EmitScalarExpr(E->getArg(2)), AO, SSID);
+    FenceInst *Fence = Builder.CreateFence(AO, SSID);
+    AddAMDGCNAddressSpaceMMRA(Fence, EmitScalarExpr(E->getArg(0)));
+    return Fence;
+  }
   case AMDGPU::BI__builtin_amdgcn_atomic_inc32:
   case AMDGPU::BI__builtin_amdgcn_atomic_inc64:
   case AMDGPU::BI__builtin_amdgcn_atomic_dec32:
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index 143ad64e8816b1..93a11c379e82c9 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4401,6 +4401,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *EmitHexagonBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitRISCVBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                     ReturnValueSlot ReturnValue);
+
+  void AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst, llvm::Value *ASMask);
   void ProcessOrderScopeAMDGCN(llvm::Value *Order, llvm::Value *Scope,
                                llvm::AtomicOrdering &AO,
                                llvm::SyncScope::ID &SSID);
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index ace3e386988f00..0a98f5504ff4f6 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -5098,6 +5098,10 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
     OrderIndex = 0;
     ScopeIndex = 1;
     break;
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl:
+    OrderIndex = 1;
+    ScopeIndex = 2;
+    break;
   default:
     return false;
   }
@@ -5120,7 +5124,8 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
   switch (static_cast<llvm::AtomicOrderingCABI>(Ord)) {
   case llvm::AtomicOrderingCABI::relaxed:
   case llvm::AtomicOrderingCABI::consume:
-    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence)
+    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence ||
+        BuiltinID == AMDGPU::BI__builtin_amdgcn_fence_opencl)
       return Diag(ArgExpr->getBeginLoc(),
                   diag::warn_atomic_op_has_invalid_memory_order)
              << 0 << ArgExpr->getSourceRange();
diff --git a/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
new file mode 100644
index 00000000000000..b4ac7ca8c178d9
--- /dev/null
+++ b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
@@ -0,0 +1,108 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals --version 3
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 %s -emit-llvm -O0 -o - \
+// RUN:   -triple=amdgcn-amd-amdhsa -Qn -mcode-object-version=none | FileCheck %s
+
+#define LOCAL_MASK (1 << 0)
+#define GLOBAL_MASK (1 << 1)
+#define IMAGE_MASK (1 << 2)
+
+//.
+// CHECK: @.str = private unnamed_addr addrspace(4) constant [10 x i8] c"workgroup\00", align 1
+// CHECK: @.str.1 = private unnamed_addr addrspace(4) constant [6 x i8] c"agent\00", align 1
+// CHECK: @.str.2 = private unnamed_addr addrspace(4) constant [1 x i8] zeroinitializer, align 1
+//.
+// CHECK-LABEL: define dso_local void @_Z10test_localv(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META1:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META1]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META1]]
+// CHECK-NEXT:    ret void
+//
+void test_local() {
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z11test_globalv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META2:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META2]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META2]]
+// CHECK-NEXT:    ret void
+//
+void test_global() {
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_imagev(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META3:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META3]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META3]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META3]]
+// CHECK-NEXT:    ret void
+//
+void test_image() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_mixedv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META4:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5]]
+// CHECK-NEXT:    ret void
+//
+void test_mixed() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK | LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(0xFF,__ATOMIC_SEQ_CST, "workgroup");
+}
+//.
+// CHECK: attributes #[[ATTR0]] = { mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+//.
+// CHECK: [[META0:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// CHECK: [[META1]] = !{!"opencl-fence-mem", !"local"}
+// CHECK: [[META2]] = !{!"opencl-fence-mem", !"global"}
+// CHECK: [[META3]] = !{!"opencl-fence-mem", !"image"}
+// CHECK: [[META4]] = !{[[META3]], [[META2]]}
+// CHECK: [[META5]] = !{[[META3]], [[META2]], [[META1]]}
+//.
diff --git a/llvm/docs/MemoryModelRelaxationAnnotations.rst b/llvm/docs/MemoryModelRelaxationAnnotations.rst
new file mode 100644
index 00000000000000..e83e365a3a57de
--- /dev/null
+++ b/llvm/docs/MemoryModelRelaxationAnnotations.rst
@@ -0,0 +1,476 @@
+===================================
+Memory Model Relaxation Annotations
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Memory Model Relaxation Annotations (MMRAs) are target-defined properties
+on instructions that can be used to selectively relax constraints placed
+by the memory model. For example:
+
+* The use of ``VulkanMemoryModel`` in a SPIRV program allows certain
+  memory operations to be reordered across ``acquire`` or ``release``
+  operations.
+* OpenCL APIs expose primitives to only fence a specific set of address
+  spaces, carrying that information to the backend can enable the
+  use of faster synchronization instructions, rather than fencing all
+  address spaces.
+
+MMRAs offer an opt-in system for targets to relax the default LLVM
+memory model.
+As such, they are attached to an operation using LLVM metadata which
+can always be dropped without affecting correctness.
+
+Definitions
+===========
+
+memory operation
+    A load, a store, an atomic, or a function call that is marked as
+    accessing memory.
+
+synchronizing operation
+    An instruction that synchronizes memory with other threads (e.g.
+    an atomic or a fence).
+
+tag
+    Metadata attached to a memory or synchronizing operation
+    that represents some target-defined property regarding memory
+    synchronization.
+
+    An operation may have multiple tags that each represent a different
+    property.
+
+    A tag is composed of a pair of metadata nodes:
+
+    * a *prefix* string.
+    * a *suffix* integer or string.
+
+    In LLVM IR, the pair is represented using a metadata tuple.
+    In other cases (comments, documentation, etc.), we may use the
+    ``prefix:suffix`` notation.
+    For example:
+
+    .. code-block::
+      :caption: Example: Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}  # scope:workgroup
+      !1 = !{!"scope", !"device"}     # scope:device
+      !2 = !{!"scope", !"system"}     # scope:system
+      !3 = !{!"sync-as", i32 2}  # sync-as:2
+      !4 = !{!"sync-as", i32 1}  # sync-as:1
+      !5 = !{!"sync-as", i32 0}  # sync-as:0
+
+    .. note::
+
+      The only semantics relevant to the optimizer is the
+      "compatibility" relation defined below. All other
+      semantics are target defined.
+
+    Tags can also be organised in lists to allow operations
+    to specify all of the tags they belong to. Such a list
+    is referred to as a "set of tags".
+
+    .. code-block::
+      :caption: Example: Set of Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}
+      !1 = !{!"sync-as", !"private"}
+      !2 = !{!0, !2}
+
+    .. note::
+
+      If an operation does not have MMRA metadata, it's treated as if
+      it has an empty list (``!{}``) of tags.
+
+    Note that it is not an error if a tag is not recognized by the
+    instruction it is applied to, or by the current target.
+    Such tags are simply ignored.
+
+    Both synchronizing operations and memory operations can have
+    zero or more tags attached to them using the ``!mmra`` syntax.
+
+    For the sake of readability in examples below,
+    we use a (non-functional) short syntax to represent MMMRA metadata:
+
+    .. code-block::
+      :caption: Short Syntax Example
+
+      store %ptr1 # foo:bar
+      store %ptr1 !mmra !{!"foo", !"bar"}
+
+    These two notations can be used in this document and are strictly
+    equivalent. However, only the second version is functional.
+
+compatibility
+    Two sets of tags are said to be *compatible* iff, for every unique
+    tag prefix P present in at least one set:
+
+    - the other set contains no tag with prefix P, or
+    - at least one tag with prefix P is common to both sets.
+
+    The above definition implies that an empty set is always compatible
+    with any other set. This is an important property as it ensures that
+    if a transform drops the metadata on an operation, it can never affect
+    correctness. In other words, the memory model cannot be relaxed further
+    by deleting metadata from instructions.
+
+.. _HappensBefore:
+
+The *happens-before* Relation
+==============================
+
+Compatibility checks can be used to opt out of the *happens-before* relation
+established between two instructions.
+
+Ordering
+    When two instructions' metadata are not compatible, any program order
+    between them are not in *happens-before*.
+
+    For example, consider two tags ``foo:bar`` and
+    ``foo:baz`` exposed by a target:
+
+    .. code-block::
+
+       A: store %ptr1                 # foo:bar
+       B: store %ptr2                 # foo:baz
+       X: store atomic release %ptr3  # foo:bar
+
+    In the above figure, ``A`` is compatible with ``X``, and hence ``A``
+    happens-before ``X``. But ``B`` is not compatible with
+    ``X``, and hence it is not happens-before ``X``.
+
+Synchronization
+    If an synchronizing operation has one or more tags, then whether it
+    participate in the  ``seq_cst`` order with other operations is target
+    dependent.
+
+    .. code-block::
+
+       ; Depending on the semantics of foo:bar & foo:bux, this may not
+       ; synchronize with another sequence.
+       fence release               # foo:bar
+       store atomic %ptr1          # foo:bux
+
+Examples
+--------
+
+.. code-block:: text
+  :caption: Example 1
+
+   A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate
+
+A and B are not ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Note that the ``sync-as`` value does not have to match the ``addrspace`` value.
+e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to
+only synchronize with operations happening in ``addrspace(0)``.
+
+.. code-block:: text
+  :caption: Example 2
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+Note that A and B may or may not be in *happens-before* due to other reasons.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2
+
+A and B do not have to be ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Use-cases
+=========
+
+SPIRV ``NonPrivatePointer``
+---------------------------
+
+MMRAs can support the SPIRV capability
+``VulkanMemoryModel``, where synchronizing operations only affect
+memory operations that specify ``NonPrivatePointer`` semantics.
+
+The example below is generated from a SPIRV program using the
+following recipe:
+
+- Add ``vulkan:nonprivate`` to every synchronizing operation.
+- Add ``vulkan:nonprivate`` to every non-atomic memory operation
+  that is marked ``NonPrivatePointer``.
+- Add ``vulkan:private`` to tags of every non-atomic memory operation
+  that is not marked ``NonPrivatePointer``.
+
+.. code-block::
+
+   Thread T1:
+    A: store %ptr1                 # vulkan:nonprivate
+    B: store %ptr2                 # vulkan:private
+    X: store atomic release %ptr3  # vulkan:nonprivate
+
+   Thread T2:
+    Y: load atomic acquire %ptr3   # vulkan:nonprivate
+    C: load %ptr2                  # vulkan:private
+    D: load %ptr1                  # vulkan:nonprivate
+
+Compatibility ensures that operation ``A`` is ordered
+relative to ``X`` while operation ``D`` is ordered relative to ``Y``.
+If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``.
+No such relation can be inferred about operations ``B`` and ``C``.
+
+.. note::
+   The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_
+   considers all atomic operation non-private.
+
+   Whether ``vulkan:nonprivate`` would be specified on atomic operations is
+   an implementation detail, as an atomic operation is always ``nonprivate``.
+   The implementation may choose to be explicit and emit IR with
+   ``vulkan:nonprivate`` on every atomic operation, or it could choose to
+   only emit ``vulkan::private`` and assume ``vulkan:nonprivate``
+   by default.
+
+Operations marked with ``vulkan:private`` effectively opt out of the
+happens-before order in a SPIRV program since they are incompatible
+with every synchronizing operation. Note that SPIRV operations that
+are not marked ``NonPrivatePointer`` are not entirely private to the
+thread --- they are implicitly synchronized at the start or end of a
+thread by the Vulkan *system-synchronizes-with* relationship. This
+example assumes that the target-defined semantics of
+``vulkan:private`` correctly implements this property.
+
+This scheme is general enough to express the interoperability of SPIRV
+programs with other environments.
+
+.. code-block::
+
+   Thread T1:
+   A: store %ptr1                 # vulkan:nonprivate
+   X: store atomic release %ptr2  # vulkan:nonprivate
+
+   Thread T2:
+   Y: load atomic acquire %ptr2   # foo:bar
+   B: load %ptr1
+
+In the above example, thread ``T1`` originates from a SPIRV program
+while thread ``T2`` originates from a non-SPIRV program. Whether ``X``
+can synchronize with ``Y`` is target defined.  If ``X`` synchronizes
+with ``Y``, then ``A`` happens before ``B`` (because A/X and
+Y/B are compatible).
+
+Implementation Example
+~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the implementation of SPIRV ``NonPrivatePointer`` on a target
+where all memory operations are cached, and the entire cache is
+flushed or invalidated at a ``release`` or ``acquire`` respectively. A
+possible scheme is that when translating a SPIRV program, memory
+operations marked ``NonPrivatePointer`` should not be cached, and the
+cache contents should not be touched during an ``acquire`` and
+``release`` operation.
+
+This could be implemented using the tags that share the ``vulkan:`` prefix,
+as follows:
+
+- For memory operations:
+
+  - Operations with ``vulkan:nonprivate`` should bypass the cache.
+  - Operations with ``vulkan:private`` should be cached.
+  - Operations that specify neither or both should conservatively
+    bypass the cache to ensure correctness.
+
+- For synchronizing operations:
+
+  - Operations with ``vulkan:nonprivate`` should not flush or
+    invalidate the cache.
+  - Operations with ``vulkan:private`` should flush or invalidate the...
[truncated]

llvmbot · 2024-01-18T12:59:10Z

@llvm/pr-subscribers-llvm-globalisel

Author: Pierre van Houtryve (Pierre-vh)

Changes

Note: please only review the last commit!

Using MMRAs, implement builtin_amdgcn_fence_opencl to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

Patch is 231.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78572.diff

44 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+27)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+2)
(modified) clang/lib/Sema/SemaChecking.cpp (+6-1)
(added) clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp (+108)
(added) llvm/docs/MemoryModelRelaxationAnnotations.rst (+476)
(modified) llvm/docs/Reference.rst (+4)
(modified) llvm/include/llvm/Analysis/VectorUtils.h (+1-1)
(modified) llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h (+9)
(modified) llvm/include/llvm/CodeGen/MachineFunction.h (+2-1)
(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+37-10)
(modified) llvm/include/llvm/CodeGen/MachineInstrBuilder.h (+32-14)
(modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+12)
(modified) llvm/include/llvm/IR/FixedMetadataKinds.def (+1)
(added) llvm/include/llvm/IR/MemoryModelRelaxationAnnotations.h (+101)
(modified) llvm/lib/Analysis/VectorUtils.cpp (+10-1)
(modified) llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp (+1)
(modified) llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp (+3-1)
(modified) llvm/lib/CodeGen/MIRPrinter.cpp (+7)
(modified) llvm/lib/CodeGen/MachineFunction.cpp (+2-2)
(modified) llvm/lib/CodeGen/MachineInstr.cpp (+37-12)
(modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+9)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+1-1)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+11-6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp (+2)
(modified) llvm/lib/IR/CMakeLists.txt (+1)
(modified) llvm/lib/IR/Instruction.cpp (+13-1)
(added) llvm/lib/IR/MemoryModelRelaxationAnnotations.cpp (+201)
(modified) llvm/lib/IR/Verifier.cpp (+27)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+23-8)
(modified) llvm/lib/Transforms/Scalar/GVNHoist.cpp (+2-1)
(modified) llvm/lib/Transforms/Utils/FunctionComparator.cpp (+5)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+14-1)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/mmra.ll (+49)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+1428)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-private.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/mmra.ll (+254)
(added) llvm/test/Verifier/mmra-allowed.ll (+33)
(added) llvm/test/Verifier/mmra.ll (+37)
(modified) llvm/unittests/CodeGen/MachineInstrTest.cpp (+51)
(modified) llvm/unittests/IR/CMakeLists.txt (+1)
(added) llvm/unittests/IR/MemoryModelRelaxationAnnotationsTest.cpp (+34)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index e562ef04a30194..b9cd21bbfe514c 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")
 BUILTIN(__builtin_amdgcn_groupstaticsize, "Ui", "n")
 
 BUILTIN(__builtin_amdgcn_atomic_inc32, "UZiUZiD*UZiUicC*", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index de48b15645b1ab..050539b3816e54 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -55,6 +55,7 @@
 #include "llvm/IR/IntrinsicsX86.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/IR/MatrixBuilder.h"
+#include "llvm/IR/MemoryModelRelaxationAnnotations.h"
 #include "llvm/Support/ConvertUTF.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ScopedPrinter.h"
@@ -17859,6 +17860,25 @@ llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
   return Arg;
 }
 
+void CodeGenFunction::AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                llvm::Value *ASMask) {
+  constexpr const char *Tag = "opencl-fence-mem";
+
+  uint64_t Mask = cast<llvm::ConstantInt>(ASMask)->getZExtValue();
+  if (Mask == 0)
+    return;
+
+  // 3 bits can be set: local, global, image in that order.
+  MMRAMetadata MMRAs;
+  if (Mask & (1 << 0))
+    MMRAs.addTag(Tag, "local");
+  if (Mask & (1 << 1))
+    MMRAs.addTag(Tag, "global");
+  if (Mask & (1 << 2))
+    MMRAs.addTag(Tag, "image");
+  Inst->setMetadata(LLVMContext::MD_MMRA, MMRAs.getAsMD(getLLVMContext()));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -18348,6 +18368,13 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                             EmitScalarExpr(E->getArg(1)), AO, SSID);
     return Builder.CreateFence(AO, SSID);
   }
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl: {
+    ProcessOrderScopeAMDGCN(EmitScalarExpr(E->getArg(1)),
+                            EmitScalarExpr(E->getArg(2)), AO, SSID);
+    FenceInst *Fence = Builder.CreateFence(AO, SSID);
+    AddAMDGCNAddressSpaceMMRA(Fence, EmitScalarExpr(E->getArg(0)));
+    return Fence;
+  }
   case AMDGPU::BI__builtin_amdgcn_atomic_inc32:
   case AMDGPU::BI__builtin_amdgcn_atomic_inc64:
   case AMDGPU::BI__builtin_amdgcn_atomic_dec32:
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index 143ad64e8816b1..93a11c379e82c9 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4401,6 +4401,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *EmitHexagonBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitRISCVBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                     ReturnValueSlot ReturnValue);
+
+  void AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst, llvm::Value *ASMask);
   void ProcessOrderScopeAMDGCN(llvm::Value *Order, llvm::Value *Scope,
                                llvm::AtomicOrdering &AO,
                                llvm::SyncScope::ID &SSID);
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index ace3e386988f00..0a98f5504ff4f6 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -5098,6 +5098,10 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
     OrderIndex = 0;
     ScopeIndex = 1;
     break;
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl:
+    OrderIndex = 1;
+    ScopeIndex = 2;
+    break;
   default:
     return false;
   }
@@ -5120,7 +5124,8 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
   switch (static_cast<llvm::AtomicOrderingCABI>(Ord)) {
   case llvm::AtomicOrderingCABI::relaxed:
   case llvm::AtomicOrderingCABI::consume:
-    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence)
+    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence ||
+        BuiltinID == AMDGPU::BI__builtin_amdgcn_fence_opencl)
       return Diag(ArgExpr->getBeginLoc(),
                   diag::warn_atomic_op_has_invalid_memory_order)
              << 0 << ArgExpr->getSourceRange();
diff --git a/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
new file mode 100644
index 00000000000000..b4ac7ca8c178d9
--- /dev/null
+++ b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
@@ -0,0 +1,108 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals --version 3
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 %s -emit-llvm -O0 -o - \
+// RUN:   -triple=amdgcn-amd-amdhsa -Qn -mcode-object-version=none | FileCheck %s
+
+#define LOCAL_MASK (1 << 0)
+#define GLOBAL_MASK (1 << 1)
+#define IMAGE_MASK (1 << 2)
+
+//.
+// CHECK: @.str = private unnamed_addr addrspace(4) constant [10 x i8] c"workgroup\00", align 1
+// CHECK: @.str.1 = private unnamed_addr addrspace(4) constant [6 x i8] c"agent\00", align 1
+// CHECK: @.str.2 = private unnamed_addr addrspace(4) constant [1 x i8] zeroinitializer, align 1
+//.
+// CHECK-LABEL: define dso_local void @_Z10test_localv(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META1:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META1]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META1]]
+// CHECK-NEXT:    ret void
+//
+void test_local() {
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z11test_globalv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META2:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META2]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META2]]
+// CHECK-NEXT:    ret void
+//
+void test_global() {
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_imagev(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META3:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META3]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META3]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META3]]
+// CHECK-NEXT:    ret void
+//
+void test_image() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_mixedv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META4:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5]]
+// CHECK-NEXT:    ret void
+//
+void test_mixed() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK | LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(0xFF,__ATOMIC_SEQ_CST, "workgroup");
+}
+//.
+// CHECK: attributes #[[ATTR0]] = { mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+//.
+// CHECK: [[META0:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// CHECK: [[META1]] = !{!"opencl-fence-mem", !"local"}
+// CHECK: [[META2]] = !{!"opencl-fence-mem", !"global"}
+// CHECK: [[META3]] = !{!"opencl-fence-mem", !"image"}
+// CHECK: [[META4]] = !{[[META3]], [[META2]]}
+// CHECK: [[META5]] = !{[[META3]], [[META2]], [[META1]]}
+//.
diff --git a/llvm/docs/MemoryModelRelaxationAnnotations.rst b/llvm/docs/MemoryModelRelaxationAnnotations.rst
new file mode 100644
index 00000000000000..e83e365a3a57de
--- /dev/null
+++ b/llvm/docs/MemoryModelRelaxationAnnotations.rst
@@ -0,0 +1,476 @@
+===================================
+Memory Model Relaxation Annotations
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Memory Model Relaxation Annotations (MMRAs) are target-defined properties
+on instructions that can be used to selectively relax constraints placed
+by the memory model. For example:
+
+* The use of ``VulkanMemoryModel`` in a SPIRV program allows certain
+  memory operations to be reordered across ``acquire`` or ``release``
+  operations.
+* OpenCL APIs expose primitives to only fence a specific set of address
+  spaces, carrying that information to the backend can enable the
+  use of faster synchronization instructions, rather than fencing all
+  address spaces.
+
+MMRAs offer an opt-in system for targets to relax the default LLVM
+memory model.
+As such, they are attached to an operation using LLVM metadata which
+can always be dropped without affecting correctness.
+
+Definitions
+===========
+
+memory operation
+    A load, a store, an atomic, or a function call that is marked as
+    accessing memory.
+
+synchronizing operation
+    An instruction that synchronizes memory with other threads (e.g.
+    an atomic or a fence).
+
+tag
+    Metadata attached to a memory or synchronizing operation
+    that represents some target-defined property regarding memory
+    synchronization.
+
+    An operation may have multiple tags that each represent a different
+    property.
+
+    A tag is composed of a pair of metadata nodes:
+
+    * a *prefix* string.
+    * a *suffix* integer or string.
+
+    In LLVM IR, the pair is represented using a metadata tuple.
+    In other cases (comments, documentation, etc.), we may use the
+    ``prefix:suffix`` notation.
+    For example:
+
+    .. code-block::
+      :caption: Example: Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}  # scope:workgroup
+      !1 = !{!"scope", !"device"}     # scope:device
+      !2 = !{!"scope", !"system"}     # scope:system
+      !3 = !{!"sync-as", i32 2}  # sync-as:2
+      !4 = !{!"sync-as", i32 1}  # sync-as:1
+      !5 = !{!"sync-as", i32 0}  # sync-as:0
+
+    .. note::
+
+      The only semantics relevant to the optimizer is the
+      "compatibility" relation defined below. All other
+      semantics are target defined.
+
+    Tags can also be organised in lists to allow operations
+    to specify all of the tags they belong to. Such a list
+    is referred to as a "set of tags".
+
+    .. code-block::
+      :caption: Example: Set of Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}
+      !1 = !{!"sync-as", !"private"}
+      !2 = !{!0, !2}
+
+    .. note::
+
+      If an operation does not have MMRA metadata, it's treated as if
+      it has an empty list (``!{}``) of tags.
+
+    Note that it is not an error if a tag is not recognized by the
+    instruction it is applied to, or by the current target.
+    Such tags are simply ignored.
+
+    Both synchronizing operations and memory operations can have
+    zero or more tags attached to them using the ``!mmra`` syntax.
+
+    For the sake of readability in examples below,
+    we use a (non-functional) short syntax to represent MMMRA metadata:
+
+    .. code-block::
+      :caption: Short Syntax Example
+
+      store %ptr1 # foo:bar
+      store %ptr1 !mmra !{!"foo", !"bar"}
+
+    These two notations can be used in this document and are strictly
+    equivalent. However, only the second version is functional.
+
+compatibility
+    Two sets of tags are said to be *compatible* iff, for every unique
+    tag prefix P present in at least one set:
+
+    - the other set contains no tag with prefix P, or
+    - at least one tag with prefix P is common to both sets.
+
+    The above definition implies that an empty set is always compatible
+    with any other set. This is an important property as it ensures that
+    if a transform drops the metadata on an operation, it can never affect
+    correctness. In other words, the memory model cannot be relaxed further
+    by deleting metadata from instructions.
+
+.. _HappensBefore:
+
+The *happens-before* Relation
+==============================
+
+Compatibility checks can be used to opt out of the *happens-before* relation
+established between two instructions.
+
+Ordering
+    When two instructions' metadata are not compatible, any program order
+    between them are not in *happens-before*.
+
+    For example, consider two tags ``foo:bar`` and
+    ``foo:baz`` exposed by a target:
+
+    .. code-block::
+
+       A: store %ptr1                 # foo:bar
+       B: store %ptr2                 # foo:baz
+       X: store atomic release %ptr3  # foo:bar
+
+    In the above figure, ``A`` is compatible with ``X``, and hence ``A``
+    happens-before ``X``. But ``B`` is not compatible with
+    ``X``, and hence it is not happens-before ``X``.
+
+Synchronization
+    If an synchronizing operation has one or more tags, then whether it
+    participate in the  ``seq_cst`` order with other operations is target
+    dependent.
+
+    .. code-block::
+
+       ; Depending on the semantics of foo:bar & foo:bux, this may not
+       ; synchronize with another sequence.
+       fence release               # foo:bar
+       store atomic %ptr1          # foo:bux
+
+Examples
+--------
+
+.. code-block:: text
+  :caption: Example 1
+
+   A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate
+
+A and B are not ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Note that the ``sync-as`` value does not have to match the ``addrspace`` value.
+e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to
+only synchronize with operations happening in ``addrspace(0)``.
+
+.. code-block:: text
+  :caption: Example 2
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+Note that A and B may or may not be in *happens-before* due to other reasons.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2
+
+A and B do not have to be ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Use-cases
+=========
+
+SPIRV ``NonPrivatePointer``
+---------------------------
+
+MMRAs can support the SPIRV capability
+``VulkanMemoryModel``, where synchronizing operations only affect
+memory operations that specify ``NonPrivatePointer`` semantics.
+
+The example below is generated from a SPIRV program using the
+following recipe:
+
+- Add ``vulkan:nonprivate`` to every synchronizing operation.
+- Add ``vulkan:nonprivate`` to every non-atomic memory operation
+  that is marked ``NonPrivatePointer``.
+- Add ``vulkan:private`` to tags of every non-atomic memory operation
+  that is not marked ``NonPrivatePointer``.
+
+.. code-block::
+
+   Thread T1:
+    A: store %ptr1                 # vulkan:nonprivate
+    B: store %ptr2                 # vulkan:private
+    X: store atomic release %ptr3  # vulkan:nonprivate
+
+   Thread T2:
+    Y: load atomic acquire %ptr3   # vulkan:nonprivate
+    C: load %ptr2                  # vulkan:private
+    D: load %ptr1                  # vulkan:nonprivate
+
+Compatibility ensures that operation ``A`` is ordered
+relative to ``X`` while operation ``D`` is ordered relative to ``Y``.
+If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``.
+No such relation can be inferred about operations ``B`` and ``C``.
+
+.. note::
+   The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_
+   considers all atomic operation non-private.
+
+   Whether ``vulkan:nonprivate`` would be specified on atomic operations is
+   an implementation detail, as an atomic operation is always ``nonprivate``.
+   The implementation may choose to be explicit and emit IR with
+   ``vulkan:nonprivate`` on every atomic operation, or it could choose to
+   only emit ``vulkan::private`` and assume ``vulkan:nonprivate``
+   by default.
+
+Operations marked with ``vulkan:private`` effectively opt out of the
+happens-before order in a SPIRV program since they are incompatible
+with every synchronizing operation. Note that SPIRV operations that
+are not marked ``NonPrivatePointer`` are not entirely private to the
+thread --- they are implicitly synchronized at the start or end of a
+thread by the Vulkan *system-synchronizes-with* relationship. This
+example assumes that the target-defined semantics of
+``vulkan:private`` correctly implements this property.
+
+This scheme is general enough to express the interoperability of SPIRV
+programs with other environments.
+
+.. code-block::
+
+   Thread T1:
+   A: store %ptr1                 # vulkan:nonprivate
+   X: store atomic release %ptr2  # vulkan:nonprivate
+
+   Thread T2:
+   Y: load atomic acquire %ptr2   # foo:bar
+   B: load %ptr1
+
+In the above example, thread ``T1`` originates from a SPIRV program
+while thread ``T2`` originates from a non-SPIRV program. Whether ``X``
+can synchronize with ``Y`` is target defined.  If ``X`` synchronizes
+with ``Y``, then ``A`` happens before ``B`` (because A/X and
+Y/B are compatible).
+
+Implementation Example
+~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the implementation of SPIRV ``NonPrivatePointer`` on a target
+where all memory operations are cached, and the entire cache is
+flushed or invalidated at a ``release`` or ``acquire`` respectively. A
+possible scheme is that when translating a SPIRV program, memory
+operations marked ``NonPrivatePointer`` should not be cached, and the
+cache contents should not be touched during an ``acquire`` and
+``release`` operation.
+
+This could be implemented using the tags that share the ``vulkan:`` prefix,
+as follows:
+
+- For memory operations:
+
+  - Operations with ``vulkan:nonprivate`` should bypass the cache.
+  - Operations with ``vulkan:private`` should be cached.
+  - Operations that specify neither or both should conservatively
+    bypass the cache to ensure correctness.
+
+- For synchronizing operations:
+
+  - Operations with ``vulkan:nonprivate`` should not flush or
+    invalidate the cache.
+  - Operations with ``vulkan:private`` should flush or invalidate the...
[truncated]

llvmbot · 2024-01-18T12:59:10Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

Note: please only review the last commit!

Using MMRAs, implement builtin_amdgcn_fence_opencl to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

Patch is 231.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78572.diff

44 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+27)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+2)
(modified) clang/lib/Sema/SemaChecking.cpp (+6-1)
(added) clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp (+108)
(added) llvm/docs/MemoryModelRelaxationAnnotations.rst (+476)
(modified) llvm/docs/Reference.rst (+4)
(modified) llvm/include/llvm/Analysis/VectorUtils.h (+1-1)
(modified) llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h (+9)
(modified) llvm/include/llvm/CodeGen/MachineFunction.h (+2-1)
(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+37-10)
(modified) llvm/include/llvm/CodeGen/MachineInstrBuilder.h (+32-14)
(modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+12)
(modified) llvm/include/llvm/IR/FixedMetadataKinds.def (+1)
(added) llvm/include/llvm/IR/MemoryModelRelaxationAnnotations.h (+101)
(modified) llvm/lib/Analysis/VectorUtils.cpp (+10-1)
(modified) llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp (+1)
(modified) llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp (+3-1)
(modified) llvm/lib/CodeGen/MIRPrinter.cpp (+7)
(modified) llvm/lib/CodeGen/MachineFunction.cpp (+2-2)
(modified) llvm/lib/CodeGen/MachineInstr.cpp (+37-12)
(modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+9)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+1-1)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+11-6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp (+2)
(modified) llvm/lib/IR/CMakeLists.txt (+1)
(modified) llvm/lib/IR/Instruction.cpp (+13-1)
(added) llvm/lib/IR/MemoryModelRelaxationAnnotations.cpp (+201)
(modified) llvm/lib/IR/Verifier.cpp (+27)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+23-8)
(modified) llvm/lib/Transforms/Scalar/GVNHoist.cpp (+2-1)
(modified) llvm/lib/Transforms/Utils/FunctionComparator.cpp (+5)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+14-1)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/mmra.ll (+49)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+1428)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-private.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/mmra.ll (+254)
(added) llvm/test/Verifier/mmra-allowed.ll (+33)
(added) llvm/test/Verifier/mmra.ll (+37)
(modified) llvm/unittests/CodeGen/MachineInstrTest.cpp (+51)
(modified) llvm/unittests/IR/CMakeLists.txt (+1)
(added) llvm/unittests/IR/MemoryModelRelaxationAnnotationsTest.cpp (+34)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index e562ef04a30194..b9cd21bbfe514c 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")
 BUILTIN(__builtin_amdgcn_groupstaticsize, "Ui", "n")
 
 BUILTIN(__builtin_amdgcn_atomic_inc32, "UZiUZiD*UZiUicC*", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index de48b15645b1ab..050539b3816e54 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -55,6 +55,7 @@
 #include "llvm/IR/IntrinsicsX86.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/IR/MatrixBuilder.h"
+#include "llvm/IR/MemoryModelRelaxationAnnotations.h"
 #include "llvm/Support/ConvertUTF.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ScopedPrinter.h"
@@ -17859,6 +17860,25 @@ llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
   return Arg;
 }
 
+void CodeGenFunction::AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                llvm::Value *ASMask) {
+  constexpr const char *Tag = "opencl-fence-mem";
+
+  uint64_t Mask = cast<llvm::ConstantInt>(ASMask)->getZExtValue();
+  if (Mask == 0)
+    return;
+
+  // 3 bits can be set: local, global, image in that order.
+  MMRAMetadata MMRAs;
+  if (Mask & (1 << 0))
+    MMRAs.addTag(Tag, "local");
+  if (Mask & (1 << 1))
+    MMRAs.addTag(Tag, "global");
+  if (Mask & (1 << 2))
+    MMRAs.addTag(Tag, "image");
+  Inst->setMetadata(LLVMContext::MD_MMRA, MMRAs.getAsMD(getLLVMContext()));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -18348,6 +18368,13 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                             EmitScalarExpr(E->getArg(1)), AO, SSID);
     return Builder.CreateFence(AO, SSID);
   }
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl: {
+    ProcessOrderScopeAMDGCN(EmitScalarExpr(E->getArg(1)),
+                            EmitScalarExpr(E->getArg(2)), AO, SSID);
+    FenceInst *Fence = Builder.CreateFence(AO, SSID);
+    AddAMDGCNAddressSpaceMMRA(Fence, EmitScalarExpr(E->getArg(0)));
+    return Fence;
+  }
   case AMDGPU::BI__builtin_amdgcn_atomic_inc32:
   case AMDGPU::BI__builtin_amdgcn_atomic_inc64:
   case AMDGPU::BI__builtin_amdgcn_atomic_dec32:
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index 143ad64e8816b1..93a11c379e82c9 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4401,6 +4401,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *EmitHexagonBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitRISCVBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                     ReturnValueSlot ReturnValue);
+
+  void AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst, llvm::Value *ASMask);
   void ProcessOrderScopeAMDGCN(llvm::Value *Order, llvm::Value *Scope,
                                llvm::AtomicOrdering &AO,
                                llvm::SyncScope::ID &SSID);
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index ace3e386988f00..0a98f5504ff4f6 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -5098,6 +5098,10 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
     OrderIndex = 0;
     ScopeIndex = 1;
     break;
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl:
+    OrderIndex = 1;
+    ScopeIndex = 2;
+    break;
   default:
     return false;
   }
@@ -5120,7 +5124,8 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
   switch (static_cast<llvm::AtomicOrderingCABI>(Ord)) {
   case llvm::AtomicOrderingCABI::relaxed:
   case llvm::AtomicOrderingCABI::consume:
-    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence)
+    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence ||
+        BuiltinID == AMDGPU::BI__builtin_amdgcn_fence_opencl)
       return Diag(ArgExpr->getBeginLoc(),
                   diag::warn_atomic_op_has_invalid_memory_order)
              << 0 << ArgExpr->getSourceRange();
diff --git a/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
new file mode 100644
index 00000000000000..b4ac7ca8c178d9
--- /dev/null
+++ b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
@@ -0,0 +1,108 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals --version 3
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 %s -emit-llvm -O0 -o - \
+// RUN:   -triple=amdgcn-amd-amdhsa -Qn -mcode-object-version=none | FileCheck %s
+
+#define LOCAL_MASK (1 << 0)
+#define GLOBAL_MASK (1 << 1)
+#define IMAGE_MASK (1 << 2)
+
+//.
+// CHECK: @.str = private unnamed_addr addrspace(4) constant [10 x i8] c"workgroup\00", align 1
+// CHECK: @.str.1 = private unnamed_addr addrspace(4) constant [6 x i8] c"agent\00", align 1
+// CHECK: @.str.2 = private unnamed_addr addrspace(4) constant [1 x i8] zeroinitializer, align 1
+//.
+// CHECK-LABEL: define dso_local void @_Z10test_localv(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META1:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META1]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META1]]
+// CHECK-NEXT:    ret void
+//
+void test_local() {
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z11test_globalv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META2:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META2]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META2]]
+// CHECK-NEXT:    ret void
+//
+void test_global() {
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_imagev(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META3:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META3]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META3]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META3]]
+// CHECK-NEXT:    ret void
+//
+void test_image() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_mixedv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META4:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5]]
+// CHECK-NEXT:    ret void
+//
+void test_mixed() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK | LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(0xFF,__ATOMIC_SEQ_CST, "workgroup");
+}
+//.
+// CHECK: attributes #[[ATTR0]] = { mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+//.
+// CHECK: [[META0:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// CHECK: [[META1]] = !{!"opencl-fence-mem", !"local"}
+// CHECK: [[META2]] = !{!"opencl-fence-mem", !"global"}
+// CHECK: [[META3]] = !{!"opencl-fence-mem", !"image"}
+// CHECK: [[META4]] = !{[[META3]], [[META2]]}
+// CHECK: [[META5]] = !{[[META3]], [[META2]], [[META1]]}
+//.
diff --git a/llvm/docs/MemoryModelRelaxationAnnotations.rst b/llvm/docs/MemoryModelRelaxationAnnotations.rst
new file mode 100644
index 00000000000000..e83e365a3a57de
--- /dev/null
+++ b/llvm/docs/MemoryModelRelaxationAnnotations.rst
@@ -0,0 +1,476 @@
+===================================
+Memory Model Relaxation Annotations
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Memory Model Relaxation Annotations (MMRAs) are target-defined properties
+on instructions that can be used to selectively relax constraints placed
+by the memory model. For example:
+
+* The use of ``VulkanMemoryModel`` in a SPIRV program allows certain
+  memory operations to be reordered across ``acquire`` or ``release``
+  operations.
+* OpenCL APIs expose primitives to only fence a specific set of address
+  spaces, carrying that information to the backend can enable the
+  use of faster synchronization instructions, rather than fencing all
+  address spaces.
+
+MMRAs offer an opt-in system for targets to relax the default LLVM
+memory model.
+As such, they are attached to an operation using LLVM metadata which
+can always be dropped without affecting correctness.
+
+Definitions
+===========
+
+memory operation
+    A load, a store, an atomic, or a function call that is marked as
+    accessing memory.
+
+synchronizing operation
+    An instruction that synchronizes memory with other threads (e.g.
+    an atomic or a fence).
+
+tag
+    Metadata attached to a memory or synchronizing operation
+    that represents some target-defined property regarding memory
+    synchronization.
+
+    An operation may have multiple tags that each represent a different
+    property.
+
+    A tag is composed of a pair of metadata nodes:
+
+    * a *prefix* string.
+    * a *suffix* integer or string.
+
+    In LLVM IR, the pair is represented using a metadata tuple.
+    In other cases (comments, documentation, etc.), we may use the
+    ``prefix:suffix`` notation.
+    For example:
+
+    .. code-block::
+      :caption: Example: Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}  # scope:workgroup
+      !1 = !{!"scope", !"device"}     # scope:device
+      !2 = !{!"scope", !"system"}     # scope:system
+      !3 = !{!"sync-as", i32 2}  # sync-as:2
+      !4 = !{!"sync-as", i32 1}  # sync-as:1
+      !5 = !{!"sync-as", i32 0}  # sync-as:0
+
+    .. note::
+
+      The only semantics relevant to the optimizer is the
+      "compatibility" relation defined below. All other
+      semantics are target defined.
+
+    Tags can also be organised in lists to allow operations
+    to specify all of the tags they belong to. Such a list
+    is referred to as a "set of tags".
+
+    .. code-block::
+      :caption: Example: Set of Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}
+      !1 = !{!"sync-as", !"private"}
+      !2 = !{!0, !2}
+
+    .. note::
+
+      If an operation does not have MMRA metadata, it's treated as if
+      it has an empty list (``!{}``) of tags.
+
+    Note that it is not an error if a tag is not recognized by the
+    instruction it is applied to, or by the current target.
+    Such tags are simply ignored.
+
+    Both synchronizing operations and memory operations can have
+    zero or more tags attached to them using the ``!mmra`` syntax.
+
+    For the sake of readability in examples below,
+    we use a (non-functional) short syntax to represent MMMRA metadata:
+
+    .. code-block::
+      :caption: Short Syntax Example
+
+      store %ptr1 # foo:bar
+      store %ptr1 !mmra !{!"foo", !"bar"}
+
+    These two notations can be used in this document and are strictly
+    equivalent. However, only the second version is functional.
+
+compatibility
+    Two sets of tags are said to be *compatible* iff, for every unique
+    tag prefix P present in at least one set:
+
+    - the other set contains no tag with prefix P, or
+    - at least one tag with prefix P is common to both sets.
+
+    The above definition implies that an empty set is always compatible
+    with any other set. This is an important property as it ensures that
+    if a transform drops the metadata on an operation, it can never affect
+    correctness. In other words, the memory model cannot be relaxed further
+    by deleting metadata from instructions.
+
+.. _HappensBefore:
+
+The *happens-before* Relation
+==============================
+
+Compatibility checks can be used to opt out of the *happens-before* relation
+established between two instructions.
+
+Ordering
+    When two instructions' metadata are not compatible, any program order
+    between them are not in *happens-before*.
+
+    For example, consider two tags ``foo:bar`` and
+    ``foo:baz`` exposed by a target:
+
+    .. code-block::
+
+       A: store %ptr1                 # foo:bar
+       B: store %ptr2                 # foo:baz
+       X: store atomic release %ptr3  # foo:bar
+
+    In the above figure, ``A`` is compatible with ``X``, and hence ``A``
+    happens-before ``X``. But ``B`` is not compatible with
+    ``X``, and hence it is not happens-before ``X``.
+
+Synchronization
+    If an synchronizing operation has one or more tags, then whether it
+    participate in the  ``seq_cst`` order with other operations is target
+    dependent.
+
+    .. code-block::
+
+       ; Depending on the semantics of foo:bar & foo:bux, this may not
+       ; synchronize with another sequence.
+       fence release               # foo:bar
+       store atomic %ptr1          # foo:bux
+
+Examples
+--------
+
+.. code-block:: text
+  :caption: Example 1
+
+   A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate
+
+A and B are not ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Note that the ``sync-as`` value does not have to match the ``addrspace`` value.
+e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to
+only synchronize with operations happening in ``addrspace(0)``.
+
+.. code-block:: text
+  :caption: Example 2
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+Note that A and B may or may not be in *happens-before* due to other reasons.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2
+
+A and B do not have to be ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Use-cases
+=========
+
+SPIRV ``NonPrivatePointer``
+---------------------------
+
+MMRAs can support the SPIRV capability
+``VulkanMemoryModel``, where synchronizing operations only affect
+memory operations that specify ``NonPrivatePointer`` semantics.
+
+The example below is generated from a SPIRV program using the
+following recipe:
+
+- Add ``vulkan:nonprivate`` to every synchronizing operation.
+- Add ``vulkan:nonprivate`` to every non-atomic memory operation
+  that is marked ``NonPrivatePointer``.
+- Add ``vulkan:private`` to tags of every non-atomic memory operation
+  that is not marked ``NonPrivatePointer``.
+
+.. code-block::
+
+   Thread T1:
+    A: store %ptr1                 # vulkan:nonprivate
+    B: store %ptr2                 # vulkan:private
+    X: store atomic release %ptr3  # vulkan:nonprivate
+
+   Thread T2:
+    Y: load atomic acquire %ptr3   # vulkan:nonprivate
+    C: load %ptr2                  # vulkan:private
+    D: load %ptr1                  # vulkan:nonprivate
+
+Compatibility ensures that operation ``A`` is ordered
+relative to ``X`` while operation ``D`` is ordered relative to ``Y``.
+If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``.
+No such relation can be inferred about operations ``B`` and ``C``.
+
+.. note::
+   The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_
+   considers all atomic operation non-private.
+
+   Whether ``vulkan:nonprivate`` would be specified on atomic operations is
+   an implementation detail, as an atomic operation is always ``nonprivate``.
+   The implementation may choose to be explicit and emit IR with
+   ``vulkan:nonprivate`` on every atomic operation, or it could choose to
+   only emit ``vulkan::private`` and assume ``vulkan:nonprivate``
+   by default.
+
+Operations marked with ``vulkan:private`` effectively opt out of the
+happens-before order in a SPIRV program since they are incompatible
+with every synchronizing operation. Note that SPIRV operations that
+are not marked ``NonPrivatePointer`` are not entirely private to the
+thread --- they are implicitly synchronized at the start or end of a
+thread by the Vulkan *system-synchronizes-with* relationship. This
+example assumes that the target-defined semantics of
+``vulkan:private`` correctly implements this property.
+
+This scheme is general enough to express the interoperability of SPIRV
+programs with other environments.
+
+.. code-block::
+
+   Thread T1:
+   A: store %ptr1                 # vulkan:nonprivate
+   X: store atomic release %ptr2  # vulkan:nonprivate
+
+   Thread T2:
+   Y: load atomic acquire %ptr2   # foo:bar
+   B: load %ptr1
+
+In the above example, thread ``T1`` originates from a SPIRV program
+while thread ``T2`` originates from a non-SPIRV program. Whether ``X``
+can synchronize with ``Y`` is target defined.  If ``X`` synchronizes
+with ``Y``, then ``A`` happens before ``B`` (because A/X and
+Y/B are compatible).
+
+Implementation Example
+~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the implementation of SPIRV ``NonPrivatePointer`` on a target
+where all memory operations are cached, and the entire cache is
+flushed or invalidated at a ``release`` or ``acquire`` respectively. A
+possible scheme is that when translating a SPIRV program, memory
+operations marked ``NonPrivatePointer`` should not be cached, and the
+cache contents should not be touched during an ``acquire`` and
+``release`` operation.
+
+This could be implemented using the tags that share the ``vulkan:`` prefix,
+as follows:
+
+- For memory operations:
+
+  - Operations with ``vulkan:nonprivate`` should bypass the cache.
+  - Operations with ``vulkan:private`` should be cached.
+  - Operations that specify neither or both should conservatively
+    bypass the cache to ensure correctness.
+
+- For synchronizing operations:
+
+  - Operations with ``vulkan:nonprivate`` should not flush or
+    invalidate the cache.
+  - Operations with ``vulkan:private`` should flush or invalidate the...
[truncated]

llvmbot · 2024-01-18T12:59:10Z

@llvm/pr-subscribers-llvm-analysis

Author: Pierre van Houtryve (Pierre-vh)

Changes

Note: please only review the last commit!

Using MMRAs, implement builtin_amdgcn_fence_opencl to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

Patch is 231.85 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78572.diff

44 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1)
(modified) clang/lib/CodeGen/CGBuiltin.cpp (+27)
(modified) clang/lib/CodeGen/CodeGenFunction.h (+2)
(modified) clang/lib/Sema/SemaChecking.cpp (+6-1)
(added) clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp (+108)
(added) llvm/docs/MemoryModelRelaxationAnnotations.rst (+476)
(modified) llvm/docs/Reference.rst (+4)
(modified) llvm/include/llvm/Analysis/VectorUtils.h (+1-1)
(modified) llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h (+9)
(modified) llvm/include/llvm/CodeGen/MachineFunction.h (+2-1)
(modified) llvm/include/llvm/CodeGen/MachineInstr.h (+37-10)
(modified) llvm/include/llvm/CodeGen/MachineInstrBuilder.h (+32-14)
(modified) llvm/include/llvm/CodeGen/SelectionDAG.h (+12)
(modified) llvm/include/llvm/IR/FixedMetadataKinds.def (+1)
(added) llvm/include/llvm/IR/MemoryModelRelaxationAnnotations.h (+101)
(modified) llvm/lib/Analysis/VectorUtils.cpp (+10-1)
(modified) llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp (+1)
(modified) llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp (+3-1)
(modified) llvm/lib/CodeGen/MIRPrinter.cpp (+7)
(modified) llvm/lib/CodeGen/MachineFunction.cpp (+2-2)
(modified) llvm/lib/CodeGen/MachineInstr.cpp (+37-12)
(modified) llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp (+9)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp (+1-1)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+11-6)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp (+2)
(modified) llvm/lib/IR/CMakeLists.txt (+1)
(modified) llvm/lib/IR/Instruction.cpp (+13-1)
(added) llvm/lib/IR/MemoryModelRelaxationAnnotations.cpp (+201)
(modified) llvm/lib/IR/Verifier.cpp (+27)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+23-8)
(modified) llvm/lib/Transforms/Scalar/GVNHoist.cpp (+2-1)
(modified) llvm/lib/Transforms/Utils/FunctionComparator.cpp (+5)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+14-1)
(added) llvm/test/CodeGen/AMDGPU/GlobalISel/mmra.ll (+49)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+1428)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-local.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-private.ll (+1188)
(added) llvm/test/CodeGen/AMDGPU/mmra.ll (+254)
(added) llvm/test/Verifier/mmra-allowed.ll (+33)
(added) llvm/test/Verifier/mmra.ll (+37)
(modified) llvm/unittests/CodeGen/MachineInstrTest.cpp (+51)
(modified) llvm/unittests/IR/CMakeLists.txt (+1)
(added) llvm/unittests/IR/MemoryModelRelaxationAnnotationsTest.cpp (+34)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index e562ef04a30194e..b9cd21bbfe514c6 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")
 BUILTIN(__builtin_amdgcn_groupstaticsize, "Ui", "n")
 
 BUILTIN(__builtin_amdgcn_atomic_inc32, "UZiUZiD*UZiUicC*", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index de48b15645b1abe..050539b3816e544 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -55,6 +55,7 @@
 #include "llvm/IR/IntrinsicsX86.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/IR/MatrixBuilder.h"
+#include "llvm/IR/MemoryModelRelaxationAnnotations.h"
 #include "llvm/Support/ConvertUTF.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ScopedPrinter.h"
@@ -17859,6 +17860,25 @@ llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
   return Arg;
 }
 
+void CodeGenFunction::AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                llvm::Value *ASMask) {
+  constexpr const char *Tag = "opencl-fence-mem";
+
+  uint64_t Mask = cast<llvm::ConstantInt>(ASMask)->getZExtValue();
+  if (Mask == 0)
+    return;
+
+  // 3 bits can be set: local, global, image in that order.
+  MMRAMetadata MMRAs;
+  if (Mask & (1 << 0))
+    MMRAs.addTag(Tag, "local");
+  if (Mask & (1 << 1))
+    MMRAs.addTag(Tag, "global");
+  if (Mask & (1 << 2))
+    MMRAs.addTag(Tag, "image");
+  Inst->setMetadata(LLVMContext::MD_MMRA, MMRAs.getAsMD(getLLVMContext()));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -18348,6 +18368,13 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                             EmitScalarExpr(E->getArg(1)), AO, SSID);
     return Builder.CreateFence(AO, SSID);
   }
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl: {
+    ProcessOrderScopeAMDGCN(EmitScalarExpr(E->getArg(1)),
+                            EmitScalarExpr(E->getArg(2)), AO, SSID);
+    FenceInst *Fence = Builder.CreateFence(AO, SSID);
+    AddAMDGCNAddressSpaceMMRA(Fence, EmitScalarExpr(E->getArg(0)));
+    return Fence;
+  }
   case AMDGPU::BI__builtin_amdgcn_atomic_inc32:
   case AMDGPU::BI__builtin_amdgcn_atomic_inc64:
   case AMDGPU::BI__builtin_amdgcn_atomic_dec32:
diff --git a/clang/lib/CodeGen/CodeGenFunction.h b/clang/lib/CodeGen/CodeGenFunction.h
index 143ad64e8816b12..93a11c379e82c94 100644
--- a/clang/lib/CodeGen/CodeGenFunction.h
+++ b/clang/lib/CodeGen/CodeGenFunction.h
@@ -4401,6 +4401,8 @@ class CodeGenFunction : public CodeGenTypeCache {
   llvm::Value *EmitHexagonBuiltinExpr(unsigned BuiltinID, const CallExpr *E);
   llvm::Value *EmitRISCVBuiltinExpr(unsigned BuiltinID, const CallExpr *E,
                                     ReturnValueSlot ReturnValue);
+
+  void AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst, llvm::Value *ASMask);
   void ProcessOrderScopeAMDGCN(llvm::Value *Order, llvm::Value *Scope,
                                llvm::AtomicOrdering &AO,
                                llvm::SyncScope::ID &SSID);
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index ace3e386988f005..0a98f5504ff4f6c 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -5098,6 +5098,10 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
     OrderIndex = 0;
     ScopeIndex = 1;
     break;
+  case AMDGPU::BI__builtin_amdgcn_fence_opencl:
+    OrderIndex = 1;
+    ScopeIndex = 2;
+    break;
   default:
     return false;
   }
@@ -5120,7 +5124,8 @@ bool Sema::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
   switch (static_cast<llvm::AtomicOrderingCABI>(Ord)) {
   case llvm::AtomicOrderingCABI::relaxed:
   case llvm::AtomicOrderingCABI::consume:
-    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence)
+    if (BuiltinID == AMDGPU::BI__builtin_amdgcn_fence ||
+        BuiltinID == AMDGPU::BI__builtin_amdgcn_fence_opencl)
       return Diag(ArgExpr->getBeginLoc(),
                   diag::warn_atomic_op_has_invalid_memory_order)
              << 0 << ArgExpr->getSourceRange();
diff --git a/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
new file mode 100644
index 000000000000000..b4ac7ca8c178d91
--- /dev/null
+++ b/clang/test/CodeGenCXX/builtin-amdgcn-fence-opencl.cpp
@@ -0,0 +1,108 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals --version 3
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 %s -emit-llvm -O0 -o - \
+// RUN:   -triple=amdgcn-amd-amdhsa -Qn -mcode-object-version=none | FileCheck %s
+
+#define LOCAL_MASK (1 << 0)
+#define GLOBAL_MASK (1 << 1)
+#define IMAGE_MASK (1 << 2)
+
+//.
+// CHECK: @.str = private unnamed_addr addrspace(4) constant [10 x i8] c"workgroup\00", align 1
+// CHECK: @.str.1 = private unnamed_addr addrspace(4) constant [6 x i8] c"agent\00", align 1
+// CHECK: @.str.2 = private unnamed_addr addrspace(4) constant [1 x i8] zeroinitializer, align 1
+//.
+// CHECK-LABEL: define dso_local void @_Z10test_localv(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META1:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META1]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META1]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META1]]
+// CHECK-NEXT:    ret void
+//
+void test_local() {
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(LOCAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z11test_globalv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META2:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META2]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META2]]
+// CHECK-NEXT:    ret void
+//
+void test_global() {
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_imagev(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META3:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("agent") acquire, !mmra [[META3]]
+// CHECK-NEXT:    fence seq_cst, !mmra [[META2]]
+// CHECK-NEXT:    fence syncscope("agent") acq_rel, !mmra [[META3]]
+// CHECK-NEXT:    fence syncscope("workgroup") release, !mmra [[META3]]
+// CHECK-NEXT:    ret void
+//
+void test_image() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK,__ATOMIC_ACQUIRE, "agent");
+
+  __builtin_amdgcn_fence_opencl(GLOBAL_MASK,__ATOMIC_SEQ_CST, "");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 4, "agent");
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK, 3, "workgroup");
+}
+
+// CHECK-LABEL: define dso_local void @_Z10test_mixedv(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META4:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5:![0-9]+]]
+// CHECK-NEXT:    fence syncscope("workgroup") seq_cst, !mmra [[META5]]
+// CHECK-NEXT:    ret void
+//
+void test_mixed() {
+
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+  __builtin_amdgcn_fence_opencl(IMAGE_MASK | GLOBAL_MASK | LOCAL_MASK, __ATOMIC_SEQ_CST, "workgroup");
+
+  __builtin_amdgcn_fence_opencl(0xFF,__ATOMIC_SEQ_CST, "workgroup");
+}
+//.
+// CHECK: attributes #[[ATTR0]] = { mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+//.
+// CHECK: [[META0:![0-9]+]] = !{i32 1, !"wchar_size", i32 4}
+// CHECK: [[META1]] = !{!"opencl-fence-mem", !"local"}
+// CHECK: [[META2]] = !{!"opencl-fence-mem", !"global"}
+// CHECK: [[META3]] = !{!"opencl-fence-mem", !"image"}
+// CHECK: [[META4]] = !{[[META3]], [[META2]]}
+// CHECK: [[META5]] = !{[[META3]], [[META2]], [[META1]]}
+//.
diff --git a/llvm/docs/MemoryModelRelaxationAnnotations.rst b/llvm/docs/MemoryModelRelaxationAnnotations.rst
new file mode 100644
index 000000000000000..e83e365a3a57de1
--- /dev/null
+++ b/llvm/docs/MemoryModelRelaxationAnnotations.rst
@@ -0,0 +1,476 @@
+===================================
+Memory Model Relaxation Annotations
+===================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Memory Model Relaxation Annotations (MMRAs) are target-defined properties
+on instructions that can be used to selectively relax constraints placed
+by the memory model. For example:
+
+* The use of ``VulkanMemoryModel`` in a SPIRV program allows certain
+  memory operations to be reordered across ``acquire`` or ``release``
+  operations.
+* OpenCL APIs expose primitives to only fence a specific set of address
+  spaces, carrying that information to the backend can enable the
+  use of faster synchronization instructions, rather than fencing all
+  address spaces.
+
+MMRAs offer an opt-in system for targets to relax the default LLVM
+memory model.
+As such, they are attached to an operation using LLVM metadata which
+can always be dropped without affecting correctness.
+
+Definitions
+===========
+
+memory operation
+    A load, a store, an atomic, or a function call that is marked as
+    accessing memory.
+
+synchronizing operation
+    An instruction that synchronizes memory with other threads (e.g.
+    an atomic or a fence).
+
+tag
+    Metadata attached to a memory or synchronizing operation
+    that represents some target-defined property regarding memory
+    synchronization.
+
+    An operation may have multiple tags that each represent a different
+    property.
+
+    A tag is composed of a pair of metadata nodes:
+
+    * a *prefix* string.
+    * a *suffix* integer or string.
+
+    In LLVM IR, the pair is represented using a metadata tuple.
+    In other cases (comments, documentation, etc.), we may use the
+    ``prefix:suffix`` notation.
+    For example:
+
+    .. code-block::
+      :caption: Example: Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}  # scope:workgroup
+      !1 = !{!"scope", !"device"}     # scope:device
+      !2 = !{!"scope", !"system"}     # scope:system
+      !3 = !{!"sync-as", i32 2}  # sync-as:2
+      !4 = !{!"sync-as", i32 1}  # sync-as:1
+      !5 = !{!"sync-as", i32 0}  # sync-as:0
+
+    .. note::
+
+      The only semantics relevant to the optimizer is the
+      "compatibility" relation defined below. All other
+      semantics are target defined.
+
+    Tags can also be organised in lists to allow operations
+    to specify all of the tags they belong to. Such a list
+    is referred to as a "set of tags".
+
+    .. code-block::
+      :caption: Example: Set of Tags in Metadata
+
+      !0 = !{!"scope", !"workgroup"}
+      !1 = !{!"sync-as", !"private"}
+      !2 = !{!0, !2}
+
+    .. note::
+
+      If an operation does not have MMRA metadata, it's treated as if
+      it has an empty list (``!{}``) of tags.
+
+    Note that it is not an error if a tag is not recognized by the
+    instruction it is applied to, or by the current target.
+    Such tags are simply ignored.
+
+    Both synchronizing operations and memory operations can have
+    zero or more tags attached to them using the ``!mmra`` syntax.
+
+    For the sake of readability in examples below,
+    we use a (non-functional) short syntax to represent MMMRA metadata:
+
+    .. code-block::
+      :caption: Short Syntax Example
+
+      store %ptr1 # foo:bar
+      store %ptr1 !mmra !{!"foo", !"bar"}
+
+    These two notations can be used in this document and are strictly
+    equivalent. However, only the second version is functional.
+
+compatibility
+    Two sets of tags are said to be *compatible* iff, for every unique
+    tag prefix P present in at least one set:
+
+    - the other set contains no tag with prefix P, or
+    - at least one tag with prefix P is common to both sets.
+
+    The above definition implies that an empty set is always compatible
+    with any other set. This is an important property as it ensures that
+    if a transform drops the metadata on an operation, it can never affect
+    correctness. In other words, the memory model cannot be relaxed further
+    by deleting metadata from instructions.
+
+.. _HappensBefore:
+
+The *happens-before* Relation
+==============================
+
+Compatibility checks can be used to opt out of the *happens-before* relation
+established between two instructions.
+
+Ordering
+    When two instructions' metadata are not compatible, any program order
+    between them are not in *happens-before*.
+
+    For example, consider two tags ``foo:bar`` and
+    ``foo:baz`` exposed by a target:
+
+    .. code-block::
+
+       A: store %ptr1                 # foo:bar
+       B: store %ptr2                 # foo:baz
+       X: store atomic release %ptr3  # foo:bar
+
+    In the above figure, ``A`` is compatible with ``X``, and hence ``A``
+    happens-before ``X``. But ``B`` is not compatible with
+    ``X``, and hence it is not happens-before ``X``.
+
+Synchronization
+    If an synchronizing operation has one or more tags, then whether it
+    participate in the  ``seq_cst`` order with other operations is target
+    dependent.
+
+    .. code-block::
+
+       ; Depending on the semantics of foo:bar & foo:bux, this may not
+       ; synchronize with another sequence.
+       fence release               # foo:bar
+       store atomic %ptr1          # foo:bux
+
+Examples
+--------
+
+.. code-block:: text
+  :caption: Example 1
+
+   A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate
+
+A and B are not ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Note that the ``sync-as`` value does not have to match the ``addrspace`` value.
+e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to
+only synchronize with operations happening in ``addrspace(0)``.
+
+.. code-block:: text
+  :caption: Example 2
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+Note that A and B may or may not be in *happens-before* due to other reasons.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate
+   B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate
+
+The ordering of A and B is unaffected because their set of tags are
+compatible.
+
+.. code-block:: text
+  :caption: Example 3
+
+   A: store ptr addrspace(1) %ptr2                 # sync-as:1
+   B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2
+
+A and B do not have to be ordered relative to each other
+(no *happens-before*) because their sets of tags are not compatible.
+
+Use-cases
+=========
+
+SPIRV ``NonPrivatePointer``
+---------------------------
+
+MMRAs can support the SPIRV capability
+``VulkanMemoryModel``, where synchronizing operations only affect
+memory operations that specify ``NonPrivatePointer`` semantics.
+
+The example below is generated from a SPIRV program using the
+following recipe:
+
+- Add ``vulkan:nonprivate`` to every synchronizing operation.
+- Add ``vulkan:nonprivate`` to every non-atomic memory operation
+  that is marked ``NonPrivatePointer``.
+- Add ``vulkan:private`` to tags of every non-atomic memory operation
+  that is not marked ``NonPrivatePointer``.
+
+.. code-block::
+
+   Thread T1:
+    A: store %ptr1                 # vulkan:nonprivate
+    B: store %ptr2                 # vulkan:private
+    X: store atomic release %ptr3  # vulkan:nonprivate
+
+   Thread T2:
+    Y: load atomic acquire %ptr3   # vulkan:nonprivate
+    C: load %ptr2                  # vulkan:private
+    D: load %ptr1                  # vulkan:nonprivate
+
+Compatibility ensures that operation ``A`` is ordered
+relative to ``X`` while operation ``D`` is ordered relative to ``Y``.
+If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``.
+No such relation can be inferred about operations ``B`` and ``C``.
+
+.. note::
+   The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_
+   considers all atomic operation non-private.
+
+   Whether ``vulkan:nonprivate`` would be specified on atomic operations is
+   an implementation detail, as an atomic operation is always ``nonprivate``.
+   The implementation may choose to be explicit and emit IR with
+   ``vulkan:nonprivate`` on every atomic operation, or it could choose to
+   only emit ``vulkan::private`` and assume ``vulkan:nonprivate``
+   by default.
+
+Operations marked with ``vulkan:private`` effectively opt out of the
+happens-before order in a SPIRV program since they are incompatible
+with every synchronizing operation. Note that SPIRV operations that
+are not marked ``NonPrivatePointer`` are not entirely private to the
+thread --- they are implicitly synchronized at the start or end of a
+thread by the Vulkan *system-synchronizes-with* relationship. This
+example assumes that the target-defined semantics of
+``vulkan:private`` correctly implements this property.
+
+This scheme is general enough to express the interoperability of SPIRV
+programs with other environments.
+
+.. code-block::
+
+   Thread T1:
+   A: store %ptr1                 # vulkan:nonprivate
+   X: store atomic release %ptr2  # vulkan:nonprivate
+
+   Thread T2:
+   Y: load atomic acquire %ptr2   # foo:bar
+   B: load %ptr1
+
+In the above example, thread ``T1`` originates from a SPIRV program
+while thread ``T2`` originates from a non-SPIRV program. Whether ``X``
+can synchronize with ``Y`` is target defined.  If ``X`` synchronizes
+with ``Y``, then ``A`` happens before ``B`` (because A/X and
+Y/B are compatible).
+
+Implementation Example
+~~~~~~~~~~~~~~~~~~~~~~
+
+Consider the implementation of SPIRV ``NonPrivatePointer`` on a target
+where all memory operations are cached, and the entire cache is
+flushed or invalidated at a ``release`` or ``acquire`` respectively. A
+possible scheme is that when translating a SPIRV program, memory
+operations marked ``NonPrivatePointer`` should not be cached, and the
+cache contents should not be touched during an ``acquire`` and
+``release`` operation.
+
+This could be implemented using the tags that share the ``vulkan:`` prefix,
+as follows:
+
+- For memory operations:
+
+  - Operations with ``vulkan:nonprivate`` should bypass the cache.
+  - Operations with ``vulkan:private`` should be cached.
+  - Operations that specify neither or both should conservatively
+    bypass the cache to ensure correctness.
+
+- For synchronizing operations:
+
+  - Operations with ``vulkan:nonprivate`` should not flush or
+    invalidate the cache.
+  - Operations with ``vulkan:private`` should flush or in...
[truncated]

github-actions · 2024-04-02T08:56:44Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2024-04-03T21:16:40Z

clang/include/clang/Basic/BuiltinsAMDGPU.def

@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")


We shouldn't have language specific target builtins. Why isn't the existing fence adequate?

I think the motivation here is that we now have an integer "address space mask" argument to the fence. The default should be 0, to represent "all address spaces", but it can be a smaller subset. This is argument gets translated to the MMRAs. But this shouldn't be OpenCL-specific. We need a scheme that works with all input languages that we care about.

ssahasra · 2024-04-04T05:25:49Z

clang/lib/CodeGen/CGBuiltin.cpp

+  // 3 bits can be set: local, global, image in that order.
+  LLVMContext &Ctx = Inst->getContext();
+  SmallVector<MMRAMetadata::TagT, 3> MMRAs;
+  if (Mask & (1 << 0))


These magic numbers for the shift amount should be declared as an Enum somewhere.

Agreed, but not sure where they should go to be meaningful. Any ideas?
Maybe TargetParser could have a list of MMRA-related values (tag prefix/suffixes, mask values for builtins, etc.) ?

Alternatively could behave more like system scopes where the builtin takes a variadic list of string literal arguments passed directly through

@arsenm Do you mean just having a space-separate list of address spaces such as "image local"?
If that's acceptable then it definitely works for me too

Space separated is weird. I meant more __builtin_amdgcn_something_fence("somesyncscope", ordering, "addrspace0", "addrspace1", ...)

ssahasra · 2024-04-04T05:27:38Z

clang/include/clang/Basic/BuiltinsAMDGPU.def

@@ -68,6 +68,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_fence_opencl, "vUiUicC*", "n")


I think the motivation here is that we now have an integer "address space mask" argument to the fence. The default should be 0, to represent "all address spaces", but it can be a smaller subset. This is argument gets translated to the MMRAs. But this shouldn't be OpenCL-specific. We need a scheme that works with all input languages that we care about.

arsenm · 2024-04-24T07:53:13Z

clang/include/clang/Basic/BuiltinsAMDGPU.def

@@ -69,6 +69,7 @@ BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
 BUILTIN(__builtin_amdgcn_fence, "vUicC*", "n")
+BUILTIN(__builtin_amdgcn_masked_fence, "vUiUicC*", "n")


Should put some documentation in LanguageExtensions for the arguments

Added docs with tests. I also documented the other fence intrinsic

arsenm · 2024-04-26T12:15:00Z

clang/docs/LanguageExtensions.rst

+* ``0x1`` (first bit) is for local memory.
+* ``0x2`` (second bit) is for global memory.
+* ``0x4`` (third bit) is for image memory.


Should either get named enums or string-like treatment like the sync scope

arsenm · 2024-04-26T12:18:18Z

clang/lib/CodeGen/CGBuiltin.cpp

+  // 3 bits can be set: local, global, image in that order.
+  LLVMContext &Ctx = Inst->getContext();
+  SmallVector<MMRAMetadata::TagT, 3> MMRAs;
+  if (Mask & (1 << 0))


Space separated is weird. I meant more __builtin_amdgcn_something_fence("somesyncscope", ordering, "addrspace0", "addrspace1", ...)

arsenm · 2024-05-02T17:43:54Z

I'm now wondering if adding a new builtin is needed at all, or if it should just be part of the original builtin? It's an additive change.

Maybe?

Should we also rename the MMRA to amdgpu-fence-as (remove OpenCL from the name) ?

I definitely do not want to maintain any language names in anything

ssahasra · 2024-05-03T07:20:51Z

clang/docs/LanguageExtensions.rst

+* ``"global"``
+* ``"image"``
+
+If there are no address spaces specified, this fence behaves like


To answer comments elsewhere, this documentation makes it really obvious that the two fences are really just one.

ssahasra · 2024-05-03T07:22:08Z

clang/lib/CodeGen/CGBuiltin.cpp

+    llvm::Value *V = EmitScalarExpr(E->getArg(K));
+    StringRef AS;
+    if (llvm::getConstantStringInfo(V, AS) &&
+        (AS == "local" || AS == "global" || AS == "image")) {


Can these magic strings be declared somewhere? Ideally some sort of target-specific mechanism that Clang will use to find such sets of valid strings?

ssahasra · 2024-05-03T07:22:36Z

clang/lib/CodeGen/CGBuiltin.cpp

+void CodeGenFunction::AddAMDGCNAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                const CallExpr *E,
+                                                unsigned FirstASNameIdx) {
+  constexpr const char *Tag = "opencl-fence-mem";


Defintely drop the "opencl" prefix.

ssahasra · 2024-05-03T07:28:30Z

Should we also rename the MMRA to amdgpu-fence-as (remove OpenCL from the name) ?

Even the "fence" prefix is not entirely correct. The same tags also make sense on a load-acquire or store-release, which are "fence like" instructions, or "operations with implicit fences". Why not just "as:global", "as:local", etc? The fact that they are used as !mmra on a fence-like instruction makes it clear that they represent the address spaces that are caused to be synchronized by that instruction, and not incidentally the address space that the load-acquire or store-release itself wants to access.

ssahasra

The frontend changes and the MMRA emitted in LLVM IR look good to me. The backend changes also look okay, but please see if anyone else has comments about that.

clang/docs/LanguageExtensions.rst

clang/lib/CodeGen/CGBuiltin.cpp

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

ssahasra · 2024-05-06T05:28:04Z

clang/lib/CodeGen/CGBuiltin.cpp

@@ -18365,6 +18366,28 @@ Value *CodeGenFunction::EmitHLSLBuiltinExpr(unsigned BuiltinID,
  return nullptr;
 }

+void CodeGenFunction::AddAMDGCNFenceAddressSpaceMMRA(llvm::Instruction *Inst,
+                                                     const CallExpr *E) {
+  constexpr const char *Tag = "amdgpu-as";


Just bikeshedding a bit, but do we really need the "amdgpu" prefix on the tag? Clang will only generate these for AMDGPU anyway. It's not a blocker, but feels like we are being cautious for no reason.

I think it's just more readable this way, but I have no strong preference either.
I'd say let's keep it this way, it can very easily be renamed later if we want to.

arsenm · 2024-05-07T21:38:20Z

clang/docs/LanguageExtensions.rst

+  __builtin_amdgcn_fence(__ATOMIC_ACQUIRE, "agent");
+
+  // Fence only requested address spaces.
+  __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")


We randomly change between HSA and OpenCL terminology. Maybe we should call "local" "groupsegment"? I guess the ISA manuals call it "local data share"

I wasusing the OpenCL terminology: https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/atomic_work_item_fence.html

flags must be set to CLK_GLOBAL_MEM_FENCE, CLK_LOCAL_MEM_FENCE, CLK_IMAGE_MEM_FENCE

Thinking about it, maybe it's better to use names consistent with our AMDGPUAS definitions, so image would just be "private" but local/global stay. What do you think?

Not sure we can get away without image

This is just for internal use right now, so we can use anything we want. Device libs can then map the IMAGE flag to a "private" operand here.

clang/test/CodeGenCXX/builtin-amdgcn-fence.cpp

Using MMRAs, implement `builtin_amdgcn_fence_opencl` to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

ssahasra · 2024-05-13T15:43:24Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+    if (Prefix != FenceASPrefix)
+      continue;
+
+    auto It = find_if(ASNames, [Suffix = Suffix](auto &Pair) {


I didn't understand the default assignment of Suffix. Shouldn't the lambda capture it simply because the outer symbol is used inside? But anyway, this use of find_if shouldn't be necessary with StringMap.

Lambdas can't capture structured bindings (auto [a, b] = pair) in C++17, I think it comes in C++20 or 23?

ssahasra

Looks good to me. But I have no opinion about that discussion with whether "image" should be available for explicit use!

ssahasra · 2024-05-14T06:58:15Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+    if (Prefix != FenceASPrefix)
+      continue;
+
+    if (auto It = ASNames.find(Suffix); It != ASNames.end())


Wow, I have never considered using a semicolon like this before. But it makes so much sense! :)

Pierre-vh · 2024-05-17T07:39:07Z

@arsenm Should we use image or private?
We could allow both in the frontend, and only use private as the canonical MMRA.

arsenm · 2024-05-20T17:47:57Z

@arsenm Should we use image or private? We could allow both in the frontend, and only use private as the canonical MMRA.

I don't understand why image would imply private. I would just keep at as private throughout

Pierre-vh · 2024-05-21T06:43:32Z

@arsenm Should we use image or private? We could allow both in the frontend, and only use private as the canonical MMRA.

I don't understand why image would imply private. I would just keep at as private throughout

I thought image memory = private. It's unclear to me, what AS does OpenCL IMAGE memory map to in our backend?
(But otherwise, yes, MMRA should just have the backend names, the mapping of the OpenCL IMAGE to a backend AS should be in the device-lib)

arsenm · 2024-05-21T19:17:48Z

I thought image memory = private. It's unclear to me, what AS does OpenCL IMAGE memory map to in our backend? (But otherwise, yes, MMRA should just have the backend names, the mapping of the OpenCL IMAGE to a backend AS should be in the device-lib)

Images are global memory with magical addressing and value interpretation on load/store. There's nothing private about them

Pierre-vh · 2024-05-22T06:00:01Z

I thought image memory = private. It's unclear to me, what AS does OpenCL IMAGE memory map to in our backend? (But otherwise, yes, MMRA should just have the backend names, the mapping of the OpenCL IMAGE to a backend AS should be in the device-lib)

Images are global memory with magical addressing and value interpretation on load/store. There's nothing private about them

So in our case:

OpenCL Image = Global
OpenCL Local = LDS
OpenCL global = Global
?

Is that correct?

Then I guess the MMRA should just have "global" and "local" for now, we can always add more later if needed. What do you think?

arsenm · 2024-05-22T09:19:45Z

Then I guess the MMRA should just have "global" and "local" for now, we can always add more later if needed. What do you think?

Yes, we don't have specific image counters. They are just vcmnt

Pierre-vh · 2024-05-22T10:25:10Z

Then I guess the MMRA should just have "global" and "local" for now, we can always add more later if needed. What do you think?

Yes, we don't have specific image counters. They are just vcmnt

Diff has been updated with those changes

arsenm · 2024-05-22T11:23:33Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+  std::string Str;
+  raw_string_ostream OS(Str);


SmallString + raw_svector_ostream?

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen llvm:globalisel llvm:SelectionDAG SelectionDAGISel as well llvm:ir llvm:analysis llvm:transforms labels Jan 18, 2024

Pierre-vh mentioned this pull request Jan 31, 2024

[RFC] Memory Model Relaxation Annotations #78569

Merged

Pierre-vh force-pushed the mmras-ocl branch from fc8006e to 36bcf92 Compare April 2, 2024 08:53

Pierre-vh requested review from jayfoad, arsenm, nhaehnle and ssahasra April 2, 2024 08:59

arsenm reviewed Apr 3, 2024

View reviewed changes

ssahasra reviewed Apr 4, 2024

View reviewed changes

Pierre-vh force-pushed the mmras-ocl branch from 36bcf92 to 0722e6a Compare April 24, 2024 07:16

Pierre-vh changed the title ~~[RFC][AMDGPU] Add OpenCL-specific fence address space masks~~ [AMDGPU] Add OpenCL-specific fence address space masks Apr 24, 2024

Pierre-vh requested review from ssahasra and arsenm April 24, 2024 07:17

arsenm reviewed Apr 24, 2024

View reviewed changes

Pierre-vh requested a review from arsenm April 26, 2024 08:40

arsenm reviewed Apr 26, 2024

View reviewed changes

ssahasra reviewed May 3, 2024

View reviewed changes

Pierre-vh changed the title ~~[AMDGPU] Add OpenCL-specific fence address space masks~~ [AMDGPU] Add amdgpu-as MMRA for fences May 3, 2024

Pierre-vh requested a review from ssahasra May 3, 2024 08:20

ssahasra approved these changes May 6, 2024

View reviewed changes

clang/docs/LanguageExtensions.rst Outdated Show resolved Hide resolved

clang/lib/CodeGen/CGBuiltin.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp Outdated Show resolved Hide resolved

ssahasra reviewed May 6, 2024

View reviewed changes

arsenm reviewed May 7, 2024

View reviewed changes

Pierre-vh added 7 commits May 13, 2024 09:46

[RFC][AMDGPU] Add OpenCL-specific fence address space masks

a2f665c

Using MMRAs, implement `builtin_amdgcn_fence_opencl` to allow device libs to emit fences that only target one or more address spaces, instead of fencing all address spaces at once.

add docs

3914434

comments

a5eb7ee

fmt

a608563

Refactor

d8a5a97

Don't check AS names in frontend

c106a8c

correctly use unique + comments

2494ad3

Pierre-vh force-pushed the mmras-ocl branch from 9b27066 to 2494ad3 Compare May 13, 2024 08:03

Pierre-vh requested review from ssahasra and arsenm May 13, 2024 08:03

ssahasra reviewed May 13, 2024

View reviewed changes

update test, use stringmap

8f8ae2a

ssahasra approved these changes May 14, 2024

View reviewed changes

Remove "private", use O0 in tests

1f29733

arsenm approved these changes May 22, 2024

View reviewed changes

[AMDGPU] Add amdgpu-as MMRA for fences #78572

Are you sure you want to change the base?

[AMDGPU] Add amdgpu-as MMRA for fences #78572

Conversation

Pierre-vh commented Jan 18, 2024 • edited

llvmbot commented Jan 18, 2024 • edited

llvmbot commented Jan 18, 2024

llvmbot commented Jan 18, 2024

llvmbot commented Jan 18, 2024

llvmbot commented Jan 18, 2024

github-actions bot commented Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssahasra commented May 3, 2024

ssahasra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssahasra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pierre-vh commented May 17, 2024

arsenm commented May 20, 2024

Pierre-vh commented May 21, 2024

arsenm commented May 21, 2024

Pierre-vh commented May 22, 2024

arsenm commented May 22, 2024

Pierre-vh commented May 22, 2024

Choose a reason for hiding this comment

Pierre-vh commented Jan 18, 2024 •

edited

llvmbot commented Jan 18, 2024 •

edited

github-actions bot commented Apr 2, 2024 •

edited