[X86] Use plain load/store instead of cmpxchg16b for atomics with AVX #74275

jyknight · 2023-12-04T04:37:09Z

In late 2021, both Intel and AMD finally documented that every AVX-capable CPU has always been guaranteed to execute aligned 16-byte loads/stores atomically, and further, guaranteed that all future CPUs with AVX will do so as well.

Therefore, we may use normal SSE 128-bit load/store instructions to implement atomics, if AVX is enabled.

Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity:

Processors that report [AVX] extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

Per Intel's SDM:

Processors that enumerate support for Intel(R) AVX guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

MOVAPD, MOVAPS, and MOVDQA.

VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.

VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

This was also confirmed to be true for Zhaoxin CPUs with AVX, in https://gcc.gnu.org/PR104688

In late 2021, both Intel and AMD finally documented that every AVX-capable CPU has always been guaranteed to execute aligned 16-byte loads/stores atomically, and further, guaranteed that all future CPUs with AVX will do so as well. Therefore, we may use normal SSE 128-bit load/store instructions to implement atomics, if AVX is enabled. Also adjust handling of unordered atomic load/store in LegalizeIntegerTypes.cpp; currently, it hardcodes a fallback to ATOMIC_CMP_SWAP_WITH_SUCCESS, but we should instead fallback to ATOMIC_LOAD/ATOMIC_STORE. Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity: """ Processors that report [AVX] extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword. """ Per Intel's SDM: """ Processors that enumerate support for Intel(R) AVX guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically: - MOVAPD, MOVAPS, and MOVDQA. - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128. - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled). """ This was also confirmed to be true for Zhaoxin CPUs with AVX, in https://gcc.gnu.org/PR104688

llvmbot · 2023-12-04T04:37:40Z

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: James Y Knight (jyknight)

Changes

In late 2021, both Intel and AMD finally documented that every AVX-capable CPU has always been guaranteed to execute aligned 16-byte loads/stores atomically, and further, guaranteed that all future CPUs with AVX will do so as well.

Therefore, we may use normal SSE 128-bit load/store instructions to implement atomics, if AVX is enabled.

Also adjust handling of unordered atomic load/store in LegalizeIntegerTypes.cpp; currently, it hardcodes a fallback to ATOMIC_CMP_SWAP_WITH_SUCCESS, but we should instead fallback to ATOMIC_LOAD/ATOMIC_STORE.

Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity: """
Processors that report [AVX] extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.
"""

Per Intel's SDM:
"""
Processors that enumerate support for Intel(R) AVX guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

MOVAPD, MOVAPS, and MOVDQA.
VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled). """

This was also confirmed to be true for Zhaoxin CPUs with AVX, in https://gcc.gnu.org/PR104688

Patch is 27.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74275.diff

6 Files Affected:

(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+12-16)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+68-26)
(modified) llvm/test/CodeGen/X86/atomic-non-integer.ll (+4-20)
(modified) llvm/test/CodeGen/X86/atomic-unordered.ll (+14-69)
(modified) llvm/test/CodeGen/X86/atomic128.ll (+155-92)
(modified) llvm/test/CodeGen/X86/cmpxchg-i128-i1.ll (+3-5)

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 54698edce7d6f..5b496feee7a8f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -3831,17 +3831,14 @@ void DAGTypeLegalizer::ExpandIntRes_XROUND_XRINT(SDNode *N, SDValue &Lo,
 void DAGTypeLegalizer::ExpandIntRes_LOAD(LoadSDNode *N,
                                          SDValue &Lo, SDValue &Hi) {
   if (N->isAtomic()) {
-    // It's typical to have larger CAS than atomic load instructions.
     SDLoc dl(N);
     EVT VT = N->getMemoryVT();
-    SDVTList VTs = DAG.getVTList(VT, MVT::i1, MVT::Other);
-    SDValue Zero = DAG.getConstant(0, dl, VT);
-    SDValue Swap = DAG.getAtomicCmpSwap(
-        ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, dl,
-        VT, VTs, N->getOperand(0),
-        N->getOperand(1), Zero, Zero, N->getMemOperand());
-    ReplaceValueWith(SDValue(N, 0), Swap.getValue(0));
-    ReplaceValueWith(SDValue(N, 1), Swap.getValue(2));
+    // We may support larger values in atomic_load than in a normal load
+    // (without splitting), so switch over if needed.
+    SDValue New = DAG.getAtomic(ISD::ATOMIC_LOAD, dl, VT, VT, N->getOperand(0),
+                                N->getOperand(1), N->getMemOperand());
+    ReplaceValueWith(SDValue(N, 0), New.getValue(0));
+    ReplaceValueWith(SDValue(N, 1), New.getValue(1));
     return;
   }
 
@@ -5399,14 +5396,13 @@ SDValue DAGTypeLegalizer::ExpandIntOp_XINT_TO_FP(SDNode *N) {
 
 SDValue DAGTypeLegalizer::ExpandIntOp_STORE(StoreSDNode *N, unsigned OpNo) {
   if (N->isAtomic()) {
-    // It's typical to have larger CAS than atomic store instructions.
+    // We may support larger values in atomic_store than in a normal store
+    // (without splitting), so switch over if needed.
     SDLoc dl(N);
-    SDValue Swap = DAG.getAtomic(ISD::ATOMIC_SWAP, dl,
-                                 N->getMemoryVT(),
-                                 N->getOperand(0), N->getOperand(2),
-                                 N->getOperand(1),
-                                 N->getMemOperand());
-    return Swap.getValue(1);
+    SDValue New =
+        DAG.getAtomic(ISD::ATOMIC_STORE, dl, N->getMemoryVT(), N->getOperand(0),
+                      N->getOperand(1), N->getOperand(2), N->getMemOperand());
+    return New.getValue(0);
   }
   if (ISD::isNormalStore(N))
     return ExpandOp_NormalStore(N, OpNo);
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6167be7bdf84e..1880cbc3a5bf3 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -515,6 +515,13 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
   if (!Subtarget.is64Bit())
     setOperationAction(ISD::ATOMIC_LOAD, MVT::i64, Custom);
 
+  if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
+    // All CPUs supporting AVX will atomically load/store aligned 128-bit
+    // values, so we can emit [V]MOVAPS/[V]MOVDQA.
+    setOperationAction(ISD::ATOMIC_LOAD, MVT::i128, Custom);
+    setOperationAction(ISD::ATOMIC_STORE, MVT::i128, Custom);
+  }
+
   if (Subtarget.canUseCMPXCHG16B())
     setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, MVT::i128, Custom);
 
@@ -30101,12 +30108,16 @@ TargetLoweringBase::AtomicExpansionKind
 X86TargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {
   Type *MemType = SI->getValueOperand()->getType();
 
-  bool NoImplicitFloatOps =
-      SI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat);
-  if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
-      !Subtarget.useSoftFloat() && !NoImplicitFloatOps &&
-      (Subtarget.hasSSE1() || Subtarget.hasX87()))
-    return AtomicExpansionKind::None;
+  if (!SI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
+      !Subtarget.useSoftFloat()) {
+    if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
+        (Subtarget.hasSSE1() || Subtarget.hasX87()))
+      return AtomicExpansionKind::None;
+
+    if (MemType->getPrimitiveSizeInBits() == 128 && Subtarget.is64Bit() &&
+        Subtarget.hasAVX())
+      return AtomicExpansionKind::None;
+  }
 
   return needsCmpXchgNb(MemType) ? AtomicExpansionKind::Expand
                                  : AtomicExpansionKind::None;
@@ -30121,12 +30132,16 @@ X86TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
   // If this a 64 bit atomic load on a 32-bit target and SSE2 is enabled, we
   // can use movq to do the load. If we have X87 we can load into an 80-bit
   // X87 register and store it to a stack temporary.
-  bool NoImplicitFloatOps =
-      LI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat);
-  if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
-      !Subtarget.useSoftFloat() && !NoImplicitFloatOps &&
-      (Subtarget.hasSSE1() || Subtarget.hasX87()))
-    return AtomicExpansionKind::None;
+  if (!LI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
+      !Subtarget.useSoftFloat()) {
+    if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
+        (Subtarget.hasSSE1() || Subtarget.hasX87()))
+      return AtomicExpansionKind::None;
+
+    if (MemType->getPrimitiveSizeInBits() == 128 && Subtarget.is64Bit() &&
+        Subtarget.hasAVX())
+      return AtomicExpansionKind::None;
+  }
 
   return needsCmpXchgNb(MemType) ? AtomicExpansionKind::CmpXChg
                                  : AtomicExpansionKind::None;
@@ -31277,14 +31292,23 @@ static SDValue LowerATOMIC_STORE(SDValue Op, SelectionDAG &DAG,
   if (!IsSeqCst && IsTypeLegal)
     return Op;
 
-  if (VT == MVT::i64 && !IsTypeLegal) {
+  if (!IsTypeLegal && !Subtarget.useSoftFloat() &&
+      !DAG.getMachineFunction().getFunction().hasFnAttribute(
+          Attribute::NoImplicitFloat)) {
+    SDValue Chain;
+    // For illegal i128 atomic_store, when AVX is enabled, we can simply emit a
+    // vector store.
+    if (VT == MVT::i128) {
+      if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
+        SDValue VecVal = DAG.getBitcast(MVT::v2i64, Node->getVal());
+        Chain = DAG.getStore(Node->getChain(), dl, VecVal, Node->getBasePtr(),
+                             Node->getMemOperand());
+      }
+    }
+
     // For illegal i64 atomic_stores, we can try to use MOVQ or MOVLPS if SSE
     // is enabled.
-    bool NoImplicitFloatOps =
-        DAG.getMachineFunction().getFunction().hasFnAttribute(
-            Attribute::NoImplicitFloat);
-    if (!Subtarget.useSoftFloat() && !NoImplicitFloatOps) {
-      SDValue Chain;
+    if (VT == MVT::i64) {
       if (Subtarget.hasSSE1()) {
         SDValue SclToVec =
             DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2i64, Node->getVal());
@@ -31316,15 +31340,15 @@ static SDValue LowerATOMIC_STORE(SDValue Op, SelectionDAG &DAG,
             DAG.getMemIntrinsicNode(X86ISD::FIST, dl, DAG.getVTList(MVT::Other),
                                     StoreOps, MVT::i64, Node->getMemOperand());
       }
+    }
 
-      if (Chain) {
-        // If this is a sequentially consistent store, also emit an appropriate
-        // barrier.
-        if (IsSeqCst)
-          Chain = emitLockedStackOp(DAG, Subtarget, Chain, dl);
+    if (Chain) {
+      // If this is a sequentially consistent store, also emit an appropriate
+      // barrier.
+      if (IsSeqCst)
+        Chain = emitLockedStackOp(DAG, Subtarget, Chain, dl);
 
-        return Chain;
-      }
+      return Chain;
     }
   }
 
@@ -32877,12 +32901,30 @@ void X86TargetLowering::ReplaceNodeResults(SDNode *N,
     return;
   }
   case ISD::ATOMIC_LOAD: {
-    assert(N->getValueType(0) == MVT::i64 && "Unexpected VT!");
+    assert(
+        (N->getValueType(0) == MVT::i64 || N->getValueType(0) == MVT::i128) &&
+        "Unexpected VT!");
     bool NoImplicitFloatOps =
         DAG.getMachineFunction().getFunction().hasFnAttribute(
             Attribute::NoImplicitFloat);
     if (!Subtarget.useSoftFloat() && !NoImplicitFloatOps) {
       auto *Node = cast<AtomicSDNode>(N);
+
+      if (N->getValueType(0) == MVT::i128) {
+        if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
+          SDValue Ld = DAG.getLoad(MVT::v2i64, dl, Node->getChain(),
+                                   Node->getBasePtr(), Node->getMemOperand());
+          SDValue ResL = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i64, Ld,
+                                     DAG.getIntPtrConstant(0, dl));
+          SDValue ResH = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i64, Ld,
+                                     DAG.getIntPtrConstant(1, dl));
+          Results.push_back(DAG.getNode(ISD::BUILD_PAIR, dl, N->getValueType(0),
+                                        {ResL, ResH}));
+          Results.push_back(Ld.getValue(1));
+          return;
+        }
+        break;
+      }
       if (Subtarget.hasSSE1()) {
         // Use a VZEXT_LOAD which will be selected as MOVQ or XORPS+MOVLPS.
         // Then extract the lower 64-bits.
diff --git a/llvm/test/CodeGen/X86/atomic-non-integer.ll b/llvm/test/CodeGen/X86/atomic-non-integer.ll
index 7d2810e57a25b..22b45b13aae22 100644
--- a/llvm/test/CodeGen/X86/atomic-non-integer.ll
+++ b/llvm/test/CodeGen/X86/atomic-non-integer.ll
@@ -207,14 +207,7 @@ define void @store_fp128(ptr %fptr, fp128 %v) {
 ;
 ; X64-AVX-LABEL: store_fp128:
 ; X64-AVX:       # %bb.0:
-; X64-AVX-NEXT:    subq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 32
-; X64-AVX-NEXT:    vmovaps %xmm0, (%rsp)
-; X64-AVX-NEXT:    movq (%rsp), %rsi
-; X64-AVX-NEXT:    movq {{[0-9]+}}(%rsp), %rdx
-; X64-AVX-NEXT:    callq __sync_lock_test_and_set_16@PLT
-; X64-AVX-NEXT:    addq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 8
+; X64-AVX-NEXT:    vmovaps %xmm0, (%rdi)
 ; X64-AVX-NEXT:    retq
   store atomic fp128 %v, ptr %fptr unordered, align 16
   ret void
@@ -592,18 +585,9 @@ define fp128 @load_fp128(ptr %fptr) {
 ;
 ; X64-AVX-LABEL: load_fp128:
 ; X64-AVX:       # %bb.0:
-; X64-AVX-NEXT:    subq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 32
-; X64-AVX-NEXT:    xorl %esi, %esi
-; X64-AVX-NEXT:    xorl %edx, %edx
-; X64-AVX-NEXT:    xorl %ecx, %ecx
-; X64-AVX-NEXT:    xorl %r8d, %r8d
-; X64-AVX-NEXT:    callq __sync_val_compare_and_swap_16@PLT
-; X64-AVX-NEXT:    movq %rdx, {{[0-9]+}}(%rsp)
-; X64-AVX-NEXT:    movq %rax, (%rsp)
-; X64-AVX-NEXT:    vmovaps (%rsp), %xmm0
-; X64-AVX-NEXT:    addq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 8
+; X64-AVX-NEXT:    vmovaps (%rdi), %xmm0
+; X64-AVX-NEXT:    vmovaps %xmm0, -{{[0-9]+}}(%rsp)
+; X64-AVX-NEXT:    vmovaps -{{[0-9]+}}(%rsp), %xmm0
 ; X64-AVX-NEXT:    retq
   %v = load atomic fp128, ptr %fptr unordered, align 16
   ret fp128 %v
diff --git a/llvm/test/CodeGen/X86/atomic-unordered.ll b/llvm/test/CodeGen/X86/atomic-unordered.ll
index b66988c8bd24b..91e427189de47 100644
--- a/llvm/test/CodeGen/X86/atomic-unordered.ll
+++ b/llvm/test/CodeGen/X86/atomic-unordered.ll
@@ -230,34 +230,12 @@ define void @widen_broadcast_unaligned(ptr %p0, i32 %v) {
 }
 
 define i128 @load_i128(ptr %ptr) {
-; CHECK-O0-LABEL: load_i128:
-; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    pushq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O0-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O0-NEXT:    xorl %eax, %eax
-; CHECK-O0-NEXT:    movl %eax, %ebx
-; CHECK-O0-NEXT:    movq %rbx, %rax
-; CHECK-O0-NEXT:    movq %rbx, %rdx
-; CHECK-O0-NEXT:    movq %rbx, %rcx
-; CHECK-O0-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-O0-NEXT:    popq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 8
-; CHECK-O0-NEXT:    retq
-;
-; CHECK-O3-LABEL: load_i128:
-; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    pushq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O3-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O3-NEXT:    xorl %eax, %eax
-; CHECK-O3-NEXT:    xorl %edx, %edx
-; CHECK-O3-NEXT:    xorl %ecx, %ecx
-; CHECK-O3-NEXT:    xorl %ebx, %ebx
-; CHECK-O3-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-O3-NEXT:    popq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 8
-; CHECK-O3-NEXT:    retq
+; CHECK-LABEL: load_i128:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vmovdqa (%rdi), %xmm0
+; CHECK-NEXT:    vmovq %xmm0, %rax
+; CHECK-NEXT:    vpextrq $1, %xmm0, %rdx
+; CHECK-NEXT:    retq
   %v = load atomic i128, ptr %ptr unordered, align 16
   ret i128 %v
 }
@@ -265,51 +243,18 @@ define i128 @load_i128(ptr %ptr) {
 define void @store_i128(ptr %ptr, i128 %v) {
 ; CHECK-O0-LABEL: store_i128:
 ; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    pushq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O0-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O0-NEXT:    movq %rdi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq (%rdi), %rax
-; CHECK-O0-NEXT:    movq 8(%rdi), %rdx
-; CHECK-O0-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    jmp .LBB16_1
-; CHECK-O0-NEXT:  .LBB16_1: # %atomicrmw.start
-; CHECK-O0-NEXT:    # =>This Inner Loop Header: Depth=1
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbx # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
-; CHECK-O0-NEXT:    lock cmpxchg16b (%rsi)
-; CHECK-O0-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    jne .LBB16_1
-; CHECK-O0-NEXT:    jmp .LBB16_2
-; CHECK-O0-NEXT:  .LBB16_2: # %atomicrmw.end
-; CHECK-O0-NEXT:    popq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-O0-NEXT:    vmovq %rsi, %xmm0
+; CHECK-O0-NEXT:    vmovq %rdx, %xmm1
+; CHECK-O0-NEXT:    vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O0-NEXT:    vmovdqa %xmm0, (%rdi)
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-O3-LABEL: store_i128:
 ; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    pushq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O3-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O3-NEXT:    movq %rdx, %rcx
-; CHECK-O3-NEXT:    movq %rsi, %rbx
-; CHECK-O3-NEXT:    movq (%rdi), %rax
-; CHECK-O3-NEXT:    movq 8(%rdi), %rdx
-; CHECK-O3-NEXT:    .p2align 4, 0x90
-; CHECK-O3-NEXT:  .LBB16_1: # %atomicrmw.start
-; CHECK-O3-NEXT:    # =>This Inner Loop Header: Depth=1
-; CHECK-O3-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-O3-NEXT:    jne .LBB16_1
-; CHECK-O3-NEXT:  # %bb.2: # %atomicrmw.end
-; CHECK-O3-NEXT:    popq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-O3-NEXT:    vmovq %rdx, %xmm0
+; CHECK-O3-NEXT:    vmovq %rsi, %xmm1
+; CHECK-O3-NEXT:    vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; CHECK-O3-NEXT:    vmovdqa %xmm0, (%rdi)
 ; CHECK-O3-NEXT:    retq
   store atomic i128 %v, ptr %ptr unordered, align 16
   ret void
diff --git a/llvm/test/CodeGen/X86/atomic128.ll b/llvm/test/CodeGen/X86/atomic128.ll
index d5600b54a169d..76c3b2c5f1bb1 100644
--- a/llvm/test/CodeGen/X86/atomic128.ll
+++ b/llvm/test/CodeGen/X86/atomic128.ll
@@ -1,5 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc < %s -mtriple=x86_64-apple-macosx10.9 -verify-machineinstrs -mattr=cx16 | FileCheck %s
+; RUN: llc < %s -mtriple=x86_64-apple-macosx10.9 -verify-machineinstrs -mattr=cx16 | FileCheck %s --check-prefixes=CHECK,CHECK-NOAVX
+; RUN: llc < %s -mtriple=x86_64-apple-macosx10.9 -verify-machineinstrs -mattr=cx16,avx | FileCheck %s --check-prefixes=CHECK,CHECK-AVX
 ; RUN: llc < %s -mtriple=i386-linux-gnu -verify-machineinstrs -mattr=cx16 | FileCheck %s -check-prefixes=CHECK32
 ; RUN: llc < %s -mtriple=i386-linux-gnu -verify-machineinstrs -mattr=-cx16 | FileCheck %s -check-prefixes=CHECK32
 
@@ -83,21 +84,32 @@ define i128 @val_compare_and_swap(ptr %p, i128 %oldval, i128 %newval) {
 @cmpxchg16b_global = external dso_local global { i128, i128 }, align 16
 
 ;; Make sure we retain the offset of the global variable.
-define void @cmpxchg16b_global_with_offset() nounwind {
-; CHECK-LABEL: cmpxchg16b_global_with_offset:
-; CHECK:       ## %bb.0: ## %entry
-; CHECK-NEXT:    pushq %rbx
-; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:    xorl %edx, %edx
-; CHECK-NEXT:    xorl %ecx, %ecx
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:    lock cmpxchg16b _cmpxchg16b_global+16(%rip)
-; CHECK-NEXT:    popq %rbx
-; CHECK-NEXT:    retq
+define i128 @load_global_with_offset() nounwind {
+; CHECK-NOAVX-LABEL: load_global_with_offset:
+; CHECK-NOAVX:       ## %bb.0: ## %entry
+; CHECK-NOAVX-NEXT:    pushq %rbx
+; CHECK-NOAVX-NEXT:    xorl %eax, %eax
+; CHECK-NOAVX-NEXT:    xorl %edx, %edx
+; CHECK-NOAVX-NEXT:    xorl %ecx, %ecx
+; CHECK-NOAVX-NEXT:    xorl %ebx, %ebx
+; CHECK-NOAVX-NEXT:    lock cmpxchg16b _cmpxchg16b_global+16(%rip)
+; CHECK-NOAVX-NEXT:    popq %rbx
+; CHECK-NOAVX-NEXT:    retq
 ;
-; CHECK32-LABEL: cmpxchg16b_global_with_offset:
+; CHECK-AVX-LABEL: load_global_with_offset:
+; CHECK-AVX:       ## %bb.0: ## %entry
+; CHECK-AVX-NEXT:    vmovdqa _cmpxchg16b_global+16(%rip), %xmm0
+; CHECK-AVX-NEXT:    vmovq %xmm0, %rax
+; CHECK-AVX-NEXT:    vpextrq $1, %xmm0, %rdx
+; CHECK-AVX-NEXT:    retq
+;
+; CHECK32-LABEL: load_global_with_offset:
 ; CHECK32:       # %bb.0: # %entry
-; CHECK32-NEXT:    subl $36, %esp
+; CHECK32-NEXT:    pushl %edi
+; CHECK32-NEXT:    pushl %esi
+; CHECK32-NEXT:    subl $20, %esp
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; CHECK32-NEXT:    subl $8, %esp
 ; CHECK32-NEXT:    leal {{[0-9]+}}(%esp), %eax
 ; CHECK32-NEXT:    pushl $0
 ; CHECK32-NEXT:    pushl $0
@@ -110,11 +122,23 @@ define void @cmpxchg16b_global_with_offset() nounwind {
 ; CHECK32-NEXT:    pushl $cmpxchg16b_global+16
 ; CHECK32-NEXT:    pushl %eax
 ; CHECK32-NEXT:    calll __sync_val_compare_and_swap_16
-; CHECK32-NEXT:    addl $72, %esp
-; CHECK32-NEXT:    retl
+; CHECK32-NEXT:    addl $44, %esp
+; CHECK32-NEXT:    movl (%esp), %eax
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %edx
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %edi
+; CHECK32-NEXT:    movl %edi, 8(%esi)
+; CHECK32-NEXT:    movl %edx, 12(%esi)
+; CHECK32-NEXT:    movl %eax, (%esi)
+; CHECK32-NEXT:    movl %ecx, 4(%esi)
+; CHECK32-NEXT:    movl %esi, %eax
+; CHECK32-NEXT:    addl $20, %esp
+; CHECK32-NEXT:    popl %esi
+; CHECK32-NEXT:    popl %edi
+; CHECK32-NEXT:    retl $4
 entry:
   %0 = load atomic i128, ptr getelementptr inbounds ({i128, i128}, ptr @cmpxchg16b_global, i64 0, i32 1) acquire, align 16
-  ret void
+  ret i128 %0
 }
 
 define void @fetch_and_nand(ptr %p, i128 %bits) {
@@ -676,18 +700,25 @@ define void @fetch_and_umax(ptr %p, i128 %bits) {
 }
 
 define i128 @atomic_load_seq_cst(ptr %p) {
-; CHECK-LABEL: atomic_load_seq_cst:
-; CHECK:       ## %bb.0:
-; CHECK-NEXT:    pushq %rbx
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT:    .cfi_offset %rbx, -16
-; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:    xorl %edx, %edx
-; CHECK-NEXT:    xorl %ecx, %ecx
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-NEXT:    popq %rbx
-; CHECK-NEXT:    retq
+; CHECK-NOAVX-LABEL: atomic_load_seq_cst:
+; CHECK-NOAVX:       ## %bb.0:
+; CHECK-NOAVX-NEXT:    pushq %rbx
+; CHECK-NOAVX-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NOAVX-NEXT:    .cfi_offset %rbx, -16
+; CHECK-NOAVX-NEXT:    xorl %eax, %eax
+; CHECK-NOAVX-NEXT:    xorl %edx, %edx
+; CHECK-NOAVX-NEXT:    xorl %ecx, %ecx
+; CHECK-NOAVX-NEXT:    xorl %ebx, %ebx
+; CHECK-NOAVX-NEXT:    lock cmpxchg16b (%rdi)
+; CHECK-NOAVX-NEXT:    popq %rbx
+; CHECK-NOAVX-NEXT:    retq
+;
+; CHECK-AVX-LABEL: atomic_load_seq_cst:
+; CHECK-AVX:       ## %bb.0:
+; CHECK-AVX-NEXT:    vmovdqa (%rdi), %xmm0
+; CHECK-AVX-NEXT:    vmovq %xmm0, %rax
+; CHECK-AVX-NEXT:    vpextrq $1, %xmm0, %rdx
+; CHECK-AVX-NEXT:    retq
 ;
 ; CHECK32-LABEL: atomic_load_seq_cst:
 ; CHECK32:       # %bb.0:
@@ -748,18 +779,25 @@ define i128 @atomic_load_seq_cst(ptr %p) {
 }
 
 define i128 @atomic_load_relaxed(ptr %p) {
-; CHECK-LABEL: atomic_load_relaxed:
-; CHECK:       ## %bb.0:
-; CHECK-NEXT:    pushq %rbx
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT...
[truncated]

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

jyknight · 2023-12-14T14:31:57Z

Ping!

preames · 2023-12-14T20:15:35Z

llvm/lib/Target/X86/X86ISelLowering.cpp

@@ -30115,12 +30126,16 @@ X86TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
  // If this a 64 bit atomic load on a 32-bit target and SSE2 is enabled, we
  // can use movq to do the load. If we have X87 we can load into an 80-bit
  // X87 register and store it to a stack temporary.


Can you move the comment down to the case it applies to (inside the outer if). Currently, it binds to both cases which is misleading.

preames · 2023-12-14T20:17:43Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    // For illegal i128 atomic_store, when AVX is enabled, we can simply emit a
+    // vector store.
+    if (VT == MVT::i128) {
+      if (Subtarget.is64Bit() && Subtarget.hasAVX()) {


Looks like you can collapse one level of if clause here.

preames · 2023-12-14T20:17:59Z

llvm/lib/Target/X86/X86ISelLowering.cpp

-            Attribute::NoImplicitFloat);
-    if (!Subtarget.useSoftFloat() && !NoImplicitFloatOps) {
-      SDValue Chain;
+    if (VT == MVT::i64) {
      if (Subtarget.hasSSE1()) {


This has an "else if" attached, so better not to.

preames · 2023-12-14T20:20:07Z

llvm/test/CodeGen/X86/atomic-unordered.ll

-; CHECK-O3-NEXT:    popq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 8
-; CHECK-O3-NEXT:    retq
+; CHECK-O0-CUR-LABEL: load_i128:


Your tests need updated, these check prefixes no longer exist.

preames · 2023-12-14T20:20:53Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+  if (!SI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
+      !Subtarget.useSoftFloat()) {
+    if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
+        (Subtarget.hasSSE1() || Subtarget.hasX87()))


Do we need to check the alignment here?

No, misaligned atomic ops are converted to __atomic_* libcall before this function is ever called.

jyknight · 2024-01-02T14:05:59Z

Ping!

arsenm · 2024-01-04T11:39:37Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    // If this a 64 bit atomic load on a 32-bit target and SSE2 is enabled, we
+    // can use movq to do the load. If we have X87 we can load into an 80-bit
+    // X87 register and store it to a stack temporary.
+    if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&


getPrimitiveSizeInBits doesn't work correctly for pointer typed atomics; you need to use the DataLayout. Is there appropriate pointer typed atomic load test coverage for this?

Yikes, that's a nasty footgun -- obvious once you think about it, but I never would've looked out for that issue! The API of getPrimitiveSizeInBits is just asking for mistakes like this. It ought to either assert when called on a non-primitive, or else return an optional, rather than simply returning 0 on failure.

But, looking through this code and the rest of the related atomics support on X86, I believe there is actually not a correctness impact, because (not coincidentally!) atomics always work "normally" on pointer-sized values. That is, here, getPrimitiveSizeInBits() will return 0 rather than the perhaps-expected is64Bit() ? 64 : 32. Despite being unexpected, though, the logic still works out fine in that case.

So, given that this is a longstanding pre-existing issue -- which appears not to be a correctness issue -- I'd like to leave it alone in this patch.

arsenm

I don't know anything about x86 but this seems plausible

RKSimon

Please can you confirm we have tests for underaligned pointers?

jyknight · 2024-02-02T18:10:54Z

Underaligned atomic operations are expanded to an appropriate __atomic_* libcall via mostly target-independent code in AtomicExpandPass (

llvm-project/llvm/lib/CodeGen/AtomicExpandPass.cpp

Line 210 in 7ecfb66

// If the Size/Alignment is not supported, replace with a libcall.

) and never hits any of this. Not sure we have full test coverage for that case on X86, but I know we do for AArch64.

jyknight · 2024-02-06T17:05:25Z

@RKSimon: I'm not sure if you intended your comment to be a blocker; I was about to press the merge button when you commented (and would still wish to now).

RKSimon

LGTM. I'd still like to ensure we have unaligned x86 test coverage.

RKSimon

LGTM cheers

jyknight requested a review from RKSimon December 4, 2023 04:37

llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Dec 4, 2023

arsenm reviewed Dec 4, 2023

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp Outdated Show resolved Hide resolved

jyknight added 2 commits December 4, 2023 12:30

Revert change to LegalizeIntegerTypes.cpp

db3abe6

Merge remote-tracking branch 'origin/main' into atomic-128b-avx

45ffe10

preames reviewed Dec 14, 2023

View reviewed changes

jyknight added 2 commits December 16, 2023 10:08

Merge remote-tracking branch 'origin/main' into atomic-128b-avx

b4decd9

Address review comments.

0632fc2

jyknight mentioned this pull request Jan 2, 2024

[docs] Make rebase advice more explicit for cases where pre-commit CI is failing #76704

Open

arsenm reviewed Jan 4, 2024

View reviewed changes

arsenm approved these changes Jan 9, 2024

View reviewed changes

RKSimon reviewed Feb 2, 2024

View reviewed changes

RKSimon approved these changes Feb 6, 2024

View reviewed changes

jyknight added 3 commits February 14, 2024 16:55

Merge remote-tracking branch 'origin/main' into atomic-128b-avx

b1d6607

Add simple test of unaligned atomics.

bed731e

Merge remote-tracking branch 'origin/main' into atomic-128b-avx

80257f9

RKSimon approved these changes May 16, 2024

View reviewed changes

jyknight merged commit d6f9278 into llvm:main May 16, 2024
3 of 4 checks passed

jyknight deleted the atomic-128b-avx branch May 16, 2024 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Use plain load/store instead of cmpxchg16b for atomics with AVX #74275

[X86] Use plain load/store instead of cmpxchg16b for atomics with AVX #74275

jyknight commented Dec 4, 2023 •

edited

Loading

llvmbot commented Dec 4, 2023 •

edited

Loading

jyknight commented Dec 14, 2023

preames Dec 14, 2023

jyknight Dec 16, 2023

preames Dec 14, 2023

jyknight Dec 16, 2023

preames Dec 14, 2023

jyknight Dec 16, 2023

preames Dec 14, 2023

jyknight Dec 16, 2023

preames Dec 14, 2023

jyknight Dec 16, 2023

jyknight commented Jan 2, 2024

arsenm Jan 4, 2024

jyknight Jan 9, 2024

arsenm left a comment

RKSimon left a comment

jyknight commented Feb 2, 2024

jyknight commented Feb 6, 2024

RKSimon left a comment

RKSimon left a comment

[X86] Use plain load/store instead of cmpxchg16b for atomics with AVX #74275

[X86] Use plain load/store instead of cmpxchg16b for atomics with AVX #74275

Conversation

jyknight commented Dec 4, 2023 • edited Loading

llvmbot commented Dec 4, 2023 • edited Loading

jyknight commented Dec 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jyknight commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

RKSimon left a comment

Choose a reason for hiding this comment

jyknight commented Feb 2, 2024

jyknight commented Feb 6, 2024

RKSimon left a comment

Choose a reason for hiding this comment

RKSimon left a comment

Choose a reason for hiding this comment

jyknight commented Dec 4, 2023 •

edited

Loading

llvmbot commented Dec 4, 2023 •

edited

Loading