Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Use plain load/store instead of cmpxchg16b for atomics with AVX #74275

Merged
merged 8 commits into from
May 16, 2024

Conversation

jyknight
Copy link
Member

@jyknight jyknight commented Dec 4, 2023

In late 2021, both Intel and AMD finally documented that every AVX-capable CPU has always been guaranteed to execute aligned 16-byte loads/stores atomically, and further, guaranteed that all future CPUs with AVX will do so as well.

Therefore, we may use normal SSE 128-bit load/store instructions to implement atomics, if AVX is enabled.

Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity:

Processors that report [AVX] extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

Per Intel's SDM:

Processors that enumerate support for Intel(R) AVX guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

  • MOVAPD, MOVAPS, and MOVDQA.
  • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
  • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

This was also confirmed to be true for Zhaoxin CPUs with AVX, in https://gcc.gnu.org/PR104688

In late 2021, both Intel and AMD finally documented that every
AVX-capable CPU has always been guaranteed to execute aligned 16-byte
loads/stores atomically, and further, guaranteed that all future CPUs
with AVX will do so as well.

Therefore, we may use normal SSE 128-bit load/store instructions to
implement atomics, if AVX is enabled.

Also adjust handling of unordered atomic load/store in LegalizeIntegerTypes.cpp;
currently, it hardcodes a fallback to ATOMIC_CMP_SWAP_WITH_SUCCESS,
but we should instead fallback to ATOMIC_LOAD/ATOMIC_STORE.

Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity:
"""
Processors that report [AVX] extend the atomicity for cacheable,
naturally-aligned single loads or stores from a quadword to a double
quadword.
"""

Per Intel's SDM:
"""
Processors that enumerate support for Intel(R) AVX guarantee that the
16-byte memory operations performed by the following instructions will
always be carried out atomically:
- MOVAPD, MOVAPS, and MOVDQA.
- VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
- VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with
  EVEX.128 and k0 (masking disabled).
"""

This was also confirmed to be true for Zhaoxin CPUs with AVX, in
https://gcc.gnu.org/PR104688
@jyknight jyknight requested a review from RKSimon December 4, 2023 04:37
@llvmbot llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Dec 4, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Dec 4, 2023

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: James Y Knight (jyknight)

Changes

In late 2021, both Intel and AMD finally documented that every AVX-capable CPU has always been guaranteed to execute aligned 16-byte loads/stores atomically, and further, guaranteed that all future CPUs with AVX will do so as well.

Therefore, we may use normal SSE 128-bit load/store instructions to implement atomics, if AVX is enabled.

Also adjust handling of unordered atomic load/store in LegalizeIntegerTypes.cpp; currently, it hardcodes a fallback to ATOMIC_CMP_SWAP_WITH_SUCCESS, but we should instead fallback to ATOMIC_LOAD/ATOMIC_STORE.

Per AMD64 Architecture Programmer's manual, 7.3.2 Access Atomicity: """
Processors that report [AVX] extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.
"""

Per Intel's SDM:
"""
Processors that enumerate support for Intel(R) AVX guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

  • MOVAPD, MOVAPS, and MOVDQA.
  • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
  • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled). """

This was also confirmed to be true for Zhaoxin CPUs with AVX, in https://gcc.gnu.org/PR104688


Patch is 27.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/74275.diff

6 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+12-16)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+68-26)
  • (modified) llvm/test/CodeGen/X86/atomic-non-integer.ll (+4-20)
  • (modified) llvm/test/CodeGen/X86/atomic-unordered.ll (+14-69)
  • (modified) llvm/test/CodeGen/X86/atomic128.ll (+155-92)
  • (modified) llvm/test/CodeGen/X86/cmpxchg-i128-i1.ll (+3-5)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 54698edce7d6f..5b496feee7a8f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -3831,17 +3831,14 @@ void DAGTypeLegalizer::ExpandIntRes_XROUND_XRINT(SDNode *N, SDValue &Lo,
 void DAGTypeLegalizer::ExpandIntRes_LOAD(LoadSDNode *N,
                                          SDValue &Lo, SDValue &Hi) {
   if (N->isAtomic()) {
-    // It's typical to have larger CAS than atomic load instructions.
     SDLoc dl(N);
     EVT VT = N->getMemoryVT();
-    SDVTList VTs = DAG.getVTList(VT, MVT::i1, MVT::Other);
-    SDValue Zero = DAG.getConstant(0, dl, VT);
-    SDValue Swap = DAG.getAtomicCmpSwap(
-        ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, dl,
-        VT, VTs, N->getOperand(0),
-        N->getOperand(1), Zero, Zero, N->getMemOperand());
-    ReplaceValueWith(SDValue(N, 0), Swap.getValue(0));
-    ReplaceValueWith(SDValue(N, 1), Swap.getValue(2));
+    // We may support larger values in atomic_load than in a normal load
+    // (without splitting), so switch over if needed.
+    SDValue New = DAG.getAtomic(ISD::ATOMIC_LOAD, dl, VT, VT, N->getOperand(0),
+                                N->getOperand(1), N->getMemOperand());
+    ReplaceValueWith(SDValue(N, 0), New.getValue(0));
+    ReplaceValueWith(SDValue(N, 1), New.getValue(1));
     return;
   }
 
@@ -5399,14 +5396,13 @@ SDValue DAGTypeLegalizer::ExpandIntOp_XINT_TO_FP(SDNode *N) {
 
 SDValue DAGTypeLegalizer::ExpandIntOp_STORE(StoreSDNode *N, unsigned OpNo) {
   if (N->isAtomic()) {
-    // It's typical to have larger CAS than atomic store instructions.
+    // We may support larger values in atomic_store than in a normal store
+    // (without splitting), so switch over if needed.
     SDLoc dl(N);
-    SDValue Swap = DAG.getAtomic(ISD::ATOMIC_SWAP, dl,
-                                 N->getMemoryVT(),
-                                 N->getOperand(0), N->getOperand(2),
-                                 N->getOperand(1),
-                                 N->getMemOperand());
-    return Swap.getValue(1);
+    SDValue New =
+        DAG.getAtomic(ISD::ATOMIC_STORE, dl, N->getMemoryVT(), N->getOperand(0),
+                      N->getOperand(1), N->getOperand(2), N->getMemOperand());
+    return New.getValue(0);
   }
   if (ISD::isNormalStore(N))
     return ExpandOp_NormalStore(N, OpNo);
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6167be7bdf84e..1880cbc3a5bf3 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -515,6 +515,13 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
   if (!Subtarget.is64Bit())
     setOperationAction(ISD::ATOMIC_LOAD, MVT::i64, Custom);
 
+  if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
+    // All CPUs supporting AVX will atomically load/store aligned 128-bit
+    // values, so we can emit [V]MOVAPS/[V]MOVDQA.
+    setOperationAction(ISD::ATOMIC_LOAD, MVT::i128, Custom);
+    setOperationAction(ISD::ATOMIC_STORE, MVT::i128, Custom);
+  }
+
   if (Subtarget.canUseCMPXCHG16B())
     setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, MVT::i128, Custom);
 
@@ -30101,12 +30108,16 @@ TargetLoweringBase::AtomicExpansionKind
 X86TargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {
   Type *MemType = SI->getValueOperand()->getType();
 
-  bool NoImplicitFloatOps =
-      SI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat);
-  if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
-      !Subtarget.useSoftFloat() && !NoImplicitFloatOps &&
-      (Subtarget.hasSSE1() || Subtarget.hasX87()))
-    return AtomicExpansionKind::None;
+  if (!SI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
+      !Subtarget.useSoftFloat()) {
+    if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
+        (Subtarget.hasSSE1() || Subtarget.hasX87()))
+      return AtomicExpansionKind::None;
+
+    if (MemType->getPrimitiveSizeInBits() == 128 && Subtarget.is64Bit() &&
+        Subtarget.hasAVX())
+      return AtomicExpansionKind::None;
+  }
 
   return needsCmpXchgNb(MemType) ? AtomicExpansionKind::Expand
                                  : AtomicExpansionKind::None;
@@ -30121,12 +30132,16 @@ X86TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
   // If this a 64 bit atomic load on a 32-bit target and SSE2 is enabled, we
   // can use movq to do the load. If we have X87 we can load into an 80-bit
   // X87 register and store it to a stack temporary.
-  bool NoImplicitFloatOps =
-      LI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat);
-  if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
-      !Subtarget.useSoftFloat() && !NoImplicitFloatOps &&
-      (Subtarget.hasSSE1() || Subtarget.hasX87()))
-    return AtomicExpansionKind::None;
+  if (!LI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
+      !Subtarget.useSoftFloat()) {
+    if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
+        (Subtarget.hasSSE1() || Subtarget.hasX87()))
+      return AtomicExpansionKind::None;
+
+    if (MemType->getPrimitiveSizeInBits() == 128 && Subtarget.is64Bit() &&
+        Subtarget.hasAVX())
+      return AtomicExpansionKind::None;
+  }
 
   return needsCmpXchgNb(MemType) ? AtomicExpansionKind::CmpXChg
                                  : AtomicExpansionKind::None;
@@ -31277,14 +31292,23 @@ static SDValue LowerATOMIC_STORE(SDValue Op, SelectionDAG &DAG,
   if (!IsSeqCst && IsTypeLegal)
     return Op;
 
-  if (VT == MVT::i64 && !IsTypeLegal) {
+  if (!IsTypeLegal && !Subtarget.useSoftFloat() &&
+      !DAG.getMachineFunction().getFunction().hasFnAttribute(
+          Attribute::NoImplicitFloat)) {
+    SDValue Chain;
+    // For illegal i128 atomic_store, when AVX is enabled, we can simply emit a
+    // vector store.
+    if (VT == MVT::i128) {
+      if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
+        SDValue VecVal = DAG.getBitcast(MVT::v2i64, Node->getVal());
+        Chain = DAG.getStore(Node->getChain(), dl, VecVal, Node->getBasePtr(),
+                             Node->getMemOperand());
+      }
+    }
+
     // For illegal i64 atomic_stores, we can try to use MOVQ or MOVLPS if SSE
     // is enabled.
-    bool NoImplicitFloatOps =
-        DAG.getMachineFunction().getFunction().hasFnAttribute(
-            Attribute::NoImplicitFloat);
-    if (!Subtarget.useSoftFloat() && !NoImplicitFloatOps) {
-      SDValue Chain;
+    if (VT == MVT::i64) {
       if (Subtarget.hasSSE1()) {
         SDValue SclToVec =
             DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2i64, Node->getVal());
@@ -31316,15 +31340,15 @@ static SDValue LowerATOMIC_STORE(SDValue Op, SelectionDAG &DAG,
             DAG.getMemIntrinsicNode(X86ISD::FIST, dl, DAG.getVTList(MVT::Other),
                                     StoreOps, MVT::i64, Node->getMemOperand());
       }
+    }
 
-      if (Chain) {
-        // If this is a sequentially consistent store, also emit an appropriate
-        // barrier.
-        if (IsSeqCst)
-          Chain = emitLockedStackOp(DAG, Subtarget, Chain, dl);
+    if (Chain) {
+      // If this is a sequentially consistent store, also emit an appropriate
+      // barrier.
+      if (IsSeqCst)
+        Chain = emitLockedStackOp(DAG, Subtarget, Chain, dl);
 
-        return Chain;
-      }
+      return Chain;
     }
   }
 
@@ -32877,12 +32901,30 @@ void X86TargetLowering::ReplaceNodeResults(SDNode *N,
     return;
   }
   case ISD::ATOMIC_LOAD: {
-    assert(N->getValueType(0) == MVT::i64 && "Unexpected VT!");
+    assert(
+        (N->getValueType(0) == MVT::i64 || N->getValueType(0) == MVT::i128) &&
+        "Unexpected VT!");
     bool NoImplicitFloatOps =
         DAG.getMachineFunction().getFunction().hasFnAttribute(
             Attribute::NoImplicitFloat);
     if (!Subtarget.useSoftFloat() && !NoImplicitFloatOps) {
       auto *Node = cast<AtomicSDNode>(N);
+
+      if (N->getValueType(0) == MVT::i128) {
+        if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
+          SDValue Ld = DAG.getLoad(MVT::v2i64, dl, Node->getChain(),
+                                   Node->getBasePtr(), Node->getMemOperand());
+          SDValue ResL = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i64, Ld,
+                                     DAG.getIntPtrConstant(0, dl));
+          SDValue ResH = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i64, Ld,
+                                     DAG.getIntPtrConstant(1, dl));
+          Results.push_back(DAG.getNode(ISD::BUILD_PAIR, dl, N->getValueType(0),
+                                        {ResL, ResH}));
+          Results.push_back(Ld.getValue(1));
+          return;
+        }
+        break;
+      }
       if (Subtarget.hasSSE1()) {
         // Use a VZEXT_LOAD which will be selected as MOVQ or XORPS+MOVLPS.
         // Then extract the lower 64-bits.
diff --git a/llvm/test/CodeGen/X86/atomic-non-integer.ll b/llvm/test/CodeGen/X86/atomic-non-integer.ll
index 7d2810e57a25b..22b45b13aae22 100644
--- a/llvm/test/CodeGen/X86/atomic-non-integer.ll
+++ b/llvm/test/CodeGen/X86/atomic-non-integer.ll
@@ -207,14 +207,7 @@ define void @store_fp128(ptr %fptr, fp128 %v) {
 ;
 ; X64-AVX-LABEL: store_fp128:
 ; X64-AVX:       # %bb.0:
-; X64-AVX-NEXT:    subq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 32
-; X64-AVX-NEXT:    vmovaps %xmm0, (%rsp)
-; X64-AVX-NEXT:    movq (%rsp), %rsi
-; X64-AVX-NEXT:    movq {{[0-9]+}}(%rsp), %rdx
-; X64-AVX-NEXT:    callq __sync_lock_test_and_set_16@PLT
-; X64-AVX-NEXT:    addq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 8
+; X64-AVX-NEXT:    vmovaps %xmm0, (%rdi)
 ; X64-AVX-NEXT:    retq
   store atomic fp128 %v, ptr %fptr unordered, align 16
   ret void
@@ -592,18 +585,9 @@ define fp128 @load_fp128(ptr %fptr) {
 ;
 ; X64-AVX-LABEL: load_fp128:
 ; X64-AVX:       # %bb.0:
-; X64-AVX-NEXT:    subq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 32
-; X64-AVX-NEXT:    xorl %esi, %esi
-; X64-AVX-NEXT:    xorl %edx, %edx
-; X64-AVX-NEXT:    xorl %ecx, %ecx
-; X64-AVX-NEXT:    xorl %r8d, %r8d
-; X64-AVX-NEXT:    callq __sync_val_compare_and_swap_16@PLT
-; X64-AVX-NEXT:    movq %rdx, {{[0-9]+}}(%rsp)
-; X64-AVX-NEXT:    movq %rax, (%rsp)
-; X64-AVX-NEXT:    vmovaps (%rsp), %xmm0
-; X64-AVX-NEXT:    addq $24, %rsp
-; X64-AVX-NEXT:    .cfi_def_cfa_offset 8
+; X64-AVX-NEXT:    vmovaps (%rdi), %xmm0
+; X64-AVX-NEXT:    vmovaps %xmm0, -{{[0-9]+}}(%rsp)
+; X64-AVX-NEXT:    vmovaps -{{[0-9]+}}(%rsp), %xmm0
 ; X64-AVX-NEXT:    retq
   %v = load atomic fp128, ptr %fptr unordered, align 16
   ret fp128 %v
diff --git a/llvm/test/CodeGen/X86/atomic-unordered.ll b/llvm/test/CodeGen/X86/atomic-unordered.ll
index b66988c8bd24b..91e427189de47 100644
--- a/llvm/test/CodeGen/X86/atomic-unordered.ll
+++ b/llvm/test/CodeGen/X86/atomic-unordered.ll
@@ -230,34 +230,12 @@ define void @widen_broadcast_unaligned(ptr %p0, i32 %v) {
 }
 
 define i128 @load_i128(ptr %ptr) {
-; CHECK-O0-LABEL: load_i128:
-; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    pushq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O0-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O0-NEXT:    xorl %eax, %eax
-; CHECK-O0-NEXT:    movl %eax, %ebx
-; CHECK-O0-NEXT:    movq %rbx, %rax
-; CHECK-O0-NEXT:    movq %rbx, %rdx
-; CHECK-O0-NEXT:    movq %rbx, %rcx
-; CHECK-O0-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-O0-NEXT:    popq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 8
-; CHECK-O0-NEXT:    retq
-;
-; CHECK-O3-LABEL: load_i128:
-; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    pushq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O3-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O3-NEXT:    xorl %eax, %eax
-; CHECK-O3-NEXT:    xorl %edx, %edx
-; CHECK-O3-NEXT:    xorl %ecx, %ecx
-; CHECK-O3-NEXT:    xorl %ebx, %ebx
-; CHECK-O3-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-O3-NEXT:    popq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 8
-; CHECK-O3-NEXT:    retq
+; CHECK-LABEL: load_i128:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vmovdqa (%rdi), %xmm0
+; CHECK-NEXT:    vmovq %xmm0, %rax
+; CHECK-NEXT:    vpextrq $1, %xmm0, %rdx
+; CHECK-NEXT:    retq
   %v = load atomic i128, ptr %ptr unordered, align 16
   ret i128 %v
 }
@@ -265,51 +243,18 @@ define i128 @load_i128(ptr %ptr) {
 define void @store_i128(ptr %ptr, i128 %v) {
 ; CHECK-O0-LABEL: store_i128:
 ; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    pushq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O0-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O0-NEXT:    movq %rdi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq (%rdi), %rax
-; CHECK-O0-NEXT:    movq 8(%rdi), %rdx
-; CHECK-O0-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    jmp .LBB16_1
-; CHECK-O0-NEXT:  .LBB16_1: # %atomicrmw.start
-; CHECK-O0-NEXT:    # =>This Inner Loop Header: Depth=1
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbx # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
-; CHECK-O0-NEXT:    lock cmpxchg16b (%rsi)
-; CHECK-O0-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
-; CHECK-O0-NEXT:    jne .LBB16_1
-; CHECK-O0-NEXT:    jmp .LBB16_2
-; CHECK-O0-NEXT:  .LBB16_2: # %atomicrmw.end
-; CHECK-O0-NEXT:    popq %rbx
-; CHECK-O0-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-O0-NEXT:    vmovq %rsi, %xmm0
+; CHECK-O0-NEXT:    vmovq %rdx, %xmm1
+; CHECK-O0-NEXT:    vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O0-NEXT:    vmovdqa %xmm0, (%rdi)
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-O3-LABEL: store_i128:
 ; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    pushq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-O3-NEXT:    .cfi_offset %rbx, -16
-; CHECK-O3-NEXT:    movq %rdx, %rcx
-; CHECK-O3-NEXT:    movq %rsi, %rbx
-; CHECK-O3-NEXT:    movq (%rdi), %rax
-; CHECK-O3-NEXT:    movq 8(%rdi), %rdx
-; CHECK-O3-NEXT:    .p2align 4, 0x90
-; CHECK-O3-NEXT:  .LBB16_1: # %atomicrmw.start
-; CHECK-O3-NEXT:    # =>This Inner Loop Header: Depth=1
-; CHECK-O3-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-O3-NEXT:    jne .LBB16_1
-; CHECK-O3-NEXT:  # %bb.2: # %atomicrmw.end
-; CHECK-O3-NEXT:    popq %rbx
-; CHECK-O3-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-O3-NEXT:    vmovq %rdx, %xmm0
+; CHECK-O3-NEXT:    vmovq %rsi, %xmm1
+; CHECK-O3-NEXT:    vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; CHECK-O3-NEXT:    vmovdqa %xmm0, (%rdi)
 ; CHECK-O3-NEXT:    retq
   store atomic i128 %v, ptr %ptr unordered, align 16
   ret void
diff --git a/llvm/test/CodeGen/X86/atomic128.ll b/llvm/test/CodeGen/X86/atomic128.ll
index d5600b54a169d..76c3b2c5f1bb1 100644
--- a/llvm/test/CodeGen/X86/atomic128.ll
+++ b/llvm/test/CodeGen/X86/atomic128.ll
@@ -1,5 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc < %s -mtriple=x86_64-apple-macosx10.9 -verify-machineinstrs -mattr=cx16 | FileCheck %s
+; RUN: llc < %s -mtriple=x86_64-apple-macosx10.9 -verify-machineinstrs -mattr=cx16 | FileCheck %s --check-prefixes=CHECK,CHECK-NOAVX
+; RUN: llc < %s -mtriple=x86_64-apple-macosx10.9 -verify-machineinstrs -mattr=cx16,avx | FileCheck %s --check-prefixes=CHECK,CHECK-AVX
 ; RUN: llc < %s -mtriple=i386-linux-gnu -verify-machineinstrs -mattr=cx16 | FileCheck %s -check-prefixes=CHECK32
 ; RUN: llc < %s -mtriple=i386-linux-gnu -verify-machineinstrs -mattr=-cx16 | FileCheck %s -check-prefixes=CHECK32
 
@@ -83,21 +84,32 @@ define i128 @val_compare_and_swap(ptr %p, i128 %oldval, i128 %newval) {
 @cmpxchg16b_global = external dso_local global { i128, i128 }, align 16
 
 ;; Make sure we retain the offset of the global variable.
-define void @cmpxchg16b_global_with_offset() nounwind {
-; CHECK-LABEL: cmpxchg16b_global_with_offset:
-; CHECK:       ## %bb.0: ## %entry
-; CHECK-NEXT:    pushq %rbx
-; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:    xorl %edx, %edx
-; CHECK-NEXT:    xorl %ecx, %ecx
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:    lock cmpxchg16b _cmpxchg16b_global+16(%rip)
-; CHECK-NEXT:    popq %rbx
-; CHECK-NEXT:    retq
+define i128 @load_global_with_offset() nounwind {
+; CHECK-NOAVX-LABEL: load_global_with_offset:
+; CHECK-NOAVX:       ## %bb.0: ## %entry
+; CHECK-NOAVX-NEXT:    pushq %rbx
+; CHECK-NOAVX-NEXT:    xorl %eax, %eax
+; CHECK-NOAVX-NEXT:    xorl %edx, %edx
+; CHECK-NOAVX-NEXT:    xorl %ecx, %ecx
+; CHECK-NOAVX-NEXT:    xorl %ebx, %ebx
+; CHECK-NOAVX-NEXT:    lock cmpxchg16b _cmpxchg16b_global+16(%rip)
+; CHECK-NOAVX-NEXT:    popq %rbx
+; CHECK-NOAVX-NEXT:    retq
 ;
-; CHECK32-LABEL: cmpxchg16b_global_with_offset:
+; CHECK-AVX-LABEL: load_global_with_offset:
+; CHECK-AVX:       ## %bb.0: ## %entry
+; CHECK-AVX-NEXT:    vmovdqa _cmpxchg16b_global+16(%rip), %xmm0
+; CHECK-AVX-NEXT:    vmovq %xmm0, %rax
+; CHECK-AVX-NEXT:    vpextrq $1, %xmm0, %rdx
+; CHECK-AVX-NEXT:    retq
+;
+; CHECK32-LABEL: load_global_with_offset:
 ; CHECK32:       # %bb.0: # %entry
-; CHECK32-NEXT:    subl $36, %esp
+; CHECK32-NEXT:    pushl %edi
+; CHECK32-NEXT:    pushl %esi
+; CHECK32-NEXT:    subl $20, %esp
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; CHECK32-NEXT:    subl $8, %esp
 ; CHECK32-NEXT:    leal {{[0-9]+}}(%esp), %eax
 ; CHECK32-NEXT:    pushl $0
 ; CHECK32-NEXT:    pushl $0
@@ -110,11 +122,23 @@ define void @cmpxchg16b_global_with_offset() nounwind {
 ; CHECK32-NEXT:    pushl $cmpxchg16b_global+16
 ; CHECK32-NEXT:    pushl %eax
 ; CHECK32-NEXT:    calll __sync_val_compare_and_swap_16
-; CHECK32-NEXT:    addl $72, %esp
-; CHECK32-NEXT:    retl
+; CHECK32-NEXT:    addl $44, %esp
+; CHECK32-NEXT:    movl (%esp), %eax
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %edx
+; CHECK32-NEXT:    movl {{[0-9]+}}(%esp), %edi
+; CHECK32-NEXT:    movl %edi, 8(%esi)
+; CHECK32-NEXT:    movl %edx, 12(%esi)
+; CHECK32-NEXT:    movl %eax, (%esi)
+; CHECK32-NEXT:    movl %ecx, 4(%esi)
+; CHECK32-NEXT:    movl %esi, %eax
+; CHECK32-NEXT:    addl $20, %esp
+; CHECK32-NEXT:    popl %esi
+; CHECK32-NEXT:    popl %edi
+; CHECK32-NEXT:    retl $4
 entry:
   %0 = load atomic i128, ptr getelementptr inbounds ({i128, i128}, ptr @cmpxchg16b_global, i64 0, i32 1) acquire, align 16
-  ret void
+  ret i128 %0
 }
 
 define void @fetch_and_nand(ptr %p, i128 %bits) {
@@ -676,18 +700,25 @@ define void @fetch_and_umax(ptr %p, i128 %bits) {
 }
 
 define i128 @atomic_load_seq_cst(ptr %p) {
-; CHECK-LABEL: atomic_load_seq_cst:
-; CHECK:       ## %bb.0:
-; CHECK-NEXT:    pushq %rbx
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT:    .cfi_offset %rbx, -16
-; CHECK-NEXT:    xorl %eax, %eax
-; CHECK-NEXT:    xorl %edx, %edx
-; CHECK-NEXT:    xorl %ecx, %ecx
-; CHECK-NEXT:    xorl %ebx, %ebx
-; CHECK-NEXT:    lock cmpxchg16b (%rdi)
-; CHECK-NEXT:    popq %rbx
-; CHECK-NEXT:    retq
+; CHECK-NOAVX-LABEL: atomic_load_seq_cst:
+; CHECK-NOAVX:       ## %bb.0:
+; CHECK-NOAVX-NEXT:    pushq %rbx
+; CHECK-NOAVX-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NOAVX-NEXT:    .cfi_offset %rbx, -16
+; CHECK-NOAVX-NEXT:    xorl %eax, %eax
+; CHECK-NOAVX-NEXT:    xorl %edx, %edx
+; CHECK-NOAVX-NEXT:    xorl %ecx, %ecx
+; CHECK-NOAVX-NEXT:    xorl %ebx, %ebx
+; CHECK-NOAVX-NEXT:    lock cmpxchg16b (%rdi)
+; CHECK-NOAVX-NEXT:    popq %rbx
+; CHECK-NOAVX-NEXT:    retq
+;
+; CHECK-AVX-LABEL: atomic_load_seq_cst:
+; CHECK-AVX:       ## %bb.0:
+; CHECK-AVX-NEXT:    vmovdqa (%rdi), %xmm0
+; CHECK-AVX-NEXT:    vmovq %xmm0, %rax
+; CHECK-AVX-NEXT:    vpextrq $1, %xmm0, %rdx
+; CHECK-AVX-NEXT:    retq
 ;
 ; CHECK32-LABEL: atomic_load_seq_cst:
 ; CHECK32:       # %bb.0:
@@ -748,18 +779,25 @@ define i128 @atomic_load_seq_cst(ptr %p) {
 }
 
 define i128 @atomic_load_relaxed(ptr %p) {
-; CHECK-LABEL: atomic_load_relaxed:
-; CHECK:       ## %bb.0:
-; CHECK-NEXT:    pushq %rbx
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT...
[truncated]

@jyknight
Copy link
Member Author

Ping!

@@ -30115,12 +30126,16 @@ X86TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
// If this a 64 bit atomic load on a 32-bit target and SSE2 is enabled, we
// can use movq to do the load. If we have X87 we can load into an 80-bit
// X87 register and store it to a stack temporary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the comment down to the case it applies to (inside the outer if). Currently, it binds to both cases which is misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// For illegal i128 atomic_store, when AVX is enabled, we can simply emit a
// vector store.
if (VT == MVT::i128) {
if (Subtarget.is64Bit() && Subtarget.hasAVX()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you can collapse one level of if clause here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Attribute::NoImplicitFloat);
if (!Subtarget.useSoftFloat() && !NoImplicitFloatOps) {
SDValue Chain;
if (VT == MVT::i64) {
if (Subtarget.hasSSE1()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has an "else if" attached, so better not to.

; CHECK-O3-NEXT: popq %rbx
; CHECK-O3-NEXT: .cfi_def_cfa_offset 8
; CHECK-O3-NEXT: retq
; CHECK-O0-CUR-LABEL: load_i128:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests need updated, these check prefixes no longer exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (!SI->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
!Subtarget.useSoftFloat()) {
if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
(Subtarget.hasSSE1() || Subtarget.hasX87()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the alignment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, misaligned atomic ops are converted to __atomic_* libcall before this function is ever called.

@jyknight
Copy link
Member Author

jyknight commented Jan 2, 2024

Ping!

// If this a 64 bit atomic load on a 32-bit target and SSE2 is enabled, we
// can use movq to do the load. If we have X87 we can load into an 80-bit
// X87 register and store it to a stack temporary.
if (MemType->getPrimitiveSizeInBits() == 64 && !Subtarget.is64Bit() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPrimitiveSizeInBits doesn't work correctly for pointer typed atomics; you need to use the DataLayout. Is there appropriate pointer typed atomic load test coverage for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, that's a nasty footgun -- obvious once you think about it, but I never would've looked out for that issue! The API of getPrimitiveSizeInBits is just asking for mistakes like this. It ought to either assert when called on a non-primitive, or else return an optional, rather than simply returning 0 on failure.

But, looking through this code and the rest of the related atomics support on X86, I believe there is actually not a correctness impact, because (not coincidentally!) atomics always work "normally" on pointer-sized values. That is, here, getPrimitiveSizeInBits() will return 0 rather than the perhaps-expected is64Bit() ? 64 : 32. Despite being unexpected, though, the logic still works out fine in that case.

So, given that this is a longstanding pre-existing issue -- which appears not to be a correctness issue -- I'd like to leave it alone in this patch.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about x86 but this seems plausible

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you confirm we have tests for underaligned pointers?

@jyknight
Copy link
Member Author

jyknight commented Feb 2, 2024

Underaligned atomic operations are expanded to an appropriate __atomic_* libcall via mostly target-independent code in AtomicExpandPass (

// If the Size/Alignment is not supported, replace with a libcall.
) and never hits any of this. Not sure we have full test coverage for that case on X86, but I know we do for AArch64.

@jyknight
Copy link
Member Author

jyknight commented Feb 6, 2024

@RKSimon: I'm not sure if you intended your comment to be a blocker; I was about to press the merge button when you commented (and would still wish to now).

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'd still like to ensure we have unaligned x86 test coverage.

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cheers

@jyknight jyknight merged commit d6f9278 into llvm:main May 16, 2024
3 of 4 checks passed
@jyknight jyknight deleted the atomic-128b-avx branch May 16, 2024 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants