Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Support isel for Zacas for 2*XLen types. #77814

Closed
wants to merge 1 commit into from

Conversation

topperc
Copy link
Collaborator

@topperc topperc commented Jan 11, 2024

This is an alternative to #67918 now that the MC layer amocas.d/amocas.q instructions use GPRPair register class.

While working on this I noticed that the change of setMaxAtomicSizeInBitsSupported caused atomic load/store/rmw for 2*XLen types to be affected. The AtomicExpandPass will not convert them to libcalls immediately. Instead they would get passed through to SelectionDAG and then get converted to different libcalls during type legalization.

I didn't see any way to signal AtomicExpandPass to convert them to the same libcalls using any of the shouldExpandAtomic* hooks. So I've forced them to use to CmpXChg expansion. I've disabled the insertion of fences for atomic load/store when we use CmpXChg.

I've very unsure if this the right thing to do or if we should make changes to AtomicExpand to get back the original libcalls.

This is an alternative to llvm#6718 now that the MC layer amocas.d/amocas.q
instructions use GPRPair register class.

While working on this I noticed that the change of setMaxAtomicSizeInBitsSupported
caused atomic load/store/rmw for 2*XLen types to be affected. The
AtomicExpandPass will not convert them to libcalls immediately.
Instead they get converted to different library calls by type
legalization.

I didn't see any way to signal AtomicExpandPass to convert them
to the same libcalls using any of the shouldExpandAtomic* hooks.
So I've forced them to use to CmpXChg expansion. I've disabled
the insertion of fences for atomic load/store when we use CmpXChg.

I've very unsure if this the right thing to do or if we should
make changes to AtomicExpand to get back the libcalls.
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 11, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Craig Topper (topperc)

Changes

This is an alternative to #6718 now that the MC layer amocas.d/amocas.q instructions use GPRPair register class.

While working on this I noticed that the change of setMaxAtomicSizeInBitsSupported caused atomic load/store/rmw for 2*XLen types to be affected. The AtomicExpandPass will not convert them to libcalls immediately. Instead they would get passed through to SelectionDAG and then get converted to different libcalls during type legalization.

I didn't see any way to signal AtomicExpandPass to convert them to the same libcalls using any of the shouldExpandAtomic* hooks. So I've forced them to use to CmpXChg expansion. I've disabled the insertion of fences for atomic load/store when we use CmpXChg.

I've very unsure if this the right thing to do or if we should make changes to AtomicExpand to get back the original libcalls.


Patch is 958.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/77814.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+124-2)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+5-3)
  • (modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp (+3-2)
  • (modified) llvm/test/CodeGen/RISCV/atomic-cmpxchg.ll (+1857-159)
  • (modified) llvm/test/CodeGen/RISCV/atomic-load-store.ll (+1310-76)
  • (modified) llvm/test/CodeGen/RISCV/atomic-rmw.ll (+18245-1379)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index cb9ffabc41236e..8ba4d65eacb760 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -629,7 +629,10 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
   }
 
   if (Subtarget.hasStdExtA()) {
-    setMaxAtomicSizeInBitsSupported(Subtarget.getXLen());
+    unsigned MaxAtomicSize = Subtarget.getXLen();
+    if (Subtarget.hasStdExtZacas())
+      MaxAtomicSize = 2 * Subtarget.getXLen();
+    setMaxAtomicSizeInBitsSupported(MaxAtomicSize);
     setMinCmpXchgSizeInBits(32);
   } else if (Subtarget.hasForcedAtomics()) {
     setMaxAtomicSizeInBitsSupported(Subtarget.getXLen());
@@ -1338,6 +1341,9 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ATOMIC_LOAD_SUB, XLenVT, Expand);
     if (RV64LegalI32 && Subtarget.is64Bit())
       setOperationAction(ISD::ATOMIC_LOAD_SUB, MVT::i32, Expand);
+    if (Subtarget.hasStdExtZacas())
+      setOperationAction(ISD::ATOMIC_CMP_SWAP,
+                         Subtarget.is64Bit() ? MVT::i128 : MVT::i64, Custom);
   }
 
   if (Subtarget.hasForcedAtomics()) {
@@ -11237,6 +11243,76 @@ static SDValue customLegalizeToWOpWithSExt(SDNode *N, SelectionDAG &DAG) {
   return DAG.getNode(ISD::TRUNCATE, DL, MVT::i32, NewRes);
 }
 
+// Create an even/odd pair of X registers holding integer value V.
+static SDValue createGPRPairNode(SelectionDAG &DAG, SDValue V, MVT VT,
+                                 MVT SubRegVT) {
+  SDLoc DL(V.getNode());
+  auto [VLo, VHi] = DAG.SplitScalar(V, DL, SubRegVT, SubRegVT);
+  SDValue RegClass =
+      DAG.getTargetConstant(RISCV::GPRPairRegClassID, DL, MVT::i32);
+  SDValue SubReg0 = DAG.getTargetConstant(RISCV::sub_gpr_even, DL, MVT::i32);
+  SDValue SubReg1 = DAG.getTargetConstant(RISCV::sub_gpr_odd, DL, MVT::i32);
+  const SDValue Ops[] = {RegClass, VLo, SubReg0, VHi, SubReg1};
+  return SDValue(
+      DAG.getMachineNode(TargetOpcode::REG_SEQUENCE, DL, MVT::Untyped, Ops), 0);
+}
+
+static void ReplaceCMP_SWAP_2XLenResults(SDNode *N,
+                                         SmallVectorImpl<SDValue> &Results,
+                                         SelectionDAG &DAG,
+                                         const RISCVSubtarget &Subtarget) {
+  MVT VT = N->getSimpleValueType(0);
+  assert(N->getValueType(0) == (Subtarget.is64Bit() ? MVT::i128 : MVT::i64) &&
+         "AtomicCmpSwap on types less than 2*XLen should be legal");
+  assert(Subtarget.hasStdExtZacas());
+  MVT XLenVT = Subtarget.getXLenVT();
+
+  SDValue Ops[] = {
+      createGPRPairNode(DAG, N->getOperand(2), VT, XLenVT), // Compare value
+      N->getOperand(1),                                     // Ptr
+      createGPRPairNode(DAG, N->getOperand(3), VT, XLenVT), // Store value
+      N->getOperand(0),                                     // Chain in
+  };
+
+  MachineMemOperand *MemOp = cast<MemSDNode>(N)->getMemOperand();
+
+  bool Is64Bit = Subtarget.is64Bit();
+  unsigned Opcode;
+  if (Subtarget.hasStdExtZtso()) {
+    Opcode = Subtarget.is64Bit() ? RISCV::AMOCAS_Q : RISCV::AMOCAS_D_RV32;
+  } else {
+    switch (MemOp->getMergedOrdering()) {
+    default:
+      llvm_unreachable("Unexpected ordering!");
+    case AtomicOrdering::Monotonic:
+      Opcode = Is64Bit ? RISCV::AMOCAS_Q : RISCV::AMOCAS_D_RV32;
+      break;
+    case AtomicOrdering::Acquire:
+      Opcode = Is64Bit ? RISCV::AMOCAS_Q_AQ : RISCV::AMOCAS_D_RV32_AQ;
+      break;
+    case AtomicOrdering::Release:
+      Opcode = Is64Bit ? RISCV::AMOCAS_Q_RL : RISCV::AMOCAS_D_RV32_RL;
+      break;
+    case AtomicOrdering::AcquireRelease:
+    case AtomicOrdering::SequentiallyConsistent:
+      Opcode = Is64Bit ? RISCV::AMOCAS_Q_AQ_RL : RISCV::AMOCAS_D_RV32_AQ_RL;
+      break;
+    }
+  }
+
+  SDLoc DL(N);
+  MachineSDNode *CmpSwap = DAG.getMachineNode(
+      Opcode, DL, DAG.getVTList(MVT::Untyped, MVT::Other), Ops);
+  DAG.setNodeMemRefs(CmpSwap, {MemOp});
+
+  SDValue Lo = DAG.getTargetExtractSubreg(RISCV::sub_gpr_even, DL, XLenVT,
+                                          SDValue(CmpSwap, 0));
+  SDValue Hi = DAG.getTargetExtractSubreg(RISCV::sub_gpr_odd, DL, XLenVT,
+                                          SDValue(CmpSwap, 0));
+  Results.push_back(DAG.getNode(ISD::BUILD_PAIR, DL, VT, Lo, Hi));
+  Results.push_back(SDValue(CmpSwap, 1));
+}
+
 void RISCVTargetLowering::ReplaceNodeResults(SDNode *N,
                                              SmallVectorImpl<SDValue> &Results,
                                              SelectionDAG &DAG) const {
@@ -11244,6 +11320,9 @@ void RISCVTargetLowering::ReplaceNodeResults(SDNode *N,
   switch (N->getOpcode()) {
   default:
     llvm_unreachable("Don't know how to custom type legalize this operation!");
+  case ISD::ATOMIC_CMP_SWAP:
+    ReplaceCMP_SWAP_2XLenResults(N, Results, DAG, Subtarget);
+    break;
   case ISD::STRICT_FP_TO_SINT:
   case ISD::STRICT_FP_TO_UINT:
   case ISD::FP_TO_SINT:
@@ -19003,6 +19082,20 @@ void RISCVTargetLowering::LowerAsmOperandForConstraint(
   TargetLowering::LowerAsmOperandForConstraint(Op, Constraint, Ops, DAG);
 }
 
+bool RISCVTargetLowering::shouldInsertFencesForAtomic(
+    const Instruction *I) const {
+  // We don't need a fence for 2*Xlen. We can use Zacas.
+  if (auto *LI = dyn_cast<LoadInst>(I))
+    return LI->getType()->getPrimitiveSizeInBits() != 2 * Subtarget.getXLen();
+
+  // We don't need a fence for 2*Xlen. We can use Zacas.
+  if (auto *SI = dyn_cast<StoreInst>(I))
+    return SI->getValueOperand()->getType()->getPrimitiveSizeInBits() !=
+           2 * Subtarget.getXLen();
+
+  return false;
+}
+
 Instruction *RISCVTargetLowering::emitLeadingFence(IRBuilderBase &Builder,
                                                    Instruction *Inst,
                                                    AtomicOrdering Ord) const {
@@ -19036,6 +19129,30 @@ Instruction *RISCVTargetLowering::emitTrailingFence(IRBuilderBase &Builder,
   return nullptr;
 }
 
+TargetLowering::AtomicExpansionKind
+RISCVTargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
+  unsigned Size = LI->getType()->getPrimitiveSizeInBits();
+
+  if (Size != 2 * Subtarget.getXLen())
+    return AtomicExpansionKind::None;
+
+  // With Zacas we can use amocas for 2*XLen types.
+  assert(Subtarget.hasStdExtZacas() && "Unexpected extension");
+  return AtomicExpansionKind::CmpXChg;
+}
+
+TargetLowering::AtomicExpansionKind
+RISCVTargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {
+  unsigned Size = SI->getValueOperand()->getType()->getPrimitiveSizeInBits();
+
+  if (Size != 2 * Subtarget.getXLen())
+    return AtomicExpansionKind::None;
+
+  // With Zacas we can use amocas for 2*XLen types.
+  assert(Subtarget.hasStdExtZacas() && "Unexpected extension");
+  return AtomicExpansionKind::Expand;
+}
+
 TargetLowering::AtomicExpansionKind
 RISCVTargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
   // atomicrmw {fadd,fsub} must be expanded to use compare-exchange, as floating
@@ -19053,7 +19170,12 @@ RISCVTargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
   unsigned Size = AI->getType()->getPrimitiveSizeInBits();
   if (Size == 8 || Size == 16)
     return AtomicExpansionKind::MaskedIntrinsic;
-  return AtomicExpansionKind::None;
+  if (Size != 2 * Subtarget.getXLen())
+    return AtomicExpansionKind::None;
+
+  // With Zacas we can use amocas for 2*XLen types.
+  assert(Subtarget.hasStdExtZacas() && "Unexpected extension");
+  return AtomicExpansionKind::CmpXChg;
 }
 
 static Intrinsic::ID
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index c65953e37b1710..0042c8c5f4e941 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -618,9 +618,7 @@ class RISCVTargetLowering : public TargetLowering {
 
   bool preferZeroCompareBranch() const override { return true; }
 
-  bool shouldInsertFencesForAtomic(const Instruction *I) const override {
-    return isa<LoadInst>(I) || isa<StoreInst>(I);
-  }
+  bool shouldInsertFencesForAtomic(const Instruction *I) const override;
   Instruction *emitLeadingFence(IRBuilderBase &Builder, Instruction *Inst,
                                 AtomicOrdering Ord) const override;
   Instruction *emitTrailingFence(IRBuilderBase &Builder, Instruction *Inst,
@@ -699,6 +697,10 @@ class RISCVTargetLowering : public TargetLowering {
   bool isMulAddWithConstProfitable(SDValue AddNode,
                                    SDValue ConstNode) const override;
 
+  TargetLoweringBase::AtomicExpansionKind
+  shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
+  TargetLoweringBase::AtomicExpansionKind
+  shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
   TargetLowering::AtomicExpansionKind
   shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;
   Value *emitMaskedAtomicRMWIntrinsic(IRBuilderBase &Builder, AtomicRMWInst *AI,
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
index 24f8d600f1eafc..de28af67050b21 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
@@ -754,8 +754,9 @@ bool RISCVRegisterInfo::getRegAllocationHints(
                         bool NeedGPRC) -> void {
     Register Reg = MO.getReg();
     Register PhysReg = Reg.isPhysical() ? Reg : Register(VRM->getPhys(Reg));
-    if (PhysReg && (!NeedGPRC || RISCV::GPRCRegClass.contains(PhysReg))) {
-      assert(!MO.getSubReg() && !VRRegMO.getSubReg() && "Unexpected subreg!");
+    // TODO: Add hints when there are GPRPair subregs?
+    if (PhysReg && (!NeedGPRC || RISCV::GPRCRegClass.contains(PhysReg)) &&
+        !MO.getSubReg() && !VRRegMO.getSubReg()) {
       if (!MRI->isReserved(PhysReg) && !is_contained(Hints, PhysReg))
         TwoAddrHints.insert(PhysReg);
     }
diff --git a/llvm/test/CodeGen/RISCV/atomic-cmpxchg.ll b/llvm/test/CodeGen/RISCV/atomic-cmpxchg.ll
index 5b3e5789e8d910..46e249d2327e49 100644
--- a/llvm/test/CodeGen/RISCV/atomic-cmpxchg.ll
+++ b/llvm/test/CodeGen/RISCV/atomic-cmpxchg.ll
@@ -4219,21 +4219,46 @@ define void @cmpxchg_i64_monotonic_monotonic(ptr %ptr, i64 %cmp, i64 %val) nounw
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
 ;
-; RV32IA-LABEL: cmpxchg_i64_monotonic_monotonic:
-; RV32IA:       # %bb.0:
-; RV32IA-NEXT:    addi sp, sp, -16
-; RV32IA-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
-; RV32IA-NEXT:    sw a2, 4(sp)
-; RV32IA-NEXT:    sw a1, 0(sp)
-; RV32IA-NEXT:    mv a1, sp
-; RV32IA-NEXT:    mv a2, a3
-; RV32IA-NEXT:    mv a3, a4
-; RV32IA-NEXT:    li a4, 0
-; RV32IA-NEXT:    li a5, 0
-; RV32IA-NEXT:    call __atomic_compare_exchange_8
-; RV32IA-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
-; RV32IA-NEXT:    addi sp, sp, 16
-; RV32IA-NEXT:    ret
+; RV32IA-WMO-LABEL: cmpxchg_i64_monotonic_monotonic:
+; RV32IA-WMO:       # %bb.0:
+; RV32IA-WMO-NEXT:    addi sp, sp, -16
+; RV32IA-WMO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-WMO-NEXT:    sw a2, 4(sp)
+; RV32IA-WMO-NEXT:    sw a1, 0(sp)
+; RV32IA-WMO-NEXT:    mv a1, sp
+; RV32IA-WMO-NEXT:    mv a2, a3
+; RV32IA-WMO-NEXT:    mv a3, a4
+; RV32IA-WMO-NEXT:    li a4, 0
+; RV32IA-WMO-NEXT:    li a5, 0
+; RV32IA-WMO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-WMO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-WMO-NEXT:    addi sp, sp, 16
+; RV32IA-WMO-NEXT:    ret
+;
+; RV32IA-ZACAS-LABEL: cmpxchg_i64_monotonic_monotonic:
+; RV32IA-ZACAS:       # %bb.0:
+; RV32IA-ZACAS-NEXT:    mv a5, a4
+; RV32IA-ZACAS-NEXT:    mv a7, a2
+; RV32IA-ZACAS-NEXT:    mv a4, a3
+; RV32IA-ZACAS-NEXT:    mv a6, a1
+; RV32IA-ZACAS-NEXT:    amocas.d a6, a4, (a0)
+; RV32IA-ZACAS-NEXT:    ret
+;
+; RV32IA-TSO-LABEL: cmpxchg_i64_monotonic_monotonic:
+; RV32IA-TSO:       # %bb.0:
+; RV32IA-TSO-NEXT:    addi sp, sp, -16
+; RV32IA-TSO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-TSO-NEXT:    sw a2, 4(sp)
+; RV32IA-TSO-NEXT:    sw a1, 0(sp)
+; RV32IA-TSO-NEXT:    mv a1, sp
+; RV32IA-TSO-NEXT:    mv a2, a3
+; RV32IA-TSO-NEXT:    mv a3, a4
+; RV32IA-TSO-NEXT:    li a4, 0
+; RV32IA-TSO-NEXT:    li a5, 0
+; RV32IA-TSO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-TSO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-TSO-NEXT:    addi sp, sp, 16
+; RV32IA-TSO-NEXT:    ret
 ;
 ; RV64I-LABEL: cmpxchg_i64_monotonic_monotonic:
 ; RV64I:       # %bb.0:
@@ -4296,22 +4321,57 @@ define void @cmpxchg_i64_acquire_monotonic(ptr %ptr, i64 %cmp, i64 %val) nounwin
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
 ;
-; RV32IA-LABEL: cmpxchg_i64_acquire_monotonic:
-; RV32IA:       # %bb.0:
-; RV32IA-NEXT:    addi sp, sp, -16
-; RV32IA-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
-; RV32IA-NEXT:    mv a5, a4
-; RV32IA-NEXT:    sw a2, 4(sp)
-; RV32IA-NEXT:    sw a1, 0(sp)
-; RV32IA-NEXT:    mv a1, sp
-; RV32IA-NEXT:    li a4, 2
-; RV32IA-NEXT:    mv a2, a3
-; RV32IA-NEXT:    mv a3, a5
-; RV32IA-NEXT:    li a5, 0
-; RV32IA-NEXT:    call __atomic_compare_exchange_8
-; RV32IA-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
-; RV32IA-NEXT:    addi sp, sp, 16
-; RV32IA-NEXT:    ret
+; RV32IA-WMO-LABEL: cmpxchg_i64_acquire_monotonic:
+; RV32IA-WMO:       # %bb.0:
+; RV32IA-WMO-NEXT:    addi sp, sp, -16
+; RV32IA-WMO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-WMO-NEXT:    mv a5, a4
+; RV32IA-WMO-NEXT:    sw a2, 4(sp)
+; RV32IA-WMO-NEXT:    sw a1, 0(sp)
+; RV32IA-WMO-NEXT:    mv a1, sp
+; RV32IA-WMO-NEXT:    li a4, 2
+; RV32IA-WMO-NEXT:    mv a2, a3
+; RV32IA-WMO-NEXT:    mv a3, a5
+; RV32IA-WMO-NEXT:    li a5, 0
+; RV32IA-WMO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-WMO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-WMO-NEXT:    addi sp, sp, 16
+; RV32IA-WMO-NEXT:    ret
+;
+; RV32IA-WMO-ZACAS-LABEL: cmpxchg_i64_acquire_monotonic:
+; RV32IA-WMO-ZACAS:       # %bb.0:
+; RV32IA-WMO-ZACAS-NEXT:    mv a5, a4
+; RV32IA-WMO-ZACAS-NEXT:    mv a7, a2
+; RV32IA-WMO-ZACAS-NEXT:    mv a4, a3
+; RV32IA-WMO-ZACAS-NEXT:    mv a6, a1
+; RV32IA-WMO-ZACAS-NEXT:    amocas.d.aq a6, a4, (a0)
+; RV32IA-WMO-ZACAS-NEXT:    ret
+;
+; RV32IA-TSO-LABEL: cmpxchg_i64_acquire_monotonic:
+; RV32IA-TSO:       # %bb.0:
+; RV32IA-TSO-NEXT:    addi sp, sp, -16
+; RV32IA-TSO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-TSO-NEXT:    mv a5, a4
+; RV32IA-TSO-NEXT:    sw a2, 4(sp)
+; RV32IA-TSO-NEXT:    sw a1, 0(sp)
+; RV32IA-TSO-NEXT:    mv a1, sp
+; RV32IA-TSO-NEXT:    li a4, 2
+; RV32IA-TSO-NEXT:    mv a2, a3
+; RV32IA-TSO-NEXT:    mv a3, a5
+; RV32IA-TSO-NEXT:    li a5, 0
+; RV32IA-TSO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-TSO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-TSO-NEXT:    addi sp, sp, 16
+; RV32IA-TSO-NEXT:    ret
+;
+; RV32IA-TSO-ZACAS-LABEL: cmpxchg_i64_acquire_monotonic:
+; RV32IA-TSO-ZACAS:       # %bb.0:
+; RV32IA-TSO-ZACAS-NEXT:    mv a5, a4
+; RV32IA-TSO-ZACAS-NEXT:    mv a7, a2
+; RV32IA-TSO-ZACAS-NEXT:    mv a4, a3
+; RV32IA-TSO-ZACAS-NEXT:    mv a6, a1
+; RV32IA-TSO-ZACAS-NEXT:    amocas.d a6, a4, (a0)
+; RV32IA-TSO-ZACAS-NEXT:    ret
 ;
 ; RV64I-LABEL: cmpxchg_i64_acquire_monotonic:
 ; RV64I:       # %bb.0:
@@ -4379,22 +4439,57 @@ define void @cmpxchg_i64_acquire_acquire(ptr %ptr, i64 %cmp, i64 %val) nounwind
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
 ;
-; RV32IA-LABEL: cmpxchg_i64_acquire_acquire:
-; RV32IA:       # %bb.0:
-; RV32IA-NEXT:    addi sp, sp, -16
-; RV32IA-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
-; RV32IA-NEXT:    mv a6, a4
-; RV32IA-NEXT:    sw a2, 4(sp)
-; RV32IA-NEXT:    sw a1, 0(sp)
-; RV32IA-NEXT:    mv a1, sp
-; RV32IA-NEXT:    li a4, 2
-; RV32IA-NEXT:    li a5, 2
-; RV32IA-NEXT:    mv a2, a3
-; RV32IA-NEXT:    mv a3, a6
-; RV32IA-NEXT:    call __atomic_compare_exchange_8
-; RV32IA-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
-; RV32IA-NEXT:    addi sp, sp, 16
-; RV32IA-NEXT:    ret
+; RV32IA-WMO-LABEL: cmpxchg_i64_acquire_acquire:
+; RV32IA-WMO:       # %bb.0:
+; RV32IA-WMO-NEXT:    addi sp, sp, -16
+; RV32IA-WMO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-WMO-NEXT:    mv a6, a4
+; RV32IA-WMO-NEXT:    sw a2, 4(sp)
+; RV32IA-WMO-NEXT:    sw a1, 0(sp)
+; RV32IA-WMO-NEXT:    mv a1, sp
+; RV32IA-WMO-NEXT:    li a4, 2
+; RV32IA-WMO-NEXT:    li a5, 2
+; RV32IA-WMO-NEXT:    mv a2, a3
+; RV32IA-WMO-NEXT:    mv a3, a6
+; RV32IA-WMO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-WMO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-WMO-NEXT:    addi sp, sp, 16
+; RV32IA-WMO-NEXT:    ret
+;
+; RV32IA-WMO-ZACAS-LABEL: cmpxchg_i64_acquire_acquire:
+; RV32IA-WMO-ZACAS:       # %bb.0:
+; RV32IA-WMO-ZACAS-NEXT:    mv a5, a4
+; RV32IA-WMO-ZACAS-NEXT:    mv a7, a2
+; RV32IA-WMO-ZACAS-NEXT:    mv a4, a3
+; RV32IA-WMO-ZACAS-NEXT:    mv a6, a1
+; RV32IA-WMO-ZACAS-NEXT:    amocas.d.aq a6, a4, (a0)
+; RV32IA-WMO-ZACAS-NEXT:    ret
+;
+; RV32IA-TSO-LABEL: cmpxchg_i64_acquire_acquire:
+; RV32IA-TSO:       # %bb.0:
+; RV32IA-TSO-NEXT:    addi sp, sp, -16
+; RV32IA-TSO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-TSO-NEXT:    mv a6, a4
+; RV32IA-TSO-NEXT:    sw a2, 4(sp)
+; RV32IA-TSO-NEXT:    sw a1, 0(sp)
+; RV32IA-TSO-NEXT:    mv a1, sp
+; RV32IA-TSO-NEXT:    li a4, 2
+; RV32IA-TSO-NEXT:    li a5, 2
+; RV32IA-TSO-NEXT:    mv a2, a3
+; RV32IA-TSO-NEXT:    mv a3, a6
+; RV32IA-TSO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-TSO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-TSO-NEXT:    addi sp, sp, 16
+; RV32IA-TSO-NEXT:    ret
+;
+; RV32IA-TSO-ZACAS-LABEL: cmpxchg_i64_acquire_acquire:
+; RV32IA-TSO-ZACAS:       # %bb.0:
+; RV32IA-TSO-ZACAS-NEXT:    mv a5, a4
+; RV32IA-TSO-ZACAS-NEXT:    mv a7, a2
+; RV32IA-TSO-ZACAS-NEXT:    mv a4, a3
+; RV32IA-TSO-ZACAS-NEXT:    mv a6, a1
+; RV32IA-TSO-ZACAS-NEXT:    amocas.d a6, a4, (a0)
+; RV32IA-TSO-ZACAS-NEXT:    ret
 ;
 ; RV64I-LABEL: cmpxchg_i64_acquire_acquire:
 ; RV64I:       # %bb.0:
@@ -4462,22 +4557,57 @@ define void @cmpxchg_i64_release_monotonic(ptr %ptr, i64 %cmp, i64 %val) nounwin
 ; RV32I-NEXT:    addi sp, sp, 16
 ; RV32I-NEXT:    ret
 ;
-; RV32IA-LABEL: cmpxchg_i64_release_monotonic:
-; RV32IA:       # %bb.0:
-; RV32IA-NEXT:    addi sp, sp, -16
-; RV32IA-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
-; RV32IA-NEXT:    mv a5, a4
-; RV32IA-NEXT:    sw a2, 4(sp)
-; RV32IA-NEXT:    sw a1, 0(sp)
-; RV32IA-NEXT:    mv a1, sp
-; RV32IA-NEXT:    li a4, 3
-; RV32IA-NEXT:    mv a2, a3
-; RV32IA-NEXT:    mv a3, a5
-; RV32IA-NEXT:    li a5, 0
-; RV32IA-NEXT:    call __atomic_compare_exchange_8
-; RV32IA-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
-; RV32IA-NEXT:    addi sp, sp, 16
-; RV32IA-NEXT:    ret
+; RV32IA-WMO-LABEL: cmpxchg_i64_release_monotonic:
+; RV32IA-WMO:       # %bb.0:
+; RV32IA-WMO-NEXT:    addi sp, sp, -16
+; RV32IA-WMO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-WMO-NEXT:    mv a5, a4
+; RV32IA-WMO-NEXT:    sw a2, 4(sp)
+; RV32IA-WMO-NEXT:    sw a1, 0(sp)
+; RV32IA-WMO-NEXT:    mv a1, sp
+; RV32IA-WMO-NEXT:    li a4, 3
+; RV32IA-WMO-NEXT:    mv a2, a3
+; RV32IA-WMO-NEXT:    mv a3, a5
+; RV32IA-WMO-NEXT:    li a5, 0
+; RV32IA-WMO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-WMO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-WMO-NEXT:    addi sp, sp, 16
+; RV32IA-WMO-NEXT:    ret
+;
+; RV32IA-WMO-ZACAS-LABEL: cmpxchg_i64_release_monotonic:
+; RV32IA-WMO-ZACAS:       # %bb.0:
+; RV32IA-WMO-ZACAS-NEXT:    mv a5, a4
+; RV32IA-WMO-ZACAS-NEXT:    mv a7, a2
+; RV32IA-WMO-ZACAS-NEXT:    mv a4, a3
+; RV32IA-WMO-ZACAS-NEXT:    mv a6, a1
+; RV32IA-WMO-ZACAS-NEXT:    amocas.d.rl a6, a4, (a0)
+; RV32IA-WMO-ZACAS-NEXT:    ret
+;
+; RV32IA-TSO-LABEL: cmpxchg_i64_release_monotonic:
+; RV32IA-TSO:       # %bb.0:
+; RV32IA-TSO-NEXT:    addi sp, sp, -16
+; RV32IA-TSO-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32IA-TSO-NEXT:    mv a5, a4
+; RV32IA-TSO-NEXT:    sw a2, 4(sp)
+; RV32IA-TSO-NEXT:    sw a1, 0(sp)
+; RV32IA-TSO-NEXT:    mv a1, sp
+; RV32IA-TSO-NEXT:    li a4, 3
+; RV32IA-TSO-NEXT:    mv a2, a3
+; RV32IA-TSO-NEXT:    mv a3, a5
+; RV32IA-TSO-NEXT:    li a5, 0
+; RV32IA-TSO-NEXT:    call __atomic_compare_exchange_8
+; RV32IA-TSO-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32IA-TSO-NEXT:    addi sp, sp, 16
+; RV32IA-TSO-NEXT:    ret
+;
+; RV32IA-TSO-ZACAS-LABEL: cmpxchg_i64_release_monotonic:
+; RV32IA-TSO-ZACAS:       # %bb.0:
+; RV32IA-TSO-ZACAS-NEXT:    mv a5, a4
+; R...
[truncated]

bool Is64Bit = Subtarget.is64Bit();
unsigned Opcode;
if (Subtarget.hasStdExtZtso()) {
Opcode = Subtarget.is64Bit() ? RISCV::AMOCAS_Q : RISCV::AMOCAS_D_RV32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Opcode = Subtarget.is64Bit() ? RISCV::AMOCAS_Q : RISCV::AMOCAS_D_RV32;
Opcode = Is64Bit ? RISCV::AMOCAS_Q : RISCV::AMOCAS_D_RV32;

assert(!MO.getSubReg() && !VRRegMO.getSubReg() && "Unexpected subreg!");
// TODO: Add hints when there are GPRPair subregs?
if (PhysReg && (!NeedGPRC || RISCV::GPRCRegClass.contains(PhysReg)) &&
!MO.getSubReg() && !VRRegMO.getSubReg()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see test changes caused by this.

Comment on lines +710 to +714
; RV32IA-ZACAS-NEXT: li a2, 0
; RV32IA-ZACAS-NEXT: li a3, 0
; RV32IA-ZACAS-NEXT: amocas.d a2, a2, (a0)
; RV32IA-ZACAS-NEXT: mv a0, a2
; RV32IA-ZACAS-NEXT: mv a1, a3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; RV32IA-ZACAS-NEXT: li a2, 0
; RV32IA-ZACAS-NEXT: li a3, 0
; RV32IA-ZACAS-NEXT: amocas.d a2, a2, (a0)
; RV32IA-ZACAS-NEXT: mv a0, a2
; RV32IA-ZACAS-NEXT: mv a1, a3
; RV32IA-ZACAS-NEXT: mv a2, a0
; RV32IA-ZACAS-NEXT: amocas.d a0, a0, (a2)

I don't know how to give some hints to RA for better codegen :(

@asb
Copy link
Contributor

asb commented Jan 16, 2024

I didn't see any way to signal AtomicExpandPass to convert them to the same libcalls using any of the shouldExpandAtomic* hooks. So I've forced them to use to CmpXChg expansion. I've disabled the insertion of fences for atomic load/store when we use CmpXChg.

I've very unsure if this the right thing to do or if we should make changes to AtomicExpand to get back the original libcalls.

I believe that the correct thing is to either consistently generate lock-free implementations for all operations of a given bitwidth, or to dispatch to the libcall. So using the zacas instructions is correct. We may want to update MaxAtomicInlineWidth in clang/lib/Basic/Targets/RISCV.h too.

Looking at other targets that do setMaxAtomicSizeInBitsSupported differently based on a target feature, I'm a bit surprised there's no consideration for potential ABI compatibility. @jyknight, isn't it right that if you have use of atomics cross an ABI boundary then mixing translation units compiled with different maximum atomic width may generate undesirable results? e.g. one TU might perform operations on a memory location using lock-based libcalls while the other uses native lock-free instructions inline. Given distros are looking at rolling out x86-64_v2 as a default this might become more common. But there seems to be no strategy to mitigate this (e.g. always going through a helper function in libatomic). Perhaps this kind of use of double-width atomics across ABI boundaries just doesn't happen in practice?

@jyknight
Copy link
Member

The intent is that linking together code with differing max atomic widths should not cause ABI incompatibility, under the assumption that libatomic always knows how to use lock-free atomics for a given size if any code linked against it might do so. Which effectively implies that it must know about all microarchitecture levels the compiler knows how to build, and potentially dynamically dispatch to the right implementation.

This ensures that one TU using a libcall, and another using a native operation will work.

However, we don't yet implement a compliant libatomic in LLVM, though @Logikable is going to work on that. GCC libatomic does do this.

@jyknight
Copy link
Member

This change isn't OK - we shouldn't add any new expansions of Load to CmpXChg. Unless there is an atomic 128-bit load operation also available in RISC-V, we cannot use amocas.q for implementing atomics.

We made the mistake of expanding load to cmpxchg on X86 and AArch64, but it's wrong. An atomic load
should work on read-only memory, and expanding to cmpxchg does not. Furthermore, turning memory reads into writes is extremely bad for scalability.

Luckily, on newer X86 and AArch64 microarchitectures there is support for 128-bit atomic load/store, so we can DTRT on newer microarchitectures.

GCC fixed this partially on X86: the generated code always defers to libatomic unless the 128-bit atomic load/store is available, but libatomic still has the broken behavior, for compatibility with older generated code. I believe they avoided making the mistake on AArch64. (See #75081).

@topperc
Copy link
Collaborator Author

topperc commented Jan 16, 2024

This change isn't OK - we shouldn't add any new expansions of Load to CmpXChg. Unless there is an atomic 128-bit load operation also available in RISC-V, we cannot use amocas.q for implementing atomics.

There is no 128-bit atomic load for RISC-V currently defined

@jyknight
Copy link
Member

There is no 128-bit atomic load for RISC-V currently defined

Perhaps the simplest path forward to address that would be to define an extension which has no instructions, but simply specifies that a 128-bit vector load or store is a single memory operation. E.g.

	vsetivli	zero, 2, e64, m1 # (optionally ta, ma)
	vle64.v	v8, (a0)
	vse64.v	v8, (a1)

That'd be analogous to how it's done on x86/aarch64.

@topperc
Copy link
Collaborator Author

topperc commented Jan 17, 2024

There is no 128-bit atomic load for RISC-V currently defined

Perhaps the simplest path forward to address that would be to define an extension which has no instructions, but simply specifies that a 128-bit vector load or store is a single memory operation. E.g.

	vsetivli	zero, 2, e64, m1 # (optionally ta, ma)
	vle64.v	v8, (a0)
	vse64.v	v8, (a1)

That'd be analogous to how it's done on x86/aarch64.

RISC-V doesn't always have vector instructions, and the vector iSA does not require a 128-bit vector load to be atomic. It would depend on the microarchitecture.

@jyknight
Copy link
Member

RISC-V doesn't always have vector instructions, and the vector iSA does not require a 128-bit vector load to be atomic. It would depend on the microarchitecture.

Yes, that's why I suggested an ISA extension that a given CPU could advertise support for, which does provide that guarantee.

@topperc
Copy link
Collaborator Author

topperc commented Jan 17, 2024

RISC-V doesn't always have vector instructions, and the vector iSA does not require a 128-bit vector load to be atomic. It would depend on the microarchitecture.

Yes, that's why I suggested an ISA extension that a given CPU could advertise support for, which does provide that guarantee.

And the lowerings in this patch would require that extension be present?

@jyknight
Copy link
Member

And the lowerings in this patch would require that extension be present?

Yeah.

For RV64 128-bit atomics support: require Zacas, V, and that new extension which promises that 128-bit vector load/store are a single memory operation. Then emit vector instructions for 128-bit atomic load/store.

For RV32 64-bit atomics support: require Zacas, D, and a new extension which promises that 64-bit FLD/FSD are a single memory operation. Emit those for 64-bit atomic load/store.

Alternatively to requiring the V or D extensiosn, an extension implementing integer load/store instructions of double-register widths (reusing the encodings of LQ/SQ in RV64 and LD/SD in RV32, which aren't currently supported) could be nice. I see the latter is actually already proposed in https://github.com/riscv/riscv-zilsd/blob/main/zilsd.adoc. But even with that, you'd still need the previously-described extension promising a single memory operation for double-word load/stores.

@topperc
Copy link
Collaborator Author

topperc commented Mar 25, 2024

I'm going to close this. The libatomic ABI is a bigger blocker to this than support for 128-bit atomic load.

@topperc topperc closed this Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants