[LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering #169039

ritter-x2a · 2025-11-21T13:45:49Z

So far, memcpy with known size, memcpy with unknown size, memmove with known
size, and memmove with unknown size have individual optimized loop lowering
implementations, while memset and memset.pattern use an unoptimized loop
lowering. This patch extracts the parts of the memcpy lowerings (for known and
unknown sizes) that generate the control flow for the loop expansion into an
insertLoopExpansion function. The createMemCpyLoop(Unk|K)nownSize functions
then only collect the necessary arguments for insertLoopExpansion, call it,
and fill the generated loop basic blocks.

The immediate benefit of this is that logic from the two memcpy lowerings is
deduplicated. Moreover, it enables follow-up patches that will use
insertLoopExpansion to optimize the memset and memset.pattern implementations
similarly to memcpy, since they can use the exact same control flow patterns.

The test changes are due to more consistent and useful basic block names in the
loop expansion and an improvement in basic block ordering: previously, the
basic block that determines if the residual loop is executed would be put at
the end of the function, now it is put before the residual loop body.
Otherwise, the generated code should be equivalent.

This patch doesn't affect memmove; deduplicating its logic would also be nice,
but to extract all CF generation from the memmove lowering,
insertLoopExpansion would need to be able to also create code that iterates
backwards over the argument buffers. That would make insertLoopExpansion a
lot more complex for a code path that's only used for memmove, so it's probably
not worth refactoring.

For SWDEV-543208.

… lowering So far, memcpy with known size, memcpy with unknown size, memmove with known size, and memmove with unknown size have individual optimized loop lowering implementations, while memset and memset.pattern use an unoptimized loop lowering. This patch extracts the parts of the memcpy lowerings (for known and unknown sizes) that generate the control flow for the loop expansion into an `insertLoopExpansion` function. The `createMemCpyLoop(Unk|K)nownSize` functions then only collect the necessary arguments for `insertLoopExpansion`, call it, and fill the generated loop basic blocks. The immediate benefit of this is that logic from the two memcpy lowerings is deduplicated. Moreover, it enables follow-up patches that will use `insertLoopExpansion` to optimize the memset and memset.pattern implementations similarly to memcpy, since they can use the exact same control flow patterns. The test changes are due to more consistent and useful basic block names in the loop expansion and an improvement in basic block ordering: previously, the basic block that determines if the residual loop is executed would be put at the end of the function, now it is put before the residual loop body. Otherwise, the generated code should be equivalent. This patch doesn't affect memmove; deduplicating its logic would also be nice, but to extract all CF generation from the memmove lowering, `insertLoopExpansion` would need to be able to also create code that iterates backwards over the argument buffers. That would make `insertLoopExpansion` a lot more complex for a code path that's only used for memmove, so it's probably not worth refactoring. For SWDEV-543208.

ritter-x2a · 2025-11-21T13:46:08Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-11-21T13:47:35Z

@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

So far, memcpy with known size, memcpy with unknown size, memmove with known
size, and memmove with unknown size have individual optimized loop lowering
implementations, while memset and memset.pattern use an unoptimized loop
lowering. This patch extracts the parts of the memcpy lowerings (for known and
unknown sizes) that generate the control flow for the loop expansion into an
insertLoopExpansion function. The createMemCpyLoop(Unk|K)nownSize functions
then only collect the necessary arguments for insertLoopExpansion, call it,
and fill the generated loop basic blocks.

The immediate benefit of this is that logic from the two memcpy lowerings is
deduplicated. Moreover, it enables follow-up patches that will use
insertLoopExpansion to optimize the memset and memset.pattern implementations
similarly to memcpy, since they can use the exact same control flow patterns.

The test changes are due to more consistent and useful basic block names in the
loop expansion and an improvement in basic block ordering: previously, the
basic block that determines if the residual loop is executed would be put at
the end of the function, now it is put before the residual loop body.
Otherwise, the generated code should be equivalent.

This patch doesn't affect memmove; deduplicating its logic would also be nice,
but to extract all CF generation from the memmove lowering,
insertLoopExpansion would need to be able to also create code that iterates
backwards over the argument buffers. That would make insertLoopExpansion a
lot more complex for a code path that's only used for memmove, so it's probably
not worth refactoring.

For SWDEV-543208.

Patch is 203.13 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/169039.diff

10 Files Affected:

(modified) llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp (+311-208)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll (+72-72)
(modified) llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll (+210-210)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+25-29)
(modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+30-30)
(modified) llvm/test/CodeGen/AMDGPU/memmove-var-size.ll (+48-72)
(modified) llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll (+14-14)
(modified) llvm/unittests/Transforms/Utils/MemTransferLowering.cpp (+6-3)

diff --git a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
index 18b0f617ca232..88e5d038bff82 100644
--- a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
+++ b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
@@ -21,6 +21,219 @@
 
 using namespace llvm;
 
+/// \returns \p Len urem \p OpSize, checking for optimization opportunities.
+/// \p OpSizeVal must be the integer value of the \c ConstantInt \p OpSize.
+static Value *getRuntimeLoopRemainder(IRBuilderBase &B, Value *Len,
+                                      Value *OpSize, unsigned OpSizeVal) {
+  // For powers of 2, we can and by (OpSizeVal - 1) instead of using urem.
+  if (isPowerOf2_32(OpSizeVal))
+    return B.CreateAnd(Len, OpSizeVal - 1);
+  return B.CreateURem(Len, OpSize);
+}
+
+/// \returns (\p Len udiv \p OpSize) mul \p OpSize, checking for optimization
+/// opportunities.
+/// If \p RTLoopRemainder is provided, it must be the result of
+/// \c getRuntimeLoopRemainder() with the same arguments.
+static Value *getRuntimeLoopUnits(IRBuilderBase &B, Value *Len, Value *OpSize,
+                                  unsigned OpSizeVal,
+                                  Value *RTLoopRemainder = nullptr) {
+  if (!RTLoopRemainder)
+    RTLoopRemainder = getRuntimeLoopRemainder(B, Len, OpSize, OpSizeVal);
+  return B.CreateSub(Len, RTLoopRemainder);
+}
+
+namespace {
+/// Container for the return values of insertLoopExpansion.
+struct LoopExpansionInfo {
+  /// The instruction at the end of the main loop body.
+  Instruction *MainLoopIP = nullptr;
+
+  /// The unit index in the main loop body.
+  Value *MainLoopIndex = nullptr;
+
+  /// The instruction at the end of the residual loop body. Can be nullptr if no
+  /// residual is required.
+  Instruction *ResidualLoopIP = nullptr;
+
+  /// The unit index in the residual loop body. Can be nullptr if no residual is
+  /// required.
+  Value *ResidualLoopIndex = nullptr;
+};
+} // namespace
+
+/// Insert the control flow and loop counters for a memcpy/memset loop
+/// expansion.
+///
+/// This function inserts IR corresponding to the following C code before
+/// \p InsertBefore:
+/// \code
+/// LoopUnits = (Len / MainLoopStep) * MainLoopStep;
+/// ResidualUnits = Len - LoopUnits;
+/// MainLoopIndex = 0;
+/// if (LoopUnits > 0) {
+///   do {
+///     // MainLoopIP
+///     MainLoopIndex += MainLoopStep;
+///   } while (MainLoopIndex < LoopUnits);
+/// }
+/// for (size_t i = 0; i < ResidualUnits; i += ResidualLoopStep) {
+///   ResidualLoopIndex = LoopUnits + i;
+///   // ResidualLoopIP
+/// }
+/// \endcode
+///
+/// \p MainLoopStep and \p ResidualLoopStep determine by how many "units" the
+/// loop index is increased in each iteration of the main and residual loops,
+/// respectively. In most cases, the "unit" will be bytes, but larger units are
+/// useful for lowering memset.pattern.
+///
+/// The computation of \c LoopUnits and \c ResidualUnits is performed at compile
+/// time if \p Len is a \c ConstantInt.
+/// The second (residual) loop is omitted if \p ResidualLoopStep is 0 or equal
+/// to \p MainLoopStep.
+/// The generated \c MainLoopIP, \c MainLoopIndex, \c ResidualLoopIP, and
+/// \c ResidualLoopIndex are returned in a \c LoopExpansionInfo object.
+static LoopExpansionInfo insertLoopExpansion(Instruction *InsertBefore,
+                                             Value *Len, unsigned MainLoopStep,
+                                             unsigned ResidualLoopStep,
+                                             StringRef BBNamePrefix) {
+  assert((ResidualLoopStep == 0 || MainLoopStep % ResidualLoopStep == 0) &&
+         "ResidualLoopStep must divide MainLoopStep if specified");
+  assert(ResidualLoopStep <= MainLoopStep &&
+         "ResidualLoopStep cannot be larger than MainLoopStep");
+  assert(MainLoopStep > 0 && "MainLoopStep must be non-zero");
+  LoopExpansionInfo LEI;
+  BasicBlock *PreLoopBB = InsertBefore->getParent();
+  BasicBlock *PostLoopBB = PreLoopBB->splitBasicBlock(
+      InsertBefore, BBNamePrefix + "-post-expansion");
+  Function *ParentFunc = PreLoopBB->getParent();
+  LLVMContext &Ctx = PreLoopBB->getContext();
+  IRBuilder<> PreLoopBuilder(PreLoopBB->getTerminator());
+
+  // Calculate the main loop trip count and remaining units to cover after the
+  // loop.
+  Type *LenType = Len->getType();
+  IntegerType *ILenType = dyn_cast<IntegerType>(LenType);
+  assert(ILenType && "expected length to be an integer type!");
+  ConstantInt *CIMainLoopStep = ConstantInt::get(ILenType, MainLoopStep);
+
+  Value *LoopUnits = Len;
+  Value *ResidualUnits = nullptr;
+  // We can make a conditional branch unconditional if we know that the
+  // MainLoop must be executed at least once.
+  bool MustTakeMainLoop = false;
+  if (MainLoopStep != 1) {
+    if (auto *CLen = dyn_cast<ConstantInt>(Len)) {
+      uint64_t TotalUnits = CLen->getZExtValue();
+      uint64_t LoopEndCount = alignDown(TotalUnits, MainLoopStep);
+      uint64_t ResidualCount = TotalUnits - LoopEndCount;
+      LoopUnits = ConstantInt::get(LenType, LoopEndCount);
+      ResidualUnits = ConstantInt::get(LenType, ResidualCount);
+      MustTakeMainLoop = LoopEndCount > 0;
+      // As an optimization, we could skip generating the residual loop if
+      // ResidualCount is known to be 0. However, current uses of this function
+      // don't request a residual loop if the length is constant (they generate
+      // a (potentially empty) sequence of loads and stores instead), so this
+      // optimization would have no effect here.
+    } else {
+      ResidualUnits = getRuntimeLoopRemainder(PreLoopBuilder, Len,
+                                              CIMainLoopStep, MainLoopStep);
+      LoopUnits = getRuntimeLoopUnits(PreLoopBuilder, Len, CIMainLoopStep,
+                                      MainLoopStep, ResidualUnits);
+    }
+  } else if (auto *CLen = dyn_cast<ConstantInt>(Len)) {
+    MustTakeMainLoop = CLen->getZExtValue() > 0;
+  }
+
+  BasicBlock *MainLoopBB = BasicBlock::Create(
+      Ctx, BBNamePrefix + "-expansion-main-body", ParentFunc, PostLoopBB);
+  IRBuilder<> LoopBuilder(MainLoopBB);
+
+  PHINode *LoopIndex = LoopBuilder.CreatePHI(LenType, 2, "loop-index");
+  LEI.MainLoopIndex = LoopIndex;
+  LoopIndex->addIncoming(ConstantInt::get(LenType, 0U), PreLoopBB);
+
+  Value *NewIndex =
+      LoopBuilder.CreateAdd(LoopIndex, ConstantInt::get(LenType, MainLoopStep));
+  LoopIndex->addIncoming(NewIndex, MainLoopBB);
+
+  // One argument of the addition is a loop-variant PHI, so it must be an
+  // Instruction (i.e., it cannot be a Constant).
+  LEI.MainLoopIP = cast<Instruction>(NewIndex);
+
+  if (0 < ResidualLoopStep && ResidualLoopStep < MainLoopStep) {
+    // Loop body for the residual accesses.
+    BasicBlock *ResLoopBB =
+        BasicBlock::Create(Ctx, BBNamePrefix + "-expansion-residual-body",
+                           PreLoopBB->getParent(), PostLoopBB);
+    // BB to check if the residual loop is needed.
+    BasicBlock *ResidualCondBB =
+        BasicBlock::Create(Ctx, BBNamePrefix + "-expansion-residual-cond",
+                           PreLoopBB->getParent(), ResLoopBB);
+
+    // Enter the MainLoop unless no main loop iteration is required.
+    ConstantInt *Zero = ConstantInt::get(ILenType, 0U);
+    if (MustTakeMainLoop)
+      PreLoopBuilder.CreateBr(MainLoopBB);
+    else
+      PreLoopBuilder.CreateCondBr(PreLoopBuilder.CreateICmpNE(LoopUnits, Zero),
+                                  MainLoopBB, ResidualCondBB);
+    PreLoopBB->getTerminator()->eraseFromParent();
+
+    // Stay in the MainLoop until we have handled all the LoopUnits. Then go to
+    // the residual condition BB.
+    LoopBuilder.CreateCondBr(LoopBuilder.CreateICmpULT(NewIndex, LoopUnits),
+                             MainLoopBB, ResidualCondBB);
+
+    // Determine if we need to branch to the residual loop or bypass it.
+    IRBuilder<> RCBuilder(ResidualCondBB);
+    RCBuilder.CreateCondBr(RCBuilder.CreateICmpNE(ResidualUnits, Zero),
+                           ResLoopBB, PostLoopBB);
+
+    IRBuilder<> ResBuilder(ResLoopBB);
+    PHINode *ResidualIndex =
+        ResBuilder.CreatePHI(LenType, 2, "residual-loop-index");
+    ResidualIndex->addIncoming(Zero, ResidualCondBB);
+
+    // Add the offset at the end of the main loop to the loop counter of the
+    // residual loop to get the proper index.
+    Value *FullOffset = ResBuilder.CreateAdd(LoopUnits, ResidualIndex);
+    LEI.ResidualLoopIndex = FullOffset;
+
+    Value *ResNewIndex = ResBuilder.CreateAdd(
+        ResidualIndex, ConstantInt::get(LenType, ResidualLoopStep));
+    ResidualIndex->addIncoming(ResNewIndex, ResLoopBB);
+
+    // One argument of the addition is a loop-variant PHI, so it must be an
+    // Instruction (i.e., it cannot be a Constant).
+    LEI.ResidualLoopIP = cast<Instruction>(ResNewIndex);
+
+    // Stay in the residual loop until all ResidualUnits are handled.
+    ResBuilder.CreateCondBr(
+        ResBuilder.CreateICmpULT(ResNewIndex, ResidualUnits), ResLoopBB,
+        PostLoopBB);
+  } else {
+    // There is no need for a residual loop after the main loop. We do however
+    // need to patch up the control flow by creating the terminators for the
+    // preloop block and the main loop.
+
+    // Enter the MainLoop unless no main loop iteration is required.
+    if (MustTakeMainLoop) {
+      PreLoopBuilder.CreateBr(MainLoopBB);
+    } else {
+      ConstantInt *Zero = ConstantInt::get(ILenType, 0U);
+      PreLoopBuilder.CreateCondBr(PreLoopBuilder.CreateICmpNE(LoopUnits, Zero),
+                                  MainLoopBB, PostLoopBB);
+    }
+    PreLoopBB->getTerminator()->eraseFromParent();
+    // Stay in the MainLoop until we have handled all the LoopUnits.
+    LoopBuilder.CreateCondBr(LoopBuilder.CreateICmpULT(NewIndex, LoopUnits),
+                             MainLoopBB, PostLoopBB);
+  }
+  return LEI;
+}
+
 void llvm::createMemCpyLoopKnownSize(
     Instruction *InsertBefore, Value *SrcAddr, Value *DstAddr,
     ConstantInt *CopyLen, Align SrcAlign, Align DstAlign, bool SrcIsVolatile,
@@ -31,7 +244,6 @@ void llvm::createMemCpyLoopKnownSize(
     return;
 
   BasicBlock *PreLoopBB = InsertBefore->getParent();
-  BasicBlock *PostLoopBB = nullptr;
   Function *ParentFunc = PreLoopBB->getParent();
   LLVMContext &Ctx = PreLoopBB->getContext();
   const DataLayout &DL = ParentFunc->getDataLayout();
@@ -56,37 +268,32 @@ void llvm::createMemCpyLoopKnownSize(
 
   uint64_t LoopEndCount = alignDown(CopyLen->getZExtValue(), LoopOpSize);
 
+  // Skip the loop expansion entirely if the loop would never be taken.
   if (LoopEndCount != 0) {
-    // Split
-    PostLoopBB = PreLoopBB->splitBasicBlock(InsertBefore, "memcpy-split");
-    BasicBlock *LoopBB =
-        BasicBlock::Create(Ctx, "load-store-loop", ParentFunc, PostLoopBB);
-    PreLoopBB->getTerminator()->setSuccessor(0, LoopBB);
-
-    IRBuilder<> PLBuilder(PreLoopBB->getTerminator());
+    LoopExpansionInfo LEI = insertLoopExpansion(InsertBefore, CopyLen,
+                                                LoopOpSize, 0, "static-memcpy");
 
+    // Fill MainLoopBB
+    IRBuilder<> MainLoopBuilder(LEI.MainLoopIP);
     Align PartDstAlign(commonAlignment(DstAlign, LoopOpSize));
     Align PartSrcAlign(commonAlignment(SrcAlign, LoopOpSize));
 
-    IRBuilder<> LoopBuilder(LoopBB);
-    PHINode *LoopIndex = LoopBuilder.CreatePHI(TypeOfCopyLen, 2, "loop-index");
-    LoopIndex->addIncoming(ConstantInt::get(TypeOfCopyLen, 0U), PreLoopBB);
-    // Loop Body
-
     // If we used LoopOpType as GEP element type, we would iterate over the
     // buffers in TypeStoreSize strides while copying TypeAllocSize bytes, i.e.,
     // we would miss bytes if TypeStoreSize != TypeAllocSize. Therefore, use
     // byte offsets computed from the TypeStoreSize.
-    Value *SrcGEP = LoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, LoopIndex);
-    LoadInst *Load = LoopBuilder.CreateAlignedLoad(LoopOpType, SrcGEP,
-                                                   PartSrcAlign, SrcIsVolatile);
+    Value *SrcGEP =
+        MainLoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, LEI.MainLoopIndex);
+    LoadInst *Load = MainLoopBuilder.CreateAlignedLoad(
+        LoopOpType, SrcGEP, PartSrcAlign, SrcIsVolatile);
     if (!CanOverlap) {
       // Set alias scope for loads.
       Load->setMetadata(LLVMContext::MD_alias_scope,
                         MDNode::get(Ctx, NewScope));
     }
-    Value *DstGEP = LoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr, LoopIndex);
-    StoreInst *Store = LoopBuilder.CreateAlignedStore(
+    Value *DstGEP =
+        MainLoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr, LEI.MainLoopIndex);
+    StoreInst *Store = MainLoopBuilder.CreateAlignedStore(
         Load, DstGEP, PartDstAlign, DstIsVolatile);
     if (!CanOverlap) {
       // Indicate that stores don't overlap loads.
@@ -96,96 +303,63 @@ void llvm::createMemCpyLoopKnownSize(
       Load->setAtomic(AtomicOrdering::Unordered);
       Store->setAtomic(AtomicOrdering::Unordered);
     }
-    Value *NewIndex = LoopBuilder.CreateAdd(
-        LoopIndex, ConstantInt::get(TypeOfCopyLen, LoopOpSize));
-    LoopIndex->addIncoming(NewIndex, LoopBB);
-
-    // Create the loop branch condition.
-    Constant *LoopEndCI = ConstantInt::get(TypeOfCopyLen, LoopEndCount);
-    LoopBuilder.CreateCondBr(LoopBuilder.CreateICmpULT(NewIndex, LoopEndCI),
-                             LoopBB, PostLoopBB);
+    assert(!LEI.ResidualLoopIP && !LEI.ResidualLoopIndex &&
+           "No residual loop was requested");
   }
 
+  // Copy the remaining bytes with straight-line code.
   uint64_t BytesCopied = LoopEndCount;
   uint64_t RemainingBytes = CopyLen->getZExtValue() - BytesCopied;
-  if (RemainingBytes) {
-    BasicBlock::iterator InsertIt = PostLoopBB ? PostLoopBB->getFirstNonPHIIt()
-                                               : InsertBefore->getIterator();
-    IRBuilder<> RBuilder(InsertIt->getParent(), InsertIt);
+  if (RemainingBytes == 0)
+    return;
 
-    SmallVector<Type *, 5> RemainingOps;
-    TTI.getMemcpyLoopResidualLoweringType(RemainingOps, Ctx, RemainingBytes,
-                                          SrcAS, DstAS, SrcAlign, DstAlign,
-                                          AtomicElementSize);
+  IRBuilder<> RBuilder(InsertBefore);
+  SmallVector<Type *, 5> RemainingOps;
+  TTI.getMemcpyLoopResidualLoweringType(RemainingOps, Ctx, RemainingBytes,
+                                        SrcAS, DstAS, SrcAlign, DstAlign,
+                                        AtomicElementSize);
 
-    for (auto *OpTy : RemainingOps) {
-      Align PartSrcAlign(commonAlignment(SrcAlign, BytesCopied));
-      Align PartDstAlign(commonAlignment(DstAlign, BytesCopied));
-
-      unsigned OperandSize = DL.getTypeStoreSize(OpTy);
-      assert(
-          (!AtomicElementSize || OperandSize % *AtomicElementSize == 0) &&
-          "Atomic memcpy lowering is not supported for selected operand size");
-
-      Value *SrcGEP = RBuilder.CreateInBoundsGEP(
-          Int8Type, SrcAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
-      LoadInst *Load =
-          RBuilder.CreateAlignedLoad(OpTy, SrcGEP, PartSrcAlign, SrcIsVolatile);
-      if (!CanOverlap) {
-        // Set alias scope for loads.
-        Load->setMetadata(LLVMContext::MD_alias_scope,
-                          MDNode::get(Ctx, NewScope));
-      }
-      Value *DstGEP = RBuilder.CreateInBoundsGEP(
-          Int8Type, DstAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
-      StoreInst *Store = RBuilder.CreateAlignedStore(Load, DstGEP, PartDstAlign,
-                                                     DstIsVolatile);
-      if (!CanOverlap) {
-        // Indicate that stores don't overlap loads.
-        Store->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
-      }
-      if (AtomicElementSize) {
-        Load->setAtomic(AtomicOrdering::Unordered);
-        Store->setAtomic(AtomicOrdering::Unordered);
-      }
-      BytesCopied += OperandSize;
+  for (auto *OpTy : RemainingOps) {
+    Align PartSrcAlign(commonAlignment(SrcAlign, BytesCopied));
+    Align PartDstAlign(commonAlignment(DstAlign, BytesCopied));
+
+    unsigned OperandSize = DL.getTypeStoreSize(OpTy);
+    assert((!AtomicElementSize || OperandSize % *AtomicElementSize == 0) &&
+           "Atomic memcpy lowering is not supported for selected operand size");
+
+    Value *SrcGEP = RBuilder.CreateInBoundsGEP(
+        Int8Type, SrcAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
+    LoadInst *Load =
+        RBuilder.CreateAlignedLoad(OpTy, SrcGEP, PartSrcAlign, SrcIsVolatile);
+    if (!CanOverlap) {
+      // Set alias scope for loads.
+      Load->setMetadata(LLVMContext::MD_alias_scope,
+                        MDNode::get(Ctx, NewScope));
+    }
+    Value *DstGEP = RBuilder.CreateInBoundsGEP(
+        Int8Type, DstAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
+    StoreInst *Store =
+        RBuilder.CreateAlignedStore(Load, DstGEP, PartDstAlign, DstIsVolatile);
+    if (!CanOverlap) {
+      // Indicate that stores don't overlap loads.
+      Store->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
     }
+    if (AtomicElementSize) {
+      Load->setAtomic(AtomicOrdering::Unordered);
+      Store->setAtomic(AtomicOrdering::Unordered);
+    }
+    BytesCopied += OperandSize;
   }
   assert(BytesCopied == CopyLen->getZExtValue() &&
          "Bytes copied should match size in the call!");
 }
 
-// \returns \p Len urem \p OpSize, checking for optimization opportunities.
-static Value *getRuntimeLoopRemainder(const DataLayout &DL, IRBuilderBase &B,
-                                      Value *Len, Value *OpSize,
-                                      unsigned OpSizeVal) {
-  // For powers of 2, we can and by (OpSizeVal - 1) instead of using urem.
-  if (isPowerOf2_32(OpSizeVal))
-    return B.CreateAnd(Len, OpSizeVal - 1);
-  return B.CreateURem(Len, OpSize);
-}
-
-// \returns (\p Len udiv \p OpSize) mul \p OpSize, checking for optimization
-// opportunities.
-// If RTLoopRemainder is provided, it must be the result of
-// getRuntimeLoopRemainder() with the same arguments.
-static Value *getRuntimeLoopBytes(const DataLayout &DL, IRBuilderBase &B,
-                                  Value *Len, Value *OpSize, unsigned OpSizeVal,
-                                  Value *RTLoopRemainder = nullptr) {
-  if (!RTLoopRemainder)
-    RTLoopRemainder = getRuntimeLoopRemainder(DL, B, Len, OpSize, OpSizeVal);
-  return B.CreateSub(Len, RTLoopRemainder);
-}
-
 void llvm::createMemCpyLoopUnknownSize(
     Instruction *InsertBefore, Value *SrcAddr, Value *DstAddr, Value *CopyLen,
     Align SrcAlign, Align DstAlign, bool SrcIsVolatile, bool DstIsVolatile,
     bool CanOverlap, const TargetTransformInfo &TTI,
     std::optional<uint32_t> AtomicElementSize) {
   BasicBlock *PreLoopBB = InsertBefore->getParent();
-  BasicBlock *PostLoopBB =
-      PreLoopBB->splitBasicBlock(InsertBefore, "post-loop-memcpy-expansion");
-
   Function *ParentFunc = PreLoopBB->getParent();
   const DataLayout &DL = ParentFunc->getDataLayout();
   LLVMContext &Ctx = PreLoopBB->getContext();
@@ -205,50 +379,39 @@ void llvm::createMemCpyLoopUnknownSize(
   assert((!AtomicElementSize || LoopOpSize % *AtomicElementSize == 0) &&
          "Atomic memcpy lowering is not supported for selected operand size");
 
-  IRBuilder<> PLBuilder(PreLoopBB->getTerminator());
-
-  // Calculate the loop trip count, and remaining bytes to copy after the loop.
-  Type *CopyLenType = CopyLen->getType();
-  IntegerType *ILengthType = dyn_cast<IntegerType>(CopyLenType);
-  assert(ILengthType &&
-         "expected size argument to memcpy to be an integer type!");
   Type *Int8Type = Type::getInt8Ty(Ctx);
-  bool LoopOpIsInt8 = LoopOpType == Int8Type;
-  ConstantInt *CILoopOpSize = ConstantInt::get(ILengthType, LoopOpSize);
 
-  Value *RuntimeLoopBytes = CopyLen;
-  Value *RuntimeResidualBytes = nullptr;
-  if (!LoopOpIsInt8) {
-    RuntimeResidualBytes = getRuntimeLoopRemainder(DL, PLBuilder, CopyLen,
-                                                   CILoopOpSize, LoopOpSize);
-    RuntimeLoopBytes = getRuntimeLoopBytes(DL, PLBuilder, CopyLen, CILoopOpSize,
-                                           LoopOpSize, RuntimeResidualBytes);
-  }
+  Type *ResidualLoopOpType = AtomicElementSize
+                                 ? T...
[truncated]

github-actions · 2025-11-21T14:29:41Z

🐧 Linux x64 Test Results

186440 tests passed
4868 tests skipped

krzysz00

Looks NFC on the buffer fat pointer side of things

ritter-x2a mentioned this pull request Nov 21, 2025

[LowerMemIntrinsics] Optimize memset lowering #169040

Open

ritter-x2a added the backend:AMDGPU label Nov 21, 2025 — with Graphite App

ritter-x2a requested review from arsenm, dtcxzyw, jayfoad, jmmartinez, krzysz00 and shiltian November 21, 2025 13:54

ritter-x2a marked this pull request as ready for review November 21, 2025 13:55

llvmbot added llvm:globalisel backend:NVPTX llvm:transforms labels Nov 21, 2025

krzysz00 reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering #169039

[LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering #169039

Uh oh!

ritter-x2a commented Nov 21, 2025

Uh oh!

ritter-x2a commented Nov 21, 2025 •

edited

Loading

Uh oh!

llvmbot commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

krzysz00 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering #169039

Are you sure you want to change the base?

[LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering #169039

Uh oh!

Conversation

ritter-x2a commented Nov 21, 2025

Uh oh!

ritter-x2a commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2025

🐧 Linux x64 Test Results

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ritter-x2a commented Nov 21, 2025 •

edited

Loading

llvmbot commented Nov 21, 2025 •

edited

Loading