Revert "[LSV] Merge contiguous chains across scalar types" #170381

dakersnar · 2025-12-02T22:24:45Z

Reverts #154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load <2xi16> and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

This reverts commit fbdf8ab.

llvmbot · 2025-12-02T22:25:22Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-globalisel

Author: Drew Kersnar (dakersnar)

Changes

Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load <2xi16> and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

Patch is 411.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170381.diff

50 Files Affected:

(modified) llvm/include/llvm/Transforms/Utils/Local.h (+1-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1-1)
(modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+2-2)
(modified) llvm/lib/Transforms/Scalar/SROA.cpp (+1-1)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+22-38)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+63-219)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (-2)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+129-133)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+792-825)
(modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+40-18)
(modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+26-28)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+162-171)
(modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+39-45)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+49-57)
(modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+30-39)
(modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+149-141)
(modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+206-197)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (-82)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+109-65)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+120-192)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+58-59)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+58-59)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+33-34)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+56-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+56-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+12-14)
(modified) llvm/test/CodeGen/AMDGPU/max.ll (+16-18)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+108-108)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+30-58)
(modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+36-38)
(modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+28-30)
(modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+63-64)
(modified) llvm/test/CodeGen/AMDGPU/store-to-constant.ll (+2-4)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+61-65)
(modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
(removed) llvm/test/Transforms/InstCombine/copy-access-metadata.ll (-215)
(removed) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-load-store.ll (-159)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+66-258)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+3-281)

diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index d0af2d3d2e4c2..9acfd872e574b 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -431,7 +431,7 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 
 /// Copy the metadata from the source instruction to the destination (the
 /// replacement for the source instruction).
-LLVM_ABI void copyMetadataForAccess(Instruction &Dest, Instruction &Source);
+LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 98884c441096e..fdff21b6ef8df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1035,7 +1035,7 @@ bool LegalizeBufferContentTypesVisitor::visitLoadImpl(
       LoadInst *NewLI = IRB.CreateAlignedLoad(
           LoadableType, NewPtr, commonAlignment(OrigLI.getAlign(), ByteOffset),
           Name + ".off." + Twine(ByteOffset));
-      copyMetadataForAccess(*NewLI, OrigLI);
+      copyMetadataForLoad(*NewLI, OrigLI);
       NewLI->setAAMetadata(
           AANodes.adjustForAccess(ByteOffset, LoadableType, DL));
       NewLI->setAtomic(OrigLI.getOrdering(), OrigLI.getSyncScopeID());
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 3e04aeb675d2a..9491610190c10 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -415,7 +415,7 @@ void PointerReplacer::replace(Instruction *I) {
                               LT->getAlign(), LT->getOrdering(),
                               LT->getSyncScopeID());
     NewI->takeName(LT);
-    copyMetadataForAccess(*NewI, *LT);
+    copyMetadataForLoad(*NewI, *LT);
 
     IC.InsertNewInstWith(NewI, LT->getIterator());
     IC.replaceInstUsesWith(*LT, NewI);
@@ -606,7 +606,7 @@ LoadInst *InstCombinerImpl::combineLoadToNewType(LoadInst &LI, Type *NewTy,
       Builder.CreateAlignedLoad(NewTy, LI.getPointerOperand(), LI.getAlign(),
                                 LI.isVolatile(), LI.getName() + Suffix);
   NewLoad->setAtomic(LI.getOrdering(), LI.getSyncScopeID());
-  copyMetadataForAccess(*NewLoad, LI);
+  copyMetadataForLoad(*NewLoad, LI);
   return NewLoad;
 }
 
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index a7c322bfcb981..70afe833c9f47 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3272,7 +3272,7 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       // Copy any metadata that is valid for the new load. This may require
       // conversion to a different kind of metadata, e.g. !nonnull might change
       // to !range or vice versa.
-      copyMetadataForAccess(*NewLI, LI);
+      copyMetadataForLoad(*NewLI, LI);
 
       // Do this after copyMetadataForLoad() to preserve the TBAA shift.
       if (AATags)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index dec2e019333b9..a03cf6e953e35 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3100,70 +3100,54 @@ void llvm::combineAAMetadata(Instruction *K, const Instruction *J) {
   combineMetadata(K, J, /*DoesKMove=*/true, /*AAOnly=*/true);
 }
 
-void llvm::copyMetadataForAccess(Instruction &DestI, Instruction &SourceI) {
+void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
-  SourceI.getAllMetadata(MD);
-  MDBuilder MDB(DestI.getContext());
-  Type *NewType = DestI.getType();
-
-  // Only needed for range metadata on loads.
-  const DataLayout *DL = nullptr;
-  const LoadInst *LSource = dyn_cast<LoadInst>(&SourceI);
-  if (LSource)
-    DL = &LSource->getDataLayout();
-
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  const DataLayout &DL = Source.getDataLayout();
   for (const auto &MDPair : MD) {
     unsigned ID = MDPair.first;
     MDNode *N = MDPair.second;
-
+    // Note, essentially every kind of metadata should be preserved here! This
+    // routine is supposed to clone a load instruction changing *only its type*.
+    // The only metadata it makes sense to drop is metadata which is invalidated
+    // when the pointer type changes. This should essentially never be the case
+    // in LLVM, but we explicitly switch over only known metadata to be
+    // conservatively correct. If you are adding metadata to LLVM which pertains
+    // to loads, you almost certainly want to add it here.
     switch (ID) {
-    // Applies to both loads and stores as-is.
     case LLVMContext::MD_dbg:
+    case LLVMContext::MD_tbaa:
     case LLVMContext::MD_prof:
+    case LLVMContext::MD_fpmath:
     case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_invariant_load:
     case LLVMContext::MD_alias_scope:
     case LLVMContext::MD_noalias:
     case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_mem_parallel_loop_access:
     case LLVMContext::MD_access_group:
     case LLVMContext::MD_noundef:
     case LLVMContext::MD_noalias_addrspace:
-    case LLVMContext::MD_mem_parallel_loop_access:
-      DestI.setMetadata(ID, N);
-      break;
-
-    // Load-only metadata.
-    case LLVMContext::MD_fpmath:
-    case LLVMContext::MD_invariant_load:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      // All of these directly apply.
+      Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_nonnull:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource)
-          copyNonnullMetadata(*LSource, N, *LDest);
-      }
+      copyNonnullMetadata(Source, N, Dest);
       break;
 
     case LLVMContext::MD_align:
     case LLVMContext::MD_dereferenceable:
     case LLVMContext::MD_dereferenceable_or_null:
-      // Applies to both loads and stores only if the new type is also a
-      // pointer.
+      // These only directly apply if the new type is also a pointer.
       if (NewType->isPointerTy())
-        DestI.setMetadata(ID, N);
+        Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_range:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource && DL)
-          copyRangeMetadata(*DL, *LSource, N, *LDest);
-      }
-      break;
-
-    case LLVMContext::MD_tbaa:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      copyRangeMetadata(DL, Source, N, Dest);
       break;
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 114df653bad83..c28314f6ab124 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,7 +112,6 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -269,6 +268,11 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
+  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
+  /// in the same BB with the same value for getUnderlyingObject() etc.
+  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                             ArrayRef<Instruction *> EqClass);
+
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -334,22 +338,12 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
-  /// Inserts a cast instruction to convert Inst to DstTy.
-  Value *insertCast(Value *Val, Type *DstTy);
-
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
-
-  // Helpers for chain merging.
-  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
-  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                const APInt &Delta) const;
-  static void rebaseChain(Chain &C, const APInt &Delta);
-  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -431,20 +425,6 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
-static const Value *getUnderlyingObject(const Value *Ptr) {
-  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-    // The select's themselves are distinct instructions even if they share
-    // the same condition and evaluate to consecutive pointers for true and
-    // false values of the condition. Therefore using the select's themselves
-    // for grouping instructions would put consecutive accesses into different
-    // lists and they won't be even checked for being consecutive, and won't
-    // be vectorized.
-    return Sel->getCondition();
-  }
-  return ObjPtr;
-}
-
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -488,88 +468,6 @@ bool Vectorizer::run() {
   return Changed;
 }
 
-Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
-  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
-    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  // If the types are of different sizes and both are integers, we can use
-  // zext or sext to cast.
-  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
-    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
-      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
-    }
-    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  return nullptr;
-}
-
-std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
-                                                    Instruction *I2) {
-  assert(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) ||
-          (isa<StoreInst>(I1) && isa<StoreInst>(I2))) &&
-         "computeLeaderDelta must be called with two load or two store "
-         "instructions");
-  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
-  const Value *Ptr1 = getLoadStorePointerOperand(I1);
-  const Value *Ptr2 = getLoadStorePointerOperand(I2);
-  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
-                           CtxInst);
-}
-
-bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                          const APInt &Delta) const {
-  ConstantRange ARange(
-      A.front().OffsetFromLeader,
-      A.back().OffsetFromLeader +
-          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
-  ConstantRange BRange(
-      B.front().OffsetFromLeader + Delta,
-      B.back().OffsetFromLeader + Delta +
-          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
-  return !ARange.intersectWith(BRange).isEmptySet();
-}
-
-void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
-  for (ChainElem &E : C)
-    E.OffsetFromLeader += Delta;
-}
-
-void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
-  for (ChainElem &Elem : C) {
-    Instruction *Inst = Elem.Inst;
-    Type *OrigValTy = getLoadStoreType(Inst);
-    if (OrigValTy == CastTy)
-      continue;
-
-    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
-      Builder.SetInsertPoint(LI);
-      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
-                                             LI->getName() + ".mut");
-      copyMetadataForAccess(*NewLoad, *LI);
-      Value *CastBack = insertCast(NewLoad, OrigValTy);
-      if (!CastBack)
-        llvm_unreachable("Failed to insert cast");
-      LI->replaceAllUsesWith(CastBack);
-      ToErase.emplace_back(LI);
-      Elem.Inst = NewLoad;
-    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
-      Builder.SetInsertPoint(SI);
-      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
-      if (!CastVal)
-        llvm_unreachable("Failed to insert cast");
-      StoreInst *NewStore =
-          Builder.CreateStore(CastVal, SI->getPointerOperand());
-      NewStore->setAlignment(SI->getAlign());
-      NewStore->setVolatile(SI->isVolatile());
-      copyMetadataForAccess(*NewStore, *SI);
-      ToErase.emplace_back(SI);
-      Elem.Inst = NewStore;
-    }
-  }
-}
-
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -582,120 +480,49 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
-  SmallVector<Chain> ContiguousSubChains;
-
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End)) {
-
-    LLVM_DEBUG({
-      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-             << " keyed on " << EqClassKey << ":\n";
-      for (Instruction *I : EqClass)
-        dbgs() << "  " << *I << "\n";
-    });
-
-    for (Chain &C : gatherChains(EqClass)) {
-
-      // Split up the chain into increasingly smaller chains, until we can
-      // finally vectorize the chains.
-      //
-      // (Don't be scared by the depth of the loop nest here.  These operations
-      // are all at worst O(n lg n) in the number of instructions, and splitting
-      // chains doesn't change the number of instrs.  So the whole loop nest is
-      // O(n lg n).)
-      for (auto &C : splitChainByMayAliasInstrs(C)) {
-        for (auto &C : splitChainByContiguity(C)) {
-          ContiguousSubChains.emplace_back(C);
-        }
-      }
-    }
-  }
-
-  // Merge chains in reverse order, so that the first chain is the largest.
-  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
-    Chain &C1 = ContiguousSubChains[I - 1];
-    Chain &C2 = ContiguousSubChains[I];
+       collectEquivalenceClasses(Begin, End))
+    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
 
-    // If the scalar types of the chains are the same, we can merge them
-    // without inserting any casts.
-    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
-        getLoadStoreType(C2[0].Inst)->getScalarType())
-      continue;
-
-    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
-    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
-    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
-    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
-    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
-    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
-
-    // If the chains are mapped to different types, have distinct underlying
-    // pointer objects, or include both loads and stores, skip.
-    if (C1IsLoad != C2IsLoad || AS1 != AS2 ||
-        ::getUnderlyingObject(C1Ptr) != ::getUnderlyingObject(C2Ptr))
-      continue;
-
-    // Compute constant offset between chain leaders; if unknown, skip.
-    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
-    if (!DeltaOpt)
-      continue;
-
-    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
-    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
-      continue;
-
-    // Determine the common integer cast type for normalization and ensure total
-    // bitwidth matches across all elements of both chains.
-    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
-    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
-    auto AllElemsMatchTotalBits = [&](const Chain &C) {
-      return llvm::all_of(C, [&](const ChainElem &E) {
-        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
-      });
-    };
-    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
-      continue;
+  return Changed;
+}
 
-    // Power-of-two span ensures we can form a legal, single vector access
-    // without padding or splitting. Many targets and cost models assume POT
-    // widths, and it guarantees an integral element count for the chosen
-    // VecElemTy.
-    APInt Sz = C2.front().OffsetFromLeader +
-               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
-               C1.back().OffsetFromLeader + *DeltaOpt;
-    if (!Sz.isPowerOf2())
-      continue;
+bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                                       ArrayRef<Instruction *> EqClass) {
+  bool Changed = false;
 
-    // Rebase C2's offsets into C1's coordinate space prior to merging and
-    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
-    // from ContiguousSubChains.
-    rebaseChain(C2, *DeltaOpt);
-    C1.insert(C1.end(), C2.begin(), C2.end());
-    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
-
-    // Normalize the value operand/result type of each instruction in C1 to
-    // C1CastTy.
-    Type *C1CastTy =
-        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
-    normalizeChainToType(C1, C1CastTy);
-  }
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+           << " keyed on " << EqClassKey << ":\n";
+    for (Instruction *I : EqClass)
+      dbgs() << "  " << *I << "\n";
+  });
 
-  for (auto &C : ContiguousSubChains) {
-    if (C.size() <= 1)
-      continue;
-    for (auto &AlignedSubChain : splitChainByAlignment(C))
-      Changed |= vectorizeChain(AlignedSubChain);
-  }
+  std::vector<Chain> Chains = gatherChains(EqClass);
+  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
+                    << " nontrivial chains.\n";);
+  for (Chain &C : Chains)
+    Changed |= runOnChain(C);
+  return Changed;
+}
 
-  // Erase all instructions scheduled for deletion in this pseudo-BB.
-  for (Instruction *I : ToErase) {
-    auto *PtrOperand = getLoadStorePointerOperand(I);
-    if (I->use_empty())
-      I->eraseFromParent();
-    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
-  }
-  ToErase.clear();
+bool Vectorizer::runOnChain(Chain &C) {
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
+    dumpChain(C);
+  });
 
+  // Split up the chain into increasingly smaller chains, until we can finally
+  // vectorize the chains.
+  //
+  // (Don't be scared by the depth of the loop nest here.  These operations are
+  // all at worst O(n lg n) in the number of instructions, and splitting chains
+  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
+  bool Changed = false;
+  for (auto &C : splitChainByMayAliasInstrs(C))
+    for (auto &C : splitChainByContiguity(C))
+      for (auto &C : splitChainByAlignment(C))
+        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -756,7 +583,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (!NewChain.empty()) {
+        if (NewChain.size() > 1) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -768,7 +595,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (!NewChain.empty()) {
+    if (NewChain.size() > 1) {
       LLVM_DEBUG({
         dbgs() << "LSV: got ...
[truncated]

llvmbot · 2025-12-02T22:25:23Z

@llvm/pr-subscribers-backend-amdgpu

Author: Drew Kersnar (dakersnar)

Changes

Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load <2xi16> and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

Patch is 411.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170381.diff

50 Files Affected:

(modified) llvm/include/llvm/Transforms/Utils/Local.h (+1-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1-1)
(modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+2-2)
(modified) llvm/lib/Transforms/Scalar/SROA.cpp (+1-1)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+22-38)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+63-219)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (-2)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+129-133)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+792-825)
(modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+40-18)
(modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+26-28)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+162-171)
(modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+39-45)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+49-57)
(modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+30-39)
(modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+149-141)
(modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+206-197)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (-82)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+109-65)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+120-192)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+58-59)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+58-59)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+33-34)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+56-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+56-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+12-14)
(modified) llvm/test/CodeGen/AMDGPU/max.ll (+16-18)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+108-108)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+30-58)
(modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+36-38)
(modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+28-30)
(modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+63-64)
(modified) llvm/test/CodeGen/AMDGPU/store-to-constant.ll (+2-4)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+61-65)
(modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
(removed) llvm/test/Transforms/InstCombine/copy-access-metadata.ll (-215)
(removed) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-load-store.ll (-159)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+66-258)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+3-281)

diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index d0af2d3d2e4c2..9acfd872e574b 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -431,7 +431,7 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 
 /// Copy the metadata from the source instruction to the destination (the
 /// replacement for the source instruction).
-LLVM_ABI void copyMetadataForAccess(Instruction &Dest, Instruction &Source);
+LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 98884c441096e..fdff21b6ef8df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1035,7 +1035,7 @@ bool LegalizeBufferContentTypesVisitor::visitLoadImpl(
       LoadInst *NewLI = IRB.CreateAlignedLoad(
           LoadableType, NewPtr, commonAlignment(OrigLI.getAlign(), ByteOffset),
           Name + ".off." + Twine(ByteOffset));
-      copyMetadataForAccess(*NewLI, OrigLI);
+      copyMetadataForLoad(*NewLI, OrigLI);
       NewLI->setAAMetadata(
           AANodes.adjustForAccess(ByteOffset, LoadableType, DL));
       NewLI->setAtomic(OrigLI.getOrdering(), OrigLI.getSyncScopeID());
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 3e04aeb675d2a..9491610190c10 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -415,7 +415,7 @@ void PointerReplacer::replace(Instruction *I) {
                               LT->getAlign(), LT->getOrdering(),
                               LT->getSyncScopeID());
     NewI->takeName(LT);
-    copyMetadataForAccess(*NewI, *LT);
+    copyMetadataForLoad(*NewI, *LT);
 
     IC.InsertNewInstWith(NewI, LT->getIterator());
     IC.replaceInstUsesWith(*LT, NewI);
@@ -606,7 +606,7 @@ LoadInst *InstCombinerImpl::combineLoadToNewType(LoadInst &LI, Type *NewTy,
       Builder.CreateAlignedLoad(NewTy, LI.getPointerOperand(), LI.getAlign(),
                                 LI.isVolatile(), LI.getName() + Suffix);
   NewLoad->setAtomic(LI.getOrdering(), LI.getSyncScopeID());
-  copyMetadataForAccess(*NewLoad, LI);
+  copyMetadataForLoad(*NewLoad, LI);
   return NewLoad;
 }
 
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index a7c322bfcb981..70afe833c9f47 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3272,7 +3272,7 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       // Copy any metadata that is valid for the new load. This may require
       // conversion to a different kind of metadata, e.g. !nonnull might change
       // to !range or vice versa.
-      copyMetadataForAccess(*NewLI, LI);
+      copyMetadataForLoad(*NewLI, LI);
 
       // Do this after copyMetadataForLoad() to preserve the TBAA shift.
       if (AATags)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index dec2e019333b9..a03cf6e953e35 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3100,70 +3100,54 @@ void llvm::combineAAMetadata(Instruction *K, const Instruction *J) {
   combineMetadata(K, J, /*DoesKMove=*/true, /*AAOnly=*/true);
 }
 
-void llvm::copyMetadataForAccess(Instruction &DestI, Instruction &SourceI) {
+void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
-  SourceI.getAllMetadata(MD);
-  MDBuilder MDB(DestI.getContext());
-  Type *NewType = DestI.getType();
-
-  // Only needed for range metadata on loads.
-  const DataLayout *DL = nullptr;
-  const LoadInst *LSource = dyn_cast<LoadInst>(&SourceI);
-  if (LSource)
-    DL = &LSource->getDataLayout();
-
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  const DataLayout &DL = Source.getDataLayout();
   for (const auto &MDPair : MD) {
     unsigned ID = MDPair.first;
     MDNode *N = MDPair.second;
-
+    // Note, essentially every kind of metadata should be preserved here! This
+    // routine is supposed to clone a load instruction changing *only its type*.
+    // The only metadata it makes sense to drop is metadata which is invalidated
+    // when the pointer type changes. This should essentially never be the case
+    // in LLVM, but we explicitly switch over only known metadata to be
+    // conservatively correct. If you are adding metadata to LLVM which pertains
+    // to loads, you almost certainly want to add it here.
     switch (ID) {
-    // Applies to both loads and stores as-is.
     case LLVMContext::MD_dbg:
+    case LLVMContext::MD_tbaa:
     case LLVMContext::MD_prof:
+    case LLVMContext::MD_fpmath:
     case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_invariant_load:
     case LLVMContext::MD_alias_scope:
     case LLVMContext::MD_noalias:
     case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_mem_parallel_loop_access:
     case LLVMContext::MD_access_group:
     case LLVMContext::MD_noundef:
     case LLVMContext::MD_noalias_addrspace:
-    case LLVMContext::MD_mem_parallel_loop_access:
-      DestI.setMetadata(ID, N);
-      break;
-
-    // Load-only metadata.
-    case LLVMContext::MD_fpmath:
-    case LLVMContext::MD_invariant_load:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      // All of these directly apply.
+      Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_nonnull:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource)
-          copyNonnullMetadata(*LSource, N, *LDest);
-      }
+      copyNonnullMetadata(Source, N, Dest);
       break;
 
     case LLVMContext::MD_align:
     case LLVMContext::MD_dereferenceable:
     case LLVMContext::MD_dereferenceable_or_null:
-      // Applies to both loads and stores only if the new type is also a
-      // pointer.
+      // These only directly apply if the new type is also a pointer.
       if (NewType->isPointerTy())
-        DestI.setMetadata(ID, N);
+        Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_range:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource && DL)
-          copyRangeMetadata(*DL, *LSource, N, *LDest);
-      }
-      break;
-
-    case LLVMContext::MD_tbaa:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      copyRangeMetadata(DL, Source, N, Dest);
       break;
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 114df653bad83..c28314f6ab124 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,7 +112,6 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -269,6 +268,11 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
+  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
+  /// in the same BB with the same value for getUnderlyingObject() etc.
+  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                             ArrayRef<Instruction *> EqClass);
+
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -334,22 +338,12 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
-  /// Inserts a cast instruction to convert Inst to DstTy.
-  Value *insertCast(Value *Val, Type *DstTy);
-
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
-
-  // Helpers for chain merging.
-  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
-  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                const APInt &Delta) const;
-  static void rebaseChain(Chain &C, const APInt &Delta);
-  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -431,20 +425,6 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
-static const Value *getUnderlyingObject(const Value *Ptr) {
-  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-    // The select's themselves are distinct instructions even if they share
-    // the same condition and evaluate to consecutive pointers for true and
-    // false values of the condition. Therefore using the select's themselves
-    // for grouping instructions would put consecutive accesses into different
-    // lists and they won't be even checked for being consecutive, and won't
-    // be vectorized.
-    return Sel->getCondition();
-  }
-  return ObjPtr;
-}
-
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -488,88 +468,6 @@ bool Vectorizer::run() {
   return Changed;
 }
 
-Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
-  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
-    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  // If the types are of different sizes and both are integers, we can use
-  // zext or sext to cast.
-  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
-    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
-      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
-    }
-    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  return nullptr;
-}
-
-std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
-                                                    Instruction *I2) {
-  assert(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) ||
-          (isa<StoreInst>(I1) && isa<StoreInst>(I2))) &&
-         "computeLeaderDelta must be called with two load or two store "
-         "instructions");
-  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
-  const Value *Ptr1 = getLoadStorePointerOperand(I1);
-  const Value *Ptr2 = getLoadStorePointerOperand(I2);
-  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
-                           CtxInst);
-}
-
-bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                          const APInt &Delta) const {
-  ConstantRange ARange(
-      A.front().OffsetFromLeader,
-      A.back().OffsetFromLeader +
-          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
-  ConstantRange BRange(
-      B.front().OffsetFromLeader + Delta,
-      B.back().OffsetFromLeader + Delta +
-          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
-  return !ARange.intersectWith(BRange).isEmptySet();
-}
-
-void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
-  for (ChainElem &E : C)
-    E.OffsetFromLeader += Delta;
-}
-
-void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
-  for (ChainElem &Elem : C) {
-    Instruction *Inst = Elem.Inst;
-    Type *OrigValTy = getLoadStoreType(Inst);
-    if (OrigValTy == CastTy)
-      continue;
-
-    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
-      Builder.SetInsertPoint(LI);
-      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
-                                             LI->getName() + ".mut");
-      copyMetadataForAccess(*NewLoad, *LI);
-      Value *CastBack = insertCast(NewLoad, OrigValTy);
-      if (!CastBack)
-        llvm_unreachable("Failed to insert cast");
-      LI->replaceAllUsesWith(CastBack);
-      ToErase.emplace_back(LI);
-      Elem.Inst = NewLoad;
-    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
-      Builder.SetInsertPoint(SI);
-      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
-      if (!CastVal)
-        llvm_unreachable("Failed to insert cast");
-      StoreInst *NewStore =
-          Builder.CreateStore(CastVal, SI->getPointerOperand());
-      NewStore->setAlignment(SI->getAlign());
-      NewStore->setVolatile(SI->isVolatile());
-      copyMetadataForAccess(*NewStore, *SI);
-      ToErase.emplace_back(SI);
-      Elem.Inst = NewStore;
-    }
-  }
-}
-
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -582,120 +480,49 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
-  SmallVector<Chain> ContiguousSubChains;
-
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End)) {
-
-    LLVM_DEBUG({
-      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-             << " keyed on " << EqClassKey << ":\n";
-      for (Instruction *I : EqClass)
-        dbgs() << "  " << *I << "\n";
-    });
-
-    for (Chain &C : gatherChains(EqClass)) {
-
-      // Split up the chain into increasingly smaller chains, until we can
-      // finally vectorize the chains.
-      //
-      // (Don't be scared by the depth of the loop nest here.  These operations
-      // are all at worst O(n lg n) in the number of instructions, and splitting
-      // chains doesn't change the number of instrs.  So the whole loop nest is
-      // O(n lg n).)
-      for (auto &C : splitChainByMayAliasInstrs(C)) {
-        for (auto &C : splitChainByContiguity(C)) {
-          ContiguousSubChains.emplace_back(C);
-        }
-      }
-    }
-  }
-
-  // Merge chains in reverse order, so that the first chain is the largest.
-  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
-    Chain &C1 = ContiguousSubChains[I - 1];
-    Chain &C2 = ContiguousSubChains[I];
+       collectEquivalenceClasses(Begin, End))
+    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
 
-    // If the scalar types of the chains are the same, we can merge them
-    // without inserting any casts.
-    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
-        getLoadStoreType(C2[0].Inst)->getScalarType())
-      continue;
-
-    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
-    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
-    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
-    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
-    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
-    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
-
-    // If the chains are mapped to different types, have distinct underlying
-    // pointer objects, or include both loads and stores, skip.
-    if (C1IsLoad != C2IsLoad || AS1 != AS2 ||
-        ::getUnderlyingObject(C1Ptr) != ::getUnderlyingObject(C2Ptr))
-      continue;
-
-    // Compute constant offset between chain leaders; if unknown, skip.
-    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
-    if (!DeltaOpt)
-      continue;
-
-    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
-    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
-      continue;
-
-    // Determine the common integer cast type for normalization and ensure total
-    // bitwidth matches across all elements of both chains.
-    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
-    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
-    auto AllElemsMatchTotalBits = [&](const Chain &C) {
-      return llvm::all_of(C, [&](const ChainElem &E) {
-        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
-      });
-    };
-    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
-      continue;
+  return Changed;
+}
 
-    // Power-of-two span ensures we can form a legal, single vector access
-    // without padding or splitting. Many targets and cost models assume POT
-    // widths, and it guarantees an integral element count for the chosen
-    // VecElemTy.
-    APInt Sz = C2.front().OffsetFromLeader +
-               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
-               C1.back().OffsetFromLeader + *DeltaOpt;
-    if (!Sz.isPowerOf2())
-      continue;
+bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                                       ArrayRef<Instruction *> EqClass) {
+  bool Changed = false;
 
-    // Rebase C2's offsets into C1's coordinate space prior to merging and
-    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
-    // from ContiguousSubChains.
-    rebaseChain(C2, *DeltaOpt);
-    C1.insert(C1.end(), C2.begin(), C2.end());
-    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
-
-    // Normalize the value operand/result type of each instruction in C1 to
-    // C1CastTy.
-    Type *C1CastTy =
-        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
-    normalizeChainToType(C1, C1CastTy);
-  }
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+           << " keyed on " << EqClassKey << ":\n";
+    for (Instruction *I : EqClass)
+      dbgs() << "  " << *I << "\n";
+  });
 
-  for (auto &C : ContiguousSubChains) {
-    if (C.size() <= 1)
-      continue;
-    for (auto &AlignedSubChain : splitChainByAlignment(C))
-      Changed |= vectorizeChain(AlignedSubChain);
-  }
+  std::vector<Chain> Chains = gatherChains(EqClass);
+  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
+                    << " nontrivial chains.\n";);
+  for (Chain &C : Chains)
+    Changed |= runOnChain(C);
+  return Changed;
+}
 
-  // Erase all instructions scheduled for deletion in this pseudo-BB.
-  for (Instruction *I : ToErase) {
-    auto *PtrOperand = getLoadStorePointerOperand(I);
-    if (I->use_empty())
-      I->eraseFromParent();
-    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
-  }
-  ToErase.clear();
+bool Vectorizer::runOnChain(Chain &C) {
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
+    dumpChain(C);
+  });
 
+  // Split up the chain into increasingly smaller chains, until we can finally
+  // vectorize the chains.
+  //
+  // (Don't be scared by the depth of the loop nest here.  These operations are
+  // all at worst O(n lg n) in the number of instructions, and splitting chains
+  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
+  bool Changed = false;
+  for (auto &C : splitChainByMayAliasInstrs(C))
+    for (auto &C : splitChainByContiguity(C))
+      for (auto &C : splitChainByAlignment(C))
+        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -756,7 +583,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (!NewChain.empty()) {
+        if (NewChain.size() > 1) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -768,7 +595,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (!NewChain.empty()) {
+    if (NewChain.size() > 1) {
       LLVM_DEBUG({
         dbgs() << "LSV: got ...
[truncated]

llvmbot · 2025-12-02T22:25:23Z

@llvm/pr-subscribers-vectorizers

Author: Drew Kersnar (dakersnar)

Changes

Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load <2xi16> and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

Patch is 411.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170381.diff

50 Files Affected:

(modified) llvm/include/llvm/Transforms/Utils/Local.h (+1-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1-1)
(modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+2-2)
(modified) llvm/lib/Transforms/Scalar/SROA.cpp (+1-1)
(modified) llvm/lib/Transforms/Utils/Local.cpp (+22-38)
(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+63-219)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (-2)
(modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+129-133)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+792-825)
(modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+40-18)
(modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+26-28)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+162-171)
(modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+39-45)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
(modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+15-16)
(modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+49-57)
(modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+30-39)
(modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+149-141)
(modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+206-197)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (-82)
(modified) llvm/test/CodeGen/AMDGPU/half.ll (+109-65)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+120-192)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+58-59)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+58-59)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+33-34)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+56-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+56-57)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+12-14)
(modified) llvm/test/CodeGen/AMDGPU/max.ll (+16-18)
(modified) llvm/test/CodeGen/AMDGPU/min.ll (+108-108)
(modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+30-58)
(modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+36-38)
(modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+28-30)
(modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+63-64)
(modified) llvm/test/CodeGen/AMDGPU/store-to-constant.ll (+2-4)
(modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+61-65)
(modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
(removed) llvm/test/Transforms/InstCombine/copy-access-metadata.ll (-215)
(removed) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-load-store.ll (-159)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+66-258)
(modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+3-281)

diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index d0af2d3d2e4c2..9acfd872e574b 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -431,7 +431,7 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 
 /// Copy the metadata from the source instruction to the destination (the
 /// replacement for the source instruction).
-LLVM_ABI void copyMetadataForAccess(Instruction &Dest, Instruction &Source);
+LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 98884c441096e..fdff21b6ef8df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1035,7 +1035,7 @@ bool LegalizeBufferContentTypesVisitor::visitLoadImpl(
       LoadInst *NewLI = IRB.CreateAlignedLoad(
           LoadableType, NewPtr, commonAlignment(OrigLI.getAlign(), ByteOffset),
           Name + ".off." + Twine(ByteOffset));
-      copyMetadataForAccess(*NewLI, OrigLI);
+      copyMetadataForLoad(*NewLI, OrigLI);
       NewLI->setAAMetadata(
           AANodes.adjustForAccess(ByteOffset, LoadableType, DL));
       NewLI->setAtomic(OrigLI.getOrdering(), OrigLI.getSyncScopeID());
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 3e04aeb675d2a..9491610190c10 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -415,7 +415,7 @@ void PointerReplacer::replace(Instruction *I) {
                               LT->getAlign(), LT->getOrdering(),
                               LT->getSyncScopeID());
     NewI->takeName(LT);
-    copyMetadataForAccess(*NewI, *LT);
+    copyMetadataForLoad(*NewI, *LT);
 
     IC.InsertNewInstWith(NewI, LT->getIterator());
     IC.replaceInstUsesWith(*LT, NewI);
@@ -606,7 +606,7 @@ LoadInst *InstCombinerImpl::combineLoadToNewType(LoadInst &LI, Type *NewTy,
       Builder.CreateAlignedLoad(NewTy, LI.getPointerOperand(), LI.getAlign(),
                                 LI.isVolatile(), LI.getName() + Suffix);
   NewLoad->setAtomic(LI.getOrdering(), LI.getSyncScopeID());
-  copyMetadataForAccess(*NewLoad, LI);
+  copyMetadataForLoad(*NewLoad, LI);
   return NewLoad;
 }
 
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index a7c322bfcb981..70afe833c9f47 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3272,7 +3272,7 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       // Copy any metadata that is valid for the new load. This may require
       // conversion to a different kind of metadata, e.g. !nonnull might change
       // to !range or vice versa.
-      copyMetadataForAccess(*NewLI, LI);
+      copyMetadataForLoad(*NewLI, LI);
 
       // Do this after copyMetadataForLoad() to preserve the TBAA shift.
       if (AATags)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index dec2e019333b9..a03cf6e953e35 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3100,70 +3100,54 @@ void llvm::combineAAMetadata(Instruction *K, const Instruction *J) {
   combineMetadata(K, J, /*DoesKMove=*/true, /*AAOnly=*/true);
 }
 
-void llvm::copyMetadataForAccess(Instruction &DestI, Instruction &SourceI) {
+void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
-  SourceI.getAllMetadata(MD);
-  MDBuilder MDB(DestI.getContext());
-  Type *NewType = DestI.getType();
-
-  // Only needed for range metadata on loads.
-  const DataLayout *DL = nullptr;
-  const LoadInst *LSource = dyn_cast<LoadInst>(&SourceI);
-  if (LSource)
-    DL = &LSource->getDataLayout();
-
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  const DataLayout &DL = Source.getDataLayout();
   for (const auto &MDPair : MD) {
     unsigned ID = MDPair.first;
     MDNode *N = MDPair.second;
-
+    // Note, essentially every kind of metadata should be preserved here! This
+    // routine is supposed to clone a load instruction changing *only its type*.
+    // The only metadata it makes sense to drop is metadata which is invalidated
+    // when the pointer type changes. This should essentially never be the case
+    // in LLVM, but we explicitly switch over only known metadata to be
+    // conservatively correct. If you are adding metadata to LLVM which pertains
+    // to loads, you almost certainly want to add it here.
     switch (ID) {
-    // Applies to both loads and stores as-is.
     case LLVMContext::MD_dbg:
+    case LLVMContext::MD_tbaa:
     case LLVMContext::MD_prof:
+    case LLVMContext::MD_fpmath:
     case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_invariant_load:
     case LLVMContext::MD_alias_scope:
     case LLVMContext::MD_noalias:
     case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_mem_parallel_loop_access:
     case LLVMContext::MD_access_group:
     case LLVMContext::MD_noundef:
     case LLVMContext::MD_noalias_addrspace:
-    case LLVMContext::MD_mem_parallel_loop_access:
-      DestI.setMetadata(ID, N);
-      break;
-
-    // Load-only metadata.
-    case LLVMContext::MD_fpmath:
-    case LLVMContext::MD_invariant_load:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      // All of these directly apply.
+      Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_nonnull:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource)
-          copyNonnullMetadata(*LSource, N, *LDest);
-      }
+      copyNonnullMetadata(Source, N, Dest);
       break;
 
     case LLVMContext::MD_align:
     case LLVMContext::MD_dereferenceable:
     case LLVMContext::MD_dereferenceable_or_null:
-      // Applies to both loads and stores only if the new type is also a
-      // pointer.
+      // These only directly apply if the new type is also a pointer.
       if (NewType->isPointerTy())
-        DestI.setMetadata(ID, N);
+        Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_range:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource && DL)
-          copyRangeMetadata(*DL, *LSource, N, *LDest);
-      }
-      break;
-
-    case LLVMContext::MD_tbaa:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      copyRangeMetadata(DL, Source, N, Dest);
       break;
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 114df653bad83..c28314f6ab124 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,7 +112,6 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -269,6 +268,11 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
+  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
+  /// in the same BB with the same value for getUnderlyingObject() etc.
+  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                             ArrayRef<Instruction *> EqClass);
+
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -334,22 +338,12 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
-  /// Inserts a cast instruction to convert Inst to DstTy.
-  Value *insertCast(Value *Val, Type *DstTy);
-
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
-
-  // Helpers for chain merging.
-  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
-  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                const APInt &Delta) const;
-  static void rebaseChain(Chain &C, const APInt &Delta);
-  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -431,20 +425,6 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
-static const Value *getUnderlyingObject(const Value *Ptr) {
-  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-    // The select's themselves are distinct instructions even if they share
-    // the same condition and evaluate to consecutive pointers for true and
-    // false values of the condition. Therefore using the select's themselves
-    // for grouping instructions would put consecutive accesses into different
-    // lists and they won't be even checked for being consecutive, and won't
-    // be vectorized.
-    return Sel->getCondition();
-  }
-  return ObjPtr;
-}
-
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -488,88 +468,6 @@ bool Vectorizer::run() {
   return Changed;
 }
 
-Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
-  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
-    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  // If the types are of different sizes and both are integers, we can use
-  // zext or sext to cast.
-  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
-    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
-      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
-    }
-    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  return nullptr;
-}
-
-std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
-                                                    Instruction *I2) {
-  assert(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) ||
-          (isa<StoreInst>(I1) && isa<StoreInst>(I2))) &&
-         "computeLeaderDelta must be called with two load or two store "
-         "instructions");
-  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
-  const Value *Ptr1 = getLoadStorePointerOperand(I1);
-  const Value *Ptr2 = getLoadStorePointerOperand(I2);
-  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
-                           CtxInst);
-}
-
-bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                          const APInt &Delta) const {
-  ConstantRange ARange(
-      A.front().OffsetFromLeader,
-      A.back().OffsetFromLeader +
-          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
-  ConstantRange BRange(
-      B.front().OffsetFromLeader + Delta,
-      B.back().OffsetFromLeader + Delta +
-          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
-  return !ARange.intersectWith(BRange).isEmptySet();
-}
-
-void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
-  for (ChainElem &E : C)
-    E.OffsetFromLeader += Delta;
-}
-
-void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
-  for (ChainElem &Elem : C) {
-    Instruction *Inst = Elem.Inst;
-    Type *OrigValTy = getLoadStoreType(Inst);
-    if (OrigValTy == CastTy)
-      continue;
-
-    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
-      Builder.SetInsertPoint(LI);
-      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
-                                             LI->getName() + ".mut");
-      copyMetadataForAccess(*NewLoad, *LI);
-      Value *CastBack = insertCast(NewLoad, OrigValTy);
-      if (!CastBack)
-        llvm_unreachable("Failed to insert cast");
-      LI->replaceAllUsesWith(CastBack);
-      ToErase.emplace_back(LI);
-      Elem.Inst = NewLoad;
-    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
-      Builder.SetInsertPoint(SI);
-      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
-      if (!CastVal)
-        llvm_unreachable("Failed to insert cast");
-      StoreInst *NewStore =
-          Builder.CreateStore(CastVal, SI->getPointerOperand());
-      NewStore->setAlignment(SI->getAlign());
-      NewStore->setVolatile(SI->isVolatile());
-      copyMetadataForAccess(*NewStore, *SI);
-      ToErase.emplace_back(SI);
-      Elem.Inst = NewStore;
-    }
-  }
-}
-
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -582,120 +480,49 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
-  SmallVector<Chain> ContiguousSubChains;
-
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End)) {
-
-    LLVM_DEBUG({
-      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-             << " keyed on " << EqClassKey << ":\n";
-      for (Instruction *I : EqClass)
-        dbgs() << "  " << *I << "\n";
-    });
-
-    for (Chain &C : gatherChains(EqClass)) {
-
-      // Split up the chain into increasingly smaller chains, until we can
-      // finally vectorize the chains.
-      //
-      // (Don't be scared by the depth of the loop nest here.  These operations
-      // are all at worst O(n lg n) in the number of instructions, and splitting
-      // chains doesn't change the number of instrs.  So the whole loop nest is
-      // O(n lg n).)
-      for (auto &C : splitChainByMayAliasInstrs(C)) {
-        for (auto &C : splitChainByContiguity(C)) {
-          ContiguousSubChains.emplace_back(C);
-        }
-      }
-    }
-  }
-
-  // Merge chains in reverse order, so that the first chain is the largest.
-  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
-    Chain &C1 = ContiguousSubChains[I - 1];
-    Chain &C2 = ContiguousSubChains[I];
+       collectEquivalenceClasses(Begin, End))
+    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
 
-    // If the scalar types of the chains are the same, we can merge them
-    // without inserting any casts.
-    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
-        getLoadStoreType(C2[0].Inst)->getScalarType())
-      continue;
-
-    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
-    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
-    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
-    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
-    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
-    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
-
-    // If the chains are mapped to different types, have distinct underlying
-    // pointer objects, or include both loads and stores, skip.
-    if (C1IsLoad != C2IsLoad || AS1 != AS2 ||
-        ::getUnderlyingObject(C1Ptr) != ::getUnderlyingObject(C2Ptr))
-      continue;
-
-    // Compute constant offset between chain leaders; if unknown, skip.
-    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
-    if (!DeltaOpt)
-      continue;
-
-    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
-    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
-      continue;
-
-    // Determine the common integer cast type for normalization and ensure total
-    // bitwidth matches across all elements of both chains.
-    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
-    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
-    auto AllElemsMatchTotalBits = [&](const Chain &C) {
-      return llvm::all_of(C, [&](const ChainElem &E) {
-        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
-      });
-    };
-    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
-      continue;
+  return Changed;
+}
 
-    // Power-of-two span ensures we can form a legal, single vector access
-    // without padding or splitting. Many targets and cost models assume POT
-    // widths, and it guarantees an integral element count for the chosen
-    // VecElemTy.
-    APInt Sz = C2.front().OffsetFromLeader +
-               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
-               C1.back().OffsetFromLeader + *DeltaOpt;
-    if (!Sz.isPowerOf2())
-      continue;
+bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                                       ArrayRef<Instruction *> EqClass) {
+  bool Changed = false;
 
-    // Rebase C2's offsets into C1's coordinate space prior to merging and
-    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
-    // from ContiguousSubChains.
-    rebaseChain(C2, *DeltaOpt);
-    C1.insert(C1.end(), C2.begin(), C2.end());
-    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
-
-    // Normalize the value operand/result type of each instruction in C1 to
-    // C1CastTy.
-    Type *C1CastTy =
-        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
-    normalizeChainToType(C1, C1CastTy);
-  }
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+           << " keyed on " << EqClassKey << ":\n";
+    for (Instruction *I : EqClass)
+      dbgs() << "  " << *I << "\n";
+  });
 
-  for (auto &C : ContiguousSubChains) {
-    if (C.size() <= 1)
-      continue;
-    for (auto &AlignedSubChain : splitChainByAlignment(C))
-      Changed |= vectorizeChain(AlignedSubChain);
-  }
+  std::vector<Chain> Chains = gatherChains(EqClass);
+  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
+                    << " nontrivial chains.\n";);
+  for (Chain &C : Chains)
+    Changed |= runOnChain(C);
+  return Changed;
+}
 
-  // Erase all instructions scheduled for deletion in this pseudo-BB.
-  for (Instruction *I : ToErase) {
-    auto *PtrOperand = getLoadStorePointerOperand(I);
-    if (I->use_empty())
-      I->eraseFromParent();
-    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
-  }
-  ToErase.clear();
+bool Vectorizer::runOnChain(Chain &C) {
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
+    dumpChain(C);
+  });
 
+  // Split up the chain into increasingly smaller chains, until we can finally
+  // vectorize the chains.
+  //
+  // (Don't be scared by the depth of the loop nest here.  These operations are
+  // all at worst O(n lg n) in the number of instructions, and splitting chains
+  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
+  bool Changed = false;
+  for (auto &C : splitChainByMayAliasInstrs(C))
+    for (auto &C : splitChainByContiguity(C))
+      for (auto &C : splitChainByAlignment(C))
+        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -756,7 +583,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (!NewChain.empty()) {
+        if (NewChain.size() > 1) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -768,7 +595,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (!NewChain.empty()) {
+    if (NewChain.size() > 1) {
       LLVM_DEBUG({
         dbgs() << "LSV: got ...
[truncated]

gandhi56 · 2025-12-02T23:26:38Z

Hi @dakersnar, thanks for your comments. I am certainly open to improve the solution to this problem. I will approve this revert. Let's discuss in #97715.

Reverts llvm#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: llvm#154069 (comment). While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like `load <2xi16>` and `load i32` to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after `splitChainByContiguity`, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

Revert "[LSV] Merge contiguous chains across scalar types (#154069)"

8e57b16

This reverts commit fbdf8ab.

dakersnar requested a review from nikic as a code owner December 2, 2025 22:24

dakersnar requested review from arsenm and gandhi56 and removed request for nikic December 2, 2025 22:24

llvmbot added backend:AMDGPU llvm:globalisel vectorizers llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Dec 2, 2025

dakersnar requested a review from LU-JOHN December 2, 2025 22:25

dakersnar requested a review from nikic December 2, 2025 22:25

gandhi56 approved these changes Dec 2, 2025

View reviewed changes

gandhi56 merged commit 9c78bc5 into main Dec 2, 2025
16 checks passed

gandhi56 deleted the revert-154069-issue-97715/lsv-merge-chains branch December 2, 2025 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "[LSV] Merge contiguous chains across scalar types" #170381

Revert "[LSV] Merge contiguous chains across scalar types" #170381

dakersnar commented Dec 2, 2025 •

edited

Loading

Uh oh!

llvmbot commented Dec 2, 2025 •

edited

Loading

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

gandhi56 commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Revert "[LSV] Merge contiguous chains across scalar types" #170381

Revert "[LSV] Merge contiguous chains across scalar types" #170381

Conversation

dakersnar commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

gandhi56 commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dakersnar commented Dec 2, 2025 •

edited

Loading

llvmbot commented Dec 2, 2025 •

edited

Loading