Skip to content

Conversation

@dakersnar
Copy link
Contributor

@dakersnar dakersnar commented Dec 2, 2025

Reverts #154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load <2xi16> and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were not present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity, which seems like it is leaving things up to chance whether this optimization kicks in. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.

@dakersnar dakersnar requested a review from nikic as a code owner December 2, 2025 22:24
@dakersnar dakersnar requested review from arsenm and gandhi56 and removed request for nikic December 2, 2025 22:24
@llvmbot llvmbot added backend:AMDGPU llvm:globalisel vectorizers llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Dec 2, 2025
@dakersnar dakersnar requested a review from LU-JOHN December 2, 2025 22:25
@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-globalisel

Author: Drew Kersnar (dakersnar)

Changes

Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load &lt;2xi16&gt; and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.


Patch is 411.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170381.diff

50 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Utils/Local.h (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1-1)
  • (modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Scalar/SROA.cpp (+1-1)
  • (modified) llvm/lib/Transforms/Utils/Local.cpp (+22-38)
  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+63-219)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+129-133)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+792-825)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+40-18)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+26-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+162-171)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+49-57)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+30-39)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+149-141)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+206-197)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (-82)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (+109-65)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+120-192)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+58-59)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+58-59)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+33-34)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+56-57)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+56-57)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+16-18)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+108-108)
  • (modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+30-58)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+36-38)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+28-30)
  • (modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+63-64)
  • (modified) llvm/test/CodeGen/AMDGPU/store-to-constant.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+61-65)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
  • (removed) llvm/test/Transforms/InstCombine/copy-access-metadata.ll (-215)
  • (removed) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-load-store.ll (-159)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+66-258)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+3-281)
diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index d0af2d3d2e4c2..9acfd872e574b 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -431,7 +431,7 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 
 /// Copy the metadata from the source instruction to the destination (the
 /// replacement for the source instruction).
-LLVM_ABI void copyMetadataForAccess(Instruction &Dest, Instruction &Source);
+LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 98884c441096e..fdff21b6ef8df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1035,7 +1035,7 @@ bool LegalizeBufferContentTypesVisitor::visitLoadImpl(
       LoadInst *NewLI = IRB.CreateAlignedLoad(
           LoadableType, NewPtr, commonAlignment(OrigLI.getAlign(), ByteOffset),
           Name + ".off." + Twine(ByteOffset));
-      copyMetadataForAccess(*NewLI, OrigLI);
+      copyMetadataForLoad(*NewLI, OrigLI);
       NewLI->setAAMetadata(
           AANodes.adjustForAccess(ByteOffset, LoadableType, DL));
       NewLI->setAtomic(OrigLI.getOrdering(), OrigLI.getSyncScopeID());
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 3e04aeb675d2a..9491610190c10 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -415,7 +415,7 @@ void PointerReplacer::replace(Instruction *I) {
                               LT->getAlign(), LT->getOrdering(),
                               LT->getSyncScopeID());
     NewI->takeName(LT);
-    copyMetadataForAccess(*NewI, *LT);
+    copyMetadataForLoad(*NewI, *LT);
 
     IC.InsertNewInstWith(NewI, LT->getIterator());
     IC.replaceInstUsesWith(*LT, NewI);
@@ -606,7 +606,7 @@ LoadInst *InstCombinerImpl::combineLoadToNewType(LoadInst &LI, Type *NewTy,
       Builder.CreateAlignedLoad(NewTy, LI.getPointerOperand(), LI.getAlign(),
                                 LI.isVolatile(), LI.getName() + Suffix);
   NewLoad->setAtomic(LI.getOrdering(), LI.getSyncScopeID());
-  copyMetadataForAccess(*NewLoad, LI);
+  copyMetadataForLoad(*NewLoad, LI);
   return NewLoad;
 }
 
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index a7c322bfcb981..70afe833c9f47 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3272,7 +3272,7 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       // Copy any metadata that is valid for the new load. This may require
       // conversion to a different kind of metadata, e.g. !nonnull might change
       // to !range or vice versa.
-      copyMetadataForAccess(*NewLI, LI);
+      copyMetadataForLoad(*NewLI, LI);
 
       // Do this after copyMetadataForLoad() to preserve the TBAA shift.
       if (AATags)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index dec2e019333b9..a03cf6e953e35 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3100,70 +3100,54 @@ void llvm::combineAAMetadata(Instruction *K, const Instruction *J) {
   combineMetadata(K, J, /*DoesKMove=*/true, /*AAOnly=*/true);
 }
 
-void llvm::copyMetadataForAccess(Instruction &DestI, Instruction &SourceI) {
+void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
-  SourceI.getAllMetadata(MD);
-  MDBuilder MDB(DestI.getContext());
-  Type *NewType = DestI.getType();
-
-  // Only needed for range metadata on loads.
-  const DataLayout *DL = nullptr;
-  const LoadInst *LSource = dyn_cast<LoadInst>(&SourceI);
-  if (LSource)
-    DL = &LSource->getDataLayout();
-
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  const DataLayout &DL = Source.getDataLayout();
   for (const auto &MDPair : MD) {
     unsigned ID = MDPair.first;
     MDNode *N = MDPair.second;
-
+    // Note, essentially every kind of metadata should be preserved here! This
+    // routine is supposed to clone a load instruction changing *only its type*.
+    // The only metadata it makes sense to drop is metadata which is invalidated
+    // when the pointer type changes. This should essentially never be the case
+    // in LLVM, but we explicitly switch over only known metadata to be
+    // conservatively correct. If you are adding metadata to LLVM which pertains
+    // to loads, you almost certainly want to add it here.
     switch (ID) {
-    // Applies to both loads and stores as-is.
     case LLVMContext::MD_dbg:
+    case LLVMContext::MD_tbaa:
     case LLVMContext::MD_prof:
+    case LLVMContext::MD_fpmath:
     case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_invariant_load:
     case LLVMContext::MD_alias_scope:
     case LLVMContext::MD_noalias:
     case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_mem_parallel_loop_access:
     case LLVMContext::MD_access_group:
     case LLVMContext::MD_noundef:
     case LLVMContext::MD_noalias_addrspace:
-    case LLVMContext::MD_mem_parallel_loop_access:
-      DestI.setMetadata(ID, N);
-      break;
-
-    // Load-only metadata.
-    case LLVMContext::MD_fpmath:
-    case LLVMContext::MD_invariant_load:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      // All of these directly apply.
+      Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_nonnull:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource)
-          copyNonnullMetadata(*LSource, N, *LDest);
-      }
+      copyNonnullMetadata(Source, N, Dest);
       break;
 
     case LLVMContext::MD_align:
     case LLVMContext::MD_dereferenceable:
     case LLVMContext::MD_dereferenceable_or_null:
-      // Applies to both loads and stores only if the new type is also a
-      // pointer.
+      // These only directly apply if the new type is also a pointer.
       if (NewType->isPointerTy())
-        DestI.setMetadata(ID, N);
+        Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_range:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource && DL)
-          copyRangeMetadata(*DL, *LSource, N, *LDest);
-      }
-      break;
-
-    case LLVMContext::MD_tbaa:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      copyRangeMetadata(DL, Source, N, Dest);
       break;
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 114df653bad83..c28314f6ab124 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,7 +112,6 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -269,6 +268,11 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
+  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
+  /// in the same BB with the same value for getUnderlyingObject() etc.
+  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                             ArrayRef<Instruction *> EqClass);
+
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -334,22 +338,12 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
-  /// Inserts a cast instruction to convert Inst to DstTy.
-  Value *insertCast(Value *Val, Type *DstTy);
-
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
-
-  // Helpers for chain merging.
-  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
-  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                const APInt &Delta) const;
-  static void rebaseChain(Chain &C, const APInt &Delta);
-  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -431,20 +425,6 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
-static const Value *getUnderlyingObject(const Value *Ptr) {
-  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-    // The select's themselves are distinct instructions even if they share
-    // the same condition and evaluate to consecutive pointers for true and
-    // false values of the condition. Therefore using the select's themselves
-    // for grouping instructions would put consecutive accesses into different
-    // lists and they won't be even checked for being consecutive, and won't
-    // be vectorized.
-    return Sel->getCondition();
-  }
-  return ObjPtr;
-}
-
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -488,88 +468,6 @@ bool Vectorizer::run() {
   return Changed;
 }
 
-Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
-  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
-    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  // If the types are of different sizes and both are integers, we can use
-  // zext or sext to cast.
-  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
-    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
-      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
-    }
-    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  return nullptr;
-}
-
-std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
-                                                    Instruction *I2) {
-  assert(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) ||
-          (isa<StoreInst>(I1) && isa<StoreInst>(I2))) &&
-         "computeLeaderDelta must be called with two load or two store "
-         "instructions");
-  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
-  const Value *Ptr1 = getLoadStorePointerOperand(I1);
-  const Value *Ptr2 = getLoadStorePointerOperand(I2);
-  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
-                           CtxInst);
-}
-
-bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                          const APInt &Delta) const {
-  ConstantRange ARange(
-      A.front().OffsetFromLeader,
-      A.back().OffsetFromLeader +
-          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
-  ConstantRange BRange(
-      B.front().OffsetFromLeader + Delta,
-      B.back().OffsetFromLeader + Delta +
-          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
-  return !ARange.intersectWith(BRange).isEmptySet();
-}
-
-void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
-  for (ChainElem &E : C)
-    E.OffsetFromLeader += Delta;
-}
-
-void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
-  for (ChainElem &Elem : C) {
-    Instruction *Inst = Elem.Inst;
-    Type *OrigValTy = getLoadStoreType(Inst);
-    if (OrigValTy == CastTy)
-      continue;
-
-    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
-      Builder.SetInsertPoint(LI);
-      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
-                                             LI->getName() + ".mut");
-      copyMetadataForAccess(*NewLoad, *LI);
-      Value *CastBack = insertCast(NewLoad, OrigValTy);
-      if (!CastBack)
-        llvm_unreachable("Failed to insert cast");
-      LI->replaceAllUsesWith(CastBack);
-      ToErase.emplace_back(LI);
-      Elem.Inst = NewLoad;
-    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
-      Builder.SetInsertPoint(SI);
-      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
-      if (!CastVal)
-        llvm_unreachable("Failed to insert cast");
-      StoreInst *NewStore =
-          Builder.CreateStore(CastVal, SI->getPointerOperand());
-      NewStore->setAlignment(SI->getAlign());
-      NewStore->setVolatile(SI->isVolatile());
-      copyMetadataForAccess(*NewStore, *SI);
-      ToErase.emplace_back(SI);
-      Elem.Inst = NewStore;
-    }
-  }
-}
-
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -582,120 +480,49 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
-  SmallVector<Chain> ContiguousSubChains;
-
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End)) {
-
-    LLVM_DEBUG({
-      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-             << " keyed on " << EqClassKey << ":\n";
-      for (Instruction *I : EqClass)
-        dbgs() << "  " << *I << "\n";
-    });
-
-    for (Chain &C : gatherChains(EqClass)) {
-
-      // Split up the chain into increasingly smaller chains, until we can
-      // finally vectorize the chains.
-      //
-      // (Don't be scared by the depth of the loop nest here.  These operations
-      // are all at worst O(n lg n) in the number of instructions, and splitting
-      // chains doesn't change the number of instrs.  So the whole loop nest is
-      // O(n lg n).)
-      for (auto &C : splitChainByMayAliasInstrs(C)) {
-        for (auto &C : splitChainByContiguity(C)) {
-          ContiguousSubChains.emplace_back(C);
-        }
-      }
-    }
-  }
-
-  // Merge chains in reverse order, so that the first chain is the largest.
-  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
-    Chain &C1 = ContiguousSubChains[I - 1];
-    Chain &C2 = ContiguousSubChains[I];
+       collectEquivalenceClasses(Begin, End))
+    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
 
-    // If the scalar types of the chains are the same, we can merge them
-    // without inserting any casts.
-    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
-        getLoadStoreType(C2[0].Inst)->getScalarType())
-      continue;
-
-    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
-    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
-    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
-    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
-    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
-    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
-
-    // If the chains are mapped to different types, have distinct underlying
-    // pointer objects, or include both loads and stores, skip.
-    if (C1IsLoad != C2IsLoad || AS1 != AS2 ||
-        ::getUnderlyingObject(C1Ptr) != ::getUnderlyingObject(C2Ptr))
-      continue;
-
-    // Compute constant offset between chain leaders; if unknown, skip.
-    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
-    if (!DeltaOpt)
-      continue;
-
-    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
-    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
-      continue;
-
-    // Determine the common integer cast type for normalization and ensure total
-    // bitwidth matches across all elements of both chains.
-    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
-    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
-    auto AllElemsMatchTotalBits = [&](const Chain &C) {
-      return llvm::all_of(C, [&](const ChainElem &E) {
-        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
-      });
-    };
-    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
-      continue;
+  return Changed;
+}
 
-    // Power-of-two span ensures we can form a legal, single vector access
-    // without padding or splitting. Many targets and cost models assume POT
-    // widths, and it guarantees an integral element count for the chosen
-    // VecElemTy.
-    APInt Sz = C2.front().OffsetFromLeader +
-               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
-               C1.back().OffsetFromLeader + *DeltaOpt;
-    if (!Sz.isPowerOf2())
-      continue;
+bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                                       ArrayRef<Instruction *> EqClass) {
+  bool Changed = false;
 
-    // Rebase C2's offsets into C1's coordinate space prior to merging and
-    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
-    // from ContiguousSubChains.
-    rebaseChain(C2, *DeltaOpt);
-    C1.insert(C1.end(), C2.begin(), C2.end());
-    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
-
-    // Normalize the value operand/result type of each instruction in C1 to
-    // C1CastTy.
-    Type *C1CastTy =
-        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
-    normalizeChainToType(C1, C1CastTy);
-  }
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+           << " keyed on " << EqClassKey << ":\n";
+    for (Instruction *I : EqClass)
+      dbgs() << "  " << *I << "\n";
+  });
 
-  for (auto &C : ContiguousSubChains) {
-    if (C.size() <= 1)
-      continue;
-    for (auto &AlignedSubChain : splitChainByAlignment(C))
-      Changed |= vectorizeChain(AlignedSubChain);
-  }
+  std::vector<Chain> Chains = gatherChains(EqClass);
+  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
+                    << " nontrivial chains.\n";);
+  for (Chain &C : Chains)
+    Changed |= runOnChain(C);
+  return Changed;
+}
 
-  // Erase all instructions scheduled for deletion in this pseudo-BB.
-  for (Instruction *I : ToErase) {
-    auto *PtrOperand = getLoadStorePointerOperand(I);
-    if (I->use_empty())
-      I->eraseFromParent();
-    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
-  }
-  ToErase.clear();
+bool Vectorizer::runOnChain(Chain &C) {
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
+    dumpChain(C);
+  });
 
+  // Split up the chain into increasingly smaller chains, until we can finally
+  // vectorize the chains.
+  //
+  // (Don't be scared by the depth of the loop nest here.  These operations are
+  // all at worst O(n lg n) in the number of instructions, and splitting chains
+  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
+  bool Changed = false;
+  for (auto &C : splitChainByMayAliasInstrs(C))
+    for (auto &C : splitChainByContiguity(C))
+      for (auto &C : splitChainByAlignment(C))
+        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -756,7 +583,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (!NewChain.empty()) {
+        if (NewChain.size() > 1) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -768,7 +595,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (!NewChain.empty()) {
+    if (NewChain.size() > 1) {
       LLVM_DEBUG({
         dbgs() << "LSV: got ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Drew Kersnar (dakersnar)

Changes

Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load &lt;2xi16&gt; and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.


Patch is 411.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170381.diff

50 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Utils/Local.h (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1-1)
  • (modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Scalar/SROA.cpp (+1-1)
  • (modified) llvm/lib/Transforms/Utils/Local.cpp (+22-38)
  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+63-219)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+129-133)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+792-825)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+40-18)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+26-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+162-171)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+49-57)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+30-39)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+149-141)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+206-197)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (-82)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (+109-65)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+120-192)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+58-59)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+58-59)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+33-34)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+56-57)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+56-57)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+16-18)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+108-108)
  • (modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+30-58)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+36-38)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+28-30)
  • (modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+63-64)
  • (modified) llvm/test/CodeGen/AMDGPU/store-to-constant.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+61-65)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
  • (removed) llvm/test/Transforms/InstCombine/copy-access-metadata.ll (-215)
  • (removed) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-load-store.ll (-159)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+66-258)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+3-281)
diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index d0af2d3d2e4c2..9acfd872e574b 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -431,7 +431,7 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 
 /// Copy the metadata from the source instruction to the destination (the
 /// replacement for the source instruction).
-LLVM_ABI void copyMetadataForAccess(Instruction &Dest, Instruction &Source);
+LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 98884c441096e..fdff21b6ef8df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1035,7 +1035,7 @@ bool LegalizeBufferContentTypesVisitor::visitLoadImpl(
       LoadInst *NewLI = IRB.CreateAlignedLoad(
           LoadableType, NewPtr, commonAlignment(OrigLI.getAlign(), ByteOffset),
           Name + ".off." + Twine(ByteOffset));
-      copyMetadataForAccess(*NewLI, OrigLI);
+      copyMetadataForLoad(*NewLI, OrigLI);
       NewLI->setAAMetadata(
           AANodes.adjustForAccess(ByteOffset, LoadableType, DL));
       NewLI->setAtomic(OrigLI.getOrdering(), OrigLI.getSyncScopeID());
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 3e04aeb675d2a..9491610190c10 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -415,7 +415,7 @@ void PointerReplacer::replace(Instruction *I) {
                               LT->getAlign(), LT->getOrdering(),
                               LT->getSyncScopeID());
     NewI->takeName(LT);
-    copyMetadataForAccess(*NewI, *LT);
+    copyMetadataForLoad(*NewI, *LT);
 
     IC.InsertNewInstWith(NewI, LT->getIterator());
     IC.replaceInstUsesWith(*LT, NewI);
@@ -606,7 +606,7 @@ LoadInst *InstCombinerImpl::combineLoadToNewType(LoadInst &LI, Type *NewTy,
       Builder.CreateAlignedLoad(NewTy, LI.getPointerOperand(), LI.getAlign(),
                                 LI.isVolatile(), LI.getName() + Suffix);
   NewLoad->setAtomic(LI.getOrdering(), LI.getSyncScopeID());
-  copyMetadataForAccess(*NewLoad, LI);
+  copyMetadataForLoad(*NewLoad, LI);
   return NewLoad;
 }
 
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index a7c322bfcb981..70afe833c9f47 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3272,7 +3272,7 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       // Copy any metadata that is valid for the new load. This may require
       // conversion to a different kind of metadata, e.g. !nonnull might change
       // to !range or vice versa.
-      copyMetadataForAccess(*NewLI, LI);
+      copyMetadataForLoad(*NewLI, LI);
 
       // Do this after copyMetadataForLoad() to preserve the TBAA shift.
       if (AATags)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index dec2e019333b9..a03cf6e953e35 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3100,70 +3100,54 @@ void llvm::combineAAMetadata(Instruction *K, const Instruction *J) {
   combineMetadata(K, J, /*DoesKMove=*/true, /*AAOnly=*/true);
 }
 
-void llvm::copyMetadataForAccess(Instruction &DestI, Instruction &SourceI) {
+void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
-  SourceI.getAllMetadata(MD);
-  MDBuilder MDB(DestI.getContext());
-  Type *NewType = DestI.getType();
-
-  // Only needed for range metadata on loads.
-  const DataLayout *DL = nullptr;
-  const LoadInst *LSource = dyn_cast<LoadInst>(&SourceI);
-  if (LSource)
-    DL = &LSource->getDataLayout();
-
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  const DataLayout &DL = Source.getDataLayout();
   for (const auto &MDPair : MD) {
     unsigned ID = MDPair.first;
     MDNode *N = MDPair.second;
-
+    // Note, essentially every kind of metadata should be preserved here! This
+    // routine is supposed to clone a load instruction changing *only its type*.
+    // The only metadata it makes sense to drop is metadata which is invalidated
+    // when the pointer type changes. This should essentially never be the case
+    // in LLVM, but we explicitly switch over only known metadata to be
+    // conservatively correct. If you are adding metadata to LLVM which pertains
+    // to loads, you almost certainly want to add it here.
     switch (ID) {
-    // Applies to both loads and stores as-is.
     case LLVMContext::MD_dbg:
+    case LLVMContext::MD_tbaa:
     case LLVMContext::MD_prof:
+    case LLVMContext::MD_fpmath:
     case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_invariant_load:
     case LLVMContext::MD_alias_scope:
     case LLVMContext::MD_noalias:
     case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_mem_parallel_loop_access:
     case LLVMContext::MD_access_group:
     case LLVMContext::MD_noundef:
     case LLVMContext::MD_noalias_addrspace:
-    case LLVMContext::MD_mem_parallel_loop_access:
-      DestI.setMetadata(ID, N);
-      break;
-
-    // Load-only metadata.
-    case LLVMContext::MD_fpmath:
-    case LLVMContext::MD_invariant_load:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      // All of these directly apply.
+      Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_nonnull:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource)
-          copyNonnullMetadata(*LSource, N, *LDest);
-      }
+      copyNonnullMetadata(Source, N, Dest);
       break;
 
     case LLVMContext::MD_align:
     case LLVMContext::MD_dereferenceable:
     case LLVMContext::MD_dereferenceable_or_null:
-      // Applies to both loads and stores only if the new type is also a
-      // pointer.
+      // These only directly apply if the new type is also a pointer.
       if (NewType->isPointerTy())
-        DestI.setMetadata(ID, N);
+        Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_range:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource && DL)
-          copyRangeMetadata(*DL, *LSource, N, *LDest);
-      }
-      break;
-
-    case LLVMContext::MD_tbaa:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      copyRangeMetadata(DL, Source, N, Dest);
       break;
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 114df653bad83..c28314f6ab124 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,7 +112,6 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -269,6 +268,11 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
+  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
+  /// in the same BB with the same value for getUnderlyingObject() etc.
+  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                             ArrayRef<Instruction *> EqClass);
+
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -334,22 +338,12 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
-  /// Inserts a cast instruction to convert Inst to DstTy.
-  Value *insertCast(Value *Val, Type *DstTy);
-
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
-
-  // Helpers for chain merging.
-  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
-  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                const APInt &Delta) const;
-  static void rebaseChain(Chain &C, const APInt &Delta);
-  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -431,20 +425,6 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
-static const Value *getUnderlyingObject(const Value *Ptr) {
-  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-    // The select's themselves are distinct instructions even if they share
-    // the same condition and evaluate to consecutive pointers for true and
-    // false values of the condition. Therefore using the select's themselves
-    // for grouping instructions would put consecutive accesses into different
-    // lists and they won't be even checked for being consecutive, and won't
-    // be vectorized.
-    return Sel->getCondition();
-  }
-  return ObjPtr;
-}
-
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -488,88 +468,6 @@ bool Vectorizer::run() {
   return Changed;
 }
 
-Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
-  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
-    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  // If the types are of different sizes and both are integers, we can use
-  // zext or sext to cast.
-  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
-    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
-      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
-    }
-    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  return nullptr;
-}
-
-std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
-                                                    Instruction *I2) {
-  assert(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) ||
-          (isa<StoreInst>(I1) && isa<StoreInst>(I2))) &&
-         "computeLeaderDelta must be called with two load or two store "
-         "instructions");
-  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
-  const Value *Ptr1 = getLoadStorePointerOperand(I1);
-  const Value *Ptr2 = getLoadStorePointerOperand(I2);
-  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
-                           CtxInst);
-}
-
-bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                          const APInt &Delta) const {
-  ConstantRange ARange(
-      A.front().OffsetFromLeader,
-      A.back().OffsetFromLeader +
-          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
-  ConstantRange BRange(
-      B.front().OffsetFromLeader + Delta,
-      B.back().OffsetFromLeader + Delta +
-          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
-  return !ARange.intersectWith(BRange).isEmptySet();
-}
-
-void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
-  for (ChainElem &E : C)
-    E.OffsetFromLeader += Delta;
-}
-
-void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
-  for (ChainElem &Elem : C) {
-    Instruction *Inst = Elem.Inst;
-    Type *OrigValTy = getLoadStoreType(Inst);
-    if (OrigValTy == CastTy)
-      continue;
-
-    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
-      Builder.SetInsertPoint(LI);
-      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
-                                             LI->getName() + ".mut");
-      copyMetadataForAccess(*NewLoad, *LI);
-      Value *CastBack = insertCast(NewLoad, OrigValTy);
-      if (!CastBack)
-        llvm_unreachable("Failed to insert cast");
-      LI->replaceAllUsesWith(CastBack);
-      ToErase.emplace_back(LI);
-      Elem.Inst = NewLoad;
-    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
-      Builder.SetInsertPoint(SI);
-      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
-      if (!CastVal)
-        llvm_unreachable("Failed to insert cast");
-      StoreInst *NewStore =
-          Builder.CreateStore(CastVal, SI->getPointerOperand());
-      NewStore->setAlignment(SI->getAlign());
-      NewStore->setVolatile(SI->isVolatile());
-      copyMetadataForAccess(*NewStore, *SI);
-      ToErase.emplace_back(SI);
-      Elem.Inst = NewStore;
-    }
-  }
-}
-
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -582,120 +480,49 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
-  SmallVector<Chain> ContiguousSubChains;
-
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End)) {
-
-    LLVM_DEBUG({
-      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-             << " keyed on " << EqClassKey << ":\n";
-      for (Instruction *I : EqClass)
-        dbgs() << "  " << *I << "\n";
-    });
-
-    for (Chain &C : gatherChains(EqClass)) {
-
-      // Split up the chain into increasingly smaller chains, until we can
-      // finally vectorize the chains.
-      //
-      // (Don't be scared by the depth of the loop nest here.  These operations
-      // are all at worst O(n lg n) in the number of instructions, and splitting
-      // chains doesn't change the number of instrs.  So the whole loop nest is
-      // O(n lg n).)
-      for (auto &C : splitChainByMayAliasInstrs(C)) {
-        for (auto &C : splitChainByContiguity(C)) {
-          ContiguousSubChains.emplace_back(C);
-        }
-      }
-    }
-  }
-
-  // Merge chains in reverse order, so that the first chain is the largest.
-  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
-    Chain &C1 = ContiguousSubChains[I - 1];
-    Chain &C2 = ContiguousSubChains[I];
+       collectEquivalenceClasses(Begin, End))
+    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
 
-    // If the scalar types of the chains are the same, we can merge them
-    // without inserting any casts.
-    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
-        getLoadStoreType(C2[0].Inst)->getScalarType())
-      continue;
-
-    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
-    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
-    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
-    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
-    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
-    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
-
-    // If the chains are mapped to different types, have distinct underlying
-    // pointer objects, or include both loads and stores, skip.
-    if (C1IsLoad != C2IsLoad || AS1 != AS2 ||
-        ::getUnderlyingObject(C1Ptr) != ::getUnderlyingObject(C2Ptr))
-      continue;
-
-    // Compute constant offset between chain leaders; if unknown, skip.
-    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
-    if (!DeltaOpt)
-      continue;
-
-    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
-    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
-      continue;
-
-    // Determine the common integer cast type for normalization and ensure total
-    // bitwidth matches across all elements of both chains.
-    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
-    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
-    auto AllElemsMatchTotalBits = [&](const Chain &C) {
-      return llvm::all_of(C, [&](const ChainElem &E) {
-        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
-      });
-    };
-    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
-      continue;
+  return Changed;
+}
 
-    // Power-of-two span ensures we can form a legal, single vector access
-    // without padding or splitting. Many targets and cost models assume POT
-    // widths, and it guarantees an integral element count for the chosen
-    // VecElemTy.
-    APInt Sz = C2.front().OffsetFromLeader +
-               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
-               C1.back().OffsetFromLeader + *DeltaOpt;
-    if (!Sz.isPowerOf2())
-      continue;
+bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                                       ArrayRef<Instruction *> EqClass) {
+  bool Changed = false;
 
-    // Rebase C2's offsets into C1's coordinate space prior to merging and
-    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
-    // from ContiguousSubChains.
-    rebaseChain(C2, *DeltaOpt);
-    C1.insert(C1.end(), C2.begin(), C2.end());
-    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
-
-    // Normalize the value operand/result type of each instruction in C1 to
-    // C1CastTy.
-    Type *C1CastTy =
-        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
-    normalizeChainToType(C1, C1CastTy);
-  }
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+           << " keyed on " << EqClassKey << ":\n";
+    for (Instruction *I : EqClass)
+      dbgs() << "  " << *I << "\n";
+  });
 
-  for (auto &C : ContiguousSubChains) {
-    if (C.size() <= 1)
-      continue;
-    for (auto &AlignedSubChain : splitChainByAlignment(C))
-      Changed |= vectorizeChain(AlignedSubChain);
-  }
+  std::vector<Chain> Chains = gatherChains(EqClass);
+  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
+                    << " nontrivial chains.\n";);
+  for (Chain &C : Chains)
+    Changed |= runOnChain(C);
+  return Changed;
+}
 
-  // Erase all instructions scheduled for deletion in this pseudo-BB.
-  for (Instruction *I : ToErase) {
-    auto *PtrOperand = getLoadStorePointerOperand(I);
-    if (I->use_empty())
-      I->eraseFromParent();
-    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
-  }
-  ToErase.clear();
+bool Vectorizer::runOnChain(Chain &C) {
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
+    dumpChain(C);
+  });
 
+  // Split up the chain into increasingly smaller chains, until we can finally
+  // vectorize the chains.
+  //
+  // (Don't be scared by the depth of the loop nest here.  These operations are
+  // all at worst O(n lg n) in the number of instructions, and splitting chains
+  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
+  bool Changed = false;
+  for (auto &C : splitChainByMayAliasInstrs(C))
+    for (auto &C : splitChainByContiguity(C))
+      for (auto &C : splitChainByAlignment(C))
+        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -756,7 +583,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (!NewChain.empty()) {
+        if (NewChain.size() > 1) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -768,7 +595,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (!NewChain.empty()) {
+    if (NewChain.size() > 1) {
       LLVM_DEBUG({
         dbgs() << "LSV: got ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2025

@llvm/pr-subscribers-vectorizers

Author: Drew Kersnar (dakersnar)

Changes

Reverts llvm/llvm-project#154069. I pointed out a number of issues post-merge, most importantly examples of miscompiles: #154069 (comment).

While the motivation of the change is clear, I think the implementation approach is flawed. It seems like the goal is to allow elements like load &lt;2xi16&gt; and load i32 to be vectorized together despite the current algorithm not grouping them into the same equivalence classes. I personally think that if we want to attempt this it should be a more wholistic approach, maybe even redefining the concept of an equivalence class. This current solution seems like it would be really hard to do bug-free, and even if the bugs were present, it is only able to merge chains that happen to be adjacent to each other after splitChainByContiguity. But we can discuss more in the re-land. Maybe the broader approach I'm proposing is too difficult, and a narrow optimization is worthwhile. Regardless, this should be reverted, it needs more iteration before it is correct.


Patch is 411.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170381.diff

50 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Utils/Local.h (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1-1)
  • (modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+2-2)
  • (modified) llvm/lib/Transforms/Scalar/SROA.cpp (+1-1)
  • (modified) llvm/lib/Transforms/Utils/Local.cpp (+22-38)
  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+63-219)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/amdgpu-irtranslator.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+129-133)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+792-825)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+40-18)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+26-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+162-171)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+11-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+39-45)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+15-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+49-57)
  • (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+30-39)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+149-141)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+206-197)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (-82)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (+109-65)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+27-27)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll (+120-192)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+58-59)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+58-59)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+33-34)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+56-57)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+56-57)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+12-14)
  • (modified) llvm/test/CodeGen/AMDGPU/max.ll (+16-18)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (+108-108)
  • (modified) llvm/test/CodeGen/AMDGPU/packed-fp32.ll (+30-58)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+36-38)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+28-30)
  • (modified) llvm/test/CodeGen/AMDGPU/s_addk_i32.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+63-64)
  • (modified) llvm/test/CodeGen/AMDGPU/store-to-constant.ll (+2-4)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+61-65)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.f64.ll (+1-1)
  • (removed) llvm/test/Transforms/InstCombine/copy-access-metadata.ll (-215)
  • (removed) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/copy-metadata-load-store.ll (-159)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors-complex.ll (+66-258)
  • (modified) llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll (+3-281)
diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h
index d0af2d3d2e4c2..9acfd872e574b 100644
--- a/llvm/include/llvm/Transforms/Utils/Local.h
+++ b/llvm/include/llvm/Transforms/Utils/Local.h
@@ -431,7 +431,7 @@ LLVM_ABI void combineAAMetadata(Instruction *K, const Instruction *J);
 
 /// Copy the metadata from the source instruction to the destination (the
 /// replacement for the source instruction).
-LLVM_ABI void copyMetadataForAccess(Instruction &Dest, Instruction &Source);
+LLVM_ABI void copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source);
 
 /// Patch the replacement so that it is not more restrictive than the value
 /// being replaced. It assumes that the replacement does not get moved from
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
index 98884c441096e..fdff21b6ef8df 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -1035,7 +1035,7 @@ bool LegalizeBufferContentTypesVisitor::visitLoadImpl(
       LoadInst *NewLI = IRB.CreateAlignedLoad(
           LoadableType, NewPtr, commonAlignment(OrigLI.getAlign(), ByteOffset),
           Name + ".off." + Twine(ByteOffset));
-      copyMetadataForAccess(*NewLI, OrigLI);
+      copyMetadataForLoad(*NewLI, OrigLI);
       NewLI->setAAMetadata(
           AANodes.adjustForAccess(ByteOffset, LoadableType, DL));
       NewLI->setAtomic(OrigLI.getOrdering(), OrigLI.getSyncScopeID());
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 3e04aeb675d2a..9491610190c10 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -415,7 +415,7 @@ void PointerReplacer::replace(Instruction *I) {
                               LT->getAlign(), LT->getOrdering(),
                               LT->getSyncScopeID());
     NewI->takeName(LT);
-    copyMetadataForAccess(*NewI, *LT);
+    copyMetadataForLoad(*NewI, *LT);
 
     IC.InsertNewInstWith(NewI, LT->getIterator());
     IC.replaceInstUsesWith(*LT, NewI);
@@ -606,7 +606,7 @@ LoadInst *InstCombinerImpl::combineLoadToNewType(LoadInst &LI, Type *NewTy,
       Builder.CreateAlignedLoad(NewTy, LI.getPointerOperand(), LI.getAlign(),
                                 LI.isVolatile(), LI.getName() + Suffix);
   NewLoad->setAtomic(LI.getOrdering(), LI.getSyncScopeID());
-  copyMetadataForAccess(*NewLoad, LI);
+  copyMetadataForLoad(*NewLoad, LI);
   return NewLoad;
 }
 
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index a7c322bfcb981..70afe833c9f47 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3272,7 +3272,7 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       // Copy any metadata that is valid for the new load. This may require
       // conversion to a different kind of metadata, e.g. !nonnull might change
       // to !range or vice versa.
-      copyMetadataForAccess(*NewLI, LI);
+      copyMetadataForLoad(*NewLI, LI);
 
       // Do this after copyMetadataForLoad() to preserve the TBAA shift.
       if (AATags)
diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp
index dec2e019333b9..a03cf6e953e35 100644
--- a/llvm/lib/Transforms/Utils/Local.cpp
+++ b/llvm/lib/Transforms/Utils/Local.cpp
@@ -3100,70 +3100,54 @@ void llvm::combineAAMetadata(Instruction *K, const Instruction *J) {
   combineMetadata(K, J, /*DoesKMove=*/true, /*AAOnly=*/true);
 }
 
-void llvm::copyMetadataForAccess(Instruction &DestI, Instruction &SourceI) {
+void llvm::copyMetadataForLoad(LoadInst &Dest, const LoadInst &Source) {
   SmallVector<std::pair<unsigned, MDNode *>, 8> MD;
-  SourceI.getAllMetadata(MD);
-  MDBuilder MDB(DestI.getContext());
-  Type *NewType = DestI.getType();
-
-  // Only needed for range metadata on loads.
-  const DataLayout *DL = nullptr;
-  const LoadInst *LSource = dyn_cast<LoadInst>(&SourceI);
-  if (LSource)
-    DL = &LSource->getDataLayout();
-
+  Source.getAllMetadata(MD);
+  MDBuilder MDB(Dest.getContext());
+  Type *NewType = Dest.getType();
+  const DataLayout &DL = Source.getDataLayout();
   for (const auto &MDPair : MD) {
     unsigned ID = MDPair.first;
     MDNode *N = MDPair.second;
-
+    // Note, essentially every kind of metadata should be preserved here! This
+    // routine is supposed to clone a load instruction changing *only its type*.
+    // The only metadata it makes sense to drop is metadata which is invalidated
+    // when the pointer type changes. This should essentially never be the case
+    // in LLVM, but we explicitly switch over only known metadata to be
+    // conservatively correct. If you are adding metadata to LLVM which pertains
+    // to loads, you almost certainly want to add it here.
     switch (ID) {
-    // Applies to both loads and stores as-is.
     case LLVMContext::MD_dbg:
+    case LLVMContext::MD_tbaa:
     case LLVMContext::MD_prof:
+    case LLVMContext::MD_fpmath:
     case LLVMContext::MD_tbaa_struct:
+    case LLVMContext::MD_invariant_load:
     case LLVMContext::MD_alias_scope:
     case LLVMContext::MD_noalias:
     case LLVMContext::MD_nontemporal:
+    case LLVMContext::MD_mem_parallel_loop_access:
     case LLVMContext::MD_access_group:
     case LLVMContext::MD_noundef:
     case LLVMContext::MD_noalias_addrspace:
-    case LLVMContext::MD_mem_parallel_loop_access:
-      DestI.setMetadata(ID, N);
-      break;
-
-    // Load-only metadata.
-    case LLVMContext::MD_fpmath:
-    case LLVMContext::MD_invariant_load:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      // All of these directly apply.
+      Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_nonnull:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource)
-          copyNonnullMetadata(*LSource, N, *LDest);
-      }
+      copyNonnullMetadata(Source, N, Dest);
       break;
 
     case LLVMContext::MD_align:
     case LLVMContext::MD_dereferenceable:
     case LLVMContext::MD_dereferenceable_or_null:
-      // Applies to both loads and stores only if the new type is also a
-      // pointer.
+      // These only directly apply if the new type is also a pointer.
       if (NewType->isPointerTy())
-        DestI.setMetadata(ID, N);
+        Dest.setMetadata(ID, N);
       break;
 
     case LLVMContext::MD_range:
-      if (auto *LDest = dyn_cast<LoadInst>(&DestI)) {
-        if (LSource && DL)
-          copyRangeMetadata(*DL, *LSource, N, *LDest);
-      }
-      break;
-
-    case LLVMContext::MD_tbaa:
-      if (isa<LoadInst>(DestI))
-        DestI.setMetadata(ID, N);
+      copyRangeMetadata(DL, Source, N, Dest);
       break;
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 114df653bad83..c28314f6ab124 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -112,7 +112,6 @@
 #include <optional>
 #include <tuple>
 #include <type_traits>
-#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -269,6 +268,11 @@ class Vectorizer {
   /// isGuaranteedToTransferExecutionToSuccessor(I) == true.
   bool runOnPseudoBB(BasicBlock::iterator Begin, BasicBlock::iterator End);
 
+  /// Runs the vectorizer on one equivalence class, i.e. one set of loads/stores
+  /// in the same BB with the same value for getUnderlyingObject() etc.
+  bool runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                             ArrayRef<Instruction *> EqClass);
+
   /// Runs the vectorizer on one chain, i.e. a subset of an equivalence class
   /// where all instructions access a known, constant offset from the first
   /// instruction.
@@ -334,22 +338,12 @@ class Vectorizer {
   EquivalenceClassMap collectEquivalenceClasses(BasicBlock::iterator Begin,
                                                 BasicBlock::iterator End);
 
-  /// Inserts a cast instruction to convert Inst to DstTy.
-  Value *insertCast(Value *Val, Type *DstTy);
-
   /// Partitions Instrs into "chains" where every instruction has a known
   /// constant offset from the first instr in the chain.
   ///
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
-
-  // Helpers for chain merging.
-  std::optional<APInt> computeLeaderDelta(Instruction *I1, Instruction *I2);
-  bool chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                const APInt &Delta) const;
-  static void rebaseChain(Chain &C, const APInt &Delta);
-  void normalizeChainToType(Chain &C, Type *CastTy);
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -431,20 +425,6 @@ PreservedAnalyses LoadStoreVectorizerPass::run(Function &F,
   return Changed ? PA : PreservedAnalyses::all();
 }
 
-static const Value *getUnderlyingObject(const Value *Ptr) {
-  const Value *ObjPtr = llvm::getUnderlyingObject(Ptr);
-  if (const auto *Sel = dyn_cast<SelectInst>(ObjPtr)) {
-    // The select's themselves are distinct instructions even if they share
-    // the same condition and evaluate to consecutive pointers for true and
-    // false values of the condition. Therefore using the select's themselves
-    // for grouping instructions would put consecutive accesses into different
-    // lists and they won't be even checked for being consecutive, and won't
-    // be vectorized.
-    return Sel->getCondition();
-  }
-  return ObjPtr;
-}
-
 bool Vectorizer::run() {
   bool Changed = false;
   // Break up the BB if there are any instrs which aren't guaranteed to transfer
@@ -488,88 +468,6 @@ bool Vectorizer::run() {
   return Changed;
 }
 
-Value *Vectorizer::insertCast(Value *Val, Type *DstTy) {
-  if (DL.getTypeSizeInBits(Val->getType()) == DL.getTypeSizeInBits(DstTy)) {
-    return Builder.CreateBitOrPointerCast(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  // If the types are of different sizes and both are integers, we can use
-  // zext or sext to cast.
-  if (Val->getType()->isIntegerTy() && DstTy->isIntegerTy()) {
-    if (DL.getTypeSizeInBits(Val->getType()) < DL.getTypeSizeInBits(DstTy)) {
-      return Builder.CreateZExt(Val, DstTy, Val->getName() + ".bc");
-    }
-    return Builder.CreateTrunc(Val, DstTy, Val->getName() + ".bc");
-  }
-
-  return nullptr;
-}
-
-std::optional<APInt> Vectorizer::computeLeaderDelta(Instruction *I1,
-                                                    Instruction *I2) {
-  assert(((isa<LoadInst>(I1) && isa<LoadInst>(I2)) ||
-          (isa<StoreInst>(I1) && isa<StoreInst>(I2))) &&
-         "computeLeaderDelta must be called with two load or two store "
-         "instructions");
-  Instruction *CtxInst = I1->comesBefore(I2) ? I2 : I1;
-  const Value *Ptr1 = getLoadStorePointerOperand(I1);
-  const Value *Ptr2 = getLoadStorePointerOperand(I2);
-  return getConstantOffset(const_cast<Value *>(Ptr1), const_cast<Value *>(Ptr2),
-                           CtxInst);
-}
-
-bool Vectorizer::chainsOverlapAfterRebase(const Chain &A, const Chain &B,
-                                          const APInt &Delta) const {
-  ConstantRange ARange(
-      A.front().OffsetFromLeader,
-      A.back().OffsetFromLeader +
-          DL.getTypeStoreSize(getLoadStoreType(A.back().Inst)));
-  ConstantRange BRange(
-      B.front().OffsetFromLeader + Delta,
-      B.back().OffsetFromLeader + Delta +
-          DL.getTypeStoreSize(getLoadStoreType(B.back().Inst)));
-  return !ARange.intersectWith(BRange).isEmptySet();
-}
-
-void Vectorizer::rebaseChain(Chain &C, const APInt &Delta) {
-  for (ChainElem &E : C)
-    E.OffsetFromLeader += Delta;
-}
-
-void Vectorizer::normalizeChainToType(Chain &C, Type *CastTy) {
-  for (ChainElem &Elem : C) {
-    Instruction *Inst = Elem.Inst;
-    Type *OrigValTy = getLoadStoreType(Inst);
-    if (OrigValTy == CastTy)
-      continue;
-
-    if (auto *LI = dyn_cast<LoadInst>(Inst)) {
-      Builder.SetInsertPoint(LI);
-      LoadInst *NewLoad = Builder.CreateLoad(CastTy, LI->getPointerOperand(),
-                                             LI->getName() + ".mut");
-      copyMetadataForAccess(*NewLoad, *LI);
-      Value *CastBack = insertCast(NewLoad, OrigValTy);
-      if (!CastBack)
-        llvm_unreachable("Failed to insert cast");
-      LI->replaceAllUsesWith(CastBack);
-      ToErase.emplace_back(LI);
-      Elem.Inst = NewLoad;
-    } else if (auto *SI = dyn_cast<StoreInst>(Inst)) {
-      Builder.SetInsertPoint(SI);
-      Value *CastVal = insertCast(SI->getValueOperand(), CastTy);
-      if (!CastVal)
-        llvm_unreachable("Failed to insert cast");
-      StoreInst *NewStore =
-          Builder.CreateStore(CastVal, SI->getPointerOperand());
-      NewStore->setAlignment(SI->getAlign());
-      NewStore->setVolatile(SI->isVolatile());
-      copyMetadataForAccess(*NewStore, *SI);
-      ToErase.emplace_back(SI);
-      Elem.Inst = NewStore;
-    }
-  }
-}
-
 bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
                                BasicBlock::iterator End) {
   LLVM_DEBUG({
@@ -582,120 +480,49 @@ bool Vectorizer::runOnPseudoBB(BasicBlock::iterator Begin,
   });
 
   bool Changed = false;
-  SmallVector<Chain> ContiguousSubChains;
-
   for (const auto &[EqClassKey, EqClass] :
-       collectEquivalenceClasses(Begin, End)) {
-
-    LLVM_DEBUG({
-      dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
-             << " keyed on " << EqClassKey << ":\n";
-      for (Instruction *I : EqClass)
-        dbgs() << "  " << *I << "\n";
-    });
-
-    for (Chain &C : gatherChains(EqClass)) {
-
-      // Split up the chain into increasingly smaller chains, until we can
-      // finally vectorize the chains.
-      //
-      // (Don't be scared by the depth of the loop nest here.  These operations
-      // are all at worst O(n lg n) in the number of instructions, and splitting
-      // chains doesn't change the number of instrs.  So the whole loop nest is
-      // O(n lg n).)
-      for (auto &C : splitChainByMayAliasInstrs(C)) {
-        for (auto &C : splitChainByContiguity(C)) {
-          ContiguousSubChains.emplace_back(C);
-        }
-      }
-    }
-  }
-
-  // Merge chains in reverse order, so that the first chain is the largest.
-  for (int I = ContiguousSubChains.size() - 1; I > 0; I--) {
-    Chain &C1 = ContiguousSubChains[I - 1];
-    Chain &C2 = ContiguousSubChains[I];
+       collectEquivalenceClasses(Begin, End))
+    Changed |= runOnEquivalenceClass(EqClassKey, EqClass);
 
-    // If the scalar types of the chains are the same, we can merge them
-    // without inserting any casts.
-    if (getLoadStoreType(C1[0].Inst)->getScalarType() ==
-        getLoadStoreType(C2[0].Inst)->getScalarType())
-      continue;
-
-    const Value *C1Ptr = getLoadStorePointerOperand(C1[0].Inst);
-    const Value *C2Ptr = getLoadStorePointerOperand(C2[0].Inst);
-    unsigned AS1 = C1Ptr->getType()->getPointerAddressSpace();
-    unsigned AS2 = C2Ptr->getType()->getPointerAddressSpace();
-    bool C1IsLoad = isa<LoadInst>(C1[0].Inst);
-    bool C2IsLoad = isa<LoadInst>(C2[0].Inst);
-
-    // If the chains are mapped to different types, have distinct underlying
-    // pointer objects, or include both loads and stores, skip.
-    if (C1IsLoad != C2IsLoad || AS1 != AS2 ||
-        ::getUnderlyingObject(C1Ptr) != ::getUnderlyingObject(C2Ptr))
-      continue;
-
-    // Compute constant offset between chain leaders; if unknown, skip.
-    std::optional<APInt> DeltaOpt = computeLeaderDelta(C1[0].Inst, C2[0].Inst);
-    if (!DeltaOpt)
-      continue;
-
-    // Check that rebasing C2 into C1's coordinate space will not overlap C1.
-    if (chainsOverlapAfterRebase(C1, C2, *DeltaOpt))
-      continue;
-
-    // Determine the common integer cast type for normalization and ensure total
-    // bitwidth matches across all elements of both chains.
-    Type *C1ElemTy = getLoadStoreType(C1[0].Inst);
-    unsigned TotalBits = DL.getTypeSizeInBits(C1ElemTy);
-    auto AllElemsMatchTotalBits = [&](const Chain &C) {
-      return llvm::all_of(C, [&](const ChainElem &E) {
-        return DL.getTypeSizeInBits(getLoadStoreType(E.Inst)) == TotalBits;
-      });
-    };
-    if (!AllElemsMatchTotalBits(C1) || !AllElemsMatchTotalBits(C2))
-      continue;
+  return Changed;
+}
 
-    // Power-of-two span ensures we can form a legal, single vector access
-    // without padding or splitting. Many targets and cost models assume POT
-    // widths, and it guarantees an integral element count for the chosen
-    // VecElemTy.
-    APInt Sz = C2.front().OffsetFromLeader +
-               DL.getTypeStoreSize(getLoadStoreType(C2.front().Inst)) -
-               C1.back().OffsetFromLeader + *DeltaOpt;
-    if (!Sz.isPowerOf2())
-      continue;
+bool Vectorizer::runOnEquivalenceClass(const EqClassKey &EqClassKey,
+                                       ArrayRef<Instruction *> EqClass) {
+  bool Changed = false;
 
-    // Rebase C2's offsets into C1's coordinate space prior to merging and
-    // merge C2 into C1 by appending all elements of C2 to C1, then erase C2
-    // from ContiguousSubChains.
-    rebaseChain(C2, *DeltaOpt);
-    C1.insert(C1.end(), C2.begin(), C2.end());
-    ContiguousSubChains.erase(ContiguousSubChains.begin() + I);
-
-    // Normalize the value operand/result type of each instruction in C1 to
-    // C1CastTy.
-    Type *C1CastTy =
-        Type::getIntNTy(C1ElemTy->getContext(), DL.getTypeSizeInBits(C1ElemTy));
-    normalizeChainToType(C1, C1CastTy);
-  }
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on equivalence class of size " << EqClass.size()
+           << " keyed on " << EqClassKey << ":\n";
+    for (Instruction *I : EqClass)
+      dbgs() << "  " << *I << "\n";
+  });
 
-  for (auto &C : ContiguousSubChains) {
-    if (C.size() <= 1)
-      continue;
-    for (auto &AlignedSubChain : splitChainByAlignment(C))
-      Changed |= vectorizeChain(AlignedSubChain);
-  }
+  std::vector<Chain> Chains = gatherChains(EqClass);
+  LLVM_DEBUG(dbgs() << "LSV: Got " << Chains.size()
+                    << " nontrivial chains.\n";);
+  for (Chain &C : Chains)
+    Changed |= runOnChain(C);
+  return Changed;
+}
 
-  // Erase all instructions scheduled for deletion in this pseudo-BB.
-  for (Instruction *I : ToErase) {
-    auto *PtrOperand = getLoadStorePointerOperand(I);
-    if (I->use_empty())
-      I->eraseFromParent();
-    RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
-  }
-  ToErase.clear();
+bool Vectorizer::runOnChain(Chain &C) {
+  LLVM_DEBUG({
+    dbgs() << "LSV: Running on chain with " << C.size() << " instructions:\n";
+    dumpChain(C);
+  });
 
+  // Split up the chain into increasingly smaller chains, until we can finally
+  // vectorize the chains.
+  //
+  // (Don't be scared by the depth of the loop nest here.  These operations are
+  // all at worst O(n lg n) in the number of instructions, and splitting chains
+  // doesn't change the number of instrs.  So the whole loop nest is O(n lg n).)
+  bool Changed = false;
+  for (auto &C : splitChainByMayAliasInstrs(C))
+    for (auto &C : splitChainByContiguity(C))
+      for (auto &C : splitChainByAlignment(C))
+        Changed |= vectorizeChain(C);
   return Changed;
 }
 
@@ -756,7 +583,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         LLVM_DEBUG(
             dbgs() << "LSV: Found intervening may-alias instrs; cannot merge "
                    << *ChainIt->Inst << " into " << *ChainBegin->Inst << "\n");
-        if (!NewChain.empty()) {
+        if (NewChain.size() > 1) {
           LLVM_DEBUG({
             dbgs() << "LSV: got nontrivial chain without aliasing instrs:\n";
             dumpChain(NewChain);
@@ -768,7 +595,7 @@ std::vector<Chain> Vectorizer::splitChainByMayAliasInstrs(Chain &C) {
         NewChain = SmallVector<ChainElem, 1>({*ChainIt});
       }
     }
-    if (!NewChain.empty()) {
+    if (NewChain.size() > 1) {
       LLVM_DEBUG({
         dbgs() << "LSV: got ...
[truncated]

@dakersnar dakersnar requested a review from nikic December 2, 2025 22:25
@gandhi56
Copy link
Contributor

gandhi56 commented Dec 2, 2025

Hi @dakersnar, thanks for your comments. I am certainly open to improve the solution to this problem. I will approve this revert. Let's discuss in #97715.

@gandhi56 gandhi56 merged commit 9c78bc5 into main Dec 2, 2025
16 checks passed
@gandhi56 gandhi56 deleted the revert-154069-issue-97715/lsv-merge-chains branch December 2, 2025 23:28
kcloudy0717 pushed a commit to kcloudy0717/llvm-project that referenced this pull request Dec 4, 2025
Reverts llvm#154069. I pointed out a number of issues
post-merge, most importantly examples of miscompiles:
llvm#154069 (comment).

While the motivation of the change is clear, I think the implementation
approach is flawed. It seems like the goal is to allow elements like
`load <2xi16>` and `load i32` to be vectorized together despite the
current algorithm not grouping them into the same equivalence classes. I
personally think that if we want to attempt this it should be a more
wholistic approach, maybe even redefining the concept of an equivalence
class. This current solution seems like it would be really hard to do
bug-free, and even if the bugs were not present, it is only able to
merge chains that happen to be adjacent to each other after
`splitChainByContiguity`, which seems like it is leaving things up to
chance whether this optimization kicks in. But we can discuss more in
the re-land. Maybe the broader approach I'm proposing is too difficult,
and a narrow optimization is worthwhile. Regardless, this should be
reverted, it needs more iteration before it is correct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU llvm:globalisel llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants