[SLP]Improve minbitwidth analysis. #78976

alexey-bataev · 2024-01-22T14:13:55Z

This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test 92549.00 92609.00 0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 663381.00 663493.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 663381.00 663493.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 307182.00 307214.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2040257.00 2040273.00 0.0%

                                                                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Created using spr 1.3.5

llvmbot · 2024-01-22T14:14:29Z

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

This improves overall analysis for minbitwidth in SLP. It allows to
analyize the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test 92549.00 92609.00 0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 663381.00 663493.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 663381.00 663493.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 307182.00 307214.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2040257.00 2040273.00 0.0%

                                                                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scara zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Patch is 63.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78976.diff

13 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+291-153)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll (+5-4)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/getelementptr.ll (+87-29)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-add-i64.ll (+5-15)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+4-3)
(modified) llvm/test/Transforms/SLPVectorizer/X86/PR35777.ll (+12-11)
(modified) llvm/test/Transforms/SLPVectorizer/X86/minbitwidth-multiuse-with-insertelement.ll (+8-9)
(modified) llvm/test/Transforms/SLPVectorizer/X86/minimum-sizes.ll (+8-9)
(modified) llvm/test/Transforms/SLPVectorizer/X86/phi-undef-input.ll (+12-12)
(modified) llvm/test/Transforms/SLPVectorizer/X86/resched.ll (+16-16)
(modified) llvm/test/Transforms/SLPVectorizer/X86/reused-reductions-with-minbitwidth.ll (+4-6)
(modified) llvm/test/Transforms/SLPVectorizer/X86/store-insertelement-minbitwidth.ll (+12-10)
(modified) llvm/test/Transforms/SLPVectorizer/alt-cmp-vectorize.ll (+2-2)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 482970bbf306120..5a493aed0fed7b8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -2273,9 +2273,11 @@ class BoUpSLP {
   /// constant and to be demoted. Required to correctly identify constant nodes
   /// to be demoted.
   bool collectValuesToDemote(
-      Value *V, SmallVectorImpl<Value *> &ToDemote,
+      Value *V, bool IsProfitableToDemoteRoot, unsigned &BitWidth,
+      SmallVectorImpl<Value *> &ToDemote,
       DenseMap<Instruction *, SmallVector<unsigned>> &DemotedConsts,
-      SmallVectorImpl<Value *> &Roots, DenseSet<Value *> &Visited) const;
+      DenseSet<Value *> &Visited, unsigned &MaxDepthLevel,
+      bool &IsProfitableToDemote) const;
 
   /// Check if the operands on the edges \p Edges of the \p UserTE allows
   /// reordering (i.e. the operands can be reordered because they have only one
@@ -7862,7 +7864,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
               unsigned BWSz = DL->getTypeSizeInBits(ScalarTy);
               unsigned SrcBWSz = DL->getTypeSizeInBits(UserScalarTy);
               unsigned VecOpcode;
-              auto *SrcVecTy =
+              auto *UserVecTy =
                   FixedVectorType::get(UserScalarTy, E->getVectorFactor());
               if (BWSz > SrcBWSz)
                 VecOpcode = Instruction::Trunc;
@@ -7870,11 +7872,10 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
                 VecOpcode =
                     It->second.second ? Instruction::SExt : Instruction::ZExt;
               TTI::CastContextHint CCH = GetCastContextHint(VL0);
-              VecCost += TTI->getCastInstrCost(VecOpcode, VecTy, SrcVecTy, CCH,
+              VecCost += TTI->getCastInstrCost(VecOpcode, UserVecTy, VecTy, CCH,
                                                CostKind);
-              ScalarCost +=
-                  Sz * TTI->getCastInstrCost(VecOpcode, ScalarTy, UserScalarTy,
-                                             CCH, CostKind);
+              ScalarCost += Sz * TTI->getCastInstrCost(VecOpcode, UserScalarTy,
+                                                       ScalarTy, CCH, CostKind);
             }
           }
         }
@@ -8955,7 +8956,7 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   SmallVector<std::pair<Value *, const TreeEntry *>> FirstUsers;
   SmallVector<APInt> DemandedElts;
   SmallDenseSet<Value *, 4> UsedInserts;
-  DenseSet<Value *> VectorCasts;
+  DenseSet<std::pair<const TreeEntry *, Type *>> VectorCasts;
   for (ExternalUser &EU : ExternalUses) {
     // We only add extract cost once for the same scalar.
     if (!isa_and_nonnull<InsertElementInst>(EU.User) &&
@@ -9025,7 +9026,10 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
             DemandedElts.push_back(APInt::getZero(FTy->getNumElements()));
             VecId = FirstUsers.size() - 1;
             auto It = MinBWs.find(ScalarTE);
-            if (It != MinBWs.end() && VectorCasts.insert(EU.Scalar).second) {
+            if (It != MinBWs.end() &&
+                VectorCasts
+                    .insert(std::make_pair(ScalarTE, FTy->getElementType()))
+                    .second) {
               unsigned BWSz = It->second.second;
               unsigned SrcBWSz = DL->getTypeSizeInBits(FTy->getElementType());
               unsigned VecOpcode;
@@ -9082,17 +9086,20 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   }
   // Add reduced value cost, if resized.
   if (!VectorizedVals.empty()) {
-    auto BWIt = MinBWs.find(VectorizableTree.front().get());
+    const TreeEntry &Root = *VectorizableTree.front().get();
+    auto BWIt = MinBWs.find(&Root);
     if (BWIt != MinBWs.end()) {
-      Type *DstTy = VectorizableTree.front()->Scalars.front()->getType();
+      Type *DstTy = Root.Scalars.front()->getType();
       unsigned OriginalSz = DL->getTypeSizeInBits(DstTy);
-      unsigned Opcode = Instruction::Trunc;
-      if (OriginalSz < BWIt->second.first)
-        Opcode = BWIt->second.second ? Instruction::SExt : Instruction::ZExt;
-      Type *SrcTy = IntegerType::get(DstTy->getContext(), BWIt->second.first);
-      Cost += TTI->getCastInstrCost(Opcode, DstTy, SrcTy,
-                                    TTI::CastContextHint::None,
-                                    TTI::TCK_RecipThroughput);
+      if (OriginalSz != BWIt->second.first) {
+        unsigned Opcode = Instruction::Trunc;
+        if (OriginalSz < BWIt->second.first)
+          Opcode = BWIt->second.second ? Instruction::SExt : Instruction::ZExt;
+        Type *SrcTy = IntegerType::get(DstTy->getContext(), BWIt->second.first);
+        Cost += TTI->getCastInstrCost(Opcode, DstTy, SrcTy,
+                                      TTI::CastContextHint::None,
+                                      TTI::TCK_RecipThroughput);
+      }
     }
   }
 
@@ -11383,9 +11390,10 @@ Value *BoUpSLP::vectorizeTree(TreeEntry *E, bool PostponedPHIs) {
           VecOpcode = Instruction::BitCast;
         } else if (BWSz < SrcBWSz) {
           VecOpcode = Instruction::Trunc;
-        } else if (It != MinBWs.end()) {
+        } else if (SrcIt != MinBWs.end()) {
           assert(BWSz > SrcBWSz && "Invalid cast!");
-          VecOpcode = It->second.second ? Instruction::SExt : Instruction::ZExt;
+          VecOpcode =
+              SrcIt->second.second ? Instruction::SExt : Instruction::ZExt;
         }
       }
       Value *V = (VecOpcode != ShuffleOrOp && VecOpcode == Instruction::BitCast)
@@ -11893,7 +11901,7 @@ Value *BoUpSLP::vectorizeTree(
   // basic block. Only one extractelement per block should be emitted.
   DenseMap<Value *, DenseMap<BasicBlock *, Instruction *>> ScalarToEEs;
   SmallDenseSet<Value *, 4> UsedInserts;
-  DenseMap<Value *, Value *> VectorCasts;
+  DenseMap<std::pair<Value *, Type *>, Value *> VectorCasts;
   SmallDenseSet<Value *, 4> ScalarsWithNullptrUser;
   // Extract all of the elements with the external uses.
   for (const auto &ExternalUse : ExternalUses) {
@@ -12014,7 +12022,9 @@ Value *BoUpSLP::vectorizeTree(
           // Need to use original vector, if the root is truncated.
           auto BWIt = MinBWs.find(E);
           if (BWIt != MinBWs.end() && Vec->getType() != VU->getType()) {
-            auto VecIt = VectorCasts.find(Scalar);
+            auto *ScalarTy = FTy->getElementType();
+            auto Key = std::make_pair(Vec, ScalarTy);
+            auto VecIt = VectorCasts.find(Key);
             if (VecIt == VectorCasts.end()) {
               IRBuilder<>::InsertPointGuard Guard(Builder);
               if (auto *IVec = dyn_cast<Instruction>(Vec))
@@ -12022,10 +12032,10 @@ Value *BoUpSLP::vectorizeTree(
               Vec = Builder.CreateIntCast(
                   Vec,
                   FixedVectorType::get(
-                      cast<VectorType>(VU->getType())->getElementType(),
+                      ScalarTy,
                       cast<FixedVectorType>(Vec->getType())->getNumElements()),
                   BWIt->second.second);
-              VectorCasts.try_emplace(Scalar, Vec);
+              VectorCasts.try_emplace(Key, Vec);
             } else {
               Vec = VecIt->second;
             }
@@ -13095,16 +13105,21 @@ unsigned BoUpSLP::getVectorElementSize(Value *V) {
 // smaller type with a truncation. We collect the values that will be demoted
 // in ToDemote and additional roots that require investigating in Roots.
 bool BoUpSLP::collectValuesToDemote(
-    Value *V, SmallVectorImpl<Value *> &ToDemote,
+    Value *V, bool IsProfitableToDemoteRoot, unsigned &BitWidth,
+    SmallVectorImpl<Value *> &ToDemote,
     DenseMap<Instruction *, SmallVector<unsigned>> &DemotedConsts,
-    SmallVectorImpl<Value *> &Roots, DenseSet<Value *> &Visited) const {
+    DenseSet<Value *> &Visited, unsigned &MaxDepthLevel,
+    bool &IsProfitableToDemote) const {
   // We can always demote constants.
-  if (isa<Constant>(V))
+  if (isa<Constant>(V)) {
+    MaxDepthLevel = 1;
     return true;
+  }
 
   // If the value is not a vectorized instruction in the expression and not used
   // by the insertelement instruction and not used in multiple vector nodes, it
   // cannot be demoted.
+  // TODO: improve handling of gathered values and others.
   auto *I = dyn_cast<Instruction>(V);
   if (!I || !getTreeEntry(I) || MultiNodeScalars.contains(I) ||
       !Visited.insert(I).second || all_of(I->users(), [&](User *U) {
@@ -13112,6 +13127,21 @@ bool BoUpSLP::collectValuesToDemote(
       }))
     return false;
 
+  auto IsPotentiallyTruncated = [&](Value *V, unsigned &BitWidth) -> bool {
+    if (MultiNodeScalars.contains(V))
+      return false;
+    uint32_t OrigBitWidth = DL->getTypeSizeInBits(V->getType());
+    APInt Mask = APInt::getBitsSetFrom(OrigBitWidth, BitWidth);
+    if (MaskedValueIsZero(V, Mask, SimplifyQuery(*DL)))
+      return true;
+    auto NumSignBits = ComputeNumSignBits(V, *DL, 0, AC, nullptr, DT);
+    unsigned BitWidth1 = OrigBitWidth - NumSignBits;
+    KnownBits Known = computeKnownBits(V, *DL);
+    if (!Known.isNonNegative())
+      ++BitWidth1;
+    BitWidth = std::max(BitWidth, BitWidth1);
+    return BitWidth > 0 && OrigBitWidth / BitWidth > 1;
+  };
   unsigned Start = 0;
   unsigned End = I->getNumOperands();
   switch (I->getOpcode()) {
@@ -13119,12 +13149,16 @@ bool BoUpSLP::collectValuesToDemote(
   // We can always demote truncations and extensions. Since truncations can
   // seed additional demotion, we save the truncated value.
   case Instruction::Trunc:
-    Roots.push_back(I->getOperand(0));
+    MaxDepthLevel = 1;
+    if (IsProfitableToDemoteRoot)
+      IsProfitableToDemote = true;
     break;
   case Instruction::ZExt:
   case Instruction::SExt:
-    if (isa<ExtractElementInst, InsertElementInst>(I->getOperand(0)))
-      return false;
+    MaxDepthLevel = 1;
+    if (isa<InsertElementInst>(I->getOperand(0)))
+      return true;
+    IsProfitableToDemote = true;
     break;
 
   // We can demote certain binary operations if we can demote both of their
@@ -13134,23 +13168,32 @@ bool BoUpSLP::collectValuesToDemote(
   case Instruction::Mul:
   case Instruction::And:
   case Instruction::Or:
-  case Instruction::Xor:
-    if (!collectValuesToDemote(I->getOperand(0), ToDemote, DemotedConsts, Roots,
-                               Visited) ||
-        !collectValuesToDemote(I->getOperand(1), ToDemote, DemotedConsts, Roots,
-                               Visited))
+  case Instruction::Xor: {
+    unsigned Level1, Level2;
+    if (!collectValuesToDemote(I->getOperand(0), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level1, IsProfitableToDemote) ||
+        !collectValuesToDemote(I->getOperand(1), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level2, IsProfitableToDemote))
       return false;
+    MaxDepthLevel = std::max(Level1, Level2);
     break;
+  }
 
   // We can demote selects if we can demote their true and false values.
   case Instruction::Select: {
     Start = 1;
+    unsigned Level1, Level2;
     SelectInst *SI = cast<SelectInst>(I);
-    if (!collectValuesToDemote(SI->getTrueValue(), ToDemote, DemotedConsts,
-                               Roots, Visited) ||
-        !collectValuesToDemote(SI->getFalseValue(), ToDemote, DemotedConsts,
-                               Roots, Visited))
+    if (!collectValuesToDemote(SI->getTrueValue(), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level1, IsProfitableToDemote) ||
+        !collectValuesToDemote(SI->getFalseValue(), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level2, IsProfitableToDemote))
       return false;
+    MaxDepthLevel = std::max(Level1, Level2);
     break;
   }
 
@@ -13159,157 +13202,252 @@ bool BoUpSLP::collectValuesToDemote(
   case Instruction::PHI: {
     PHINode *PN = cast<PHINode>(I);
     for (Value *IncValue : PN->incoming_values())
-      if (!collectValuesToDemote(IncValue, ToDemote, DemotedConsts, Roots,
-                                 Visited))
+      if (!collectValuesToDemote(IncValue, IsProfitableToDemoteRoot, BitWidth,
+                                 ToDemote, DemotedConsts, Visited,
+                                 MaxDepthLevel, IsProfitableToDemote))
         return false;
     break;
   }
 
   // Otherwise, conservatively give up.
   default:
-    return false;
+    if (!IsPotentiallyTruncated(I, BitWidth))
+      return false;
+    MaxDepthLevel = 0;
+    Start = End = 0;
+    break;
   }
 
+  ++MaxDepthLevel;
   // Gather demoted constant operands.
   for (unsigned Idx : seq<unsigned>(Start, End))
     if (isa<Constant>(I->getOperand(Idx)))
       DemotedConsts.try_emplace(I).first->getSecond().push_back(Idx);
   // Record the value that we can demote.
   ToDemote.push_back(V);
-  return true;
+  return IsProfitableToDemote;
 }
 
 void BoUpSLP::computeMinimumValueSizes() {
   // We only attempt to truncate integer expressions.
-  auto &TreeRoot = VectorizableTree[0]->Scalars;
-  auto *TreeRootIT = dyn_cast<IntegerType>(TreeRoot[0]->getType());
-  if (!TreeRootIT)
-    return;
+  bool IsStoreOrInsertElt =
+      VectorizableTree.front()->getOpcode() == Instruction::Store ||
+      VectorizableTree.front()->getOpcode() == Instruction::InsertElement;
+  unsigned NodeIdx = 0;
+  if (IsStoreOrInsertElt &&
+      VectorizableTree.front()->State != TreeEntry::NeedToGather)
+    NodeIdx = 1;
 
   // Ensure the roots of the vectorizable tree don't form a cycle.
-  if (!VectorizableTree.front()->UserTreeIndices.empty())
+  if ((NodeIdx == 0 && !VectorizableTree[NodeIdx]->UserTreeIndices.empty()) ||
+      (NodeIdx != 0 && any_of(VectorizableTree[NodeIdx]->UserTreeIndices,
+                              [&](const EdgeInfo &EI) {
+                                return EI.UserTE->Idx >
+                                       static_cast<int>(NodeIdx);
+                              })))
     return;
 
-  // Conservatively determine if we can actually truncate the roots of the
-  // expression. Collect the values that can be demoted in ToDemote and
-  // additional roots that require investigating in Roots.
-  SmallVector<Value *, 32> ToDemote;
-  DenseMap<Instruction *, SmallVector<unsigned>> DemotedConsts;
-  SmallVector<Value *, 4> Roots;
-  for (auto *Root : TreeRoot) {
-    DenseSet<Value *> Visited;
-    if (!collectValuesToDemote(Root, ToDemote, DemotedConsts, Roots, Visited))
-      return;
+  // The first value node for store/insertelement is sext/zext/trunc? Skip it,
+  // resize to the final type.
+  bool IsProfitableToDemoteRoot = !IsStoreOrInsertElt;
+  if (NodeIdx != 0 &&
+      VectorizableTree[NodeIdx]->State == TreeEntry::Vectorize &&
+      (VectorizableTree[NodeIdx]->getOpcode() == Instruction::ZExt ||
+       VectorizableTree[NodeIdx]->getOpcode() == Instruction::SExt ||
+       VectorizableTree[NodeIdx]->getOpcode() == Instruction::Trunc)) {
+    assert(IsStoreOrInsertElt && "Expected store/insertelement seeded graph.");
+    ++NodeIdx;
+    IsProfitableToDemoteRoot = true;
   }
 
-  // The maximum bit width required to represent all the values that can be
-  // demoted without loss of precision. It would be safe to truncate the roots
-  // of the expression to this width.
-  auto MaxBitWidth = 1u;
-
-  // We first check if all the bits of the roots are demanded. If they're not,
-  // we can truncate the roots to this narrower type.
-  for (auto *Root : TreeRoot) {
-    auto Mask = DB->getDemandedBits(cast<Instruction>(Root));
-    MaxBitWidth = std::max<unsigned>(Mask.getBitWidth() - Mask.countl_zero(),
-                                     MaxBitWidth);
-  }
-
-  // True if the roots can be zero-extended back to their original type, rather
-  // than sign-extended. We know that if the leading bits are not demanded, we
-  // can safely zero-extend. So we initialize IsKnownPositive to True.
-  bool IsKnownPositive = true;
-
-  // If all the bits of the roots are demanded, we can try a little harder to
-  // compute a narrower type. This can happen, for example, if the roots are
-  // getelementptr indices. InstCombine promotes these indices to the pointer
-  // width. Thus, all their bits are technically demanded even though the
-  // address computation might be vectorized in a smaller type.
-  //
-  // We start by looking at each entry that can be demoted. We compute the
-  // maximum bit width required to store the scalar by using ValueTracking to
-  // compute the number of high-order bits we can truncate.
-  if (MaxBitWidth == DL->getTypeSizeInBits(TreeRoot[0]->getType()) &&
-      all_of(TreeRoot, [](Value *V) {
-        return all_of(V->users(),
-                      [](User *U) { return isa<GetElementPtrInst>(U); });
-      })) {
-    MaxBitWidth = 8u;
-
+  SmallVector<Value *> ToDemote;
+  DenseMap<Instruction *, SmallVector<unsigned>> DemotedConsts;
+  auto ComputeMaxBitWidth = [&](ArrayRef<Value *> TreeRoot, unsigned VF,
+                                bool IsTopRoot, bool IsProfitableToDemoteRoot,
+                                unsigned Opcode, unsigned Limit) {
+    ToDemote.clear();
+    auto *TreeRootIT = dyn_cast<IntegerType>(TreeRoot[0]->getType());
+    if (!TreeRootIT || !Opcode)
+      return 0u;
+
+    unsigned NumParts = TTI->getNumberOfParts(
+        FixedVectorType::get(TreeRoot.front()->getType(), VF));
+
+    // The maximum bit width required to represent all the values that can be
+    // demoted without loss of precision. It would be safe to truncate the roots
+    // of the expression to this width.
+    auto MaxBitWidth = 1u;
+
+    // True if the roots can be zero-extended back to their original type,
+    // rather than sign-extended. We know that if the leading bits are not
+    // demanded, we can safely zero-extend. So we initialize IsKnownPositive to
+    // True.
     // Determine if the sign bit of all the roots is known to be zero. If not,
     // IsKnownPositive is set to False.
-    IsKnownPositive = llvm::all_of(TreeRoot, [&](Value *R) {
+    bool IsKnownPositive = all_of(TreeRoot, [&](Value *R) {
       KnownBits Known = computeKnownBits(R, *DL);
       return Known.isNonNegative();
     });
 
-    // Determine the maximum number of bits required to store the scalar
-    // values.
-    for (auto *Scalar : ToDemote) {
-      auto NumSignBits = ComputeNumSignBits(Scalar, *DL, 0, AC, nullptr, DT);
-      auto NumTypeBits = DL->getTypeSizeInBits(Scalar->getType());
-      MaxBitWidth = std::max<unsigned>(NumTypeBits - NumSignBits, MaxBitWidth);
-    }
-
-    // If we can't prove that the sign bit is zero, we must add one to the
-    // maximum bit width to account for the unknown sign bit. This preserves
-    // the existing sign bit so we can safely sign-extend the root back to the
-    // original type. Otherwise, if we know the sign bit is zero, we will
-    // zero-extend the root instead.
-    //
-    // FIXME: This is somewhat suboptimal, as there will be cases where adding
-    //        one to the maximum bit width will yield a larger-than-necessary
-    //        type. In general, we need to add an extra bit only if we can't
-    //        prove that the upper bit of the original type is equal to the
-    //        upper bit of the proposed smaller type. If these two bits are the
-    //        same (either zero or one) we know that sign-extending from the
-    //        smaller type will result in the same value. Here, since we can't
-    //        yet prove this, we are just making the proposed smaller type
-    //        larger to ensure correctness.
-    if (!IsKnownPositive)
-      ++MaxBitWidth;
-  }
-
-  // Round MaxBitWidth up to the next power-of-two.
-  MaxBitWidth = llvm::bit_ceil(MaxBitWidth);
-
-  // If the maximum bit width we compute is less than the with of the roots'
-  // type, we can proceed with the narrowing. Otherwise, do nothing.
-  if (MaxBitWidth >= TreeRootIT->getBitWidth())
-    return;
+    // We first check if all the bits of the roots are demanded. If they'r...
[truncated]

Created using spr 1.3.5

alexey-bataev

Ping!

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Created using spr 1.3.5

RKSimon · 2024-02-19T18:14:59Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+      return true;
+    auto NumSignBits = ComputeNumSignBits(V, *DL, 0, AC, nullptr, DT);
+    unsigned BitWidth1 = OrigBitWidth - NumSignBits;
+    KnownBits Known = computeKnownBits(V, *DL);


(Optimization) - You're calling computeKnownBits here after MaskedValueIsZero has already done that above

Do not how to reuse the result from MaskedValueIsZero for checking that V is potentially negative here. Anyway, replaced by llvm::isKnownNonNegative function

You should be able to replace MaskedValueIsZero with something like:

KnownBits Known = computeKnownBits(V, *DL); if (Known.countMinLeadingZeros() >= (OrigBitWidth - NumSignBits)) return true;

You should be able to replace MaskedValueIsZero with something like:

KnownBits Known = computeKnownBits(V, *DL); if (Known.countMinLeadingZeros() >= (OrigBitWidth - NumSignBits)) return true;

Not sure this is better. Prefer to use standard functions rather than reinvent them here. Such kind of "manual inlining" is very to hard to maintain.

Created using spr 1.3.5

github-actions · 2024-02-22T18:29:22Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Created using spr 1.3.5

alexey-bataev · 2024-02-28T18:47:29Z

Ping!

Created using spr 1.3.5

alexey-bataev · 2024-03-04T15:49:23Z

Ping!

RKSimon

LGTM - as far as I can tell - the AArch64/getelementptr.ll code explosion seems unfortunate though - have you confirmed it results in better codegen?

alexey-bataev · 2024-03-05T13:53:19Z

LGTM - as far as I can tell - the AArch64/getelementptr.ll code explosion seems unfortunate though - have you confirmed it results in better codegen?

AArch64/getelementptr.ll is not a regression. It uses -slp-threshold=-7, which turns on the vectorization of previously non-profitable graph with this patch. Need to set -slp-threshold=-6 (or even -5) to return it back, but in this case the first test will be failed (unaffected by this patch). I'll split this test into 2 separate test with the different thresholds.

Created using spr 1.3.5

nikic · 2024-03-05T18:16:42Z

Seems to break the build, maybe on GCC only?


/var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/var/llvm-compile-time-tracker/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~

alexey-bataev · 2024-03-05T18:21:26Z

Seems to break the build, maybe on GCC only?


/var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/var/llvm-compile-time-tracker/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~

Am, I did not touch this code. Probably, some kind of side effect. Could you provide full error report?

dyung · 2024-03-05T18:25:41Z

Definitely failing on my GCC based builders:

dtcxzyw · 2024-03-05T18:26:02Z

Seems to break the build, maybe on GCC only?


/var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/var/llvm-compile-time-tracker/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~

Am, I did not touch this code. Probably, some kind of side effect. Could you provide full error report?

https://github.com/dtcxzyw/llvm-ci/actions/runs/8160865465/job/22308394053

[624/1925] Building CXX object lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o
FAILED: lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o 
/usr/bin/c++ -DGTEST_HAS_RTTI=0 -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-build/lib/Transforms/Vectorize -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-build/include -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -fno-lifetime-dse -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-uninitialized -Wno-nonnull -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17  -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o -MF lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o.d -o lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o -c /home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~
[625/1925] Building CXX object lib/Transforms/ObjCARC/CMakeFiles/LLVMObjCARCOpts.dir/ObjCARCContract.cpp.o

Related commits:

from 1e828f838cc0f15074f3dbbb04929c06ef0c9729 to 041638c4294a9a8375851e0add1ab2c99412c032

041638c4294a9a8375851e0add1ab2c99412c032 [libc][stdbit] implement stdc_bit_width (C23) (#83892)
233f750c3dad14e330f5358e8dbcc6c30e805edb [flang] Catch more bad pointer initialization targets (#83731)
9f67f19614e952ede385a59bb62f7b57771ca4c3 [gn] Remove ScudoBenchmarks
2807ea6b8047780b5e66a122faf09fae786c917b [HLSL] implement the any intrinsic (#83903)
b2ca23aed802abc43ed216ce9bf4c80c056a04c0 [HLSL] implement exp intrinsic (#83832)
06fea93341ae7d0d0faa82c4c8704591963c2d8c [TextAPI] add missing platforms for translating triples to tapi targets
643b31dbe8a515e007a0f1b2e1072c34e461b778 [HLSL] implement `mad` intrinsic (#83826)
a730ed7c1a4a35f5219df720ffb0ba6122d64fe4 [SLP]Improve minbitwidth analysis.
1b1aea79194117d8f1729ef9c8f80454aea381fe AMDGPU: Make s_wait_samplecnt(_bvhcnt) dependent on hasImageInsts, NFC (#83932)
f836048a2b452f5f2a8440c9f5945ee1a7bcdac2 [gn] port 6fd27d5b0321f (no more module.modulemap.in)
f33f66be7dc586a597437d7ce7619d87e8637209 [NFC][RemoveDIs] Always use iterators for inserting PHIs

alexey-bataev · 2024-03-05T18:27:55Z

Hope fixed in 083d8aa

nikic · 2024-03-05T19:52:26Z

It looks like this change causes large compile-time regressions in some cases: http://llvm-compile-time-tracker.com/compare.php?from=1b1aea79194117d8f1729ef9c8f80454aea381fe&to=083d8aa03aca55b88098a91e41e41a8e321a5721&stat=instructions%3Au

For example, mode_decision.c from lencod regresses by 10% with -O3 (without codegen change).

Other examples are partQalignmm.c from mafft by 3.5%, shared_sha256.c from ClamAV by 5%.

alexey-bataev · 2024-03-05T20:04:57Z

I'll revert the patch and will try to improve the compile time

jyknight · 2024-03-06T05:45:52Z

I see it was already reverted.

But prior to that, this change has also triggered a miscompile in xxhash, with llvm/unittests/Support/xxhashTest.cpp failing to compute correct hashes for 6 of its test-cases. (This error showed up in our downstream build, which is building llvm for x86-64, but uses various cpu tuning and other build flags. I don't know if any upstream buildbots observed the bug.)

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:48: Failure
Expected equality of these values:
  uint64_t(0x9013fb74ca603e0c)
    Which is: 10381918045049208332
  xxh3_64bits(ArrayRef(a, size_t(33)))
    Which is: 15799937386795546273

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:49: Failure
Expected equality of these values:
  uint64_t(0xfa5271fcce0db1c3)
    Which is: 18037604788174959043
  xxh3_64bits(ArrayRef(a, size_t(64)))
    Which is: 4052662859111881885

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:50: Failure
Expected equality of these values:
  uint64_t(0x79c42431727f1012)
    Which is: 8774177768817496082
  xxh3_64bits(ArrayRef(a, size_t(65)))
    Which is: 15228745306148528873

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:51: Failure
Expected equality of these values:
  uint64_t(0x591ee0ddf9c9ccd1)
    Which is: 6421817362660052177
  xxh3_64bits(ArrayRef(a, size_t(96)))
    Which is: 9533022609644079181

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:52: Failure
Expected equality of these values:
  uint64_t(0x8ffc6a3111fe19da)
    Which is: 10375284400542587354
  xxh3_64bits(ArrayRef(a, size_t(97)))
    Which is: 16638996938277853399

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:53: Failure
Expected equality of these values:
  uint64_t(0x06a146ee9a2da378)
    Which is: 477741026080826232
  xxh3_64bits(ArrayRef(a, size_t(128)))
    Which is: 5146516666195395268

[𝘀𝗽𝗿] initial version

391145a

Created using spr 1.3.5

llvmbot added vectorization llvm:transforms labels Jan 22, 2024

alexey-bataev requested a review from RKSimon January 22, 2024 14:14

alexey-bataev added 6 commits January 29, 2024 21:50

Rebase

2e53d9a

Created using spr 1.3.5

Rebase

2bbbb87

Created using spr 1.3.5

[SLP]Improve minbitwidth analysis.

5a830aa

Created using spr 1.3.5

Rebase, ping!

42602ff

Created using spr 1.3.5

Rebase, ping!

c7f4c54

Created using spr 1.3.5

Rebase

749af8d

Created using spr 1.3.5

alexey-bataev commented Feb 16, 2024

View reviewed changes

RKSimon reviewed Feb 16, 2024

View reviewed changes

Rebase, address comments

9ce8d34

Created using spr 1.3.5

alexey-bataev mentioned this pull request Feb 16, 2024

[SLP] Use known bits of each value in computeMinimumValueSizes #82013

Open

RKSimon reviewed Feb 19, 2024

View reviewed changes

alexey-bataev added 2 commits February 19, 2024 18:51

Rebase, use llvm::isKnownNonNegative

a1867f3

Created using spr 1.3.5

Rebase, improve reductions, fix cost calculations

09ddb20

Created using spr 1.3.5

alexey-bataev added 3 commits February 22, 2024 18:32

Fix formatting

8975407

Created using spr 1.3.5

Rebase

7f00130

Created using spr 1.3.5

Rebase

202a1ad

Created using spr 1.3.5

alexey-bataev added 2 commits March 1, 2024 17:38

Rebase

9678261

Created using spr 1.3.5

Rebase

ed77210

Created using spr 1.3.5

RKSimon approved these changes Mar 5, 2024

View reviewed changes

Rebase

0e6ff3a

Created using spr 1.3.5

alexey-bataev merged commit a730ed7 into main Mar 5, 2024
3 of 4 checks passed

alexey-bataev deleted the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis branch March 5, 2024 17:20

alexey-bataev restored the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis branch March 7, 2024 15:35

alexey-bataev deleted the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis branch March 7, 2024 15:36

alexey-bataev mentioned this pull request Mar 7, 2024

[SLP]Improve minbitwidth analysis. #84334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLP]Improve minbitwidth analysis. #78976

[SLP]Improve minbitwidth analysis. #78976

alexey-bataev commented Jan 22, 2024 •

edited

llvmbot commented Jan 22, 2024

alexey-bataev left a comment

RKSimon Feb 19, 2024

alexey-bataev Feb 19, 2024

RKSimon Feb 20, 2024

alexey-bataev Feb 20, 2024

github-actions bot commented Feb 22, 2024 •

edited

alexey-bataev commented Feb 28, 2024

alexey-bataev commented Mar 4, 2024

RKSimon left a comment

alexey-bataev commented Mar 5, 2024

nikic commented Mar 5, 2024

alexey-bataev commented Mar 5, 2024

dyung commented Mar 5, 2024

dtcxzyw commented Mar 5, 2024

alexey-bataev commented Mar 5, 2024

nikic commented Mar 5, 2024 •

edited

alexey-bataev commented Mar 5, 2024

jyknight commented Mar 6, 2024

[SLP]Improve minbitwidth analysis. #78976

[SLP]Improve minbitwidth analysis. #78976

Conversation

alexey-bataev commented Jan 22, 2024 • edited

llvmbot commented Jan 22, 2024

alexey-bataev left a comment

Choose a reason for hiding this comment

RKSimon Feb 19, 2024

Choose a reason for hiding this comment

alexey-bataev Feb 19, 2024

Choose a reason for hiding this comment

RKSimon Feb 20, 2024

Choose a reason for hiding this comment

alexey-bataev Feb 20, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 22, 2024 • edited

alexey-bataev commented Feb 28, 2024

alexey-bataev commented Mar 4, 2024

RKSimon left a comment

Choose a reason for hiding this comment

alexey-bataev commented Mar 5, 2024

nikic commented Mar 5, 2024

alexey-bataev commented Mar 5, 2024

dyung commented Mar 5, 2024

dtcxzyw commented Mar 5, 2024

alexey-bataev commented Mar 5, 2024

nikic commented Mar 5, 2024 • edited

alexey-bataev commented Mar 5, 2024

jyknight commented Mar 6, 2024

alexey-bataev commented Jan 22, 2024 •

edited

github-actions bot commented Feb 22, 2024 •

edited

nikic commented Mar 5, 2024 •

edited