Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP]Improve minbitwidth analysis. #78976

Merged

Conversation

alexey-bataev
Copy link
Member

@alexey-bataev alexey-bataev commented Jan 22, 2024

This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test 92549.00 92609.00 0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 663381.00 663493.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 663381.00 663493.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 307182.00 307214.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2040257.00 2040273.00 0.0%

                                                                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Created using spr 1.3.5
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 22, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

This improves overall analysis for minbitwidth in SLP. It allows to
analyize the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test 92549.00 92609.00 0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 663381.00 663493.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 663381.00 663493.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 307182.00 307214.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2040257.00 2040273.00 0.0%

                                                                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scara zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.


Patch is 63.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78976.diff

13 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+291-153)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll (+5-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/getelementptr.ll (+87-29)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-add-i64.ll (+5-15)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+4-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/PR35777.ll (+12-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/minbitwidth-multiuse-with-insertelement.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/minimum-sizes.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/phi-undef-input.ll (+12-12)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/resched.ll (+16-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reused-reductions-with-minbitwidth.ll (+4-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/store-insertelement-minbitwidth.ll (+12-10)
  • (modified) llvm/test/Transforms/SLPVectorizer/alt-cmp-vectorize.ll (+2-2)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 482970bbf306120..5a493aed0fed7b8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -2273,9 +2273,11 @@ class BoUpSLP {
   /// constant and to be demoted. Required to correctly identify constant nodes
   /// to be demoted.
   bool collectValuesToDemote(
-      Value *V, SmallVectorImpl<Value *> &ToDemote,
+      Value *V, bool IsProfitableToDemoteRoot, unsigned &BitWidth,
+      SmallVectorImpl<Value *> &ToDemote,
       DenseMap<Instruction *, SmallVector<unsigned>> &DemotedConsts,
-      SmallVectorImpl<Value *> &Roots, DenseSet<Value *> &Visited) const;
+      DenseSet<Value *> &Visited, unsigned &MaxDepthLevel,
+      bool &IsProfitableToDemote) const;
 
   /// Check if the operands on the edges \p Edges of the \p UserTE allows
   /// reordering (i.e. the operands can be reordered because they have only one
@@ -7862,7 +7864,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
               unsigned BWSz = DL->getTypeSizeInBits(ScalarTy);
               unsigned SrcBWSz = DL->getTypeSizeInBits(UserScalarTy);
               unsigned VecOpcode;
-              auto *SrcVecTy =
+              auto *UserVecTy =
                   FixedVectorType::get(UserScalarTy, E->getVectorFactor());
               if (BWSz > SrcBWSz)
                 VecOpcode = Instruction::Trunc;
@@ -7870,11 +7872,10 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
                 VecOpcode =
                     It->second.second ? Instruction::SExt : Instruction::ZExt;
               TTI::CastContextHint CCH = GetCastContextHint(VL0);
-              VecCost += TTI->getCastInstrCost(VecOpcode, VecTy, SrcVecTy, CCH,
+              VecCost += TTI->getCastInstrCost(VecOpcode, UserVecTy, VecTy, CCH,
                                                CostKind);
-              ScalarCost +=
-                  Sz * TTI->getCastInstrCost(VecOpcode, ScalarTy, UserScalarTy,
-                                             CCH, CostKind);
+              ScalarCost += Sz * TTI->getCastInstrCost(VecOpcode, UserScalarTy,
+                                                       ScalarTy, CCH, CostKind);
             }
           }
         }
@@ -8955,7 +8956,7 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   SmallVector<std::pair<Value *, const TreeEntry *>> FirstUsers;
   SmallVector<APInt> DemandedElts;
   SmallDenseSet<Value *, 4> UsedInserts;
-  DenseSet<Value *> VectorCasts;
+  DenseSet<std::pair<const TreeEntry *, Type *>> VectorCasts;
   for (ExternalUser &EU : ExternalUses) {
     // We only add extract cost once for the same scalar.
     if (!isa_and_nonnull<InsertElementInst>(EU.User) &&
@@ -9025,7 +9026,10 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
             DemandedElts.push_back(APInt::getZero(FTy->getNumElements()));
             VecId = FirstUsers.size() - 1;
             auto It = MinBWs.find(ScalarTE);
-            if (It != MinBWs.end() && VectorCasts.insert(EU.Scalar).second) {
+            if (It != MinBWs.end() &&
+                VectorCasts
+                    .insert(std::make_pair(ScalarTE, FTy->getElementType()))
+                    .second) {
               unsigned BWSz = It->second.second;
               unsigned SrcBWSz = DL->getTypeSizeInBits(FTy->getElementType());
               unsigned VecOpcode;
@@ -9082,17 +9086,20 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   }
   // Add reduced value cost, if resized.
   if (!VectorizedVals.empty()) {
-    auto BWIt = MinBWs.find(VectorizableTree.front().get());
+    const TreeEntry &Root = *VectorizableTree.front().get();
+    auto BWIt = MinBWs.find(&Root);
     if (BWIt != MinBWs.end()) {
-      Type *DstTy = VectorizableTree.front()->Scalars.front()->getType();
+      Type *DstTy = Root.Scalars.front()->getType();
       unsigned OriginalSz = DL->getTypeSizeInBits(DstTy);
-      unsigned Opcode = Instruction::Trunc;
-      if (OriginalSz < BWIt->second.first)
-        Opcode = BWIt->second.second ? Instruction::SExt : Instruction::ZExt;
-      Type *SrcTy = IntegerType::get(DstTy->getContext(), BWIt->second.first);
-      Cost += TTI->getCastInstrCost(Opcode, DstTy, SrcTy,
-                                    TTI::CastContextHint::None,
-                                    TTI::TCK_RecipThroughput);
+      if (OriginalSz != BWIt->second.first) {
+        unsigned Opcode = Instruction::Trunc;
+        if (OriginalSz < BWIt->second.first)
+          Opcode = BWIt->second.second ? Instruction::SExt : Instruction::ZExt;
+        Type *SrcTy = IntegerType::get(DstTy->getContext(), BWIt->second.first);
+        Cost += TTI->getCastInstrCost(Opcode, DstTy, SrcTy,
+                                      TTI::CastContextHint::None,
+                                      TTI::TCK_RecipThroughput);
+      }
     }
   }
 
@@ -11383,9 +11390,10 @@ Value *BoUpSLP::vectorizeTree(TreeEntry *E, bool PostponedPHIs) {
           VecOpcode = Instruction::BitCast;
         } else if (BWSz < SrcBWSz) {
           VecOpcode = Instruction::Trunc;
-        } else if (It != MinBWs.end()) {
+        } else if (SrcIt != MinBWs.end()) {
           assert(BWSz > SrcBWSz && "Invalid cast!");
-          VecOpcode = It->second.second ? Instruction::SExt : Instruction::ZExt;
+          VecOpcode =
+              SrcIt->second.second ? Instruction::SExt : Instruction::ZExt;
         }
       }
       Value *V = (VecOpcode != ShuffleOrOp && VecOpcode == Instruction::BitCast)
@@ -11893,7 +11901,7 @@ Value *BoUpSLP::vectorizeTree(
   // basic block. Only one extractelement per block should be emitted.
   DenseMap<Value *, DenseMap<BasicBlock *, Instruction *>> ScalarToEEs;
   SmallDenseSet<Value *, 4> UsedInserts;
-  DenseMap<Value *, Value *> VectorCasts;
+  DenseMap<std::pair<Value *, Type *>, Value *> VectorCasts;
   SmallDenseSet<Value *, 4> ScalarsWithNullptrUser;
   // Extract all of the elements with the external uses.
   for (const auto &ExternalUse : ExternalUses) {
@@ -12014,7 +12022,9 @@ Value *BoUpSLP::vectorizeTree(
           // Need to use original vector, if the root is truncated.
           auto BWIt = MinBWs.find(E);
           if (BWIt != MinBWs.end() && Vec->getType() != VU->getType()) {
-            auto VecIt = VectorCasts.find(Scalar);
+            auto *ScalarTy = FTy->getElementType();
+            auto Key = std::make_pair(Vec, ScalarTy);
+            auto VecIt = VectorCasts.find(Key);
             if (VecIt == VectorCasts.end()) {
               IRBuilder<>::InsertPointGuard Guard(Builder);
               if (auto *IVec = dyn_cast<Instruction>(Vec))
@@ -12022,10 +12032,10 @@ Value *BoUpSLP::vectorizeTree(
               Vec = Builder.CreateIntCast(
                   Vec,
                   FixedVectorType::get(
-                      cast<VectorType>(VU->getType())->getElementType(),
+                      ScalarTy,
                       cast<FixedVectorType>(Vec->getType())->getNumElements()),
                   BWIt->second.second);
-              VectorCasts.try_emplace(Scalar, Vec);
+              VectorCasts.try_emplace(Key, Vec);
             } else {
               Vec = VecIt->second;
             }
@@ -13095,16 +13105,21 @@ unsigned BoUpSLP::getVectorElementSize(Value *V) {
 // smaller type with a truncation. We collect the values that will be demoted
 // in ToDemote and additional roots that require investigating in Roots.
 bool BoUpSLP::collectValuesToDemote(
-    Value *V, SmallVectorImpl<Value *> &ToDemote,
+    Value *V, bool IsProfitableToDemoteRoot, unsigned &BitWidth,
+    SmallVectorImpl<Value *> &ToDemote,
     DenseMap<Instruction *, SmallVector<unsigned>> &DemotedConsts,
-    SmallVectorImpl<Value *> &Roots, DenseSet<Value *> &Visited) const {
+    DenseSet<Value *> &Visited, unsigned &MaxDepthLevel,
+    bool &IsProfitableToDemote) const {
   // We can always demote constants.
-  if (isa<Constant>(V))
+  if (isa<Constant>(V)) {
+    MaxDepthLevel = 1;
     return true;
+  }
 
   // If the value is not a vectorized instruction in the expression and not used
   // by the insertelement instruction and not used in multiple vector nodes, it
   // cannot be demoted.
+  // TODO: improve handling of gathered values and others.
   auto *I = dyn_cast<Instruction>(V);
   if (!I || !getTreeEntry(I) || MultiNodeScalars.contains(I) ||
       !Visited.insert(I).second || all_of(I->users(), [&](User *U) {
@@ -13112,6 +13127,21 @@ bool BoUpSLP::collectValuesToDemote(
       }))
     return false;
 
+  auto IsPotentiallyTruncated = [&](Value *V, unsigned &BitWidth) -> bool {
+    if (MultiNodeScalars.contains(V))
+      return false;
+    uint32_t OrigBitWidth = DL->getTypeSizeInBits(V->getType());
+    APInt Mask = APInt::getBitsSetFrom(OrigBitWidth, BitWidth);
+    if (MaskedValueIsZero(V, Mask, SimplifyQuery(*DL)))
+      return true;
+    auto NumSignBits = ComputeNumSignBits(V, *DL, 0, AC, nullptr, DT);
+    unsigned BitWidth1 = OrigBitWidth - NumSignBits;
+    KnownBits Known = computeKnownBits(V, *DL);
+    if (!Known.isNonNegative())
+      ++BitWidth1;
+    BitWidth = std::max(BitWidth, BitWidth1);
+    return BitWidth > 0 && OrigBitWidth / BitWidth > 1;
+  };
   unsigned Start = 0;
   unsigned End = I->getNumOperands();
   switch (I->getOpcode()) {
@@ -13119,12 +13149,16 @@ bool BoUpSLP::collectValuesToDemote(
   // We can always demote truncations and extensions. Since truncations can
   // seed additional demotion, we save the truncated value.
   case Instruction::Trunc:
-    Roots.push_back(I->getOperand(0));
+    MaxDepthLevel = 1;
+    if (IsProfitableToDemoteRoot)
+      IsProfitableToDemote = true;
     break;
   case Instruction::ZExt:
   case Instruction::SExt:
-    if (isa<ExtractElementInst, InsertElementInst>(I->getOperand(0)))
-      return false;
+    MaxDepthLevel = 1;
+    if (isa<InsertElementInst>(I->getOperand(0)))
+      return true;
+    IsProfitableToDemote = true;
     break;
 
   // We can demote certain binary operations if we can demote both of their
@@ -13134,23 +13168,32 @@ bool BoUpSLP::collectValuesToDemote(
   case Instruction::Mul:
   case Instruction::And:
   case Instruction::Or:
-  case Instruction::Xor:
-    if (!collectValuesToDemote(I->getOperand(0), ToDemote, DemotedConsts, Roots,
-                               Visited) ||
-        !collectValuesToDemote(I->getOperand(1), ToDemote, DemotedConsts, Roots,
-                               Visited))
+  case Instruction::Xor: {
+    unsigned Level1, Level2;
+    if (!collectValuesToDemote(I->getOperand(0), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level1, IsProfitableToDemote) ||
+        !collectValuesToDemote(I->getOperand(1), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level2, IsProfitableToDemote))
       return false;
+    MaxDepthLevel = std::max(Level1, Level2);
     break;
+  }
 
   // We can demote selects if we can demote their true and false values.
   case Instruction::Select: {
     Start = 1;
+    unsigned Level1, Level2;
     SelectInst *SI = cast<SelectInst>(I);
-    if (!collectValuesToDemote(SI->getTrueValue(), ToDemote, DemotedConsts,
-                               Roots, Visited) ||
-        !collectValuesToDemote(SI->getFalseValue(), ToDemote, DemotedConsts,
-                               Roots, Visited))
+    if (!collectValuesToDemote(SI->getTrueValue(), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level1, IsProfitableToDemote) ||
+        !collectValuesToDemote(SI->getFalseValue(), IsProfitableToDemoteRoot,
+                               BitWidth, ToDemote, DemotedConsts, Visited,
+                               Level2, IsProfitableToDemote))
       return false;
+    MaxDepthLevel = std::max(Level1, Level2);
     break;
   }
 
@@ -13159,157 +13202,252 @@ bool BoUpSLP::collectValuesToDemote(
   case Instruction::PHI: {
     PHINode *PN = cast<PHINode>(I);
     for (Value *IncValue : PN->incoming_values())
-      if (!collectValuesToDemote(IncValue, ToDemote, DemotedConsts, Roots,
-                                 Visited))
+      if (!collectValuesToDemote(IncValue, IsProfitableToDemoteRoot, BitWidth,
+                                 ToDemote, DemotedConsts, Visited,
+                                 MaxDepthLevel, IsProfitableToDemote))
         return false;
     break;
   }
 
   // Otherwise, conservatively give up.
   default:
-    return false;
+    if (!IsPotentiallyTruncated(I, BitWidth))
+      return false;
+    MaxDepthLevel = 0;
+    Start = End = 0;
+    break;
   }
 
+  ++MaxDepthLevel;
   // Gather demoted constant operands.
   for (unsigned Idx : seq<unsigned>(Start, End))
     if (isa<Constant>(I->getOperand(Idx)))
       DemotedConsts.try_emplace(I).first->getSecond().push_back(Idx);
   // Record the value that we can demote.
   ToDemote.push_back(V);
-  return true;
+  return IsProfitableToDemote;
 }
 
 void BoUpSLP::computeMinimumValueSizes() {
   // We only attempt to truncate integer expressions.
-  auto &TreeRoot = VectorizableTree[0]->Scalars;
-  auto *TreeRootIT = dyn_cast<IntegerType>(TreeRoot[0]->getType());
-  if (!TreeRootIT)
-    return;
+  bool IsStoreOrInsertElt =
+      VectorizableTree.front()->getOpcode() == Instruction::Store ||
+      VectorizableTree.front()->getOpcode() == Instruction::InsertElement;
+  unsigned NodeIdx = 0;
+  if (IsStoreOrInsertElt &&
+      VectorizableTree.front()->State != TreeEntry::NeedToGather)
+    NodeIdx = 1;
 
   // Ensure the roots of the vectorizable tree don't form a cycle.
-  if (!VectorizableTree.front()->UserTreeIndices.empty())
+  if ((NodeIdx == 0 && !VectorizableTree[NodeIdx]->UserTreeIndices.empty()) ||
+      (NodeIdx != 0 && any_of(VectorizableTree[NodeIdx]->UserTreeIndices,
+                              [&](const EdgeInfo &EI) {
+                                return EI.UserTE->Idx >
+                                       static_cast<int>(NodeIdx);
+                              })))
     return;
 
-  // Conservatively determine if we can actually truncate the roots of the
-  // expression. Collect the values that can be demoted in ToDemote and
-  // additional roots that require investigating in Roots.
-  SmallVector<Value *, 32> ToDemote;
-  DenseMap<Instruction *, SmallVector<unsigned>> DemotedConsts;
-  SmallVector<Value *, 4> Roots;
-  for (auto *Root : TreeRoot) {
-    DenseSet<Value *> Visited;
-    if (!collectValuesToDemote(Root, ToDemote, DemotedConsts, Roots, Visited))
-      return;
+  // The first value node for store/insertelement is sext/zext/trunc? Skip it,
+  // resize to the final type.
+  bool IsProfitableToDemoteRoot = !IsStoreOrInsertElt;
+  if (NodeIdx != 0 &&
+      VectorizableTree[NodeIdx]->State == TreeEntry::Vectorize &&
+      (VectorizableTree[NodeIdx]->getOpcode() == Instruction::ZExt ||
+       VectorizableTree[NodeIdx]->getOpcode() == Instruction::SExt ||
+       VectorizableTree[NodeIdx]->getOpcode() == Instruction::Trunc)) {
+    assert(IsStoreOrInsertElt && "Expected store/insertelement seeded graph.");
+    ++NodeIdx;
+    IsProfitableToDemoteRoot = true;
   }
 
-  // The maximum bit width required to represent all the values that can be
-  // demoted without loss of precision. It would be safe to truncate the roots
-  // of the expression to this width.
-  auto MaxBitWidth = 1u;
-
-  // We first check if all the bits of the roots are demanded. If they're not,
-  // we can truncate the roots to this narrower type.
-  for (auto *Root : TreeRoot) {
-    auto Mask = DB->getDemandedBits(cast<Instruction>(Root));
-    MaxBitWidth = std::max<unsigned>(Mask.getBitWidth() - Mask.countl_zero(),
-                                     MaxBitWidth);
-  }
-
-  // True if the roots can be zero-extended back to their original type, rather
-  // than sign-extended. We know that if the leading bits are not demanded, we
-  // can safely zero-extend. So we initialize IsKnownPositive to True.
-  bool IsKnownPositive = true;
-
-  // If all the bits of the roots are demanded, we can try a little harder to
-  // compute a narrower type. This can happen, for example, if the roots are
-  // getelementptr indices. InstCombine promotes these indices to the pointer
-  // width. Thus, all their bits are technically demanded even though the
-  // address computation might be vectorized in a smaller type.
-  //
-  // We start by looking at each entry that can be demoted. We compute the
-  // maximum bit width required to store the scalar by using ValueTracking to
-  // compute the number of high-order bits we can truncate.
-  if (MaxBitWidth == DL->getTypeSizeInBits(TreeRoot[0]->getType()) &&
-      all_of(TreeRoot, [](Value *V) {
-        return all_of(V->users(),
-                      [](User *U) { return isa<GetElementPtrInst>(U); });
-      })) {
-    MaxBitWidth = 8u;
-
+  SmallVector<Value *> ToDemote;
+  DenseMap<Instruction *, SmallVector<unsigned>> DemotedConsts;
+  auto ComputeMaxBitWidth = [&](ArrayRef<Value *> TreeRoot, unsigned VF,
+                                bool IsTopRoot, bool IsProfitableToDemoteRoot,
+                                unsigned Opcode, unsigned Limit) {
+    ToDemote.clear();
+    auto *TreeRootIT = dyn_cast<IntegerType>(TreeRoot[0]->getType());
+    if (!TreeRootIT || !Opcode)
+      return 0u;
+
+    unsigned NumParts = TTI->getNumberOfParts(
+        FixedVectorType::get(TreeRoot.front()->getType(), VF));
+
+    // The maximum bit width required to represent all the values that can be
+    // demoted without loss of precision. It would be safe to truncate the roots
+    // of the expression to this width.
+    auto MaxBitWidth = 1u;
+
+    // True if the roots can be zero-extended back to their original type,
+    // rather than sign-extended. We know that if the leading bits are not
+    // demanded, we can safely zero-extend. So we initialize IsKnownPositive to
+    // True.
     // Determine if the sign bit of all the roots is known to be zero. If not,
     // IsKnownPositive is set to False.
-    IsKnownPositive = llvm::all_of(TreeRoot, [&](Value *R) {
+    bool IsKnownPositive = all_of(TreeRoot, [&](Value *R) {
       KnownBits Known = computeKnownBits(R, *DL);
       return Known.isNonNegative();
     });
 
-    // Determine the maximum number of bits required to store the scalar
-    // values.
-    for (auto *Scalar : ToDemote) {
-      auto NumSignBits = ComputeNumSignBits(Scalar, *DL, 0, AC, nullptr, DT);
-      auto NumTypeBits = DL->getTypeSizeInBits(Scalar->getType());
-      MaxBitWidth = std::max<unsigned>(NumTypeBits - NumSignBits, MaxBitWidth);
-    }
-
-    // If we can't prove that the sign bit is zero, we must add one to the
-    // maximum bit width to account for the unknown sign bit. This preserves
-    // the existing sign bit so we can safely sign-extend the root back to the
-    // original type. Otherwise, if we know the sign bit is zero, we will
-    // zero-extend the root instead.
-    //
-    // FIXME: This is somewhat suboptimal, as there will be cases where adding
-    //        one to the maximum bit width will yield a larger-than-necessary
-    //        type. In general, we need to add an extra bit only if we can't
-    //        prove that the upper bit of the original type is equal to the
-    //        upper bit of the proposed smaller type. If these two bits are the
-    //        same (either zero or one) we know that sign-extending from the
-    //        smaller type will result in the same value. Here, since we can't
-    //        yet prove this, we are just making the proposed smaller type
-    //        larger to ensure correctness.
-    if (!IsKnownPositive)
-      ++MaxBitWidth;
-  }
-
-  // Round MaxBitWidth up to the next power-of-two.
-  MaxBitWidth = llvm::bit_ceil(MaxBitWidth);
-
-  // If the maximum bit width we compute is less than the with of the roots'
-  // type, we can proceed with the narrowing. Otherwise, do nothing.
-  if (MaxBitWidth >= TreeRootIT->getBitWidth())
-    return;
+    // We first check if all the bits of the roots are demanded. If they'r...
[truncated]

Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
Copy link
Member Author

@alexey-bataev alexey-bataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping!

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
Created using spr 1.3.5
return true;
auto NumSignBits = ComputeNumSignBits(V, *DL, 0, AC, nullptr, DT);
unsigned BitWidth1 = OrigBitWidth - NumSignBits;
KnownBits Known = computeKnownBits(V, *DL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optimization) - You're calling computeKnownBits here after MaskedValueIsZero has already done that above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not how to reuse the result from MaskedValueIsZero for checking that V is potentially negative here. Anyway, replaced by llvm::isKnownNonNegative function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to replace MaskedValueIsZero with something like:

KnownBits Known = computeKnownBits(V, *DL);
if (Known.countMinLeadingZeros() >= (OrigBitWidth - NumSignBits))
  return true;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to replace MaskedValueIsZero with something like:

KnownBits Known = computeKnownBits(V, *DL);
if (Known.countMinLeadingZeros() >= (OrigBitWidth - NumSignBits))
  return true;

Not sure this is better. Prefer to use standard functions rather than reinvent them here. Such kind of "manual inlining" is very to hard to maintain.

Copy link

github-actions bot commented Feb 22, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

Created using spr 1.3.5
Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as far as I can tell - the AArch64/getelementptr.ll code explosion seems unfortunate though - have you confirmed it results in better codegen?

@alexey-bataev
Copy link
Member Author

LGTM - as far as I can tell - the AArch64/getelementptr.ll code explosion seems unfortunate though - have you confirmed it results in better codegen?

AArch64/getelementptr.ll is not a regression. It uses -slp-threshold=-7, which turns on the vectorization of previously non-profitable graph with this patch. Need to set -slp-threshold=-6 (or even -5) to return it back, but in this case the first test will be failed (unaffected by this patch). I'll split this test into 2 separate test with the different thresholds.

Created using spr 1.3.5
@alexey-bataev alexey-bataev merged commit a730ed7 into main Mar 5, 2024
3 of 4 checks passed
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis branch March 5, 2024 17:20
@nikic
Copy link
Contributor

nikic commented Mar 5, 2024

Seems to break the build, maybe on GCC only?


/var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/var/llvm-compile-time-tracker/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~

@alexey-bataev
Copy link
Member Author

Seems to break the build, maybe on GCC only?


/var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/var/llvm-compile-time-tracker/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~

Am, I did not touch this code. Probably, some kind of side effect. Could you provide full error report?

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 5, 2024

Seems to break the build, maybe on GCC only?


/var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /var/llvm-compile-time-tracker/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/var/llvm-compile-time-tracker/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~

Am, I did not touch this code. Probably, some kind of side effect. Could you provide full error report?

https://github.com/dtcxzyw/llvm-ci/actions/runs/8160865465/job/22308394053

[624/1925] Building CXX object lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o
FAILED: lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o 
/usr/bin/c++ -DGTEST_HAS_RTTI=0 -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-build/lib/Transforms/Vectorize -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-build/include -I/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -fno-lifetime-dse -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-uninitialized -Wno-nonnull -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17  -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o -MF lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o.d -o lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o -c /home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3617:24: error: declaration of ‘llvm::TargetTransformInfo* llvm::slpvectorizer::BoUpSLP::TTI’ changes meaning of ‘TTI’ [-fpermissive]
 3617 |   TargetTransformInfo *TTI;
      |                        ^~~
In file included from /home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:48:
/home/dtcxzyw/llvm-ci/rv64gc-O3-thinlto/llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:206:29: note: ‘TTI’ declared here as ‘typedef class llvm::TargetTransformInfo llvm::TTI’
  206 | typedef TargetTransformInfo TTI;
      |                             ^~~
[625/1925] Building CXX object lib/Transforms/ObjCARC/CMakeFiles/LLVMObjCARCOpts.dir/ObjCARCContract.cpp.o

Related commits:

from 1e828f838cc0f15074f3dbbb04929c06ef0c9729 to 041638c4294a9a8375851e0add1ab2c99412c032

041638c4294a9a8375851e0add1ab2c99412c032 [libc][stdbit] implement stdc_bit_width (C23) (#83892)
233f750c3dad14e330f5358e8dbcc6c30e805edb [flang] Catch more bad pointer initialization targets (#83731)
9f67f19614e952ede385a59bb62f7b57771ca4c3 [gn] Remove ScudoBenchmarks
2807ea6b8047780b5e66a122faf09fae786c917b [HLSL] implement the any intrinsic (#83903)
b2ca23aed802abc43ed216ce9bf4c80c056a04c0 [HLSL] implement exp intrinsic (#83832)
06fea93341ae7d0d0faa82c4c8704591963c2d8c [TextAPI] add missing platforms for translating triples to tapi targets
643b31dbe8a515e007a0f1b2e1072c34e461b778 [HLSL] implement `mad` intrinsic (#83826)
a730ed7c1a4a35f5219df720ffb0ba6122d64fe4 [SLP]Improve minbitwidth analysis.
1b1aea79194117d8f1729ef9c8f80454aea381fe AMDGPU: Make s_wait_samplecnt(_bvhcnt) dependent on hasImageInsts, NFC (#83932)
f836048a2b452f5f2a8440c9f5945ee1a7bcdac2 [gn] port 6fd27d5b0321f (no more module.modulemap.in)
f33f66be7dc586a597437d7ce7619d87e8637209 [NFC][RemoveDIs] Always use iterators for inserting PHIs

@alexey-bataev
Copy link
Member Author

Hope fixed in 083d8aa

@nikic
Copy link
Contributor

nikic commented Mar 5, 2024

It looks like this change causes large compile-time regressions in some cases: http://llvm-compile-time-tracker.com/compare.php?from=1b1aea79194117d8f1729ef9c8f80454aea381fe&to=083d8aa03aca55b88098a91e41e41a8e321a5721&stat=instructions%3Au

For example, mode_decision.c from lencod regresses by 10% with -O3 (without codegen change).

Other examples are partQalignmm.c from mafft by 3.5%, shared_sha256.c from ClamAV by 5%.

@alexey-bataev
Copy link
Member Author

I'll revert the patch and will try to improve the compile time

@jyknight
Copy link
Member

jyknight commented Mar 6, 2024

I see it was already reverted.

But prior to that, this change has also triggered a miscompile in xxhash, with llvm/unittests/Support/xxhashTest.cpp failing to compute correct hashes for 6 of its test-cases. (This error showed up in our downstream build, which is building llvm for x86-64, but uses various cpu tuning and other build flags. I don't know if any upstream buildbots observed the bug.)

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:48: Failure
Expected equality of these values:
  uint64_t(0x9013fb74ca603e0c)
    Which is: 10381918045049208332
  xxh3_64bits(ArrayRef(a, size_t(33)))
    Which is: 15799937386795546273

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:49: Failure
Expected equality of these values:
  uint64_t(0xfa5271fcce0db1c3)
    Which is: 18037604788174959043
  xxh3_64bits(ArrayRef(a, size_t(64)))
    Which is: 4052662859111881885

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:50: Failure
Expected equality of these values:
  uint64_t(0x79c42431727f1012)
    Which is: 8774177768817496082
  xxh3_64bits(ArrayRef(a, size_t(65)))
    Which is: 15228745306148528873

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:51: Failure
Expected equality of these values:
  uint64_t(0x591ee0ddf9c9ccd1)
    Which is: 6421817362660052177
  xxh3_64bits(ArrayRef(a, size_t(96)))
    Which is: 9533022609644079181

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:52: Failure
Expected equality of these values:
  uint64_t(0x8ffc6a3111fe19da)
    Which is: 10375284400542587354
  xxh3_64bits(ArrayRef(a, size_t(97)))
    Which is: 16638996938277853399

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:53: Failure
Expected equality of these values:
  uint64_t(0x06a146ee9a2da378)
    Which is: 477741026080826232
  xxh3_64bits(ArrayRef(a, size_t(128)))
    Which is: 5146516666195395268

@alexey-bataev alexey-bataev restored the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis branch March 7, 2024 15:35
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis branch March 7, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants