Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP]Improve minbitwidth analysis. #84334

Conversation

alexey-bataev
Copy link
Member

@alexey-bataev alexey-bataev commented Mar 7, 2024

This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test 92549.00 92609.00 0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 663381.00 663493.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 663381.00 663493.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 307182.00 307214.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2040257.00 2040273.00 0.0%

                                                                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: #78976

Created using spr 1.3.5
@alexey-bataev alexey-bataev merged commit 4ce52e2 into main Mar 7, 2024
4 of 5 checks passed
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slpimprove-minbitwidth-analysis-1 branch March 7, 2024 15:36
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 7, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Alexey Bataev (alexey-bataev)

Changes

This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test 92549.00 92609.00 0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 663381.00 663493.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 663381.00 663493.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 307182.00 307214.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1394420.00 1394484.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2040257.00 2040273.00 0.0%

                                                                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                     test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.


Patch is 71.15 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/84334.diff

15 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+440-194)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll (+5-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/getelementptr2.ll (+2-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-add-i64.ll (+5-15)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+4-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/PR35777.ll (+5-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/int-bitcast-minbitwidth.ll (+1-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/minbitwidth-multiuse-with-insertelement.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/minbitwidth-transformed-operand.ll (+13-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/minimum-sizes.ll (+24-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/phi-undef-input.ll (+12-12)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/resched.ll (+16-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reused-reductions-with-minbitwidth.ll (+4-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/store-insertelement-minbitwidth.ll (+12-10)
  • (modified) llvm/test/Transforms/SLPVectorizer/alt-cmp-vectorize.ll (+2-2)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 36dc9094538ae9..1889bc09e85028 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1085,6 +1085,9 @@ class BoUpSLP {
       BS->clear();
     }
     MinBWs.clear();
+    ReductionBitWidth = 0;
+    CastMaxMinBWSizes.reset();
+    TruncNodes.clear();
     InstrElementSize.clear();
     UserIgnoreList = nullptr;
     PostponedGathers.clear();
@@ -2287,6 +2290,7 @@ class BoUpSLP {
   void clearReductionData() {
     AnalyzedReductionsRoots.clear();
     AnalyzedReductionVals.clear();
+    AnalyzedMinBWVals.clear();
   }
   /// Checks if the given value is gathered in one of the nodes.
   bool isAnyGathered(const SmallDenseSet<Value *> &Vals) const {
@@ -2307,9 +2311,11 @@ class BoUpSLP {
   /// constant and to be demoted. Required to correctly identify constant nodes
   /// to be demoted.
   bool collectValuesToDemote(
-      Value *V, SmallVectorImpl<Value *> &ToDemote,
+      Value *V, bool IsProfitableToDemoteRoot, unsigned &BitWidth,
+      SmallVectorImpl<Value *> &ToDemote,
       DenseMap<Instruction *, SmallVector<unsigned>> &DemotedConsts,
-      SmallVectorImpl<Value *> &Roots, DenseSet<Value *> &Visited) const;
+      DenseSet<Value *> &Visited, unsigned &MaxDepthLevel,
+      bool &IsProfitableToDemote) const;
 
   /// Check if the operands on the edges \p Edges of the \p UserTE allows
   /// reordering (i.e. the operands can be reordered because they have only one
@@ -2375,6 +2381,10 @@ class BoUpSLP {
   /// \ returns the graph entry for the \p Idx operand of the \p E entry.
   const TreeEntry *getOperandEntry(const TreeEntry *E, unsigned Idx) const;
 
+  /// \returns Cast context for the given graph node.
+  TargetTransformInfo::CastContextHint
+  getCastContextHint(const TreeEntry &TE) const;
+
   /// \returns the cost of the vectorizable entry.
   InstructionCost getEntryCost(const TreeEntry *E,
                                ArrayRef<Value *> VectorizedVals,
@@ -2925,11 +2935,18 @@ class BoUpSLP {
       }
       assert(!BundleMember && "Bundle and VL out of sync");
     } else {
-      MustGather.insert(VL.begin(), VL.end());
       // Build a map for gathered scalars to the nodes where they are used.
+      bool AllConstsOrCasts = true;
       for (Value *V : VL)
-        if (!isConstant(V))
+        if (!isConstant(V)) {
+          auto *I = dyn_cast<CastInst>(V);
+          AllConstsOrCasts &= I && I->getType()->isIntegerTy();
           ValueToGatherNodes.try_emplace(V).first->getSecond().insert(Last);
+        }
+      if (AllConstsOrCasts)
+        CastMaxMinBWSizes =
+            std::make_pair(std::numeric_limits<unsigned>::max(), 1);
+      MustGather.insert(VL.begin(), VL.end());
     }
 
     if (UserTreeIdx.UserTE)
@@ -3054,6 +3071,10 @@ class BoUpSLP {
   /// Set of hashes for the list of reduction values already being analyzed.
   DenseSet<size_t> AnalyzedReductionVals;
 
+  /// Values, already been analyzed for mininmal bitwidth and found to be
+  /// non-profitable.
+  DenseSet<Value *> AnalyzedMinBWVals;
+
   /// A list of values that need to extracted out of the tree.
   /// This list holds pairs of (Internal Scalar : External User). External User
   /// can be nullptr, it means that this Internal Scalar will be used later,
@@ -3629,6 +3650,18 @@ class BoUpSLP {
   /// value must be signed-extended, rather than zero-extended, back to its
   /// original width.
   DenseMap<const TreeEntry *, std::pair<uint64_t, bool>> MinBWs;
+
+  /// Final size of the reduced vector, if the current graph represents the
+  /// input for the reduction and it was possible to narrow the size of the
+  /// reduction.
+  unsigned ReductionBitWidth = 0;
+
+  /// If the tree contains any zext/sext/trunc nodes, contains max-min pair of
+  /// type sizes, used in the tree.
+  std::optional<std::pair<unsigned, unsigned>> CastMaxMinBWSizes;
+
+  /// Indices of the vectorized trunc nodes.
+  DenseSet<unsigned> TruncNodes;
 };
 
 } // end namespace slpvectorizer
@@ -6539,8 +6572,29 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
     case Instruction::Trunc:
     case Instruction::FPTrunc:
     case Instruction::BitCast: {
+      auto [PrevMaxBW, PrevMinBW] = CastMaxMinBWSizes.value_or(
+          std::make_pair(std::numeric_limits<unsigned>::min(),
+                         std::numeric_limits<unsigned>::max()));
+      if (ShuffleOrOp == Instruction::ZExt ||
+          ShuffleOrOp == Instruction::SExt) {
+        CastMaxMinBWSizes = std::make_pair(
+            std::max<unsigned>(DL->getTypeSizeInBits(VL0->getType()),
+                               PrevMaxBW),
+            std::min<unsigned>(
+                DL->getTypeSizeInBits(VL0->getOperand(0)->getType()),
+                PrevMinBW));
+      } else if (ShuffleOrOp == Instruction::Trunc) {
+        CastMaxMinBWSizes = std::make_pair(
+            std::max<unsigned>(
+                DL->getTypeSizeInBits(VL0->getOperand(0)->getType()),
+                PrevMaxBW),
+            std::min<unsigned>(DL->getTypeSizeInBits(VL0->getType()),
+                               PrevMinBW));
+        TruncNodes.insert(VectorizableTree.size());
+      }
       TreeEntry *TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx,
                                    ReuseShuffleIndicies);
+
       LLVM_DEBUG(dbgs() << "SLP: added a vector of casts.\n");
 
       TE->setOperandsInOrder();
@@ -8362,6 +8416,22 @@ const BoUpSLP::TreeEntry *BoUpSLP::getOperandEntry(const TreeEntry *E,
   return It->get();
 }
 
+TTI::CastContextHint BoUpSLP::getCastContextHint(const TreeEntry &TE) const {
+  if (TE.State == TreeEntry::ScatterVectorize ||
+      TE.State == TreeEntry::StridedVectorize)
+    return TTI::CastContextHint::GatherScatter;
+  if (TE.State == TreeEntry::Vectorize && TE.getOpcode() == Instruction::Load &&
+      !TE.isAltShuffle()) {
+    if (TE.ReorderIndices.empty())
+      return TTI::CastContextHint::Normal;
+    SmallVector<int> Mask;
+    inversePermutation(TE.ReorderIndices, Mask);
+    if (ShuffleVectorInst::isReverseMask(Mask, Mask.size()))
+      return TTI::CastContextHint::Reversed;
+  }
+  return TTI::CastContextHint::None;
+}
+
 InstructionCost
 BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
                       SmallPtrSetImpl<Value *> &CheckedExtracts) {
@@ -8384,6 +8454,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
   // If we have computed a smaller type for the expression, update VecTy so
   // that the costs will be accurate.
   auto It = MinBWs.find(E);
+  Type *OrigScalarTy = ScalarTy;
   if (It != MinBWs.end()) {
     ScalarTy = IntegerType::get(F->getContext(), It->second.first);
     VecTy = FixedVectorType::get(ScalarTy, VL.size());
@@ -8441,24 +8512,11 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
     UsedScalars.set(I);
   }
   auto GetCastContextHint = [&](Value *V) {
-    if (const TreeEntry *OpTE = getTreeEntry(V)) {
-      if (OpTE->State == TreeEntry::ScatterVectorize ||
-          OpTE->State == TreeEntry::StridedVectorize)
-        return TTI::CastContextHint::GatherScatter;
-      if (OpTE->State == TreeEntry::Vectorize &&
-          OpTE->getOpcode() == Instruction::Load && !OpTE->isAltShuffle()) {
-        if (OpTE->ReorderIndices.empty())
-          return TTI::CastContextHint::Normal;
-        SmallVector<int> Mask;
-        inversePermutation(OpTE->ReorderIndices, Mask);
-        if (ShuffleVectorInst::isReverseMask(Mask, Mask.size()))
-          return TTI::CastContextHint::Reversed;
-      }
-    } else {
-      InstructionsState SrcState = getSameOpcode(E->getOperand(0), *TLI);
-      if (SrcState.getOpcode() == Instruction::Load && !SrcState.isAltShuffle())
-        return TTI::CastContextHint::GatherScatter;
-    }
+    if (const TreeEntry *OpTE = getTreeEntry(V))
+      return getCastContextHint(*OpTE);
+    InstructionsState SrcState = getSameOpcode(E->getOperand(0), *TLI);
+    if (SrcState.getOpcode() == Instruction::Load && !SrcState.isAltShuffle())
+      return TTI::CastContextHint::GatherScatter;
     return TTI::CastContextHint::None;
   };
   auto GetCostDiff =
@@ -8507,8 +8565,6 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
               TTI::CastContextHint CCH = GetCastContextHint(VL0);
               VecCost += TTI->getCastInstrCost(VecOpcode, UserVecTy, VecTy, CCH,
                                                CostKind);
-              ScalarCost += Sz * TTI->getCastInstrCost(VecOpcode, UserScalarTy,
-                                                       ScalarTy, CCH, CostKind);
             }
           }
         }
@@ -8525,7 +8581,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
     InstructionCost ScalarCost = 0;
     InstructionCost VecCost = 0;
     std::tie(ScalarCost, VecCost) = getGEPCosts(
-        *TTI, Ptrs, BasePtr, E->getOpcode(), CostKind, ScalarTy, VecTy);
+        *TTI, Ptrs, BasePtr, E->getOpcode(), CostKind, OrigScalarTy, VecTy);
     LLVM_DEBUG(dumpTreeCosts(E, 0, VecCost, ScalarCost,
                              "Calculated GEPs cost for Tree"));
 
@@ -8572,7 +8628,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
           NumElts = ATy->getNumElements();
         else
           NumElts = AggregateTy->getStructNumElements();
-        SrcVecTy = FixedVectorType::get(ScalarTy, NumElts);
+        SrcVecTy = FixedVectorType::get(OrigScalarTy, NumElts);
       }
       if (I->hasOneUse()) {
         Instruction *Ext = I->user_back();
@@ -8740,13 +8796,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
       }
     }
     auto GetScalarCost = [&](unsigned Idx) -> InstructionCost {
-      // Do not count cost here if minimum bitwidth is in effect and it is just
-      // a bitcast (here it is just a noop).
-      if (VecOpcode != Opcode && VecOpcode == Instruction::BitCast)
-        return TTI::TCC_Free;
-      auto *VI = VL0->getOpcode() == Opcode
-                     ? cast<Instruction>(UniqueValues[Idx])
-                     : nullptr;
+      auto *VI = cast<Instruction>(UniqueValues[Idx]);
       return TTI->getCastInstrCost(Opcode, VL0->getType(),
                                    VL0->getOperand(0)->getType(),
                                    TTI::getCastContextHint(VI), CostKind, VI);
@@ -8789,7 +8839,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
                                        ? CmpInst::BAD_FCMP_PREDICATE
                                        : CmpInst::BAD_ICMP_PREDICATE;
 
-      return TTI->getCmpSelInstrCost(E->getOpcode(), ScalarTy,
+      return TTI->getCmpSelInstrCost(E->getOpcode(), OrigScalarTy,
                                      Builder.getInt1Ty(), CurrentPred, CostKind,
                                      VI);
     };
@@ -8844,7 +8894,7 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
       TTI::OperandValueInfo Op2Info =
           TTI::getOperandInfo(VI->getOperand(OpIdx));
       SmallVector<const Value *> Operands(VI->operand_values());
-      return TTI->getArithmeticInstrCost(ShuffleOrOp, ScalarTy, CostKind,
+      return TTI->getArithmeticInstrCost(ShuffleOrOp, OrigScalarTy, CostKind,
                                          Op1Info, Op2Info, Operands, VI);
     };
     auto GetVectorCost = [=](InstructionCost CommonCost) {
@@ -8863,9 +8913,9 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
   case Instruction::Load: {
     auto GetScalarCost = [&](unsigned Idx) {
       auto *VI = cast<LoadInst>(UniqueValues[Idx]);
-      return TTI->getMemoryOpCost(Instruction::Load, ScalarTy, VI->getAlign(),
-                                  VI->getPointerAddressSpace(), CostKind,
-                                  TTI::OperandValueInfo(), VI);
+      return TTI->getMemoryOpCost(Instruction::Load, OrigScalarTy,
+                                  VI->getAlign(), VI->getPointerAddressSpace(),
+                                  CostKind, TTI::OperandValueInfo(), VI);
     };
     auto *LI0 = cast<LoadInst>(VL0);
     auto GetVectorCost = [&](InstructionCost CommonCost) {
@@ -8908,9 +8958,9 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
     auto GetScalarCost = [=](unsigned Idx) {
       auto *VI = cast<StoreInst>(VL[Idx]);
       TTI::OperandValueInfo OpInfo = TTI::getOperandInfo(VI->getValueOperand());
-      return TTI->getMemoryOpCost(Instruction::Store, ScalarTy, VI->getAlign(),
-                                  VI->getPointerAddressSpace(), CostKind,
-                                  OpInfo, VI);
+      return TTI->getMemoryOpCost(Instruction::Store, OrigScalarTy,
+                                  VI->getAlign(), VI->getPointerAddressSpace(),
+                                  CostKind, OpInfo, VI);
     };
     auto *BaseSI =
         cast<StoreInst>(IsReorder ? VL[E->ReorderIndices.front()] : VL0);
@@ -9772,6 +9822,44 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
     Cost -= InsertCost;
   }
 
+  // Add the cost for reduced value resize (if required).
+  if (ReductionBitWidth != 0) {
+    assert(UserIgnoreList && "Expected reduction tree.");
+    const TreeEntry &E = *VectorizableTree.front().get();
+    auto It = MinBWs.find(&E);
+    if (It != MinBWs.end() && It->second.first != ReductionBitWidth) {
+      unsigned SrcSize = It->second.first;
+      unsigned DstSize = ReductionBitWidth;
+      unsigned Opcode = Instruction::Trunc;
+      if (SrcSize < DstSize)
+        Opcode = It->second.second ? Instruction::SExt : Instruction::ZExt;
+      auto *SrcVecTy =
+          FixedVectorType::get(Builder.getIntNTy(SrcSize), E.getVectorFactor());
+      auto *DstVecTy =
+          FixedVectorType::get(Builder.getIntNTy(DstSize), E.getVectorFactor());
+      TTI::CastContextHint CCH = getCastContextHint(E);
+      InstructionCost CastCost;
+      switch (E.getOpcode()) {
+      case Instruction::SExt:
+      case Instruction::ZExt:
+      case Instruction::Trunc: {
+        const TreeEntry *OpTE = getOperandEntry(&E, 0);
+        CCH = getCastContextHint(*OpTE);
+        break;
+      }
+      default:
+        break;
+      }
+      CastCost += TTI->getCastInstrCost(Opcode, DstVecTy, SrcVecTy, CCH,
+                                        TTI::TCK_RecipThroughput);
+      Cost += CastCost;
+      LLVM_DEBUG(dbgs() << "SLP: Adding cost " << CastCost
+                        << " for final resize for reduction from " << SrcVecTy
+                        << " to " << DstVecTy << "\n";
+                 dbgs() << "SLP: Current total cost = " << Cost << "\n");
+    }
+  }
+
 #ifndef NDEBUG
   SmallString<256> Str;
   {
@@ -9992,6 +10080,30 @@ BoUpSLP::isGatherShuffledSingleRegisterEntry(
   // tree node for each gathered value - we have just a permutation of the
   // single vector. If we have 2 different sets, we're in situation where we
   // have a permutation of 2 input vectors.
+  // Filter out entries with larger bitwidth of elements.
+  Type *ScalarTy = VL.front()->getType();
+  unsigned BitWidth = 0;
+  if (ScalarTy->isIntegerTy()) {
+    // Check if the used TEs supposed to be resized and choose the best
+    // candidates.
+    BitWidth = DL->getTypeStoreSize(ScalarTy);
+    if (TEUseEI.UserTE->getOpcode() != Instruction::Select ||
+        TEUseEI.EdgeIdx != 0) {
+      auto UserIt = MinBWs.find(TEUseEI.UserTE);
+      if (UserIt != MinBWs.end())
+        BitWidth = UserIt->second.second;
+    }
+  }
+  auto CheckBitwidth = [&](const TreeEntry &TE) {
+    Type *ScalarTy = TE.Scalars.front()->getType();
+    if (!ScalarTy->isIntegerTy())
+      return true;
+    unsigned TEBitWidth = DL->getTypeStoreSize(ScalarTy);
+    auto UserIt = MinBWs.find(TEUseEI.UserTE);
+    if (UserIt != MinBWs.end())
+      TEBitWidth = UserIt->second.second;
+    return BitWidth == TEBitWidth;
+  };
   SmallVector<SmallPtrSet<const TreeEntry *, 4>> UsedTEs;
   DenseMap<Value *, int> UsedValuesEntry;
   for (Value *V : VL) {
@@ -10026,6 +10138,8 @@ BoUpSLP::isGatherShuffledSingleRegisterEntry(
           continue;
       }
 
+      if (!CheckBitwidth(*TEPtr))
+        continue;
       // Check if the user node of the TE comes after user node of TEPtr,
       // otherwise TEPtr depends on TE.
       if ((TEInsertBlock != InsertPt->getParent() ||
@@ -10042,8 +10156,8 @@ BoUpSLP::isGatherShuffledSingleRegisterEntry(
             continue;
           VTE = *It->getSecond().begin();
           // Iterate through all vectorized nodes.
-          auto *MIt = find_if(It->getSecond(), [](const TreeEntry *MTE) {
-            return MTE->State == TreeEntry::Vectorize;
+          auto *MIt = find_if(It->getSecond(), [&](const TreeEntry *MTE) {
+            return MTE->State == TreeEntry::Vectorize && CheckBitwidth(*MTE);
           });
           if (MIt == It->getSecond().end())
             continue;
@@ -10053,10 +10167,7 @@ BoUpSLP::isGatherShuffledSingleRegisterEntry(
       Instruction &LastBundleInst = getLastInstructionInBundle(VTE);
       if (&LastBundleInst == TEInsertPt || !CheckOrdering(&LastBundleInst))
         continue;
-      auto It = MinBWs.find(VTE);
-      // If vectorize node is demoted - do not match.
-      if (It != MinBWs.end() &&
-          It->second.first != DL->getTypeSizeInBits(V->getType()))
+      if (!CheckBitwidth(*VTE))
         continue;
       VToTEs.insert(VTE);
     }
@@ -12929,7 +13040,21 @@ Value *BoUpSLP::vectorizeTree(
   Builder.ClearInsertionPoint();
   InstrElementSize.clear();
 
-  return VectorizableTree[0]->VectorizedValue;
+  const TreeEntry &RootTE = *VectorizableTree.front().get();
+  Value *Vec = RootTE.VectorizedValue;
+  if (auto It = MinBWs.find(&RootTE); ReductionBitWidth != 0 &&
+                                      It != MinBWs.end() &&
+                                      ReductionBitWidth != It->second.first) {
+    IRBuilder<>::InsertPointGuard Guard(Builder);
+    Builder.SetInsertPoint(ReductionRoot->getParent(),
+                           ReductionRoot->getIterator());
+    Vec = Builder.CreateIntCast(
+        Vec,
+        VectorType::get(Builder.getIntNTy(ReductionBitWidth),
+                        cast<VectorType>(Vec->getType())->getElementCount()),
+        It->second.second);
+  }
+  return Vec;
 }
 
 void BoUpSLP::optimizeGatherSequence() {
@@ -13749,23 +13874,42 @@ unsigned BoUpSLP::getVectorElementSize(Value *V) {
 // smaller type with a truncation. We collect the values that will be demoted
 // in ToDemote and additional roots that require investigating in Roots.
 bool BoUpSLP::collectValuesToDemote(
-    Value *V, SmallVectorImpl<Value *> &ToDemote,
+    Value *V, bool IsProfitableToDemoteRoot, unsigned &BitWidth,
+    SmallVectorImpl<Value *> &ToDemote,
     DenseMap<Instruction *, SmallVector<unsigned>> &DemotedConsts,
-    SmallVectorImpl<Value *> &Roots, DenseSet<Value *> &Visited) const {
+    DenseSet<Value *> &Visited, unsigned &MaxDepthLevel,
+    bool &IsProfitableToDemote) const {
   // We can always demote constants.
-  if (isa<Constant>(V))
+  if (isa<Constant>(V)) {
+    MaxDepthLevel = 1;
     return true;
+  }
 
   // If the value is not a vectorized instruction in the expression and not used
   // by the insertelement instruction and not used in multiple vector nodes, it
   // cannot be demoted.
+  // TODO: improve handling of gathered values and others.
   auto *I = dyn_cast<Instruction>(V);
-  if (!I || !getTreeEntry(I) || MultiNodeScalars.contains(I) ||
-      !Visited.insert(I).second || all_of(I->users(), [&](User *U) {
+  if (!I || !Visited.insert(I).second || !getTreeEntry(I) ||
+      MultiNodeScalars.contains(I) || all_of(I->users(), [&](User *U) {
         return isa<InsertElementInst>(U) && !getTreeEntry(U);
       }))
     return false;
 
+  auto IsPotentiallyTruncated = [&](Value *V, unsigned &BitWidth) -> bool {
+    if (MultiNodeScalars.contains(V))
+      return false;
+    uint32_t OrigBitWidth = DL->getTypeSizeInBits(V->getType());
+    APInt Mask = APInt::getBitsSetFrom(OrigBitWidth, BitWidth);
+    if (MaskedValueIsZero(V, Mask, SimplifyQuery(*DL)))
+      return true;
+    auto NumSignBits = ComputeNumSignBits(V, *DL, 0, AC, nullptr, DT);
+   ...
[truncated]

Copy link

github-actions bot commented Mar 7, 2024

⚠️ We detected that you are using a GitHub private e-mail address to contribute to the repo.
Please turn off Keep my email addresses private setting in your account.
See LLVM Discourse for more information.

@jyknight
Copy link
Member

jyknight commented Mar 7, 2024

This PR was merged, without

  1. any information on what the changes were since it was reverted, or
  2. being reviewed, or
  3. fixing -- or even acknowledging -- the xxhash miscompile I reported on the previous PR.

So, I'm going to re-revert this. I can work to get a standalone reproduction example for the xxhash issue if you need.

@alexey-bataev
Copy link
Member Author

xxxhash was fixed, I would not commit it without fixing it.

@alexey-bataev
Copy link
Member Author

This patch is almost the same, just contains small fix for xxxhash (one line fix in line 13980) + some changes to improve compile time. Everything else the same

@jyknight
Copy link
Member

jyknight commented Mar 7, 2024

Thanks for the quick response with the info on what was changed, and that there was an intent to fix the xxhash miscompile!

Given that response, I went to double-check my work to see if I'd made a mistake in testing. Unfortunately, not. I've now re-verified that the test still fails with clang built at this revision, and doesn't fail with clang built at the prior revision.

Maybe interesting (or maybe not), now the check with size 129 also fails.

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:48: Failure
Expected equality of these values:
  uint64_t(0x9013fb74ca603e0c)
    Which is: 10381918045049208332
  xxh3_64bits(ArrayRef(a, size_t(33)))
    Which is: 3308700454355487325

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:49: Failure
Expected equality of these values:
  uint64_t(0xfa5271fcce0db1c3)
    Which is: 18037604788174959043
  xxh3_64bits(ArrayRef(a, size_t(64)))
    Which is: 15548882779349196076

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:50: Failure
Expected equality of these values:
  uint64_t(0x79c42431727f1012)
    Which is: 8774177768817496082
  xxh3_64bits(ArrayRef(a, size_t(65)))
    Which is: 9026354738766114493

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:51: Failure
Expected equality of these values:
  uint64_t(0x591ee0ddf9c9ccd1)
    Which is: 6421817362660052177
  xxh3_64bits(ArrayRef(a, size_t(96)))
    Which is: 14257441473594075054

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:52: Failure
Expected equality of these values:
  uint64_t(0x8ffc6a3111fe19da)
    Which is: 10375284400542587354
  xxh3_64bits(ArrayRef(a, size_t(97)))
    Which is: 17112387889451439326

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:53: Failure
Expected equality of these values:
  uint64_t(0x06a146ee9a2da378)
    Which is: 477741026080826232
  xxh3_64bits(ArrayRef(a, size_t(128)))
    Which is: 15939254768337478897

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:54: Failure
Expected equality of these values:
  uint64_t(0xbc7138129bf065da)
    Which is: 13578696004075546074
  xxh3_64bits(ArrayRef(a, size_t(129)))
    Which is: 14895047307860736276

@fmayer
Copy link
Contributor

fmayer commented Mar 7, 2024

This breaks MSAN buildbot: https://lab.llvm.org/buildbot/#/builders/74/builds/26470/steps/12/logs/stdio

I will revert this

/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm_build_msan/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm_build2_msan/lib/Support -I/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Support -I/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm_build2_msan/include -I/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -UNDEBUG -std=c++17  -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Support/CMakeFiles/LLVMSupport.dir/xxhash.cpp.o -MF lib/Support/CMakeFiles/LLVMSupport.dir/xxhash.cpp.o.d -o lib/Support/CMakeFiles/LLVMSupport.dir/xxhash.cpp.o -c /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Support/xxhash.cpp
==2146450==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55c2665958a9 in operator() /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14122:35
    #1 0x55c2665958a9 in llvm::slpvectorizer::BoUpSLP::computeMinimumValueSizes() /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14179:28
    #2 0x55c2665aa8a2 in llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14724:9
    #3 0x55c2665ac854 in llvm::SLPVectorizerPass::tryToVectorize(llvm::Instruction*, llvm::slpvectorizer::BoUpSLP&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
    #4 0x55c2665b4c31 in llvm::SLPVectorizerPass::tryToVectorize(llvm::ArrayRef<llvm::WeakTrackingVH>, llvm::slpvectorizer::BoUpSLP&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:16561:14
    #5 0x55c26659f16f in vectorizeRootInstruction /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:16552:10
    #6 0x55c26659f16f in llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:17091:22
    #7 0x55c26659744e in llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14313:16
    #8 0x55c266596608 in llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:14244:18
    #9 0x55c265c24e31 in llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:90:17
    #10 0x55c2614e9c2d in llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/IR/PassManager.h:243:40
    #11 0x55c25c3a4a61 in llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:90:17
    #12 0x55c2614f4702 in llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/IR/PassManager.cpp:123:38
    #13 0x55c25c3a43c1 in llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:90:17
    #14 0x55c2614e7036 in llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/IR/PassManager.h:243:40
    #15 0x55c2639695ea in (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::__1::unique_ptr<llvm::raw_pwrite_stream, std::__1::default_delete<llvm::raw_pwrite_stream>>&, std::__1::unique_ptr<llvm::ToolOutputFile, std::__1::default_delete<llvm::ToolOutputFile>>&, clang::BackendConsumer*) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1109:9
    #16 0x55c263952be3 in EmitAssembly /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1174:3
    #17 0x55c263952be3 in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::__1::unique_ptr<llvm::raw_pwrite_stream, std::__1::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1336:13
    #18 0x55c2639ac4ff in clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:388:3
    #19 0x55c2688145d6 in clang::ParseAST(clang::Sema&, bool, bool) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Parse/ParseAST.cpp:176:13
    #20 0x55c26442022d in clang::FrontendAction::Execute() /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1069:8
    #21 0x55c26428b983 in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:1057:33
    #22 0x55c26467ad02 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:272:25
    #23 0x55c25b862cc6 in cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/tools/driver/cc1_main.cpp:232:15
    #24 0x55c25b85b0f4 in ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/tools/driver/driver.cpp:343:12
    #25 0x55c263e7c960 in operator() /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #26 0x55c263e7c960 in operator() /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Driver/Job.cpp:440:34
    #27 0x55c263e7c960 in void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::__1::optional<llvm::StringRef>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, bool*) const::$_0>(long) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    #28 0x55c2622a1330 in operator() /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #29 0x55c2622a1330 in llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:426:3
    #30 0x55c263e7a912 in clang::driver::CC1Command::Execute(llvm::ArrayRef<std::__1::optional<llvm::StringRef>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, bool*) const /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Driver/Job.cpp:440:12
    #31 0x55c263dc8a01 in clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Driver/Compilation.cpp:199:15
    #32 0x55c263dc96d9 in clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::__1::pair<int, clang::driver::Command const*>>&, bool) const /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Driver/Compilation.cpp:253:19
    #33 0x55c263e1f538 in clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::__1::pair<int, clang::driver::Command const*>>&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/lib/Driver/Driver.cpp:1920:5
    #34 0x55c25b858f53 in clang_main(int, char**, llvm::ToolContext const&) /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/clang/tools/driver/driver.cpp:518:21
    #35 0x55c25b882233 in main /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm_build_msan/tools/clang/tools/driver/clang-driver.cpp:17:10
    #36 0x7f2741a23a8f  (/lib/x86_64-linux-gnu/libc.so.6+0x23a8f) (BuildId: d320ce4e63925d698610ed423fc4b1f0e8ed51f1)
    #37 0x7f2741a23b48 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23b48) (BuildId: d320ce4e63925d698610ed423fc4b1f0e8ed51f1)
    #38 0x55c25b7c0c64 in _start (/b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm_build_msan/bin/clang-19+0x522ec64)

@alexey-bataev
Copy link
Member Author

Ok, I'll revert the patch and try one more time to fix xxhash/memsan issues.

@alexey-bataev
Copy link
Member Author

Thanks for the quick response with the info on what was changed, and that there was an intent to fix the xxhash miscompile!

Given that response, I went to double-check my work to see if I'd made a mistake in testing. Unfortunately, not. I've now re-verified that the test still fails with clang built at this revision, and doesn't fail with clang built at the prior revision.

Maybe interesting (or maybe not), now the check with size 129 also fails.

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:48: Failure
Expected equality of these values:
  uint64_t(0x9013fb74ca603e0c)
    Which is: 10381918045049208332
  xxh3_64bits(ArrayRef(a, size_t(33)))
    Which is: 3308700454355487325

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:49: Failure
Expected equality of these values:
  uint64_t(0xfa5271fcce0db1c3)
    Which is: 18037604788174959043
  xxh3_64bits(ArrayRef(a, size_t(64)))
    Which is: 15548882779349196076

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:50: Failure
Expected equality of these values:
  uint64_t(0x79c42431727f1012)
    Which is: 8774177768817496082
  xxh3_64bits(ArrayRef(a, size_t(65)))
    Which is: 9026354738766114493

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:51: Failure
Expected equality of these values:
  uint64_t(0x591ee0ddf9c9ccd1)
    Which is: 6421817362660052177
  xxh3_64bits(ArrayRef(a, size_t(96)))
    Which is: 14257441473594075054

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:52: Failure
Expected equality of these values:
  uint64_t(0x8ffc6a3111fe19da)
    Which is: 10375284400542587354
  xxh3_64bits(ArrayRef(a, size_t(97)))
    Which is: 17112387889451439326

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:53: Failure
Expected equality of these values:
  uint64_t(0x06a146ee9a2da378)
    Which is: 477741026080826232
  xxh3_64bits(ArrayRef(a, size_t(128)))
    Which is: 15939254768337478897

third_party/llvm/llvm-project/llvm/unittests/Support/xxhashTest.cpp:54: Failure
Expected equality of these values:
  uint64_t(0xbc7138129bf065da)
    Which is: 13578696004075546074
  xxh3_64bits(ArrayRef(a, size_t(129)))
    Which is: 14895047307860736276

Hmm, this is what I see for both SSE and AVX512
[----------] 2 tests from xxhashTest
[ RUN ] xxhashTest.Basic
[ OK ] xxhashTest.Basic (0 ms)
[ RUN ] xxhashTest.xxh3
[ OK ] xxhashTest.xxh3 (0 ms)
[----------] 2 tests from xxhashTest (0 ms total)

Can you provide your build flags? Also, did you try upstream compiler or downstream?

@alexey-bataev
Copy link
Member Author

Was able to reproduce it in one of the configs, will fix it.

alexey-bataev added a commit that referenced this pull request Mar 8, 2024
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: #84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers: 

Pull Request: #84536
alexey-bataev added a commit that referenced this pull request Mar 14, 2024
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: #84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers:

Pull Request: #84536
alexey-bataev added a commit that referenced this pull request Mar 19, 2024
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: #84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers:

Pull Request: #84536
chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: llvm#84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers:

Pull Request: llvm#84536
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants