VectorWiden pass to widen aleady vectorized instrctions #67029

dtemirbulatov · 2023-09-21T15:12:14Z

This pass allows us to widen already vectorized instructions to wider vector types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type:

add { z4.s, z5.s }, { z4.s, z5.s }, z3.s

With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si attribute ((vector_size (16)));

void add(v4si *ptr, v4si *ptr1) {
  v4si a = *ptr;
  ptr++;
  v4si b = *ptr;
  ptr++;
  v4si c = *ptr;
  ptr++;
  v4si d = *ptr;
  *ptr1 = a+b;
  ptr1++;
  *ptr1 = c+d;
}

On ARM SVE hardware, we could produce:

        ptrue   p0.s, vl8
        mov     x8, #8             // =0x8
        ld1w    { z0.s }, p0/z, [x0]
        ld1w    { z1.s }, p0/z, [x0, x8, lsl #2]
        add     z0.s, z1.s, z0.s
        st1w    { z0.s }, p0, [x1]
        ret

Currently, we have this output https://godbolt.org/z/z5n78TWsc:

        ldp     q0, q1, [x0]
        ldp     q2, q3, [x0, #32]
        add     v0.4s, v1.4s, v0.4s
        add     v1.4s, v3.4s, v2.4s
        stp     q0, q1, [x1]
        ret

I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types. This is just a starting point, I plan to build up the functionality incrementally to handle other instructions and make the algorithm more advanced over time.

llvmbot · 2023-09-21T15:13:58Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Changes

… vector

types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type:

add { z4.s, z5.s }, { z4.s, z5.s }, z3.s

With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si attribute ((vector_size (16)));

void add(v4si *ptr, v4si *ptr1) {
v4si a = *ptr;
ptr++;
v4si b = *ptr;
ptr++;
v4si c = *ptr;
ptr++;
v4si d = *ptr;
*ptr1 = a+b;
ptr1++;
*ptr1 = c+d;
}

On ARM SVE hardware, we could produce:
ptrue p0.s, vl8
mov x8, #8 // =0x8
ld1w { z0.s }, p0/z, [x0]
ld1w { z1.s }, p0/z, [x0, x8, lsl #2]
add z0.s, z1.s, z0.s
st1w { z0.s }, p0, [x1]
ret

Currently, we have this output https://godbolt.org/z/z5n78TWsc:
ldp q0, q1, [x0]
ldp q2, q3, [x0, #32]
add v0.4s, v1.4s, v0.4s
add v1.4s, v3.4s, v2.4s
stp q0, q1, [x1]
ret

I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types.

Patch is 29.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67029.diff

14 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+11)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+5)
(added) llvm/include/llvm/Transforms/Vectorize/VectorWiden.h (+25)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+5)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+9)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+21)
(modified) llvm/lib/Transforms/Vectorize/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/Vectorize/VectorWiden.cpp (+356)
(added) llvm/test/Transforms/VectorWiden/add.ll (+37)
(added) llvm/test/Transforms/VectorWiden/fptrunc-bad-dep.ll (+45)
(added) llvm/test/Transforms/VectorWiden/fptrunc.ll (+40)
(modified) llvm/utils/gn/secondary/llvm/lib/Transforms/Vectorize/BUILD.gn (+1)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 1ae595d2110457d..2a31f6ebe0ff210 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1669,6 +1669,10 @@ class TargetTransformInfo {
   /// \return The maximum number of function arguments the target supports.
   unsigned getMaxNumArgs() const;
 
+  /// \returns Whether vector operations are a good candidate for vector widen.
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const;
+
   /// @}
 
 private:
@@ -2035,6 +2039,8 @@ class TargetTransformInfo::Concept {
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
   virtual unsigned getMaxNumArgs() const = 0;
+  virtual bool considerForWidening(LLVMContext &Context,
+                                   ArrayRef<Instruction *> IL) const = 0;
 };
 
 template <typename T>
@@ -2745,6 +2751,11 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   unsigned getMaxNumArgs() const override {
     return Impl.getMaxNumArgs();
   }
+
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const override {
+    return Impl.considerForWidening(Context, IL);
+  }
 };
 
 template <typename T>
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 326c3130c6cff76..266b8da12c046cf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -891,6 +891,11 @@ class TargetTransformInfoImplBase {
 
   unsigned getMaxNumArgs() const { return UINT_MAX; }
 
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const {
+    return false;
+  }
+
 protected:
   // Obtain the minimum required size to hold the value (without the sign)
   // In case of a vector it returns the min required size for one element.
diff --git a/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h b/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h
new file mode 100644
index 000000000000000..6988785a92ce09c
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h
@@ -0,0 +1,25 @@
+//===--- VectorWiden.h - Combining Vector Operations to wider types ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
+#define LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class VectorWidenPass : public PassInfoMixin<VectorWidenPass> {
+public:
+  VectorWidenPass() {}
+
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c751d174a48ab1f..73e87330c8fbbd3 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1237,6 +1237,11 @@ bool TargetTransformInfo::hasActiveVectorLength(unsigned Opcode, Type *DataType,
   return TTIImpl->hasActiveVectorLength(Opcode, DataType, Alignment);
 }
 
+bool TargetTransformInfo::considerForWidening(
+    LLVMContext &Context, ArrayRef<Instruction *> IL) const {
+  return TTIImpl->considerForWidening(Context, IL);
+}
+
 TargetTransformInfo::Concept::~Concept() = default;
 
 TargetIRAnalysis::TargetIRAnalysis() : TTICallback(&getDefaultTTI) {}
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 985ff88139323c6..45c2529f92a5881 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -263,6 +263,7 @@
 #include "llvm/Transforms/Vectorize/LoopVectorize.h"
 #include "llvm/Transforms/Vectorize/SLPVectorizer.h"
 #include "llvm/Transforms/Vectorize/VectorCombine.h"
+#include "llvm/Transforms/Vectorize/VectorWiden.h"
 #include <optional>
 
 using namespace llvm;
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index df9f14920f29161..2eef2f0a22d95d7 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -428,6 +428,7 @@ FUNCTION_PASS("tailcallelim", TailCallElimPass())
 FUNCTION_PASS("typepromotion", TypePromotionPass(TM))
 FUNCTION_PASS("unify-loop-exits", UnifyLoopExitsPass())
 FUNCTION_PASS("vector-combine", VectorCombinePass())
+FUNCTION_PASS("vector-widen", VectorWidenPass())
 FUNCTION_PASS("verify", VerifierPass())
 FUNCTION_PASS("verify<domtree>", DominatorTreeVerifierPass())
 FUNCTION_PASS("verify<loops>", LoopVerifierPass())
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index e6209ca12a48c31..d591e727dc1dbe6 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -2426,6 +2426,15 @@ InstructionCost AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
             CostKind, I));
   }
 
+  static const TypeConversionCostTblEntry SME2Tbl[] = {
+    { ISD::FP_ROUND, MVT::nxv8f16, MVT::nxv8f32, 1 }
+  };
+
+  if (ST->hasSME2())
+    if (const auto *Entry = ConvertCostTableLookup(
+        SME2Tbl, ISD, DstTy.getSimpleVT(), SrcTy.getSimpleVT()))
+      return AdjustCost(Entry->Cost);
+
   if (const auto *Entry = ConvertCostTableLookup(ConversionTbl, ISD,
                                                  DstTy.getSimpleVT(),
                                                  SrcTy.getSimpleVT()))
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index a6baade412c77d2..6f5eaa3fadd4156 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -412,6 +412,27 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
 
     return BaseT::getStoreMinimumVF(VF, ScalarMemTy, ScalarValTy);
   }
+
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const {
+    unsigned Opcode = IL[0]->getOpcode();
+    Type *Ty = IL[0]->getType();
+    if (!ST->hasSME2())
+      return false;
+    if (llvm::any_of(IL, [Opcode, Ty](Instruction *I) {
+          return (Opcode != I->getOpcode() || Ty != I->getType());
+        }))
+      return false;
+    if (Opcode == Instruction::FPTrunc &&
+        Ty == ScalableVectorType::get(Type::getHalfTy(Context), 4))
+      return true;
+    if (Opcode == Instruction::Add &&
+        Ty == ScalableVectorType::get(Type::getInt32Ty(Context), 4) &&
+        (IL[0]->getOperand(1) == IL[1]->getOperand(1) ||
+         IL[0]->getOperand(0) == IL[1]->getOperand(0)))
+      return true;
+    return false;
+  }
 };
 
 } // end namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/CMakeLists.txt b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
index 998dfd956575d3c..a1537bb1ffa632e 100644
--- a/llvm/lib/Transforms/Vectorize/CMakeLists.txt
+++ b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
@@ -5,6 +5,7 @@ add_llvm_component_library(LLVMVectorize
   SLPVectorizer.cpp
   Vectorize.cpp
   VectorCombine.cpp
+  VectorWiden.cpp
   VPlan.cpp
   VPlanHCFGBuilder.cpp
   VPlanRecipes.cpp
diff --git a/llvm/lib/Transforms/Vectorize/VectorWiden.cpp b/llvm/lib/Transforms/Vectorize/VectorWiden.cpp
new file mode 100644
index 000000000000000..c7a590ea75430e5
--- /dev/null
+++ b/llvm/lib/Transforms/Vectorize/VectorWiden.cpp
@@ -0,0 +1,356 @@
+///==--- VectorWiden.cpp - Combining Vector Operations to wider types ----==//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass tries to widen vector operations to a wider type, it finds
+// independent from each other operations with a certain vector type as SLP does
+// with scalars by Bottom Up. It detects consecutive stores that can be put
+// together into a wider vector-stores. Next, it attempts to construct
+// vectorizable tree using the use-def chains.
+//
+//==------------------------------------------------------------------------==//
+
+#include "llvm/Transforms/Vectorize/VectorWiden.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallSet.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/DependenceAnalysis.h"
+#include "llvm/Analysis/LoopAccessAnalysis.h"
+#include "llvm/Analysis/PostDominators.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/ValueTypes.h"
+#include "llvm/IR/ConstantRange.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Transforms/Utils/CodeMoverUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "vector-widen"
+
+// Due to independant operations to widening that we consider with possibility
+// to merge those operations into one and also to widening store if we find
+// later store instructions. We have to consider the distance between those
+// independent operations or we might introduce bad register pressure, etc.
+
+static cl::opt<unsigned>
+    MaxInstDistance("vw-max-instr-distance", cl::init(30), cl::Hidden,
+                    cl::desc("Maximum distance between instructions to"
+                             "consider to widen"));
+
+namespace {
+class VectorWiden {
+public:
+  using InstrList = SmallVector<Instruction *, 2>;
+  using ValueList = SmallVector<Value *, 2>;
+  VectorWiden(Function &F, const TargetTransformInfo &TTI, DominatorTree &DT,
+              DependenceInfo &DI, const PostDominatorTree &PDT)
+      : F(F), Builder(F.getContext()), TTI(TTI), DT(DT), DI(DI), PDT(PDT) {}
+
+  bool run();
+
+private:
+  Function &F;
+  IRBuilder<> Builder;
+  const TargetTransformInfo &TTI;
+  DominatorTree &DT;
+  DependenceInfo &DI;
+  const PostDominatorTree &PDT;
+  TargetLibraryInfo *TLI;
+
+  DenseSet<Instruction *> DeletedInstructions;
+
+  /// Checks if the instruction is marked for deletion.
+  bool isDeleted(Instruction *I) const { return DeletedInstructions.count(I); }
+
+  /// Removes an instruction from its block and eventually deletes it.
+  void eraseInstruction(Instruction *I) { DeletedInstructions.insert(I); }
+
+  bool processBB(BasicBlock &BB, LLVMContext &Context);
+
+  bool widenNode(ArrayRef<Instruction *> IL, LLVMContext &Context);
+
+  void widenFPTrunc(ArrayRef<Instruction *> IL);
+
+  void widenAdd(ArrayRef<Instruction *> IL, bool Reorder);
+
+  InstructionCost getOpCost(unsigned Opcode, Type *To, Type *From,
+                            Instruction *I);
+};
+} // namespace
+
+void VectorWiden::widenFPTrunc(ArrayRef<Instruction *> IL) {
+  Instruction *I = IL[0];
+  Instruction *I1 = IL[1];
+  ScalableVectorType *RetOrigType = cast<ScalableVectorType>(I->getType());
+  ScalableVectorType *OrigType =
+      cast<ScalableVectorType>(I->getOperand(0)->getType());
+  ScalableVectorType *RetType =
+      ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+  ScalableVectorType *OpType =
+      ScalableVectorType::getDoubleElementsVectorType(OrigType);
+  Value *WideVec = UndefValue::get(OpType);
+  Builder.SetInsertPoint(I);
+  Function *InsertIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_insert, {OpType, OrigType});
+  Value *Insert1 = Builder.CreateCall(
+      InsertIntr, {WideVec, I->getOperand(0), Builder.getInt64(0)});
+  Value *Insert2 = Builder.CreateCall(
+      InsertIntr, {Insert1, I1->getOperand(0), Builder.getInt64(4)});
+  Value *ResFPTrunc =
+      Builder.CreateCast(Instruction::FPTrunc, Insert2, RetType);
+  Function *ExtractIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_extract, {RetOrigType, RetType});
+  if (!I->users().empty()) {
+    Value *Res =
+        Builder.CreateCall(ExtractIntr, {ResFPTrunc, Builder.getInt64(4)});
+    I->replaceAllUsesWith(Res);
+  }
+  if (!I1->users().empty()) {
+    Value *Res2 =
+        Builder.CreateCall(ExtractIntr, {ResFPTrunc, Builder.getInt64(0)});
+    I1->replaceAllUsesWith(Res2);
+  }
+}
+
+void VectorWiden::widenAdd(ArrayRef<Instruction *> IL, bool Reorder) {
+  Instruction *I = IL[0];
+  Instruction *I1 = IL[1];
+
+  Value *XHi = I->getOperand(0);
+  Value *XLo = I1->getOperand(0);
+  Value *YHi = I->getOperand(1);
+  Value *YLo = I1->getOperand(1);
+  if (Reorder) {
+    std::swap(XHi, YHi);
+    std::swap(XLo, YLo);
+  }
+
+  ScalableVectorType *RetOrigType = cast<ScalableVectorType>(I->getType());
+  ScalableVectorType *OrigType =
+      cast<ScalableVectorType>(I->getOperand(0)->getType());
+  ScalableVectorType *RetType =
+      ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+  ScalableVectorType *OpType =
+      ScalableVectorType::getDoubleElementsVectorType(OrigType);
+  Value *WideVec = UndefValue::get(OpType);
+  Builder.SetInsertPoint(I);
+  Function *InsertIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_insert, {OpType, OrigType});
+  Value *X1 =
+      Builder.CreateCall(InsertIntr, {WideVec, XHi, Builder.getInt64(0)});
+  Value *X2 = Builder.CreateCall(InsertIntr, {X1, XLo, Builder.getInt64(4)});
+  Value *Y1 =
+      Builder.CreateCall(InsertIntr, {WideVec, YHi, Builder.getInt64(0)});
+  Value *Y2 = Builder.CreateCall(InsertIntr, {Y1, YLo, Builder.getInt64(4)});
+  Value *ResAdd = Builder.CreateAdd(X2, Y2);
+  Function *ExtractIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_extract, {RetOrigType, RetType});
+  if (!I->users().empty()) {
+    Value *Res = Builder.CreateCall(ExtractIntr, {ResAdd, Builder.getInt64(0)});
+    I->replaceAllUsesWith(Res);
+  }
+  if (!I1->users().empty()) {
+    Value *Res2 =
+        Builder.CreateCall(ExtractIntr, {ResAdd, Builder.getInt64(4)});
+    I1->replaceAllUsesWith(Res2);
+  }
+}
+
+bool VectorWiden::widenNode(ArrayRef<Instruction *> IL, LLVMContext &Context) {
+  LLVM_DEBUG(dbgs() << "VW: widenNode: " << *IL[0] << " " << *IL[1] << "\n");
+  if (!TTI.considerForWidening(Context, IL))
+    return false;
+  if (IL[0] == IL[1])
+    return false;
+  if (IL[0]->getOpcode() != IL[1]->getOpcode())
+    return false;
+  // Ignore if any live in a diffrent Basic Block
+  if (IL[0]->getParent() != IL[1]->getParent())
+    return false;
+  // Ignore if disatance between two are too apart.
+  if (abs(std::distance(IL[1]->getIterator(), IL[0]->getIterator())) >
+      MaxInstDistance)
+    return false;
+  // Check that any instrction is already deleted.
+  if (isDeleted(IL[0]) || isDeleted(IL[1]))
+    return false;
+  if (IL[1] == IL[0]->getOperand(0) || IL[0] == IL[1]->getOperand(0))
+    return false;
+  if (IL[0]->getNumOperands() > 1 &&
+      (IL[1] == IL[0]->getOperand(1) || IL[0] == IL[1]->getOperand(1)))
+    return false;
+  if (!isSafeToMoveBefore(*IL[1], *IL[0], DT, &PDT, &DI))
+    return false;
+  switch (IL[0]->getOpcode()) {
+  case Instruction::FPTrunc: {
+    ScalableVectorType *RetOrigType =
+        cast<ScalableVectorType>(IL[0]->getType());
+    ScalableVectorType *OrigType =
+        cast<ScalableVectorType>(IL[0]->getOperand(0)->getType());
+    InstructionCost Cost =
+        getOpCost(Instruction::FPTrunc, RetOrigType, OrigType, IL[0]);
+    ScalableVectorType *RetType =
+        ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+    ScalableVectorType *OpType =
+        ScalableVectorType::getDoubleElementsVectorType(OrigType);
+    InstructionCost CostNew =
+        getOpCost(Instruction::FPTrunc, RetType, OpType, IL[0]);
+    if (2 * Cost < CostNew)
+      return false;
+    LLVM_DEBUG(dbgs() << "VW: Decided to widen FPTrunc, safe to merge : "
+                      << *IL[0] << " with  " << *IL[1] << "\n");
+    widenFPTrunc(IL);
+    return true;
+  }
+  case Instruction::Add: {
+    ScalableVectorType *OrigType =
+        cast<ScalableVectorType>(IL[0]->getOperand(0)->getType());
+    ScalableVectorType *OpType =
+        ScalableVectorType::getDoubleElementsVectorType(OrigType);
+    InstructionCost Cost =
+        getOpCost(Instruction::Add, OrigType, OrigType, IL[0]);
+    InstructionCost CostNew =
+        getOpCost(Instruction::Add, OpType, OpType, IL[0]);
+    if (2 * Cost < CostNew)
+      return false;
+    LLVM_DEBUG(dbgs() << "VW: Decided to widen Add, safe to merge : " << *IL[0]
+                      << " with  " << *IL[1] << "\n");
+    widenAdd(IL, IL[0]->getOperand(1) != IL[1]->getOperand(1));
+    return true;
+  }
+
+  default:
+    break;
+  }
+  return false;
+}
+
+InstructionCost VectorWiden::getOpCost(unsigned Opcode, Type *To, Type *From,
+                                       Instruction *I) {
+  InstructionCost Cost = 0;
+  TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  switch (Opcode) {
+  case Instruction::FPTrunc: {
+    Cost = TTI.getCastInstrCost(Opcode, To, From, TTI::getCastContextHint(I),
+                                CostKind, I);
+    break;
+  }
+  case Instruction::Add: {
+    unsigned OpIdx = isa<UnaryOperator>(I) ? 0 : 1;
+    TTI::OperandValueInfo Op1Info = TTI::getOperandInfo(I->getOperand(0));
+    TTI::OperandValueInfo Op2Info = TTI::getOperandInfo(I->getOperand(OpIdx));
+    SmallVector<const Value *> Operands(I->operand_values());
+    Cost = TTI.getArithmeticInstrCost(I->getOpcode(), To, CostKind, Op1Info,
+                                      Op2Info, Operands, I);
+    break;
+  }
+  default:
+    llvm_unreachable("Unknown instruction");
+  }
+  return Cost;
+}
+
+bool VectorWiden::processBB(BasicBlock &BB, LLVMContext &Context) {
+  DenseMap<unsigned, std::pair<InstrList, unsigned>> Operations;
+  unsigned Counter = 0;
+  for (BasicBlock::reverse_iterator IP(BB.rbegin()); IP != BB.rend();
+       *IP++, ++Counter) {
+    Instruction *I = &*IP;
+    unsigned OpFound = 0;
+
+    if (I->isDebugOrPseudoInst() || isDeleted(I))
+      continue;
+
+    switch (I->getOpcode()) {
+    default:
+      continue;
+    case Instruction::Add:
+    case Instruction::FPTrunc: {
+      if (Operations.find(I->getOpcode()) != Operations.end()) {
+        auto *OpRec = &Operations[I->getOpcode()];
+        // If instructions are too apart then remove old instrction
+        // and reset position to this instruction.
+        if (Counter - Operations[I->getOpcode()].second > MaxInstDistance) {
+          OpRec->second = Counter;
+          OpRec->first.clear();
+          OpRec->first.push_back(I);
+        } else {
+          OpRec->first.push_back(I);
+          OpFound = I->getOpcode();
+        }
+      } else {
+        Operations[I->getOpcode()] = {{I}, Counter};
+      }
+      break;
+    }
+    }
+    if (OpFound && Operations.find(OpFound) != Operations.end()) {
+      auto *OpRec = &Operations[OpFound];
+      for (Instruction *Op : OpRec->first)
+        LLVM_DEBUG(dbgs() << "VW Op to check : " << *Op << "\n");
+      if (!widenNode(OpRec->first, Context)) {
+        LLVM_DEBUG(dbgs() << "VW Unable to consturct the tree.\n");
+        OpRec->first.erase(OpRec->first.begin());
+        OpRec->second = Counter;
+      } else {
+        for (Instruction *Instr : OpRec->first)
+          eraseInstruction(Instr);
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
+bool VectorWiden::run() {
+  bool Changed = false;
+  LLVMContext &Context = F.getContext();
+
+  LLVM_DEBUG(dbgs() << "VW Function:" << F.getName() << "\n");
+  for (BasicBlock &BB : F) {
+    LLVM_DEBUG(dbgs() << "VW BB:" << BB.getName() << ...
[truncated]

llvm/lib/Transforms/Vectorize/VectorWiden.cpp

fhahn · 2023-09-23T20:36:47Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

alexey-bataev · 2023-09-25T11:40:48Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

fhahn · 2023-09-25T13:18:10Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

It's not clear from the description, but the godbolt example passes -msve-vector-bits=256, which IIUC allows LLVM to assume the SVE vector registers are exactly 256 bits wide, allowing SLP using that assumption. In the example, 2 128 bit NEON add are replaced with a single SVE add, which implies the code is only correct for system where the registers have at least 256 bits unless I am missing anything.

alexey-bataev · 2023-09-25T13:27:11Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

It's not clear from the description, but the godbolt example passes -msve-vector-bits=256, which IIUC allows LLVM to assume the SVE vector registers are exactly 256 bits wide, allowing SLP using that assumption. In the example, 2 128 bit NEON add are replaced with a single SVE add, which implies the code is only correct for system where the registers have at least 256 bits unless I am missing anything.

Yes, but this is different. Here it looks like vector combing rather than vectorization. Can SLP do this? Yes. But only for vectorized values. For scalars it will always produce fixed vectors.

RKSimon · 2023-09-25T13:58:48Z

Have you looked at doing this in VectorCombine?

dtemirbulatov · 2023-09-29T06:12:24Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors.

dtemirbulatov · 2023-09-29T06:13:54Z

Have you looked at doing this in VectorCombine?

Yes, VecotorCombine deals with a limited number of operations and ussually dependent. Still, we want to widen the whole chain of operations independet from each other, from loads to stores, to build something like an SLP tree.

nikic · 2023-09-29T06:16:50Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors.

Just to clarify, you are specifically interested in the case where the programmer has already written manually vectorized code, and you want to improve it further? So there are no scalar instructions at some earlier point in the pipeline?

dtemirbulatov · 2023-09-29T06:16:52Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

It's not clear from the description, but the godbolt example passes -msve-vector-bits=256, which IIUC allows LLVM to assume the SVE vector registers are exactly 256 bits wide, allowing SLP using that assumption. In the example, 2 128 bit NEON add are replaced with a single SVE add, which implies the code is only correct for system where the registers have at least 256 bits unless I am missing anything.

Yes, that is correct. We are targeting 256-bit register SVE hardware in the example.

dtemirbulatov · 2023-09-29T10:02:12Z

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors.

Just to clarify, you are specifically interested in the case where the programmer has already written manually vectorized code, and you want to improve it further? So there are no scalar instructions at some earlier point in the pipeline?

yes, correct, no plans here to consider scalar instructions fo now.

sdesmalen-arm · 2023-09-29T10:48:49Z

To add a bit of context here, the main motivator here is making use of the multi-vector instructions added in AArch64 SME2/SVE2p1 which can operate on 2 or 4 vectors at a time. For example, loop-vectorized code that has a UF=2 or UF=4 would be a natural input to this pass. However, there are also other use-cases to consider:

Manually written vector code using intrinsics, where the compiler could further optimise by bundling together single-vector instructions into multi-vector instructions.
SLP-vectorized code where certain vector operations can be bundled together into a single wider vector operation (this is not specific to SVE or scalable vectors).

The point is that the instructions don't necessarily need to be part of the same expression. For SME2 the multi-vector instructions can just pair up two independent operations with the same opcode. A non-SVE use-case that comes to mind are the NEON load/store-pair (ldp/stp) instructions which are currently paired up only after ISel.

llvm/lib/Transforms/Vectorize/VectorWiden.cpp

github-actions · 2023-10-08T19:21:38Z

✅ With the latest revision this PR passed the C/C++ code formatter.

… vector types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type: add { z4.s, z5.s }, { z4.s, z5.s }, z3.s With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si __attribute__ ((vector_size (16))); void add(v4si *ptr, v4si *ptr1) { v4si a = *ptr; ptr++; v4si b = *ptr; ptr++; v4si c = *ptr; ptr++; v4si d = *ptr; *ptr1 = a+b; ptr1++; *ptr1 = c+d; } On ARM SVE hardware, we could produce: ptrue p0.s, vl8 mov x8, llvm#8 // =0x8 ld1w { z0.s }, p0/z, [x0] ld1w { z1.s }, p0/z, [x0, x8, lsl llvm#2] add z0.s, z1.s, z0.s st1w { z0.s }, p0, [x1] ret Currently, we have this output https://godbolt.org/z/z5n78TWsc: ldp q0, q1, [x0] ldp q2, q3, [x0, llvm#32] add v0.4s, v1.4s, v0.4s add v1.4s, v3.4s, v2.4s stp q0, q1, [x1] ret I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types.

sdesmalen-arm

Left a bunch of comments which will hopefully clarify the code a bit more when addressed, because I'm still trying to make sense of how things work. Other than that, the code in this pass should conceptually distinguish between:

Target-agnostic vector widening
Target-specific choices about whether to widen

You've added a TargetTransformInfo interface for the latter, which is good, but the code in this pass makes various assumptions based on the implementation of AArch64TargetTransformInfo::considerToWiden. For example, the code in the generic pass assumes that the vector length is a [vscale x] 4 and the tests only seem to cover whatever AArch64 enables with its TTI hook.

llvm/lib/Transforms/Vectorize/VectorWiden.cpp

sdesmalen-arm

Just a general observation that the last patch in this pull-request had quite a few code changes, but no changes to any of the tests. That suggests the test-coverage of the pass is too limited.

llvm/lib/Transforms/Vectorize/VectorWiden.cpp

instruction handling.

added testcase for bad dependace case.

paulwalker-arm · 2025-05-07T12:11:34Z

Arm is no longer pursuing this investigation.

dtemirbulatov requested review from fhahn, RKSimon, alexey-bataev, paulwalker-arm, sdesmalen-arm and david-arm September 21, 2023 15:12

llvmbot added backend:AArch64 vectorizers vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Sep 21, 2023

dtemirbulatov changed the title ~~This pass allows us to widen already vectorized instructions to wider…~~ VectorWiden pass to widen aleady vectorized instrctions Sep 21, 2023

nikic reviewed Sep 21, 2023

View reviewed changes

llvm/lib/Transforms/Vectorize/VectorWiden.cpp Outdated Show resolved Hide resolved

Endilll removed the vectorizers label Sep 29, 2023

sdesmalen-arm reviewed Oct 4, 2023

View reviewed changes

dtemirbulatov force-pushed the vector-widen branch from 3fb14e0 to 1191edc Compare October 8, 2023 19:19

dtemirbulatov force-pushed the vector-widen branch from 1191edc to 2633f95 Compare October 10, 2023 18:37

sdesmalen-arm reviewed Oct 11, 2023

View reviewed changes

Add support for testing with override TTI::considerToWiden().

2a6a2ab

sdesmalen-arm requested changes Oct 18, 2023

View reviewed changes

Dinar Temirbulatov added 3 commits October 18, 2023 14:01

Fixed error vector extract and insert offset calculation with BitCast

c4fec73

instruction handling.

Resolving remarks.

86dfaeb

Reduced this double-nested loop in VectorWiden::canWidenNode,

ed23e64

added testcase for bad dependace case.

dtemirbulatov requested a review from preames November 30, 2023 16:29

paulwalker-arm closed this May 7, 2025

VectorWiden pass to widen aleady vectorized instrctions #67029

VectorWiden pass to widen aleady vectorized instrctions #67029

Uh oh!

Conversation

dtemirbulatov commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fhahn commented Sep 23, 2023

Uh oh!

alexey-bataev commented Sep 25, 2023

Uh oh!

fhahn commented Sep 25, 2023

Uh oh!

alexey-bataev commented Sep 25, 2023

Uh oh!

RKSimon commented Sep 25, 2023

Uh oh!

dtemirbulatov commented Sep 29, 2023

Uh oh!

dtemirbulatov commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Sep 29, 2023

Uh oh!

dtemirbulatov commented Sep 29, 2023

Uh oh!

dtemirbulatov commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdesmalen-arm commented Sep 29, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paulwalker-arm commented May 7, 2025

Uh oh!

Uh oh!

dtemirbulatov commented Sep 21, 2023 •

edited

Loading

llvmbot commented Sep 21, 2023 •

edited

Loading

dtemirbulatov commented Sep 29, 2023 •

edited

Loading

dtemirbulatov commented Sep 29, 2023 •

edited

Loading

github-actions bot commented Oct 8, 2023 •

edited

Loading