Skip to content

VectorWiden pass to widen aleady vectorized instrctions #67029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

dtemirbulatov
Copy link
Contributor

@dtemirbulatov dtemirbulatov commented Sep 21, 2023

This pass allows us to widen already vectorized instructions to wider vector types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type:

add { z4.s, z5.s }, { z4.s, z5.s }, z3.s

With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si attribute ((vector_size (16)));

void add(v4si *ptr, v4si *ptr1) {
  v4si a = *ptr;
  ptr++;
  v4si b = *ptr;
  ptr++;
  v4si c = *ptr;
  ptr++;
  v4si d = *ptr;
  *ptr1 = a+b;
  ptr1++;
  *ptr1 = c+d;
}

On ARM SVE hardware, we could produce:

        ptrue   p0.s, vl8
        mov     x8, #8             // =0x8
        ld1w    { z0.s }, p0/z, [x0]
        ld1w    { z1.s }, p0/z, [x0, x8, lsl #2]
        add     z0.s, z1.s, z0.s
        st1w    { z0.s }, p0, [x1]
        ret

Currently, we have this output https://godbolt.org/z/z5n78TWsc:

        ldp     q0, q1, [x0]
        ldp     q2, q3, [x0, #32]
        add     v0.4s, v1.4s, v0.4s
        add     v1.4s, v3.4s, v2.4s
        stp     q0, q1, [x1]
        ret

I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types. This is just a starting point, I plan to build up the functionality incrementally to handle other instructions and make the algorithm more advanced over time.

@llvmbot
Copy link
Member

llvmbot commented Sep 21, 2023

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-aarch64

Changes

… vector

types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type:

add { z4.s, z5.s }, { z4.s, z5.s }, z3.s

With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si attribute ((vector_size (16)));

void add(v4si *ptr, v4si *ptr1) {
v4si a = *ptr;
ptr++;
v4si b = *ptr;
ptr++;
v4si c = *ptr;
ptr++;
v4si d = *ptr;
*ptr1 = a+b;
ptr1++;
*ptr1 = c+d;
}

On ARM SVE hardware, we could produce:
ptrue p0.s, vl8
mov x8, #8 // =0x8
ld1w { z0.s }, p0/z, [x0]
ld1w { z1.s }, p0/z, [x0, x8, lsl #2]
add z0.s, z1.s, z0.s
st1w { z0.s }, p0, [x1]
ret

Currently, we have this output https://godbolt.org/z/z5n78TWsc:
ldp q0, q1, [x0]
ldp q2, q3, [x0, #32]
add v0.4s, v1.4s, v0.4s
add v1.4s, v3.4s, v2.4s
stp q0, q1, [x1]
ret

I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types.


Patch is 29.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67029.diff

14 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+11)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+5)
  • (added) llvm/include/llvm/Transforms/Vectorize/VectorWiden.h (+25)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+5)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+9)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+21)
  • (modified) llvm/lib/Transforms/Vectorize/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/Vectorize/VectorWiden.cpp (+356)
  • (added) llvm/test/Transforms/VectorWiden/add.ll (+37)
  • (added) llvm/test/Transforms/VectorWiden/fptrunc-bad-dep.ll (+45)
  • (added) llvm/test/Transforms/VectorWiden/fptrunc.ll (+40)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Transforms/Vectorize/BUILD.gn (+1)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 1ae595d2110457d..2a31f6ebe0ff210 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1669,6 +1669,10 @@ class TargetTransformInfo {
   /// \return The maximum number of function arguments the target supports.
   unsigned getMaxNumArgs() const;
 
+  /// \returns Whether vector operations are a good candidate for vector widen.
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const;
+
   /// @}
 
 private:
@@ -2035,6 +2039,8 @@ class TargetTransformInfo::Concept {
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
   virtual unsigned getMaxNumArgs() const = 0;
+  virtual bool considerForWidening(LLVMContext &Context,
+                                   ArrayRef<Instruction *> IL) const = 0;
 };
 
 template <typename T>
@@ -2745,6 +2751,11 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   unsigned getMaxNumArgs() const override {
     return Impl.getMaxNumArgs();
   }
+
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const override {
+    return Impl.considerForWidening(Context, IL);
+  }
 };
 
 template <typename T>
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 326c3130c6cff76..266b8da12c046cf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -891,6 +891,11 @@ class TargetTransformInfoImplBase {
 
   unsigned getMaxNumArgs() const { return UINT_MAX; }
 
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const {
+    return false;
+  }
+
 protected:
   // Obtain the minimum required size to hold the value (without the sign)
   // In case of a vector it returns the min required size for one element.
diff --git a/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h b/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h
new file mode 100644
index 000000000000000..6988785a92ce09c
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h
@@ -0,0 +1,25 @@
+//===--- VectorWiden.h - Combining Vector Operations to wider types ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
+#define LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class VectorWidenPass : public PassInfoMixin<VectorWidenPass> {
+public:
+  VectorWidenPass() {}
+
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c751d174a48ab1f..73e87330c8fbbd3 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1237,6 +1237,11 @@ bool TargetTransformInfo::hasActiveVectorLength(unsigned Opcode, Type *DataType,
   return TTIImpl->hasActiveVectorLength(Opcode, DataType, Alignment);
 }
 
+bool TargetTransformInfo::considerForWidening(
+    LLVMContext &Context, ArrayRef<Instruction *> IL) const {
+  return TTIImpl->considerForWidening(Context, IL);
+}
+
 TargetTransformInfo::Concept::~Concept() = default;
 
 TargetIRAnalysis::TargetIRAnalysis() : TTICallback(&getDefaultTTI) {}
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 985ff88139323c6..45c2529f92a5881 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -263,6 +263,7 @@
 #include "llvm/Transforms/Vectorize/LoopVectorize.h"
 #include "llvm/Transforms/Vectorize/SLPVectorizer.h"
 #include "llvm/Transforms/Vectorize/VectorCombine.h"
+#include "llvm/Transforms/Vectorize/VectorWiden.h"
 #include <optional>
 
 using namespace llvm;
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index df9f14920f29161..2eef2f0a22d95d7 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -428,6 +428,7 @@ FUNCTION_PASS("tailcallelim", TailCallElimPass())
 FUNCTION_PASS("typepromotion", TypePromotionPass(TM))
 FUNCTION_PASS("unify-loop-exits", UnifyLoopExitsPass())
 FUNCTION_PASS("vector-combine", VectorCombinePass())
+FUNCTION_PASS("vector-widen", VectorWidenPass())
 FUNCTION_PASS("verify", VerifierPass())
 FUNCTION_PASS("verify<domtree>", DominatorTreeVerifierPass())
 FUNCTION_PASS("verify<loops>", LoopVerifierPass())
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index e6209ca12a48c31..d591e727dc1dbe6 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -2426,6 +2426,15 @@ InstructionCost AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
             CostKind, I));
   }
 
+  static const TypeConversionCostTblEntry SME2Tbl[] = {
+    { ISD::FP_ROUND, MVT::nxv8f16, MVT::nxv8f32, 1 }
+  };
+
+  if (ST->hasSME2())
+    if (const auto *Entry = ConvertCostTableLookup(
+        SME2Tbl, ISD, DstTy.getSimpleVT(), SrcTy.getSimpleVT()))
+      return AdjustCost(Entry->Cost);
+
   if (const auto *Entry = ConvertCostTableLookup(ConversionTbl, ISD,
                                                  DstTy.getSimpleVT(),
                                                  SrcTy.getSimpleVT()))
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index a6baade412c77d2..6f5eaa3fadd4156 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -412,6 +412,27 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
 
     return BaseT::getStoreMinimumVF(VF, ScalarMemTy, ScalarValTy);
   }
+
+  bool considerForWidening(LLVMContext &Context,
+                           ArrayRef<Instruction *> IL) const {
+    unsigned Opcode = IL[0]->getOpcode();
+    Type *Ty = IL[0]->getType();
+    if (!ST->hasSME2())
+      return false;
+    if (llvm::any_of(IL, [Opcode, Ty](Instruction *I) {
+          return (Opcode != I->getOpcode() || Ty != I->getType());
+        }))
+      return false;
+    if (Opcode == Instruction::FPTrunc &&
+        Ty == ScalableVectorType::get(Type::getHalfTy(Context), 4))
+      return true;
+    if (Opcode == Instruction::Add &&
+        Ty == ScalableVectorType::get(Type::getInt32Ty(Context), 4) &&
+        (IL[0]->getOperand(1) == IL[1]->getOperand(1) ||
+         IL[0]->getOperand(0) == IL[1]->getOperand(0)))
+      return true;
+    return false;
+  }
 };
 
 } // end namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/CMakeLists.txt b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
index 998dfd956575d3c..a1537bb1ffa632e 100644
--- a/llvm/lib/Transforms/Vectorize/CMakeLists.txt
+++ b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
@@ -5,6 +5,7 @@ add_llvm_component_library(LLVMVectorize
   SLPVectorizer.cpp
   Vectorize.cpp
   VectorCombine.cpp
+  VectorWiden.cpp
   VPlan.cpp
   VPlanHCFGBuilder.cpp
   VPlanRecipes.cpp
diff --git a/llvm/lib/Transforms/Vectorize/VectorWiden.cpp b/llvm/lib/Transforms/Vectorize/VectorWiden.cpp
new file mode 100644
index 000000000000000..c7a590ea75430e5
--- /dev/null
+++ b/llvm/lib/Transforms/Vectorize/VectorWiden.cpp
@@ -0,0 +1,356 @@
+///==--- VectorWiden.cpp - Combining Vector Operations to wider types ----==//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass tries to widen vector operations to a wider type, it finds
+// independent from each other operations with a certain vector type as SLP does
+// with scalars by Bottom Up. It detects consecutive stores that can be put
+// together into a wider vector-stores. Next, it attempts to construct
+// vectorizable tree using the use-def chains.
+//
+//==------------------------------------------------------------------------==//
+
+#include "llvm/Transforms/Vectorize/VectorWiden.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallSet.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/DependenceAnalysis.h"
+#include "llvm/Analysis/LoopAccessAnalysis.h"
+#include "llvm/Analysis/PostDominators.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/ValueTypes.h"
+#include "llvm/IR/ConstantRange.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Transforms/Utils/CodeMoverUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "vector-widen"
+
+// Due to independant operations to widening that we consider with possibility
+// to merge those operations into one and also to widening store if we find
+// later store instructions. We have to consider the distance between those
+// independent operations or we might introduce bad register pressure, etc.
+
+static cl::opt<unsigned>
+    MaxInstDistance("vw-max-instr-distance", cl::init(30), cl::Hidden,
+                    cl::desc("Maximum distance between instructions to"
+                             "consider to widen"));
+
+namespace {
+class VectorWiden {
+public:
+  using InstrList = SmallVector<Instruction *, 2>;
+  using ValueList = SmallVector<Value *, 2>;
+  VectorWiden(Function &F, const TargetTransformInfo &TTI, DominatorTree &DT,
+              DependenceInfo &DI, const PostDominatorTree &PDT)
+      : F(F), Builder(F.getContext()), TTI(TTI), DT(DT), DI(DI), PDT(PDT) {}
+
+  bool run();
+
+private:
+  Function &F;
+  IRBuilder<> Builder;
+  const TargetTransformInfo &TTI;
+  DominatorTree &DT;
+  DependenceInfo &DI;
+  const PostDominatorTree &PDT;
+  TargetLibraryInfo *TLI;
+
+  DenseSet<Instruction *> DeletedInstructions;
+
+  /// Checks if the instruction is marked for deletion.
+  bool isDeleted(Instruction *I) const { return DeletedInstructions.count(I); }
+
+  /// Removes an instruction from its block and eventually deletes it.
+  void eraseInstruction(Instruction *I) { DeletedInstructions.insert(I); }
+
+  bool processBB(BasicBlock &BB, LLVMContext &Context);
+
+  bool widenNode(ArrayRef<Instruction *> IL, LLVMContext &Context);
+
+  void widenFPTrunc(ArrayRef<Instruction *> IL);
+
+  void widenAdd(ArrayRef<Instruction *> IL, bool Reorder);
+
+  InstructionCost getOpCost(unsigned Opcode, Type *To, Type *From,
+                            Instruction *I);
+};
+} // namespace
+
+void VectorWiden::widenFPTrunc(ArrayRef<Instruction *> IL) {
+  Instruction *I = IL[0];
+  Instruction *I1 = IL[1];
+  ScalableVectorType *RetOrigType = cast<ScalableVectorType>(I->getType());
+  ScalableVectorType *OrigType =
+      cast<ScalableVectorType>(I->getOperand(0)->getType());
+  ScalableVectorType *RetType =
+      ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+  ScalableVectorType *OpType =
+      ScalableVectorType::getDoubleElementsVectorType(OrigType);
+  Value *WideVec = UndefValue::get(OpType);
+  Builder.SetInsertPoint(I);
+  Function *InsertIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_insert, {OpType, OrigType});
+  Value *Insert1 = Builder.CreateCall(
+      InsertIntr, {WideVec, I->getOperand(0), Builder.getInt64(0)});
+  Value *Insert2 = Builder.CreateCall(
+      InsertIntr, {Insert1, I1->getOperand(0), Builder.getInt64(4)});
+  Value *ResFPTrunc =
+      Builder.CreateCast(Instruction::FPTrunc, Insert2, RetType);
+  Function *ExtractIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_extract, {RetOrigType, RetType});
+  if (!I->users().empty()) {
+    Value *Res =
+        Builder.CreateCall(ExtractIntr, {ResFPTrunc, Builder.getInt64(4)});
+    I->replaceAllUsesWith(Res);
+  }
+  if (!I1->users().empty()) {
+    Value *Res2 =
+        Builder.CreateCall(ExtractIntr, {ResFPTrunc, Builder.getInt64(0)});
+    I1->replaceAllUsesWith(Res2);
+  }
+}
+
+void VectorWiden::widenAdd(ArrayRef<Instruction *> IL, bool Reorder) {
+  Instruction *I = IL[0];
+  Instruction *I1 = IL[1];
+
+  Value *XHi = I->getOperand(0);
+  Value *XLo = I1->getOperand(0);
+  Value *YHi = I->getOperand(1);
+  Value *YLo = I1->getOperand(1);
+  if (Reorder) {
+    std::swap(XHi, YHi);
+    std::swap(XLo, YLo);
+  }
+
+  ScalableVectorType *RetOrigType = cast<ScalableVectorType>(I->getType());
+  ScalableVectorType *OrigType =
+      cast<ScalableVectorType>(I->getOperand(0)->getType());
+  ScalableVectorType *RetType =
+      ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+  ScalableVectorType *OpType =
+      ScalableVectorType::getDoubleElementsVectorType(OrigType);
+  Value *WideVec = UndefValue::get(OpType);
+  Builder.SetInsertPoint(I);
+  Function *InsertIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_insert, {OpType, OrigType});
+  Value *X1 =
+      Builder.CreateCall(InsertIntr, {WideVec, XHi, Builder.getInt64(0)});
+  Value *X2 = Builder.CreateCall(InsertIntr, {X1, XLo, Builder.getInt64(4)});
+  Value *Y1 =
+      Builder.CreateCall(InsertIntr, {WideVec, YHi, Builder.getInt64(0)});
+  Value *Y2 = Builder.CreateCall(InsertIntr, {Y1, YLo, Builder.getInt64(4)});
+  Value *ResAdd = Builder.CreateAdd(X2, Y2);
+  Function *ExtractIntr = llvm::Intrinsic::getDeclaration(
+      F.getParent(), Intrinsic::vector_extract, {RetOrigType, RetType});
+  if (!I->users().empty()) {
+    Value *Res = Builder.CreateCall(ExtractIntr, {ResAdd, Builder.getInt64(0)});
+    I->replaceAllUsesWith(Res);
+  }
+  if (!I1->users().empty()) {
+    Value *Res2 =
+        Builder.CreateCall(ExtractIntr, {ResAdd, Builder.getInt64(4)});
+    I1->replaceAllUsesWith(Res2);
+  }
+}
+
+bool VectorWiden::widenNode(ArrayRef<Instruction *> IL, LLVMContext &Context) {
+  LLVM_DEBUG(dbgs() << "VW: widenNode: " << *IL[0] << " " << *IL[1] << "\n");
+  if (!TTI.considerForWidening(Context, IL))
+    return false;
+  if (IL[0] == IL[1])
+    return false;
+  if (IL[0]->getOpcode() != IL[1]->getOpcode())
+    return false;
+  // Ignore if any live in a diffrent Basic Block
+  if (IL[0]->getParent() != IL[1]->getParent())
+    return false;
+  // Ignore if disatance between two are too apart.
+  if (abs(std::distance(IL[1]->getIterator(), IL[0]->getIterator())) >
+      MaxInstDistance)
+    return false;
+  // Check that any instrction is already deleted.
+  if (isDeleted(IL[0]) || isDeleted(IL[1]))
+    return false;
+  if (IL[1] == IL[0]->getOperand(0) || IL[0] == IL[1]->getOperand(0))
+    return false;
+  if (IL[0]->getNumOperands() > 1 &&
+      (IL[1] == IL[0]->getOperand(1) || IL[0] == IL[1]->getOperand(1)))
+    return false;
+  if (!isSafeToMoveBefore(*IL[1], *IL[0], DT, &PDT, &DI))
+    return false;
+  switch (IL[0]->getOpcode()) {
+  case Instruction::FPTrunc: {
+    ScalableVectorType *RetOrigType =
+        cast<ScalableVectorType>(IL[0]->getType());
+    ScalableVectorType *OrigType =
+        cast<ScalableVectorType>(IL[0]->getOperand(0)->getType());
+    InstructionCost Cost =
+        getOpCost(Instruction::FPTrunc, RetOrigType, OrigType, IL[0]);
+    ScalableVectorType *RetType =
+        ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+    ScalableVectorType *OpType =
+        ScalableVectorType::getDoubleElementsVectorType(OrigType);
+    InstructionCost CostNew =
+        getOpCost(Instruction::FPTrunc, RetType, OpType, IL[0]);
+    if (2 * Cost < CostNew)
+      return false;
+    LLVM_DEBUG(dbgs() << "VW: Decided to widen FPTrunc, safe to merge : "
+                      << *IL[0] << " with  " << *IL[1] << "\n");
+    widenFPTrunc(IL);
+    return true;
+  }
+  case Instruction::Add: {
+    ScalableVectorType *OrigType =
+        cast<ScalableVectorType>(IL[0]->getOperand(0)->getType());
+    ScalableVectorType *OpType =
+        ScalableVectorType::getDoubleElementsVectorType(OrigType);
+    InstructionCost Cost =
+        getOpCost(Instruction::Add, OrigType, OrigType, IL[0]);
+    InstructionCost CostNew =
+        getOpCost(Instruction::Add, OpType, OpType, IL[0]);
+    if (2 * Cost < CostNew)
+      return false;
+    LLVM_DEBUG(dbgs() << "VW: Decided to widen Add, safe to merge : " << *IL[0]
+                      << " with  " << *IL[1] << "\n");
+    widenAdd(IL, IL[0]->getOperand(1) != IL[1]->getOperand(1));
+    return true;
+  }
+
+  default:
+    break;
+  }
+  return false;
+}
+
+InstructionCost VectorWiden::getOpCost(unsigned Opcode, Type *To, Type *From,
+                                       Instruction *I) {
+  InstructionCost Cost = 0;
+  TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  switch (Opcode) {
+  case Instruction::FPTrunc: {
+    Cost = TTI.getCastInstrCost(Opcode, To, From, TTI::getCastContextHint(I),
+                                CostKind, I);
+    break;
+  }
+  case Instruction::Add: {
+    unsigned OpIdx = isa<UnaryOperator>(I) ? 0 : 1;
+    TTI::OperandValueInfo Op1Info = TTI::getOperandInfo(I->getOperand(0));
+    TTI::OperandValueInfo Op2Info = TTI::getOperandInfo(I->getOperand(OpIdx));
+    SmallVector<const Value *> Operands(I->operand_values());
+    Cost = TTI.getArithmeticInstrCost(I->getOpcode(), To, CostKind, Op1Info,
+                                      Op2Info, Operands, I);
+    break;
+  }
+  default:
+    llvm_unreachable("Unknown instruction");
+  }
+  return Cost;
+}
+
+bool VectorWiden::processBB(BasicBlock &BB, LLVMContext &Context) {
+  DenseMap<unsigned, std::pair<InstrList, unsigned>> Operations;
+  unsigned Counter = 0;
+  for (BasicBlock::reverse_iterator IP(BB.rbegin()); IP != BB.rend();
+       *IP++, ++Counter) {
+    Instruction *I = &*IP;
+    unsigned OpFound = 0;
+
+    if (I->isDebugOrPseudoInst() || isDeleted(I))
+      continue;
+
+    switch (I->getOpcode()) {
+    default:
+      continue;
+    case Instruction::Add:
+    case Instruction::FPTrunc: {
+      if (Operations.find(I->getOpcode()) != Operations.end()) {
+        auto *OpRec = &Operations[I->getOpcode()];
+        // If instructions are too apart then remove old instrction
+        // and reset position to this instruction.
+        if (Counter - Operations[I->getOpcode()].second > MaxInstDistance) {
+          OpRec->second = Counter;
+          OpRec->first.clear();
+          OpRec->first.push_back(I);
+        } else {
+          OpRec->first.push_back(I);
+          OpFound = I->getOpcode();
+        }
+      } else {
+        Operations[I->getOpcode()] = {{I}, Counter};
+      }
+      break;
+    }
+    }
+    if (OpFound && Operations.find(OpFound) != Operations.end()) {
+      auto *OpRec = &Operations[OpFound];
+      for (Instruction *Op : OpRec->first)
+        LLVM_DEBUG(dbgs() << "VW Op to check : " << *Op << "\n");
+      if (!widenNode(OpRec->first, Context)) {
+        LLVM_DEBUG(dbgs() << "VW Unable to consturct the tree.\n");
+        OpRec->first.erase(OpRec->first.begin());
+        OpRec->second = Counter;
+      } else {
+        for (Instruction *Instr : OpRec->first)
+          eraseInstruction(Instr);
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
+bool VectorWiden::run() {
+  bool Changed = false;
+  LLVMContext &Context = F.getContext();
+
+  LLVM_DEBUG(dbgs() << "VW Function:" << F.getName() << "\n");
+  for (BasicBlock &BB : F) {
+    LLVM_DEBUG(dbgs() << "VW BB:" << BB.getName() << ...
[truncated]

@dtemirbulatov dtemirbulatov changed the title This pass allows us to widen already vectorized instructions to wider… VectorWiden pass to widen aleady vectorized instrctions Sep 21, 2023
@fhahn
Copy link
Contributor

fhahn commented Sep 23, 2023

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

@alexey-bataev
Copy link
Member

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

@fhahn
Copy link
Contributor

fhahn commented Sep 25, 2023

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

It's not clear from the description, but the godbolt example passes -msve-vector-bits=256, which IIUC allows LLVM to assume the SVE vector registers are exactly 256 bits wide, allowing SLP using that assumption. In the example, 2 128 bit NEON add are replaced with a single SVE add, which implies the code is only correct for system where the registers have at least 256 bits unless I am missing anything.

@alexey-bataev
Copy link
Member

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

It's not clear from the description, but the godbolt example passes -msve-vector-bits=256, which IIUC allows LLVM to assume the SVE vector registers are exactly 256 bits wide, allowing SLP using that assumption. In the example, 2 128 bit NEON add are replaced with a single SVE add, which implies the code is only correct for system where the registers have at least 256 bits unless I am missing anything.

Yes, but this is different. Here it looks like vector combing rather than vectorization. Can SLP do this? Yes. But only for vectorized values. For scalars it will always produce fixed vectors.

@RKSimon
Copy link
Collaborator

RKSimon commented Sep 25, 2023

Have you looked at doing this in VectorCombine?

@dtemirbulatov
Copy link
Contributor Author

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors.

@dtemirbulatov
Copy link
Contributor Author

dtemirbulatov commented Sep 29, 2023

Have you looked at doing this in VectorCombine?

Yes, VecotorCombine deals with a limited number of operations and ussually dependent. Still, we want to widen the whole chain of operations independet from each other, from loads to stores, to build something like an SLP tree.

@nikic
Copy link
Contributor

nikic commented Sep 29, 2023

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors.

Just to clarify, you are specifically interested in the case where the programmer has already written manually vectorized code, and you want to improve it further? So there are no scalar instructions at some earlier point in the pipeline?

@dtemirbulatov
Copy link
Contributor Author

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars).

It's not clear from the description, but the godbolt example passes -msve-vector-bits=256, which IIUC allows LLVM to assume the SVE vector registers are exactly 256 bits wide, allowing SLP using that assumption. In the example, 2 128 bit NEON add are replaced with a single SVE add, which implies the code is only correct for system where the registers have at least 256 bits unless I am missing anything.

Yes, that is correct. We are targeting 256-bit register SVE hardware in the example.

@dtemirbulatov
Copy link
Contributor Author

dtemirbulatov commented Sep 29, 2023

For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented?

SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors.

Just to clarify, you are specifically interested in the case where the programmer has already written manually vectorized code, and you want to improve it further? So there are no scalar instructions at some earlier point in the pipeline?

yes, correct, no plans here to consider scalar instructions fo now.

@sdesmalen-arm
Copy link
Collaborator

To add a bit of context here, the main motivator here is making use of the multi-vector instructions added in AArch64 SME2/SVE2p1 which can operate on 2 or 4 vectors at a time. For example, loop-vectorized code that has a UF=2 or UF=4 would be a natural input to this pass. However, there are also other use-cases to consider:

  • Manually written vector code using intrinsics, where the compiler could further optimise by bundling together single-vector instructions into multi-vector instructions.
  • SLP-vectorized code where certain vector operations can be bundled together into a single wider vector operation (this is not specific to SVE or scalable vectors).

The point is that the instructions don't necessarily need to be part of the same expression. For SME2 the multi-vector instructions can just pair up two independent operations with the same opcode. A non-SVE use-case that comes to mind are the NEON load/store-pair (ldp/stp) instructions which are currently paired up only after ISel.

@github-actions
Copy link

github-actions bot commented Oct 8, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

… vector

types. We encountered an issue with the current auto-vectorisations passes that
would not allow us to implement the required functionality without a new pass
easily. For example, SME2 ADD instruction with a first operand and the resulting
register are in a multivector form with scalable vector types, while the third operand is
just a regular scalable vector type:

add     { z4.s, z5.s }, { z4.s, z5.s }, z3.s

With the loop-vectorizer pass, choosing a correct VF to deal with one of
the operands and the result to be a wider vector type could be difficult.
With the new pass, we want to consider a group of operations, not a single one
like SLP does, to make more profitable transformations, including, for example,
LOADs and  STORES, arithmetical operations, etc. For example, we could combine
those independent ADD operations in a single basic block into one on ARM SVE:
typedef int v4si __attribute__ ((vector_size (16)));

void add(v4si *ptr, v4si *ptr1) {
  v4si a = *ptr;
  ptr++;
  v4si b = *ptr;
  ptr++;
  v4si c = *ptr;
  ptr++;
  v4si d = *ptr;
  *ptr1 = a+b;
  ptr1++;
  *ptr1 = c+d;
}

On ARM SVE hardware, we could produce:
        ptrue   p0.s, vl8
        mov     x8, llvm#8             // =0x8
        ld1w    { z0.s }, p0/z, [x0]
        ld1w    { z1.s }, p0/z, [x0, x8, lsl llvm#2]
        add     z0.s, z1.s, z0.s
        st1w    { z0.s }, p0, [x1]
        ret

Currently, we have this output https://godbolt.org/z/z5n78TWsc:
        ldp     q0, q1, [x0]
        ldp     q2, q3, [x0, llvm#32]
        add     v0.4s, v1.4s, v0.4s
        add     v1.4s, v3.4s, v2.4s
        stp     q0, q1, [x1]
        ret

I noticed similar opportunities with SLP vectorizer not choosing wider VF due to
its implementation, for example, with reductions only able to handle four or
fewer elements width types. Currently, the pass supports only ADD and FP_ROUND
operations to widen and only in scalable vector types.
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments which will hopefully clarify the code a bit more when addressed, because I'm still trying to make sense of how things work. Other than that, the code in this pass should conceptually distinguish between:

  • Target-agnostic vector widening
  • Target-specific choices about whether to widen

You've added a TargetTransformInfo interface for the latter, which is good, but the code in this pass makes various assumptions based on the implementation of AArch64TargetTransformInfo::considerToWiden. For example, the code in the generic pass assumes that the vector length is a [vscale x] 4 and the tests only seem to cover whatever AArch64 enables with its TTI hook.

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general observation that the last patch in this pull-request had quite a few code changes, but no changes to any of the tests. That suggests the test-coverage of the pass is too limited.

@paulwalker-arm
Copy link
Collaborator

Arm is no longer pursuing this investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants