-
Notifications
You must be signed in to change notification settings - Fork 14.8k
VectorWiden pass to widen aleady vectorized instrctions #67029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-backend-aarch64 Changes… vector types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type: add { z4.s, z5.s }, { z4.s, z5.s }, z3.s With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si attribute ((vector_size (16))); void add(v4si *ptr, v4si *ptr1) { On ARM SVE hardware, we could produce: Currently, we have this output https://godbolt.org/z/z5n78TWsc: I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types. Patch is 29.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67029.diff 14 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 1ae595d2110457d..2a31f6ebe0ff210 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1669,6 +1669,10 @@ class TargetTransformInfo {
/// \return The maximum number of function arguments the target supports.
unsigned getMaxNumArgs() const;
+ /// \returns Whether vector operations are a good candidate for vector widen.
+ bool considerForWidening(LLVMContext &Context,
+ ArrayRef<Instruction *> IL) const;
+
/// @}
private:
@@ -2035,6 +2039,8 @@ class TargetTransformInfo::Concept {
getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
virtual bool hasArmWideBranch(bool Thumb) const = 0;
virtual unsigned getMaxNumArgs() const = 0;
+ virtual bool considerForWidening(LLVMContext &Context,
+ ArrayRef<Instruction *> IL) const = 0;
};
template <typename T>
@@ -2745,6 +2751,11 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
unsigned getMaxNumArgs() const override {
return Impl.getMaxNumArgs();
}
+
+ bool considerForWidening(LLVMContext &Context,
+ ArrayRef<Instruction *> IL) const override {
+ return Impl.considerForWidening(Context, IL);
+ }
};
template <typename T>
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 326c3130c6cff76..266b8da12c046cf 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -891,6 +891,11 @@ class TargetTransformInfoImplBase {
unsigned getMaxNumArgs() const { return UINT_MAX; }
+ bool considerForWidening(LLVMContext &Context,
+ ArrayRef<Instruction *> IL) const {
+ return false;
+ }
+
protected:
// Obtain the minimum required size to hold the value (without the sign)
// In case of a vector it returns the min required size for one element.
diff --git a/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h b/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h
new file mode 100644
index 000000000000000..6988785a92ce09c
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Vectorize/VectorWiden.h
@@ -0,0 +1,25 @@
+//===--- VectorWiden.h - Combining Vector Operations to wider types ---===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
+#define LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class VectorWidenPass : public PassInfoMixin<VectorWidenPass> {
+public:
+ VectorWidenPass() {}
+
+ PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_TRANSFORMS_VECTORIZE_VECTORWIDENING_H
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index c751d174a48ab1f..73e87330c8fbbd3 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1237,6 +1237,11 @@ bool TargetTransformInfo::hasActiveVectorLength(unsigned Opcode, Type *DataType,
return TTIImpl->hasActiveVectorLength(Opcode, DataType, Alignment);
}
+bool TargetTransformInfo::considerForWidening(
+ LLVMContext &Context, ArrayRef<Instruction *> IL) const {
+ return TTIImpl->considerForWidening(Context, IL);
+}
+
TargetTransformInfo::Concept::~Concept() = default;
TargetIRAnalysis::TargetIRAnalysis() : TTICallback(&getDefaultTTI) {}
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 985ff88139323c6..45c2529f92a5881 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -263,6 +263,7 @@
#include "llvm/Transforms/Vectorize/LoopVectorize.h"
#include "llvm/Transforms/Vectorize/SLPVectorizer.h"
#include "llvm/Transforms/Vectorize/VectorCombine.h"
+#include "llvm/Transforms/Vectorize/VectorWiden.h"
#include <optional>
using namespace llvm;
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index df9f14920f29161..2eef2f0a22d95d7 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -428,6 +428,7 @@ FUNCTION_PASS("tailcallelim", TailCallElimPass())
FUNCTION_PASS("typepromotion", TypePromotionPass(TM))
FUNCTION_PASS("unify-loop-exits", UnifyLoopExitsPass())
FUNCTION_PASS("vector-combine", VectorCombinePass())
+FUNCTION_PASS("vector-widen", VectorWidenPass())
FUNCTION_PASS("verify", VerifierPass())
FUNCTION_PASS("verify<domtree>", DominatorTreeVerifierPass())
FUNCTION_PASS("verify<loops>", LoopVerifierPass())
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index e6209ca12a48c31..d591e727dc1dbe6 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -2426,6 +2426,15 @@ InstructionCost AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
CostKind, I));
}
+ static const TypeConversionCostTblEntry SME2Tbl[] = {
+ { ISD::FP_ROUND, MVT::nxv8f16, MVT::nxv8f32, 1 }
+ };
+
+ if (ST->hasSME2())
+ if (const auto *Entry = ConvertCostTableLookup(
+ SME2Tbl, ISD, DstTy.getSimpleVT(), SrcTy.getSimpleVT()))
+ return AdjustCost(Entry->Cost);
+
if (const auto *Entry = ConvertCostTableLookup(ConversionTbl, ISD,
DstTy.getSimpleVT(),
SrcTy.getSimpleVT()))
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index a6baade412c77d2..6f5eaa3fadd4156 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -412,6 +412,27 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
return BaseT::getStoreMinimumVF(VF, ScalarMemTy, ScalarValTy);
}
+
+ bool considerForWidening(LLVMContext &Context,
+ ArrayRef<Instruction *> IL) const {
+ unsigned Opcode = IL[0]->getOpcode();
+ Type *Ty = IL[0]->getType();
+ if (!ST->hasSME2())
+ return false;
+ if (llvm::any_of(IL, [Opcode, Ty](Instruction *I) {
+ return (Opcode != I->getOpcode() || Ty != I->getType());
+ }))
+ return false;
+ if (Opcode == Instruction::FPTrunc &&
+ Ty == ScalableVectorType::get(Type::getHalfTy(Context), 4))
+ return true;
+ if (Opcode == Instruction::Add &&
+ Ty == ScalableVectorType::get(Type::getInt32Ty(Context), 4) &&
+ (IL[0]->getOperand(1) == IL[1]->getOperand(1) ||
+ IL[0]->getOperand(0) == IL[1]->getOperand(0)))
+ return true;
+ return false;
+ }
};
} // end namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/CMakeLists.txt b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
index 998dfd956575d3c..a1537bb1ffa632e 100644
--- a/llvm/lib/Transforms/Vectorize/CMakeLists.txt
+++ b/llvm/lib/Transforms/Vectorize/CMakeLists.txt
@@ -5,6 +5,7 @@ add_llvm_component_library(LLVMVectorize
SLPVectorizer.cpp
Vectorize.cpp
VectorCombine.cpp
+ VectorWiden.cpp
VPlan.cpp
VPlanHCFGBuilder.cpp
VPlanRecipes.cpp
diff --git a/llvm/lib/Transforms/Vectorize/VectorWiden.cpp b/llvm/lib/Transforms/Vectorize/VectorWiden.cpp
new file mode 100644
index 000000000000000..c7a590ea75430e5
--- /dev/null
+++ b/llvm/lib/Transforms/Vectorize/VectorWiden.cpp
@@ -0,0 +1,356 @@
+///==--- VectorWiden.cpp - Combining Vector Operations to wider types ----==//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass tries to widen vector operations to a wider type, it finds
+// independent from each other operations with a certain vector type as SLP does
+// with scalars by Bottom Up. It detects consecutive stores that can be put
+// together into a wider vector-stores. Next, it attempts to construct
+// vectorizable tree using the use-def chains.
+//
+//==------------------------------------------------------------------------==//
+
+#include "llvm/Transforms/Vectorize/VectorWiden.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallSet.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/DependenceAnalysis.h"
+#include "llvm/Analysis/LoopAccessAnalysis.h"
+#include "llvm/Analysis/PostDominators.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
+#include "llvm/Analysis/VectorUtils.h"
+#include "llvm/CodeGen/ValueTypes.h"
+#include "llvm/IR/ConstantRange.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Transforms/Utils/CodeMoverUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "vector-widen"
+
+// Due to independant operations to widening that we consider with possibility
+// to merge those operations into one and also to widening store if we find
+// later store instructions. We have to consider the distance between those
+// independent operations or we might introduce bad register pressure, etc.
+
+static cl::opt<unsigned>
+ MaxInstDistance("vw-max-instr-distance", cl::init(30), cl::Hidden,
+ cl::desc("Maximum distance between instructions to"
+ "consider to widen"));
+
+namespace {
+class VectorWiden {
+public:
+ using InstrList = SmallVector<Instruction *, 2>;
+ using ValueList = SmallVector<Value *, 2>;
+ VectorWiden(Function &F, const TargetTransformInfo &TTI, DominatorTree &DT,
+ DependenceInfo &DI, const PostDominatorTree &PDT)
+ : F(F), Builder(F.getContext()), TTI(TTI), DT(DT), DI(DI), PDT(PDT) {}
+
+ bool run();
+
+private:
+ Function &F;
+ IRBuilder<> Builder;
+ const TargetTransformInfo &TTI;
+ DominatorTree &DT;
+ DependenceInfo &DI;
+ const PostDominatorTree &PDT;
+ TargetLibraryInfo *TLI;
+
+ DenseSet<Instruction *> DeletedInstructions;
+
+ /// Checks if the instruction is marked for deletion.
+ bool isDeleted(Instruction *I) const { return DeletedInstructions.count(I); }
+
+ /// Removes an instruction from its block and eventually deletes it.
+ void eraseInstruction(Instruction *I) { DeletedInstructions.insert(I); }
+
+ bool processBB(BasicBlock &BB, LLVMContext &Context);
+
+ bool widenNode(ArrayRef<Instruction *> IL, LLVMContext &Context);
+
+ void widenFPTrunc(ArrayRef<Instruction *> IL);
+
+ void widenAdd(ArrayRef<Instruction *> IL, bool Reorder);
+
+ InstructionCost getOpCost(unsigned Opcode, Type *To, Type *From,
+ Instruction *I);
+};
+} // namespace
+
+void VectorWiden::widenFPTrunc(ArrayRef<Instruction *> IL) {
+ Instruction *I = IL[0];
+ Instruction *I1 = IL[1];
+ ScalableVectorType *RetOrigType = cast<ScalableVectorType>(I->getType());
+ ScalableVectorType *OrigType =
+ cast<ScalableVectorType>(I->getOperand(0)->getType());
+ ScalableVectorType *RetType =
+ ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+ ScalableVectorType *OpType =
+ ScalableVectorType::getDoubleElementsVectorType(OrigType);
+ Value *WideVec = UndefValue::get(OpType);
+ Builder.SetInsertPoint(I);
+ Function *InsertIntr = llvm::Intrinsic::getDeclaration(
+ F.getParent(), Intrinsic::vector_insert, {OpType, OrigType});
+ Value *Insert1 = Builder.CreateCall(
+ InsertIntr, {WideVec, I->getOperand(0), Builder.getInt64(0)});
+ Value *Insert2 = Builder.CreateCall(
+ InsertIntr, {Insert1, I1->getOperand(0), Builder.getInt64(4)});
+ Value *ResFPTrunc =
+ Builder.CreateCast(Instruction::FPTrunc, Insert2, RetType);
+ Function *ExtractIntr = llvm::Intrinsic::getDeclaration(
+ F.getParent(), Intrinsic::vector_extract, {RetOrigType, RetType});
+ if (!I->users().empty()) {
+ Value *Res =
+ Builder.CreateCall(ExtractIntr, {ResFPTrunc, Builder.getInt64(4)});
+ I->replaceAllUsesWith(Res);
+ }
+ if (!I1->users().empty()) {
+ Value *Res2 =
+ Builder.CreateCall(ExtractIntr, {ResFPTrunc, Builder.getInt64(0)});
+ I1->replaceAllUsesWith(Res2);
+ }
+}
+
+void VectorWiden::widenAdd(ArrayRef<Instruction *> IL, bool Reorder) {
+ Instruction *I = IL[0];
+ Instruction *I1 = IL[1];
+
+ Value *XHi = I->getOperand(0);
+ Value *XLo = I1->getOperand(0);
+ Value *YHi = I->getOperand(1);
+ Value *YLo = I1->getOperand(1);
+ if (Reorder) {
+ std::swap(XHi, YHi);
+ std::swap(XLo, YLo);
+ }
+
+ ScalableVectorType *RetOrigType = cast<ScalableVectorType>(I->getType());
+ ScalableVectorType *OrigType =
+ cast<ScalableVectorType>(I->getOperand(0)->getType());
+ ScalableVectorType *RetType =
+ ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+ ScalableVectorType *OpType =
+ ScalableVectorType::getDoubleElementsVectorType(OrigType);
+ Value *WideVec = UndefValue::get(OpType);
+ Builder.SetInsertPoint(I);
+ Function *InsertIntr = llvm::Intrinsic::getDeclaration(
+ F.getParent(), Intrinsic::vector_insert, {OpType, OrigType});
+ Value *X1 =
+ Builder.CreateCall(InsertIntr, {WideVec, XHi, Builder.getInt64(0)});
+ Value *X2 = Builder.CreateCall(InsertIntr, {X1, XLo, Builder.getInt64(4)});
+ Value *Y1 =
+ Builder.CreateCall(InsertIntr, {WideVec, YHi, Builder.getInt64(0)});
+ Value *Y2 = Builder.CreateCall(InsertIntr, {Y1, YLo, Builder.getInt64(4)});
+ Value *ResAdd = Builder.CreateAdd(X2, Y2);
+ Function *ExtractIntr = llvm::Intrinsic::getDeclaration(
+ F.getParent(), Intrinsic::vector_extract, {RetOrigType, RetType});
+ if (!I->users().empty()) {
+ Value *Res = Builder.CreateCall(ExtractIntr, {ResAdd, Builder.getInt64(0)});
+ I->replaceAllUsesWith(Res);
+ }
+ if (!I1->users().empty()) {
+ Value *Res2 =
+ Builder.CreateCall(ExtractIntr, {ResAdd, Builder.getInt64(4)});
+ I1->replaceAllUsesWith(Res2);
+ }
+}
+
+bool VectorWiden::widenNode(ArrayRef<Instruction *> IL, LLVMContext &Context) {
+ LLVM_DEBUG(dbgs() << "VW: widenNode: " << *IL[0] << " " << *IL[1] << "\n");
+ if (!TTI.considerForWidening(Context, IL))
+ return false;
+ if (IL[0] == IL[1])
+ return false;
+ if (IL[0]->getOpcode() != IL[1]->getOpcode())
+ return false;
+ // Ignore if any live in a diffrent Basic Block
+ if (IL[0]->getParent() != IL[1]->getParent())
+ return false;
+ // Ignore if disatance between two are too apart.
+ if (abs(std::distance(IL[1]->getIterator(), IL[0]->getIterator())) >
+ MaxInstDistance)
+ return false;
+ // Check that any instrction is already deleted.
+ if (isDeleted(IL[0]) || isDeleted(IL[1]))
+ return false;
+ if (IL[1] == IL[0]->getOperand(0) || IL[0] == IL[1]->getOperand(0))
+ return false;
+ if (IL[0]->getNumOperands() > 1 &&
+ (IL[1] == IL[0]->getOperand(1) || IL[0] == IL[1]->getOperand(1)))
+ return false;
+ if (!isSafeToMoveBefore(*IL[1], *IL[0], DT, &PDT, &DI))
+ return false;
+ switch (IL[0]->getOpcode()) {
+ case Instruction::FPTrunc: {
+ ScalableVectorType *RetOrigType =
+ cast<ScalableVectorType>(IL[0]->getType());
+ ScalableVectorType *OrigType =
+ cast<ScalableVectorType>(IL[0]->getOperand(0)->getType());
+ InstructionCost Cost =
+ getOpCost(Instruction::FPTrunc, RetOrigType, OrigType, IL[0]);
+ ScalableVectorType *RetType =
+ ScalableVectorType::getDoubleElementsVectorType(RetOrigType);
+ ScalableVectorType *OpType =
+ ScalableVectorType::getDoubleElementsVectorType(OrigType);
+ InstructionCost CostNew =
+ getOpCost(Instruction::FPTrunc, RetType, OpType, IL[0]);
+ if (2 * Cost < CostNew)
+ return false;
+ LLVM_DEBUG(dbgs() << "VW: Decided to widen FPTrunc, safe to merge : "
+ << *IL[0] << " with " << *IL[1] << "\n");
+ widenFPTrunc(IL);
+ return true;
+ }
+ case Instruction::Add: {
+ ScalableVectorType *OrigType =
+ cast<ScalableVectorType>(IL[0]->getOperand(0)->getType());
+ ScalableVectorType *OpType =
+ ScalableVectorType::getDoubleElementsVectorType(OrigType);
+ InstructionCost Cost =
+ getOpCost(Instruction::Add, OrigType, OrigType, IL[0]);
+ InstructionCost CostNew =
+ getOpCost(Instruction::Add, OpType, OpType, IL[0]);
+ if (2 * Cost < CostNew)
+ return false;
+ LLVM_DEBUG(dbgs() << "VW: Decided to widen Add, safe to merge : " << *IL[0]
+ << " with " << *IL[1] << "\n");
+ widenAdd(IL, IL[0]->getOperand(1) != IL[1]->getOperand(1));
+ return true;
+ }
+
+ default:
+ break;
+ }
+ return false;
+}
+
+InstructionCost VectorWiden::getOpCost(unsigned Opcode, Type *To, Type *From,
+ Instruction *I) {
+ InstructionCost Cost = 0;
+ TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+ switch (Opcode) {
+ case Instruction::FPTrunc: {
+ Cost = TTI.getCastInstrCost(Opcode, To, From, TTI::getCastContextHint(I),
+ CostKind, I);
+ break;
+ }
+ case Instruction::Add: {
+ unsigned OpIdx = isa<UnaryOperator>(I) ? 0 : 1;
+ TTI::OperandValueInfo Op1Info = TTI::getOperandInfo(I->getOperand(0));
+ TTI::OperandValueInfo Op2Info = TTI::getOperandInfo(I->getOperand(OpIdx));
+ SmallVector<const Value *> Operands(I->operand_values());
+ Cost = TTI.getArithmeticInstrCost(I->getOpcode(), To, CostKind, Op1Info,
+ Op2Info, Operands, I);
+ break;
+ }
+ default:
+ llvm_unreachable("Unknown instruction");
+ }
+ return Cost;
+}
+
+bool VectorWiden::processBB(BasicBlock &BB, LLVMContext &Context) {
+ DenseMap<unsigned, std::pair<InstrList, unsigned>> Operations;
+ unsigned Counter = 0;
+ for (BasicBlock::reverse_iterator IP(BB.rbegin()); IP != BB.rend();
+ *IP++, ++Counter) {
+ Instruction *I = &*IP;
+ unsigned OpFound = 0;
+
+ if (I->isDebugOrPseudoInst() || isDeleted(I))
+ continue;
+
+ switch (I->getOpcode()) {
+ default:
+ continue;
+ case Instruction::Add:
+ case Instruction::FPTrunc: {
+ if (Operations.find(I->getOpcode()) != Operations.end()) {
+ auto *OpRec = &Operations[I->getOpcode()];
+ // If instructions are too apart then remove old instrction
+ // and reset position to this instruction.
+ if (Counter - Operations[I->getOpcode()].second > MaxInstDistance) {
+ OpRec->second = Counter;
+ OpRec->first.clear();
+ OpRec->first.push_back(I);
+ } else {
+ OpRec->first.push_back(I);
+ OpFound = I->getOpcode();
+ }
+ } else {
+ Operations[I->getOpcode()] = {{I}, Counter};
+ }
+ break;
+ }
+ }
+ if (OpFound && Operations.find(OpFound) != Operations.end()) {
+ auto *OpRec = &Operations[OpFound];
+ for (Instruction *Op : OpRec->first)
+ LLVM_DEBUG(dbgs() << "VW Op to check : " << *Op << "\n");
+ if (!widenNode(OpRec->first, Context)) {
+ LLVM_DEBUG(dbgs() << "VW Unable to consturct the tree.\n");
+ OpRec->first.erase(OpRec->first.begin());
+ OpRec->second = Counter;
+ } else {
+ for (Instruction *Instr : OpRec->first)
+ eraseInstruction(Instr);
+ return true;
+ }
+ }
+ }
+ return false;
+}
+
+bool VectorWiden::run() {
+ bool Changed = false;
+ LLVMContext &Context = F.getContext();
+
+ LLVM_DEBUG(dbgs() << "VW Function:" << F.getName() << "\n");
+ for (BasicBlock &BB : F) {
+ LLVM_DEBUG(dbgs() << "VW BB:" << BB.getName() << ...
[truncated]
|
For the example in the description, isn't the issue that the SLP vectorizer doesn't support generating code for scalable vectors? Or has this recently been implemented? |
Scalable vectors are meaningless for SLP, it always operates with constant number of scalars/fixed vectors size (equal it number of scalars). |
It's not clear from the description, but the godbolt example passes |
Yes, but this is different. Here it looks like vector combing rather than vectorization. Can SLP do this? Yes. But only for vectorized values. For scalars it will always produce fixed vectors. |
Have you looked at doing this in VectorCombine? |
SLP considers only scalars to vectorize, not already vectorized operations, nor scalable vectors. |
Yes, VecotorCombine deals with a limited number of operations and ussually dependent. Still, we want to widen the whole chain of operations independet from each other, from loads to stores, to build something like an SLP tree. |
Just to clarify, you are specifically interested in the case where the programmer has already written manually vectorized code, and you want to improve it further? So there are no scalar instructions at some earlier point in the pipeline? |
Yes, that is correct. We are targeting 256-bit register SVE hardware in the example. |
yes, correct, no plans here to consider scalar instructions fo now. |
To add a bit of context here, the main motivator here is making use of the multi-vector instructions added in AArch64 SME2/SVE2p1 which can operate on 2 or 4 vectors at a time. For example, loop-vectorized code that has a UF=2 or UF=4 would be a natural input to this pass. However, there are also other use-cases to consider:
The point is that the instructions don't necessarily need to be part of the same expression. For SME2 the multi-vector instructions can just pair up two independent operations with the same opcode. A non-SVE use-case that comes to mind are the NEON load/store-pair (ldp/stp) instructions which are currently paired up only after ISel. |
3fb14e0
to
1191edc
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
… vector types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type: add { z4.s, z5.s }, { z4.s, z5.s }, z3.s With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si __attribute__ ((vector_size (16))); void add(v4si *ptr, v4si *ptr1) { v4si a = *ptr; ptr++; v4si b = *ptr; ptr++; v4si c = *ptr; ptr++; v4si d = *ptr; *ptr1 = a+b; ptr1++; *ptr1 = c+d; } On ARM SVE hardware, we could produce: ptrue p0.s, vl8 mov x8, llvm#8 // =0x8 ld1w { z0.s }, p0/z, [x0] ld1w { z1.s }, p0/z, [x0, x8, lsl llvm#2] add z0.s, z1.s, z0.s st1w { z0.s }, p0, [x1] ret Currently, we have this output https://godbolt.org/z/z5n78TWsc: ldp q0, q1, [x0] ldp q2, q3, [x0, llvm#32] add v0.4s, v1.4s, v0.4s add v1.4s, v3.4s, v2.4s stp q0, q1, [x1] ret I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types.
1191edc
to
2633f95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments which will hopefully clarify the code a bit more when addressed, because I'm still trying to make sense of how things work. Other than that, the code in this pass should conceptually distinguish between:
- Target-agnostic vector widening
- Target-specific choices about whether to widen
You've added a TargetTransformInfo interface for the latter, which is good, but the code in this pass makes various assumptions based on the implementation of AArch64TargetTransformInfo::considerToWiden
. For example, the code in the generic pass assumes that the vector length is a [vscale x] 4
and the tests only seem to cover whatever AArch64 enables with its TTI hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general observation that the last patch in this pull-request had quite a few code changes, but no changes to any of the tests. That suggests the test-coverage of the pass is too limited.
instruction handling.
added testcase for bad dependace case.
Arm is no longer pursuing this investigation. |
This pass allows us to widen already vectorized instructions to wider vector types. We encountered an issue with the current auto-vectorisations passes that would not allow us to implement the required functionality without a new pass easily. For example, SME2 ADD instruction with a first operand and the resulting register are in a multivector form with scalable vector types, while the third operand is just a regular scalable vector type:
add { z4.s, z5.s }, { z4.s, z5.s }, z3.s
With the loop-vectorizer pass, choosing a correct VF to deal with one of the operands and the result to be a wider vector type could be difficult. With the new pass, we want to consider a group of operations, not a single one like SLP does, to make more profitable transformations, including, for example, LOADs and STORES, arithmetical operations, etc. For example, we could combine those independent ADD operations in a single basic block into one on ARM SVE: typedef int v4si attribute ((vector_size (16)));
On ARM SVE hardware, we could produce:
Currently, we have this output https://godbolt.org/z/z5n78TWsc:
I noticed similar opportunities with SLP vectorizer not choosing wider VF due to its implementation, for example, with reductions only able to handle four or fewer elements width types. Currently, the pass supports only ADD and FP_ROUND operations to widen and only in scalable vector types. This is just a starting point, I plan to build up the functionality incrementally to handle other instructions and make the algorithm more advanced over time.