[AMDGPU] Add IR LiveReg type-based optimization #66838

jrbyrnes · 2023-09-19T23:13:59Z

NOTE: This commit is part of a stack which spans across phabricator. The PR is meant only for the top of the stack ([AMDGPU] Add IR LiveReg type-based optimization).

As suggested in #66134, this adds the IR level logic to coerce the type of illegal vectors which have live ranges that span across basic blocks.

The issue is that local ISel will emit CopyToReg / CopyFromReg pairs for live ranges spanning basic blocks. For illegal vector types, the DAGBuilder will legalize by scalarizing the vector, then widening each scalar, and passing each scalar via a separate physical register. See https://godbolt.org/z/Y7MhcjGE8 for a demo of the issue.

This feature identifies cases like these, and inserts bitcasts between the def of the illegal vector and the uses in different blocks. This results in avoiding the scalarization process and an ability to pack the bits into fewer registers -- for example, we now use 2 VGPR for a v8i8 instead of 8.

llvmbot · 2023-09-19T23:15:12Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Changes

NOTE: This commit is part of a stack which spans across phabricator. The PR is meant only for the top of the stack ([AMDGPU] Add IR LiveReg type-based optimization).

As suggested in #66134, this adds the IR level logic to coerce the type of illegal vectors which have live ranges that span across basic blocks.

The issue is that local ISel will emit CopyToReg / CopyFromReg pairs for live ranges spanning basic blocks. For illegal vector types, the DAGBuilder will legalize by scalarizing the vector, then widening each scalar, and passing each scalar via a separate physical register. See https://godbolt.org/z/Y7MhcjGE8 for a demo of the issue.

This feature identifies cases like these, and inserts bitcasts between the def of the illegal vector and the uses in different blocks. This results in avoiding the scalarization process and an ability to pack the bits into fewer registers -- for example, we now use 2 VGPR for a v8i8 instead of 8.

Patch is 637.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66838.diff

18 Files Affected:

(modified) llvm/include/llvm/CodeGen/ByteProvider.h (+16)
(modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+343)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+502-48)
(modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis.ll (+79-46)
(modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+120-122)
(modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+10-8)
(modified) llvm/test/CodeGen/AMDGPU/idot2.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+2391-123)
(modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+3703-227)
(modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+6-9)
(modified) llvm/test/CodeGen/AMDGPU/load-hi16.ll (+30-30)
(modified) llvm/test/CodeGen/AMDGPU/load-lo16.ll (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+25-24)
(modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+19-18)
(modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (+8-10)
(modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+254)
(modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (+7-11)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+985-1707)

diff --git a/llvm/include/llvm/CodeGen/ByteProvider.h b/llvm/include/llvm/CodeGen/ByteProvider.h
index 3187b4e68c56f3a..99ae8607c0b2071 100644
--- a/llvm/include/llvm/CodeGen/ByteProvider.h
+++ b/llvm/include/llvm/CodeGen/ByteProvider.h
@@ -32,6 +32,11 @@ template <typename ISelOp> class ByteProvider {
   ByteProvider(std::optional<ISelOp> Src, int64_t DestOffset, int64_t SrcOffset)
       : Src(Src), DestOffset(DestOffset), SrcOffset(SrcOffset) {}
 
+  ByteProvider(std::optional<ISelOp> Src, int64_t DestOffset, int64_t SrcOffset,
+               std::optional<bool> IsSigned)
+      : Src(Src), DestOffset(DestOffset), SrcOffset(SrcOffset),
+        IsSigned(IsSigned) {}
+
   // TODO -- use constraint in c++20
   // Does this type correspond with an operation in selection DAG
   template <typename T> class is_op {
@@ -61,6 +66,9 @@ template <typename ISelOp> class ByteProvider {
   // DestOffset
   int64_t SrcOffset = 0;
 
+  // Whether or not Src be treated as signed
+  std::optional<bool> IsSigned;
+
   ByteProvider() = default;
 
   static ByteProvider getSrc(std::optional<ISelOp> Val, int64_t ByteOffset,
@@ -70,6 +78,14 @@ template <typename ISelOp> class ByteProvider {
     return ByteProvider(Val, ByteOffset, VectorOffset);
   }
 
+  static ByteProvider getSrc(std::optional<ISelOp> Val, int64_t ByteOffset,
+                             int64_t VectorOffset,
+                             std::optional<bool> IsSigned) {
+    static_assert(is_op<ISelOp>().value,
+                  "ByteProviders must contain an operation in selection DAG.");
+    return ByteProvider(Val, ByteOffset, VectorOffset, IsSigned);
+  }
+
   static ByteProvider getConstantZero() {
     return ByteProvider<ISelOp>(std::nullopt, 0, 0);
   }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index 4cce34bdeabcf44..b50379e98d0f6b5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -106,6 +106,7 @@ class AMDGPUCodeGenPrepareImpl
   Module *Mod = nullptr;
   const DataLayout *DL = nullptr;
   bool HasUnsafeFPMath = false;
+  bool UsesGlobalISel = false;
   bool HasFP32DenormalFlush = false;
   bool FlowChanged = false;
   mutable Function *SqrtF32 = nullptr;
@@ -341,6 +342,85 @@ class AMDGPUCodeGenPrepare : public FunctionPass {
   StringRef getPassName() const override { return "AMDGPU IR optimizations"; }
 };
 
+class LiveRegConversion {
+private:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *LiveRegDef;
+  // The original type
+  Type *OriginalType;
+  // The desired type
+  Type *NewType;
+  // The instruction sequence that converts the virtual register, to be used
+  // instead of the original
+  std::optional<Instruction *> Converted;
+  // The builder used to build the conversion instruction
+  IRBuilder<> ConvertBuilder;
+
+public:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *getLiveRegDef() { return LiveRegDef; }
+  // The original type
+  Type *getOriginalType() { return OriginalType; }
+  // The desired type
+  Type *getNewType() { return NewType; }
+  void setNewType(Type *NewType) { this->NewType = NewType; }
+  // The instruction that conerts the virtual register, to be used instead of
+  // the original
+  std::optional<Instruction *> &getConverted() { return Converted; }
+  void setConverted(Instruction *Converted) { this->Converted = Converted; }
+  // The builder used to build the conversion instruction
+  IRBuilder<> &getConverBuilder() { return ConvertBuilder; }
+  // Do we have a instruction sequence which convert the original virtual
+  // register
+  bool hasConverted() { return Converted.has_value(); }
+
+  LiveRegConversion(Instruction *LiveRegDef, BasicBlock *InsertBlock,
+                    BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        ConvertBuilder(InsertBlock, InsertPt) {}
+  LiveRegConversion(Instruction *LiveRegDef, Type *NewType,
+                    BasicBlock *InsertBlock, BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        NewType(NewType), ConvertBuilder(InsertBlock, InsertPt) {}
+};
+
+class LiveRegOptimizer {
+private:
+  Module *Mod = nullptr;
+  // The scalar type to convert to
+  Type *ConvertToScalar;
+  // Holds the collection of PHIs with their pending new operands
+  SmallVector<std::pair<Instruction *,
+                        SmallVector<std::pair<Instruction *, BasicBlock *>, 4>>,
+              4>
+      PHIUpdater;
+
+public:
+  // Should the def of the instruction be converted if it is live across blocks
+  bool shouldReplaceUses(const Instruction &I);
+  // Convert the virtual register to the compatible vector of legal type
+  void convertToOptType(LiveRegConversion &LR);
+  // Convert the virtual register back to the original type, stripping away
+  // the MSBs in cases where there was an imperfect fit (e.g. v2i32 -> v7i8)
+  void convertFromOptType(LiveRegConversion &LR);
+  // Get a vector of desired scalar type that is compatible with the original
+  // vector. In cases where there is no bitsize equivalent using a legal vector
+  // type, we pad the MSBs (e.g. v7i8 -> v2i32)
+  Type *getCompatibleType(Instruction *InstToConvert);
+  // Find and replace uses of the virtual register in different block with a
+  // newly produced virtual register of legal type
+  bool replaceUses(Instruction &I);
+  // Replace the collected PHIs with newly produced incoming values. Replacement
+  // is only done if we have a replacement for each original incoming value.
+  bool replacePHIs();
+
+  LiveRegOptimizer(Module *Mod) : Mod(Mod) {
+    ConvertToScalar = Type::getInt32Ty(Mod->getContext());
+  }
+};
+
 } // end anonymous namespace
 
 bool AMDGPUCodeGenPrepareImpl::run(Function &F) {
@@ -358,6 +438,7 @@ bool AMDGPUCodeGenPrepareImpl::run(Function &F) {
       Next = std::next(I);
 
       MadeChange |= visit(*I);
+      I->getType();
 
       if (Next != E) { // Control flow changed
         BasicBlock *NextInstBB = Next->getParent();
@@ -369,9 +450,269 @@ bool AMDGPUCodeGenPrepareImpl::run(Function &F) {
       }
     }
   }
+
+  // GlobalISel should directly use the values, and do not need to emit
+  // CopyTo/CopyFrom Regs across blocks
+  if (UsesGlobalISel)
+    return MadeChange;
+
+  // "Optimize" the virtual regs that cross basic block boundaries. In such
+  // cases, vectors of illegal types will be scalarized and widened, with each
+  // scalar living in its own physical register. The optimization converts the
+  // vectors to equivalent vectors of legal type (which are convereted back
+  // before uses in subsequenmt blocks), to pack the bits into fewer physical
+  // registers (used in CopyToReg/CopyFromReg pairs).
+  LiveRegOptimizer LRO(Mod);
+  for (auto &BB : F) {
+    for (auto &I : BB) {
+      if (!LRO.shouldReplaceUses(I))
+        continue;
+      MadeChange |= LRO.replaceUses(I);
+    }
+  }
+
+  MadeChange |= LRO.replacePHIs();
+  return MadeChange;
+}
+
+bool LiveRegOptimizer::replaceUses(Instruction &I) {
+  bool MadeChange = false;
+
+  struct ConvertUseInfo {
+    Instruction *Converted;
+    SmallVector<Instruction *, 4> Users;
+  };
+  DenseMap<BasicBlock *, ConvertUseInfo> UseConvertTracker;
+
+  LiveRegConversion FromLRC(
+      &I, I.getParent(),
+      static_cast<BasicBlock::iterator>(std::next(I.getIterator())));
+  FromLRC.setNewType(getCompatibleType(FromLRC.getLiveRegDef()));
+  for (auto IUser = I.user_begin(); IUser != I.user_end(); IUser++) {
+
+    if (auto UserInst = dyn_cast<Instruction>(*IUser)) {
+      if (UserInst->getParent() != I.getParent()) {
+        LLVM_DEBUG(dbgs() << *UserInst << "\n\tUses "
+                          << *FromLRC.getOriginalType()
+                          << " from previous block. Needs conversion\n");
+        convertToOptType(FromLRC);
+        if (!FromLRC.hasConverted())
+          continue;
+        // If it is a PHI node, just create and collect the new operand. We can
+        // only replace the PHI node once we have converted all the operands
+        if (auto PhiInst = dyn_cast<PHINode>(UserInst)) {
+          for (unsigned Idx = 0; Idx < PhiInst->getNumIncomingValues(); Idx++) {
+            auto IncVal = PhiInst->getIncomingValue(Idx);
+            if (&I == dyn_cast<Instruction>(IncVal)) {
+              auto IncBlock = PhiInst->getIncomingBlock(Idx);
+              auto PHIOps = find_if(
+                  PHIUpdater,
+                  [&UserInst](
+                      std::pair<Instruction *,
+                                SmallVector<
+                                    std::pair<Instruction *, BasicBlock *>, 4>>
+                          &Entry) { return Entry.first == UserInst; });
+
+              if (PHIOps == PHIUpdater.end())
+                PHIUpdater.push_back(
+                    {UserInst, {{*FromLRC.getConverted(), IncBlock}}});
+              else
+                PHIOps->second.push_back({*FromLRC.getConverted(), IncBlock});
+
+              break;
+            }
+          }
+          continue;
+        }
+
+        // Do not create multiple conversion sequences if there are multiple
+        // uses in the same block
+        if (UseConvertTracker.contains(UserInst->getParent())) {
+          UseConvertTracker[UserInst->getParent()].Users.push_back(UserInst);
+          LLVM_DEBUG(dbgs() << "\tUser already has access to converted def\n");
+          continue;
+        }
+
+        LiveRegConversion ToLRC(*FromLRC.getConverted(), I.getType(),
+                                UserInst->getParent(),
+                                static_cast<BasicBlock::iterator>(
+                                    UserInst->getParent()->getFirstNonPHIIt()));
+        convertFromOptType(ToLRC);
+        assert(ToLRC.hasConverted());
+        UseConvertTracker[UserInst->getParent()] = {*ToLRC.getConverted(),
+                                                    {UserInst}};
+      }
+    }
+  }
+
+  // Replace uses of with in a separate loop that is not dependent upon the
+  // state of the uses
+  for (auto &Entry : UseConvertTracker) {
+    for (auto &UserInst : Entry.second.Users) {
+      LLVM_DEBUG(dbgs() << *UserInst
+                        << "\n\tNow uses: " << *Entry.second.Converted << "\n");
+      UserInst->replaceUsesOfWith(&I, Entry.second.Converted);
+      MadeChange = true;
+    }
+  }
+  return MadeChange;
+}
+
+bool LiveRegOptimizer::replacePHIs() {
+  bool MadeChange = false;
+  for (auto Ele : PHIUpdater) {
+    auto ThePHINode = dyn_cast<PHINode>(Ele.first);
+    assert(ThePHINode);
+    auto NewPHINodeOps = Ele.second;
+    LLVM_DEBUG(dbgs() << "Attempting to replace: " << *ThePHINode << "\n");
+    // If we have conveted all the required operands, then do the replacement
+    if (ThePHINode->getNumIncomingValues() == NewPHINodeOps.size()) {
+      IRBuilder<> Builder(Ele.first);
+      auto NPHI = Builder.CreatePHI(NewPHINodeOps[0].first->getType(),
+                                    NewPHINodeOps.size());
+      for (auto IncVals : NewPHINodeOps) {
+        NPHI->addIncoming(IncVals.first, IncVals.second);
+        LLVM_DEBUG(dbgs() << "  Using: " << *IncVals.first
+                          << "  For: " << IncVals.second->getName() << "\n");
+      }
+      LLVM_DEBUG(dbgs() << "Sucessfully replaced with " << *NPHI << "\n");
+      LiveRegConversion ToLRC(NPHI, ThePHINode->getType(),
+                              ThePHINode->getParent(),
+                              static_cast<BasicBlock::iterator>(
+                                  ThePHINode->getParent()->getFirstNonPHIIt()));
+      convertFromOptType(ToLRC);
+      assert(ToLRC.hasConverted());
+      Ele.first->replaceAllUsesWith(*ToLRC.getConverted());
+      // The old PHI is no longer used
+      ThePHINode->eraseFromParent();
+      MadeChange = true;
+    }
+  }
   return MadeChange;
 }
 
+Type *LiveRegOptimizer::getCompatibleType(Instruction *InstToConvert) {
+  auto OriginalType = InstToConvert->getType();
+  assert(OriginalType->getScalarSizeInBits() <=
+         ConvertToScalar->getScalarSizeInBits());
+  auto VTy = dyn_cast<VectorType>(OriginalType);
+  if (!VTy)
+    return ConvertToScalar;
+
+  auto OriginalSize =
+      VTy->getScalarSizeInBits() * VTy->getElementCount().getFixedValue();
+  auto ConvertScalarSize = ConvertToScalar->getScalarSizeInBits();
+  auto ConvertEltCount =
+      (OriginalSize + ConvertScalarSize - 1) / ConvertScalarSize;
+
+  return VectorType::get(Type::getIntNTy(Mod->getContext(), ConvertScalarSize),
+                         llvm::ElementCount::getFixed(ConvertEltCount));
+}
+
+void LiveRegOptimizer::convertToOptType(LiveRegConversion &LR) {
+  if (LR.hasConverted()) {
+    LLVM_DEBUG(dbgs() << "\tAlready has converted def\n");
+    return;
+  }
+
+  auto VTy = dyn_cast<VectorType>(LR.getOriginalType());
+  assert(VTy);
+  auto NewVTy = dyn_cast<VectorType>(LR.getNewType());
+  assert(NewVTy);
+
+  auto V = static_cast<Value *>(LR.getLiveRegDef());
+  auto OriginalSize =
+      VTy->getScalarSizeInBits() * VTy->getElementCount().getFixedValue();
+  auto NewSize =
+      NewVTy->getScalarSizeInBits() * NewVTy->getElementCount().getFixedValue();
+
+  auto &Builder = LR.getConverBuilder();
+
+  // If there is a bitsize match, we can fit the old vector into a new vector of
+  // desired type
+  if (OriginalSize == NewSize) {
+    LR.setConverted(dyn_cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+    LLVM_DEBUG(dbgs() << "\tConverted def to "
+                      << *(*LR.getConverted())->getType() << "\n");
+    return;
+  }
+
+  // If there is a bitsize mismatch, we must use a wider vector
+  assert(NewSize > OriginalSize);
+  auto ExpandedVecElementCount =
+      llvm::ElementCount::getFixed(NewSize / VTy->getScalarSizeInBits());
+
+  SmallVector<int, 8> ShuffleMask;
+  for (unsigned I = 0; I < VTy->getElementCount().getFixedValue(); I++)
+    ShuffleMask.push_back(I);
+
+  for (uint64_t I = VTy->getElementCount().getFixedValue();
+       I < ExpandedVecElementCount.getFixedValue(); I++)
+    ShuffleMask.push_back(VTy->getElementCount().getFixedValue());
+
+  auto ExpandedVec =
+      dyn_cast<Instruction>(Builder.CreateShuffleVector(V, ShuffleMask));
+  LR.setConverted(
+      dyn_cast<Instruction>(Builder.CreateBitCast(ExpandedVec, NewVTy)));
+  LLVM_DEBUG(dbgs() << "\tConverted def to " << *(*LR.getConverted())->getType()
+                    << "\n");
+  return;
+}
+
+void LiveRegOptimizer::convertFromOptType(LiveRegConversion &LRC) {
+  auto VTy = dyn_cast<VectorType>(LRC.getOriginalType());
+  assert(VTy);
+  auto NewVTy = dyn_cast<VectorType>(LRC.getNewType());
+  assert(NewVTy);
+
+  auto V = static_cast<Value *>(LRC.getLiveRegDef());
+  auto OriginalSize =
+      VTy->getScalarSizeInBits() * VTy->getElementCount().getFixedValue();
+  auto NewSize =
+      NewVTy->getScalarSizeInBits() * NewVTy->getElementCount().getFixedValue();
+
+  auto &Builder = LRC.getConverBuilder();
+
+  // If there is a bitsize match, we simply convert back to the original type
+  if (OriginalSize == NewSize) {
+    LRC.setConverted(dyn_cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+    LLVM_DEBUG(dbgs() << "\tProduced for user: " << **LRC.getConverted()
+                      << "\n");
+    return;
+  }
+
+  // If there is a bitsize mismatch, we have used a wider vector and must strip
+  // the MSBs to convert back to the original type
+  assert(OriginalSize > NewSize);
+  auto ExpandedVecElementCount = llvm::ElementCount::getFixed(
+      OriginalSize / NewVTy->getScalarSizeInBits());
+  auto ExpandedVT = VectorType::get(
+      Type::getIntNTy(Mod->getContext(), NewVTy->getScalarSizeInBits()),
+      ExpandedVecElementCount);
+  auto Converted = dyn_cast<Instruction>(
+      Builder.CreateBitCast(LRC.getLiveRegDef(), ExpandedVT));
+
+  auto NarrowElementCount = NewVTy->getElementCount().getFixedValue();
+  SmallVector<int, 8> ShuffleMask;
+  for (uint64_t I = 0; I < NarrowElementCount; I++)
+    ShuffleMask.push_back(I);
+
+  auto NarrowVec = dyn_cast<Instruction>(
+      Builder.CreateShuffleVector(Converted, ShuffleMask));
+  LRC.setConverted(dyn_cast<Instruction>(NarrowVec));
+  LLVM_DEBUG(dbgs() << "\tProduced for user: " << **LRC.getConverted() << "\n");
+  return;
+}
+
+bool LiveRegOptimizer::shouldReplaceUses(const Instruction &I) {
+  // Vectors of illegal types are copied across blocks in an efficient manner.
+  // They are scalarized and widened to legal scalars. In such cases, we can do
+  // better by using legal vector types
+  auto IType = I.getType();
+  return IType->isVectorTy() && IType->getScalarSizeInBits() < 16 &&
+         !I.getType()->getScalarType()->isPointerTy();
+}
+
 unsigned AMDGPUCodeGenPrepareImpl::getBaseElementBitWidth(const Type *T) const {
   assert(needsPromotionToI32(T) && "T does not need promotion to i32");
 
@@ -2230,6 +2571,7 @@ bool AMDGPUCodeGenPrepare::runOnFunction(Function &F) {
   Impl.ST = &TM.getSubtarget<GCNSubtarget>(F);
   Impl.AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   Impl.UA = &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
+  Impl.UsesGlobalISel = TM.Options.EnableGlobalISel;
   auto *DTWP = getAnalysisIfAvailable<DominatorTreeWrapperPass>();
   Impl.DT = DTWP ? &DTWP->getDomTree() : nullptr;
   Impl.HasUnsafeFPMath = hasUnsafeFPMath(F);
@@ -2250,6 +2592,7 @@ PreservedAnalyses AMDGPUCodeGenPreparePass::run(Function &F,
   Impl.UA = &FAM.getResult<UniformityInfoAnalysis>(F);
   Impl.DT = FAM.getCachedResult<DominatorTreeAnalysis>(F);
   Impl.HasUnsafeFPMath = hasUnsafeFPMath(F);
+  Impl.UsesGlobalISel = TM.Options.EnableGlobalISel;
   SIModeRegisterDefaults Mode(F);
   Impl.HasFP32DenormalFlush =
       Mode.FP32Denormals == DenormalMode::getPreserveSign();
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 1c85ec3f9f5212f..403d61f1b836fab 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -10652,23 +10652,27 @@ SDValue SITargetLowering::performAndCombine(SDNode *N,
 // performed.
 static const std::optional<ByteProvider<SDValue>>
 calculateSrcByte(const SDValue Op, uint64_t DestByte, uint64_t SrcIndex = 0,
+                 std::optional<bool> IsSigned = std::nullopt,
                  unsigned Depth = 0) {
   // We may need to recursively traverse a series of SRLs
   if (Depth >= 6)
     return std::nullopt;
 
-  auto ValueSize = Op.getValueSizeInBits();
-  if (ValueSize != 8 && ValueSize != 16 && ValueSize != 32)
+  if (Op.getValueSizeInBits() < 8)
     return std::nullopt;
 
   switch (Op->getOpcode()) {
   case ISD::TRUNCATE: {
-    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
+    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, IsSigned,
+                            Depth + 1);
   }
 
   case ISD::SIGN_EXTEND:
   case ISD::ZERO_EXTEND:
   case ISD::SIGN_EXTEND_INREG: {
+    IsSigned = IsSigned.value_or(false) ||
+               Op->getOpcode() == ISD::SIGN_EXTEND ||
+               Op->getOpcode() == ISD::SIGN_EXTEND_INREG;
     SDValue NarrowOp = Op->getOperand(0);
     auto NarrowVT = NarrowOp.getValueType();
     if (Op->getOpcode() == ISD::SIGN_EXTEND_INREG) {
@@ -10681,7 +10685,8 @@ calculateSrcByte(const SDValue Op, uint64_t DestByte, uint64_t SrcIndex = 0,
 
     if (SrcIndex >= NarrowByteWidth)
       return std::nullopt;
-    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
+    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, IsSigned,
+                            Depth + 1);
   }
 
   case ISD::SRA:
@@ -10697,12 +10702,38 @@ calculateSrcByte(const SDValue Op, uint64_t DestByte, uint64_t SrcIndex = 0,
 
     SrcIndex += BitShift / 8;
 
-    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
+    return calculateSrcByte(Op->getOperand(0), DestByte, SrcIndex, IsSigned,
+                            Depth + 1);
   }
 
-  default: {
+  case ISD::EXTRACT_VECTOR_ELT: {
+    auto IdxOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
+    if (!IdxOp)
+      return std::nullopt;
+    auto VecIdx = IdxOp->getZExtValue();
+    auto ScalarSize = Op.getScalarValueS...
[truncated]

arsenm

Needs a rebase to show the already pushed dependencies

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

arsenm

Needs rebase, most of this stuff was already merged?

jrbyrnes · 2024-02-07T23:20:08Z

Just rebase to reflect the current outstanding work -- still need to address other comments.

Change-Id: Ide8a46cdaf1d2d82cbd5296c998a5c8fd41fce80

jrbyrnes · 2024-04-18T00:49:07Z

Decoupling from "[AMDGPU]: Accept constant zero bytes in v_perm OrCombine" as that is taking longer than expected, and this has priority.

As a result, in the exotic cases (e.g. v3i8), we may produce suboptimal codegen, but, for the normal case, codegen is much improved.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

…eCodeGenPrepare after CodeSinking + Integrate the loops Change-Id: Iac0baf0ab9e523bf303585b545f060293e6fb4f0

Change-Id: I1b461e3194a27e5e3c45500cae0ef5d4d6540d59

Change-Id: Ia56d86e1acf191d19f6fc43ae780de9bb5118ba9

Change-Id: I8eeacb7d4292a215bb0540e8e7dd12ab7547d058

Change-Id: I94504f26819c45de7496b39fee8031bcda0f29fb

Change-Id: I4383004240dc0365de6e67b12dc9ea5b609826d2

Change-Id: I07bf0cf4537bd3b148dc4ee3b785b989f0aac8b0

Change-Id: I83ae012da3118b0a40fb8a80be5029ce5bd2d78a

Change-Id: Idbfbbadfc1c3cee6cbd1a814b3446628dcce4394

arsenm

Should include a test where a predecessor block has a repeated successor (i.e. it has multiple incoming switch cases)

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

Change-Id: I244784728ff1b4363ff066f8c5a6fa6d03c2a4d5

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll

arsenm · 2024-05-01T05:35:45Z

How does this compare to the usage of

llvm-project/llvm/include/llvm/CodeGen/TargetLowering.h

Line 2869 in 3684a38

virtual bool shouldConvertPhiType(Type *From, Type *To) const {

in generic CodeGenPrepare?

jrbyrnes · 2024-05-20T18:22:09Z

How does this compare to the usage of

llvm-project/llvm/include/llvm/CodeGen/TargetLowering.h

Line 2869 in 3684a38

virtual bool shouldConvertPhiType(Type *From, Type *To) const {

in generic CodeGenPrepare?

Interesting -- I did not see that, thanks.

That feature does have the same structure as the feature in this PR, but it seems for a different purpose: to help fold away bitcasts. It will insert bitcasts if it finds that the bitcasts defining incoming values and bitcast uses of the PHI are all the same type. The target hook is invoked in special cases: when there are no bitcast def/uses, or when the bitcasts are def by a load (for bitcast def) / used by a store (for bitcast use of PHI).

It is mainly due to this special condition entry that the hook will not work for the feature in this PR. Not only that special condition, but the generic codegenprepare optimizePhiType is disabled for vector types. Of course we could remove the conditions to the target hook, but this feels like misusing the hook since the features are conceptually different and we will have a phase ordering problem (code sinking will fold the newly inserted casts).

arsenm · 2024-05-20T19:22:16Z

It is mainly due to this special condition entry that the hook will not work for the feature in this PR. Not only that special condition, but the generic codegenprepare optimizePhiType is disabled for vector types. Of course we could remove the conditions to the target hook, but this feels like misusing the hook since the features are conceptually different and we will have a phase ordering problem (code sinking will fold the newly inserted casts).

The implementation is still a lot simpler, if the heuristic and placement is different. Does copying the same technique, and using the existing ValueToValueMap work?

arsenm · 2024-05-20T19:16:05Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+  IRBuilder<> ConvertBuilder;
+
+public:
+  // The instruction which defined the original virtual register used across


/// for doxygen comments

arsenm · 2024-05-20T19:17:30Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+  ConversionCandidateInfo FromCCI(&I, I.getParent(),
+                                  std::next(I.getIterator()));
+  FromCCI.setNewType(getCompatibleType(FromCCI.getLiveRegDef()));
+  for (auto IUser = I.user_begin(); IUser != I.user_end(); IUser++) {


This can be a range loop

arsenm · 2024-05-20T19:18:51Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+  Type *OriginalType = InstToConvert->getType();
+  assert(OriginalType->getScalarSizeInBits() <=
+         ConvertToScalar->getScalarSizeInBits());
+  VectorType *VTy = dyn_cast<VectorType>(OriginalType);


Probably should just use FixedVectorType, you assumed fixed below

Pierre-vh · 2024-05-21T07:22:56Z

We should check if PHI splitting is still needed in CGP with this

arsenm · 2024-06-06T21:57:24Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+typedef std::pair<Instruction *, BasicBlock *> IncomingPair;
+typedef std::pair<Instruction *, SmallVector<IncomingPair, 4>> PHIUpdateInfo;


arsenm · 2024-06-06T21:58:41Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+                  return Entry.first == UserInst;
+                });
+
+            if (PHIOps == PHIUpdater.end())


arsenm · 2024-06-06T21:59:06Z

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

+      if (auto PHI = dyn_cast<PHINode>(UserInst)) {
+        for (unsigned Idx = 0; Idx < PHI->getNumIncomingValues(); Idx++) {
+          Value *IncVal = PHI->getIncomingValue(Idx);
+          if (&I == dyn_cast<Instruction>(IncVal)) {


don't think you need the dyn_cast

llvmbot added the backend:AMDGPU label Sep 19, 2023

arsenm requested changes Nov 8, 2023

View reviewed changes

jrbyrnes mentioned this pull request Jan 12, 2024

[SelectionDAG] NFC: Add target hooks to enable vector coercion in CopyToReg / CopyFromReg #66134

Closed

arsenm requested changes Feb 6, 2024

View reviewed changes

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 4ee0c89 to 1c502bc Compare February 7, 2024 23:19

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 1c502bc to 71526ba Compare February 7, 2024 23:51

[AMDGPU] Add IR LiveReg type-based optimization

5efb955

Change-Id: Ide8a46cdaf1d2d82cbd5296c998a5c8fd41fce80

jrbyrnes force-pushed the CompleteLiveRegRebase1 branch from 71526ba to 5efb955 Compare April 18, 2024 00:48

jayfoad reviewed Apr 18, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp Outdated Show resolved Hide resolved

jrbyrnes added 8 commits April 19, 2024 11:03

Handle loop edge in PHI nodes + Port to LateCodegenPrepare + Move Lat…

8d082d2

…eCodeGenPrepare after CodeSinking + Integrate the loops Change-Id: Iac0baf0ab9e523bf303585b545f060293e6fb4f0

replace auto

6d4aa39

Change-Id: I1b461e3194a27e5e3c45500cae0ef5d4d6540d59

Delete std::optional usage

85493c0

Change-Id: Ia56d86e1acf191d19f6fc43ae780de9bb5118ba9

query size instead of calculation

647885f

Change-Id: I8eeacb7d4292a215bb0540e8e7dd12ab7547d058

rename LiveRegConversion

95ee7a5

Change-Id: I94504f26819c45de7496b39fee8031bcda0f29fb

simplify initialization of shufflemask vector

7fe461c

Change-Id: I4383004240dc0365de6e67b12dc9ea5b609826d2

precommit global-isel tests

0988248

Change-Id: I07bf0cf4537bd3b148dc4ee3b785b989f0aac8b0

Enable for GlobalISel

95bd787

Change-Id: I83ae012da3118b0a40fb8a80be5029ce5bd2d78a

llvmbot added the llvm:globalisel label Apr 19, 2024

remove unintentional changes

35c5eb5

Change-Id: Idbfbbadfc1c3cee6cbd1a814b3446628dcce4394

arsenm reviewed Apr 25, 2024

View reviewed changes

Review comments

efe24b6

Change-Id: I244784728ff1b4363ff066f8c5a6fa6d03c2a4d5

jrbyrnes commented Apr 30, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll Show resolved Hide resolved

jrbyrnes mentioned this pull request May 3, 2024

[AMDGPU] Prefer vector i8s in PHI Nodes #91016

Draft

arsenm reviewed May 20, 2024

View reviewed changes

arsenm reviewed Jun 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Add IR LiveReg type-based optimization #66838

[AMDGPU] Add IR LiveReg type-based optimization #66838

jrbyrnes commented Sep 19, 2023

llvmbot commented Sep 19, 2023 •

edited

arsenm left a comment

arsenm left a comment

jrbyrnes commented Feb 7, 2024

jrbyrnes commented Apr 18, 2024

arsenm left a comment

arsenm commented May 1, 2024

jrbyrnes commented May 20, 2024

arsenm commented May 20, 2024

arsenm May 20, 2024

arsenm May 20, 2024

arsenm May 20, 2024

Pierre-vh commented May 21, 2024

arsenm Jun 6, 2024

arsenm Jun 6, 2024

arsenm Jun 6, 2024

		typedef std::pair<Instruction , BasicBlock > IncomingPair;
		typedef std::pair<Instruction *, SmallVector<IncomingPair, 4>> PHIUpdateInfo;

[AMDGPU] Add IR LiveReg type-based optimization #66838

Are you sure you want to change the base?

[AMDGPU] Add IR LiveReg type-based optimization #66838

Conversation

jrbyrnes commented Sep 19, 2023

llvmbot commented Sep 19, 2023 • edited

arsenm left a comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

jrbyrnes commented Feb 7, 2024

jrbyrnes commented Apr 18, 2024

arsenm left a comment

Choose a reason for hiding this comment

arsenm commented May 1, 2024

jrbyrnes commented May 20, 2024

arsenm commented May 20, 2024

arsenm May 20, 2024

Choose a reason for hiding this comment

arsenm May 20, 2024

Choose a reason for hiding this comment

arsenm May 20, 2024

Choose a reason for hiding this comment

Pierre-vh commented May 21, 2024

arsenm Jun 6, 2024

Choose a reason for hiding this comment

arsenm Jun 6, 2024

Choose a reason for hiding this comment

arsenm Jun 6, 2024

Choose a reason for hiding this comment

llvmbot commented Sep 19, 2023 •

edited