Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] New AMDGPUInsertSingleUseVDST pass #72388

Merged
merged 6 commits into from Nov 24, 2023
Merged

[AMDGPU] New AMDGPUInsertSingleUseVDST pass #72388

merged 6 commits into from Nov 24, 2023

Conversation

jayfoad
Copy link
Contributor

@jayfoad jayfoad commented Nov 15, 2023

Add support for emitting GFX11.5 s_singleuse_vdst instructions. This is
a power saving feature whereby the compiler can annotate VALU
instructions whose results are known to have only a single use, so the
hardware can in some cases avoid writing the result back to VGPR RAM.

To begin with the pass is disabled by default because of one missing
feature: we need an exclusion list of opcodes that never qualify as
single-use producers and/or consumers. A future patch will implement
this and enable the pass by default.

Add support for emitting GFX11.5 s_singleuse_vdst instructions. This is
a power saving feature whereby the compiler can annotate VALU
instructions whose results are known to have only a single use, so the
hardware can in some cases avoid writing the result back to VGPR RAM.

To begin with the pass is disabled by default because of one missing
feature: we need an exclusion list of opcodes that never qualify as
single-use producers and/or consumers. A future patch will implement
this and enable the pass by default.
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 15, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

Add support for emitting GFX11.5 s_singleuse_vdst instructions. This is
a power saving feature whereby the compiler can annotate VALU
instructions whose results are known to have only a single use, so the
hardware can in some cases avoid writing the result back to VGPR RAM.

To begin with the pass is disabled by default because of one missing
feature: we need an exclusion list of opcodes that never qualify as
single-use producers and/or consumers. A future patch will implement
this and enable the pass by default.


Patch is 44.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72388.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp (+120)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+10)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir (+1023)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 403014db56171ac..323560a46f31de2 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -335,6 +335,9 @@ extern char &SIModeRegisterID;
 void initializeAMDGPUInsertDelayAluPass(PassRegistry &);
 extern char &AMDGPUInsertDelayAluID;
 
+void initializeAMDGPUInsertSingleUseVDSTPass(PassRegistry &);
+extern char &AMDGPUInsertSingleUseVDSTID;
+
 void initializeSIInsertHardClausesPass(PassRegistry &);
 extern char &SIInsertHardClausesID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp
new file mode 100644
index 000000000000000..88a77b22d989127
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp
@@ -0,0 +1,120 @@
+//===- AMDGPUInsertSingleUseVDST.cpp - Insert s_singleuse_vdst instructions ==//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// Insert s_singleuse_vdst instructions on GFX11.5+ to mark regions of VALU
+/// instructions that produce single-use VGPR values. If the value is forwarded
+/// to the consumer instruction prior to VGPR writeback, the hardware can
+/// then skip (kill) the VGPR write.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
+#include "SIInstrInfo.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineInstrBuilder.h"
+#include "llvm/CodeGen/MachineOperand.h"
+#include "llvm/CodeGen/Register.h"
+#include "llvm/CodeGen/TargetSubtargetInfo.h"
+#include "llvm/IR/DebugLoc.h"
+#include "llvm/MC/MCRegister.h"
+#include "llvm/Pass.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-insert-single-use-vdst"
+
+namespace {
+class AMDGPUInsertSingleUseVDST : public MachineFunctionPass {
+private:
+  const SIInstrInfo *SII;
+
+public:
+  static char ID;
+
+  AMDGPUInsertSingleUseVDST() : MachineFunctionPass(ID) {}
+
+  void emitSingleUseVDST(MachineInstr &MI) const {
+    BuildMI(*MI.getParent(), MI, DebugLoc(), SII->get(AMDGPU::S_SINGLEUSE_VDST))
+        .addImm(0x1);
+  }
+
+  bool runOnMachineFunction(MachineFunction &MF) override {
+    const auto &ST = MF.getSubtarget<GCNSubtarget>();
+    if (!ST.hasVGPRSingleUseHintInsts())
+      return false;
+
+    SII = ST.getInstrInfo();
+    const auto *TRI = MF.getSubtarget().getRegisterInfo();
+    bool InstructionEmitted = false;
+
+    for (MachineBasicBlock &MBB : MF) {
+      DenseMap<MCPhysReg, unsigned> RegisterUseCount; // TODO: MCRegUnits
+
+      // Handle boundaries at the end of basic block separately to avoid
+      // false positives. If they are live at the end of a basic block then
+      // assume it has more uses later on.
+      for (const auto &Liveouts : MBB.liveouts())
+        RegisterUseCount[Liveouts.PhysReg] = 2;
+
+      for (MachineInstr &MI : reverse(MBB.instrs())) {
+        // All registers in all operands need to be single use for an
+        // instruction to be marked as a single use producer.
+        bool AllProducerOperandsAreSingleUse = true;
+
+        for (const auto &Operand : MI.operands()) {
+          if (!Operand.isReg())
+            continue;
+          const auto Reg = Operand.getReg();
+
+          // Count the number of times each register is read.
+          if (Operand.readsReg())
+            RegisterUseCount[Reg]++;
+
+          // Do not attempt to optimise across exec mask changes.
+          if (MI.modifiesRegister(AMDGPU::EXEC, TRI)) {
+            for (auto &UsedReg : RegisterUseCount)
+              UsedReg.second = 2;
+          }
+
+          // If we are at the point where the register first became live,
+          // check if the operands are single use.
+          if (!MI.modifiesRegister(Reg, TRI))
+            continue;
+          if (RegisterUseCount[Reg] > 1)
+            AllProducerOperandsAreSingleUse = false;
+          // Reset uses count when a register is no longer live.
+          RegisterUseCount.erase(Reg);
+        }
+        if (AllProducerOperandsAreSingleUse && SIInstrInfo::isVALU(MI)) {
+          // TODO: Replace with candidate logging for instruction grouping
+          // later.
+          emitSingleUseVDST(MI);
+          InstructionEmitted = true;
+        }
+      }
+    }
+    return InstructionEmitted;
+  }
+};
+} // namespace
+
+char AMDGPUInsertSingleUseVDST::ID = 0;
+
+char &llvm::AMDGPUInsertSingleUseVDSTID = AMDGPUInsertSingleUseVDST::ID;
+
+INITIALIZE_PASS(AMDGPUInsertSingleUseVDST, DEBUG_TYPE,
+                "AMDGPU Insert SingleUseVDST", false, false)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 951ed9420594b19..0c38fa32c6f33a8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -286,6 +286,12 @@ static cl::opt<bool> EnableSIModeRegisterPass(
   cl::init(true),
   cl::Hidden);
 
+// Enable GFX11.5+ s_singleuse_vdst insertion
+static cl::opt<bool>
+    EnableInsertSingleUseVDST("amdgpu-enable-single-use-vdst",
+                              cl::desc("Enable s_singleuse_vdst insertion"),
+                              cl::init(false), cl::Hidden);
+
 // Enable GFX11+ s_delay_alu insertion
 static cl::opt<bool>
     EnableInsertDelayAlu("amdgpu-enable-delay-alu",
@@ -404,6 +410,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPURewriteUndefForPHILegacyPass(*PR);
   initializeAMDGPUUnifyMetadataPass(*PR);
   initializeSIAnnotateControlFlowPass(*PR);
+  initializeAMDGPUInsertSingleUseVDSTPass(*PR);
   initializeAMDGPUInsertDelayAluPass(*PR);
   initializeSIInsertHardClausesPass(*PR);
   initializeSIInsertWaitcntsPass(*PR);
@@ -1448,6 +1455,9 @@ void GCNPassConfig::addPreEmitPass() {
   // cases.
   addPass(&PostRAHazardRecognizerID);
 
+  if (isPassEnabled(EnableInsertSingleUseVDST, CodeGenOptLevel::Less))
+    addPass(&AMDGPUInsertSingleUseVDSTID);
+
   if (isPassEnabled(EnableInsertDelayAlu, CodeGenOptLevel::Less))
     addPass(&AMDGPUInsertDelayAluID);
 
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 0c0720890794b66..53a33f8210d2a84 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -77,6 +77,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUMacroFusion.cpp
   AMDGPUMCInstLower.cpp
   AMDGPUIGroupLP.cpp
+  AMDGPUInsertSingleUseVDST.cpp
   AMDGPUMIRFormatter.cpp
   AMDGPUOpenCLEnqueuedBlockLowering.cpp
   AMDGPUPerfHintAnalysis.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir b/llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir
new file mode 100644
index 000000000000000..8480f8d211ebac1
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir
@@ -0,0 +1,1023 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 2
+# RUN: llc -march=amdgcn -mcpu=gfx1150 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs -run-pass=amdgpu-insert-single-use-vdst %s -o - | FileCheck %s
+
+---
+name: valu_dep_1
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0
+    ; CHECK-LABEL: name: valu_dep_1
+    ; CHECK: liveins: $vgpr0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: valu_dep_2
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1
+    ; CHECK-LABEL: name: valu_dep_2
+    ; CHECK: liveins: $vgpr0, $vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: valu_dep_3
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-LABEL: name: valu_dep_3
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: valu_dep_4
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    ; CHECK-LABEL: name: valu_dep_4
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr3 = V_ADD_U32_e32 $vgpr3, $vgpr3, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    $vgpr3 = V_ADD_U32_e32 $vgpr3, $vgpr3, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: valu_dep_5
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+    ; CHECK-LABEL: name: valu_dep_5
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr3 = V_ADD_U32_e32 $vgpr3, $vgpr3, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr4 = V_ADD_U32_e32 $vgpr4, $vgpr4, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr1, $vgpr1, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    $vgpr3 = V_ADD_U32_e32 $vgpr3, $vgpr3, implicit $exec
+    $vgpr4 = V_ADD_U32_e32 $vgpr4, $vgpr4, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: multiple_uses_1
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-LABEL: name: multiple_uses_1
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: multiple_uses_2
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-LABEL: name: multiple_uses_2
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: multiple_uses_3
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    ; CHECK-LABEL: name: multiple_uses_3
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr3 = V_ADD_U32_e32 $vgpr1, $vgpr0, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr3 = V_ADD_U32_e32 $vgpr1, $vgpr0, implicit $exec
+...
+
+---
+name: basic_block_1
+tracksRegLiveness: true
+
+body: |
+  ; CHECK-LABEL: name: basic_block_1
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
+  ; CHECK-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
+  ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+    successors: %bb.1
+
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  bb.1:
+  liveins: $vgpr0, $vgpr1, $vgpr2
+
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+...
+
+---
+name: basic_block_2
+tracksRegLiveness: true
+body: |
+  ; CHECK-LABEL: name: basic_block_2
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vgpr0 = IMPLICIT_DEF
+  ; CHECK-NEXT:   $vgpr1 = IMPLICIT_DEF
+  ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
+  ; CHECK-NEXT:   $vgpr3 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
+  ; CHECK-NEXT:   $vgpr3 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    successors: %bb.1
+
+    $vgpr0 = IMPLICIT_DEF
+    $vgpr1 = IMPLICIT_DEF
+
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  bb.1:
+  liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+
+    $vgpr3 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+    $vgpr3 = V_ADD_U32_e32 $vgpr2, $vgpr2, implicit $exec
+...
+
+---
+name: basic_block_3
+tracksRegLiveness: true
+body: |
+  ; CHECK-LABEL: name: basic_block_3
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vgpr0 = IMPLICIT_DEF
+  ; CHECK-NEXT:   $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   liveins: $vgpr0, $vgpr1, $vgpr2
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
+  ; CHECK-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  ; CHECK-NEXT:   S_SINGLEUSE_VDST 1
+  ; CHECK-NEXT:   $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2
+    successors: %bb.1
+
+    $vgpr0 = IMPLICIT_DEF
+
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+  bb.1:
+  liveins: $vgpr0, $vgpr1, $vgpr2
+
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+...
+
+---
+name: exec_mask_1
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    ; CHECK-LABEL: name: exec_mask_1
+    ; CHECK: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1234
+    ; CHECK-NEXT: $exec = COPY $vgpr0
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr3 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+    $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    %9:_(s64) = G_CONSTANT i64 1234
+    $exec = COPY $vgpr0
+    $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    $vgpr3 = V_ADD_U32_e32 $vgpr0, $vgpr1, implicit $exec
+...
+
+---
+name: exec_mask_2
+tracksRegLiveness: true
+body: |
+  bb.0:
+    liveins: $vgpr0, $vgpr1
+    ; CHECK-LABEL: name: exec_mask_2
+    ; CHECK: liveins: $vgpr0, $vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: $vgpr0 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1234
+    ; CHECK-NEXT: $exec = COPY $vgpr0
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr2 = V_ADD_U32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; CHECK-NEXT: S_SINGLEUSE_VDST 1
+    ; CHECK-NEXT: $vgpr3 = V...
[truncated]

Copy link
Contributor

@perlfu perlfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about the test coverage here because there are barely any tests covering wide registers or subregister access.
Based on the TODO comment about MCRegUnits I guess these may not work correctly yet?
Even so it might be good to have tests that show what doesn't work, so that future patches can demonstrate improvements.

I although think many of the liveins for blocks in the test can be shorten or removed.
I have annotate a few as examples.

...

---
name: writelane1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little odd, as conceptually V_WRITELANE only updates a single of the vgpr and does not consume the VGPR per-say, even if it does as the LLVM MIR level.

llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir Outdated Show resolved Hide resolved
llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir Outdated Show resolved Hide resolved
llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir Outdated Show resolved Hide resolved
llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir Outdated Show resolved Hide resolved
llvm/test/CodeGen/AMDGPU/insert-singleuse_vdst.mir Outdated Show resolved Hide resolved
//===----------------------------------------------------------------------===//
//
/// \file
/// Insert s_singleuse_vdst instructions on GFX11.5+ to mark regions of VALU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this only marks regions of 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There is more work to be done here.

if (AllProducerOperandsAreSingleUse && SIInstrInfo::isVALU(MI)) {
// TODO: Replace with candidate logging for instruction grouping
// later.
emitSingleUseVDST(MI);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with bundles -- do that need a test?


void emitSingleUseVDST(MachineInstr &MI) const {
BuildMI(*MI.getParent(), MI, DebugLoc(), SII->get(AMDGPU::S_SINGLEUSE_VDST))
.addImm(0x1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the immediate mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented the specific case we're using. In general it can indicate up to three regions of "single-use producer" VALU instructions but I don't want to add support for the encoding of them until we're actually using it. There is also a pretty assembler/disassembler syntax for that which is not yet implemented.

llvm/lib/Target/AMDGPU/AMDGPUInsertSingleUseVDST.cpp Outdated Show resolved Hide resolved
@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 22, 2023

I am unsure about the test coverage here because there are barely any tests covering wide registers or subregister access. Based on the TODO comment about MCRegUnits I guess these may not work correctly yet? Even so it might be good to have tests that show what doesn't work, so that future patches can demonstrate improvements.

I although think many of the liveins for blocks in the test can be shorten or removed. I have annotate a few as examples.

I have completely redone the tests now, simplifying as much as I could, commenting them, and adding some tests for 64-bit and 16-bit registers. Please take another look!

Copy link
Contributor

@perlfu perlfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but should fix illegal copies from VGPR to SGPR in tests

Thanks for the updated tests.
I think the clarification around f16 behaviour is important.

bb.0:
liveins: $vgpr1_vgpr2
$vgpr0 = V_MOV_B32_e32 0, implicit $exec
$exec = COPY $vgpr1_vgpr2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion these should be SGPRs so this is legal MIR. (A few similar VGPR to exec copies below as well.)

Given all the tests based on EXEC are assuming Wave64 then the relevant flags should be probably also be set in the llc line for the test, i.e. -mattr=+wavefrontsize32,-wavefrontsize64 -- in case it effects the reg units for EXEC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion these should be SGPRs so this is legal MIR.

That was a thinko. Will fix.

I don't think these tests are assuming anything about wave size. I will try running them in both wave32 and wave64. (I suppose in wave32 mode we could ignore a write to exec_hi, but I don't think the pass implements that.)

@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 23, 2023

Please see last two commits.

@jayfoad
Copy link
Contributor Author

jayfoad commented Nov 24, 2023

@perlfu I'm assuming your previous approval still stands!

@jayfoad jayfoad merged commit 28233b1 into llvm:main Nov 24, 2023
3 checks passed
@jayfoad jayfoad deleted the singleuse-vdst branch November 24, 2023 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants