-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Add a MIR pass to reassociate shXadd, add, and slli to form more shXadd. #87544
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Craig Topper (topperc) ChangesThis reassociates patterns like (sh3add Z, (add X, (slli Y, 6))) This improves a pattern that occurs in 531.deepsjeng_r. Reducing This may be possible to improve in SelectionDAG, but given the special I've used a GEP with 2 indices because that mostly closely resembles See also #85734 which had a hacky version of a similar optimization. Full diff: https://github.com/llvm/llvm-project/pull/87544.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt
index 8715403f3839a6..8708d3b1ab70f0 100644
--- a/llvm/lib/Target/RISCV/CMakeLists.txt
+++ b/llvm/lib/Target/RISCV/CMakeLists.txt
@@ -46,6 +46,7 @@ add_llvm_target(RISCVCodeGen
RISCVMachineFunctionInfo.cpp
RISCVMergeBaseOffset.cpp
RISCVOptWInstrs.cpp
+ RISCVOptZba.cpp
RISCVPostRAExpandPseudoInsts.cpp
RISCVRedundantCopyElimination.cpp
RISCVMoveMerger.cpp
diff --git a/llvm/lib/Target/RISCV/RISCV.h b/llvm/lib/Target/RISCV/RISCV.h
index 7af543f018ccbd..a3f9949425fcc4 100644
--- a/llvm/lib/Target/RISCV/RISCV.h
+++ b/llvm/lib/Target/RISCV/RISCV.h
@@ -46,6 +46,9 @@ void initializeRISCVFoldMasksPass(PassRegistry &);
FunctionPass *createRISCVOptWInstrsPass();
void initializeRISCVOptWInstrsPass(PassRegistry &);
+FunctionPass *createRISCVOptZbaPass();
+void initializeRISCVOptZbaPass(PassRegistry &);
+
FunctionPass *createRISCVMergeBaseOffsetOptPass();
void initializeRISCVMergeBaseOffsetOptPass(PassRegistry &);
diff --git a/llvm/lib/Target/RISCV/RISCVOptZba.cpp b/llvm/lib/Target/RISCV/RISCVOptZba.cpp
new file mode 100644
index 00000000000000..dc19f921783c95
--- /dev/null
+++ b/llvm/lib/Target/RISCV/RISCVOptZba.cpp
@@ -0,0 +1,150 @@
+//===- RISCVOptWInstrs.cpp - MI W instruction optimizations ---------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+//
+// This pass attempts to reassociate expressions like
+// (sh3add Z, (add X, (slli Y, 6)))
+// To
+// (sh3add (sh3add Y, Z), X)
+//
+// This reduces number of instructions needing by spreading the shift amount
+// across to 2 sh3adds. This can be generalized to sh1add/sh2add and other
+// shift amounts.
+//
+// TODO: We can also support slli.uw by using shXadd.uw or the inner shXadd.
+
+#include "RISCV.h"
+#include "RISCVSubtarget.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "riscv-opt-zba"
+#define RISCV_OPT_ZBA_NAME "RISC-V Optimize Zba"
+
+namespace {
+
+class RISCVOptZba : public MachineFunctionPass {
+public:
+ static char ID;
+
+ RISCVOptZba() : MachineFunctionPass(ID) {}
+
+ bool runOnMachineFunction(MachineFunction &MF) override;
+
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.setPreservesCFG();
+ MachineFunctionPass::getAnalysisUsage(AU);
+ }
+
+ StringRef getPassName() const override { return RISCV_OPT_ZBA_NAME; }
+};
+
+} // end anonymous namespace
+
+char RISCVOptZba::ID = 0;
+INITIALIZE_PASS(RISCVOptZba, DEBUG_TYPE, RISCV_OPT_ZBA_NAME, false, false)
+
+FunctionPass *llvm::createRISCVOptZbaPass() { return new RISCVOptZba(); }
+
+static bool findShift(Register Reg, const MachineBasicBlock &MBB,
+ MachineInstr *&Shift, const MachineRegisterInfo &MRI) {
+ if (!Reg.isVirtual())
+ return false;
+
+ Shift = MRI.getVRegDef(Reg);
+ if (!Shift || Shift->getOpcode() != RISCV::SLLI ||
+ Shift->getParent() != &MBB || !MRI.hasOneNonDBGUse(Reg))
+ return false;
+
+ return true;
+}
+
+bool RISCVOptZba::runOnMachineFunction(MachineFunction &MF) {
+ if (skipFunction(MF.getFunction()))
+ return false;
+
+ MachineRegisterInfo &MRI = MF.getRegInfo();
+ const RISCVSubtarget &ST = MF.getSubtarget<RISCVSubtarget>();
+ const RISCVInstrInfo &TII = *ST.getInstrInfo();
+
+ if (!ST.hasStdExtZba())
+ return false;
+
+ bool MadeChange = true;
+
+ for (MachineBasicBlock &MBB : MF) {
+ for (MachineInstr &MI : llvm::make_early_inc_range(MBB)) {
+ unsigned OuterShiftAmt;
+ switch (MI.getOpcode()) {
+ default:
+ continue;
+ case RISCV::SH1ADD: OuterShiftAmt = 1; break;
+ case RISCV::SH2ADD: OuterShiftAmt = 2; break;
+ case RISCV::SH3ADD: OuterShiftAmt = 3; break;
+ }
+
+ // Second operand must be virtual.
+ Register UnshiftedReg = MI.getOperand(2).getReg();
+ if (!UnshiftedReg.isVirtual())
+ continue;
+
+ MachineInstr *Add = MRI.getVRegDef(UnshiftedReg);
+ if (!Add || Add->getOpcode() != RISCV::ADD || Add->getParent() != &MBB ||
+ !MRI.hasOneNonDBGUse(UnshiftedReg))
+ continue;
+
+ Register AddReg0 = Add->getOperand(1).getReg();
+ Register AddReg1 = Add->getOperand(2).getReg();
+
+ MachineInstr *InnerShift;
+ Register X;
+ if (findShift(AddReg0, MBB, InnerShift, MRI))
+ X = AddReg1;
+ else if (findShift(AddReg1, MBB, InnerShift, MRI))
+ X = AddReg0;
+ else
+ continue;
+
+ unsigned InnerShiftAmt = InnerShift->getOperand(2).getImm();
+
+ // The inner shift amount must be at least as large as the outer shift
+ // amount.
+ if (OuterShiftAmt > InnerShiftAmt)
+ continue;
+
+ unsigned InnerOpc;
+ switch (InnerShiftAmt - OuterShiftAmt) {
+ default:
+ continue;
+ case 0: InnerOpc = RISCV::ADD; break;
+ case 1: InnerOpc = RISCV::SH1ADD; break;
+ case 2: InnerOpc = RISCV::SH2ADD; break;
+ case 3: InnerOpc = RISCV::SH3ADD; break;
+ }
+
+ Register Y = InnerShift->getOperand(1).getReg();
+ Register Z = MI.getOperand(1).getReg();
+
+ Register NewReg = MRI.createVirtualRegister(&RISCV::GPRRegClass);
+ BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(InnerOpc), NewReg)
+ .addReg(Y)
+ .addReg(Z);
+ BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(MI.getOpcode()),
+ MI.getOperand(0).getReg())
+ .addReg(NewReg)
+ .addReg(X);
+
+ MI.eraseFromParent();
+ Add->eraseFromParent();
+ InnerShift->eraseFromParent();
+ MadeChange = true;
+ }
+ }
+
+ return MadeChange;
+}
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index ae1a6f179a49e3..7c6ea2b15be0cb 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -117,6 +117,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeRISCVTarget() {
initializeRISCVPostRAExpandPseudoPass(*PR);
initializeRISCVMergeBaseOffsetOptPass(*PR);
initializeRISCVOptWInstrsPass(*PR);
+ initializeRISCVOptZbaPass(*PR);
initializeRISCVPreRAExpandPseudoPass(*PR);
initializeRISCVExpandPseudoPass(*PR);
initializeRISCVFoldMasksPass(*PR);
@@ -531,6 +532,8 @@ void RISCVPassConfig::addMachineSSAOptimization() {
if (EnableMachineCombiner)
addPass(&MachineCombinerID);
+ addPass(createRISCVOptZbaPass());
+
if (TM->getTargetTriple().isRISCV64()) {
addPass(createRISCVOptWInstrsPass());
}
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 90472f246918f3..8fc610bb07b7f6 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -112,6 +112,7 @@
; CHECK-NEXT: Machine Trace Metrics
; CHECK-NEXT: Lazy Machine Block Frequency Analysis
; CHECK-NEXT: Machine InstCombiner
+; CHECK-NEXT: RISC-V Optimize Zba
; RV64-NEXT: RISC-V Optimize W Instructions
; CHECK-NEXT: RISC-V Pre-RA pseudo instruction expansion pass
; CHECK-NEXT: RISC-V Merge Base Offset
diff --git a/llvm/test/CodeGen/RISCV/rv64zba.ll b/llvm/test/CodeGen/RISCV/rv64zba.ll
index c81c6aeaab890c..067addc819f7e6 100644
--- a/llvm/test/CodeGen/RISCV/rv64zba.ll
+++ b/llvm/test/CodeGen/RISCV/rv64zba.ll
@@ -1404,9 +1404,8 @@ define i64 @sh6_sh3_add2(i64 noundef %x, i64 noundef %y, i64 noundef %z) {
;
; RV64ZBA-LABEL: sh6_sh3_add2:
; RV64ZBA: # %bb.0: # %entry
-; RV64ZBA-NEXT: slli a1, a1, 6
-; RV64ZBA-NEXT: add a0, a1, a0
-; RV64ZBA-NEXT: sh3add a0, a2, a0
+; RV64ZBA-NEXT: sh3add a1, a1, a2
+; RV64ZBA-NEXT: sh3add a0, a1, a0
; RV64ZBA-NEXT: ret
entry:
%shl = shl i64 %z, 3
@@ -2036,3 +2035,357 @@ define i64 @pack_i64_disjoint_2(i32 signext %a, i64 %b) nounwind {
%or = or disjoint i64 %b, %zexta
ret i64 %or
}
+
+define i8 @array_index_sh1_sh0(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh1_sh0:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 1
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: lbu a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh1_sh0:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh1add a0, a1, a0
+; RV64ZBA-NEXT: add a0, a0, a2
+; RV64ZBA-NEXT: lbu a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [2 x i8], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i8, ptr %a, align 1
+ ret i8 %b
+}
+
+define i16 @array_index_sh1_sh1(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh1_sh1:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 2
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 1
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lh a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh1_sh1:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh2add a0, a1, a0
+; RV64ZBA-NEXT: sh1add a0, a2, a0
+; RV64ZBA-NEXT: lh a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [2 x i16], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i16, ptr %a, align 2
+ ret i16 %b
+}
+
+define i32 @array_index_sh1_sh2(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh1_sh2:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 3
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 2
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lw a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh1_sh2:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: sh2add a0, a2, a0
+; RV64ZBA-NEXT: lw a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [2 x i32], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i32, ptr %a, align 4
+ ret i32 %b
+}
+
+define i64 @array_index_sh1_sh3(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh1_sh3:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 4
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 3
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: ld a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh1_sh3:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh1add a1, a1, a2
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: ld a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [2 x i64], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i64, ptr %a, align 8
+ ret i64 %b
+}
+
+define i8 @array_index_sh2_sh0(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh2_sh0:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 2
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: lbu a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh2_sh0:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh2add a0, a1, a0
+; RV64ZBA-NEXT: add a0, a0, a2
+; RV64ZBA-NEXT: lbu a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [4 x i8], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i8, ptr %a, align 1
+ ret i8 %b
+}
+
+define i16 @array_index_sh2_sh1(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh2_sh1:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 3
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 1
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lh a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh2_sh1:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: sh1add a0, a2, a0
+; RV64ZBA-NEXT: lh a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [4 x i16], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i16, ptr %a, align 2
+ ret i16 %b
+}
+
+define i32 @array_index_sh2_sh2(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh2_sh2:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 4
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 2
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lw a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh2_sh2:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh2add a1, a1, a2
+; RV64ZBA-NEXT: sh2add a0, a1, a0
+; RV64ZBA-NEXT: lw a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [4 x i32], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i32, ptr %a, align 4
+ ret i32 %b
+}
+
+define i64 @array_index_sh2_sh3(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh2_sh3:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 5
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 3
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: ld a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh2_sh3:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh2add a1, a1, a2
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: ld a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [4 x i64], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i64, ptr %a, align 8
+ ret i64 %b
+}
+
+define i8 @array_index_sh3_sh0(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh3_sh0:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 3
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: lbu a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh3_sh0:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: add a0, a0, a2
+; RV64ZBA-NEXT: lbu a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [8 x i8], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i8, ptr %a, align 1
+ ret i8 %b
+}
+
+define i16 @array_index_sh3_sh1(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh3_sh1:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 4
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 1
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lh a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh3_sh1:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh3add a1, a1, a2
+; RV64ZBA-NEXT: sh1add a0, a1, a0
+; RV64ZBA-NEXT: lh a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [8 x i16], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i16, ptr %a, align 2
+ ret i16 %b
+}
+
+define i32 @array_index_sh3_sh2(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh3_sh2:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 5
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 2
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lw a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh3_sh2:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh3add a1, a1, a2
+; RV64ZBA-NEXT: sh2add a0, a1, a0
+; RV64ZBA-NEXT: lw a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [8 x i32], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i32, ptr %a, align 4
+ ret i32 %b
+}
+
+define i64 @array_index_sh3_sh3(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh3_sh3:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 6
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 3
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: ld a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh3_sh3:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: sh3add a1, a1, a2
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: ld a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [8 x i64], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i64, ptr %a, align 8
+ ret i64 %b
+}
+
+; Similar to above, but with a lshr on one of the indices. This requires
+; special handling during isel to form a shift pair.
+define i64 @array_index_lshr_sh3_sh3(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_lshr_sh3_sh3:
+; RV64I: # %bb.0:
+; RV64I-NEXT: srli a1, a1, 58
+; RV64I-NEXT: slli a1, a1, 6
+; RV64I-NEXT: slli a2, a2, 3
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: ld a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_lshr_sh3_sh3:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: srli a1, a1, 58
+; RV64ZBA-NEXT: sh3add a1, a1, a2
+; RV64ZBA-NEXT: sh3add a0, a1, a0
+; RV64ZBA-NEXT: ld a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %shr = lshr i64 %idx1, 58
+ %a = getelementptr inbounds [8 x i64], ptr %p, i64 %shr, i64 %idx2
+ %b = load i64, ptr %a, align 8
+ ret i64 %b
+}
+
+define i8 @array_index_sh4_sh0(ptr %p, i64 %idx1, i64 %idx2) {
+; CHECK-LABEL: array_index_sh4_sh0:
+; CHECK: # %bb.0:
+; CHECK-NEXT: slli a1, a1, 4
+; CHECK-NEXT: add a0, a0, a2
+; CHECK-NEXT: add a0, a0, a1
+; CHECK-NEXT: lbu a0, 0(a0)
+; CHECK-NEXT: ret
+ %a = getelementptr inbounds [16 x i8], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i8, ptr %a, align 1
+ ret i8 %b
+}
+
+define i16 @array_index_sh4_sh1(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh4_sh1:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 5
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 1
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lh a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh4_sh1:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: slli a1, a1, 5
+; RV64ZBA-NEXT: add a0, a0, a1
+; RV64ZBA-NEXT: sh1add a0, a2, a0
+; RV64ZBA-NEXT: lh a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [16 x i16], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i16, ptr %a, align 2
+ ret i16 %b
+}
+
+define i32 @array_index_sh4_sh2(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh4_sh2:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 6
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 2
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: lw a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh4_sh2:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: slli a1, a1, 6
+; RV64ZBA-NEXT: add a0, a0, a1
+; RV64ZBA-NEXT: sh2add a0, a2, a0
+; RV64ZBA-NEXT: lw a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [16 x i32], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i32, ptr %a, align 4
+ ret i32 %b
+}
+
+define i64 @array_index_sh4_sh3(ptr %p, i64 %idx1, i64 %idx2) {
+; RV64I-LABEL: array_index_sh4_sh3:
+; RV64I: # %bb.0:
+; RV64I-NEXT: slli a1, a1, 7
+; RV64I-NEXT: add a0, a0, a1
+; RV64I-NEXT: slli a2, a2, 3
+; RV64I-NEXT: add a0, a0, a2
+; RV64I-NEXT: ld a0, 0(a0)
+; RV64I-NEXT: ret
+;
+; RV64ZBA-LABEL: array_index_sh4_sh3:
+; RV64ZBA: # %bb.0:
+; RV64ZBA-NEXT: slli a1, a1, 7
+; RV64ZBA-NEXT: add a0, a0, a1
+; RV64ZBA-NEXT: sh3add a0, a2, a0
+; RV64ZBA-NEXT: ld a0, 0(a0)
+; RV64ZBA-NEXT: ret
+ %a = getelementptr inbounds [16 x i64], ptr %p, i64 %idx1, i64 %idx2
+ %b = load i64, ptr %a, align 8
+ ret i64 %b
+}
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
||
MachineInstr *InnerShift; | ||
Register X; | ||
if (findShift(AddReg0, MBB, InnerShift, MRI)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some optimization opportunities will be missed when both operands are shifts.
An example: (sh3add Z, (add (slli X, 7), (slli Y, 6)))
Is it more suitable to handle it in MachineCombiner? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of context for anyone reading along.
Craig and I spent a bunch of time chatting about related issues last week. I'm going to continue with a set of patches to try and both generate shNadds more often, and (possibly) restructure how these are handled in DAG combine.
I suggested that Craig land a variant of the original patch (i.e. this one), and that we use that as our baseline for further iteration. We may end up reversing course here fairly quickly, but I don't see any significant downside to having landed this if that turns out be the case.
|
||
FunctionPass *llvm::createRISCVOptZbaPass() { return new RISCVOptZba(); } | ||
|
||
static bool findShift(Register Reg, const MachineBasicBlock &MBB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this simply return a possibly null pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I originally had it returning a bunch of output parameters instead of the Shift instruction. Then I realized I needed the shift instruction in order to erase it.
llvm/test/CodeGen/RISCV/rv64zba.ll
Outdated
@@ -2036,3 +2035,357 @@ define i64 @pack_i64_disjoint_2(i32 signext %a, i64 %b) nounwind { | |||
%or = or disjoint i64 %b, %zexta | |||
ret i64 %or | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you precommit your tests and rebase please?
|
||
// The inner shift amount must be at least as large as the outer shift | ||
// amount. | ||
if (OuterShiftAmt > InnerShiftAmt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still reassociate this, but that can be a separate patch. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have to be a different reassociation of some kind. Do you have an example?
// (sh3add (sh3add Y, Z), X) | ||
// | ||
// This reduces number of instructions needing by spreading the shift amount | ||
// across to 2 sh3adds. This can be generalized to sh1add/sh2add and other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second part of this comment looks stale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it can be generalized relative to the specific example in the comment, but I admit that was probably poorly worded.
…ore shXadd. This reassociates patterns like (sh3add Z, (add X, (slli Y, 6))) into (sh3add (sh3add Y, Z), X). This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.
cb16e3e
to
5d24969
Compare
I believe I have this working in the MachineCombiner now. I'll post that version shortly. |
…-> (sh3add (sh3add Y, Z), X). This is an alternative to the new pass proposed in llvm#87544. This improves a pattern that occurs in 531.deepsjeng_r. Reducing the dynamic instruction count by 0.5%. This may be possible to improve in SelectionDAG, but given the special cases around shXadd formation, it's not obvious it can be done in a robust way without adding multiple special cases. I've used a GEP with 2 indices because that mostly closely resembles the motivating case. Most of the test cases are the simplest GEP case. One test has a logical right shift on an index which is closer to the deepsjeng code. This requires special handling in isel to reverse a DAGCombiner canonicalization that turns a pair of shifts into (srl (and X, C1), C2). See also llvm#85734 which had a hacky version of a similar optimization.
This reassociates patterns like (sh3add Z, (add X, (slli Y, 6)))
into (sh3add (sh3add Y, Z), X).
This improves a pattern that occurs in 531.deepsjeng_r. Reducing
the dynamic instruction count by 0.5%.
This may be possible to improve in SelectionDAG, but given the special
cases around shXadd formation, it's not obvious it can be done in a
robust way without adding multiple special cases.
I've used a GEP with 2 indices because that mostly closely resembles
the motivating case. Most of the test cases are the simplest GEP case.
One test has a logical right shift on an index which is closer to
the deepsjeng code. This requires special handling in isel to reverse
a DAGCombiner canonicalization that turns a pair of shifts into
(srl (and X, C1), C2).
See also #85734 which had a hacky version of a similar optimization.