Skip to content

Conversation

preames
Copy link
Collaborator

@preames preames commented Oct 7, 2025

This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions.

Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.

This is a follow up to the recent infrastructure work for to
generally support non-trivial rematerialization.  This is the first
in a small series to enable non-trivially agressively for the
RISC-V backend.  It deliberately avoids both vector instructions
and loads as those seem most likely to expose unexpected
interactions.

Note that this isn't ready to land just yet.  We need to collect
both compile time (in progress), and more perf numbers/stats on
at least e.g. spec2017/test-suite.  I'm posting it mostly as
a placeholder since multiple people were talking about this and
I want us to avoid duplicating work.
@llvmbot
Copy link
Member

llvmbot commented Oct 7, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

This is a follow up to the recent infrastructure work for to generally support non-trivial rematerialization. This is the first in a small series to enable non-trivially agressively for the RISC-V backend. It deliberately avoids both vector instructions and loads as those seem most likely to expose unexpected interactions.

Note that this isn't ready to land just yet. We need to collect both compile time (in progress), and more perf numbers/stats on at least e.g. spec2017/test-suite. I'm posting it mostly as a placeholder since multiple people were talking about this and I want us to avoid duplicating work.


Patch is 419.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162311.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfo.td (+9-11)
  • (modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+4512-4413)
  • (modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+5-5)
  • (modified) llvm/test/CodeGen/RISCV/pr69586.ll (+141-142)
  • (modified) llvm/test/CodeGen/RISCV/rvv/nontemporal-vp-scalable.ll (+205-205)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+21-21)
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.td b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
index 9855c47a63392..f1ac3a5b7e9a5 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.td
@@ -780,21 +780,18 @@ def SB : Store_rri<0b000, "sb">, Sched<[WriteSTB, ReadStoreData, ReadMemBase]>;
 def SH : Store_rri<0b001, "sh">, Sched<[WriteSTH, ReadStoreData, ReadMemBase]>;
 def SW : Store_rri<0b010, "sw">, Sched<[WriteSTW, ReadStoreData, ReadMemBase]>;
 
-// ADDI isn't always rematerializable, but isReMaterializable will be used as
-// a hint which is verified in isReMaterializableImpl.
-let isReMaterializable = 1, isAsCheapAsAMove = 1 in
+let isReMaterializable = 1, isAsCheapAsAMove = 1 in {
 def ADDI  : ALU_ri<0b000, "addi">;
+def XORI  : ALU_ri<0b100, "xori">;
+def ORI   : ALU_ri<0b110, "ori">;
+}
 
-let IsSignExtendingOpW = 1 in {
+let IsSignExtendingOpW = 1, isReMaterializable = 1 in {
 def SLTI  : ALU_ri<0b010, "slti">;
 def SLTIU : ALU_ri<0b011, "sltiu">;
 }
 
-let isReMaterializable = 1, isAsCheapAsAMove = 1 in {
-def XORI  : ALU_ri<0b100, "xori">;
-def ORI   : ALU_ri<0b110, "ori">;
-}
-
+let isReMaterializable = 1 in {
 def ANDI  : ALU_ri<0b111, "andi">;
 
 def SLLI : Shift_ri<0b00000, 0b001, "slli">,
@@ -826,6 +823,7 @@ def OR   : ALU_rr<0b0000000, 0b110, "or", Commutable=1>,
            Sched<[WriteIALU, ReadIALU, ReadIALU]>;
 def AND  : ALU_rr<0b0000000, 0b111, "and", Commutable=1>,
            Sched<[WriteIALU, ReadIALU, ReadIALU]>;
+}
 
 let hasSideEffects = 1, mayLoad = 0, mayStore = 0 in {
 def FENCE : RVInstI<0b000, OPC_MISC_MEM, (outs),
@@ -893,7 +891,7 @@ def LWU   : Load_ri<0b110, "lwu">, Sched<[WriteLDW, ReadMemBase]>;
 def LD    : Load_ri<0b011, "ld">, Sched<[WriteLDD, ReadMemBase]>;
 def SD    : Store_rri<0b011, "sd">, Sched<[WriteSTD, ReadStoreData, ReadMemBase]>;
 
-let IsSignExtendingOpW = 1 in {
+let IsSignExtendingOpW = 1, isReMaterializable = 1 in {
 let hasSideEffects = 0, mayLoad = 0, mayStore = 0 in
 def ADDIW : RVInstI<0b000, OPC_OP_IMM_32, (outs GPR:$rd),
                     (ins GPR:$rs1, simm12_lo:$imm12),
@@ -917,7 +915,7 @@ def SRLW  : ALUW_rr<0b0000000, 0b101, "srlw">,
             Sched<[WriteShiftReg32, ReadShiftReg32, ReadShiftReg32]>;
 def SRAW  : ALUW_rr<0b0100000, 0b101, "sraw">,
             Sched<[WriteShiftReg32, ReadShiftReg32, ReadShiftReg32]>;
-} // IsSignExtendingOpW = 1
+} // IsSignExtendingOpW = 1, isReMaterializable = 1
 } // Predicates = [IsRV64]
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
index ca9f7637388f7..74c31a229dad4 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll
@@ -3000,9 +3000,9 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    sw s9, 20(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    sw s10, 16(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    sw s11, 12(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    li a4, 0
+; RV32I-NEXT:    li a5, 0
 ; RV32I-NEXT:    lbu a3, 0(a0)
-; RV32I-NEXT:    lbu a5, 1(a0)
+; RV32I-NEXT:    lbu a4, 1(a0)
 ; RV32I-NEXT:    lbu a6, 2(a0)
 ; RV32I-NEXT:    lbu a7, 3(a0)
 ; RV32I-NEXT:    lbu t0, 4(a0)
@@ -3013,736 +3013,750 @@ define void @lshr_32bytes(ptr %src.ptr, ptr %byteOff.ptr, ptr %dst) nounwind {
 ; RV32I-NEXT:    lbu t5, 9(a0)
 ; RV32I-NEXT:    lbu t6, 10(a0)
 ; RV32I-NEXT:    lbu s0, 11(a0)
-; RV32I-NEXT:    slli a5, a5, 8
+; RV32I-NEXT:    slli a4, a4, 8
 ; RV32I-NEXT:    slli a7, a7, 8
 ; RV32I-NEXT:    slli t1, t1, 8
-; RV32I-NEXT:    or a3, a5, a3
-; RV32I-NEXT:    or a7, a7, a6
-; RV32I-NEXT:    or t1, t1, t0
-; RV32I-NEXT:    lbu a6, 13(a0)
-; RV32I-NEXT:    lbu a5, 14(a0)
-; RV32I-NEXT:    lbu s1, 15(a0)
+; RV32I-NEXT:    or a3, a4, a3
+; RV32I-NEXT:    or a4, a7, a6
+; RV32I-NEXT:    or a7, t1, t0
+; RV32I-NEXT:    lbu t0, 13(a0)
+; RV32I-NEXT:    lbu a6, 14(a0)
+; RV32I-NEXT:    lbu t1, 15(a0)
 ; RV32I-NEXT:    slli t3, t3, 8
 ; RV32I-NEXT:    slli t5, t5, 8
 ; RV32I-NEXT:    slli s0, s0, 8
-; RV32I-NEXT:    or t3, t3, t2
-; RV32I-NEXT:    or t0, t5, t4
-; RV32I-NEXT:    or t5, s0, t6
-; RV32I-NEXT:    lbu t2, 1(a1)
-; RV32I-NEXT:    lbu t4, 0(a1)
+; RV32I-NEXT:    or s1, t3, t2
+; RV32I-NEXT:    or t2, t5, t4
+; RV32I-NEXT:    or t4, s0, t6
+; RV32I-NEXT:    lbu t3, 1(a1)
+; RV32I-NEXT:    lbu t5, 0(a1)
 ; RV32I-NEXT:    lbu t6, 2(a1)
 ; RV32I-NEXT:    lbu a1, 3(a1)
-; RV32I-NEXT:    slli t2, t2, 8
-; RV32I-NEXT:    or s0, t2, t4
-; RV32I-NEXT:    slli t2, s1, 8
+; RV32I-NEXT:    slli t3, t3, 8
+; RV32I-NEXT:    or t5, t3, t5
+; RV32I-NEXT:    slli t3, t1, 8
 ; RV32I-NEXT:    slli a1, a1, 8
 ; RV32I-NEXT:    or a1, a1, t6
-; RV32I-NEXT:    slli t4, a7, 16
-; RV32I-NEXT:    slli a7, t3, 16
-; RV32I-NEXT:    slli t3, t5, 16
-; RV32I-NEXT:    slli t5, a1, 16
-; RV32I-NEXT:    or a1, a7, t1
-; RV32I-NEXT:    or a7, t5, s0
+; RV32I-NEXT:    slli a4, a4, 16
+; RV32I-NEXT:    slli s1, s1, 16
+; RV32I-NEXT:    slli t4, t4, 16
+; RV32I-NEXT:    slli t1, a1, 16
+; RV32I-NEXT:    or s5, s1, a7
+; RV32I-NEXT:    or a7, t1, t5
 ; RV32I-NEXT:    slli a7, a7, 3
 ; RV32I-NEXT:    srli t1, a7, 5
 ; RV32I-NEXT:    andi t5, a7, 31
 ; RV32I-NEXT:    neg s3, t5
 ; RV32I-NEXT:    beqz t5, .LBB12_2
 ; RV32I-NEXT:  # %bb.1:
-; RV32I-NEXT:    sll a4, a1, s3
+; RV32I-NEXT:    sll a5, s5, s3
 ; RV32I-NEXT:  .LBB12_2:
-; RV32I-NEXT:    or s7, t4, a3
-; RV32I-NEXT:    lbu t4, 12(a0)
-; RV32I-NEXT:    lbu t6, 19(a0)
-; RV32I-NEXT:    slli s1, a6, 8
-; RV32I-NEXT:    or a5, t2, a5
-; RV32I-NEXT:    or a3, t3, t0
+; RV32I-NEXT:    or a4, a4, a3
+; RV32I-NEXT:    lbu t6, 12(a0)
+; RV32I-NEXT:    lbu s0, 19(a0)
+; RV32I-NEXT:    slli s1, t0, 8
+; RV32I-NEXT:    or t0, t3, a6
+; RV32I-NEXT:    or a1, t4, t2
 ; RV32I-NEXT:    beqz t1, .LBB12_4
 ; RV32I-NEXT:  # %bb.3:
-; RV32I-NEXT:    li s0, 0
+; RV32I-NEXT:    mv s11, a4
+; RV32I-NEXT:    li a4, 0
 ; RV32I-NEXT:    j .LBB12_5
 ; RV32I-NEXT:  .LBB12_4:
-; RV32I-NEXT:    srl s0, s7, a7
-; RV32I-NEXT:    or s0, s0, a4
+; RV32I-NEXT:    mv s11, a4
+; RV32I-NEXT:    srl a6, a4, a7
+; RV32I-NEXT:    or a4, a6, a5
 ; RV32I-NEXT:  .LBB12_5:
 ; RV32I-NEXT:    li a6, 0
-; RV32I-NEXT:    lbu t0, 17(a0)
-; RV32I-NEXT:    lbu a4, 18(a0)
-; RV32I-NEXT:    slli s4, t6, 8
-; RV32I-NEXT:    or s2, s1, t4
-; RV32I-NEXT:    slli a5, a5, 16
-; RV32I-NEXT:    li s5, 1
-; RV32I-NEXT:    sll t6, a3, s3
+; RV32I-NEXT:    lbu s2, 17(a0)
+; RV32I-NEXT:    lbu a5, 18(a0)
+; RV32I-NEXT:    slli s4, s0, 8
+; RV32I-NEXT:    or s1, s1, t6
+; RV32I-NEXT:    slli t0, t0, 16
+; RV32I-NEXT:    li t3, 1
+; RV32I-NEXT:    sll s6, a1, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_7
 ; RV32I-NEXT:  # %bb.6:
-; RV32I-NEXT:    mv a6, t6
+; RV32I-NEXT:    mv a6, s6
 ; RV32I-NEXT:  .LBB12_7:
 ; RV32I-NEXT:    lbu t2, 16(a0)
-; RV32I-NEXT:    lbu t3, 23(a0)
-; RV32I-NEXT:    slli s1, t0, 8
-; RV32I-NEXT:    or t4, s4, a4
-; RV32I-NEXT:    srl a4, a1, a7
-; RV32I-NEXT:    or a5, a5, s2
-; RV32I-NEXT:    bne t1, s5, .LBB12_9
+; RV32I-NEXT:    lbu t4, 23(a0)
+; RV32I-NEXT:    slli s0, s2, 8
+; RV32I-NEXT:    or t6, s4, a5
+; RV32I-NEXT:    srl a3, s5, a7
+; RV32I-NEXT:    or a5, t0, s1
+; RV32I-NEXT:    sw a3, 0(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    bne t1, t3, .LBB12_9
 ; RV32I-NEXT:  # %bb.8:
-; RV32I-NEXT:    or s0, a4, a6
+; RV32I-NEXT:    or a4, a3, a6
 ; RV32I-NEXT:  .LBB12_9:
 ; RV32I-NEXT:    li t0, 0
-; RV32I-NEXT:    lbu s5, 21(a0)
+; RV32I-NEXT:    lbu s2, 21(a0)
 ; RV32I-NEXT:    lbu a6, 22(a0)
-; RV32I-NEXT:    slli s4, t3, 8
-; RV32I-NEXT:    or t2, s1, t2
-; RV32I-NEXT:    slli s6, t4, 16
-; RV32I-NEXT:    li s8, 2
-; RV32I-NEXT:    sll t3, a5, s3
+; RV32I-NEXT:    slli s1, t4, 8
+; RV32I-NEXT:    or t2, s0, t2
+; RV32I-NEXT:    slli s4, t6, 16
+; RV32I-NEXT:    li a3, 2
+; RV32I-NEXT:    sll s8, a5, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_11
 ; RV32I-NEXT:  # %bb.10:
-; RV32I-NEXT:    mv t0, t3
+; RV32I-NEXT:    mv t0, s8
 ; RV32I-NEXT:  .LBB12_11:
-; RV32I-NEXT:    lbu s1, 20(a0)
-; RV32I-NEXT:    lbu s2, 27(a0)
-; RV32I-NEXT:    slli s5, s5, 8
-; RV32I-NEXT:    or s4, s4, a6
-; RV32I-NEXT:    srl t4, a3, a7
-; RV32I-NEXT:    or a6, s6, t2
-; RV32I-NEXT:    bne t1, s8, .LBB12_13
+; RV32I-NEXT:    lbu t6, 20(a0)
+; RV32I-NEXT:    lbu s0, 27(a0)
+; RV32I-NEXT:    slli s2, s2, 8
+; RV32I-NEXT:    or s1, s1, a6
+; RV32I-NEXT:    srl t3, a1, a7
+; RV32I-NEXT:    or a6, s4, t2
+; RV32I-NEXT:    sw s5, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    bne t1, a3, .LBB12_13
 ; RV32I-NEXT:  # %bb.12:
-; RV32I-NEXT:    or s0, t4, t0
+; RV32I-NEXT:    or a4, t3, t0
 ; RV32I-NEXT:  .LBB12_13:
-; RV32I-NEXT:    sw s7, 4(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    li t2, 0
-; RV32I-NEXT:    lbu s6, 25(a0)
+; RV32I-NEXT:    lbu s4, 25(a0)
 ; RV32I-NEXT:    lbu t0, 26(a0)
-; RV32I-NEXT:    slli s8, s2, 8
-; RV32I-NEXT:    or s7, s5, s1
-; RV32I-NEXT:    slli s9, s4, 16
-; RV32I-NEXT:    sll s11, a6, s3
+; RV32I-NEXT:    slli s7, s0, 8
+; RV32I-NEXT:    or s5, s2, t6
+; RV32I-NEXT:    slli s9, s1, 16
+; RV32I-NEXT:    li t6, 3
+; RV32I-NEXT:    sll t4, a6, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_15
 ; RV32I-NEXT:  # %bb.14:
-; RV32I-NEXT:    mv t2, s11
+; RV32I-NEXT:    mv t2, t4
 ; RV32I-NEXT:  .LBB12_15:
-; RV32I-NEXT:    lbu s1, 24(a0)
-; RV32I-NEXT:    lbu s2, 31(a0)
-; RV32I-NEXT:    slli s5, s6, 8
-; RV32I-NEXT:    or s4, s8, t0
-; RV32I-NEXT:    srl ra, a5, a7
-; RV32I-NEXT:    or t0, s9, s7
-; RV32I-NEXT:    li s6, 3
-; RV32I-NEXT:    bne t1, s6, .LBB12_17
+; RV32I-NEXT:    lbu s0, 24(a0)
+; RV32I-NEXT:    lbu s1, 31(a0)
+; RV32I-NEXT:    slli s4, s4, 8
+; RV32I-NEXT:    or s2, s7, t0
+; RV32I-NEXT:    srl a3, a5, a7
+; RV32I-NEXT:    or t0, s9, s5
+; RV32I-NEXT:    li s9, 3
+; RV32I-NEXT:    bne t1, t6, .LBB12_17
 ; RV32I-NEXT:  # %bb.16:
-; RV32I-NEXT:    or s0, ra, t2
+; RV32I-NEXT:    or a4, a3, t2
 ; RV32I-NEXT:  .LBB12_17:
+; RV32I-NEXT:    mv t6, t3
 ; RV32I-NEXT:    li t2, 0
 ; RV32I-NEXT:    lbu s7, 29(a0)
-; RV32I-NEXT:    lbu s6, 30(a0)
-; RV32I-NEXT:    slli s8, s2, 8
-; RV32I-NEXT:    or s2, s5, s1
-; RV32I-NEXT:    slli s5, s4, 16
-; RV32I-NEXT:    li s9, 4
-; RV32I-NEXT:    sll s1, t0, s3
-; RV32I-NEXT:    sw s1, 8(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    lbu s5, 30(a0)
+; RV32I-NEXT:    slli s1, s1, 8
+; RV32I-NEXT:    or s10, s4, s0
+; RV32I-NEXT:    slli s2, s2, 16
+; RV32I-NEXT:    li a3, 4
+; RV32I-NEXT:    sll s0, t0, s3
 ; RV32I-NEXT:    beqz t5, .LBB12_19
 ; RV32I-NEXT:  # %bb.18:
-; RV32I-NEXT:    lw t2, 8(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    mv t2, s0
 ; RV32I-NEXT:  .LBB12_19:
-; RV32I-NEXT:    lbu s1, 28(a0)
+; RV32I-NEXT:    lbu t3, 28(a0)
 ; RV32I-NEXT:    slli s7, s7, 8
-; RV32I-NEXT:    or s4, s8, s6
-; RV32I-NEXT:    srl s10, a6, a7
-; RV32I-NEXT:    or a0, s5, s2
-; RV32I-NEXT:    bne t1, s9, .LBB12_21
+; RV32I-NEXT:    or s4, s1, s5
+; RV32I-NEXT:    srl s1, a6, a7
+; RV32I-NEXT:    or a0, s2, s10
+; RV32I-NEXT:    beq t1, a3, .LBB12_21
 ; RV32I-NEXT:  # %bb.20:
-; RV32I-NEXT:    or s0, s10, t2
+; RV32I-NEXT:    mv a3, s1
+; RV32I-NEXT:    j .LBB12_22
 ; RV32I-NEXT:  .LBB12_21:
+; RV32I-NEXT:    mv a3, s1
+; RV32I-NEXT:    or a4, s1, t2
+; RV32I-NEXT:  .LBB12_22:
+; RV32I-NEXT:    li s10, 1
 ; RV32I-NEXT:    li s2, 0
-; RV32I-NEXT:    or t2, s7, s1
+; RV32I-NEXT:    or t2, s7, t3
 ; RV32I-NEXT:    slli s4, s4, 16
-; RV32I-NEXT:    li s9, 5
+; RV32I-NEXT:    li s1, 5
 ; RV32I-NEXT:    sll s7, a0, s3
-; RV32I-NEXT:    beqz t5, .LBB12_23
-; RV32I-NEXT:  # %bb.22:
+; RV32I-NEXT:    beqz t5, .LBB12_24
+; RV32I-NEXT:  # %bb.23:
 ; RV32I-NEXT:    mv s2, s7
-; RV32I-NEXT:  .LBB12_23:
-; RV32I-NEXT:    srl s8, t0, a7
+; RV32I-NEXT:  .LBB12_24:
+; RV32I-NEXT:    sw a1, 4(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    srl t3, t0, a7
 ; RV32I-NEXT:    or t2, s4, t2
-; RV32I-NEXT:    bne t1, s9, .LBB12_25
-; RV32I-NEXT:  # %bb.24:
-; RV32I-NEXT:    or s0, s8, s2
-; RV32I-NEXT:  .LBB12_25:
-; RV32I-NEXT:    li s4, 0
+; RV32I-NEXT:    beq t1, s1, .LBB12_26
+; RV32I-NEXT:  # %bb.25:
+; RV32I-NEXT:    mv a1, t3
+; RV32I-NEXT:    j .LBB12_27
+; RV32I-NEXT:  .LBB12_26:
+; RV32I-NEXT:    mv a1, t3
+; RV32I-NEXT:    or a4, t3, s2
+; RV32I-NEXT:  .LBB12_27:
+; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    li s2, 6
 ; RV32I-NEXT:    sll s5, t2, s3
-; RV32I-NEXT:    beqz t5, .LBB12_27
-; RV32I-NEXT:  # %bb.26:
-; RV32I-NEXT:    mv s4, s5
-; RV32I-NEXT:  .LBB12_27:
-; RV32I-NEXT:    srl s6, a0, a7
-; RV32I-NEXT:    bne t1, s2, .LBB12_29
+; RV32I-NEXT:    beqz t5, .LBB12_29
 ; RV32I-NEXT:  # %bb.28:
-; RV32I-NEXT:    or s0, s6, s4
+; RV32I-NEXT:    mv t3, s5
 ; RV32I-NEXT:  .LBB12_29:
-; RV32I-NEXT:    li s3, 7
-; RV32I-NEXT:    srl s1, t2, a7
-; RV32I-NEXT:    mv s4, s1
-; RV32I-NEXT:    bne t1, s3, .LBB12_34
+; RV32I-NEXT:    srl s3, a0, a7
+; RV32I-NEXT:    beq t1, s2, .LBB12_31
 ; RV32I-NEXT:  # %bb.30:
-; RV32I-NEXT:    bnez a7, .LBB12_35
+; RV32I-NEXT:    mv ra, s3
+; RV32I-NEXT:    j .LBB12_32
 ; RV32I-NEXT:  .LBB12_31:
-; RV32I-NEXT:    li s0, 0
-; RV32I-NEXT:    bnez t5, .LBB12_36
+; RV32I-NEXT:    mv ra, s3
+; RV32I-NEXT:    or a4, s3, t3
 ; RV32I-NEXT:  .LBB12_32:
-; RV32I-NEXT:    li s4, 2
-; RV32I-NEXT:    beqz t1, .LBB12_37
-; RV32I-NEXT:  .LBB12_33:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    j .LBB12_38
+; RV32I-NEXT:    li s3, 7
+; RV32I-NEXT:    srl s4, t2, a7
+; RV32I-NEXT:    mv t3, s4
+; RV32I-NEXT:    beq t1, s3, .LBB12_34
+; RV32I-NEXT:  # %bb.33:
+; RV32I-NEXT:    mv t3, a4
 ; RV32I-NEXT:  .LBB12_34:
-; RV32I-NEXT:    mv s4, s0
-; RV32I-NEXT:    beqz a7, .LBB12_31
-; RV32I-NEXT:  .LBB12_35:
-; RV32I-NEXT:    sw s4, 4(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    li s0, 0
-; RV32I-NEXT:    beqz t5, .LBB12_32
+; RV32I-NEXT:    mv a4, s11
+; RV32I-NEXT:    beqz a7, .LBB12_36
+; RV32I-NEXT:  # %bb.35:
+; RV32I-NEXT:    mv a4, t3
 ; RV32I-NEXT:  .LBB12_36:
-; RV32I-NEXT:    mv s0, t6
-; RV32I-NEXT:    li s4, 2
-; RV32I-NEXT:    bnez t1, .LBB12_33
-; RV32I-NEXT:  .LBB12_37:
-; RV32I-NEXT:    or a4, a4, s0
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    li s11, 2
+; RV32I-NEXT:    beqz t5, .LBB12_38
+; RV32I-NEXT:  # %bb.37:
+; RV32I-NEXT:    mv t3, s6
 ; RV32I-NEXT:  .LBB12_38:
-; RV32I-NEXT:    li s0, 1
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_57
+; RV32I-NEXT:    beqz t1, .LBB12_40
 ; RV32I-NEXT:  # %bb.39:
-; RV32I-NEXT:    beq t1, s0, .LBB12_58
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_41
+; RV32I-NEXT:    j .LBB12_42
 ; RV32I-NEXT:  .LBB12_40:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_59
+; RV32I-NEXT:    lw s6, 0(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    or s6, s6, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_42
 ; RV32I-NEXT:  .LBB12_41:
-; RV32I-NEXT:    beq t1, s4, .LBB12_60
+; RV32I-NEXT:    mv t3, s8
 ; RV32I-NEXT:  .LBB12_42:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_61
-; RV32I-NEXT:  .LBB12_43:
-; RV32I-NEXT:    li s4, 3
-; RV32I-NEXT:    bne t1, s4, .LBB12_45
+; RV32I-NEXT:    beq t1, s10, .LBB12_58
+; RV32I-NEXT:  # %bb.43:
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_59
 ; RV32I-NEXT:  .LBB12_44:
-; RV32I-NEXT:    or a4, s10, t6
+; RV32I-NEXT:    beq t1, s11, .LBB12_60
 ; RV32I-NEXT:  .LBB12_45:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    li s4, 4
-; RV32I-NEXT:    bnez t5, .LBB12_62
-; RV32I-NEXT:  # %bb.46:
-; RV32I-NEXT:    beq t1, s4, .LBB12_63
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_61
+; RV32I-NEXT:  .LBB12_46:
+; RV32I-NEXT:    bne t1, s9, .LBB12_48
 ; RV32I-NEXT:  .LBB12_47:
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    bnez t5, .LBB12_64
+; RV32I-NEXT:    or s6, a3, t3
 ; RV32I-NEXT:  .LBB12_48:
-; RV32I-NEXT:    beq t1, s9, .LBB12_65
-; RV32I-NEXT:  .LBB12_49:
-; RV32I-NEXT:    mv t6, s1
-; RV32I-NEXT:    bne t1, s2, .LBB12_66
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    li s9, 4
+; RV32I-NEXT:    bnez t5, .LBB12_62
+; RV32I-NEXT:  # %bb.49:
+; RV32I-NEXT:    beq t1, s9, .LBB12_63
 ; RV32I-NEXT:  .LBB12_50:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    bne t1, s3, .LBB12_67
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_64
 ; RV32I-NEXT:  .LBB12_51:
-; RV32I-NEXT:    beqz a7, .LBB12_53
+; RV32I-NEXT:    beq t1, s1, .LBB12_65
 ; RV32I-NEXT:  .LBB12_52:
-; RV32I-NEXT:    mv a1, a4
+; RV32I-NEXT:    mv t3, s4
+; RV32I-NEXT:    bne t1, s2, .LBB12_66
 ; RV32I-NEXT:  .LBB12_53:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    li t6, 2
-; RV32I-NEXT:    beqz t5, .LBB12_55
-; RV32I-NEXT:  # %bb.54:
-; RV32I-NEXT:    mv a4, t3
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    bne t1, s3, .LBB12_67
+; RV32I-NEXT:  .LBB12_54:
+; RV32I-NEXT:    bnez a7, .LBB12_68
 ; RV32I-NEXT:  .LBB12_55:
-; RV32I-NEXT:    beqz t1, .LBB12_68
-; RV32I-NEXT:  # %bb.56:
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    j .LBB12_69
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    bnez t5, .LBB12_69
+; RV32I-NEXT:  .LBB12_56:
+; RV32I-NEXT:    beqz t1, .LBB12_70
 ; RV32I-NEXT:  .LBB12_57:
-; RV32I-NEXT:    mv t6, t3
-; RV32I-NEXT:    bne t1, s0, .LBB12_40
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    j .LBB12_71
 ; RV32I-NEXT:  .LBB12_58:
-; RV32I-NEXT:    or a4, t4, t6
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    beqz t5, .LBB12_41
+; RV32I-NEXT:    or s6, t6, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_44
 ; RV32I-NEXT:  .LBB12_59:
-; RV32I-NEXT:    mv t6, s11
-; RV32I-NEXT:    bne t1, s4, .LBB12_42
+; RV32I-NEXT:    mv t3, t4
+; RV32I-NEXT:    bne t1, s11, .LBB12_45
 ; RV32I-NEXT:  .LBB12_60:
-; RV32I-NEXT:    or a4, ra, t6
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    beqz t5, .LBB12_43
+; RV32I-NEXT:    srl s6, a5, a7
+; RV32I-NEXT:    or s6, s6, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_46
 ; RV32I-NEXT:  .LBB12_61:
-; RV32I-NEXT:    lw t6, 8(sp) # 4-byte Folded Reload
-; RV32I-NEXT:    li s4, 3
-; RV32I-NEXT:    beq t1, s4, .LBB12_44
-; RV32I-NEXT:    j .LBB12_45
+; RV32I-NEXT:    mv t3, s0
+; RV32I-NEXT:    beq t1, s9, .LBB12_47
+; RV32I-NEXT:    j .LBB12_48
 ; RV32I-NEXT:  .LBB12_62:
-; RV32I-NEXT:    mv t6, s7
-; RV32I-NEXT:    bne t1, s4, .LBB12_47
+; RV32I-NEXT:    mv t3, s7
+; RV32I-NEXT:    bne t1, s9, .LBB12_50
 ; RV32I-NEXT:  .LBB12_63:
-; RV32I-NEXT:    or a4, s8, t6
-; RV32I-NEXT:    li t6, 0
-; RV32I-NEXT:    beqz t5, .LBB12_48
+; RV32I-NEXT:    or s6, a1, t3
+; RV32I-NEXT:    li t3, 0
+; RV32I-NEXT:    beqz t5, .LBB12_51
 ; RV32I-NEXT:  .LBB12_64:
-; RV32I-NEXT:    mv t6, s5
-; RV32I-NEXT:    bne t1, s9, .LBB12_49
+; RV32I-NEXT:    mv t3, s5
+; RV32I-NEXT:    bne t1, s1, .LBB12_52
 ; RV32I-NEXT:  .LBB12_65:
-; RV32I-NEXT:    or a4, s6, t6
-; RV32I-NEXT:    mv t6, s1
-; RV32I-NEXT:    beq t1, s2, .LBB12_50
+; RV32I-NEXT:    or s6, ra, t3
+; RV32I-NEXT:    mv t3, s4
+; RV32I-NEXT:    beq t1, s2, .LBB12_53
 ; RV32I-NEXT:  .LBB12_66:
-; RV32I-NEXT:    mv t6, a4
-; RV32I-NEXT:    li a4, 0
-; RV32I-NEXT:    beq t1, s3, .LBB12_51
+; RV32I-NEXT:    mv t3, s6
+; RV32I-NEXT:    li s6, 0
+; RV32I-NEXT:    beq t1, s3, .LBB12_54
 ; RV32I-NEXT:  .LBB12_67:
-; RV32I-NEXT:    mv a4, t6
-; RV32I-NEXT:    bnez a7, .LBB12_52
-; RV32I-NEXT:    j .LBB12_53
+; RV32I-NEXT:    mv s6, t3
+; RV32I-NEXT:    beqz a7, .LBB12_55
 ; RV32I-NEXT:  .LBB12_68:
-; RV32I-NEXT:    or a4, t4, a4
-; RV32I-NEXT:  .LBB12_69:
-; RV32I-NEXT:    li t4, 3
+; RV32I-NEXT:    sw s6, 8(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    li t3, 0
-; RV32I-NEXT:    bnez t5, .LBB12_84
-; RV32I-NEXT:  # %bb.70:
-; RV32I-NEXT:    beq t1, s0, .LBB12_85
+; RV32I-NEXT:    beqz t5, .LBB12_56
+; RV32I-NEXT:  .LBB12_69:
+; RV32I-NEXT:    mv t3, s8
+; RV32I-NEXT:    bnez t1, .LBB12_57
+; RV32I-NEXT:  .LBB12_70:
+; RV32I-NEXT:    or s6, t6, t3
 ; RV32I-NEXT:  .LBB12_71:
+; RV32I-NEXT:    li t6, 3
 ; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    bnez t5, .LBB12_86
-; RV32I-NEXT:  .LBB12_72:
-; RV32I-NEXT:    beq t1, t6, .LBB12_87
+; RV32I-NEXT:  # %bb.72:
+; RV32I-NEXT:    beq t1, s10, .LBB12_87
 ; RV32I-NEXT:  .LBB12_73:
 ; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    bnez t5, .LBB12_88
 ; RV32I-NEXT:  .LBB12_74:
-; RV32I-NEXT:    beq t1, t4, .LBB12_89
+; RV32I-NEXT:    beq t1, s11, .LBB12_89
 ; RV32I-NEXT:  .LBB12_75:
 ; RV32I-NEXT:    li t3, 0
 ; RV32I-NEXT:    bnez t5, .LBB12_90
 ; RV32I-NEXT:  .LBB12_76:
-; RV32I...
[truncated]

; RV32I-NEXT: .LBB12_206:
; RV32I-NEXT: mv t3, t4
; RV32I-NEXT: bnez a7, .LBB12_189
; RV32I-NEXT: j .LBB12_190
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code got quite a bit longer. Is it better?

; RV32I-NEXT: .LBB13_206:
; RV32I-NEXT: mv t3, t4
; RV32I-NEXT: bnez a7, .LBB13_189
; RV32I-NEXT: j .LBB13_190
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer

@asb
Copy link
Contributor

asb commented Oct 14, 2025

Here are dyn instcount diffs for an rva22 build of SPEC 2017:

Benchmark                  Baseline       This PR   Diff (%)
============================================================
500.perlbench_r         179038504795    179015518664     -0.01%
502.gcc_r               221238744000    221171512681     -0.03%
505.mcf_r               134655886612    137150096559      1.85%
508.namd_r              217623220031    217709012429      0.04%
510.parest_r            291729214114    291727407077     -0.00%
511.povray_r             30983012423     30981981254     -0.00%
519.lbm_r                91217999797     90475828912     -0.81%
520.omnetpp_r           137704191763    137702616618     -0.00%
523.xalancbmk_r         284738544130    284738525706     -0.00%
525.x264_r              379871669079    379399368988     -0.12%
526.blender_r           659313110004    659133873329     -0.03%
531.deepsjeng_r         349454510283    349291680562     -0.05%
538.imagick_r           238568576282    238568485759     -0.00%
541.leela_r             405707905587    405700609695     -0.00%
544.nab_r               398215408162    398165080506     -0.01%
557.xz_r                129537393796    129925509975      0.30%

Looking at the static assembly diff it is large due to lots of very tiny regalloc changes. The obvious outlier is mcf, which I'll need to report back on after having a closer look.

@lukel97
Copy link
Contributor

lukel97 commented Oct 16, 2025

Some quick static results of this on llvm-test-suite, -march=rva23u64 -O3:

$ ./utils/compare.py results.rva23u64-O3.pr162311.before.json vs results.rva23u64-O3.pr162311.after.json -m regalloc.NumReloads -m regalloc.NumSpills 
Tests: 320
Metric: regalloc.NumReloads,regalloc.NumSpills

Program                                       regalloc.NumReloads               regalloc.NumSpills              
                                              lhs                 rhs     diff  lhs                rhs     diff 
SingleSour...arks/Adobe-C++/functionobjects     38.00               48.00 26.3%    8.00               8.00  0.0%
MultiSourc...e/Benchmarks/Rodinia/srad/srad     73.00               77.00  5.5%   54.00              54.00  0.0%
MultiSourc...e/Applications/minisat/minisat     33.00               34.00  3.0%   23.00              23.00  0.0%
MultiSourc...e/Applications/ClamAV/clamscan   4067.00             4159.00  2.3% 1881.00            1868.00 -0.7%
MultiSourc...e/Benchmarks/MallocBench/gs/gs    430.00              439.00  2.1%  223.00             220.00 -1.3%
MultiSource/Applications/kimwitu++/kc          627.00              631.00  0.6%  141.00             142.00  0.7%
MultiSource/Benchmarks/PAQ8p/paq8p             344.00              345.00  0.3%  243.00             243.00  0.0%
MultiSource/Benchmarks/sim/sim                 353.00              354.00  0.3%  168.00             166.00 -1.2%
MicroBench...ubsetCLambdaLoops/lcalsCLambda   1343.00             1346.00  0.2% 1017.00            1017.00  0.0%
MicroBench...CALS/SubsetBRawLoops/lcalsBRaw   1229.00             1231.00  0.2%  968.00             967.00 -0.1%
MicroBench...ubsetBLambdaLoops/lcalsBLambda   1229.00             1231.00  0.2%  968.00             967.00 -0.1%
MicroBench...CALS/SubsetCRawLoops/lcalsCRaw   1345.00             1347.00  0.1% 1019.00            1018.00 -0.1%
MicroBench...CALS/SubsetARawLoops/lcalsARaw   1410.00             1411.00  0.1% 1095.00            1093.00 -0.2%
MicroBench...ubsetALambdaLoops/lcalsALambda   1487.00             1488.00  0.1% 1167.00            1165.00 -0.2%
MultiSourc...enchmarks/VersaBench/dbms/dbms     16.00               16.00  0.0%    9.00               9.00  0.0%
                           Geomean difference                             -8.5%                            -8.9%

And SPEC CPU 2017:

$ ./utils/compare.py results.rva23u64-O3.pr162311.spec.before.json vs results.rva23u64-O3.pr162311.spec.after.json -m regalloc.NumReloads -m regalloc.NumSpills -a
Tests: 32
Metric: regalloc.NumReloads,regalloc.NumSpills

Program                                       regalloc.NumReloads                 regalloc.NumSpills               
                                              lhs                 rhs      diff   lhs                rhs      diff 
INT2017speed/605.mcf_s/605.mcf_s                196.00              220.00  12.2%   104.00              98.00 -5.8%
INT2017rate/505.mcf_r/505.mcf_r                 196.00              220.00  12.2%   104.00              98.00 -5.8%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     515.00              516.00   0.2%   265.00             267.00  0.8%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     515.00              516.00   0.2%   265.00             267.00  0.8%
FP2017rate/508.namd_r/508.namd_r              15208.00            15231.00   0.2%  6580.00            6585.00  0.1%
FP2017rate/519.lbm_r/519.lbm_r                   47.00               47.00   0.0%    46.00              46.00  0.0%
FP2017rate/511.povray_r/511.povray_r           2841.00             2839.00  -0.1%  1720.00            1707.00 -0.8%
FP2017rate/544.nab_r/544.nab_r                 1073.00             1066.00  -0.7%   714.00             709.00 -0.7%
FP2017speed/644.nab_s/644.nab_s                1073.00             1066.00  -0.7%   714.00             709.00 -0.7%
INT2017rate/502.gcc_r/502.gcc_r               24278.00            24034.00  -1.0% 11046.00           10911.00 -1.2%
INT2017speed/602.gcc_s/602.gcc_s              24278.00            24034.00  -1.0% 11046.00           10911.00 -1.2%
FP2017rate/538.imagick_r/538.imagick_r         8154.00             8071.00  -1.0%  3365.00            3311.00 -1.6%
FP2017speed/638.imagick_s/638.imagick_s        8154.00             8071.00  -1.0%  3365.00            3311.00 -1.6%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    2243.00             2220.00  -1.0%  1396.00            1384.00 -0.9%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r    2243.00             2220.00  -1.0%  1396.00            1384.00 -0.9%
FP2017rate/510.parest_r/510.parest_r          76535.00            75466.00  -1.4% 43417.00           43114.00 -0.7%
FP2017rate/526.blender_r/526.blender_r        24742.00            24354.00  -1.6% 12452.00           12361.00 -0.7%
INT2017spe...00.perlbench_s/600.perlbench_s    9630.00             9470.00  -1.7%  4360.00            4309.00 -1.2%
INT2017rat...00.perlbench_r/500.perlbench_r    9630.00             9470.00  -1.7%  4360.00            4309.00 -1.2%
INT2017rate/520.omnetpp_r/520.omnetpp_r        1188.00             1155.00  -2.8%   645.00             619.00 -4.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s    1188.00             1155.00  -2.8%   645.00             619.00 -4.0%
INT2017rate/525.x264_r/525.x264_r              4065.00             3898.00  -4.1%  1865.00            1800.00 -3.5%
INT2017speed/625.x264_s/625.x264_s             4065.00             3898.00  -4.1%  1865.00            1800.00 -3.5%
FP2017speed/619.lbm_s/619.lbm_s                  43.00               41.00  -4.7%    42.00              40.00 -4.8%
INT2017rate/541.leela_r/541.leela_r             418.00              397.00  -5.0%   300.00             283.00 -5.7%
INT2017speed/641.leela_s/641.leela_s            418.00              397.00  -5.0%   300.00             283.00 -5.7%
INT2017rate/557.xz_r/557.xz_r                   470.00              420.00 -10.6%   270.00             257.00 -4.8%
INT2017speed/657.xz_s/657.xz_s                  470.00              420.00 -10.6%   270.00             257.00 -4.8%
FP2017rate...97.specrand_fr/997.specrand_fr       0.00                0.00                                         
FP2017spee...96.specrand_fs/996.specrand_fs       0.00                0.00                                         
INT2017rat...99.specrand_ir/999.specrand_ir       0.00                0.00                                         
INT2017spe...98.specrand_is/998.specrand_is       0.00                0.00                                         
                           Geomean difference                               -1.5%                             -2.3%

Overall seems to be an improvement but I'm definitely surprised to see that some cases have an increase in number of reloads. The results in 505.mcf_r match @asb's dynamic results. Would be good to get to the bottom of that.

@asb
Copy link
Contributor

asb commented Oct 16, 2025

I spent some time having a closer look. There's a very specific hot block in spec_qsort that gets a move and a negate that seems to account for a good chunk of dynamic instcount diff:

New:

mv s6, s11          
neg a0, s3          
mul s11, a0, s9     
mv a0, s1           
mv a1, s8           
jalr s4             

vs old:

 mul s3, s11, s6 
 mv a0, s1       
 mv a1, s8       
 jalr s4         

I'll get a minimal reproducer so we can decide whether to put this down to bad luck or something we can address in the context of this patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants