Performance regression due to lost cross-block CSE after recent refactoring #64282

preames · 2023-07-31T20:14:09Z

The following test case demonstrates a performance regression on ToT (and unfortunately, the release branch).

$ cat implicit_def.ll 
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
; RUN: llc < %s -O3 -mtriple=riscv64 -mattr=+v | FileCheck %s

define void @foo(<vscale x 2 x i32> %x, <vscale x 2 x i32> %y, ptr %p1, ptr %p2, i1 zeroext %cond) {
; CHECK-LABEL: foo:
; CHECK:       # %bb.0:
; CHECK-NEXT:    vsetvli a3, zero, e32, m1, ta, ma
; CHECK-NEXT:    vadd.vv v10, v8, v9
; CHECK-NEXT:    vs1r.v v10, (a0)
; CHECK-NEXT:    bnez a2, .LBB0_2
; CHECK-NEXT:  # %bb.1: # %falsebb
; CHECK-NEXT:    vadd.vv v8, v8, v9
; CHECK-NEXT:    vs1r.v v8, (a1)
; CHECK-NEXT:  .LBB0_2: # %mergebb
; CHECK-NEXT:    ret
  %a = add <vscale x 2 x i32> %x, %y
  store <vscale x 2 x i32> %a, ptr %p1
  br i1 %cond, label %mergebb, label %falsebb

falsebb:
  %b = add <vscale x 2 x i32> %x, %y
  store <vscale x 2 x i32> %b, ptr %p2
  br label %mergebb

mergebb:
  ret void
}

The basic problem here is triggered by my recent refactoring series (see https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295). After that series, we're now using IMPLICIT_DEF operands on many vector operations which we didn't used to. (We previously had multiple forms, both with and without passthru.)

The root issue is that we don't perform cross block CSE (in MachineCSE) for IMPLICIT_DEF nodes. This is not a new issue, but is newly exposed on RISCV.

Note that I'm unclear on the practical impact of this regression. Middle level optimization will tend to catch such cases, so the opportunities we're missing are either a) generated during SelectionDAG or b) for some reason not caught in the middle end.

At the moment, I've got a couple approaches to address this.

First, we could revert the series above on the release branch. Normally, this would be my goto option, but given how invasive these changes were, and how much has built on top, I'm leery of this option.

Second, we can simply perform CSE on IMPLICIT_DEF I've locally implemented this, and it appears to functionally work. My original worry was a correctness risk, but I think I've mostly convinced myself this is a non-issue. However, both RegisterCoalescer and ProcessImplicitDefs appear to have sensitivities to cross block live ranges for IMPLCIT_DEFs which look less than obvious on how to fix.

Third, we could perform CSE of the IMPLICIT_DEF users without CSE of the IMPLICIT_DEF itself. This is the most direct fix, but requires some delicate code in the hash map keys in MachineCSE. (In particularly, we need to keep hash and identity in sync.)

Fourth, we could take inspiration from the predication support on ARM (and other targets), and conditionally add the pass thru operand only if needed. I need to investigate this option in more depth.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-07-31T20:14:26Z

@llvm/issue-subscribers-backend-risc-v

preames · 2023-08-02T16:01:21Z

Candidate patch: https://reviews.llvm.org/D156909

…rands In a recent series of refactorings (described here: https://discourse.llvm.org/t/riscv-transition-in-vector-pseudo-structure-policy-variants/71295), I greatly increased the number of IMPLICIT_DEF operands to our vector instructions. This has turned out to have an unexpected negative impact because MachineCSE does not CSE IMPLICIT_DEFs, and thus does not CSE any instruction with an IMPLICIT_DEF operand. SelectionDAG *does* CSE the same case, but that only covers the same block case, not the cross block case. This lead to the performance regression reported in #64282. This change is a slightly ugly hack to side step the issue. Instead of fixing the root cause (lack of CSE for IMPLICIT_DEF) or undoing the operand changes, we leave the extra operand in place, and use NoReg in place of IMPLICIT_DEF. I then convert back to IMPLICIT_DEF just before register allocation so that ProcessImplicitDefs and TwoAddressInstructions can do the normal transforms to Undef tied registers. We may end up backporting this into the 17.x release branch. Given how late in the release cycle this is landing, that's much less likely now, but still a possibility. Differential Revision: https://reviews.llvm.org/D156909

preames · 2023-09-05T14:59:30Z

Resolved by https://reviews.llvm.org/rGa63bd7e99b00c6c970f38ea596f708e42b8c98e5

preames added backend:RISC-V regression labels Jul 31, 2023

preames self-assigned this Jul 31, 2023

preames closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression due to lost cross-block CSE after recent refactoring #64282

Performance regression due to lost cross-block CSE after recent refactoring #64282

preames commented Jul 31, 2023

llvmbot commented Jul 31, 2023

preames commented Aug 2, 2023

preames commented Sep 5, 2023

Performance regression due to lost cross-block CSE after recent refactoring #64282

Performance regression due to lost cross-block CSE after recent refactoring #64282

Comments

preames commented Jul 31, 2023

llvmbot commented Jul 31, 2023

preames commented Aug 2, 2023

preames commented Sep 5, 2023