Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MachineSink] Fix missing sinks along critical edges #97618

Merged
merged 2 commits into from
Jul 9, 2024

Conversation

mshockwave
Copy link
Member

4e0bd3f improved early MachineLICM's capabilities to hoist COPY from physical registers out of a loop. However, it accidentally broke one of MachineSink's preconditions on sinking cheap instructions (in this case, COPY) which considered those instructions being profitable to sink only when there are at least two of them in the same def-use chain in the same basic block. So if early MachineLICM hoisted one of them out, MachineSink no longer sink rest of the cheap instructions. This results in redundant load immediate instructions from the motivating example we've seen on RISC-V.

This patch fixes this by teaching MachineSink that if there is more than one demand to sink a register into the same block from different critical edges, it should be considered profitable as it increases the CSE opportunities.
This change also improves two of the AArch64's cases.


I'd put a pre-commit test in this PR for easier comparison.

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 3, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Min-Yih Hsu (mshockwave)

Changes

4e0bd3f improved early MachineLICM's capabilities to hoist COPY from physical registers out of a loop. However, it accidentally broke one of MachineSink's preconditions on sinking cheap instructions (in this case, COPY) which considered those instructions being profitable to sink only when there are at least two of them in the same def-use chain in the same basic block. So if early MachineLICM hoisted one of them out, MachineSink no longer sink rest of the cheap instructions. This results in redundant load immediate instructions from the motivating example we've seen on RISC-V.

This patch fixes this by teaching MachineSink that if there is more than one demand to sink a register into the same block from different critical edges, it should be considered profitable as it increases the CSE opportunities.
This change also improves two of the AArch64's cases.


I'd put a pre-commit test in this PR for easier comparison.


Full diff: https://github.com/llvm/llvm-project/pull/97618.diff

4 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineSink.cpp (+66-15)
  • (modified) llvm/test/CodeGen/AArch64/and-sink.ll (+5-8)
  • (modified) llvm/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll (+8-16)
  • (added) llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll (+185)
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 4dabaabe3659f..6c8b4a6b7338b 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -130,6 +130,14 @@ namespace {
     // Remember which edges have been considered for breaking.
     SmallSet<std::pair<MachineBasicBlock*, MachineBasicBlock*>, 8>
     CEBCandidates;
+    // Memorize the register that also wanted to sink into the same block along
+    // a different critical edge.
+    // {register to sink, sink-to block} -> the first sink-from block.
+    // We're recording the first sink-from block because that (critical) edge
+    // was deferred until we see another register that's going to sink into the
+    // same block.
+    DenseMap<std::pair<Register, MachineBasicBlock *>, MachineBasicBlock *>
+        CEMergeCandidates;
     // Remember which edges we are about to split.
     // This is different from CEBCandidates since those edges
     // will be split.
@@ -197,14 +205,17 @@ namespace {
 
     void releaseMemory() override {
       CEBCandidates.clear();
+      CEMergeCandidates.clear();
     }
 
   private:
     bool ProcessBlock(MachineBasicBlock &MBB);
     void ProcessDbgInst(MachineInstr &MI);
-    bool isWorthBreakingCriticalEdge(MachineInstr &MI,
-                                     MachineBasicBlock *From,
-                                     MachineBasicBlock *To);
+    bool isLegalBreakingCriticalEdge(MachineInstr &MI, MachineBasicBlock *From,
+                                     MachineBasicBlock *To, bool BreakPHIEdge);
+    bool isWorthBreakingCriticalEdge(MachineInstr &MI, MachineBasicBlock *From,
+                                     MachineBasicBlock *To,
+                                     MachineBasicBlock *&DeferredFromBlock);
 
     bool hasStoreBetween(MachineBasicBlock *From, MachineBasicBlock *To,
                          MachineInstr &MI);
@@ -725,6 +736,7 @@ bool MachineSinking::runOnMachineFunction(MachineFunction &MF) {
 
     // Process all basic blocks.
     CEBCandidates.clear();
+    CEMergeCandidates.clear();
     ToSplit.clear();
     for (auto &MBB: MF)
       MadeChange |= ProcessBlock(MBB);
@@ -873,9 +885,9 @@ void MachineSinking::ProcessDbgInst(MachineInstr &MI) {
   SeenDbgVars.insert(Var);
 }
 
-bool MachineSinking::isWorthBreakingCriticalEdge(MachineInstr &MI,
-                                                 MachineBasicBlock *From,
-                                                 MachineBasicBlock *To) {
+bool MachineSinking::isWorthBreakingCriticalEdge(
+    MachineInstr &MI, MachineBasicBlock *From, MachineBasicBlock *To,
+    MachineBasicBlock *&DeferredFromBlock) {
   // FIXME: Need much better heuristics.
 
   // If the pass has already considered breaking this edge (during this pass
@@ -887,6 +899,27 @@ bool MachineSinking::isWorthBreakingCriticalEdge(MachineInstr &MI,
   if (!MI.isCopy() && !TII->isAsCheapAsAMove(MI))
     return true;
 
+  // Check and record the register and the destination block we want to sink
+  // into. Note that we want to do the following before the next check on branch
+  // probability. Because we want to record the initial candidate even if it's
+  // on hot edge, so that other candidates that might not on hot edges can be
+  // sinked as well.
+  for (const auto &MO : MI.all_defs()) {
+    Register Reg = MO.getReg();
+    if (!Reg)
+      continue;
+    Register SrcReg = Reg.isVirtual() ? TRI->lookThruCopyLike(Reg, MRI) : Reg;
+    auto Key = std::make_pair(SrcReg, To);
+    auto Res = CEMergeCandidates.insert(std::make_pair(Key, From));
+    // We wanted to sink the same register into the same block, consider it to
+    // be profitable.
+    if (!Res.second) {
+      // Return the source block that was previously holded off.
+      DeferredFromBlock = Res.first->second;
+      return true;
+    }
+  }
+
   if (From->isSuccessor(To) && MBPI->getEdgeProbability(From, To) <=
       BranchProbability(SplitEdgeProbabilityThreshold, 100))
     return true;
@@ -921,13 +954,10 @@ bool MachineSinking::isWorthBreakingCriticalEdge(MachineInstr &MI,
   return false;
 }
 
-bool MachineSinking::PostponeSplitCriticalEdge(MachineInstr &MI,
-                                               MachineBasicBlock *FromBB,
-                                               MachineBasicBlock *ToBB,
-                                               bool BreakPHIEdge) {
-  if (!isWorthBreakingCriticalEdge(MI, FromBB, ToBB))
-    return false;
-
+bool MachineSinking::isLegalBreakingCriticalEdge(MachineInstr &MI,
+                                                 MachineBasicBlock *FromBB,
+                                                 MachineBasicBlock *ToBB,
+                                                 bool BreakPHIEdge) {
   // Avoid breaking back edge. From == To means backedge for single BB cycle.
   if (!SplitEdges || FromBB == ToBB)
     return false;
@@ -985,11 +1015,32 @@ bool MachineSinking::PostponeSplitCriticalEdge(MachineInstr &MI,
         return false;
   }
 
-  ToSplit.insert(std::make_pair(FromBB, ToBB));
-
   return true;
 }
 
+bool MachineSinking::PostponeSplitCriticalEdge(MachineInstr &MI,
+                                               MachineBasicBlock *FromBB,
+                                               MachineBasicBlock *ToBB,
+                                               bool BreakPHIEdge) {
+  bool Status = false;
+  MachineBasicBlock *DeferredFromBB = nullptr;
+  if (isWorthBreakingCriticalEdge(MI, FromBB, ToBB, DeferredFromBB)) {
+    // If there is a DeferredFromBB, we consider FromBB only if _both_
+    // of them are legal to split.
+    if ((!DeferredFromBB ||
+         ToSplit.count(std::make_pair(DeferredFromBB, ToBB)) ||
+         isLegalBreakingCriticalEdge(MI, DeferredFromBB, ToBB, BreakPHIEdge)) &&
+        isLegalBreakingCriticalEdge(MI, FromBB, ToBB, BreakPHIEdge)) {
+      ToSplit.insert(std::make_pair(FromBB, ToBB));
+      if (DeferredFromBB)
+        ToSplit.insert(std::make_pair(DeferredFromBB, ToBB));
+      Status = true;
+    }
+  }
+
+  return Status;
+}
+
 std::vector<unsigned> &
 MachineSinking::getBBRegisterPressure(const MachineBasicBlock &MBB) {
   // Currently to save compiling time, MBB's register pressure will not change
diff --git a/llvm/test/CodeGen/AArch64/and-sink.ll b/llvm/test/CodeGen/AArch64/and-sink.ll
index f298a55dab721..c84310629e5fd 100644
--- a/llvm/test/CodeGen/AArch64/and-sink.ll
+++ b/llvm/test/CodeGen/AArch64/and-sink.ll
@@ -46,9 +46,8 @@ bb2:
 define dso_local i32 @and_sink2(i32 %a, i1 %c, i1 %c2) {
 ; CHECK-LABEL: and_sink2:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov w8, wzr
-; CHECK-NEXT:    adrp x9, A
-; CHECK-NEXT:    str wzr, [x9, :lo12:A]
+; CHECK-NEXT:    adrp x8, A
+; CHECK-NEXT:    str wzr, [x8, :lo12:A]
 ; CHECK-NEXT:    tbz w1, #0, .LBB1_5
 ; CHECK-NEXT:  // %bb.1: // %bb0.preheader
 ; CHECK-NEXT:    adrp x8, B
@@ -56,17 +55,15 @@ define dso_local i32 @and_sink2(i32 %a, i1 %c, i1 %c2) {
 ; CHECK-NEXT:  .LBB1_2: // %bb0
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    str wzr, [x8, :lo12:B]
-; CHECK-NEXT:    tbz w2, #0, .LBB1_6
+; CHECK-NEXT:    tbz w2, #0, .LBB1_5
 ; CHECK-NEXT:  // %bb.3: // %bb1
 ; CHECK-NEXT:    // in Loop: Header=BB1_2 Depth=1
 ; CHECK-NEXT:    str wzr, [x9, :lo12:C]
 ; CHECK-NEXT:    tbnz w0, #2, .LBB1_2
 ; CHECK-NEXT:  // %bb.4:
-; CHECK-NEXT:    mov w8, #1 // =0x1
-; CHECK-NEXT:  .LBB1_5: // %common.ret
-; CHECK-NEXT:    mov w0, w8
+; CHECK-NEXT:    mov w0, #1 // =0x1
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB1_6:
+; CHECK-NEXT:  .LBB1_5:
 ; CHECK-NEXT:    mov w0, wzr
 ; CHECK-NEXT:    ret
 
diff --git a/llvm/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll b/llvm/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
index d92bbfd7a21d6..49e31447c1c0d 100644
--- a/llvm/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
+++ b/llvm/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
@@ -4,13 +4,11 @@
 define i64 @test_or(i32 %a, i32 %b) {
 ; CHECK-LABEL: test_or:
 ; CHECK:       ; %bb.0: ; %bb1
-; CHECK-NEXT:    mov w8, w0
+; CHECK-NEXT:    cbnz w0, LBB0_2
+; CHECK-NEXT:  LBB0_1:
 ; CHECK-NEXT:    mov x0, xzr
-; CHECK-NEXT:    cbnz w8, LBB0_2
-; CHECK-NEXT:  LBB0_1: ; %common.ret
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  LBB0_2: ; %bb1.cond.split
-; CHECK-NEXT:    mov x0, xzr
 ; CHECK-NEXT:    cbz w1, LBB0_1
 ; CHECK-NEXT:  ; %bb.3: ; %bb4
 ; CHECK-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
@@ -37,13 +35,11 @@ bb4:
 define i64 @test_or_select(i32 %a, i32 %b) {
 ; CHECK-LABEL: test_or_select:
 ; CHECK:       ; %bb.0: ; %bb1
-; CHECK-NEXT:    mov w8, w0
+; CHECK-NEXT:    cbnz w0, LBB1_2
+; CHECK-NEXT:  LBB1_1:
 ; CHECK-NEXT:    mov x0, xzr
-; CHECK-NEXT:    cbnz w8, LBB1_2
-; CHECK-NEXT:  LBB1_1: ; %common.ret
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  LBB1_2: ; %bb1.cond.split
-; CHECK-NEXT:    mov x0, xzr
 ; CHECK-NEXT:    cbz w1, LBB1_1
 ; CHECK-NEXT:  ; %bb.3: ; %bb4
 ; CHECK-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
@@ -70,13 +66,11 @@ bb4:
 define i64 @test_and(i32 %a, i32 %b) {
 ; CHECK-LABEL: test_and:
 ; CHECK:       ; %bb.0: ; %bb1
-; CHECK-NEXT:    mov w8, w0
+; CHECK-NEXT:    cbnz w0, LBB2_2
+; CHECK-NEXT:  LBB2_1:
 ; CHECK-NEXT:    mov x0, xzr
-; CHECK-NEXT:    cbnz w8, LBB2_2
-; CHECK-NEXT:  LBB2_1: ; %common.ret
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  LBB2_2: ; %bb1.cond.split
-; CHECK-NEXT:    mov x0, xzr
 ; CHECK-NEXT:    cbz w1, LBB2_1
 ; CHECK-NEXT:  ; %bb.3: ; %bb4
 ; CHECK-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
@@ -103,13 +97,11 @@ bb4:
 define i64 @test_and_select(i32 %a, i32 %b) {
 ; CHECK-LABEL: test_and_select:
 ; CHECK:       ; %bb.0: ; %bb1
-; CHECK-NEXT:    mov w8, w0
+; CHECK-NEXT:    cbnz w0, LBB3_2
+; CHECK-NEXT:  LBB3_1:
 ; CHECK-NEXT:    mov x0, xzr
-; CHECK-NEXT:    cbnz w8, LBB3_2
-; CHECK-NEXT:  LBB3_1: ; %common.ret
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  LBB3_2: ; %bb1.cond.split
-; CHECK-NEXT:    mov x0, xzr
 ; CHECK-NEXT:    cbz w1, LBB3_1
 ; CHECK-NEXT:  ; %bb.3: ; %bb4
 ; CHECK-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
diff --git a/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll b/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll
new file mode 100644
index 0000000000000..00d10b9f2bc0c
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll
@@ -0,0 +1,185 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=riscv64 < %s | FileCheck %s
+
+define i1 @sink_li(ptr %text, ptr %text.addr.0) {
+; CHECK-LABEL: sink_li:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    addi sp, sp, -32
+; CHECK-NEXT:    .cfi_def_cfa_offset 32
+; CHECK-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s1, 8(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    sd s2, 0(sp) # 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_offset ra, -8
+; CHECK-NEXT:    .cfi_offset s0, -16
+; CHECK-NEXT:    .cfi_offset s1, -24
+; CHECK-NEXT:    .cfi_offset s2, -32
+; CHECK-NEXT:    mv s1, a1
+; CHECK-NEXT:    mv s0, a0
+; CHECK-NEXT:    call toupper
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    beqz s0, .LBB0_26
+; CHECK-NEXT:  # %bb.1: # %while.body.preheader
+; CHECK-NEXT:    li a2, 1
+; CHECK-NEXT:    li a3, 9
+; CHECK-NEXT:    li a4, 32
+; CHECK-NEXT:  .LBB0_2: # %while.body
+; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    bnez a2, .LBB0_4
+; CHECK-NEXT:  # %bb.3: # %while.body
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bne a2, a3, .LBB0_16
+; CHECK-NEXT:  .LBB0_4: # %while.body.1
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bnez a2, .LBB0_6
+; CHECK-NEXT:  # %bb.5: # %while.body.1
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bne a2, a3, .LBB0_17
+; CHECK-NEXT:  .LBB0_6: # %while.body.3
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bnez a2, .LBB0_8
+; CHECK-NEXT:  # %bb.7: # %while.body.3
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bne a2, a4, .LBB0_19
+; CHECK-NEXT:  .LBB0_8: # %while.body.4
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bnez a2, .LBB0_10
+; CHECK-NEXT:  # %bb.9: # %while.body.4
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bne a2, a4, .LBB0_21
+; CHECK-NEXT:  .LBB0_10: # %while.body.5
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bnez a2, .LBB0_12
+; CHECK-NEXT:  # %bb.11: # %while.body.5
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bne a2, a3, .LBB0_23
+; CHECK-NEXT:  .LBB0_12: # %while.body.6
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    bnez a2, .LBB0_2
+; CHECK-NEXT:  # %bb.13: # %while.body.6
+; CHECK-NEXT:    # in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    beq a2, a3, .LBB0_2
+; CHECK-NEXT:  # %bb.14: # %while.body.6
+; CHECK-NEXT:    beqz a2, .LBB0_24
+; CHECK-NEXT:  # %bb.15: # %strdup.exit.split.loop.exit126
+; CHECK-NEXT:    addi s0, s1, 7
+; CHECK-NEXT:    j .LBB0_25
+; CHECK-NEXT:  .LBB0_16: # %while.body
+; CHECK-NEXT:    beqz a2, .LBB0_26
+; CHECK-NEXT:    j .LBB0_18
+; CHECK-NEXT:  .LBB0_17: # %while.body.1
+; CHECK-NEXT:    beqz a2, .LBB0_24
+; CHECK-NEXT:  .LBB0_18: # %strdup.exit.loopexit
+; CHECK-NEXT:    li s0, 0
+; CHECK-NEXT:    j .LBB0_25
+; CHECK-NEXT:  .LBB0_19: # %while.body.3
+; CHECK-NEXT:    beqz a2, .LBB0_24
+; CHECK-NEXT:  # %bb.20: # %strdup.exit.split.loop.exit120
+; CHECK-NEXT:    addi s0, s1, 4
+; CHECK-NEXT:    j .LBB0_25
+; CHECK-NEXT:  .LBB0_21: # %while.body.4
+; CHECK-NEXT:    beqz a2, .LBB0_24
+; CHECK-NEXT:  # %bb.22: # %strdup.exit.split.loop.exit122
+; CHECK-NEXT:    addi s0, s1, 5
+; CHECK-NEXT:    j .LBB0_25
+; CHECK-NEXT:  .LBB0_23: # %while.body.5
+; CHECK-NEXT:    bnez a2, .LBB0_25
+; CHECK-NEXT:  .LBB0_24:
+; CHECK-NEXT:    li a1, 0
+; CHECK-NEXT:    j .LBB0_26
+; CHECK-NEXT:  .LBB0_25: # %strdup.exit
+; CHECK-NEXT:    li s1, 0
+; CHECK-NEXT:    mv s2, a0
+; CHECK-NEXT:    li a0, 0
+; CHECK-NEXT:    mv a1, s0
+; CHECK-NEXT:    jalr s1
+; CHECK-NEXT:    li a0, 0
+; CHECK-NEXT:    mv a1, s2
+; CHECK-NEXT:    li a2, 0
+; CHECK-NEXT:    jalr s1
+; CHECK-NEXT:    li a1, 1
+; CHECK-NEXT:  .LBB0_26: # %return
+; CHECK-NEXT:    mv a0, a1
+; CHECK-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s0, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s1, 8(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    ld s2, 0(sp) # 8-byte Folded Reload
+; CHECK-NEXT:    addi sp, sp, 32
+; CHECK-NEXT:    ret
+entry:
+  %call = call i32 @toupper()
+  %tobool.not = icmp eq ptr %text, null
+  br i1 %tobool.not, label %return, label %while.body
+
+while.body:                                       ; preds = %while.body.6, %while.body.6, %entry
+  switch i8 1, label %strdup.exit.split.loop.exit114 [
+    i8 1, label %while.body.1
+    i8 9, label %while.body.1
+    i8 0, label %return
+  ]
+
+while.body.1:                                     ; preds = %while.body, %while.body
+  switch i8 1, label %strdup.exit [
+    i8 1, label %while.body.3
+    i8 9, label %while.body.3
+    i8 0, label %return
+  ]
+
+while.body.3:                                     ; preds = %while.body.1, %while.body.1
+  switch i8 1, label %strdup.exit.split.loop.exit120 [
+    i8 32, label %while.body.4
+    i8 1, label %while.body.4
+    i8 0, label %return
+  ]
+
+while.body.4:                                     ; preds = %while.body.3, %while.body.3
+  switch i8 1, label %strdup.exit.split.loop.exit122 [
+    i8 32, label %while.body.5
+    i8 1, label %while.body.5
+    i8 0, label %return
+  ]
+
+while.body.5:                                     ; preds = %while.body.4, %while.body.4
+  switch i8 1, label %strdup.exit.split.loop.exit124 [
+    i8 1, label %while.body.6
+    i8 9, label %while.body.6
+    i8 0, label %return
+  ]
+
+while.body.6:                                     ; preds = %while.body.5, %while.body.5
+  switch i8 1, label %strdup.exit.split.loop.exit126 [
+    i8 1, label %while.body
+    i8 9, label %while.body
+    i8 0, label %return
+  ]
+
+strdup.exit.split.loop.exit114:        ; preds = %while.body
+  br label %strdup.exit
+
+strdup.exit.split.loop.exit120:        ; preds = %while.body.3
+  %incdec.ptr.3.le = getelementptr i8, ptr %text.addr.0, i64 4
+  br label %strdup.exit
+
+strdup.exit.split.loop.exit122:        ; preds = %while.body.4
+  %incdec.ptr.4.le = getelementptr i8, ptr %text.addr.0, i64 5
+  br label %strdup.exit
+
+strdup.exit.split.loop.exit124:        ; preds = %while.body.5
+  br label %strdup.exit
+
+strdup.exit.split.loop.exit126:        ; preds = %while.body.6
+  %incdec.ptr.6.le = getelementptr i8, ptr %text.addr.0, i64 7
+  br label %strdup.exit
+
+strdup.exit:                           ; preds = %strdup.exit.split.loop.exit126, %strdup.exit.split.loop.exit124, %strdup.exit.split.loop.exit122, %strdup.exit.split.loop.exit120, %strdup.exit.split.loop.exit114, %while.body.1
+  %text.addr.0.lcssa = phi ptr [ null, %strdup.exit.split.loop.exit114 ], [ %incdec.ptr.3.le, %strdup.exit.split.loop.exit120 ], [ %incdec.ptr.4.le, %strdup.exit.split.loop.exit122 ], [ %text, %strdup.exit.split.loop.exit124 ], [ %incdec.ptr.6.le, %strdup.exit.split.loop.exit126 ], [ null, %while.body.1 ]
+  %call5.i = tail call ptr null(ptr null, ptr %text.addr.0.lcssa)
+  %memchr64 = tail call ptr null(ptr null, i32 %call, i64 0)
+  br label %return
+
+return:                                           ; preds = %strdup.exit, %while.body.6, %while.body.5, %while.body.4, %while.body.3, %while.body.1, %while.body, %entry
+  %retval.1 = phi i1 [ false, %entry ], [ true, %strdup.exit ], [ false, %while.body ], [ false, %while.body.1 ], [ false, %while.body.3 ], [ false, %while.body.4 ], [ false, %while.body.5 ], [ false, %while.body.6 ]
+  ret i1 %retval.1
+}
+
+declare i32 @toupper()

// We wanted to sink the same register into the same block, consider it to
// be profitable.
if (!Res.second) {
// Return the source block that was previously holded off.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holded -> held

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

continue;
Register SrcReg = Reg.isVirtual() ? TRI->lookThruCopyLike(Reg, MRI) : Reg;
auto Key = std::make_pair(SrcReg, To);
auto Res = CEMergeCandidates.insert(std::make_pair(Key, From));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use try_emplace instead of insert to avoid the make_pair

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's fixed now.

if (!isWorthBreakingCriticalEdge(MI, FromBB, ToBB))
return false;

bool MachineSinking::isLegalBreakingCriticalEdge(MachineInstr &MI,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isLegalToBreakCriticalEdge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

; CHECK-NEXT: sd s0, 16(sp) # 8-byte Folded Spill
; CHECK-NEXT: sd s1, 8(sp) # 8-byte Folded Spill
; CHECK-NEXT: sd s2, 0(sp) # 8-byte Folded Spill
; CHECK-NEXT: .cfi_offset ra, -8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add nounwind to drop cfi directives

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

4e0bd3f improved early MachineLICM's capabilities to hoist COPY from
physical registers out of a loop. However, it accidentally broke one of
MachineSink's preconditions on sinking cheap instructions (in this case,
COPY) which considered those instructions being profitable to sink only
when there are at least two of them in the same def-use chain in the
same basic block. So if early MachineLICM hoisted one of them out,
MachineSink no longer sink rest of the cheap instructions. This results in
redundant load immediate instructions from the motivating example we've
seen on RISC-V.

This patch fixes this by teaching MachineSink that if there is more than
one demand to sink a register into the same block from different
critical edges, it should be considered profitable as it increases the
CSE opportunities.
@mshockwave mshockwave force-pushed the patch/fix-machine-sink-load-imm branch from b5683f3 to d93577d Compare July 9, 2024 17:46
@mshockwave mshockwave merged commit 7e2f961 into llvm:main Jul 9, 2024
4 of 6 checks passed
@mshockwave mshockwave deleted the patch/fix-machine-sink-load-imm branch July 9, 2024 17:48
@alexfh
Copy link
Contributor

alexfh commented Jul 11, 2024

This commit causes an assertion failure when compiling some code with hwasan for aarch64:

assertion failed at llvm-project/llvm/lib/CodeGen/MachineBasicBlock.cpp:877 in void llvm::MachineBasicBlock::replaceSuccessor(MachineBasicBlock *, MachineBasicBlock *): OldI != E && "Old is not a successor of this block"
    @     0x55c77a4e3cc4  __assert_fail
    @     0x55c778a22219  llvm::MachineBasicBlock::replaceSuccessor()
    @     0x55c778a239b6  llvm::MachineBasicBlock::SplitCriticalEdge()
    @     0x55c778b558fc  (anonymous namespace)::MachineSinking::runOnMachineFunction()
    @     0x55c778a9c590  llvm::MachineFunctionPass::runOnFunction()
    @     0x55c77a195cb2  llvm::FPPassManager::runOnFunction()
    @     0x55c77a19dbc4  llvm::FPPassManager::runOnModule()
    @     0x55c77a1967a3  llvm::legacy::PassManagerImpl::run()
    @     0x55c774f5f9fc  clang::EmitBackendOutput()
    @     0x55c774bb077f  clang::BackendConsumer::HandleTranslationUnit()
    @     0x55c775b505b9  clang::ParseAST()
    @     0x55c77589d4c3  clang::FrontendAction::Execute()
    @     0x55c775817d8d  clang::CompilerInstance::ExecuteAction()
    @     0x55c77482a6d8  clang::ExecuteCompilerInvocation()
    @     0x55c77481ead6  cc1_main()

I'm currently working on a standalone reproducer.

@alexfh
Copy link
Contributor

alexfh commented Jul 11, 2024

The reduced test case is:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-generic-linux-gnu"

define void @_f(ptr %project, i1 %cmp.i.i.not9.i) {
entry:
  %bf.load.i.i.i.i = load i8, ptr %project, align 1
  %tobool.i.i.i.i = icmp slt i8 %bf.load.i.i.i.i, 0
  br i1 %tobool.i.i.i.i, label %cond.true.i9.i.i, label %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE16__get_short_sizeEv.exit.i.i.i

cond.true.i9.i.i:                                 ; preds = %entry
  %0 = load i64, ptr %project, align 8
  br label %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE4cendEv.exit

_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE16__get_short_sizeEv.exit.i.i.i: ; preds = %entry
  %conv13.i.i.i.i = zext i8 %bf.load.i.i.i.i to i64
  br label %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE4cendEv.exit

_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE4cendEv.exit: ; preds = %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE16__get_short_sizeEv.exit.i.i.i, %cond.true.i9.i.i
  %cond.i8.i.i5 = phi i64 [ %0, %cond.true.i9.i.i ], [ %conv13.i.i.i.i, %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE16__get_short_sizeEv.exit.i.i.i ]
  br i1 %cmp.i.i.not9.i, label %common.ret, label %for.body.i

for.cond.i:                                       ; preds = %for.body.i
  %incdec.ptr.i.i = getelementptr i8, ptr %__first.sroa.0.010.i, i64 1
  %cmp.i.i.not.i = icmp eq i64 %cond.i8.i.i5, 0
  br i1 %cmp.i.i.not.i, label %common.ret, label %for.body.i

for.body.i:                                       ; preds = %for.cond.i, %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE4cendEv.exit
  %__first.sroa.0.010.i = phi ptr [ %incdec.ptr.i.i, %for.cond.i ], [ %project, %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE4cendEv.exit ]
  %1 = load i8, ptr %__first.sroa.0.010.i, align 1
  %isdigit = icmp eq i8 %1, 0
  br i1 %isdigit, label %cond.false, label %for.cond.i

common.ret:                                       ; preds = %cond.end.i.thread.i.i72, %cond.false, %for.cond.i, %_ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE4cendEv.exit
  ret void

cond.false:                                       ; preds = %for.body.i
  br i1 %tobool.i.i.i.i, label %common.ret, label %cond.end.i.thread.i.i72

cond.end.i.thread.i.i72:                          ; preds = %cond.false
  %conv13.i.i.i.i73 = zext i8 %bf.load.i.i.i.i to i64
  store i64 %conv13.i.i.i.i73, ptr %project, align 8
  br label %common.ret
}
./clang-checked-bad -O2 --target=aarch64-generic-linux-gnu -mcpu=neoverse-n1 -mtune=generic -fsanitize=hwaddress -mllvm -hwasan-inline-fast-path-checks=0 -c reduced.ll -o /dev/null
assertion failed at llvm-project/llvm/lib/CodeGen/MachineBasicBlock.cpp:877 in void llvm::MachineBasicBlock::replaceSuccessor(MachineBasicBlock *, MachineBasicBlock *): OldI != E && "Old is not a successor of this block"
    @     0x556998ae3cc4  __assert_fail
    @     0x556997022219  llvm::MachineBasicBlock::replaceSuccessor()
    @     0x5569970239b6  llvm::MachineBasicBlock::SplitCriticalEdge()
    @     0x5569971558fc  (anonymous namespace)::MachineSinking::runOnMachineFunction()
    @     0x55699709c590  llvm::MachineFunctionPass::runOnFunction()
    @     0x556998795cb2  llvm::FPPassManager::runOnFunction()
    @     0x55699879dbc4  llvm::FPPassManager::runOnModule()
    @     0x5569987967a3  llvm::legacy::PassManagerImpl::run()
    @     0x55699355f9fc  clang::EmitBackendOutput()
    @     0x5569931b6a3c  clang::CodeGenAction::ExecuteAction()
    @     0x556993e9d4c3  clang::FrontendAction::Execute()
    @     0x556993e17d8d  clang::CompilerInstance::ExecuteAction()
    @     0x556992e2a6d8  clang::ExecuteCompilerInvocation()
    @     0x556992e1ead6  cc1_main()
...

Please fix or revert. Thanks!

aemerson added a commit that referenced this pull request Jul 11, 2024
@mshockwave
Copy link
Member Author

Hi @alexfh thanks for discovering this bug. I think you just found a really interesting issue caused by an old bug.

Here is the MachineFunction right before the crash:

# Machine code for function _f: IsSSA, TracksLiveness
Function Live Ins: $x0 in %6, $w1 in %7

bb.0.entry:
  successors: %bb.1(0x30000000), %bb.2(0x50000000); %bb.1(37.50%), %bb.2(62.50%)
  liveins: $x0, $w1
  %7:gpr32 = COPY $w1
  %6:gpr64common = COPY $x0
  %8:gpr32 = COPY %7:gpr32
  %9:gpr32 = LDRSBWui %6:gpr64common, 0 :: (load (s8) from %ir.project)
  %11:gpr64all = IMPLICIT_DEF
  %10:gpr64 = INSERT_SUBREG %11:gpr64all(tied-def 0), %9:gpr32, %subreg.sub_32
  %12:gpr64sp = ANDXri killed %10:gpr64, 4103
  %0:gpr64 = COPY %12:gpr64sp
  %1:gpr64all = COPY %12:gpr64sp
  TBZW %9:gpr32, 31, %bb.2
  B %bb.1

bb.1.cond.true.i9.i.i:
; predecessors: %bb.0
  successors: %bb.5(0x40000000), %bb.4(0x40000000); %bb.5(50.00%), %bb.4(50.00%)

  %13:gpr64 = LDRXui %6:gpr64common, 0 :: (load (s64) from %ir.project)
  %2:gpr64 = COPY %13:gpr64
  TBNZW %8:gpr32, 0, %bb.5
  B %bb.4

bb.2._ZNKSt3__u12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE16__get_short_sizeEv.exit.i.i.i:
; predecessors: %bb.0
  successors: %bb.5(0x40000000), %bb.4(0x40000000); %bb.5(50.00%), %bb.4(50.00%)

  %18:gpr64 = COPY %1:gpr64all
  TBNZW %8:gpr32, 0, %bb.5
  B %bb.4

bb.3.for.cond.i:
; predecessors: %bb.4
  successors: %bb.5(0x04000000), %bb.4(0x7c000000); %bb.5(3.12%), %bb.4(96.88%)

  %15:gpr64sp = ADDXri %5:gpr64sp, 1, 0
  %4:gpr64all = COPY %15:gpr64sp
  CBZX %19:gpr64, %bb.5
  B %bb.4

bb.4.for.body.i:
; predecessors: %bb.1, %bb.2, %bb.3
  successors: %bb.6(0x04000000), %bb.3(0x7c000000); %bb.6(3.12%), %bb.3(96.88%)

  %19:gpr64 = PHI %19:gpr64, %bb.3, %2:gpr64, %bb.1, %18:gpr64, %bb.2
  %5:gpr64sp = PHI %6:gpr64common, %bb.1, %4:gpr64all, %bb.3, %6:gpr64common, %bb.2
  %14:gpr32 = LDRBBui %5:gpr64sp, 0 :: (load (s8) from %ir.__first.sroa.0.010.i)
  CBZW killed %14:gpr32, %bb.6
  B %bb.3

bb.5.common.ret:
; predecessors: %bb.1, %bb.2, %bb.3, %bb.6, %bb.7

  RET_ReallyLR

bb.6.cond.false:
; predecessors: %bb.4
  successors: %bb.5(0x30000000), %bb.7(0x50000000); %bb.5(37.50%), %bb.7(62.50%)

  %16:gpr32 = COPY %1.sub_32:gpr64all
  TBNZW killed %16:gpr32, 7, %bb.5
  B %bb.7

bb.7.cond.end.i.thread.i.i72:
; predecessors: %bb.6
  successors: %bb.5(0x80000000); %bb.5(100.00%)

  STRXui %0:gpr64, %6:gpr64common, 0 :: (store (s64) into %ir.project)
  B %bb.5

# End machine code for function _f.

MachineSink thinks it's profitable to sink %0:gpr64 = COPY %12:gpr64sp from bb.0 to bb.4 after the critical edge between those two blocks are broken. But bb.4 is not even a successor of bb.0, hence the assertion failure.

The reason it thought bb.0 - bb.4 is even a critical edge in the first place is because bb.4 is the SuccToSinkTo block w.r.t the current block (i.e. bb.0) returned from FindSuccToSinkTo. MachineSink consider the edge between current block and SuccToSinkTo a critical edge if SuccToSinkTo has more than one predecessor -- which is true for most of the cases. However, FindSuccToSinkTo not only return a (immediate) successor block but also look beyond to all descendant blocks:

// Handle cases where sinking can happen but where the sink point isn't a
. Which is why FindSuccToSinkTo return bb.4 despite the fact that bb.4 is not even a successor of bb.0. (The reason I said this is an old bug because the logics of looking into descendant blocks has been there for a long time)

I think a simple fix on isLegalToBreakCriticalEdge to reject cases like this should fix the issue. I'm cooking a patch right now...

alexfh pushed a commit that referenced this pull request Jul 12, 2024
…sic blocks involved in critical edge splitting (#98540)

Fix an issue in #97618 - if the two basic blocks involved are not
predecessor / successor to each other, treat the candidate as illegal
for critical edge splitting.

Closes #98477 (checked in test copied from its comment).
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
4e0bd3f improved early MachineLICM's capabilities to hoist COPY from
physical registers out of a loop. However, it accidentally broke one of
MachineSink's preconditions on sinking cheap instructions (in this case,
COPY) which considered those instructions being profitable to sink only
when there are at least two of them in the same def-use chain in the
same basic block. So if early MachineLICM hoisted one of them out,
MachineSink no longer sink rest of the cheap instructions. This results
in redundant load immediate instructions from the motivating example
we've seen on RISC-V.

This patch fixes this by teaching MachineSink that if there is more than
one demand to sink a register into the same block from different
critical edges, it should be considered profitable as it increases the
CSE opportunities.
This change also improves two of the AArch64's cases.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
…sic blocks involved in critical edge splitting (llvm#98540)

Fix an issue in llvm#97618 - if the two basic blocks involved are not
predecessor / successor to each other, treat the candidate as illegal
for critical edge splitting.

Closes llvm#98477 (checked in test copied from its comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants