Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GreedyRA] Improve RA for nested loop induction variables #72093

Merged
merged 2 commits into from
Nov 18, 2023

Conversation

davemgreen
Copy link
Collaborator

Imagine a loop of the form:

  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header

It can be possible for code to spend a decent amount of time in the header<->latch loop, not going into the inner part of the loop as much. The greedy register allocator can prefer to spill around %r though, adding spills around the subs in the loop, which can be very detrimental for performance. (The case I am looking at is actually a very deeply nested set of loops that repeat the header<->latch pattern at multiple different levels).

The greedy RA will apply a preference to spill to the IV, as it is live through the header block. This patch attempts to add a heuristic to prevent that in this case for variables that look like IVs, in a similar regard to the extra spill weight that gets added to variables that look like IVs, that are expensive to spill. That will mean spills are more likely to be pushed into the inner blocks, where they are less likely to be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when compiled with flang-new, whilst importantly stabilising the scores to be less chaotic to other changes. Running ctmark showed no difference in the compile time. I've tried to run a range of benchmarking for performance, most of which were relatively flat not showing many large differences. One matrix multiply case improved 21.3% due to removing a cascading chains of spills, and some other knock-on effects happen which usually cause small differences in the scores.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 13, 2023

@llvm/pr-subscribers-debuginfo

Author: David Green (davemgreen)

Changes

Imagine a loop of the form:

  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header

It can be possible for code to spend a decent amount of time in the header<->latch loop, not going into the inner part of the loop as much. The greedy register allocator can prefer to spill around %r though, adding spills around the subs in the loop, which can be very detrimental for performance. (The case I am looking at is actually a very deeply nested set of loops that repeat the header<->latch pattern at multiple different levels).

The greedy RA will apply a preference to spill to the IV, as it is live through the header block. This patch attempts to add a heuristic to prevent that in this case for variables that look like IVs, in a similar regard to the extra spill weight that gets added to variables that look like IVs, that are expensive to spill. That will mean spills are more likely to be pushed into the inner blocks, where they are less likely to be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when compiled with flang-new, whilst importantly stabilising the scores to be less chaotic to other changes. Running ctmark showed no difference in the compile time. I've tried to run a range of benchmarking for performance, most of which were relatively flat not showing many large differences. One matrix multiply case improved 21.3% due to removing a cascading chains of spills, and some other knock-on effects happen which usually cause small differences in the scores.


Full diff: https://github.com/llvm/llvm-project/pull/72093.diff

5 Files Affected:

  • (modified) llvm/lib/CodeGen/RegAllocGreedy.cpp (+19-3)
  • (modified) llvm/lib/CodeGen/SplitKit.cpp (+12)
  • (modified) llvm/lib/CodeGen/SplitKit.h (+6)
  • (modified) llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir (+21-20)
  • (modified) llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir (+1-1)
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 2ad534c985b0298..3896385c5f97448 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -767,10 +767,26 @@ bool RAGreedy::growRegion(GlobalSplitCandidate &Cand) {
     if (Cand.PhysReg) {
       if (!addThroughConstraints(Cand.Intf, NewBlocks))
         return false;
-    } else
+    } else {
       // Provide a strong negative bias on through blocks to prevent unwanted
-      // liveness on loop backedges.
-      SpillPlacer->addPrefSpill(NewBlocks, /* Strong= */ true);
+      // liveness on loop backedges. Make sure we don't do this for the Loop
+      // Header in variables that look like loop headers though, where the spill
+      // is better pushed into a condition inside the loop.
+      bool PrefSpill = true;
+      if (SA->looksLikeLoopIV() && NewBlocks.size() == 3) {
+        // NewBlocks==3 means we are (possible) adding a Header + start+end of
+        // two loop-internal blocks. If the block is indeed a header, don't make
+        // the newblocks as PrefSpill to allow the variable to be live in
+        // header<->latch.
+        MachineLoop *L = Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[0]));
+        if (L && L->getHeader()->getNumber() == NewBlocks[0] &&
+            L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[1])) &&
+            L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[2])))
+          PrefSpill = false;
+      }
+      if (PrefSpill)
+        SpillPlacer->addPrefSpill(NewBlocks, /* Strong= */ true);
+    }
     AddedTo = ActiveBlocks.size();
 
     // Perhaps iterating can enable more bundles?
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index 1664c304f643c3f..cb99939cefd3258 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -45,6 +45,11 @@ using namespace llvm;
 
 #define DEBUG_TYPE "regalloc"
 
+static cl::opt<bool>
+    EnableLoopIVHeuristic("enable-split-loopiv-heuristic",
+                          cl::desc("Enable loop iv regalloc heuristic"),
+                          cl::init(true));
+
 STATISTIC(NumFinished, "Number of splits finished");
 STATISTIC(NumSimple,   "Number of splits that were simple");
 STATISTIC(NumCopies,   "Number of copies inserted for splitting");
@@ -293,6 +298,13 @@ void SplitAnalysis::calcLiveBlockInfo() {
       MFI = LIS.getMBBFromIndex(LVI->start)->getIterator();
   }
 
+  LooksLikeLoopIV = EnableLoopIVHeuristic && UseBlocks.size() == 2 &&
+                    any_of(UseBlocks, [this](BlockInfo &BI) {
+                      MachineLoop *L = Loops.getLoopFor(BI.MBB);
+                      return BI.LiveIn && BI.LiveOut && BI.FirstDef && L &&
+                             L->isLoopLatch(BI.MBB);
+                    });
+
   assert(getNumLiveBlocks() == countLiveBlocks(CurLI) && "Bad block count");
 }
 
diff --git a/llvm/lib/CodeGen/SplitKit.h b/llvm/lib/CodeGen/SplitKit.h
index 1174e392e4e4424..cc277ecc0e882b9 100644
--- a/llvm/lib/CodeGen/SplitKit.h
+++ b/llvm/lib/CodeGen/SplitKit.h
@@ -159,6 +159,10 @@ class LLVM_LIBRARY_VISIBILITY SplitAnalysis {
   /// NumThroughBlocks - Number of live-through blocks.
   unsigned NumThroughBlocks = 0u;
 
+  /// LooksLikeLoopIV - The variable defines what looks like it could be a loop
+  /// IV, where it defs a variable in the latch.
+  bool LooksLikeLoopIV = false;
+
   // Sumarize statistics by counting instructions using CurLI.
   void analyzeUses();
 
@@ -209,6 +213,8 @@ class LLVM_LIBRARY_VISIBILITY SplitAnalysis {
     return getUseBlocks().size() - NumGapBlocks + getNumThroughBlocks();
   }
 
+  bool looksLikeLoopIV() const { return LooksLikeLoopIV; }
+
   /// countLiveBlocks - Return the number of blocks where li is live. This is
   /// guaranteed to return the same number as getNumLiveBlocks() after calling
   /// analyze(li).
diff --git a/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir b/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
index ac406df6fecec39..3bd8f83d27c2dad 100644
--- a/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
+++ b/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
@@ -1,6 +1,8 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
 # RUN: llc -mtriple aarch64 --run-pass=greedy,virtregrewriter -verify-machineinstrs %s -o - | FileCheck %s
 
+# We should ideally not spill around any of the SUBSWri in the loop exit blocks (if.end and if.end27).
+
 --- |
   target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
   target triple = "aarch64"
@@ -206,8 +208,7 @@ body:             |
   ; CHECK-NEXT:   renamable $w8 = LDRWui renamable $x0, 0 :: (load (s32) from %ir.p)
   ; CHECK-NEXT:   renamable $w9 = LDRWui killed renamable $x0, 1 :: (load (s32) from %ir.arrayidx2)
   ; CHECK-NEXT:   dead $wzr = SUBSWri renamable $w8, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   renamable $w8 = CSINCWr killed renamable $w8, $wzr, 12, implicit $nzcv
-  ; CHECK-NEXT:   STRWui killed renamable $w8, %stack.2, 0 :: (store (s32) into %stack.2)
+  ; CHECK-NEXT:   renamable $w11 = CSINCWr killed renamable $w8, $wzr, 12, implicit $nzcv
   ; CHECK-NEXT:   renamable $x8 = COPY killed renamable $x10
   ; CHECK-NEXT:   dead $wzr = SUBSWri renamable $w9, 1, 0, implicit-def $nzcv
   ; CHECK-NEXT:   renamable $w10 = CSINCWr killed renamable $w9, $wzr, 12, implicit $nzcv
@@ -215,7 +216,7 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.do.body:
   ; CHECK-NEXT:   successors: %bb.3(0x50000000), %bb.2(0x30000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STRXui renamable $x8, %stack.1, 0 :: (store (s64) into %stack.1)
   ; CHECK-NEXT:   renamable $w9 = MOVi32imm 36, implicit-def $x9
@@ -227,23 +228,24 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2:
   ; CHECK-NEXT:   successors: %bb.10(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $x8 = LDRXui %stack.1, 0 :: (load (s64) from %stack.1)
   ; CHECK-NEXT:   B %bb.10
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.3.do.body12.preheader:
   ; CHECK-NEXT:   successors: %bb.4(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $x11 = COPY $xzr
+  ; CHECK-NEXT:   renamable $x12 = COPY $xzr
+  ; CHECK-NEXT:   STRWui renamable $w11, %stack.2, 0 :: (store (s32) into %stack.2)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.4.do.body12:
   ; CHECK-NEXT:   successors: %bb.5(0x50000000), %bb.8(0x30000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = MOVi32imm 36, implicit-def $x8
-  ; CHECK-NEXT:   renamable $x8 = MADDXrrr renamable $x11, killed renamable $x8, $xzr
+  ; CHECK-NEXT:   renamable $x8 = MADDXrrr renamable $x12, killed renamable $x8, $xzr
   ; CHECK-NEXT:   renamable $x9 = MOVaddr target-flags(aarch64-page) @g, target-flags(aarch64-pageoff, aarch64-nc) @g
   ; CHECK-NEXT:   renamable $w8 = LDRWroX killed renamable $x9, killed renamable $x8, 0, 0 :: (load (s32) from %ir.arrayidx14)
   ; CHECK-NEXT:   dead $wzr = SUBSWri killed renamable $w8, 1, 0, implicit-def $nzcv
@@ -252,11 +254,11 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.5.if.then17:
   ; CHECK-NEXT:   successors: %bb.7(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STRWui killed renamable $w10, %stack.3, 0 :: (store (s32) into %stack.3)
-  ; CHECK-NEXT:   STRXui renamable $x11, %stack.4, 0 :: (store (s64) into %stack.4)
-  ; CHECK-NEXT:   renamable $w20 = LDRWroX killed renamable $x2, killed renamable $x11, 0, 1 :: (volatile load (s32) from %ir.arrayidx19)
+  ; CHECK-NEXT:   STRXui renamable $x12, %stack.4, 0 :: (store (s64) into %stack.4)
+  ; CHECK-NEXT:   renamable $w20 = LDRWroX killed renamable $x2, killed renamable $x12, 0, 1 :: (volatile load (s32) from %ir.arrayidx19)
   ; CHECK-NEXT:   renamable $w19 = MOVi32imm 100
   ; CHECK-NEXT:   B %bb.7
   ; CHECK-NEXT: {{  $}}
@@ -267,8 +269,9 @@ body:             |
   ; CHECK-NEXT:   renamable $w1 = COPY killed renamable $w20
   ; CHECK-NEXT:   INLINEASM &"nop;nop", 1 /* sideeffect attdialect */, 12 /* clobber */, implicit-def dead early-clobber $x0, 12 /* clobber */, implicit-def dead early-clobber $x2, 12 /* clobber */, implicit-def dead early-clobber $x3, 12 /* clobber */, implicit-def dead early-clobber $x4, 12 /* clobber */, implicit-def dead early-clobber $x5, 12 /* clobber */, implicit-def dead early-clobber $x6, 12 /* clobber */, implicit-def dead early-clobber $x7, 12 /* clobber */, implicit-def dead early-clobber $x8, 12 /* clobber */, implicit-def dead early-clobber $x9, 12 /* clobber */, implicit-def dead early-clobber $x10, 12 /* clobber */, implicit-def dead early-clobber $x11, 12 /* clobber */, implicit-def dead early-clobber $x12, 12 /* clobber */, implicit-def dead early-clobber $x13, 12 /* clobber */, implicit-def dead early-clobber $x14, 12 /* clobber */, implicit-def dead early-clobber $x15, 12 /* clobber */, implicit-def dead early-clobber $x16, 12 /* clobber */, implicit-def dead early-clobber $x17, 12 /* clobber */, implicit-def dead early-clobber $x18, 12 /* clobber */, implicit-def dead early-clobber $x19, 12 /* clobber */, implicit-def dead early-clobber $x20, 12 /* clobber */, implicit-def dead early-clobber $x21, 12 /* clobber */, implicit-def dead early-clobber $x22, 12 /* clobber */, implicit-def dead early-clobber $x23, 12 /* clobber */, implicit-def dead early-clobber $x24, 12 /* clobber */, implicit-def dead early-clobber $x25, 12 /* clobber */, implicit-def dead early-clobber $x26, 12 /* clobber */, implicit-def dead early-clobber $x27, 12 /* clobber */, implicit-def dead early-clobber $x28, 12 /* clobber */, implicit-def dead early-clobber $fp, 12 /* clobber */, implicit-def dead early-clobber $lr
   ; CHECK-NEXT:   renamable $x2 = LDRXui %stack.0, 0 :: (load (s64) from %stack.0)
-  ; CHECK-NEXT:   renamable $x11 = LDRXui %stack.4, 0 :: (load (s64) from %stack.4)
-  ; CHECK-NEXT:   STRWroX killed renamable $w1, renamable $x2, renamable $x11, 0, 1 :: (volatile store (s32) into %ir.sunkaddr1)
+  ; CHECK-NEXT:   renamable $x12 = LDRXui %stack.4, 0 :: (load (s64) from %stack.4)
+  ; CHECK-NEXT:   STRWroX killed renamable $w1, renamable $x2, renamable $x12, 0, 1 :: (volatile store (s32) into %ir.sunkaddr1)
+  ; CHECK-NEXT:   renamable $w11 = LDRWui %stack.2, 0 :: (load (s32) from %stack.2)
   ; CHECK-NEXT:   renamable $w10 = LDRWui %stack.3, 0 :: (load (s32) from %stack.3)
   ; CHECK-NEXT:   B %bb.8
   ; CHECK-NEXT: {{  $}}
@@ -286,16 +289,16 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.8.if.end:
   ; CHECK-NEXT:   successors: %bb.9(0x04000000), %bb.4(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w10 = nsw SUBSWri killed renamable $w10, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   renamable $x11 = nuw nsw ADDXri killed renamable $x11, 1, 0
+  ; CHECK-NEXT:   renamable $x12 = nuw nsw ADDXri killed renamable $x12, 1, 0
   ; CHECK-NEXT:   Bcc 1, %bb.4, implicit $nzcv
   ; CHECK-NEXT:   B %bb.9
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.9.do.end:
   ; CHECK-NEXT:   successors: %bb.10(0x80000000)
-  ; CHECK-NEXT:   liveins: $x2
+  ; CHECK-NEXT:   liveins: $w11, $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w10 = COPY $wzr
   ; CHECK-NEXT:   renamable $x8 = LDRXui %stack.1, 0 :: (load (s64) from %stack.1)
@@ -303,11 +306,9 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.10.if.end27:
   ; CHECK-NEXT:   successors: %bb.11(0x04000000), %bb.1(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x8
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $w9 = LDRWui %stack.2, 0 :: (load (s32) from %stack.2)
-  ; CHECK-NEXT:   renamable $w9 = nsw SUBSWri killed renamable $w9, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   STRWui killed renamable $w9, %stack.2, 0 :: (store (s32) into %stack.2)
+  ; CHECK-NEXT:   renamable $w11 = nsw SUBSWri killed renamable $w11, 1, 0, implicit-def $nzcv
   ; CHECK-NEXT:   renamable $x8 = nuw nsw ADDXri killed renamable $x8, 1, 0
   ; CHECK-NEXT:   Bcc 1, %bb.1, implicit $nzcv
   ; CHECK-NEXT:   B %bb.11
diff --git a/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir b/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
index 0ef750f94b867ad..7d074b3faefc1ce 100644
--- a/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
+++ b/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
@@ -1,6 +1,6 @@
 # RUN: llc %s -o - -experimental-debug-variable-locations \
 # RUN:    -start-before=x86-flags-copy-lowering -stop-after=virtregrewriter \
-# RUN:    -mtriple x86_64-unknown-unknown \
+# RUN:    -mtriple x86_64-unknown-unknown -enable-split-loopiv-heuristic=false \
 # RUN: | FileCheck %s
 #
 # This test is for stack spill folding -- the INC32r near the end of the MIR

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 13, 2023

@llvm/pr-subscribers-llvm-regalloc

Author: David Green (davemgreen)

Changes

Imagine a loop of the form:

  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header

It can be possible for code to spend a decent amount of time in the header<->latch loop, not going into the inner part of the loop as much. The greedy register allocator can prefer to spill around %r though, adding spills around the subs in the loop, which can be very detrimental for performance. (The case I am looking at is actually a very deeply nested set of loops that repeat the header<->latch pattern at multiple different levels).

The greedy RA will apply a preference to spill to the IV, as it is live through the header block. This patch attempts to add a heuristic to prevent that in this case for variables that look like IVs, in a similar regard to the extra spill weight that gets added to variables that look like IVs, that are expensive to spill. That will mean spills are more likely to be pushed into the inner blocks, where they are less likely to be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when compiled with flang-new, whilst importantly stabilising the scores to be less chaotic to other changes. Running ctmark showed no difference in the compile time. I've tried to run a range of benchmarking for performance, most of which were relatively flat not showing many large differences. One matrix multiply case improved 21.3% due to removing a cascading chains of spills, and some other knock-on effects happen which usually cause small differences in the scores.


Full diff: https://github.com/llvm/llvm-project/pull/72093.diff

5 Files Affected:

  • (modified) llvm/lib/CodeGen/RegAllocGreedy.cpp (+19-3)
  • (modified) llvm/lib/CodeGen/SplitKit.cpp (+12)
  • (modified) llvm/lib/CodeGen/SplitKit.h (+6)
  • (modified) llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir (+21-20)
  • (modified) llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir (+1-1)
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 2ad534c985b0298..3896385c5f97448 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -767,10 +767,26 @@ bool RAGreedy::growRegion(GlobalSplitCandidate &Cand) {
     if (Cand.PhysReg) {
       if (!addThroughConstraints(Cand.Intf, NewBlocks))
         return false;
-    } else
+    } else {
       // Provide a strong negative bias on through blocks to prevent unwanted
-      // liveness on loop backedges.
-      SpillPlacer->addPrefSpill(NewBlocks, /* Strong= */ true);
+      // liveness on loop backedges. Make sure we don't do this for the Loop
+      // Header in variables that look like loop headers though, where the spill
+      // is better pushed into a condition inside the loop.
+      bool PrefSpill = true;
+      if (SA->looksLikeLoopIV() && NewBlocks.size() == 3) {
+        // NewBlocks==3 means we are (possible) adding a Header + start+end of
+        // two loop-internal blocks. If the block is indeed a header, don't make
+        // the newblocks as PrefSpill to allow the variable to be live in
+        // header<->latch.
+        MachineLoop *L = Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[0]));
+        if (L && L->getHeader()->getNumber() == NewBlocks[0] &&
+            L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[1])) &&
+            L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[2])))
+          PrefSpill = false;
+      }
+      if (PrefSpill)
+        SpillPlacer->addPrefSpill(NewBlocks, /* Strong= */ true);
+    }
     AddedTo = ActiveBlocks.size();
 
     // Perhaps iterating can enable more bundles?
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index 1664c304f643c3f..cb99939cefd3258 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -45,6 +45,11 @@ using namespace llvm;
 
 #define DEBUG_TYPE "regalloc"
 
+static cl::opt<bool>
+    EnableLoopIVHeuristic("enable-split-loopiv-heuristic",
+                          cl::desc("Enable loop iv regalloc heuristic"),
+                          cl::init(true));
+
 STATISTIC(NumFinished, "Number of splits finished");
 STATISTIC(NumSimple,   "Number of splits that were simple");
 STATISTIC(NumCopies,   "Number of copies inserted for splitting");
@@ -293,6 +298,13 @@ void SplitAnalysis::calcLiveBlockInfo() {
       MFI = LIS.getMBBFromIndex(LVI->start)->getIterator();
   }
 
+  LooksLikeLoopIV = EnableLoopIVHeuristic && UseBlocks.size() == 2 &&
+                    any_of(UseBlocks, [this](BlockInfo &BI) {
+                      MachineLoop *L = Loops.getLoopFor(BI.MBB);
+                      return BI.LiveIn && BI.LiveOut && BI.FirstDef && L &&
+                             L->isLoopLatch(BI.MBB);
+                    });
+
   assert(getNumLiveBlocks() == countLiveBlocks(CurLI) && "Bad block count");
 }
 
diff --git a/llvm/lib/CodeGen/SplitKit.h b/llvm/lib/CodeGen/SplitKit.h
index 1174e392e4e4424..cc277ecc0e882b9 100644
--- a/llvm/lib/CodeGen/SplitKit.h
+++ b/llvm/lib/CodeGen/SplitKit.h
@@ -159,6 +159,10 @@ class LLVM_LIBRARY_VISIBILITY SplitAnalysis {
   /// NumThroughBlocks - Number of live-through blocks.
   unsigned NumThroughBlocks = 0u;
 
+  /// LooksLikeLoopIV - The variable defines what looks like it could be a loop
+  /// IV, where it defs a variable in the latch.
+  bool LooksLikeLoopIV = false;
+
   // Sumarize statistics by counting instructions using CurLI.
   void analyzeUses();
 
@@ -209,6 +213,8 @@ class LLVM_LIBRARY_VISIBILITY SplitAnalysis {
     return getUseBlocks().size() - NumGapBlocks + getNumThroughBlocks();
   }
 
+  bool looksLikeLoopIV() const { return LooksLikeLoopIV; }
+
   /// countLiveBlocks - Return the number of blocks where li is live. This is
   /// guaranteed to return the same number as getNumLiveBlocks() after calling
   /// analyze(li).
diff --git a/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir b/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
index ac406df6fecec39..3bd8f83d27c2dad 100644
--- a/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
+++ b/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
@@ -1,6 +1,8 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
 # RUN: llc -mtriple aarch64 --run-pass=greedy,virtregrewriter -verify-machineinstrs %s -o - | FileCheck %s
 
+# We should ideally not spill around any of the SUBSWri in the loop exit blocks (if.end and if.end27).
+
 --- |
   target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
   target triple = "aarch64"
@@ -206,8 +208,7 @@ body:             |
   ; CHECK-NEXT:   renamable $w8 = LDRWui renamable $x0, 0 :: (load (s32) from %ir.p)
   ; CHECK-NEXT:   renamable $w9 = LDRWui killed renamable $x0, 1 :: (load (s32) from %ir.arrayidx2)
   ; CHECK-NEXT:   dead $wzr = SUBSWri renamable $w8, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   renamable $w8 = CSINCWr killed renamable $w8, $wzr, 12, implicit $nzcv
-  ; CHECK-NEXT:   STRWui killed renamable $w8, %stack.2, 0 :: (store (s32) into %stack.2)
+  ; CHECK-NEXT:   renamable $w11 = CSINCWr killed renamable $w8, $wzr, 12, implicit $nzcv
   ; CHECK-NEXT:   renamable $x8 = COPY killed renamable $x10
   ; CHECK-NEXT:   dead $wzr = SUBSWri renamable $w9, 1, 0, implicit-def $nzcv
   ; CHECK-NEXT:   renamable $w10 = CSINCWr killed renamable $w9, $wzr, 12, implicit $nzcv
@@ -215,7 +216,7 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.do.body:
   ; CHECK-NEXT:   successors: %bb.3(0x50000000), %bb.2(0x30000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STRXui renamable $x8, %stack.1, 0 :: (store (s64) into %stack.1)
   ; CHECK-NEXT:   renamable $w9 = MOVi32imm 36, implicit-def $x9
@@ -227,23 +228,24 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2:
   ; CHECK-NEXT:   successors: %bb.10(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $x8 = LDRXui %stack.1, 0 :: (load (s64) from %stack.1)
   ; CHECK-NEXT:   B %bb.10
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.3.do.body12.preheader:
   ; CHECK-NEXT:   successors: %bb.4(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $x11 = COPY $xzr
+  ; CHECK-NEXT:   renamable $x12 = COPY $xzr
+  ; CHECK-NEXT:   STRWui renamable $w11, %stack.2, 0 :: (store (s32) into %stack.2)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.4.do.body12:
   ; CHECK-NEXT:   successors: %bb.5(0x50000000), %bb.8(0x30000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = MOVi32imm 36, implicit-def $x8
-  ; CHECK-NEXT:   renamable $x8 = MADDXrrr renamable $x11, killed renamable $x8, $xzr
+  ; CHECK-NEXT:   renamable $x8 = MADDXrrr renamable $x12, killed renamable $x8, $xzr
   ; CHECK-NEXT:   renamable $x9 = MOVaddr target-flags(aarch64-page) @g, target-flags(aarch64-pageoff, aarch64-nc) @g
   ; CHECK-NEXT:   renamable $w8 = LDRWroX killed renamable $x9, killed renamable $x8, 0, 0 :: (load (s32) from %ir.arrayidx14)
   ; CHECK-NEXT:   dead $wzr = SUBSWri killed renamable $w8, 1, 0, implicit-def $nzcv
@@ -252,11 +254,11 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.5.if.then17:
   ; CHECK-NEXT:   successors: %bb.7(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STRWui killed renamable $w10, %stack.3, 0 :: (store (s32) into %stack.3)
-  ; CHECK-NEXT:   STRXui renamable $x11, %stack.4, 0 :: (store (s64) into %stack.4)
-  ; CHECK-NEXT:   renamable $w20 = LDRWroX killed renamable $x2, killed renamable $x11, 0, 1 :: (volatile load (s32) from %ir.arrayidx19)
+  ; CHECK-NEXT:   STRXui renamable $x12, %stack.4, 0 :: (store (s64) into %stack.4)
+  ; CHECK-NEXT:   renamable $w20 = LDRWroX killed renamable $x2, killed renamable $x12, 0, 1 :: (volatile load (s32) from %ir.arrayidx19)
   ; CHECK-NEXT:   renamable $w19 = MOVi32imm 100
   ; CHECK-NEXT:   B %bb.7
   ; CHECK-NEXT: {{  $}}
@@ -267,8 +269,9 @@ body:             |
   ; CHECK-NEXT:   renamable $w1 = COPY killed renamable $w20
   ; CHECK-NEXT:   INLINEASM &"nop;nop", 1 /* sideeffect attdialect */, 12 /* clobber */, implicit-def dead early-clobber $x0, 12 /* clobber */, implicit-def dead early-clobber $x2, 12 /* clobber */, implicit-def dead early-clobber $x3, 12 /* clobber */, implicit-def dead early-clobber $x4, 12 /* clobber */, implicit-def dead early-clobber $x5, 12 /* clobber */, implicit-def dead early-clobber $x6, 12 /* clobber */, implicit-def dead early-clobber $x7, 12 /* clobber */, implicit-def dead early-clobber $x8, 12 /* clobber */, implicit-def dead early-clobber $x9, 12 /* clobber */, implicit-def dead early-clobber $x10, 12 /* clobber */, implicit-def dead early-clobber $x11, 12 /* clobber */, implicit-def dead early-clobber $x12, 12 /* clobber */, implicit-def dead early-clobber $x13, 12 /* clobber */, implicit-def dead early-clobber $x14, 12 /* clobber */, implicit-def dead early-clobber $x15, 12 /* clobber */, implicit-def dead early-clobber $x16, 12 /* clobber */, implicit-def dead early-clobber $x17, 12 /* clobber */, implicit-def dead early-clobber $x18, 12 /* clobber */, implicit-def dead early-clobber $x19, 12 /* clobber */, implicit-def dead early-clobber $x20, 12 /* clobber */, implicit-def dead early-clobber $x21, 12 /* clobber */, implicit-def dead early-clobber $x22, 12 /* clobber */, implicit-def dead early-clobber $x23, 12 /* clobber */, implicit-def dead early-clobber $x24, 12 /* clobber */, implicit-def dead early-clobber $x25, 12 /* clobber */, implicit-def dead early-clobber $x26, 12 /* clobber */, implicit-def dead early-clobber $x27, 12 /* clobber */, implicit-def dead early-clobber $x28, 12 /* clobber */, implicit-def dead early-clobber $fp, 12 /* clobber */, implicit-def dead early-clobber $lr
   ; CHECK-NEXT:   renamable $x2 = LDRXui %stack.0, 0 :: (load (s64) from %stack.0)
-  ; CHECK-NEXT:   renamable $x11 = LDRXui %stack.4, 0 :: (load (s64) from %stack.4)
-  ; CHECK-NEXT:   STRWroX killed renamable $w1, renamable $x2, renamable $x11, 0, 1 :: (volatile store (s32) into %ir.sunkaddr1)
+  ; CHECK-NEXT:   renamable $x12 = LDRXui %stack.4, 0 :: (load (s64) from %stack.4)
+  ; CHECK-NEXT:   STRWroX killed renamable $w1, renamable $x2, renamable $x12, 0, 1 :: (volatile store (s32) into %ir.sunkaddr1)
+  ; CHECK-NEXT:   renamable $w11 = LDRWui %stack.2, 0 :: (load (s32) from %stack.2)
   ; CHECK-NEXT:   renamable $w10 = LDRWui %stack.3, 0 :: (load (s32) from %stack.3)
   ; CHECK-NEXT:   B %bb.8
   ; CHECK-NEXT: {{  $}}
@@ -286,16 +289,16 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.8.if.end:
   ; CHECK-NEXT:   successors: %bb.9(0x04000000), %bb.4(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w10 = nsw SUBSWri killed renamable $w10, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   renamable $x11 = nuw nsw ADDXri killed renamable $x11, 1, 0
+  ; CHECK-NEXT:   renamable $x12 = nuw nsw ADDXri killed renamable $x12, 1, 0
   ; CHECK-NEXT:   Bcc 1, %bb.4, implicit $nzcv
   ; CHECK-NEXT:   B %bb.9
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.9.do.end:
   ; CHECK-NEXT:   successors: %bb.10(0x80000000)
-  ; CHECK-NEXT:   liveins: $x2
+  ; CHECK-NEXT:   liveins: $w11, $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w10 = COPY $wzr
   ; CHECK-NEXT:   renamable $x8 = LDRXui %stack.1, 0 :: (load (s64) from %stack.1)
@@ -303,11 +306,9 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.10.if.end27:
   ; CHECK-NEXT:   successors: %bb.11(0x04000000), %bb.1(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x8
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $w9 = LDRWui %stack.2, 0 :: (load (s32) from %stack.2)
-  ; CHECK-NEXT:   renamable $w9 = nsw SUBSWri killed renamable $w9, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   STRWui killed renamable $w9, %stack.2, 0 :: (store (s32) into %stack.2)
+  ; CHECK-NEXT:   renamable $w11 = nsw SUBSWri killed renamable $w11, 1, 0, implicit-def $nzcv
   ; CHECK-NEXT:   renamable $x8 = nuw nsw ADDXri killed renamable $x8, 1, 0
   ; CHECK-NEXT:   Bcc 1, %bb.1, implicit $nzcv
   ; CHECK-NEXT:   B %bb.11
diff --git a/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir b/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
index 0ef750f94b867ad..7d074b3faefc1ce 100644
--- a/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
+++ b/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
@@ -1,6 +1,6 @@
 # RUN: llc %s -o - -experimental-debug-variable-locations \
 # RUN:    -start-before=x86-flags-copy-lowering -stop-after=virtregrewriter \
-# RUN:    -mtriple x86_64-unknown-unknown \
+# RUN:    -mtriple x86_64-unknown-unknown -enable-split-loopiv-heuristic=false \
 # RUN: | FileCheck %s
 #
 # This test is for stack spill folding -- the INC32r near the end of the MIR

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 13, 2023

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

Imagine a loop of the form:

  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header

It can be possible for code to spend a decent amount of time in the header<->latch loop, not going into the inner part of the loop as much. The greedy register allocator can prefer to spill around %r though, adding spills around the subs in the loop, which can be very detrimental for performance. (The case I am looking at is actually a very deeply nested set of loops that repeat the header<->latch pattern at multiple different levels).

The greedy RA will apply a preference to spill to the IV, as it is live through the header block. This patch attempts to add a heuristic to prevent that in this case for variables that look like IVs, in a similar regard to the extra spill weight that gets added to variables that look like IVs, that are expensive to spill. That will mean spills are more likely to be pushed into the inner blocks, where they are less likely to be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when compiled with flang-new, whilst importantly stabilising the scores to be less chaotic to other changes. Running ctmark showed no difference in the compile time. I've tried to run a range of benchmarking for performance, most of which were relatively flat not showing many large differences. One matrix multiply case improved 21.3% due to removing a cascading chains of spills, and some other knock-on effects happen which usually cause small differences in the scores.


Full diff: https://github.com/llvm/llvm-project/pull/72093.diff

5 Files Affected:

  • (modified) llvm/lib/CodeGen/RegAllocGreedy.cpp (+19-3)
  • (modified) llvm/lib/CodeGen/SplitKit.cpp (+12)
  • (modified) llvm/lib/CodeGen/SplitKit.h (+6)
  • (modified) llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir (+21-20)
  • (modified) llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir (+1-1)
diff --git a/llvm/lib/CodeGen/RegAllocGreedy.cpp b/llvm/lib/CodeGen/RegAllocGreedy.cpp
index 2ad534c985b0298..3896385c5f97448 100644
--- a/llvm/lib/CodeGen/RegAllocGreedy.cpp
+++ b/llvm/lib/CodeGen/RegAllocGreedy.cpp
@@ -767,10 +767,26 @@ bool RAGreedy::growRegion(GlobalSplitCandidate &Cand) {
     if (Cand.PhysReg) {
       if (!addThroughConstraints(Cand.Intf, NewBlocks))
         return false;
-    } else
+    } else {
       // Provide a strong negative bias on through blocks to prevent unwanted
-      // liveness on loop backedges.
-      SpillPlacer->addPrefSpill(NewBlocks, /* Strong= */ true);
+      // liveness on loop backedges. Make sure we don't do this for the Loop
+      // Header in variables that look like loop headers though, where the spill
+      // is better pushed into a condition inside the loop.
+      bool PrefSpill = true;
+      if (SA->looksLikeLoopIV() && NewBlocks.size() == 3) {
+        // NewBlocks==3 means we are (possible) adding a Header + start+end of
+        // two loop-internal blocks. If the block is indeed a header, don't make
+        // the newblocks as PrefSpill to allow the variable to be live in
+        // header<->latch.
+        MachineLoop *L = Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[0]));
+        if (L && L->getHeader()->getNumber() == NewBlocks[0] &&
+            L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[1])) &&
+            L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[2])))
+          PrefSpill = false;
+      }
+      if (PrefSpill)
+        SpillPlacer->addPrefSpill(NewBlocks, /* Strong= */ true);
+    }
     AddedTo = ActiveBlocks.size();
 
     // Perhaps iterating can enable more bundles?
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index 1664c304f643c3f..cb99939cefd3258 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -45,6 +45,11 @@ using namespace llvm;
 
 #define DEBUG_TYPE "regalloc"
 
+static cl::opt<bool>
+    EnableLoopIVHeuristic("enable-split-loopiv-heuristic",
+                          cl::desc("Enable loop iv regalloc heuristic"),
+                          cl::init(true));
+
 STATISTIC(NumFinished, "Number of splits finished");
 STATISTIC(NumSimple,   "Number of splits that were simple");
 STATISTIC(NumCopies,   "Number of copies inserted for splitting");
@@ -293,6 +298,13 @@ void SplitAnalysis::calcLiveBlockInfo() {
       MFI = LIS.getMBBFromIndex(LVI->start)->getIterator();
   }
 
+  LooksLikeLoopIV = EnableLoopIVHeuristic && UseBlocks.size() == 2 &&
+                    any_of(UseBlocks, [this](BlockInfo &BI) {
+                      MachineLoop *L = Loops.getLoopFor(BI.MBB);
+                      return BI.LiveIn && BI.LiveOut && BI.FirstDef && L &&
+                             L->isLoopLatch(BI.MBB);
+                    });
+
   assert(getNumLiveBlocks() == countLiveBlocks(CurLI) && "Bad block count");
 }
 
diff --git a/llvm/lib/CodeGen/SplitKit.h b/llvm/lib/CodeGen/SplitKit.h
index 1174e392e4e4424..cc277ecc0e882b9 100644
--- a/llvm/lib/CodeGen/SplitKit.h
+++ b/llvm/lib/CodeGen/SplitKit.h
@@ -159,6 +159,10 @@ class LLVM_LIBRARY_VISIBILITY SplitAnalysis {
   /// NumThroughBlocks - Number of live-through blocks.
   unsigned NumThroughBlocks = 0u;
 
+  /// LooksLikeLoopIV - The variable defines what looks like it could be a loop
+  /// IV, where it defs a variable in the latch.
+  bool LooksLikeLoopIV = false;
+
   // Sumarize statistics by counting instructions using CurLI.
   void analyzeUses();
 
@@ -209,6 +213,8 @@ class LLVM_LIBRARY_VISIBILITY SplitAnalysis {
     return getUseBlocks().size() - NumGapBlocks + getNumThroughBlocks();
   }
 
+  bool looksLikeLoopIV() const { return LooksLikeLoopIV; }
+
   /// countLiveBlocks - Return the number of blocks where li is live. This is
   /// guaranteed to return the same number as getNumLiveBlocks() after calling
   /// analyze(li).
diff --git a/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir b/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
index ac406df6fecec39..3bd8f83d27c2dad 100644
--- a/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
+++ b/llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
@@ -1,6 +1,8 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
 # RUN: llc -mtriple aarch64 --run-pass=greedy,virtregrewriter -verify-machineinstrs %s -o - | FileCheck %s
 
+# We should ideally not spill around any of the SUBSWri in the loop exit blocks (if.end and if.end27).
+
 --- |
   target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
   target triple = "aarch64"
@@ -206,8 +208,7 @@ body:             |
   ; CHECK-NEXT:   renamable $w8 = LDRWui renamable $x0, 0 :: (load (s32) from %ir.p)
   ; CHECK-NEXT:   renamable $w9 = LDRWui killed renamable $x0, 1 :: (load (s32) from %ir.arrayidx2)
   ; CHECK-NEXT:   dead $wzr = SUBSWri renamable $w8, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   renamable $w8 = CSINCWr killed renamable $w8, $wzr, 12, implicit $nzcv
-  ; CHECK-NEXT:   STRWui killed renamable $w8, %stack.2, 0 :: (store (s32) into %stack.2)
+  ; CHECK-NEXT:   renamable $w11 = CSINCWr killed renamable $w8, $wzr, 12, implicit $nzcv
   ; CHECK-NEXT:   renamable $x8 = COPY killed renamable $x10
   ; CHECK-NEXT:   dead $wzr = SUBSWri renamable $w9, 1, 0, implicit-def $nzcv
   ; CHECK-NEXT:   renamable $w10 = CSINCWr killed renamable $w9, $wzr, 12, implicit $nzcv
@@ -215,7 +216,7 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.do.body:
   ; CHECK-NEXT:   successors: %bb.3(0x50000000), %bb.2(0x30000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STRXui renamable $x8, %stack.1, 0 :: (store (s64) into %stack.1)
   ; CHECK-NEXT:   renamable $w9 = MOVi32imm 36, implicit-def $x9
@@ -227,23 +228,24 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2:
   ; CHECK-NEXT:   successors: %bb.10(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $x8 = LDRXui %stack.1, 0 :: (load (s64) from %stack.1)
   ; CHECK-NEXT:   B %bb.10
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.3.do.body12.preheader:
   ; CHECK-NEXT:   successors: %bb.4(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $x11 = COPY $xzr
+  ; CHECK-NEXT:   renamable $x12 = COPY $xzr
+  ; CHECK-NEXT:   STRWui renamable $w11, %stack.2, 0 :: (store (s32) into %stack.2)
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.4.do.body12:
   ; CHECK-NEXT:   successors: %bb.5(0x50000000), %bb.8(0x30000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = MOVi32imm 36, implicit-def $x8
-  ; CHECK-NEXT:   renamable $x8 = MADDXrrr renamable $x11, killed renamable $x8, $xzr
+  ; CHECK-NEXT:   renamable $x8 = MADDXrrr renamable $x12, killed renamable $x8, $xzr
   ; CHECK-NEXT:   renamable $x9 = MOVaddr target-flags(aarch64-page) @g, target-flags(aarch64-pageoff, aarch64-nc) @g
   ; CHECK-NEXT:   renamable $w8 = LDRWroX killed renamable $x9, killed renamable $x8, 0, 0 :: (load (s32) from %ir.arrayidx14)
   ; CHECK-NEXT:   dead $wzr = SUBSWri killed renamable $w8, 1, 0, implicit-def $nzcv
@@ -252,11 +254,11 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.5.if.then17:
   ; CHECK-NEXT:   successors: %bb.7(0x80000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   STRWui killed renamable $w10, %stack.3, 0 :: (store (s32) into %stack.3)
-  ; CHECK-NEXT:   STRXui renamable $x11, %stack.4, 0 :: (store (s64) into %stack.4)
-  ; CHECK-NEXT:   renamable $w20 = LDRWroX killed renamable $x2, killed renamable $x11, 0, 1 :: (volatile load (s32) from %ir.arrayidx19)
+  ; CHECK-NEXT:   STRXui renamable $x12, %stack.4, 0 :: (store (s64) into %stack.4)
+  ; CHECK-NEXT:   renamable $w20 = LDRWroX killed renamable $x2, killed renamable $x12, 0, 1 :: (volatile load (s32) from %ir.arrayidx19)
   ; CHECK-NEXT:   renamable $w19 = MOVi32imm 100
   ; CHECK-NEXT:   B %bb.7
   ; CHECK-NEXT: {{  $}}
@@ -267,8 +269,9 @@ body:             |
   ; CHECK-NEXT:   renamable $w1 = COPY killed renamable $w20
   ; CHECK-NEXT:   INLINEASM &"nop;nop", 1 /* sideeffect attdialect */, 12 /* clobber */, implicit-def dead early-clobber $x0, 12 /* clobber */, implicit-def dead early-clobber $x2, 12 /* clobber */, implicit-def dead early-clobber $x3, 12 /* clobber */, implicit-def dead early-clobber $x4, 12 /* clobber */, implicit-def dead early-clobber $x5, 12 /* clobber */, implicit-def dead early-clobber $x6, 12 /* clobber */, implicit-def dead early-clobber $x7, 12 /* clobber */, implicit-def dead early-clobber $x8, 12 /* clobber */, implicit-def dead early-clobber $x9, 12 /* clobber */, implicit-def dead early-clobber $x10, 12 /* clobber */, implicit-def dead early-clobber $x11, 12 /* clobber */, implicit-def dead early-clobber $x12, 12 /* clobber */, implicit-def dead early-clobber $x13, 12 /* clobber */, implicit-def dead early-clobber $x14, 12 /* clobber */, implicit-def dead early-clobber $x15, 12 /* clobber */, implicit-def dead early-clobber $x16, 12 /* clobber */, implicit-def dead early-clobber $x17, 12 /* clobber */, implicit-def dead early-clobber $x18, 12 /* clobber */, implicit-def dead early-clobber $x19, 12 /* clobber */, implicit-def dead early-clobber $x20, 12 /* clobber */, implicit-def dead early-clobber $x21, 12 /* clobber */, implicit-def dead early-clobber $x22, 12 /* clobber */, implicit-def dead early-clobber $x23, 12 /* clobber */, implicit-def dead early-clobber $x24, 12 /* clobber */, implicit-def dead early-clobber $x25, 12 /* clobber */, implicit-def dead early-clobber $x26, 12 /* clobber */, implicit-def dead early-clobber $x27, 12 /* clobber */, implicit-def dead early-clobber $x28, 12 /* clobber */, implicit-def dead early-clobber $fp, 12 /* clobber */, implicit-def dead early-clobber $lr
   ; CHECK-NEXT:   renamable $x2 = LDRXui %stack.0, 0 :: (load (s64) from %stack.0)
-  ; CHECK-NEXT:   renamable $x11 = LDRXui %stack.4, 0 :: (load (s64) from %stack.4)
-  ; CHECK-NEXT:   STRWroX killed renamable $w1, renamable $x2, renamable $x11, 0, 1 :: (volatile store (s32) into %ir.sunkaddr1)
+  ; CHECK-NEXT:   renamable $x12 = LDRXui %stack.4, 0 :: (load (s64) from %stack.4)
+  ; CHECK-NEXT:   STRWroX killed renamable $w1, renamable $x2, renamable $x12, 0, 1 :: (volatile store (s32) into %ir.sunkaddr1)
+  ; CHECK-NEXT:   renamable $w11 = LDRWui %stack.2, 0 :: (load (s32) from %stack.2)
   ; CHECK-NEXT:   renamable $w10 = LDRWui %stack.3, 0 :: (load (s32) from %stack.3)
   ; CHECK-NEXT:   B %bb.8
   ; CHECK-NEXT: {{  $}}
@@ -286,16 +289,16 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.8.if.end:
   ; CHECK-NEXT:   successors: %bb.9(0x04000000), %bb.4(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x11
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x12
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w10 = nsw SUBSWri killed renamable $w10, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   renamable $x11 = nuw nsw ADDXri killed renamable $x11, 1, 0
+  ; CHECK-NEXT:   renamable $x12 = nuw nsw ADDXri killed renamable $x12, 1, 0
   ; CHECK-NEXT:   Bcc 1, %bb.4, implicit $nzcv
   ; CHECK-NEXT:   B %bb.9
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.9.do.end:
   ; CHECK-NEXT:   successors: %bb.10(0x80000000)
-  ; CHECK-NEXT:   liveins: $x2
+  ; CHECK-NEXT:   liveins: $w11, $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w10 = COPY $wzr
   ; CHECK-NEXT:   renamable $x8 = LDRXui %stack.1, 0 :: (load (s64) from %stack.1)
@@ -303,11 +306,9 @@ body:             |
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.10.if.end27:
   ; CHECK-NEXT:   successors: %bb.11(0x04000000), %bb.1(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w10, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w10, $w11, $x2, $x8
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $w9 = LDRWui %stack.2, 0 :: (load (s32) from %stack.2)
-  ; CHECK-NEXT:   renamable $w9 = nsw SUBSWri killed renamable $w9, 1, 0, implicit-def $nzcv
-  ; CHECK-NEXT:   STRWui killed renamable $w9, %stack.2, 0 :: (store (s32) into %stack.2)
+  ; CHECK-NEXT:   renamable $w11 = nsw SUBSWri killed renamable $w11, 1, 0, implicit-def $nzcv
   ; CHECK-NEXT:   renamable $x8 = nuw nsw ADDXri killed renamable $x8, 1, 0
   ; CHECK-NEXT:   Bcc 1, %bb.1, implicit $nzcv
   ; CHECK-NEXT:   B %bb.11
diff --git a/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir b/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
index 0ef750f94b867ad..7d074b3faefc1ce 100644
--- a/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
+++ b/llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir
@@ -1,6 +1,6 @@
 # RUN: llc %s -o - -experimental-debug-variable-locations \
 # RUN:    -start-before=x86-flags-copy-lowering -stop-after=virtregrewriter \
-# RUN:    -mtriple x86_64-unknown-unknown \
+# RUN:    -mtriple x86_64-unknown-unknown -enable-split-loopiv-heuristic=false \
 # RUN: | FileCheck %s
 #
 # This test is for stack spill folding -- the INC32r near the end of the MIR

MachineLoop *L = Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[0]));
if (L && L->getHeader()->getNumber() == NewBlocks[0] &&
L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[1])) &&
L == Loops->getLoopFor(MF->getBlockNumbered(NewBlocks[2])))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get the argument about the number 3.

What would happen if we had say 5 loop-internal blocks?
Why do we want to do something differently in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was aiming for the three blocks in the bundle being the loop header plus start and end of two inner loop blocks, if I'm understanding the bundles correctly. It would represent an if condition in the loop. In the original case there are many more "through" blocks in the loop, I think upto 180, but the first bundle is just the header + 2 inner blocks.

There might be more than one block jumping to the latch. I could try and generalise it to any number of blocks so long as they are all in the loop, if you think that sounds better? That looks like it should be similar for perf/compile-time.

llvm/lib/CodeGen/RegAllocGreedy.cpp Outdated Show resolved Hide resolved
Imagine a loop of the form:
  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header

It can be possible for code to spend a decent amount of time in the
header<->latch loop, not going into the inner part of the loop as much. The
greedy register allocator can prefer to spill -around- %r though, adding spills
around the subs in the loop, which can be very detrimental for performance.
(The case I am looking at is actually a very deeply nested set of loops that
repeat the header<->latch pattern at multiple different levels).

The greedy RA will apply a preference to spill to the IV, as it is live through
the header block. This patch attempts to add a heuristic to prevent that in
this case for variables that look like IVs, in a similar regard to the extra
spill weight that gets added to variables that look like IVs, that are
expensive to spill. That tries pushes the spill into the inner blocks, where
they are less likely to be executed, and not as expensive as spills around the
IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when compiled
with flang-new, whilst importantly stabalizing the scores to be less chaotic to
other changes. Running ctmark showed no difference in the compile time.
I've tried to run a relatively large amount of benchmarking for performance,
most of which is relatively flat, not showing many large differences in the
cases I tried.
@davemgreen
Copy link
Collaborator Author

Thanks

@davemgreen davemgreen merged commit 303a783 into llvm:main Nov 18, 2023
2 of 3 checks passed
@davemgreen davemgreen deleted the gh-regallociv branch November 18, 2023 09:55
sr-tream pushed a commit to sr-tream/llvm-project that referenced this pull request Nov 20, 2023
Imagine a loop of the form:
```
  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header
```

It can be possible for code to spend a decent amount of time in the
header<->latch loop, not going into the inner part of the loop as much.
The greedy register allocator can prefer to spill _around_ %r though,
adding spills around the subs in the loop, which can be very detrimental
for performance. (The case I am looking at is actually a very deeply
nested set of loops that repeat the header<->latch pattern at multiple
different levels).

The greedy RA will apply a preference to spill to the IV, as it is live
through the header block. This patch attempts to add a heuristic to
prevent that in this case for variables that look like IVs, in a similar
regard to the extra spill weight that gets added to variables that look
like IVs, that are expensive to spill. That will mean spills are more
likely to be pushed into the inner blocks, where they are less likely to
be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when
compiled with flang-new, whilst importantly stabilising the scores to be
less chaotic to other changes. Running ctmark showed no difference in
the compile time. I've tried to run a range of benchmarking for
performance, most of which were relatively flat not showing many large
differences. One matrix multiply case improved 21.3% due to removing a
cascading chains of spills, and some other knock-on effects happen which
usually cause small differences in the scores.
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
Imagine a loop of the form:
```
  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header
```

It can be possible for code to spend a decent amount of time in the
header<->latch loop, not going into the inner part of the loop as much.
The greedy register allocator can prefer to spill _around_ %r though,
adding spills around the subs in the loop, which can be very detrimental
for performance. (The case I am looking at is actually a very deeply
nested set of loops that repeat the header<->latch pattern at multiple
different levels).

The greedy RA will apply a preference to spill to the IV, as it is live
through the header block. This patch attempts to add a heuristic to
prevent that in this case for variables that look like IVs, in a similar
regard to the extra spill weight that gets added to variables that look
like IVs, that are expensive to spill. That will mean spills are more
likely to be pushed into the inner blocks, where they are less likely to
be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when
compiled with flang-new, whilst importantly stabilising the scores to be
less chaotic to other changes. Running ctmark showed no difference in
the compile time. I've tried to run a range of benchmarking for
performance, most of which were relatively flat not showing many large
differences. One matrix multiply case improved 21.3% due to removing a
cascading chains of spills, and some other knock-on effects happen which
usually cause small differences in the scores.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants