[GVN] Permit load PRE to happen in more cases #76063

david-arm · 2023-12-20T15:22:16Z

There is code in PerformLoadPRE that bails out if the number of
predecessors where we have to insert a load is greater than one:

  // If we need to insert new load in multiple predecessors, reject it.
  // FIXME: If we could restructure the CFG, we could make a common pred with
  // all the preds that don't have an available Load and insert a new load into
  // that one block.
  if (NumInsertPreds > 1)
      return false;

However, this is quite restrictive because NumInsertPreds is the sum

  unsigned NumInsertPreds = PredLoads.size() + CriticalEdgePredSplit.size();

where PredLoads and CriticalEdgePredSplit are the lists of available and
unavailable loads, respectively. Therefore, with just one unavailable
load we refuse to perform load PRE if there are other predecessors with
available loads. I've changed this check above to be less conservative:

  if (CriticalEdgePredSplit.size() > 1)
      return false;

As a result we can perform load PRE in many more cases. On the
AArch64 neoverse-v1 machine we see 1% gains in both SPEC2017 perlbench
and omnetpp benchmarks. On neoverse-n1 we see 1% and 3.5% gains in
perlbench and omnetpp, respectively. I only observed one regression with
exchange2, but this is related to function specialisation and will be
fixed separately.

There is code in PerformLoadPRE that bails out if the number of predecessors where we have to insert a load is greater than one: // If we need to insert new load in multiple predecessors, reject it. // FIXME: If we could restructure the CFG, we could make a common pred with // all the preds that don't have an available Load and insert a new load into // that one block. if (NumInsertPreds > 1) return false; However, this is quite restrictive because NumInsertPreds is the sum unsigned NumInsertPreds = PredLoads.size() + CriticalEdgePredSplit.size(); where PredLoads and CriticalEdgePredSplit are the lists of available and unavailable loads, respectively. Therefore, with just one unavailable load we refuse to perform load PRE if there are other predecessors with available loads. I've changed this check above to be less conservative: if (CriticalEdgePredSplit.size() > 1) return false; As a result we can perform load PRE in many more cases. On the AArch64 neoverse-v1 machine we see 1% gains in both SPEC2017 perlbench and omnetpp benchmarks. On neoverse-n1 we see 1% and 3.5% gains in perlbench and omnetpp, respectively. I only observed one regression with exchange2, but this is related to function specialisation and will be fixed separately.

llvmbot · 2023-12-20T15:22:46Z

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

There is code in PerformLoadPRE that bails out if the number of
predecessors where we have to insert a load is greater than one:

// If we need to insert new load in multiple predecessors, reject it.
// FIXME: If we could restructure the CFG, we could make a common pred with
// all the preds that don't have an available Load and insert a new load into
// that one block.
if (NumInsertPreds > 1)
return false;

However, this is quite restrictive because NumInsertPreds is the sum

unsigned NumInsertPreds = PredLoads.size() + CriticalEdgePredSplit.size();

where PredLoads and CriticalEdgePredSplit are the lists of available and
unavailable loads, respectively. Therefore, with just one unavailable
load we refuse to perform load PRE if there are other predecessors with
available loads. I've changed this check above to be less conservative:

if (CriticalEdgePredSplit.size() > 1)
return false;

As a result we can perform load PRE in many more cases. On the
AArch64 neoverse-v1 machine we see 1% gains in both SPEC2017 perlbench
and omnetpp benchmarks. On neoverse-n1 we see 1% and 3.5% gains in
perlbench and omnetpp, respectively. I only observed one regression with
exchange2, but this is related to function specialisation and will be
fixed separately.

Full diff: https://github.com/llvm/llvm-project/pull/76063.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Scalar/GVN.cpp (+1-1)
(modified) llvm/test/Transforms/GVN/PRE/load-pre-licm.ll (+97-1)
(modified) llvm/test/Transforms/GVN/condprop.ll (+12-8)

diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp
index 5e58af0edc1556..02d7315115c4e5 100644
--- a/llvm/lib/Transforms/Scalar/GVN.cpp
+++ b/llvm/lib/Transforms/Scalar/GVN.cpp
@@ -1625,7 +1625,7 @@ bool GVNPass::PerformLoadPRE(LoadInst *Load, AvailValInBlkVect &ValuesPerBlock,
   // FIXME: If we could restructure the CFG, we could make a common pred with
   // all the preds that don't have an available Load and insert a new load into
   // that one block.
-  if (NumInsertPreds > 1)
+  if (CriticalEdgePredSplit.size() > 1)
       return false;
 
   // Now we know where we will insert load. We must ensure that it is safe
diff --git a/llvm/test/Transforms/GVN/PRE/load-pre-licm.ll b/llvm/test/Transforms/GVN/PRE/load-pre-licm.ll
index c52f46b4f63ee9..9ec9568b3a5bce 100644
--- a/llvm/test/Transforms/GVN/PRE/load-pre-licm.ll
+++ b/llvm/test/Transforms/GVN/PRE/load-pre-licm.ll
@@ -130,11 +130,13 @@ define i32 @test3(i1 %cnd, ptr %p) {
 ; CHECK:       header:
 ; CHECK-NEXT:    br i1 [[CND:%.*]], label [[BB1:%.*]], label [[BB2:%.*]]
 ; CHECK:       bb1:
+; CHECK-NEXT:    [[V1_PRE1:%.*]] = load i32, ptr [[P:%.*]], align 4
 ; CHECK-NEXT:    br label [[MERGE:%.*]]
 ; CHECK:       bb2:
+; CHECK-NEXT:    [[V1_PRE:%.*]] = load i32, ptr [[P]], align 4
 ; CHECK-NEXT:    br label [[MERGE]]
 ; CHECK:       merge:
-; CHECK-NEXT:    [[V1:%.*]] = load i32, ptr [[P:%.*]], align 4
+; CHECK-NEXT:    [[V1:%.*]] = phi i32 [ [[V1_PRE]], [[BB2]] ], [ [[V1_PRE1]], [[BB1]] ]
 ; CHECK-NEXT:    call void @hold(i32 [[V1]])
 ; CHECK-NEXT:    br label [[HEADER]]
 ;
@@ -282,3 +284,97 @@ header:
   call void @hold(i32 %v1)
   br label %header
 }
+
+
+; In block Z one load is unavailable (via entry block), whereas via other 2
+; predecessors the loads are available.
+define i8 @test7a(i1 %c1, ptr %p, i8 %i, i8 %j) {
+; CHECK-LABEL: @test7a(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[C1:%.*]], label [[A:%.*]], label [[ENTRY_Z_CRIT_EDGE:%.*]]
+; CHECK:       entry.Z_crit_edge:
+; CHECK-NEXT:    [[PK_PHI_TRANS_INSERT:%.*]] = getelementptr i8, ptr [[P:%.*]], i8 [[I:%.*]]
+; CHECK-NEXT:    [[Z_PRE1:%.*]] = load i8, ptr [[PK_PHI_TRANS_INSERT]], align 1
+; CHECK-NEXT:    br label [[Z:%.*]]
+; CHECK:       A:
+; CHECK-NEXT:    [[PI:%.*]] = getelementptr i8, ptr [[P]], i8 [[I]]
+; CHECK-NEXT:    [[PJ:%.*]] = getelementptr i8, ptr [[P]], i8 [[J:%.*]]
+; CHECK-NEXT:    [[X:%.*]] = load i8, ptr [[PI]], align 1
+; CHECK-NEXT:    [[Y:%.*]] = load i8, ptr [[PJ]], align 1
+; CHECK-NEXT:    [[C2:%.*]] = icmp slt i8 [[X]], [[Y]]
+; CHECK-NEXT:    br i1 [[C2]], label [[Z]], label [[B:%.*]]
+; CHECK:       B:
+; CHECK-NEXT:    br label [[Z]]
+; CHECK:       Z:
+; CHECK-NEXT:    [[Z:%.*]] = phi i8 [ [[Z_PRE1]], [[ENTRY_Z_CRIT_EDGE]] ], [ [[X]], [[A]] ], [ [[Y]], [[B]] ]
+; CHECK-NEXT:    [[K:%.*]] = phi i8 [ [[I]], [[ENTRY_Z_CRIT_EDGE]] ], [ [[I]], [[A]] ], [ [[J]], [[B]] ]
+; CHECK-NEXT:    [[PK:%.*]] = getelementptr i8, ptr [[P]], i8 [[K]]
+; CHECK-NEXT:    ret i8 [[Z]]
+;
+
+entry:
+  br i1 %c1, label %A, label %Z
+
+A:
+  %pi = getelementptr i8, ptr %p, i8 %i
+  %pj = getelementptr i8, ptr %p, i8 %j
+  %x = load i8, ptr %pi
+  %y = load i8, ptr %pj
+  %c2 = icmp slt i8 %x, %y
+  br i1 %c2, label %Z, label %B
+
+B:
+  br label %Z
+
+Z:
+  %k = phi i8 [ %i, %entry ], [%i, %A], [ %j, %B ]
+  %pk = getelementptr i8, ptr %p, i8 %k
+  %z = load i8, ptr %pk
+  ret i8 %z
+}
+
+
+
+; In block Z two loads are unavailable (via entry and B blocks), whereas via
+; other predecessor (A) the load is available.
+define i8 @test7b(i1 %c1, ptr %p, i8 %i, i8 %j) {
+; CHECK-LABEL: @test7b(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[C1:%.*]], label [[A:%.*]], label [[ENTRY_Z_CRIT_EDGE:%.*]]
+; CHECK:       entry.Z_crit_edge:
+; CHECK-NEXT:    [[PK_PHI_TRANS_INSERT1:%.*]] = getelementptr i8, ptr [[P:%.*]], i8 [[I:%.*]]
+; CHECK-NEXT:    [[Z_PRE2:%.*]] = load i8, ptr [[PK_PHI_TRANS_INSERT1]], align 1
+; CHECK-NEXT:    br label [[Z:%.*]]
+; CHECK:       A:
+; CHECK-NEXT:    [[PI:%.*]] = getelementptr i8, ptr [[P]], i8 [[I]]
+; CHECK-NEXT:    [[X:%.*]] = load i8, ptr [[PI]], align 1
+; CHECK-NEXT:    [[C2:%.*]] = icmp slt i8 [[X]], 3
+; CHECK-NEXT:    br i1 [[C2]], label [[Z]], label [[B:%.*]]
+; CHECK:       B:
+; CHECK-NEXT:    [[PK_PHI_TRANS_INSERT:%.*]] = getelementptr i8, ptr [[P]], i8 [[J:%.*]]
+; CHECK-NEXT:    [[Z_PRE:%.*]] = load i8, ptr [[PK_PHI_TRANS_INSERT]], align 1
+; CHECK-NEXT:    br label [[Z]]
+; CHECK:       Z:
+; CHECK-NEXT:    [[Z:%.*]] = phi i8 [ [[Z_PRE2]], [[ENTRY_Z_CRIT_EDGE]] ], [ [[X]], [[A]] ], [ [[Z_PRE]], [[B]] ]
+; CHECK-NEXT:    [[K:%.*]] = phi i8 [ [[I]], [[ENTRY_Z_CRIT_EDGE]] ], [ [[I]], [[A]] ], [ [[J]], [[B]] ]
+; CHECK-NEXT:    [[PK:%.*]] = getelementptr i8, ptr [[P]], i8 [[K]]
+; CHECK-NEXT:    ret i8 [[Z]]
+;
+entry:
+  br i1 %c1, label %A, label %Z
+
+A:
+  %pi = getelementptr i8, ptr %p, i8 %i
+  %x = load i8, ptr %pi
+  %c2 = icmp slt i8 %x, 3
+  br i1 %c2, label %Z, label %B
+
+B:
+  br label %Z
+
+Z:
+  %k = phi i8 [ %i, %entry ], [%i, %A], [ %j, %B ]
+  %pk = getelementptr i8, ptr %p, i8 %k
+  %z = load i8, ptr %pk
+  ret i8 %z
+}
diff --git a/llvm/test/Transforms/GVN/condprop.ll b/llvm/test/Transforms/GVN/condprop.ll
index 6b1e4d10601099..f93c50201942bf 100644
--- a/llvm/test/Transforms/GVN/condprop.ll
+++ b/llvm/test/Transforms/GVN/condprop.ll
@@ -214,11 +214,11 @@ define void @test4(i1 %b, i32 %x) {
 ; CHECK-NEXT:    br i1 [[B:%.*]], label [[SW:%.*]], label [[CASE3:%.*]]
 ; CHECK:       sw:
 ; CHECK-NEXT:    switch i32 [[X:%.*]], label [[DEFAULT:%.*]] [
-; CHECK-NEXT:    i32 0, label [[CASE0:%.*]]
-; CHECK-NEXT:    i32 1, label [[CASE1:%.*]]
-; CHECK-NEXT:    i32 2, label [[CASE0]]
-; CHECK-NEXT:    i32 3, label [[CASE3]]
-; CHECK-NEXT:    i32 4, label [[DEFAULT]]
+; CHECK-NEXT:      i32 0, label [[CASE0:%.*]]
+; CHECK-NEXT:      i32 1, label [[CASE1:%.*]]
+; CHECK-NEXT:      i32 2, label [[CASE0]]
+; CHECK-NEXT:      i32 3, label [[CASE3]]
+; CHECK-NEXT:      i32 4, label [[DEFAULT]]
 ; CHECK-NEXT:    ]
 ; CHECK:       default:
 ; CHECK-NEXT:    call void @bar(i32 [[X]])
@@ -570,12 +570,16 @@ define void @test14(ptr %ptr1, ptr noalias %ptr2) {
 ; CHECK-NEXT:    br label [[THEN]]
 ; CHECK:       then:
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp eq ptr [[GEP2]], [[PTR2:%.*]]
-; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP_END]], label [[IF2:%.*]]
+; CHECK-NEXT:    br i1 [[CMP]], label [[THEN_LOOP_END_CRIT_EDGE:%.*]], label [[IF2:%.*]]
+; CHECK:       then.loop.end_crit_edge:
+; CHECK-NEXT:    [[VAL3_PRE2:%.*]] = load i32, ptr [[PTR2]], align 4
+; CHECK-NEXT:    br label [[LOOP_END]]
 ; CHECK:       if2:
+; CHECK-NEXT:    [[VAL3_PRE:%.*]] = load i32, ptr [[GEP2]], align 4
 ; CHECK-NEXT:    br label [[LOOP_END]]
 ; CHECK:       loop.end:
-; CHECK-NEXT:    [[PHI3:%.*]] = phi ptr [ [[PTR2]], [[THEN]] ], [ [[PTR1]], [[IF2]] ]
-; CHECK-NEXT:    [[VAL3]] = load i32, ptr [[GEP2]], align 4
+; CHECK-NEXT:    [[VAL3]] = phi i32 [ [[VAL3_PRE2]], [[THEN_LOOP_END_CRIT_EDGE]] ], [ [[VAL3_PRE]], [[IF2]] ]
+; CHECK-NEXT:    [[PHI3:%.*]] = phi ptr [ [[PTR2]], [[THEN_LOOP_END_CRIT_EDGE]] ], [ [[PTR1]], [[IF2]] ]
 ; CHECK-NEXT:    store i32 [[VAL3]], ptr [[PHI3]], align 4
 ; CHECK-NEXT:    br i1 undef, label [[LOOP]], label [[IF1]]
 ;

github-actions · 2023-12-20T15:25:21Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 36a073a5f45e0da3b84bd5284219d64586a97d5d 50e96813df15ca4323e8e1017af499fad1662d27 -- llvm/lib/Transforms/Scalar/GVN.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp
index 02d7315115..228a9c2c31 100644
--- a/llvm/lib/Transforms/Scalar/GVN.cpp
+++ b/llvm/lib/Transforms/Scalar/GVN.cpp
@@ -1626,7 +1626,7 @@ bool GVNPass::PerformLoadPRE(LoadInst *Load, AvailValInBlkVect &ValuesPerBlock,
   // all the preds that don't have an available Load and insert a new load into
   // that one block.
   if (CriticalEdgePredSplit.size() > 1)
-      return false;
+    return false;
 
   // Now we know where we will insert load. We must ensure that it is safe
   // to speculatively execute the load at that points.

nikic

where PredLoads and CriticalEdgePredSplit are the lists of available and
unavailable loads, respectively.

This is incorrect. Both are lists of unavailable loads. The first one for unavailable loads that don't require critical edge splitting, and the second for ones that do.

Therefore, with just one unavailable load we refuse to perform load PRE if there are other predecessors with available loads.

And this conclusion is thus also incorrect.

What the current check does it to make sure that the number of loads does not increase. We end up removing one load, and adding one load in an unavailable block.

Your new check says that instead we can insert an arbitrary number of loads in unavailable predecessors, as long as there is at least one available load (and we don't have to split more than one critical edge). This is not a reasonable heuristic.

david-arm · 2023-12-20T16:06:18Z

Your new check says that instead we can insert an arbitrary number of loads in unavailable predecessors, as long as there is at least one available load (and we don't have to split more than one critical edge). This is not a reasonable heuristic.

Fair enough! But I would like to fix the problem highlighted in test7a, which is definitely blocked by this check. Given the potential performance gains, I'd still like to pursue this and I'm happy to do more detailed analysis. What would you consider a reasonable heuristic?

nikic · 2023-12-20T16:09:38Z

llvm/test/Transforms/GVN/PRE/load-pre-licm.ll

+
+
+; In block Z one load is unavailable (via entry block), whereas via other 2
+; predecessors the loads are available.


As far as GVN knows, the load is not available in block B. If it were, the transform would already be performed.

david-arm · 2023-12-20T16:12:40Z

@nikic So perhaps I've misunderstood what this comment in the code above means:

  // Check to see how many predecessors have the loaded value fully
  // available.
  MapVector<BasicBlock *, Value *> PredLoads;

I took that to mean that PredLoads contains the loads from all predecessors that are fully available.

nikic · 2023-12-20T16:17:02Z

Your new check says that instead we can insert an arbitrary number of loads in unavailable predecessors, as long as there is at least one available load (and we don't have to split more than one critical edge). This is not a reasonable heuristic.

Fair enough! But I would like to fix the problem highlighted in test7a, which is definitely blocked by this check. Given the potential performance gains, I'd still like to pursue this and I'm happy to do more detailed analysis. What would you consider a reasonable heuristic?

You should start by trying to understand why GVN does not recognize the available load in that test case. It probably has something to do with the the fact that one block contains two available loads for different phi translated pointers.

david-arm · 2023-12-20T16:36:02Z

You should start by trying to understand why GVN does not recognize the available load in that test case. It probably has something to do with the the fact that one block contains two available loads for different phi translated pointers.

OK thanks. I think I see now that PredLoads is different to what I thought. If I understand correctly, IsValueFullyAvailableInBlock should return true for my test, which would then prevent PredLoads from being populated.

david-arm · 2024-01-24T16:46:35Z

I've created #79324 that does the analysis to add more loads to the available list.

david-arm added 2 commits December 20, 2023 13:59

Add more GVN load PRE tests

cd75212

david-arm requested review from momchil-velikov, paulwalker-arm and weiguozhi December 20, 2023 15:22

llvmbot added the llvm:transforms label Dec 20, 2023

momchil-velikov requested review from fhahn and alinas December 20, 2023 15:27

nikic self-requested a review December 20, 2023 15:48

nikic requested changes Dec 20, 2023

View reviewed changes

nikic reviewed Dec 20, 2023

View reviewed changes

david-arm mentioned this pull request Jan 24, 2024

[GVN] Permit load PRE to happen in more cases #79324

Draft

david-arm closed this Jan 24, 2024

david-arm deleted the gvn_load_pre branch May 1, 2024 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GVN] Permit load PRE to happen in more cases #76063

[GVN] Permit load PRE to happen in more cases #76063

david-arm commented Dec 20, 2023 •

edited

llvmbot commented Dec 20, 2023

github-actions bot commented Dec 20, 2023

nikic left a comment

david-arm commented Dec 20, 2023

nikic Dec 20, 2023

david-arm commented Dec 20, 2023

nikic commented Dec 20, 2023

david-arm commented Dec 20, 2023

david-arm commented Jan 24, 2024



		; In block Z one load is unavailable (via entry block), whereas via other 2
		; predecessors the loads are available.

[GVN] Permit load PRE to happen in more cases #76063

[GVN] Permit load PRE to happen in more cases #76063

Conversation

david-arm commented Dec 20, 2023 • edited

llvmbot commented Dec 20, 2023

github-actions bot commented Dec 20, 2023

nikic left a comment

Choose a reason for hiding this comment

david-arm commented Dec 20, 2023

nikic Dec 20, 2023

Choose a reason for hiding this comment

david-arm commented Dec 20, 2023

nikic commented Dec 20, 2023

david-arm commented Dec 20, 2023

david-arm commented Jan 24, 2024

david-arm commented Dec 20, 2023 •

edited