-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[LoopFusion] Assert failure in the issue 80301 #167837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[LoopFusion] Assert failure in the issue 80301 #167837
Conversation
Concerning the test 'IR.ll' mentioned in the issue(same has been added
as test case), we have the following scenario before fusion:
GC1 --> { L1 --> L2} --
| |
|<--------------------
V
GC2 --> {L3 --> L4} --
| |
|<--------------------
V
exit
The candidate set before fusion, in the same sequence is {L1, L2, L3, L4}.
L1 and L2 get fused thus giving us the single loop L12 with guard
condition GC1. L3 and L4 are considered not to have a guard condition.
When inserting L12 back into the candidate set again, there was an
assert that if entry block of L12(i.e. GC1) dominates entry block of
{L3-->L4} (i.e. L3's preheader) then entry block of {L3-->L4} must
postdominate entry block of L12. While this is true for fusing
candidates, this is not true for inserting candidates into the candidate
set(or preparing the new candidate set)(or setting the order of the
candidates for the next fusion). This was not happening for the above
case and hence the assert failure.
This patch tries to resolve this issue by removing the assert and using
the assert's condition for deciding the relative order of candidates in
the set.
|
@llvm/pr-subscribers-llvm-transforms Author: Sushant Gokhale (sushgokh) ChangesConcerning the test 'IR.ll' mentioned in the issue(same has been added as test case), we have the following scenario before fusion: The candidate set before fusion, in the same sequence is {L1, L2, L3, L4}. L1 and L2 get fused thus giving us the single loop L12 with guard condition GC1. L3 and L4 are considered not to have a guard condition. When inserting L12 back into the candidate set again, there was an assert that if entry block of L12(i.e. GC1) dominates entry block of {L3-->L4} (i.e. L3's preheader) then entry block of {L3-->L4} must postdominate entry block of L12. While this is true for fusing candidates, this is not true for inserting candidates into the candidate set(or preparing the new candidate set)(or setting the order of the candidates for the next fusion). This was not happening for the above case and hence the assert failure. This patch tries to resolve this issue by removing the assert and using the assert's condition for deciding the relative order of candidates in the set. Full diff: https://github.com/llvm/llvm-project/pull/167837.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/LoopFuse.cpp b/llvm/lib/Transforms/Scalar/LoopFuse.cpp
index 9ffa602416b05..e3a44a4976b7d 100644
--- a/llvm/lib/Transforms/Scalar/LoopFuse.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopFuse.cpp
@@ -406,16 +406,14 @@ struct FusionCandidateCompare {
// Do this compare first so if LHS == RHS, function returns false.
if (DT->dominates(RHSEntryBlock, LHSEntryBlock)) {
- // RHS dominates LHS
- // Verify LHS post-dominates RHS
- assert(LHS.PDT->dominates(LHSEntryBlock, RHSEntryBlock));
- return false;
+ // RHS dominates LHS.
+ // Check if LHS post-dominates RHS.
+ return !LHS.PDT->dominates(LHSEntryBlock, RHSEntryBlock);
}
if (DT->dominates(LHSEntryBlock, RHSEntryBlock)) {
- // Verify RHS Postdominates LHS
- assert(LHS.PDT->dominates(RHSEntryBlock, LHSEntryBlock));
- return true;
+ // Check if RHS postdominates LHS.
+ return LHS.PDT->dominates(RHSEntryBlock, LHSEntryBlock);
}
// If two FusionCandidates are in the same level of dominator tree,
diff --git a/llvm/test/Transforms/LoopFusion/guarded.ll b/llvm/test/Transforms/LoopFusion/guarded.ll
index 863d9b1bb4e86..9d27adfe1fe8f 100644
--- a/llvm/test/Transforms/LoopFusion/guarded.ll
+++ b/llvm/test/Transforms/LoopFusion/guarded.ll
@@ -390,3 +390,142 @@ for.2: ; preds = %for.cond13, %for.body6,
exit: ; preds = %for.cond13
ret void
}
+
+define void @foo(ptr noalias %A, ptr noalias %B, i64 %N) {
+; CHECK-LABEL: @foo(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[CMP4:%.*]] = icmp slt i64 0, [[N:%.*]]
+; CHECK-NEXT: br i1 [[CMP4]], label [[BB3:%.*]], label [[BB14:%.*]]
+; CHECK: bb3:
+; CHECK-NEXT: br label [[BB5:%.*]]
+; CHECK: bb5:
+; CHECK-NEXT: [[I_05:%.*]] = phi i64 [ [[INC:%.*]], [[BB5]] ], [ 0, [[BB3]] ]
+; CHECK-NEXT: [[I_0510:%.*]] = phi i64 [ [[INC10:%.*]], [[BB5]] ], [ 0, [[BB3]] ]
+; CHECK-NEXT: [[SUB:%.*]] = sub nsw i64 [[I_05]], 3
+; CHECK-NEXT: [[ADD:%.*]] = add nsw i64 [[I_05]], 3
+; CHECK-NEXT: [[MUL:%.*]] = mul nsw i64 [[SUB]], [[ADD]]
+; CHECK-NEXT: [[REM:%.*]] = srem i64 [[MUL]], [[I_05]]
+; CHECK-NEXT: [[CONV:%.*]] = trunc i64 [[REM]] to i32
+; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[I_05]]
+; CHECK-NEXT: store i32 [[CONV]], ptr [[ARRAYIDX]], align 4
+; CHECK-NEXT: [[INC]] = add nsw i64 [[I_05]], 1
+; CHECK-NEXT: [[CMP:%.*]] = icmp slt i64 [[INC]], [[N]]
+; CHECK-NEXT: [[ARRAYIDX10:%.*]] = getelementptr inbounds i32, ptr [[B:%.*]], i64 [[I_0510]]
+; CHECK-NEXT: [[LOAD_B:%.*]] = load i32, ptr [[ARRAYIDX10]], align 4
+; CHECK-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[I_0510]]
+; CHECK-NEXT: store i32 0, ptr [[ARRAYIDX11]], align 4
+; CHECK-NEXT: [[INC10]] = add nsw i64 [[I_0510]], 1
+; CHECK-NEXT: [[CMP10:%.*]] = icmp slt i64 [[INC10]], [[N]]
+; CHECK-NEXT: br i1 [[CMP10]], label [[BB5]], label [[BB1010:%.*]]
+; CHECK: bb1010:
+; CHECK-NEXT: br label [[BB14]]
+; CHECK: bb14:
+; CHECK-NEXT: br i1 [[CMP4]], label [[BB8:%.*]], label [[BB12:%.*]]
+; CHECK: bb8:
+; CHECK-NEXT: br label [[BB9:%.*]]
+; CHECK: bb9:
+; CHECK-NEXT: [[I1_02:%.*]] = phi i64 [ [[INC14:%.*]], [[BB9]] ], [ 0, [[BB8]] ]
+; CHECK-NEXT: [[SUB7:%.*]] = sub nsw i64 [[I1_02]], 3
+; CHECK-NEXT: [[ADD8:%.*]] = add nsw i64 [[I1_02]], 3
+; CHECK-NEXT: [[MUL9:%.*]] = mul nsw i64 [[SUB7]], [[ADD8]]
+; CHECK-NEXT: [[REM10:%.*]] = srem i64 [[MUL9]], [[I1_02]]
+; CHECK-NEXT: [[CONV11:%.*]] = trunc i64 [[REM10]] to i32
+; CHECK-NEXT: [[ARRAYIDX12:%.*]] = getelementptr inbounds [1024 x i32], ptr @B, i64 0, i64 [[I1_02]]
+; CHECK-NEXT: store i32 [[CONV11]], ptr [[ARRAYIDX12]], align 4
+; CHECK-NEXT: [[INC14]] = add nsw i64 [[I1_02]], 1
+; CHECK-NEXT: [[CMP3:%.*]] = icmp slt i64 [[INC14]], [[N]]
+; CHECK-NEXT: br i1 [[CMP3]], label [[BB9]], label [[BB15:%.*]]
+; CHECK: bb15:
+; CHECK-NEXT: br label [[BB80:%.*]]
+; CHECK: bb80:
+; CHECK-NEXT: br label [[BB90:%.*]]
+; CHECK: bb90:
+; CHECK-NEXT: [[I1_0210:%.*]] = phi i64 [ [[INC1410:%.*]], [[BB90]] ], [ 0, [[BB80]] ]
+; CHECK-NEXT: [[CONV12:%.*]] = trunc i64 [[I1_0210]] to i32
+; CHECK-NEXT: [[ARRAYIDX1210:%.*]] = getelementptr inbounds [1024 x i32], ptr @B, i64 0, i64 [[I1_0210]]
+; CHECK-NEXT: store i32 [[CONV12]], ptr [[ARRAYIDX1210]], align 4
+; CHECK-NEXT: [[INC1410]] = add nsw i64 [[I1_0210]], 1
+; CHECK-NEXT: [[CMP310:%.*]] = icmp slt i64 [[INC1410]], [[N]]
+; CHECK-NEXT: br i1 [[CMP310]], label [[BB90]], label [[BB150:%.*]]
+; CHECK: bb150:
+; CHECK-NEXT: br label [[BB12]]
+; CHECK: bb12:
+; CHECK-NEXT: ret void
+;
+entry:
+ %cmp4 = icmp slt i64 0, %N
+ br i1 %cmp4, label %bb3, label %bb14
+
+bb3: ; preds = %entry
+ br label %bb5
+
+bb5: ; preds = %bb3, %bb5
+ %i.05 = phi i64 [ %inc, %bb5 ], [ 0, %bb3 ]
+ %sub = sub nsw i64 %i.05, 3
+ %add = add nsw i64 %i.05, 3
+ %mul = mul nsw i64 %sub, %add
+ %rem = srem i64 %mul, %i.05
+ %conv = trunc i64 %rem to i32
+ %arrayidx = getelementptr inbounds i32, ptr %A, i64 %i.05
+ store i32 %conv, ptr %arrayidx, align 4
+ %inc = add nsw i64 %i.05, 1
+ %cmp = icmp slt i64 %inc, %N
+ br i1 %cmp, label %bb5, label %bb103
+
+bb103: ; preds = %bb103
+ br label %bb105
+
+bb105: ; preds = %bb103, %bb105
+ %i.0510 = phi i64 [ %inc10, %bb105 ], [ 0, %bb103 ]
+ %arrayidx10 = getelementptr inbounds i32, ptr %B, i64 %i.0510
+ %load.b = load i32, ptr %arrayidx10, align 4
+ %arrayidx11 = getelementptr inbounds i32, ptr %A, i64 %i.0510
+ store i32 0, ptr %arrayidx11, align 4
+ %inc10 = add nsw i64 %i.0510, 1
+ %cmp10 = icmp slt i64 %inc10, %N
+ br i1 %cmp10, label %bb105, label %bb1010
+
+bb1010: ; preds = %bb105
+ br label %bb14
+
+bb14: ; preds = %bb1010, %entry
+ br i1 %cmp4, label %bb8, label %bb12
+
+bb8: ; preds = %bb14
+ br label %bb9
+
+bb9: ; preds = %bb8, %bb9
+ %i1.02 = phi i64 [ %inc14, %bb9 ], [ 0, %bb8 ]
+ %sub7 = sub nsw i64 %i1.02, 3
+ %add8 = add nsw i64 %i1.02, 3
+ %mul9 = mul nsw i64 %sub7, %add8
+ %rem10 = srem i64 %mul9, %i1.02
+ %conv11 = trunc i64 %rem10 to i32
+ %arrayidx12 = getelementptr inbounds [1024 x i32], ptr @B, i64 0, i64 %i1.02
+ store i32 %conv11, ptr %arrayidx12, align 4
+ %inc14 = add nsw i64 %i1.02, 1
+ %cmp3 = icmp slt i64 %inc14, %N
+ br i1 %cmp3, label %bb9, label %bb15
+
+bb15: ; preds = %bb9
+ br label %bb80
+
+bb80: ; preds = %bb15
+ br label %bb90
+
+bb90: ; preds = %bb80, %bb90
+ %i1.0210 = phi i64 [ %inc1410, %bb90 ], [ 0, %bb80 ]
+ %conv12 = trunc i64 %i1.0210 to i32
+ %arrayidx1210 = getelementptr inbounds [1024 x i32], ptr @B, i64 0, i64 %i1.0210
+ store i32 %conv12, ptr %arrayidx1210, align 4
+ %inc1410 = add nsw i64 %i1.0210, 1
+ %cmp310 = icmp slt i64 %inc1410, %N
+ br i1 %cmp310, label %bb90, label %bb150
+
+bb150: ; preds = %bb90
+ br label %bb12
+
+
+bb12: ; preds = %bb15, %bb14
+ ret void
+}
|
| return false; | ||
| // RHS dominates LHS. | ||
| // Check if LHS post-dominates RHS. | ||
| return !LHS.PDT->dominates(LHSEntryBlock, RHSEntryBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what are the implication of violating control flow equivalence of the loops in the candidate set. There is a comment in the code that says members of FusionCandidateSet are control flow equivalent. but I haven't looked into it enough to know what happens if we violate the condition. @1997alireza @CongzheUalberta Do you know about this?
your comment says , while post-domination is required for fusing candidates, it is not required for adding something to the candidate set. Does this mean there is another check of post-dominance relationship later on?
I was going to post a comment on this issue and the one extracted from it, but have been very busy with DA stuff. Will try to post a comment soon (But I may need to read some parts of fusion code, if DA stuff leaves enough time for me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your comment says , while post-domination is required for fusing candidates, it is not required for adding something to the candidate set. Does this mean there is another check of post-dominance relationship later on?
'isAdjacent' is doing the same thing I believe ? In fact, 'isAdjacent' is more stricter than saying 2 loops are control flow equivalent.
|
Please correct me if I'm wrong, but my first impression on this patch is that, it allows loops that are not control flow equivalent (e.g., loop1 dominates loop2 but loop2 does not post dominate loop1) to be added into the same fusion candidate set hence those loops possibly could be fused. Would it cause functional problem when they are fused? Have you tested the patch on some benchmarks like SPEC? |
There are 2 scenarios:
Now, coming back to the example in the issue. L1/L2/L3/L4 are in the same candidate set. There is one obvious flaw in the function 'isControlFlowEquivalent' in the sense that L1/L2 and L3/L4 are equivalent pairwise but there is no dominance relation between L2 and L3. So, I think this function is not very useful other than as a legality check for fusion. I havent tested this with SPEC for the reason that this patch is just placing the candidate into the set and doesnt change anything wrt legality checks for fusion. |
Yes, 'isAdjacent' should be the function doing the check. I cant think of scenario where they are not control flow equivalent yet adjacent. |
If our codegen is limited to adjacent loops, why do we put non-adjacent loops in a candidate set? If we just put adjacent loops in the candidate set, do we lose any opportunity? We can always extend fusion in the future, and make it to catch more candidates. Do I miss something here? (@CongzheUalberta what do you think?) |
ok, thanks for the clarification.
I have not done a deep debugging so correct me if I'm wrong, but I think the reason fusion does not happen is not because adjacency check fails, it is because with your ordering, after the 1st fusion the candidate set becomes
I'm not sure if that is correct. LLVM is smart enough to determine when L2 is executed then L3 must gets executed, and vice versa. One more thing: this PR gives non-deterministic to the ordering in the candidate set, e.g., when |
Currently llvm does not have the capability to fuse non-adjacent loops that are control flow equivalent (fusing non-adjacent loops would be something nice to have). However I think it is still good to put non-adjacent loops that are control flow equivalent in the candidate set, which opens the door and could be used as the stepping stone for future extension of fusing non-adjacent loops. If we only put adjacent loops in the set, we would lose the opportunity extending to fusing non-adjacent loops. In that case when we want to extend to non-adjacent loops down the road, we would have to revert the pass to the state as of now and then we essentially would end up with the same issue as what we are facing now. |
If our goal is to turn on fusion by default then we need to simplify the code. Both in terms of maintainability and also compile time cost. Right now I believe we do three different check: isControlFlowEquivalent which is even more relaxed than dominance check, then there is dominance check and then there is adjacency check. This really needs to be simplified. |
Yes, I think we should make this a goal. We chose interchange as the first loop optimisation to enable and run by default, but we are very interested in fusion as well. For default enablement, we would need to give it the "investigate and fix compile time outliers, and fix all correctness issues" treatment as per the updated/clarified developers policy section on this topic. Running testing and fuzzers on loop fusion was a first exercise to judge the quality of the pass, and see how much work it would be. As I mentioned, we are mostly focused on interchange, but we are happy to support you with fusion. For example, we have data where fusion improves workloads if we need to justify enabling this and if it is worth the compile-time. And we can help with point fixes like this, although you seem to suggest the pass may need a bit of an overhaul? I am interested to hear your opinions on this. Maybe we can make default enablement of interchange and fusion recurring topics for the loop optimisation workgroup meetings (CC: @Meinersbur ), so am happy to talk more about this in the meeting, and in the meantime we can also continue here of course. |
Yes, we are interested in loop fusion. This past week we have been busy with DA discussions and internal stuff but I expect we speed up on loop fusion soon. With regard to major changes in fusion. yes, I think at least two major improvements is needed (and both should help with compile time). One is the comment above, the other the changes in how we check dependency that I previously commented on one of the issues. But I think before we finalize the second one, we need to wait for the current investigations in DA to conclude (basically it is a discussion of what are the conditions for correctness of DA) If you are interested in merging this fix as is, it is fine with me. we can merge this and we will address the issue of multiple different control flow checks separately. And sure, we can discuss the status both in the loop opt meeting and also here or elsewhere on github. |
|
I don't think this is a proper solution. Removing assertions and using them for a tie-breaker will eventually get us to completely symmetric instances such as #166560. There is no reason that two abritary loops in the CFG that are not even sequential have to be found a sequential order for. |
|
@sjoerdmeijer |
If we want to limit the scope of fusion to just adjacent loops, then the checks can be completely removed I think. But in the given example(or the issue IR), there is much scope for improvement at the same loop depth level. For instance, post first fusion, we have 3 loops in the same sequence in the candidate set: There is still scope for merging L3/L4 and then L34 can be merged with L12. So, basically, this is a recursive procedure. By relaxing the assertions, we would be allowing such fusions by inserting plausible candidates for fusion. |
I am trying to understand your concern here. IIUC you are saying |
Yes, right. To reiterate what I would like to say, checking for adjacency is just for actual fusion. Dominance checking is for preparing set of plausible candidates for fusion iteratively (i.e. even the fused candidate can be candidate for further fusion as is the case here). There are other ways to prepare plausible candidate set at every step of the pass but for now, lets get rid of this assert. Lets proceed one step at a time. |
give me a couple of days. I will get back to you about this. |
Concerning the test 'IR.ll' mentioned in the issue(same has been added as test case), we have the following scenario before fusion:
The candidate set before fusion, in the same sequence is {L1, L2, L3, L4}.
L1 and L2 get fused thus giving us the single loop L12 with guard condition GC1. L3 and L4 are considered not to have a guard condition.
When inserting L12 back into the candidate set again, there was an assert that if entry block of L12(i.e. GC1) dominates entry block of {L3-->L4} (i.e. L3's preheader) then entry block of {L3-->L4} must postdominate entry block of L12. While this is true for fusing candidates, this is not true for inserting candidates into the candidate set(or preparing the new candidate set)(or setting the order of the candidates for the next fusion). This was not happening for the above case and hence the assert failure.
This patch tries to resolve this issue by removing the assert and using the assert's condition for deciding the relative order of candidates in the set.
Fixes #80301