Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[InstCombine] Remove some of the complexity-based canonicalization #91185

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nikic
Copy link
Contributor

@nikic nikic commented May 6, 2024

The idea behind is that the canonicalization allows us to handle less patterns, because we know that some will be canonicalized away. This is indeed very useful to e.g. know that constants are always on the right.

However, the fact that arguments are also canonicalized to the right seems like it may be doing more damage than good: This means that writing tests to cover both commuted forms requires special care ("thwart complexity-based canonicalization").

I think we should consider dropping this canonicalization to make testing simpler.

(Draft because there are some obvious regressions in tests that I need to look at, and because I haven't updated all tests yet.)

@nikic
Copy link
Contributor Author

nikic commented May 6, 2024

cc @dtcxzyw @goldsteinn to get some early feedback.

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request May 6, 2024
@dtcxzyw
Copy link
Member

dtcxzyw commented May 6, 2024

It is surprising that this patch slightly improves the performance on MultiSource/Benchmarks/McCat/05-eks/eks...

; CHECK: exit:
; CHECK-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT_PEEL]], [[AT_WITH_INT_CONVERSION_EXIT11_PEEL:%.*]] ], [ [[SUM_NEXT]], [[AT_WITH_INT_CONVERSION_EXIT11]] ]
; CHECK-NEXT: ret i64 [[SUM_NEXT_LCSSA]]
; CHECK-NEXT: ret i64 [[SUM_NEXT]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it breaks vectorization here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a difference in LICM behavior depending on how you write the comparison :/ https://llvm.godbolt.org/z/b6TWfbMGc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LICM difference is due to the code in

static bool CanProveNotTakenFirstIteration(const BasicBlock *ExitBlock,
I have a fix for it, but need to figure out what's up with this PhaseOrdering test before landing it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core problem here is that SCEV cannot compute an exit count for something like this:

define void @test(i64 %M, i64 %N) {
entry:
  br label %loop

loop:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %latch ]
  %cmp1 = icmp ule i64 %iv, %M
  br i1 %cmp1, label %latch, label %error

error:
  call void @error()
  unreachable

latch:
  %iv.next = add nuw i64 %iv, 1
  %exitcond.not = icmp eq i64 %iv, %N
  br i1 %exitcond.not, label %exit, label %loop

exit:
  ret void
}

declare void @error()

We know that this is not an infinite loop (thanks to the add nuw even if we only look at the first exit). However, the %loop exit count would be %M + 1, which may overflow. We know that if this exit is taken, then that overflow can't happen. But if another exit is taken, it can. Computing the BECount of the whole loop as umin(%M + 1, %N) would not be correct. I guess we could return the exit count as zext(%M) + 1 though, which is what we sometimes do when converting from BECount to trip count.

@fhahn It seems like the whole "peel to make load dereferenceable" approach from cd0ba9d basically just fixed things by accident. Peeling was never necessary in the first place to make the load dereferenceable, we just failed to see it due to a simple implementation bug. What the peeling actually did is make the exit count of the loop computable...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put up #92206 for the SCEV issue.

; CHECK-NEXT: [[TMP1:%.*]] = and i1 [[TMP0]], [[CMP3]]
; CHECK-NEXT: [[TMP2:%.*]] = and i1 [[TMP1]], [[CMP_I]]
; CHECK-NEXT: br i1 [[TMP2]], label [[BB1:%.*]], label [[ENTRY_SPLIT_NONCHR:%.*]], !prof [[PROF15]]
; CHECK-NEXT: [[TMP1:%.*]] = freeze i1 [[CMP3]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit suspect.

; CHECK-NEXT: [[AND2:%.*]] = and i32 [[ARGC]], [[ARGC3:%.*]]
; CHECK-NEXT: [[TOBOOL3:%.*]] = icmp ne i32 [[ARGC3]], [[AND2]]
; CHECK-NEXT: [[AND_COND_NOT:%.*]] = or i1 [[TOBOOL]], [[TOBOOL3]]
; CHECK-NEXT: [[STOREMERGE:%.*]] = zext i1 [[AND_COND_NOT]] to i32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regression + below

; CHECK-NEXT: [[F:%.*]] = fmul fast float [[TMP1]], [[A:%.*]]
; CHECK-NEXT: [[C:%.*]] = fmul fast float [[Z:%.*]], -4.000000e+01
; CHECK-NEXT: [[TMP1:%.*]] = fneg fast float [[A:%.*]]
; CHECK-NEXT: [[F:%.*]] = fmul fast float [[C]], [[TMP1]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regression + below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to this weird transform:

Instruction *InstCombinerImpl::hoistFNegAboveFMulFDiv(Value *FNegOp,
It pushes the fneg into one of the operands, so it's fundamentally asymmetric. I believe for integers we perform the transform the other way around, such that it is symmetric.

; CHECK-NEXT: [[SQRT:%.*]] = fmul fast double [[FABS]], [[SQRT1]]
; CHECK-NEXT: [[MUL:%.*]] = fmul fast double [[X:%.*]], [[X]]
; CHECK-NEXT: [[MUL2:%.*]] = fmul fast double [[Y:%.*]], [[MUL]]
; CHECK-NEXT: [[SQRT:%.*]] = call fast double @llvm.sqrt.f64(double [[MUL2]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regression

@@ -31,7 +31,7 @@ define i32 @test1(i32 %a0) {
; CHECK-LABEL: @test1(
; CHECK-NEXT: [[TMP1:%.*]] = lshr i32 [[A0:%.*]], 1
; CHECK-NEXT: [[TMP2:%.*]] = and i32 [[TMP1]], 1431655765
; CHECK-NEXT: [[TMP3:%.*]] = sub nsw i32 [[A0]], [[TMP2]]
; CHECK-NEXT: [[TMP3:%.*]] = sub i32 [[A0]], [[TMP2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regression + below

@goldsteinn
Copy link
Contributor

goldsteinn commented May 6, 2024

The freeze reassos and vectorization change I think deserve a closer look. Otherwise just seems like some misc regressions in InstCombine none of which seems like a huge deal (Maybe the B == (B & A) & D == (D & A) case is important).

@dtcxzyw am I reading it right that you had no diff in your benchmarks? That seems surprising...

edit: Also might be nice to regen all the affected tests as a pre-commit to reduce noise in this gigantic diff.

@dtcxzyw
Copy link
Member

dtcxzyw commented May 6, 2024

am I reading it right that you had no diff in your benchmarks? That seems surprising...

No. I just canceled the run as I thought it would produce tons of diff :(

@goldsteinn
Copy link
Contributor

am I reading it right that you had no diff in your benchmarks? That seems surprising...

No. I just canceled the run as I thought it would produce tons of diff :(

I see (another case for: dtcxzyw/llvm-opt-benchmark#355). Can you run it? Ill try and get -stats to work locally and post them here.

@dtcxzyw
Copy link
Member

dtcxzyw commented May 6, 2024

Can you run it? Ill try and get -stats to work locally and post them here.

I cannot run it as LLVM doesn't support the cost estimation of struct types :(

@goldsteinn
Copy link
Contributor

Can you run it? Ill try and get -stats to work locally and post them here.

I cannot run it as LLVM doesn't support the cost estimation of struct types :(

Err misunderstanding, I mean can you just generate your normal diffs. Ill do the cost estimation stuff locally and just post the results here.

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request May 6, 2024
@dtcxzyw
Copy link
Member

dtcxzyw commented May 7, 2024

Can you run it? Ill try and get -stats to work locally and post them here.

I cannot run it as LLVM doesn't support the cost estimation of struct types :(

Err misunderstanding, I mean can you just generate your normal diffs. Ill do the cost estimation stuff locally and just post the results here.

Done. The IR diff basically looks fine to me.

The main problem is that some icmp pred A, B and icmp swap(pred) B, A pairs are not CSEed now.
See dtcxzyw/llvm-opt-benchmark#583 (comment).

BTW I find that SimplifyDemandBits breaks is_pow2 idioms.
See dtcxzyw/llvm-opt-benchmark#583 (comment).
I will post a patch later.

There are some missed optimizations which should be handled explicitly.
See dtcxzyw/llvm-opt-benchmark#583 (comment)
and dtcxzyw/llvm-opt-benchmark#583 (comment).

@goldsteinn
Copy link
Contributor

Can you run it? Ill try and get -stats to work locally and post them here.

I cannot run it as LLVM doesn't support the cost estimation of struct types :(

Err misunderstanding, I mean can you just generate your normal diffs. Ill do the cost estimation stuff locally and just post the results here.

Done. The IR diff basically looks fine to me.

The main problem is that some icmp pred A, B and icmp swap(pred) B, A pairs are not CSEed now. See dtcxzyw/llvm-opt-benchmark#583 (comment).

Ah, truthfully that and commutative binops imo make a case for keeping this as is.

@nikic
Copy link
Contributor Author

nikic commented May 8, 2024

Can you run it? Ill try and get -stats to work locally and post them here.

I cannot run it as LLVM doesn't support the cost estimation of struct types :(

Err misunderstanding, I mean can you just generate your normal diffs. Ill do the cost estimation stuff locally and just post the results here.

Done. The IR diff basically looks fine to me.
The main problem is that some icmp pred A, B and icmp swap(pred) B, A pairs are not CSEed now. See dtcxzyw/llvm-opt-benchmark#583 (comment).

Ah, truthfully that and commutative binops imo make a case for keeping this as is.

Our main CSE passes (EarlyCSE and GVN) have support for handling commutative operations. I think in this case the failure is probably from a pass like SimplifyCFG where we may fail to hoist/sink "non-identical" instructions.

The problem with relying on InstCombine complexity-based canonicalization for this purpose is that it's very weak. It will handle the case where the icmp has an instruction and argument operand, but if both are instructions, then they will not be canonicalized and we're back to the same problem. So I think to handle this properly we still need explicit commutative operation support.

@goldsteinn
Copy link
Contributor

Can you run it? Ill try and get -stats to work locally and post them here.

I cannot run it as LLVM doesn't support the cost estimation of struct types :(

Err misunderstanding, I mean can you just generate your normal diffs. Ill do the cost estimation stuff locally and just post the results here.

Done. The IR diff basically looks fine to me.
The main problem is that some icmp pred A, B and icmp swap(pred) B, A pairs are not CSEed now. See dtcxzyw/llvm-opt-benchmark#583 (comment).

Ah, truthfully that and commutative binops imo make a case for keeping this as is.

Our main CSE passes (EarlyCSE and GVN) have support for handling commutative operations. I think in this case the failure is probably from a pass like SimplifyCFG where we may fail to hoist/sink "non-identical" instructions.

The problem with relying on InstCombine complexity-based canonicalization for this purpose is that it's very weak. It will handle the case where the icmp has an instruction and argument operand, but if both are instructions, then they will not be canonicalized and we're back to the same problem. So I think to handle this properly we still need explicit commutative operation support.

What about making the matchers for commutative ops default to the commutative version (thinks like cmp could be included).
We do that in the new sd_match API.
Or have the builder do the canonicalization... although that seems tricky.

The idea behind is that the canonicalization allows us to handle
less pattern, because we know that some will be canonicalized away.
This is indeed very useful to e.g. know that constants are always
on the right.

However, the fact that arguments are also canonicalized to the
right seems like it may be doing more damage than good: This means
that writing tests to cover both commuted forms requires special
care ("thwart complexity-based canonicalization").

I think we should consider dropping this canonicalization to make
testing simpler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants