Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CVP]: Fold icmp eq X, C to trunc X to i1 if C=2k+1 and X in [2k, 2k+1] #83829

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

YanWQ-monad
Copy link
Contributor

@YanWQ-monad YanWQ-monad commented Mar 4, 2024

For

define i1 @src(i8 %x) {
  %1 = icmp ult i8 %x, 2
  br i1 %1, label %bb1, label %bb2

bb2:
  %2 = tail call i1 @dynamic()
  br label %bb3

bb1:
  %3 = icmp eq i8 %x, 1 ; <===
  br label %bb3

bb3:
  %4 = phi i1 [ %3, %bb1 ], [ %2, %bb2 ]
  ret i1 %4
}

declare i1 @dynamic()

we can fold icmp eq i8 %x, 1 to trunc i8 %x to i1 since x is in [0, 1].
The alive2 proof is https://alive2.llvm.org/ce/z/sGig3-.

Generally, for

  • icmp eq X, C if C = 2k+1 and X is in [2k, 2k+1]
  • or icmp ne X, C if C = 2k and X is in [2k, 2k+1]

we can fold it to trunc X to i1.

With this fold, RISC-V can eliminate two instructions, while ARM can eliminate one instruction on the hot path. Link: https://godbolt.org/z/xMTbP94Yn.
The real-world case: rust-lang/rust#121673

@YanWQ-monad YanWQ-monad requested a review from nikic as a code owner March 4, 2024 11:41
Copy link

github-actions bot commented Mar 4, 2024

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 4, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Monad (YanWQ-monad)

Changes

For

define i1 @<!-- -->src(i8 %x) {
  %1 = icmp ult i8 %x, 2
  br i1 %1, label %bb1, label %bb2

bb2:
  %2 = tail call i1 @<!-- -->dynamic()
  br label %bb3

bb1:
  %3 = icmp eq i8 %x, 1 ; &lt;===
  br label %bb3

bb3:
  %4 = phi i1 [ %3, %bb1 ], [ %2, %bb2 ]
  ret i1 %4
}

we can fold icmp eq i8 %x, 1 to trunc i8 %x to i1 since x is in [0, 1].
The alive2 proof is https://alive2.llvm.org/ce/z/sGig3-.

Generally, for

  • icmp eq X, C if C = 2k+1 and X is in [2k, 2k+1]
  • or icmp ne X, C if C = 2k and X is in [2k, 2k+1]

we can fold it to trunc X to i1.

With this fold, RISC-V can eliminate two instructions, while ARM can eliminate one instruction on the hot path.
The real-world case: rust-lang/rust#121673


Full diff: https://github.com/llvm/llvm-project/pull/83829.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp (+36)
  • (modified) llvm/test/Transforms/CorrelatedValuePropagation/icmp.ll (+62-2)
  • (modified) llvm/test/Transforms/JumpThreading/pr33917.ll (+4-4)
diff --git a/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp b/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
index 490cb7e528eb6f..73be5f0b016603 100644
--- a/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
+++ b/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
@@ -332,6 +332,39 @@ static bool constantFoldCmp(CmpInst *Cmp, LazyValueInfo *LVI) {
   return true;
 }
 
+/// Given an icmp `icmp eq X, C`,
+/// if we already know that C is 2k+1 and X is in [2k, 2k+1],
+/// then we can fold it to `trunc X to i1`.
+static bool processEqualityICmp(CmpInst *Cmp, LazyValueInfo *LVI) {
+  if (Cmp->getType()->isVectorTy() ||
+      !Cmp->getOperand(0)->getType()->isIntegerTy() || !Cmp->isEquality())
+    return false;
+
+  Value *Op0 = Cmp->getOperand(0);
+  Value *Op1 = Cmp->getOperand(1);
+  ConstantInt *CI = dyn_cast<ConstantInt>(Op1);
+  if (!CI)
+    return false;
+
+  ConstantRange Range =
+      LVI->getConstantRangeAtUse(Cmp->getOperandUse(0), /*UndefAllowed*/ true);
+  APInt RangeSize = Range.getUpper() - Range.getLower();
+  APInt Value = CI->getValue();
+  if (RangeSize != 2 || !Range.contains(Value))
+    return false;
+
+  bool ShouldBeOdd = Cmp->getPredicate() == ICmpInst::Predicate::ICMP_EQ;
+  if ((CI->getValue() & 1) == ShouldBeOdd) {
+    IRBuilder<> B{Cmp};
+    auto *Trunc = B.CreateTruncOrBitCast(Op0, Cmp->getType());
+    Cmp->replaceAllUsesWith(Trunc);
+    Cmp->eraseFromParent();
+    return true;
+  }
+
+  return false;
+}
+
 static bool processCmp(CmpInst *Cmp, LazyValueInfo *LVI) {
   if (constantFoldCmp(Cmp, LVI))
     return true;
@@ -340,6 +373,9 @@ static bool processCmp(CmpInst *Cmp, LazyValueInfo *LVI) {
     if (processICmp(ICmp, LVI))
       return true;
 
+  if (processEqualityICmp(Cmp, LVI))
+      return true;
+
   return false;
 }
 
diff --git a/llvm/test/Transforms/CorrelatedValuePropagation/icmp.ll b/llvm/test/Transforms/CorrelatedValuePropagation/icmp.ll
index 101820a4c65f23..ccfbc274d570ce 100644
--- a/llvm/test/Transforms/CorrelatedValuePropagation/icmp.ll
+++ b/llvm/test/Transforms/CorrelatedValuePropagation/icmp.ll
@@ -594,10 +594,10 @@ define void @test_cmp_phi(i8 %a) {
 ; CHECK-NEXT:    br i1 [[C0]], label [[LOOP:%.*]], label [[EXIT:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[P:%.*]] = phi i8 [ [[A]], [[ENTRY:%.*]] ], [ [[B:%.*]], [[LOOP]] ]
-; CHECK-NEXT:    [[C1:%.*]] = icmp ne i8 [[P]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = trunc i8 [[P]] to i1
 ; CHECK-NEXT:    [[C4:%.*]] = call i1 @get_bool()
 ; CHECK-NEXT:    [[B]] = zext i1 [[C4]] to i8
-; CHECK-NEXT:    br i1 [[C1]], label [[LOOP]], label [[EXIT]]
+; CHECK-NEXT:    br i1 [[TMP0]], label [[LOOP]], label [[EXIT]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
 ;
@@ -1455,3 +1455,63 @@ entry:
   %select = select i1 %cmp1, i1 %cmp2, i1 false
   ret i1 %select
 }
+
+define i1 @test_icmp_eq_on_valid_bool_range(i8 %x) {
+; CHECK-LABEL: @test_icmp_eq_on_valid_bool_range(
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp ult i8 [[X:%.*]], 2
+; CHECK-NEXT:    br i1 [[TMP1]], label [[BB1:%.*]], label [[BB2:%.*]]
+; CHECK:       bb2:
+; CHECK-NEXT:    [[TMP2:%.*]] = tail call i1 @get_bool()
+; CHECK-NEXT:    br label [[BB3:%.*]]
+; CHECK:       bb1:
+; CHECK-NEXT:    [[TMP3:%.*]] = trunc i8 [[X]] to i1
+; CHECK-NEXT:    br label [[BB3]]
+; CHECK:       bb3:
+; CHECK-NEXT:    [[TMP4:%.*]] = phi i1 [ [[TMP3]], [[BB1]] ], [ [[TMP2]], [[BB2]] ]
+; CHECK-NEXT:    ret i1 [[TMP4]]
+;
+  %1 = icmp ult i8 %x, 2
+  br i1 %1, label %bb1, label %bb2
+
+bb2:
+  %2 = tail call i1 @get_bool()
+  br label %bb3
+
+bb1:
+  %3 = icmp eq i8 %x, 1
+  br label %bb3
+
+bb3:
+  %4 = phi i1 [ %3, %bb1 ], [ %2, %bb2 ]
+  ret i1 %4
+}
+
+define i1 @test_icmp_ne_on_valid_bool_range(i8 %x) {
+; CHECK-LABEL: @test_icmp_ne_on_valid_bool_range(
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp ult i8 [[X:%.*]], 2
+; CHECK-NEXT:    br i1 [[TMP1]], label [[BB1:%.*]], label [[BB2:%.*]]
+; CHECK:       bb2:
+; CHECK-NEXT:    [[TMP2:%.*]] = tail call i1 @get_bool()
+; CHECK-NEXT:    br label [[BB3:%.*]]
+; CHECK:       bb1:
+; CHECK-NEXT:    [[TMP3:%.*]] = trunc i8 [[X]] to i1
+; CHECK-NEXT:    br label [[BB3]]
+; CHECK:       bb3:
+; CHECK-NEXT:    [[TMP4:%.*]] = phi i1 [ [[TMP3]], [[BB1]] ], [ [[TMP2]], [[BB2]] ]
+; CHECK-NEXT:    ret i1 [[TMP4]]
+;
+  %1 = icmp ult i8 %x, 2
+  br i1 %1, label %bb1, label %bb2
+
+bb2:
+  %2 = tail call i1 @get_bool()
+  br label %bb3
+
+bb1:
+  %3 = icmp ne i8 %x, 0
+  br label %bb3
+
+bb3:
+  %4 = phi i1 [ %3, %bb1 ], [ %2, %bb2 ]
+  ret i1 %4
+}
diff --git a/llvm/test/Transforms/JumpThreading/pr33917.ll b/llvm/test/Transforms/JumpThreading/pr33917.ll
index 7d21a4e1781519..20380c769bf173 100644
--- a/llvm/test/Transforms/JumpThreading/pr33917.ll
+++ b/llvm/test/Transforms/JumpThreading/pr33917.ll
@@ -15,16 +15,16 @@ define void @patatino() personality ptr @rust_eh_personality {
 ; CHECK-LABEL: @patatino(
 ; CHECK-NEXT:  bb9:
 ; CHECK-NEXT:    [[T9:%.*]] = invoke ptr @foo()
-; CHECK-NEXT:    to label [[GOOD:%.*]] unwind label [[BAD:%.*]]
+; CHECK-NEXT:            to label [[GOOD:%.*]] unwind label [[BAD:%.*]]
 ; CHECK:       bad:
 ; CHECK-NEXT:    [[T10:%.*]] = landingpad { ptr, i32 }
-; CHECK-NEXT:    cleanup
+; CHECK-NEXT:            cleanup
 ; CHECK-NEXT:    resume { ptr, i32 } [[T10]]
 ; CHECK:       good:
 ; CHECK-NEXT:    [[T11:%.*]] = icmp ne ptr [[T9]], null
 ; CHECK-NEXT:    [[T12:%.*]] = zext i1 [[T11]] to i64
-; CHECK-NEXT:    [[COND:%.*]] = icmp eq i64 [[T12]], 1
-; CHECK-NEXT:    br i1 [[COND]], label [[IF_TRUE:%.*]], label [[DONE:%.*]]
+; CHECK-NEXT:    [[TMP0:%.*]] = trunc i64 [[T12]] to i1
+; CHECK-NEXT:    br i1 [[TMP0]], label [[IF_TRUE:%.*]], label [[DONE:%.*]]
 ; CHECK:       if_true:
 ; CHECK-NEXT:    call void @llvm.assume(i1 [[T11]])
 ; CHECK-NEXT:    br label [[DONE]]

Copy link

github-actions bot commented Mar 4, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@dtcxzyw dtcxzyw requested a review from fhahn March 4, 2024 11:46
@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 4, 2024

With this fold, RISC-V can eliminate two instructions, while ARM can eliminate one instruction on the hot path.

Could you please provide the godbolt link?

@YanWQ-monad
Copy link
Contributor Author

With this fold, RISC-V can eliminate two instructions, while ARM can eliminate one instruction on the hot path.

Could you please provide the godbolt link?

Sure, the link is https://godbolt.org/z/xMTbP94Yn.

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request Mar 4, 2024
@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 4, 2024

FYI, trunc X to i1 will be canonicalized into (X & 1) != 0.
godbolt: https://godbolt.org/z/cMv6K8qKd

@YanWQ-monad
Copy link
Contributor Author

YanWQ-monad commented Mar 4, 2024

For most of the regression cases, they probably follow this form:
https://github.com/dtcxzyw/llvm-opt-benchmark/pull/286/files#diff-39f660d670ba4b7db20a130f81413c2584fbdbc537f1557976e334ecd34750c7L6491-R6493

- %cmp18 = icmp eq i32 %encoding, 5
- br i1 %cmp18, label %cond.true19, label %cond.false22
+ %17 = and i32 %encoding, 1
+ %.not = icmp eq i32 %17, 0
+ br i1 %.not, label %cond.false22, label %cond.true19

It seems that it's not necessarily regression.

In terms of this IR,

define i1 @src(i32 %x) {
  %y = icmp ne i32 %x, 0
  ret i1 %y
}

define i1 @tgt(i32 %x) {
  %1 = and i32 %x, 1
  %y = icmp ne i32 %1, 0
  ret i1 %y
}

On x86 and RISC-V, it generates same number of instructions. On AArch64, the number of instructions it generates has reduced by one.
Link: x86 and AArch64: https://godbolt.org/z/az9sP1nf7, RISC-V: https://godbolt.org/z/nzdM7515M.


For the other regressions, I'm still trying to resolve them.

@orlp
Copy link

orlp commented Mar 8, 2024

@YanWQ-monad Looking at your godbolt, doesn't that still have the useless and with 1 on ARM?

@YanWQ-monad
Copy link
Contributor Author

YanWQ-monad commented Mar 8, 2024

@YanWQ-monad Looking at your godbolt, doesn't that still have the useless and with 1 on ARM?

Yes, it seems that ARM backend generates an and instruction for trunc, while RISC-V's doesn't.

Putting that aside, the canonicalization of trunc, which dtcxzyw pointed out, makes this optimization actually useless (https://godbolt.org/z/f5drrv556). So I am going to close this PR and re-implement this optimization on backend.

@DianQK
Copy link
Member

DianQK commented Mar 9, 2024

@YanWQ-monad Looking at your godbolt, doesn't that still have the useless and with 1 on ARM?

Yes, it seems that ARM backend generates an and instruction for trunc, while RISC-V's doesn't.

Putting that aside, the canonicalization of trunc, which dtcxzyw pointed out, makes this optimization actually useless (https://godbolt.org/z/f5drrv556). So I am going to close this PR and re-implement this optimization on backend.

Could you create an issue and link back to rust-lang/rust#121673?

@YanWQ-monad
Copy link
Contributor Author

@YanWQ-monad Looking at your godbolt, doesn't that still have the useless and with 1 on ARM?

Yes, it seems that ARM backend generates an and instruction for trunc, while RISC-V's doesn't.
Putting that aside, the canonicalization of trunc, which dtcxzyw pointed out, makes this optimization actually useless (https://godbolt.org/z/f5drrv556). So I am going to close this PR and re-implement this optimization on backend.

Could you create an issue and link back to rust-lang/rust#121673?

Sure, is it done like this: #84605?

@nikic
Copy link
Contributor

nikic commented Mar 9, 2024

FYI, trunc X to i1 will be canonicalized into (X & 1) != 0. godbolt: https://godbolt.org/z/cMv6K8qKd

Relevant fold seems to be:

if (DestTy->isIntegerTy()) {
// Canonicalize trunc x to i1 -> icmp ne (and x, 1), 0 (scalar only).
// TODO: We canonicalize to more instructions here because we are probably
// lacking equivalent analysis for trunc relative to icmp. There may also
// be codegen concerns. If those trunc limitations were removed, we could
// remove this transform.
Value *And = Builder.CreateAnd(Src, ConstantInt::get(SrcTy, 1));
return new ICmpInst(ICmpInst::ICMP_NE, And, Zero);
}

Quite the unusual choice, probably worth trying to remove.

@YanWQ-monad
Copy link
Contributor Author

rebased, and add nuw and nsw flag when possible.

@AtariDreams
Copy link
Contributor

Screenshot 2024-03-29 at 3 24 06 PM While we are at it, should we remove the scalar restriction for the fold here as well in InstCombineCompares?

@YanWQ-monad
Copy link
Contributor Author

ping?

How about only fold those that can be folded to trunc nuw? That is, only fold icmp eq X, 1 to trunc nuw X to i1 if X is in [0, 1]. It's less general, but probably less likely to interfere with other optimizations while still fitting the original motivation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants