Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[InstCombine] Fold switch(rol(x, C1)) case C2: to switch(x) case rol(C2, -C1): #86307

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

YanWQ-monad
Copy link
Contributor

@YanWQ-monad YanWQ-monad commented Mar 22, 2024

This closes #86161.

It is worth mentioning that, as @dtcxzyw pointed out, there is an inverse fold in

/// Try to transform a switch that has "holes" in it to a contiguous sequence
/// of cases.
///
/// A switch such as: switch(i) {case 5: case 9: case 13: case 17:} can be
/// range-reduced to: switch ((i-5) / 4) {case 0: case 1: case 2: case 3:}.
///
/// This converts a sparse switch into a dense switch which allows better
/// lowering and could also allow transforming into a lookup table.
static bool ReduceSwitchRange(SwitchInst *SI, IRBuilder<> &Builder,
const DataLayout &DL,
const TargetTransformInfo &TTI) {
auto *CondTy = cast<IntegerType>(SI->getCondition()->getType());
if (CondTy->getIntegerBitWidth() > 64 ||
!DL.fitsInLegalInteger(CondTy->getIntegerBitWidth()))
return false;
// Only bother with this optimization if there are more than 3 switch cases;
// SDAG will only bother creating jump tables for 4 or more cases.
if (SI->getNumCases() < 4)
return false;
// This transform is agnostic to the signedness of the input or case values. We
// can treat the case values as signed or unsigned. We can optimize more common
// cases such as a sequence crossing zero {-4,0,4,8} if we interpret case values
// as signed.
SmallVector<int64_t,4> Values;
for (const auto &C : SI->cases())
Values.push_back(C.getCaseValue()->getValue().getSExtValue());
llvm::sort(Values);
// If the switch is already dense, there's nothing useful to do here.
if (isSwitchDense(Values))
return false;
// First, transform the values such that they start at zero and ascend.
int64_t Base = Values[0];
for (auto &V : Values)
V -= (uint64_t)(Base);
// Now we have signed numbers that have been shifted so that, given enough
// precision, there are no negative values. Since the rest of the transform
// is bitwise only, we switch now to an unsigned representation.
// This transform can be done speculatively because it is so cheap - it
// results in a single rotate operation being inserted.
// countTrailingZeros(0) returns 64. As Values is guaranteed to have more than
// one element and LLVM disallows duplicate cases, Shift is guaranteed to be
// less than 64.
unsigned Shift = 64;
for (auto &V : Values)
Shift = std::min(Shift, (unsigned)llvm::countr_zero((uint64_t)V));
assert(Shift < 64);
if (Shift > 0)
for (auto &V : Values)
V = (int64_t)((uint64_t)V >> Shift);
if (!isSwitchDense(Values))
// Transform didn't create a dense switch.
return false;
// The obvious transform is to shift the switch condition right and emit a
// check that the condition actually cleanly divided by GCD, i.e.
// C & (1 << Shift - 1) == 0
// inserting a new CFG edge to handle the case where it didn't divide cleanly.
//
// A cheaper way of doing this is a simple ROTR(C, Shift). This performs the
// shift and puts the shifted-off bits in the uppermost bits. If any of these
// are nonzero then the switch condition will be very large and will hit the
// default case.
auto *Ty = cast<IntegerType>(SI->getCondition()->getType());
Builder.SetInsertPoint(SI);
Value *Sub =
Builder.CreateSub(SI->getCondition(), ConstantInt::get(Ty, Base));
Value *Rot = Builder.CreateIntrinsic(
Ty, Intrinsic::fshl,
{Sub, Sub, ConstantInt::get(Ty, Ty->getBitWidth() - Shift)});
SI->replaceUsesOfWith(SI->getCondition(), Rot);
for (auto Case : SI->cases()) {
auto *Orig = Case.getCaseValue();
auto Sub = Orig->getValue() - APInt(Ty->getBitWidth(), Base);
Case.setValue(cast<ConstantInt>(ConstantInt::get(Ty, Sub.lshr(Shift))));
}
return true;
}

, so IR may be flickering in the optimization pipeline.
SimplifyCFG will have the upper hand, so there won't be any major problems. But in some rare cases, it might cause a regression (in @dtcxzyw's opt-benchmark).

@YanWQ-monad YanWQ-monad requested a review from nikic as a code owner March 22, 2024 16:37
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 22, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Monad (YanWQ-monad)

Changes

This solves #86161.

It is worth mentioning that, as @dtcxzyw pointed out, there is an inverse fold in

/// Try to transform a switch that has "holes" in it to a contiguous sequence
/// of cases.
///
/// A switch such as: switch(i) {case 5: case 9: case 13: case 17:} can be
/// range-reduced to: switch ((i-5) / 4) {case 0: case 1: case 2: case 3:}.
///
/// This converts a sparse switch into a dense switch which allows better
/// lowering and could also allow transforming into a lookup table.
static bool ReduceSwitchRange(SwitchInst *SI, IRBuilder<> &Builder,
const DataLayout &DL,
const TargetTransformInfo &TTI) {
auto *CondTy = cast<IntegerType>(SI->getCondition()->getType());
if (CondTy->getIntegerBitWidth() > 64 ||
!DL.fitsInLegalInteger(CondTy->getIntegerBitWidth()))
return false;
// Only bother with this optimization if there are more than 3 switch cases;
// SDAG will only bother creating jump tables for 4 or more cases.
if (SI->getNumCases() < 4)
return false;
// This transform is agnostic to the signedness of the input or case values. We
// can treat the case values as signed or unsigned. We can optimize more common
// cases such as a sequence crossing zero {-4,0,4,8} if we interpret case values
// as signed.
SmallVector<int64_t,4> Values;
for (const auto &C : SI->cases())
Values.push_back(C.getCaseValue()->getValue().getSExtValue());
llvm::sort(Values);
// If the switch is already dense, there's nothing useful to do here.
if (isSwitchDense(Values))
return false;
// First, transform the values such that they start at zero and ascend.
int64_t Base = Values[0];
for (auto &V : Values)
V -= (uint64_t)(Base);
// Now we have signed numbers that have been shifted so that, given enough
// precision, there are no negative values. Since the rest of the transform
// is bitwise only, we switch now to an unsigned representation.
// This transform can be done speculatively because it is so cheap - it
// results in a single rotate operation being inserted.
// countTrailingZeros(0) returns 64. As Values is guaranteed to have more than
// one element and LLVM disallows duplicate cases, Shift is guaranteed to be
// less than 64.
unsigned Shift = 64;
for (auto &V : Values)
Shift = std::min(Shift, (unsigned)llvm::countr_zero((uint64_t)V));
assert(Shift < 64);
if (Shift > 0)
for (auto &V : Values)
V = (int64_t)((uint64_t)V >> Shift);
if (!isSwitchDense(Values))
// Transform didn't create a dense switch.
return false;
// The obvious transform is to shift the switch condition right and emit a
// check that the condition actually cleanly divided by GCD, i.e.
// C & (1 << Shift - 1) == 0
// inserting a new CFG edge to handle the case where it didn't divide cleanly.
//
// A cheaper way of doing this is a simple ROTR(C, Shift). This performs the
// shift and puts the shifted-off bits in the uppermost bits. If any of these
// are nonzero then the switch condition will be very large and will hit the
// default case.
auto *Ty = cast<IntegerType>(SI->getCondition()->getType());
Builder.SetInsertPoint(SI);
Value *Sub =
Builder.CreateSub(SI->getCondition(), ConstantInt::get(Ty, Base));
Value *Rot = Builder.CreateIntrinsic(
Ty, Intrinsic::fshl,
{Sub, Sub, ConstantInt::get(Ty, Ty->getBitWidth() - Shift)});
SI->replaceUsesOfWith(SI->getCondition(), Rot);
for (auto Case : SI->cases()) {
auto *Orig = Case.getCaseValue();
auto Sub = Orig->getValue() - APInt(Ty->getBitWidth(), Base);
Case.setValue(cast<ConstantInt>(ConstantInt::get(Ty, Sub.lshr(Shift))));
}
return true;
}

, so IR may be flickering in the optimization pipeline.
SimplifyCFG will have the upper hand, so there won't be any major problems. But in some rare cases, it might cause a regression (in @dtcxzyw's opt-benchmark).


Full diff: https://github.com/llvm/llvm-project/pull/86307.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstructionCombining.cpp (+10)
  • (added) llvm/test/Transforms/InstCombine/switch-rol.ll (+33)
diff --git a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
index 7c40fb4fc86082..b6611cfbbfc1f4 100644
--- a/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -3645,6 +3645,16 @@ Instruction *InstCombinerImpl::visitSwitchInst(SwitchInst &SI) {
     }
   }
 
+  // Fold 'switch(rol(x, C1)) case C2:' to 'switch(x) case rol(C2, -C1):'
+  if (match(Cond,
+            m_FShl(m_Value(Op0), m_Deferred(Op0), m_ConstantInt(ShiftAmt)))) {
+    for (auto &Case : SI.cases()) {
+      const APInt NewCase = Case.getCaseValue()->getValue().rotr(ShiftAmt);
+      Case.setValue(ConstantInt::get(SI.getContext(), NewCase));
+    }
+    return replaceOperand(SI, 0, Op0);
+  }
+
   KnownBits Known = computeKnownBits(Cond, 0, &SI);
   unsigned LeadingKnownZeros = Known.countMinLeadingZeros();
   unsigned LeadingKnownOnes = Known.countMinLeadingOnes();
diff --git a/llvm/test/Transforms/InstCombine/switch-rol.ll b/llvm/test/Transforms/InstCombine/switch-rol.ll
new file mode 100644
index 00000000000000..1cd55ff91c9492
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/switch-rol.ll
@@ -0,0 +1,33 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt < %s -passes=instcombine -S | FileCheck %s
+
+declare void @dummy()
+
+define i32 @switch_rol(i32 %a) #0 {
+; CHECK-LABEL: define i32 @switch_rol(
+; CHECK-SAME: i32 [[A:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    switch i32 [[A]], label [[DEFAULT:%.*]] [
+; CHECK-NEXT:      i32 0, label [[TRAP_EXIT:%.*]]
+; CHECK-NEXT:      i32 20, label [[TRAP_EXIT]]
+; CHECK-NEXT:    ]
+; CHECK:       default:
+; CHECK-NEXT:    call void @dummy()
+; CHECK-NEXT:    br label [[TRAP_EXIT]]
+; CHECK:       trap.exit:
+; CHECK-NEXT:    ret i32 0
+;
+entry:
+  %rol = call i32 @llvm.fshl.i32(i32 %a, i32 %a, i32 30)
+  switch i32 %rol, label %default [
+  i32 0, label %trap.exit
+  i32 5, label %trap.exit
+  ]
+
+default:
+  call void @dummy()
+  br label %trap.exit
+
+trap.exit:
+  ret i32 0
+}

@goldsteinn
Copy link
Contributor

This change looks fine, although it would be nice if we had some helper generically detecting if an op is reversable so we could generalize all these cases.

@goldsteinn
Copy link
Contributor

This change looks fine, although it would be nice if we had some helper generically detecting if an op is reversable so we could generalize all these cases.

Would also be useful for things like icmp eq/ne X, C

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 22, 2024

This change looks fine, although it would be nice if we had some helper generically detecting if an op is reversable so we could generalize all these cases.

Would also be useful for things like icmp eq/ne X, C

It has been supported.

case Intrinsic::fshl:
case Intrinsic::fshr:
if (II->getArgOperand(0) == II->getArgOperand(1)) {
const APInt *RotAmtC;
// ror(X, RotAmtC) == C --> X == rol(C, RotAmtC)
// rol(X, RotAmtC) == C --> X == ror(C, RotAmtC)
if (match(II->getArgOperand(2), m_APInt(RotAmtC)))
return new ICmpInst(Pred, II->getArgOperand(0),
II->getIntrinsicID() == Intrinsic::fshl
? ConstantInt::get(Ty, C.rotr(*RotAmtC))
: ConstantInt::get(Ty, C.rotl(*RotAmtC)));
}
break;

@goldsteinn
Copy link
Contributor

This change looks fine, although it would be nice if we had some helper generically detecting if an op is reversable so we could generalize all these cases.

Would also be useful for things like icmp eq/ne X, C

It has been supported.

case Intrinsic::fshl:
case Intrinsic::fshr:
if (II->getArgOperand(0) == II->getArgOperand(1)) {
const APInt *RotAmtC;
// ror(X, RotAmtC) == C --> X == rol(C, RotAmtC)
// rol(X, RotAmtC) == C --> X == ror(C, RotAmtC)
if (match(II->getArgOperand(2), m_APInt(RotAmtC)))
return new ICmpInst(Pred, II->getArgOperand(0),
II->getIntrinsicID() == Intrinsic::fshl
? ConstantInt::get(Ty, C.rotr(*RotAmtC))
: ConstantInt::get(Ty, C.rotl(*RotAmtC)));
}
break;

Err I don't mean that to necessarily new support in icmp eq/ne X, C (or here for that matter).
I mean we could use a common helper between icmp eq/ne X, C and here which would
defintely add coverage to this function and at the very least reduce code duplication.
I'm working on a patch.

@goldsteinn
Copy link
Contributor

LGTM.

@goldsteinn
Copy link
Contributor

This change looks fine, although it would be nice if we had some helper generically detecting if an op is reversable so we could generalize all these cases.

Would also be useful for things like icmp eq/ne X, C

It has been supported.

case Intrinsic::fshl:
case Intrinsic::fshr:
if (II->getArgOperand(0) == II->getArgOperand(1)) {
const APInt *RotAmtC;
// ror(X, RotAmtC) == C --> X == rol(C, RotAmtC)
// rol(X, RotAmtC) == C --> X == ror(C, RotAmtC)
if (match(II->getArgOperand(2), m_APInt(RotAmtC)))
return new ICmpInst(Pred, II->getArgOperand(0),
II->getIntrinsicID() == Intrinsic::fshl
? ConstantInt::get(Ty, C.rotr(*RotAmtC))
: ConstantInt::get(Ty, C.rotl(*RotAmtC)));
}
break;

Err I don't mean that to necessarily new support in icmp eq/ne X, C (or here for that matter). I mean we could use a common helper between icmp eq/ne X, C and here which would defintely add coverage to this function and at the very least reduce code duplication. I'm working on a patch.

See #86346 for what I mean.

@goldsteinn
Copy link
Contributor

ping @dtcxzyw @nikic

dtcxzyw added a commit to dtcxzyw/llvm-opt-benchmark that referenced this pull request Apr 4, 2024
@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 4, 2024

@YanWQ-monad I'd like to move the logic into SimplifyCFG. Then we will run into an infinite loop if the existing code in SimplifyCFG reverts your fold :)

@goldsteinn
Copy link
Contributor

@YanWQ-monad I'd like to move the logic into SimplifyCFG. Then we will run into an infinite loop if the existing code in SimplifyCFG reverts your fold :)

Shouldn't that also apply for all the other cases we handle here?

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 4, 2024

Shouldn't that also apply for all the other cases we handle here?

No. We should leave other optimizations in InstCombine because they don't modify DomTree and some of them need DT information. I suggest we implement this fold in SimplifyCFG so as to make it easier to see the impact.

@goldsteinn
Copy link
Contributor

Shouldn't that also apply for all the other cases we handle here?

No. We should leave other optimizations in InstCombine because they don't modify DomTree and some of them need DT information. I suggest we implement this fold in SimplifyCFG so as to make it easier to see the impact.

I don't really see your point. How is this fold different from the shl one?

@nikic
Copy link
Contributor

nikic commented Apr 10, 2024

I don't really like the idea of SimplifyCFG and InstCombine changing the code back and forth. Maybe we should limit the fold to the case where it will not make the switch less dense?

@YanWQ-monad
Copy link
Contributor Author

YanWQ-monad commented Apr 10, 2024

I don't really like the idea of SimplifyCFG and InstCombine changing the code back and forth. Maybe we should limit the fold to the case where it will not make the switch less dense?

It's rare that the fold will make the switch dense, at least as the issue shows, and in dtcxzyw's benchmark. So a more reasonable solution should be limit the fold so that it won't affect ReduceSwitchRange?

If so, isSwitchDense may be needed to check the preconditions, but it's only visible inside SimplifyCFG. Should I move isSwitchDense to somewhere else to make it shared between SimplifyCFG and InstCombine?

@XChy
Copy link
Member

XChy commented May 10, 2024

I think folding it if the number of reachable successors <= 3 is OK, because the backend doesn't create a lookup table in this case, and the real-world case in original issue also holds only 3 successors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[InstCombine] Missed optimization : fold switch(rol(a, c))
6 participants