-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[InstCombine] Convert or concat to fshl if opposite or concat exists #68502
Conversation
Current implementation of matchFunnelShift only allows opposite shift pattern. Refactor it to allow more pattern.
If there are two 'or' instructions concat variables in opposite order and the first 'or' dominates the second one, the second 'or' can be optimized to fshl to rotate shift first 'or'. This can eliminate an shl and expose more optimization opportunity for bswap/bitreverse.
@llvm/pr-subscribers-llvm-transforms ChangesIf there are two 'or' instructions concat variables in opposite order Full diff: https://github.com/llvm/llvm-project/pull/68502.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp b/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
index cbdab3e9c5fb91d..09017e46ee3bd4d 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
@@ -2727,105 +2727,161 @@ Instruction *InstCombinerImpl::matchBSwapOrBitReverse(Instruction &I,
}
/// Match UB-safe variants of the funnel shift intrinsic.
-static Instruction *matchFunnelShift(Instruction &Or, InstCombinerImpl &IC) {
+static Instruction *matchFunnelShift(Instruction &Or, InstCombinerImpl &IC,
+ const DominatorTree &DT) {
// TODO: Can we reduce the code duplication between this and the related
// rotate matching code under visitSelect and visitTrunc?
unsigned Width = Or.getType()->getScalarSizeInBits();
- // First, find an or'd pair of opposite shifts:
- // or (lshr ShVal0, ShAmt0), (shl ShVal1, ShAmt1)
- BinaryOperator *Or0, *Or1;
- if (!match(Or.getOperand(0), m_BinOp(Or0)) ||
- !match(Or.getOperand(1), m_BinOp(Or1)))
+ Instruction *Or0, *Or1;
+ if (!match(Or.getOperand(0), m_Instruction(Or0)) ||
+ !match(Or.getOperand(1), m_Instruction(Or1)))
return nullptr;
- Value *ShVal0, *ShVal1, *ShAmt0, *ShAmt1;
- if (!match(Or0, m_OneUse(m_LogicalShift(m_Value(ShVal0), m_Value(ShAmt0)))) ||
- !match(Or1, m_OneUse(m_LogicalShift(m_Value(ShVal1), m_Value(ShAmt1)))) ||
- Or0->getOpcode() == Or1->getOpcode())
- return nullptr;
+ bool IsFshl = true; // Sub on LSHR.
+ SmallVector<Value *, 3> FShiftArgs;
- // Canonicalize to or(shl(ShVal0, ShAmt0), lshr(ShVal1, ShAmt1)).
- if (Or0->getOpcode() == BinaryOperator::LShr) {
- std::swap(Or0, Or1);
- std::swap(ShVal0, ShVal1);
- std::swap(ShAmt0, ShAmt1);
- }
- assert(Or0->getOpcode() == BinaryOperator::Shl &&
- Or1->getOpcode() == BinaryOperator::LShr &&
- "Illegal or(shift,shift) pair");
-
- // Match the shift amount operands for a funnel shift pattern. This always
- // matches a subtraction on the R operand.
- auto matchShiftAmount = [&](Value *L, Value *R, unsigned Width) -> Value * {
- // Check for constant shift amounts that sum to the bitwidth.
- const APInt *LI, *RI;
- if (match(L, m_APIntAllowUndef(LI)) && match(R, m_APIntAllowUndef(RI)))
- if (LI->ult(Width) && RI->ult(Width) && (*LI + *RI) == Width)
- return ConstantInt::get(L->getType(), *LI);
-
- Constant *LC, *RC;
- if (match(L, m_Constant(LC)) && match(R, m_Constant(RC)) &&
- match(L, m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
- match(R, m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
- match(ConstantExpr::getAdd(LC, RC), m_SpecificIntAllowUndef(Width)))
- return ConstantExpr::mergeUndefsWith(LC, RC);
-
- // (shl ShVal, X) | (lshr ShVal, (Width - x)) iff X < Width.
- // We limit this to X < Width in case the backend re-expands the intrinsic,
- // and has to reintroduce a shift modulo operation (InstCombine might remove
- // it after this fold). This still doesn't guarantee that the final codegen
- // will match this original pattern.
- if (match(R, m_OneUse(m_Sub(m_SpecificInt(Width), m_Specific(L))))) {
- KnownBits KnownL = IC.computeKnownBits(L, /*Depth*/ 0, &Or);
- return KnownL.getMaxValue().ult(Width) ? L : nullptr;
+ // First, find an or'd pair of opposite shifts:
+ // or (lshr ShVal0, ShAmt0), (shl ShVal1, ShAmt1)
+ if (isa<BinaryOperator>(Or0) && isa<BinaryOperator>(Or1)) {
+ Value *ShVal0, *ShVal1, *ShAmt0, *ShAmt1;
+ if (!match(Or0,
+ m_OneUse(m_LogicalShift(m_Value(ShVal0), m_Value(ShAmt0)))) ||
+ !match(Or1,
+ m_OneUse(m_LogicalShift(m_Value(ShVal1), m_Value(ShAmt1)))) ||
+ Or0->getOpcode() == Or1->getOpcode())
+ return nullptr;
+
+ // Canonicalize to or(shl(ShVal0, ShAmt0), lshr(ShVal1, ShAmt1)).
+ if (Or0->getOpcode() == BinaryOperator::LShr) {
+ std::swap(Or0, Or1);
+ std::swap(ShVal0, ShVal1);
+ std::swap(ShAmt0, ShAmt1);
}
+ assert(Or0->getOpcode() == BinaryOperator::Shl &&
+ Or1->getOpcode() == BinaryOperator::LShr &&
+ "Illegal or(shift,shift) pair");
+
+ // Match the shift amount operands for a funnel shift pattern. This always
+ // matches a subtraction on the R operand.
+ auto matchShiftAmount = [&](Value *L, Value *R, unsigned Width) -> Value * {
+ // Check for constant shift amounts that sum to the bitwidth.
+ const APInt *LI, *RI;
+ if (match(L, m_APIntAllowUndef(LI)) && match(R, m_APIntAllowUndef(RI)))
+ if (LI->ult(Width) && RI->ult(Width) && (*LI + *RI) == Width)
+ return ConstantInt::get(L->getType(), *LI);
+
+ Constant *LC, *RC;
+ if (match(L, m_Constant(LC)) && match(R, m_Constant(RC)) &&
+ match(L,
+ m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
+ match(R,
+ m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
+ match(ConstantExpr::getAdd(LC, RC), m_SpecificIntAllowUndef(Width)))
+ return ConstantExpr::mergeUndefsWith(LC, RC);
+
+ // (shl ShVal, X) | (lshr ShVal, (Width - x)) iff X < Width.
+ // We limit this to X < Width in case the backend re-expands the
+ // intrinsic, and has to reintroduce a shift modulo operation (InstCombine
+ // might remove it after this fold). This still doesn't guarantee that the
+ // final codegen will match this original pattern.
+ if (match(R, m_OneUse(m_Sub(m_SpecificInt(Width), m_Specific(L))))) {
+ KnownBits KnownL = IC.computeKnownBits(L, /*Depth*/ 0, &Or);
+ return KnownL.getMaxValue().ult(Width) ? L : nullptr;
+ }
+
+ // For non-constant cases, the following patterns currently only work for
+ // rotation patterns.
+ // TODO: Add general funnel-shift compatible patterns.
+ if (ShVal0 != ShVal1)
+ return nullptr;
+
+ // For non-constant cases we don't support non-pow2 shift masks.
+ // TODO: Is it worth matching urem as well?
+ if (!isPowerOf2_32(Width))
+ return nullptr;
+
+ // The shift amount may be masked with negation:
+ // (shl ShVal, (X & (Width - 1))) | (lshr ShVal, ((-X) & (Width - 1)))
+ Value *X;
+ unsigned Mask = Width - 1;
+ if (match(L, m_And(m_Value(X), m_SpecificInt(Mask))) &&
+ match(R, m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask))))
+ return X;
+
+ // Similar to above, but the shift amount may be extended after masking,
+ // so return the extended value as the parameter for the intrinsic.
+ if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
+ match(R,
+ m_And(m_Neg(m_ZExt(m_And(m_Specific(X), m_SpecificInt(Mask)))),
+ m_SpecificInt(Mask))))
+ return L;
+
+ if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
+ match(R, m_ZExt(m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask)))))
+ return L;
- // For non-constant cases, the following patterns currently only work for
- // rotation patterns.
- // TODO: Add general funnel-shift compatible patterns.
- if (ShVal0 != ShVal1)
return nullptr;
+ };
- // For non-constant cases we don't support non-pow2 shift masks.
- // TODO: Is it worth matching urem as well?
- if (!isPowerOf2_32(Width))
+ Value *ShAmt = matchShiftAmount(ShAmt0, ShAmt1, Width);
+ if (!ShAmt) {
+ ShAmt = matchShiftAmount(ShAmt1, ShAmt0, Width);
+ IsFshl = false; // Sub on SHL.
+ }
+ if (!ShAmt)
return nullptr;
- // The shift amount may be masked with negation:
- // (shl ShVal, (X & (Width - 1))) | (lshr ShVal, ((-X) & (Width - 1)))
- Value *X;
- unsigned Mask = Width - 1;
- if (match(L, m_And(m_Value(X), m_SpecificInt(Mask))) &&
- match(R, m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask))))
- return X;
+ FShiftArgs = {ShVal0, ShVal1, ShAmt};
- // Similar to above, but the shift amount may be extended after masking,
- // so return the extended value as the parameter for the intrinsic.
- if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
- match(R, m_And(m_Neg(m_ZExt(m_And(m_Specific(X), m_SpecificInt(Mask)))),
- m_SpecificInt(Mask))))
- return L;
+ } else if (isa<ZExtInst>(Or0) || isa<ZExtInst>(Or1)) {
+ // If there are two 'or' instructions concat variables in opposite order,
+ // the latter one can be safely convert to fshl.
+ //
+ // LowHigh = or (shl (zext Low), Width - ZextHighShlAmt), (zext High)
+ // HighLow = or (shl (zext High), ZextHighShlAmt), (zext Low)
+ // ->
+ // HighLow = fshl LowHigh, LowHigh, ZextHighShlAmt
+ if (!isa<ZExtInst>(Or1))
+ std::swap(Or0, Or1);
+
+ Value *High, *ZextHigh, *Low;
+ const APInt *ZextHighShlAmt;
+ if (!match(Or0,
+ m_OneUse(m_Shl(m_Value(ZextHigh), m_APInt(ZextHighShlAmt)))))
+ return nullptr;
- if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
- match(R, m_ZExt(m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask)))))
- return L;
+ if (!match(Or1, m_ZExt(m_Value(Low))) ||
+ !match(ZextHigh, m_ZExt(m_Value(High))))
+ return nullptr;
- return nullptr;
- };
+ unsigned HighSize = High->getType()->getScalarSizeInBits();
+ unsigned LowSize = Low->getType()->getScalarSizeInBits();
+ if (*ZextHighShlAmt != LowSize || HighSize + LowSize != Width)
+ return nullptr;
- Value *ShAmt = matchShiftAmount(ShAmt0, ShAmt1, Width);
- bool IsFshl = true; // Sub on LSHR.
- if (!ShAmt) {
- ShAmt = matchShiftAmount(ShAmt1, ShAmt0, Width);
- IsFshl = false; // Sub on SHL.
+ for (User *U : ZextHigh->users()) {
+ Value *X, *Y;
+ if (!match(U, m_Or(m_Value(X), m_Value(Y))))
+ continue;
+
+ if (!isa<ZExtInst>(Y))
+ std::swap(X, Y);
+
+ if (match(X, m_Shl(m_Specific(Or1), m_SpecificInt(HighSize))) &&
+ match(Y, m_Specific(ZextHigh)) && DT.dominates(U, &Or)) {
+ FShiftArgs = {U, U, ConstantInt::get(Or0->getType(), *ZextHighShlAmt)};
+ break;
+ }
+ }
}
- if (!ShAmt)
+
+ if (FShiftArgs.empty())
return nullptr;
Intrinsic::ID IID = IsFshl ? Intrinsic::fshl : Intrinsic::fshr;
Function *F = Intrinsic::getDeclaration(Or.getModule(), IID, Or.getType());
- return CallInst::Create(F, {ShVal0, ShVal1, ShAmt});
+ return CallInst::Create(F, FShiftArgs);
}
/// Attempt to combine or(zext(x),shl(zext(y),bw/2) concat packing patterns.
@@ -3319,7 +3375,7 @@ Instruction *InstCombinerImpl::visitOr(BinaryOperator &I) {
/*MatchBitReversals*/ true))
return BitOp;
- if (Instruction *Funnel = matchFunnelShift(I, *this))
+ if (Instruction *Funnel = matchFunnelShift(I, *this, DT))
return Funnel;
if (Instruction *Concat = matchOrConcat(I, Builder))
diff --git a/llvm/test/Transforms/InstCombine/funnel.ll b/llvm/test/Transforms/InstCombine/funnel.ll
index 60ce49a1635623f..c5bd1aa7b4351bc 100644
--- a/llvm/test/Transforms/InstCombine/funnel.ll
+++ b/llvm/test/Transforms/InstCombine/funnel.ll
@@ -354,6 +354,48 @@ define <2 x i64> @fshl_select_vector(<2 x i64> %x, <2 x i64> %y, <2 x i64> %sham
ret <2 x i64> %r
}
+; Convert 'or concat' to fshl if opposite 'or concat' exists.
+
+define i32 @fshl_concat(i8 %x, i24 %y, ptr %addr) {
+; CHECK-LABEL: @fshl_concat(
+; CHECK-NEXT: [[ZEXT_X:%.*]] = zext i8 [[X:%.*]] to i32
+; CHECK-NEXT: [[SLX:%.*]] = shl nuw i32 [[ZEXT_X]], 24
+; CHECK-NEXT: [[ZEXT_Y:%.*]] = zext i24 [[Y:%.*]] to i32
+; CHECK-NEXT: [[XY:%.*]] = or i32 [[SLX]], [[ZEXT_Y]]
+; CHECK-NEXT: store i32 [[XY]], ptr [[ADDR:%.*]], align 4
+; CHECK-NEXT: [[YX:%.*]] = call i32 @llvm.fshl.i32(i32 [[XY]], i32 [[XY]], i32 8)
+; CHECK-NEXT: ret i32 [[YX]]
+;
+ %zext.x = zext i8 %x to i32
+ %slx = shl nuw i32 %zext.x, 24
+ %zext.y = zext i24 %y to i32
+ %xy = or i32 %zext.y, %slx
+ store i32 %xy, ptr %addr, align 4
+ %sly = shl nuw i32 %zext.y, 8
+ %yx = or i32 %zext.x, %sly
+ ret i32 %yx
+}
+
+define <2 x i32> @fshl_concat_vector(<2 x i8> %x, <2 x i24> %y, ptr %addr) {
+; CHECK-LABEL: @fshl_concat_vector(
+; CHECK-NEXT: [[ZEXT_X:%.*]] = zext <2 x i8> [[X:%.*]] to <2 x i32>
+; CHECK-NEXT: [[SLX:%.*]] = shl nuw <2 x i32> [[ZEXT_X]], <i32 24, i32 24>
+; CHECK-NEXT: [[ZEXT_Y:%.*]] = zext <2 x i24> [[Y:%.*]] to <2 x i32>
+; CHECK-NEXT: [[XY:%.*]] = or <2 x i32> [[SLX]], [[ZEXT_Y]]
+; CHECK-NEXT: store <2 x i32> [[XY]], ptr [[ADDR:%.*]], align 4
+; CHECK-NEXT: [[YX:%.*]] = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> [[XY]], <2 x i32> [[XY]], <2 x i32> <i32 8, i32 8>)
+; CHECK-NEXT: ret <2 x i32> [[YX]]
+;
+ %zext.x = zext <2 x i8> %x to <2 x i32>
+ %slx = shl nuw <2 x i32> %zext.x, <i32 24, i32 24>
+ %zext.y = zext <2 x i24> %y to <2 x i32>
+ %xy = or <2 x i32> %slx, %zext.y
+ store <2 x i32> %xy, ptr %addr, align 4
+ %sly = shl nuw <2 x i32> %zext.y, <i32 8, i32 8>
+ %yx = or <2 x i32> %sly, %zext.x
+ ret <2 x i32> %yx
+}
+
; Negative test - an oversized shift in the narrow type would produce the wrong value.
define i8 @unmasked_shlop_unmasked_shift_amount(i32 %x, i32 %y, i32 %shamt) {
|
Any comments? |
This PR is based on #68474 |
I commented on the NFC patch. Get that in then rebase this. |
return nullptr; | ||
|
||
if (!match(Or1, m_ZExt(m_Value(Low))) || | ||
!match(ZextHigh, m_ZExt(m_Value(High)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually need the shifted value to be zext
? I.e if you have (shl (zext i24 to i32), 8)
that is bitwise equivilent to (shl i32, 8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only match (shl i32, 8), we can't guarantee it's reverse concat since we don't know if the most significant 8bit in i32 is zero.
%zext.x = zext i8 %x to i32
%slx = shl nuw i32 %zext.x, 24
%zext.y = zext i24 %y to i32
%xy = or i32 %zext.y, %slx #[x[7:0], y[23:0]]
%sly = shl nuw i32 %zext.y, 8
%yx = or i32 %zext.x, %sly #[y[23:0], x[7:0]]
If not match zext:
%zext.x = zext i8 %x to i32
%slx = shl nuw i32 %zext.x, 24
%xy = or i32 %y, %slx #[unknown, y[23,0]] y[31:24] may not be zero.
%sly = shl nuw i32 %y, 8 #[y[23:0], 0,0,0,0,0,0,0,0]
%yx = or i32 %zext.x, %sly #[y[23:0], x[7:0]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see, although could handle masking and
. Really its just a knownbits check though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to check zext
. An example without zext
: https://alive2.llvm.org/ce/z/MytePb
unsigned LowSize = Low->getType()->getScalarSizeInBits(); | ||
// Make sure High does not overlap with Low and most significant bits of | ||
// High aren't shifted out. | ||
if (ZextHighShlAmt->ult(LowSize) || ZextHighShlAmt->ugt(Width - HighSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you are missing a negative test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Make sure Low does not overlap with High and most significant bits of | ||
// Low aren't shifted out and we can rotate shift LowHigh to HighLow. | ||
if (ZextLowShlAmt->ult(HighSize) || ZextLowShlAmt->ugt(Width - LowSize) || | ||
*ZextLowShlAmt + *ZextHighShlAmt != Width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you are missing negative test for this (also for non-dominating).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, non-dominating was already tested:
We checked:
yx = fshl xy, xy,
When instcombine handle first or (combine for xy), it can meet all check expect dominance.
If we don't check dominance, if would be convert to xy = fshl yx, yx which is absolutely wrong.
See most recent comments about missing tests. That being said code looks functionally correct to me. |
I'm also skeptical about accepting this optimization. Looking at the motivating case in #68502 (comment), this seems like a bad approach to the problem: It means that in order to fold the pattern to Replacing the |
In fact, this is a real case. |
Any more comments? I'd like to merge it if no objection. |
I think there are outstanding objections, or at least blocking concerns from nikic. |
@nikic , any more comments? |
gentle ping @nikic |
I'd like to merge it if no any more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Your original example does not verify: https://alive2.llvm.org/ce/z/Bkd89Z Can you please provide a correct example of what you're trying to do? |
As I said in previous comments:
This PR only teach InstCombine to optimize shl+or to fshl. |
Yes, I understand that this transform is only a step towards handling the full pattern. I'm asking for a complete, working example of the original motivating case. The snippets posted in #68502 (comment) do not appear to be correct, or I failed to assemble them correctly. Please provide complete src and tgt functions that verify with alive2. |
Oh... sorry, I made a wrong IR example. |
Thanks for the updated example! To explain what I meant in first comment using this example: We would perform the transform https://alive2.llvm.org/ce/z/nllcB_, which does not depend at all on how Is this not a viable approach? Is there a concern here that generating both fshl and bitreverse may be non-profitable for targets without bitreverse? Or maybe supporting this makes the matching too expensive? |
It's absolutely a feasible solution. Solution1: Advantage: Solution2: Advantage: Both solutions requires to scan def-users chains. I don't think this is an issue. Ref for bitreverse optimization:
|
Any comments? |
I'd like to merge it. Please let me know if you have more concern. |
For my money this was merged prematurely. There are still outstanding concerns about whether this transform is desirable, as well there is an outstanding comment about the implementation itself. |
…lvm#68502) If there are two 'or' instructions concat variables in opposite order and the first 'or' dominates the second one, the second 'or' can be optimized to fshl to rotate shift first 'or'. This can eliminate an shl and expose more optimization opportunity for bswap/bitreverse.
…lvm#68502) If there are two 'or' instructions concat variables in opposite order and the first 'or' dominates the second one, the second 'or' can be optimized to fshl to rotate shift first 'or'. This can eliminate an shl and expose more optimization opportunity for bswap/bitreverse.
Thanks for kindly reminding. I think I've replied the concerns and wait a long time... I thought there should be no problem so I merged it... |
If there are two 'or' instructions concat variables in opposite order
and the first 'or' dominates the second one, the second 'or' can be
optimized to fshl to rotate shift first 'or'. This can eliminate an shl
and expose more optimization opportunity for bswap/bitreverse.