Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[InstCombine] Convert or concat to fshl if opposite or concat exists #68502

Merged
merged 7 commits into from
Nov 20, 2023

Conversation

HaohaiWen
Copy link
Contributor

If there are two 'or' instructions concat variables in opposite order
and the first 'or' dominates the second one, the second 'or' can be
optimized to fshl to rotate shift first 'or'. This can eliminate an shl
and expose more optimization opportunity for bswap/bitreverse.

Current implementation of matchFunnelShift only allows opposite shift
pattern. Refactor it to allow more pattern.
If there are two 'or' instructions concat variables in opposite order
and the first 'or' dominates the second one, the second 'or' can be
optimized to fshl to rotate shift first 'or'. This can eliminate an shl
and expose more optimization opportunity for bswap/bitreverse.
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2023

@llvm/pr-subscribers-llvm-transforms

Changes

If there are two 'or' instructions concat variables in opposite order
and the first 'or' dominates the second one, the second 'or' can be
optimized to fshl to rotate shift first 'or'. This can eliminate an shl
and expose more optimization opportunity for bswap/bitreverse.


Full diff: https://github.com/llvm/llvm-project/pull/68502.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp (+134-78)
  • (modified) llvm/test/Transforms/InstCombine/funnel.ll (+42)
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp b/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
index cbdab3e9c5fb91d..09017e46ee3bd4d 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
@@ -2727,105 +2727,161 @@ Instruction *InstCombinerImpl::matchBSwapOrBitReverse(Instruction &I,
 }
 
 /// Match UB-safe variants of the funnel shift intrinsic.
-static Instruction *matchFunnelShift(Instruction &Or, InstCombinerImpl &IC) {
+static Instruction *matchFunnelShift(Instruction &Or, InstCombinerImpl &IC,
+                                     const DominatorTree &DT) {
   // TODO: Can we reduce the code duplication between this and the related
   // rotate matching code under visitSelect and visitTrunc?
   unsigned Width = Or.getType()->getScalarSizeInBits();
 
-  // First, find an or'd pair of opposite shifts:
-  // or (lshr ShVal0, ShAmt0), (shl ShVal1, ShAmt1)
-  BinaryOperator *Or0, *Or1;
-  if (!match(Or.getOperand(0), m_BinOp(Or0)) ||
-      !match(Or.getOperand(1), m_BinOp(Or1)))
+  Instruction *Or0, *Or1;
+  if (!match(Or.getOperand(0), m_Instruction(Or0)) ||
+      !match(Or.getOperand(1), m_Instruction(Or1)))
     return nullptr;
 
-  Value *ShVal0, *ShVal1, *ShAmt0, *ShAmt1;
-  if (!match(Or0, m_OneUse(m_LogicalShift(m_Value(ShVal0), m_Value(ShAmt0)))) ||
-      !match(Or1, m_OneUse(m_LogicalShift(m_Value(ShVal1), m_Value(ShAmt1)))) ||
-      Or0->getOpcode() == Or1->getOpcode())
-    return nullptr;
+  bool IsFshl = true; // Sub on LSHR.
+  SmallVector<Value *, 3> FShiftArgs;
 
-  // Canonicalize to or(shl(ShVal0, ShAmt0), lshr(ShVal1, ShAmt1)).
-  if (Or0->getOpcode() == BinaryOperator::LShr) {
-    std::swap(Or0, Or1);
-    std::swap(ShVal0, ShVal1);
-    std::swap(ShAmt0, ShAmt1);
-  }
-  assert(Or0->getOpcode() == BinaryOperator::Shl &&
-         Or1->getOpcode() == BinaryOperator::LShr &&
-         "Illegal or(shift,shift) pair");
-
-  // Match the shift amount operands for a funnel shift pattern. This always
-  // matches a subtraction on the R operand.
-  auto matchShiftAmount = [&](Value *L, Value *R, unsigned Width) -> Value * {
-    // Check for constant shift amounts that sum to the bitwidth.
-    const APInt *LI, *RI;
-    if (match(L, m_APIntAllowUndef(LI)) && match(R, m_APIntAllowUndef(RI)))
-      if (LI->ult(Width) && RI->ult(Width) && (*LI + *RI) == Width)
-        return ConstantInt::get(L->getType(), *LI);
-
-    Constant *LC, *RC;
-    if (match(L, m_Constant(LC)) && match(R, m_Constant(RC)) &&
-        match(L, m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
-        match(R, m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
-        match(ConstantExpr::getAdd(LC, RC), m_SpecificIntAllowUndef(Width)))
-      return ConstantExpr::mergeUndefsWith(LC, RC);
-
-    // (shl ShVal, X) | (lshr ShVal, (Width - x)) iff X < Width.
-    // We limit this to X < Width in case the backend re-expands the intrinsic,
-    // and has to reintroduce a shift modulo operation (InstCombine might remove
-    // it after this fold). This still doesn't guarantee that the final codegen
-    // will match this original pattern.
-    if (match(R, m_OneUse(m_Sub(m_SpecificInt(Width), m_Specific(L))))) {
-      KnownBits KnownL = IC.computeKnownBits(L, /*Depth*/ 0, &Or);
-      return KnownL.getMaxValue().ult(Width) ? L : nullptr;
+  // First, find an or'd pair of opposite shifts:
+  // or (lshr ShVal0, ShAmt0), (shl ShVal1, ShAmt1)
+  if (isa<BinaryOperator>(Or0) && isa<BinaryOperator>(Or1)) {
+    Value *ShVal0, *ShVal1, *ShAmt0, *ShAmt1;
+    if (!match(Or0,
+               m_OneUse(m_LogicalShift(m_Value(ShVal0), m_Value(ShAmt0)))) ||
+        !match(Or1,
+               m_OneUse(m_LogicalShift(m_Value(ShVal1), m_Value(ShAmt1)))) ||
+        Or0->getOpcode() == Or1->getOpcode())
+      return nullptr;
+
+    // Canonicalize to or(shl(ShVal0, ShAmt0), lshr(ShVal1, ShAmt1)).
+    if (Or0->getOpcode() == BinaryOperator::LShr) {
+      std::swap(Or0, Or1);
+      std::swap(ShVal0, ShVal1);
+      std::swap(ShAmt0, ShAmt1);
     }
+    assert(Or0->getOpcode() == BinaryOperator::Shl &&
+           Or1->getOpcode() == BinaryOperator::LShr &&
+           "Illegal or(shift,shift) pair");
+
+    // Match the shift amount operands for a funnel shift pattern. This always
+    // matches a subtraction on the R operand.
+    auto matchShiftAmount = [&](Value *L, Value *R, unsigned Width) -> Value * {
+      // Check for constant shift amounts that sum to the bitwidth.
+      const APInt *LI, *RI;
+      if (match(L, m_APIntAllowUndef(LI)) && match(R, m_APIntAllowUndef(RI)))
+        if (LI->ult(Width) && RI->ult(Width) && (*LI + *RI) == Width)
+          return ConstantInt::get(L->getType(), *LI);
+
+      Constant *LC, *RC;
+      if (match(L, m_Constant(LC)) && match(R, m_Constant(RC)) &&
+          match(L,
+                m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
+          match(R,
+                m_SpecificInt_ICMP(ICmpInst::ICMP_ULT, APInt(Width, Width))) &&
+          match(ConstantExpr::getAdd(LC, RC), m_SpecificIntAllowUndef(Width)))
+        return ConstantExpr::mergeUndefsWith(LC, RC);
+
+      // (shl ShVal, X) | (lshr ShVal, (Width - x)) iff X < Width.
+      // We limit this to X < Width in case the backend re-expands the
+      // intrinsic, and has to reintroduce a shift modulo operation (InstCombine
+      // might remove it after this fold). This still doesn't guarantee that the
+      // final codegen will match this original pattern.
+      if (match(R, m_OneUse(m_Sub(m_SpecificInt(Width), m_Specific(L))))) {
+        KnownBits KnownL = IC.computeKnownBits(L, /*Depth*/ 0, &Or);
+        return KnownL.getMaxValue().ult(Width) ? L : nullptr;
+      }
+
+      // For non-constant cases, the following patterns currently only work for
+      // rotation patterns.
+      // TODO: Add general funnel-shift compatible patterns.
+      if (ShVal0 != ShVal1)
+        return nullptr;
+
+      // For non-constant cases we don't support non-pow2 shift masks.
+      // TODO: Is it worth matching urem as well?
+      if (!isPowerOf2_32(Width))
+        return nullptr;
+
+      // The shift amount may be masked with negation:
+      // (shl ShVal, (X & (Width - 1))) | (lshr ShVal, ((-X) & (Width - 1)))
+      Value *X;
+      unsigned Mask = Width - 1;
+      if (match(L, m_And(m_Value(X), m_SpecificInt(Mask))) &&
+          match(R, m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask))))
+        return X;
+
+      // Similar to above, but the shift amount may be extended after masking,
+      // so return the extended value as the parameter for the intrinsic.
+      if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
+          match(R,
+                m_And(m_Neg(m_ZExt(m_And(m_Specific(X), m_SpecificInt(Mask)))),
+                      m_SpecificInt(Mask))))
+        return L;
+
+      if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
+          match(R, m_ZExt(m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask)))))
+        return L;
 
-    // For non-constant cases, the following patterns currently only work for
-    // rotation patterns.
-    // TODO: Add general funnel-shift compatible patterns.
-    if (ShVal0 != ShVal1)
       return nullptr;
+    };
 
-    // For non-constant cases we don't support non-pow2 shift masks.
-    // TODO: Is it worth matching urem as well?
-    if (!isPowerOf2_32(Width))
+    Value *ShAmt = matchShiftAmount(ShAmt0, ShAmt1, Width);
+    if (!ShAmt) {
+      ShAmt = matchShiftAmount(ShAmt1, ShAmt0, Width);
+      IsFshl = false; // Sub on SHL.
+    }
+    if (!ShAmt)
       return nullptr;
 
-    // The shift amount may be masked with negation:
-    // (shl ShVal, (X & (Width - 1))) | (lshr ShVal, ((-X) & (Width - 1)))
-    Value *X;
-    unsigned Mask = Width - 1;
-    if (match(L, m_And(m_Value(X), m_SpecificInt(Mask))) &&
-        match(R, m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask))))
-      return X;
+    FShiftArgs = {ShVal0, ShVal1, ShAmt};
 
-    // Similar to above, but the shift amount may be extended after masking,
-    // so return the extended value as the parameter for the intrinsic.
-    if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
-        match(R, m_And(m_Neg(m_ZExt(m_And(m_Specific(X), m_SpecificInt(Mask)))),
-                       m_SpecificInt(Mask))))
-      return L;
+  } else if (isa<ZExtInst>(Or0) || isa<ZExtInst>(Or1)) {
+    // If there are two 'or' instructions concat variables in opposite order,
+    // the latter one can be safely convert to fshl.
+    //
+    // LowHigh = or (shl (zext Low), Width - ZextHighShlAmt), (zext High)
+    // HighLow = or (shl (zext High), ZextHighShlAmt), (zext Low)
+    // ->
+    // HighLow = fshl LowHigh, LowHigh, ZextHighShlAmt
+    if (!isa<ZExtInst>(Or1))
+      std::swap(Or0, Or1);
+
+    Value *High, *ZextHigh, *Low;
+    const APInt *ZextHighShlAmt;
+    if (!match(Or0,
+               m_OneUse(m_Shl(m_Value(ZextHigh), m_APInt(ZextHighShlAmt)))))
+      return nullptr;
 
-    if (match(L, m_ZExt(m_And(m_Value(X), m_SpecificInt(Mask)))) &&
-        match(R, m_ZExt(m_And(m_Neg(m_Specific(X)), m_SpecificInt(Mask)))))
-      return L;
+    if (!match(Or1, m_ZExt(m_Value(Low))) ||
+        !match(ZextHigh, m_ZExt(m_Value(High))))
+      return nullptr;
 
-    return nullptr;
-  };
+    unsigned HighSize = High->getType()->getScalarSizeInBits();
+    unsigned LowSize = Low->getType()->getScalarSizeInBits();
+    if (*ZextHighShlAmt != LowSize || HighSize + LowSize != Width)
+      return nullptr;
 
-  Value *ShAmt = matchShiftAmount(ShAmt0, ShAmt1, Width);
-  bool IsFshl = true; // Sub on LSHR.
-  if (!ShAmt) {
-    ShAmt = matchShiftAmount(ShAmt1, ShAmt0, Width);
-    IsFshl = false; // Sub on SHL.
+    for (User *U : ZextHigh->users()) {
+      Value *X, *Y;
+      if (!match(U, m_Or(m_Value(X), m_Value(Y))))
+        continue;
+
+      if (!isa<ZExtInst>(Y))
+        std::swap(X, Y);
+
+      if (match(X, m_Shl(m_Specific(Or1), m_SpecificInt(HighSize))) &&
+          match(Y, m_Specific(ZextHigh)) && DT.dominates(U, &Or)) {
+        FShiftArgs = {U, U, ConstantInt::get(Or0->getType(), *ZextHighShlAmt)};
+        break;
+      }
+    }
   }
-  if (!ShAmt)
+
+  if (FShiftArgs.empty())
     return nullptr;
 
   Intrinsic::ID IID = IsFshl ? Intrinsic::fshl : Intrinsic::fshr;
   Function *F = Intrinsic::getDeclaration(Or.getModule(), IID, Or.getType());
-  return CallInst::Create(F, {ShVal0, ShVal1, ShAmt});
+  return CallInst::Create(F, FShiftArgs);
 }
 
 /// Attempt to combine or(zext(x),shl(zext(y),bw/2) concat packing patterns.
@@ -3319,7 +3375,7 @@ Instruction *InstCombinerImpl::visitOr(BinaryOperator &I) {
                                                   /*MatchBitReversals*/ true))
     return BitOp;
 
-  if (Instruction *Funnel = matchFunnelShift(I, *this))
+  if (Instruction *Funnel = matchFunnelShift(I, *this, DT))
     return Funnel;
 
   if (Instruction *Concat = matchOrConcat(I, Builder))
diff --git a/llvm/test/Transforms/InstCombine/funnel.ll b/llvm/test/Transforms/InstCombine/funnel.ll
index 60ce49a1635623f..c5bd1aa7b4351bc 100644
--- a/llvm/test/Transforms/InstCombine/funnel.ll
+++ b/llvm/test/Transforms/InstCombine/funnel.ll
@@ -354,6 +354,48 @@ define <2 x i64> @fshl_select_vector(<2 x i64> %x, <2 x i64> %y, <2 x i64> %sham
   ret <2 x i64> %r
 }
 
+; Convert 'or concat' to fshl if opposite 'or concat' exists.
+
+define i32 @fshl_concat(i8 %x, i24 %y, ptr %addr) {
+; CHECK-LABEL: @fshl_concat(
+; CHECK-NEXT:    [[ZEXT_X:%.*]] = zext i8 [[X:%.*]] to i32
+; CHECK-NEXT:    [[SLX:%.*]] = shl nuw i32 [[ZEXT_X]], 24
+; CHECK-NEXT:    [[ZEXT_Y:%.*]] = zext i24 [[Y:%.*]] to i32
+; CHECK-NEXT:    [[XY:%.*]] = or i32 [[SLX]], [[ZEXT_Y]]
+; CHECK-NEXT:    store i32 [[XY]], ptr [[ADDR:%.*]], align 4
+; CHECK-NEXT:    [[YX:%.*]] = call i32 @llvm.fshl.i32(i32 [[XY]], i32 [[XY]], i32 8)
+; CHECK-NEXT:    ret i32 [[YX]]
+;
+  %zext.x = zext i8 %x to i32
+  %slx = shl nuw i32 %zext.x, 24
+  %zext.y = zext i24 %y to i32
+  %xy = or i32 %zext.y, %slx
+  store i32 %xy, ptr %addr, align 4
+  %sly = shl nuw i32 %zext.y, 8
+  %yx = or i32 %zext.x, %sly
+  ret i32 %yx
+}
+
+define <2 x i32> @fshl_concat_vector(<2 x i8> %x, <2 x i24> %y, ptr %addr) {
+; CHECK-LABEL: @fshl_concat_vector(
+; CHECK-NEXT:    [[ZEXT_X:%.*]] = zext <2 x i8> [[X:%.*]] to <2 x i32>
+; CHECK-NEXT:    [[SLX:%.*]] = shl nuw <2 x i32> [[ZEXT_X]], <i32 24, i32 24>
+; CHECK-NEXT:    [[ZEXT_Y:%.*]] = zext <2 x i24> [[Y:%.*]] to <2 x i32>
+; CHECK-NEXT:    [[XY:%.*]] = or <2 x i32> [[SLX]], [[ZEXT_Y]]
+; CHECK-NEXT:    store <2 x i32> [[XY]], ptr [[ADDR:%.*]], align 4
+; CHECK-NEXT:    [[YX:%.*]] = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> [[XY]], <2 x i32> [[XY]], <2 x i32> <i32 8, i32 8>)
+; CHECK-NEXT:    ret <2 x i32> [[YX]]
+;
+  %zext.x = zext <2 x i8> %x to <2 x i32>
+  %slx = shl nuw <2 x i32> %zext.x, <i32 24, i32 24>
+  %zext.y = zext <2 x i24> %y to <2 x i32>
+  %xy = or <2 x i32> %slx, %zext.y
+  store <2 x i32> %xy, ptr %addr, align 4
+  %sly = shl nuw <2 x i32> %zext.y, <i32 8, i32 8>
+  %yx = or <2 x i32> %sly, %zext.x
+  ret <2 x i32> %yx
+}
+
 ; Negative test - an oversized shift in the narrow type would produce the wrong value.
 
 define i8 @unmasked_shlop_unmasked_shift_amount(i32 %x, i32 %y, i32 %shamt) {

@HaohaiWen
Copy link
Contributor Author

Any comments?

@HaohaiWen
Copy link
Contributor Author

This PR is based on #68474

@goldsteinn
Copy link
Contributor

This PR is based on #68474

I commented on the NFC patch. Get that in then rebase this.

return nullptr;

if (!match(Or1, m_ZExt(m_Value(Low))) ||
!match(ZextHigh, m_ZExt(m_Value(High))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need the shifted value to be zext? I.e if you have (shl (zext i24 to i32), 8) that is bitwise equivilent to (shl i32, 8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only match (shl i32, 8), we can't guarantee it's reverse concat since we don't know if the most significant 8bit in i32 is zero.

%zext.x = zext i8 %x to i32 
%slx = shl nuw i32 %zext.x, 24 
%zext.y = zext i24 %y to i32 
%xy = or i32 %zext.y, %slx        #[x[7:0], y[23:0]]
%sly = shl nuw i32 %zext.y, 8 
%yx = or i32 %zext.x, %sly        #[y[23:0], x[7:0]]

If not match zext:

%zext.x = zext i8 %x to i32 
%slx = shl nuw i32 %zext.x, 24 
%xy = or i32 %y, %slx              #[unknown, y[23,0]]             y[31:24] may not be zero.
%sly = shl nuw i32 %y, 8           #[y[23:0], 0,0,0,0,0,0,0,0]
%yx = or i32 %zext.x, %sly         #[y[23:0], x[7:0]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see, although could handle masking and. Really its just a knownbits check though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to check zext. An example without zext: https://alive2.llvm.org/ce/z/MytePb

unsigned LowSize = Low->getType()->getScalarSizeInBits();
// Make sure High does not overlap with Low and most significant bits of
// High aren't shifted out.
if (ZextHighShlAmt->ult(LowSize) || ZextHighShlAmt->ugt(Width - HighSize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you are missing a negative test for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Make sure Low does not overlap with High and most significant bits of
// Low aren't shifted out and we can rotate shift LowHigh to HighLow.
if (ZextLowShlAmt->ult(HighSize) || ZextLowShlAmt->ugt(Width - LowSize) ||
*ZextLowShlAmt + *ZextHighShlAmt != Width)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you are missing negative test for this (also for non-dominating).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, non-dominating was already tested:
We checked:
yx = fshl xy, xy,
When instcombine handle first or (combine for xy), it can meet all check expect dominance.
If we don't check dominance, if would be convert to xy = fshl yx, yx which is absolutely wrong.

@goldsteinn
Copy link
Contributor

See most recent comments about missing tests. That being said code looks functionally correct to me.
Still not 100% sure this is a desirable change. Will defer to @nikic about that.

@nikic
Copy link
Contributor

nikic commented Oct 24, 2023

See most recent comments about missing tests. That being said code looks functionally correct to me. Still not 100% sure this is a desirable change. Will defer to @nikic about that.

I'm also skeptical about accepting this optimization. Looking at the motivating case in #68502 (comment), this seems like a bad approach to the problem: It means that in order to fold the pattern to bitreverse(%xy), we must just so happen to have the right %xy lying around in the IR, even though it doesn't have any relation to the main pattern (it's not used inside it, just injected via an extra use). It sounds to me like the better way to handle that case would be to support matching a variant of plain bitreverse in the form of bitreverse(rot(%yx)).

Replacing the rot(%yx) by %xy is then an extra optimization opportunity, but it's no longer a precondition to performing the transform.

@HaohaiWen
Copy link
Contributor Author

See most recent comments about missing tests. That being said code looks functionally correct to me. Still not 100% sure this is a desirable change. Will defer to @nikic about that.

I'm also skeptical about accepting this optimization. Looking at the motivating case in #68502 (comment), this seems like a bad approach to the problem: It means that in order to fold the pattern to bitreverse(%xy), we must just so happen to have the right %xy lying around in the IR, even though it doesn't have any relation to the main pattern (it's not used inside it, just injected via an extra use). It sounds to me like the better way to handle that case would be to support matching a variant of plain bitreverse in the form of bitreverse(rot(%yx)).

Replacing the rot(%yx) by %xy is then an extra optimization opportunity, but it's no longer a precondition to performing the transform.

In fact, this is a real case.
Currently, there's no chance for matchBSwapOrBitReverse to match bitreverse pattern if we don't do this fshl optimization. Even if we teach it to recognize this pattern, this requires injecting a new rot(%yx) then look for %xy and it still requires to look for user chain.
Even if we don't have bitreverse pattern in the rest IR, there's still beneficial to do such fshl optimization. This removes an extra shl.

@HaohaiWen
Copy link
Contributor Author

Any more comments? I'd like to merge it if no objection.

@goldsteinn
Copy link
Contributor

Any more comments? I'd like to merge it if no objection.

I think there are outstanding objections, or at least blocking concerns from nikic.

@HaohaiWen
Copy link
Contributor Author

@nikic , any more comments?

@HaohaiWen
Copy link
Contributor Author

Any more comments? I'd like to merge it if no objection.

I think there are outstanding objections, or at least blocking concerns from nikic.

gentle ping @nikic

@HaohaiWen
Copy link
Contributor Author

I'd like to merge it if no any more comments.

Copy link
Contributor

@williamweixiao williamweixiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nikic
Copy link
Contributor

nikic commented Nov 11, 2023

Your original example does not verify: https://alive2.llvm.org/ce/z/Bkd89Z Can you please provide a correct example of what you're trying to do?

@HaohaiWen
Copy link
Contributor Author

HaohaiWen commented Nov 12, 2023

Your original example does not verify: https://alive2.llvm.org/ce/z/Bkd89Z Can you please provide a correct example of what you're trying to do?

As I said in previous comments:

This requires first optimize

%sly = shl nuw i32 %zext.y, 8
%yx = or i32 %zext.x, %sly
to
fshl %xy, %xy, 8
Then teach InstCombine to optimize the rest to bitreverse.

This PR only teach InstCombine to optimize shl+or to fshl.
Then, I'll add another patch to teach InstComibine(matchBSwapOrBitReverse) to convert the example you metioned to bitreverse (It relies on this PR). And the original example will be the test in that patch.

@nikic
Copy link
Contributor

nikic commented Nov 12, 2023

Yes, I understand that this transform is only a step towards handling the full pattern. I'm asking for a complete, working example of the original motivating case. The snippets posted in #68502 (comment) do not appear to be correct, or I failed to assemble them correctly. Please provide complete src and tgt functions that verify with alive2.

@HaohaiWen
Copy link
Contributor Author

HaohaiWen commented Nov 12, 2023

Yes, I understand that this transform is only a step towards handling the full pattern. I'm asking for a complete, working example of the original motivating case. The snippets posted in #68502 (comment) do not appear to be correct, or I failed to assemble them correctly. Please provide complete src and tgt functions that verify with alive2.

Oh... sorry, I made a wrong IR example.
The real case looks like this: https://alive2.llvm.org/ce/z/-DXnJc
Both %x and %y are i16. The first or/fshl swaps half of of i32. The rest swaps byte, 4bit, 2bit.
I extended the fshl transformation to apply for both symmetric and asymmetric combination.
If %x and %y is not same width, the first or/fshl is not used to swap half bytes so can't be convert to bitrevserse.

@nikic
Copy link
Contributor

nikic commented Nov 12, 2023

Thanks for the updated example!

To explain what I meant in first comment using this example: We would perform the transform https://alive2.llvm.org/ce/z/nllcB_, which does not depend at all on how %yx is constructed, and whether there is any way to form the fshl separately. If the %yx is appropriately constructed, the fshl can be removed (https://alive2.llvm.org/ce/z/B_KOwv, another missing transform).

Is this not a viable approach? Is there a concern here that generating both fshl and bitreverse may be non-profitable for targets without bitreverse? Or maybe supporting this makes the matching too expensive?

@nikic nikic requested a review from dtcxzyw November 12, 2023 16:51
@HaohaiWen
Copy link
Contributor Author

HaohaiWen commented Nov 13, 2023

Thanks for the updated example!

To explain what I meant in first comment using this example: We would perform the transform https://alive2.llvm.org/ce/z/nllcB_, which does not depend at all on how %yx is constructed, and whether there is any way to form the fshl separately. If the %yx is appropriately constructed, the fshl can be removed (https://alive2.llvm.org/ce/z/B_KOwv, another missing transform).

Is this not a viable approach? Is there a concern here that generating both fshl and bitreverse may be non-profitable for targets without bitreverse? Or maybe supporting this makes the matching too expensive?

It's absolutely a feasible solution.


Solution1:
First optimize bitreverse then eliminate redundant fshl: https://alive2.llvm.org/ce/z/g_gWf3
This requires
a) First teach collectBitParts to not only search until unknown opcode, but also try to use itself as root.
b) Teach recognizeBSwapOrBitReverseIdiom to recognize bit pattern [n/2-1, ..., 1, 0, n-1, n-2, .... n/2]. Then insert bitreverse and fshl.
c) Teach instcombine to remove redundant fshl if opposite concat exists. This requires to scan def-users chains.

Advantage:
1). Even if we can't eliminate fshl, we can still optimize a bunch of IR to fshl+bitreverse. Don't know whether its profitable for most targets.


Solution2:
First optimize or to fshl then optimize bitreverse: https://alive2.llvm.org/ce/z/WbzJVo
This requires
a) What we did in this PR. This requires to scan def-users chains.
b) same as step a) in Solution 1.

Advantage:
1). Can optimize more opposite concat pattern to fshl. It's beneficial for targets with cycle rotate instruction (e.g. rol in x86).
2). More easily for implementation. Do not requires step b) in Solution1.


Both solutions requires to scan def-users chains. I don't think this is an issue.
Both solutions can handle my cases. Solution2 is easier to implementation. Any concern about this PR?
I think b) in solution1 can be implemented in the future if we want both advantages of solution1 and 2. InstCombine will always first try to match fshl then bitreverse. Therefore with solution2 and b) of solution1, we don't need to implent c) in solution1 at all.

Ref for bitreverse optimization:

Instruction *InstCombinerImpl::matchBSwapOrBitReverse(Instruction &I,

@HaohaiWen
Copy link
Contributor Author

HaohaiWen commented Nov 16, 2023

Thanks for the updated example!

To explain what I meant in first comment using this example: We would perform the transform https://alive2.llvm.org/ce/z/nllcB_, which does not depend at all on how %yx is constructed, and whether there is any way to form the fshl separately. If the %yx is appropriately constructed, the fshl can be removed (https://alive2.llvm.org/ce/z/B_KOwv, another missing transform).

Is this not a viable approach? Is there a concern here that generating both fshl and bitreverse may be non-profitable for targets without bitreverse? Or maybe supporting this makes the matching too expensive?

It's absolutely a feasible solution.


Solution1:
First optimize bitreverse then eliminate redundant fshl: https://alive2.llvm.org/ce/z/g_gWf3
This requires
a) First teach collectBitParts to not only search until unknown opcode, but also try to use itself as root.
b) Teach recognizeBSwapOrBitReverseIdiom to recognize bit pattern [n/2-1, ..., 1, 0, n-1, n-2, .... n/2]. Then insert bitreverse and fshl.
c) Teach instcombine to remove redundant fshl if opposite concat exists. This requires to scan def-users chains.

Advantage:
1). Even if we can't eliminate fshl, we can still optimize a bunch of IR to fshl+bitreverse. Don't know whether its profitable for most targets.


Solution2:
First optimize or to fshl then optimize bitreverse: https://alive2.llvm.org/ce/z/WbzJVo
This requires
a) What we did in this PR. This requires to scan def-users chains.
b) same as step a) in Solution 1.

Advantage:
1). Can optimize more opposite concat pattern to fshl. It's beneficial for targets with cycle rotate instruction (e.g. rol in x86).
2). More easily for implementation. Do not requires step b) in Solution1.


Both solutions requires to scan def-users chains. I don't think this is an issue.
Both solutions can handle my cases. Solution2 is easier to implementation. Any concern about this PR?
I think b) in solution1 can be implemented in the future if we want both advantages of solution1 and 2. InstCombine will always first try to match fshl then bitreverse. Therefore with solution2 and b) of solution1, we don't need to implent c) in solution1 at all.

Ref for bitreverse optimization:

Instruction *InstCombinerImpl::matchBSwapOrBitReverse(Instruction &I,

Any comments?

@HaohaiWen
Copy link
Contributor Author

I'd like to merge it. Please let me know if you have more concern.

@HaohaiWen HaohaiWen merged commit 95d584c into llvm:main Nov 20, 2023
3 checks passed
@HaohaiWen HaohaiWen deleted the concat branch November 20, 2023 05:12
@goldsteinn
Copy link
Contributor

For my money this was merged prematurely. There are still outstanding concerns about whether this transform is desirable, as well there is an outstanding comment about the implementation itself.
I'm fairly agnostic about this code getting in, but I think it should be reverted until some degree of consensus is reached.
It's painful to have comments/a PR languish without replies, but don't think the answer is to just push.

sr-tream pushed a commit to sr-tream/llvm-project that referenced this pull request Nov 20, 2023
…lvm#68502)

If there are two 'or' instructions concat variables in opposite order
and the first 'or' dominates the second one, the second 'or' can be
optimized to fshl to rotate shift first 'or'. This can eliminate an shl
and expose more optimization opportunity for bswap/bitreverse.
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
…lvm#68502)

If there are two 'or' instructions concat variables in opposite order
and the first 'or' dominates the second one, the second 'or' can be
optimized to fshl to rotate shift first 'or'. This can eliminate an shl
and expose more optimization opportunity for bswap/bitreverse.
@HaohaiWen
Copy link
Contributor Author

For my money this was merged prematurely. There are still outstanding concerns about whether this transform is desirable, as well there is an outstanding comment about the implementation itself. I'm fairly agnostic about this code getting in, but I think it should be reverted until some degree of consensus is reached. It's painful to have comments/a PR languish without replies, but don't think the answer is to just push.

Thanks for kindly reminding. I think I've replied the concerns and wait a long time... I thought there should be no problem so I merged it...
Hi @nikic, any comments about the reply to your concerns? #68502 (comment)
If you have concerns that this patch is bad for some cases, or you have more suggestions, please let us know and we can improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants