Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoopIdiomRecognize: detect and convert powi idiom #72650

Closed
wants to merge 2 commits into from

Conversation

artagnon
Copy link
Contributor

@artagnon artagnon commented Nov 17, 2023

The following code, when compiled under -ffast-math, produces bad codegen due to LoopVectorize:

  float powi(float base, int exp) {
    float result = 1.0;
      for (int i = 0; i < exp; ++i)
        result *= base;
    return result;
  }

It can easily be replaced with the llvm.powi intrinsic, when the exponent is a C int type. This is the job of LoopIdiomRecognize, and has been marked as a TODO item for years. Fulfill this wish, and replace computations of this form with the llvm.powi intrinsic.

-- 8< --
Based on #72648. Please review only the second patch.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 17, 2023

@llvm/pr-subscribers-llvm-transforms

Author: Ramkumar Ramachandra (artagnon)

Changes

The following code, when compiled under -ffast-math, produces bad codegen due to LoopVectorize:

  float powi(float base, int exp) {
    float result = 1.0;
      for (int i = 0; i &lt; exp; ++i)
        result *= base;
    return result;
  }

It can easily be replaced with the llvm.powi intrinsic, when the exponent is, at most, a C int type. This is the job of LoopIdiomRecognize, and has been marked as a TODO item for years. Fulfill this wish, and replace computations of this form with the llvm.powi intrinsic.

-- 8< --
Based on #72648. Please review only the second patch.


Patch is 24.84 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72650.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp (+137-29)
  • (added) llvm/test/Transforms/LoopIdiom/powi.ll (+339)
diff --git a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
index 4093a5a51a4d79e..3c9e99ebe6f3cb9 100644
--- a/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
@@ -22,8 +22,6 @@
 //
 // Future loop memory idioms to recognize:
 //   memcmp, strlen, etc.
-// Future floating point idioms to recognize in -ffast-math mode:
-//   fpowi
 //
 // This could recognize common matrix multiplies and dot product idioms and
 // replace them with calls to BLAS (if linked in??).
@@ -94,6 +92,7 @@
 #include <vector>
 
 using namespace llvm;
+using namespace llvm::PatternMatch;
 
 #define DEBUG_TYPE "loop-idiom"
 
@@ -206,6 +205,7 @@ class LoopIdiomRecognize {
       const SCEV *BECount);
   bool processLoopMemCpy(MemCpyInst *MCI, const SCEV *BECount);
   bool processLoopMemSet(MemSetInst *MSI, const SCEV *BECount);
+  bool processLoopPowi(BasicBlock *BB, const SCEV *BECount);
 
   bool processLoopStridedStore(Value *DestPtr, const SCEV *StoreSizeSCEV,
                                MaybeAlign StoreAlignment, Value *StoredVal,
@@ -298,13 +298,8 @@ bool LoopIdiomRecognize::runOnLoop(Loop *L) {
   ApplyCodeSizeHeuristics =
       L->getHeader()->getParent()->hasOptSize() && UseLIRCodeSizeHeurs;
 
-  HasMemset = TLI->has(LibFunc_memset);
-  HasMemsetPattern = TLI->has(LibFunc_memset_pattern16);
-  HasMemcpy = TLI->has(LibFunc_memcpy);
-
-  if (HasMemset || HasMemsetPattern || HasMemcpy)
-    if (SE->hasLoopInvariantBackedgeTakenCount(L))
-      return runOnCountableLoop();
+  if (SE->hasLoopInvariantBackedgeTakenCount(L))
+    return runOnCountableLoop();
 
   return runOnNoncountableLoop();
 }
@@ -549,33 +544,43 @@ bool LoopIdiomRecognize::runOnLoopBlock(
     BasicBlock *BB, const SCEV *BECount,
     SmallVectorImpl<BasicBlock *> &ExitBlocks) {
   // We can only promote stores in this block if they are unconditionally
-  // executed in the loop.  For a block to be unconditionally executed, it has
-  // to dominate all the exit blocks of the loop.  Verify this now.
+  // executed in the loop. The powi idiom also requires the block to be
+  // unconditionally executed. For a block to be unconditionally executed, it
+  // has to dominate all the exit blocks of the loop.
   for (BasicBlock *ExitBlock : ExitBlocks)
     if (!DT->dominates(BB, ExitBlock))
       return false;
 
   bool MadeChange = false;
-  // Look for store instructions, which may be optimized to memset/memcpy.
-  collectStores(BB);
 
-  // Look for a single store or sets of stores with a common base, which can be
-  // optimized into a memset (memset_pattern).  The latter most commonly happens
-  // with structs and handunrolled loops.
-  for (auto &SL : StoreRefsForMemset)
-    MadeChange |= processLoopStores(SL.second, BECount, ForMemset::Yes);
+  HasMemset = TLI->has(LibFunc_memset);
+  HasMemsetPattern = TLI->has(LibFunc_memset_pattern16);
+  HasMemcpy = TLI->has(LibFunc_memcpy);
+
+  if (HasMemset || HasMemsetPattern || HasMemcpy) {
+    // Look for store instructions, which may be optimized to memset/memcpy.
+    collectStores(BB);
+
+    // Look for a single store or sets of stores with a common base, which can
+    // be optimized into a memset (memset_pattern).  The latter most commonly
+    // happens with structs and handunrolled loops.
+    for (auto &SL : StoreRefsForMemset)
+      MadeChange |= processLoopStores(SL.second, BECount, ForMemset::Yes);
 
-  for (auto &SL : StoreRefsForMemsetPattern)
-    MadeChange |= processLoopStores(SL.second, BECount, ForMemset::No);
+    for (auto &SL : StoreRefsForMemsetPattern)
+      MadeChange |= processLoopStores(SL.second, BECount, ForMemset::No);
 
-  // Optimize the store into a memcpy, if it feeds an similarly strided load.
-  for (auto &SI : StoreRefsForMemcpy)
-    MadeChange |= processLoopStoreOfLoopLoad(SI, BECount);
+    // Optimize the store into a memcpy, if it feeds an similarly strided load.
+    for (auto &SI : StoreRefsForMemcpy)
+      MadeChange |= processLoopStoreOfLoopLoad(SI, BECount);
 
-  MadeChange |= processLoopMemIntrinsic<MemCpyInst>(
-      BB, &LoopIdiomRecognize::processLoopMemCpy, BECount);
-  MadeChange |= processLoopMemIntrinsic<MemSetInst>(
-      BB, &LoopIdiomRecognize::processLoopMemSet, BECount);
+    MadeChange |= processLoopMemIntrinsic<MemCpyInst>(
+        BB, &LoopIdiomRecognize::processLoopMemCpy, BECount);
+    MadeChange |= processLoopMemIntrinsic<MemSetInst>(
+        BB, &LoopIdiomRecognize::processLoopMemSet, BECount);
+  }
+
+  MadeChange |= processLoopPowi(BB, BECount);
 
   return MadeChange;
 }
@@ -925,6 +930,111 @@ bool LoopIdiomRecognize::processLoopMemSet(MemSetInst *MSI,
                                  BECount, IsNegStride, /*IsLoopMemset=*/true);
 }
 
+static CallInst *createPowiIntrinsic(IRBuilder<> &IRBuilder, Value *Base,
+                                     Value *Exp, const DebugLoc &DL) {
+  Value *Ops[] = {Base, Exp};
+  Type *Tys[] = {Base->getType(), Exp->getType()};
+
+  Module *M = IRBuilder.GetInsertBlock()->getParent()->getParent();
+  Function *Func = Intrinsic::getDeclaration(M, Intrinsic::powi, Tys);
+  CallInst *CI = IRBuilder.CreateCall(Func, Ops);
+  CI->setDebugLoc(DL);
+  return CI;
+}
+
+// Checks that the Phi is an fmul fast with a loop-invariant operand, and
+// returns the the fmul instruction.
+static Instruction *detectPowiIdiom(PHINode *Phi, BasicBlock *PH,
+                                    BasicBlock *Latch, Loop *CurLoop) {
+  LLVM_DEBUG(dbgs() << DEBUG_TYPE " Performing powi idiom detection\n");
+
+  // The phi must have two incoming values (one from the preheader, and another
+  // from the latch), it must have one use (which we will subsequently check is
+  // an fmul fast instruction), and it must be a floating-point type.
+  if (Phi->getNumIncomingValues() != 2 || !Phi->hasOneUse() ||
+      Phi->getBasicBlockIndex(PH) < 0 || Phi->getBasicBlockIndex(Latch) < 0 ||
+      !Phi->getType()->isFloatingPointTy()) {
+    LLVM_DEBUG(dbgs() << DEBUG_TYPE " Unable to operate on this PHI node\n");
+    return nullptr;
+  }
+
+  // Further, check that the incoming value from the preheader is 1.0.
+  auto *ConstFP = dyn_cast<ConstantFP>(Phi->getIncomingValueForBlock(PH));
+  if (!ConstFP || !ConstFP->isExactlyValue(1.0)) {
+    LLVM_DEBUG(dbgs() << DEBUG_TYPE " Initial value comparison failed\n");
+    return nullptr;
+  }
+
+  auto *I = cast<Instruction>(Phi->use_begin()->getUser());
+  Value *Op1, *Op2;
+  if (!match(I, m_FMul(m_Value(Op1), m_Value(Op2))) || !I->isFast()) {
+    LLVM_DEBUG(dbgs() << DEBUG_TYPE " fmul-fast test failed\n");
+    return nullptr;
+  }
+  Value *Base = Op1 == Phi ? Op2 : Op1;
+  if (CurLoop->isLoopInvariant(Base))
+    return I;
+  else
+    LLVM_DEBUG(dbgs() << DEBUG_TYPE " Base is not loop-invariant\n");
+  return nullptr;
+}
+
+/// Detect the powi idiom, and convert it to an intrinsic.
+bool LoopIdiomRecognize::processLoopPowi(BasicBlock *BB, const SCEV *BECount) {
+  // We only process loops where the IV is, at most, a C int type, since this is
+  // used for the exponent of the powi.
+  PHINode *IV = CurLoop->getInductionVariable(*SE);
+  if (!IV || IV->getType()->getScalarSizeInBits() > 32)
+    return false;
+
+  // If the loop doesn't have a valid preheader and latch, give up now.
+  BasicBlock *PH = CurLoop->getLoopPreheader();
+  BasicBlock *Latch = CurLoop->getLoopLatch();
+  if (!PH || !Latch)
+    return false;
+
+  // Collect all phis that are not the induction phi.
+  SmallVector<PHINode *, 1> Phis;
+  for (Instruction &I : *BB) {
+    if (auto *Phi = dyn_cast<PHINode>(&I))
+      if (Phi != IV)
+        Phis.push_back(Phi);
+  }
+
+  // Find the Phi corresponding to the powi idiom, amongst all phis except the
+  // induction phi.
+  for (PHINode *Phi : Phis) {
+    if (Instruction *FMul = detectPowiIdiom(Phi, PH, Latch, CurLoop)) {
+      // Find the trip count, and expand the SCEV to find the exponent of the
+      // powi.
+      IRBuilder<> Builder(PH->getTerminator());
+      SCEVExpander Expander(*SE, *DL, "loop-idiom");
+      SCEVExpanderCleaner ExpCleaner(Expander);
+      Type *ExpTy = IV->getType();
+      const SCEV *TripCount =
+          SE->getTripCountFromExitCount(BECount, ExpTy, CurLoop);
+      if (!Expander.isSafeToExpand(TripCount)) {
+        LLVM_DEBUG(dbgs() << DEBUG_TYPE " Trip count not safe to expand\n");
+        return false;
+      }
+      Value *Exp =
+          Expander.expandCodeFor(TripCount, ExpTy, PH->getTerminator());
+
+      // Insert the powi intrinsic, and replace its uses outside the block.
+      const DebugLoc &Loc = FMul->getDebugLoc();
+      Value *Base = isa<PHINode>(FMul->getOperand(1)) ? FMul->getOperand(2)
+                                                      : FMul->getOperand(1);
+      CallInst *Powi = createPowiIntrinsic(Builder, Base, Exp, Loc);
+      FMul->replaceUsesOutsideBlock(Powi, BB);
+      ExpCleaner.markResultUsed();
+      LLVM_DEBUG(dbgs() << DEBUG_TYPE " powi idiom optimized!\n");
+      return true;
+    }
+  }
+  LLVM_DEBUG(dbgs() << DEBUG_TYPE " powi idiom detection failed\n");
+  return false;
+}
+
 /// mayLoopAccessLocation - Return true if the specified loop might access the
 /// specified pointer location, which is a loop-strided access.  The 'Access'
 /// argument specifies what the verboten forms of access are (read or write).
@@ -2216,8 +2326,6 @@ static bool detectShiftUntilBitTestIdiom(Loop *CurLoop, Value *&BaseX,
   BasicBlock *LoopPreheaderBB = CurLoop->getLoopPreheader();
   assert(LoopPreheaderBB && "There is always a loop preheader.");
 
-  using namespace PatternMatch;
-
   // Step 1: Check if the loop backedge is in desirable form.
 
   ICmpInst::Predicate Pred;
diff --git a/llvm/test/Transforms/LoopIdiom/powi.ll b/llvm/test/Transforms/LoopIdiom/powi.ll
new file mode 100644
index 000000000000000..5136d5376414c51
--- /dev/null
+++ b/llvm/test/Transforms/LoopIdiom/powi.ll
@@ -0,0 +1,339 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -passes='loop(loop-idiom,loop-deletion,indvars),function(gvn,simplifycfg)' < %s -S | FileCheck %s
+
+define float @powi_f32(float %base, i32 %exp) {
+; CHECK-LABEL: define float @powi_f32(
+; CHECK-SAME: float [[BASE:%.*]], i32 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i32 [[EXP]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.powi.f32.i32(float [[BASE]], i32 [[EXP]])
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP_NOT]], float 1.000000e+00, float [[TMP0]]
+; CHECK-NEXT:    ret float [[SPEC_SELECT]]
+;
+entry:
+  %cmp.not = icmp eq i32 %exp, 0
+  br i1 %cmp.not, label %while.end, label %while.body
+
+while.body:                                       ; preds = %entry, %while.body
+  %result = phi float [ %mul, %while.body ], [ 1.000000e+00, %entry ]
+  %merge.dec = phi i32 [ %dec, %while.body ], [ %exp, %entry ]
+  %mul = fmul fast float %result, %base
+  %dec = add nsw i32 %merge.dec, -1
+  %cmp.eq = icmp eq i32 %dec, 0
+  br i1 %cmp.eq, label %while.end, label %while.body
+
+while.end:                                        ; preds = %while.body, %entry
+  %result.lcssa = phi float [ 1.000000e+00, %entry ], [ %mul, %while.body ]
+  ret float %result.lcssa
+}
+
+define double @powi_f64(double %base, i32 %exp) {
+; CHECK-LABEL: define double @powi_f64(
+; CHECK-SAME: double [[BASE:%.*]], i32 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i32 [[EXP]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = call double @llvm.powi.f64.i32(double [[BASE]], i32 [[EXP]])
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP_NOT]], double 1.000000e+00, double [[TMP0]]
+; CHECK-NEXT:    ret double [[SPEC_SELECT]]
+;
+entry:
+  %cmp.not = icmp eq i32 %exp, 0
+  br i1 %cmp.not, label %while.end, label %while.body
+
+while.body:                                       ; preds = %entry, %while.body
+  %result = phi double [ %mul, %while.body ], [ 1.000000e+00, %entry ]
+  %merge.dec = phi i32 [ %dec, %while.body ], [ %exp, %entry ]
+  %mul = fmul fast double %result, %base
+  %dec = add nsw i32 %merge.dec, -1
+  %cmp.eq = icmp eq i32 %dec, 0
+  br i1 %cmp.eq, label %while.end, label %while.body
+
+while.end:                                        ; preds = %while.body, %entry
+  %result.lcssa = phi double [ 1.000000e+00, %entry ], [ %mul, %while.body ]
+  ret double %result.lcssa
+}
+
+define double @powi_i16_iv(double %base, i16 %exp) {
+; CHECK-LABEL: define double @powi_i16_iv(
+; CHECK-SAME: double [[BASE:%.*]], i16 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i16 [[EXP]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = call double @llvm.powi.f64.i16(double [[BASE]], i16 [[EXP]])
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP_NOT]], double 1.000000e+00, double [[TMP0]]
+; CHECK-NEXT:    ret double [[SPEC_SELECT]]
+;
+entry:
+  %cmp.not = icmp eq i16 %exp, 0
+  br i1 %cmp.not, label %while.end, label %while.body
+
+while.body:                                       ; preds = %entry, %while.body
+  %result = phi double [ %mul, %while.body ], [ 1.000000e+00, %entry ]
+  %merge.dec = phi i16 [ %dec, %while.body ], [ %exp, %entry ]
+  %mul = fmul fast double %result, %base
+  %dec = add nsw i16 %merge.dec, -1
+  %cmp.eq = icmp eq i16 %dec, 0
+  br i1 %cmp.eq, label %while.end, label %while.body
+
+while.end:                                        ; preds = %while.body, %entry
+  %result.lcssa = phi double [ 1.000000e+00, %entry ], [ %mul, %while.body ]
+  ret double %result.lcssa
+}
+
+define float @powi_canonical_iv_signed(float %base, i32 %exp) {
+; CHECK-LABEL: define float @powi_canonical_iv_signed(
+; CHECK-SAME: float [[BASE:%.*]], i32 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_SGT:%.*]] = icmp sgt i32 [[EXP]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.powi.f32.i32(float [[BASE]], i32 [[EXP]])
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP_SGT]], float [[TMP0]], float 1.000000e+00
+; CHECK-NEXT:    ret float [[SPEC_SELECT]]
+;
+entry:
+  %cmp.sgt = icmp sgt i32 %exp, 0
+  br i1 %cmp.sgt, label %for.body, label %exit
+
+for.body:                                         ; preds = %entry, %for.body
+  %iv = phi i32 [ %inc, %for.body ], [ 0, %entry ]
+  %result = phi float [ %mul, %for.body ], [ 1.000000e+00, %entry ]
+  %mul = fmul fast float %result, %base
+  %inc = add nuw nsw i32 %iv, 1
+  %exitcond = icmp eq i32 %inc, %exp
+  br i1 %exitcond, label %exit, label %for.body
+
+exit:                                             ; preds = %for.body, %entry
+  %result.lcssa = phi float [ 1.000000e+00, %entry ], [ %mul, %for.body ]
+  ret float %result.lcssa
+}
+
+define float @powi_canonical_iv_unsigned(float %base, i32 %exp) {
+; CHECK-LABEL: define float @powi_canonical_iv_unsigned(
+; CHECK-SAME: float [[BASE:%.*]], i32 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_EQ:%.*]] = icmp eq i32 [[EXP]], 0
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.powi.f32.i32(float [[BASE]], i32 [[EXP]])
+; CHECK-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[CMP_EQ]], float 1.000000e+00, float [[TMP0]]
+; CHECK-NEXT:    ret float [[SPEC_SELECT]]
+;
+entry:
+  %cmp.eq = icmp eq i32 %exp, 0
+  br i1 %cmp.eq, label %exit, label %for.body
+
+for.body:                                         ; preds = %entry, %for.body
+  %iv = phi i32 [ %inc, %for.body ], [ 0, %entry ]
+  %result = phi float [ %mul, %for.body ], [ 1.000000e+00, %entry ]
+  %mul = fmul fast float %result, %base
+  %inc = add nuw nsw i32 %iv, 1
+  %exitcond = icmp eq i32 %inc, %exp
+  br i1 %exitcond, label %exit, label %for.body
+
+exit:                                             ; preds = %for.body, %entry
+  %result.lcssa = phi float [ 1.000000e+00, %entry ], [ %mul, %for.body ]
+  ret float %result.lcssa
+}
+
+define float @powi_const_i32_exp(float %base) {
+; CHECK-LABEL: define float @powi_const_i32_exp(
+; CHECK-SAME: float [[BASE:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.powi.f32.i32(float [[BASE]], i32 2147483647)
+; CHECK-NEXT:    ret float [[TMP0]]
+;
+entry:
+  br label %for.body
+
+for.body:                                         ; preds = %entry, %for.body
+  %iv = phi i32 [ 0, %entry ], [ %inc, %for.body ]
+  %result = phi float [ 1.000000e+00, %entry ], [ %mul, %for.body ]
+  %mul = fmul fast float %result, %base
+  %inc = add nuw nsw i32 %iv, 1
+  %cmp = icmp eq i32 %inc, 2147483647
+  br i1 %cmp, label %exit, label %for.body
+
+exit:                                             ; preds = %for.body
+  ret float %mul
+}
+
+define float @powi_unrelated_computation(float %base, i32 %exp) {
+; CHECK-LABEL: define float @powi_unrelated_computation(
+; CHECK-SAME: float [[BASE:%.*]], i32 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i32 [[EXP]], 0
+; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY_PREHEADER:%.*]], label [[EXIT:%.*]]
+; CHECK:       for.body.preheader:
+; CHECK-NEXT:    [[TMP0:%.*]] = call float @llvm.powi.f32.i32(float [[BASE]], i32 [[EXP]])
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
+; CHECK-NEXT:    [[UNRELATED:%.*]] = phi i32 [ [[UNRELATED_XOR:%.*]], [[FOR_BODY]] ], [ 5, [[FOR_BODY_PREHEADER]] ]
+; CHECK-NEXT:    [[UNRELATED_XOR]] = xor i32 [[IV]], [[UNRELATED]]
+; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[EXP]]
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_CLEANUP:%.*]], label [[FOR_BODY]]
+; CHECK:       for.cleanup:
+; CHECK-NEXT:    [[TMP1:%.*]] = sitofp i32 [[UNRELATED_XOR]] to float
+; CHECK-NEXT:    [[TMP2:%.*]] = fadd fast float [[TMP0]], [[TMP1]]
+; CHECK-NEXT:    br label [[EXIT]]
+; CHECK:       exit:
+; CHECK-NEXT:    [[ADD:%.*]] = phi float [ 6.000000e+00, [[ENTRY:%.*]] ], [ [[TMP2]], [[FOR_CLEANUP]] ]
+; CHECK-NEXT:    ret float [[ADD]]
+;
+entry:
+  %cmp = icmp sgt i32 %exp, 0
+  br i1 %cmp, label %for.body, label %exit
+
+for.body:                                         ; preds = %entry, %for.body
+  %iv = phi i32 [ %inc, %for.body ], [ 0, %entry ]
+  %unrelated = phi i32 [ %unrelated.xor, %for.body ], [ 5, %entry ]
+  %result = phi float [ %mul, %for.body ], [ 1.000000e+00, %entry ]
+  %mul = fmul fast float %result, %base
+  %unrelated.xor = xor i32 %iv, %unrelated
+  %inc = add nuw nsw i32 %iv, 1
+  %exitcond = icmp eq i32 %inc, %exp
+  br i1 %exitcond, label %for.cleanup, label %for.body
+
+for.cleanup:                                     ; preds = %for.body
+  %0 = sitofp i32 %unrelated.xor to float
+  %1 = fadd fast float %mul, %0
+  br label %exit
+
+exit:                                           ; preds = %for.cleanup, %entry
+  %add = phi float [ 6.000000e+00, %entry ], [ %1, %for.cleanup ]
+  ret float %add
+}
+
+; Negative tests
+
+; The powi idiom is only legal for a base of floating-point type
+define i32 @powi_i32_base(i32 %base, i32 %exp) {
+; CHECK-LABEL: define i32 @powi_i32_base(
+; CHECK-SAME: i32 [[BASE:%.*]], i32 [[EXP:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP_NOT:%.*]] = icmp eq i32 [[EXP]], 0
+; CHECK-NEXT:    br i1 [[CMP_NOT]], label [[WHILE_END:%.*]], label [[WHILE_BODY:%.*]]
+; CHECK:       while.body:
+; CHECK-NEXT:    [[RESULT:%.*]] = phi i32 [ [[MUL:%.*]], [[WHILE_BODY]] ], [ 1, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[MERGE_DEC:%.*]] = phi i32 [ [[DEC:%.*]], [[WHILE_BODY]] ], [ [[EXP]], [[ENTRY]] ]
+; CHECK-NEXT:    [[MUL]] = mul nsw i32 [[RESULT]], [[BASE]]
+; CHECK-NEXT:    [[DEC]] = add nsw i32 [[MERGE_DEC]], -1
+; CHECK-NEXT:    [[CMP_EQ:%.*]] = icmp eq i32 [[DEC]], 0
+; CHECK-NEXT:    br i1 [[CMP_EQ]], label [[WHILE_END]], label [[WHILE_BODY]]
+; CHECK:       while.end:
+; CHECK-NEXT:    [[RESULT_LCSSA:%.*]] = phi i32 [ 1, [[ENTRY]] ], [ [[MUL]], [[WHILE_BODY]] ]
+; CHECK-NEXT:    ret i32 [[RESULT_LCSSA]]
+;
+entry:
+  %cmp.not = icmp eq i32 %exp, 0
+  br i1 %cmp.not, label %while.end, label %while.body
+
+while.body:                                       ; preds = %entry, %while.body
+  %result = phi i32 [ %mul, %while.body ], [ 1, %entry ]
+  %merge.dec = phi i32 [ %dec, %while.body ], [ %exp, %entry ]
+  %mul = mul nsw i32 %result, %base
+  %dec = add nsw i32 %merge.dec, -1
+  %cmp.eq = icmp eq i32 %dec, 0
+  br i1 %cmp.e...
[truncated]

Copy link

@joe-img joe-img left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is out of the scope of this change, but all the loops in the unit tests get reduced to a select and the icmp 0. Would it better off eliminating the check all together? Seeing as we're calling the intrinsic and I think it can handle the 0 exponent.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp Outdated Show resolved Hide resolved
@artagnon
Copy link
Contributor Author

I don't know if this is out of the scope of this change, but all the loops in the unit tests get reduced to a select and the icmp 0. Would it better off eliminating the check all together? Seeing as we're calling the intrinsic and I think it can handle the 0 exponent.

I'd say it's out-of-scope of this change, since there may be unrelated computations in the loop (see test), and it's only loop-deletion that can detect that the loop is dead, and remove it entirely.

@jcranmer-intel
Copy link
Contributor

I'm skeptical of this change for quite a few reasons:

  • powi corresponds to a GCC builtin that is only defined for C int (which is i32 for most targets).
  • powi isn't always lowered by targets correctly for non-i32 parameters. Several targets fail to compile llvm.powi.f32.i16, for example.
  • Unlike memcpy and memset, which tend to be important enough internal intrinsics that you have to handle them correctly even in freestanding modes, powi is... not that. Without having a way to forcibly disable the optimizer from trying to generate powi, I don't think this patch can even be contemplated.
  • Checking for fast rather than appropriate individual fast-math flags is weird.
  • This transform is only beneficial if the result of the multiply is only used outside of the loop. replaceUsesOutsideBlock doesn't guarantee that.
  • I'm somewhat skeptical that this is going to be a useful optimization in practice. I don't see the kind of loops where this transformation would be beneficial just coming up all that frequently, even as something that might be the result of optimization.

@artagnon
Copy link
Contributor Author

  • I'm somewhat skeptical that this is going to be a useful optimization in practice. I don't see the kind of loops where this transformation would be beneficial just coming up all that frequently, even as something that might be the result of optimization.

Addressed all other review comments. I agree that this is quite a minor optimization, and that this idiom doesn't crop up frequently; nevertheless, it is a valid optimization, and benefits some programs.

The following code, when compiled under -ffast-math, produces bad
codegen due to LoopVectorize:

  float powi(float base, int exp) {
    float result = 1.0;
      for (int i = 0; i < exp; ++i)
        result *= base;
    return result;
  }

It can easily be replaced with the llvm.powi intrinsic, when the
exponent is a C int type. This is the job of LoopIdiomRecognize, and has
been marked as a TODO item for years. In preparation to fulfill this
wish, add negative tests corresponding to variations of this program.
The following code, when compiled under -ffast-math, produces bad
codegen due to LoopVectorize:

  float powi(float base, int exp) {
    float result = 1.0;
      for (int i = 0; i < exp; ++i)
        result *= base;
    return result;
  }

It can easily be replaced with the llvm.powi intrinsic, when the
exponent is a C int type. This is the job of LoopIdiomRecognize, and has
been marked as a TODO item for years. Fulfill this wish, and replace
computations of this form with the llvm.powi intrinsic.
@artagnon
Copy link
Contributor Author

artagnon commented Dec 1, 2023

Gentle ping. I think this patch would make a nice addition to LIR, even if there aren't too many idioms of this type in the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants