Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

skachkov-sc
Copy link
Contributor

Motivating example: https://godbolt.org/z/WcM6x1YPx
Here clang doesn't unroll loop with -Os, despite the fact that it will produce smaller and faster code. The issue is that we estimate cost of GEP as 1 after unrolling:

Loop Unroll: F[bar] Loop %for.body.i
  Loop Size = 5
Starting LoopUnroll profitability analysis...
 Analyzing iteration 0
Adding cost of instruction (iteration 0):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 0):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
 Analyzing iteration 1
Adding cost of instruction (iteration 1):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 1):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
 Analyzing iteration 2
Adding cost of instruction (iteration 2):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 2):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
 Analyzing iteration 3
Adding cost of instruction (iteration 3):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 3):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
  Exceeded threshold.. exiting.
  UnrolledCost: 8, MaxUnrolledLoopSize: 6
  will not try to unroll partially because -unroll-allow-partial not given

However, the more precise cost estimation is zero, because after unrolling we will not have non-constant index %indvars.iv.i, but some known compile-time constant: {0, 1, 2, 3}, and such addressing can be folded in given target architecture (RISC-V). My suggestion is to explicitly pass expected operands into TargetTransformInfo::getInstructionCost using SimplifiedValues map (e.g. for first iteration the mapping is i64 %indvars.iv.i -> i64 0).

…ction cost

Get more precise cost of instruction after LoopUnroll considering that
some operands of it can be simplified, e.g. induction variable will be
replaced by constant after full unrolling.
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 1, 2023

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: Sergey Kachkov (skachkov-sc)

Changes

Motivating example: https://godbolt.org/z/WcM6x1YPx
Here clang doesn't unroll loop with -Os, despite the fact that it will produce smaller and faster code. The issue is that we estimate cost of GEP as 1 after unrolling:

Loop Unroll: F[bar] Loop %for.body.i
  Loop Size = 5
Starting LoopUnroll profitability analysis...
 Analyzing iteration 0
Adding cost of instruction (iteration 0):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 0):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
 Analyzing iteration 1
Adding cost of instruction (iteration 1):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 1):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
 Analyzing iteration 2
Adding cost of instruction (iteration 2):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 2):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
 Analyzing iteration 3
Adding cost of instruction (iteration 3):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 3):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
  Exceeded threshold.. exiting.
  UnrolledCost: 8, MaxUnrolledLoopSize: 6
  will not try to unroll partially because -unroll-allow-partial not given

However, the more precise cost estimation is zero, because after unrolling we will not have non-constant index %indvars.iv.i, but some known compile-time constant: {0, 1, 2, 3}, and such addressing can be folded in given target architecture (RISC-V). My suggestion is to explicitly pass expected operands into TargetTransformInfo::getInstructionCost using SimplifiedValues map (e.g. for first iteration the mapping is i64 %indvars.iv.i -> i64 0).


Full diff: https://github.com/llvm/llvm-project/pull/70929.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+9-1)
  • (added) llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll (+35)
diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 446aa497026d3fb..470bc3038669d83 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -443,7 +443,15 @@ static std::optional<EstimatedUnrollCost> analyzeLoopUnrollCost(
 
         // First accumulate the cost of this instruction.
         if (!Cost.IsFree) {
-          UnrolledCost += TTI.getInstructionCost(I, CostKind);
+          // Consider simplified operands in instruction cost.
+          SmallVector<Value *, 4> Operands;
+          transform(I->operands(), std::back_inserter(Operands),
+                    [&](Value *Op) {
+                      if (auto Res = SimplifiedValues.lookup(Op))
+                        return Res;
+                      return Op;
+                    });
+          UnrolledCost += TTI.getInstructionCost(I, Operands, CostKind);
           LLVM_DEBUG(dbgs() << "Adding cost of instruction (iteration "
                             << Iteration << "): ");
           LLVM_DEBUG(I->dump());
diff --git a/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll b/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll
new file mode 100644
index 000000000000000..26de40bf1dc13e4
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
+; RUN: opt < %s -S -mtriple=riscv64 -passes=loop-unroll | FileCheck %s
+
+; Function Attrs: optsize
+define void @foo(ptr %array, i32 %x) #0 {
+; CHECK-LABEL: define void @foo
+; CHECK-SAME: (ptr [[ARRAY:%.*]], i32 [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAY]], align 4
+; CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 1
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAYIDX_1]], align 4
+; CHECK-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 2
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAYIDX_2]], align 4
+; CHECK-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 3
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAYIDX_3]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds i32, ptr %array, i64 %indvars.iv
+  store i32 %x, ptr %arrayidx, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, 4
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+
+for.cond.cleanup:
+  ret void
+}
+
+attributes #0 = { optsize }

@skachkov-sc
Copy link
Contributor Author

Sorry, missed one test that is affected by this change (didn't have AMDGPU backend in local build). @arsenm , could you please review this change? It looks like the cost of such GEPs:

%arrayidx.out = getelementptr inbounds float, ptr addrspace(3) %out, i32 %indvars.iv

is considered as zero, so it unrolls loop in test_func_addrspacecast_cost_nonfree.

@skachkov-sc
Copy link
Contributor Author

Ping

@skachkov-sc
Copy link
Contributor Author

Ping

Number of completely unrolled loops after this change on test-suite (-Os build)
Program                                                                       loop-unroll.NumCompletelyUnrolled              
                                                                              before                            after  diff  
     test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test   0.00                             27.00   inf%
           test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test   0.00                             27.00   inf%
      test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   0.00                              2.00   inf%
    test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   0.00                              2.00   inf%
                      test-suite :: MultiSource/Benchmarks/nbench/nbench.test   0.00                              1.00   inf%
               test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   0.00                              1.00   inf%
           test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test   0.00                             27.00   inf%
           test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test   0.00                             27.00   inf%
     test-suite :: MicroBenchmarks/LCALS/SubsetBLambdaLoops/lcalsBLambda.test   0.00                             27.00   inf%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   0.00                              2.00   inf%
     test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   0.00                             27.00   inf%
                        test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test   0.00                              3.00   inf%
                test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test   0.00                              2.00   inf%
               test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test   0.00                              2.00   inf%
                      test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   1.00                              6.00 500.0%
                 test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   1.00                              5.00 400.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   2.00                             10.00 400.0%
        test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   1.00                              4.00 300.0%
               test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  65.00                            252.00 287.7%
            test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   3.00                              9.00 200.0%
                  test-suite :: MultiSource/Benchmarks/Olden/power/power.test   3.00                              7.00 133.3%
              test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  16.00                             33.00 106.2%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   1.00                              2.00 100.0%
                test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   4.00                              7.00  75.0%
        test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test   7.00                              9.00  28.6%
                    test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  15.00                             17.00  13.3%

@@ -50,7 +50,7 @@ for.end:
}

; CHECK-LABEL: @test_func_addrspacecast_cost_nonfree(
; CHECK: br i1 %exitcond
; CHECK-NOT: br i1 %exitcond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should adjust the -unroll-threshold=49 option in this test to preserve the old behavior. If I understood correctly, the extra cost saving applies to all functions in this file equally, so just adjusting the threshold should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, with this change it's now impossbile to find threshold that keeps old behavour for all tests in this file. The savings in test_func_addrspacecast_cost_nonfree are higher (I assume that for AMDGPU target GEP with constant offset is cheaper in addrspace(3)), so there is no threshold where first 2 tests are unrolled and the last is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm Any idea how to preserve the intent of this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know have any great ideas. Either split the test file up, or add some filler operations to up the costs in other functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've come up with the following solution: add some extra level of indirection to get GEP indices so they will not become constant after loop unrolling; this hides the changes introduced by this patch and preserves the behaviour of original test (with slight increasing of unroll threshold). This is done by loading the GEP index from some global array (the additional cost of this loading is the same for all tests).

@skachkov-sc skachkov-sc merged commit ffd79b3 into llvm:main Feb 6, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants