[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929

skachkov-sc · 2023-11-01T12:32:25Z

Motivating example: https://godbolt.org/z/WcM6x1YPx
Here clang doesn't unroll loop with -Os, despite the fact that it will produce smaller and faster code. The issue is that we estimate cost of GEP as 1 after unrolling:

Loop Unroll: F[bar] Loop %for.body.i
  Loop Size = 5
Starting LoopUnroll profitability analysis...
 Analyzing iteration 0
Adding cost of instruction (iteration 0):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 0):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
 Analyzing iteration 1
Adding cost of instruction (iteration 1):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 1):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
 Analyzing iteration 2
Adding cost of instruction (iteration 2):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 2):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
 Analyzing iteration 3
Adding cost of instruction (iteration 3):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 3):   %arrayidx.i = getelementptr inbounds i32, ptr @array, i64 %indvars.iv.i
  Exceeded threshold.. exiting.
  UnrolledCost: 8, MaxUnrolledLoopSize: 6
  will not try to unroll partially because -unroll-allow-partial not given

However, the more precise cost estimation is zero, because after unrolling we will not have non-constant index %indvars.iv.i, but some known compile-time constant: {0, 1, 2, 3}, and such addressing can be folded in given target architecture (RISC-V). My suggestion is to explicitly pass expected operands into TargetTransformInfo::getInstructionCost using SimplifiedValues map (e.g. for first iteration the mapping is i64 %indvars.iv.i -> i64 0).

…ction cost Get more precise cost of instruction after LoopUnroll considering that some operands of it can be simplified, e.g. induction variable will be replaced by constant after full unrolling.

llvmbot · 2023-11-01T12:33:32Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: Sergey Kachkov (skachkov-sc)

Changes

Motivating example: https://godbolt.org/z/WcM6x1YPx
Here clang doesn't unroll loop with -Os, despite the fact that it will produce smaller and faster code. The issue is that we estimate cost of GEP as 1 after unrolling:

Loop Unroll: F[bar] Loop %for.body.i
  Loop Size = 5
Starting LoopUnroll profitability analysis...
 Analyzing iteration 0
Adding cost of instruction (iteration 0):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 0):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
 Analyzing iteration 1
Adding cost of instruction (iteration 1):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 1):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
 Analyzing iteration 2
Adding cost of instruction (iteration 2):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 2):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
 Analyzing iteration 3
Adding cost of instruction (iteration 3):   store i32 %x, ptr %arrayidx.i, align 4, !tbaa !7
Adding cost of instruction (iteration 3):   %arrayidx.i = getelementptr inbounds i32, ptr @<!-- -->array, i64 %indvars.iv.i
  Exceeded threshold.. exiting.
  UnrolledCost: 8, MaxUnrolledLoopSize: 6
  will not try to unroll partially because -unroll-allow-partial not given

However, the more precise cost estimation is zero, because after unrolling we will not have non-constant index %indvars.iv.i, but some known compile-time constant: {0, 1, 2, 3}, and such addressing can be folded in given target architecture (RISC-V). My suggestion is to explicitly pass expected operands into TargetTransformInfo::getInstructionCost using SimplifiedValues map (e.g. for first iteration the mapping is i64 %indvars.iv.i -> i64 0).

Full diff: https://github.com/llvm/llvm-project/pull/70929.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (+9-1)
(added) llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll (+35)

diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 446aa497026d3fb..470bc3038669d83 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -443,7 +443,15 @@ static std::optional<EstimatedUnrollCost> analyzeLoopUnrollCost(
 
         // First accumulate the cost of this instruction.
         if (!Cost.IsFree) {
-          UnrolledCost += TTI.getInstructionCost(I, CostKind);
+          // Consider simplified operands in instruction cost.
+          SmallVector<Value *, 4> Operands;
+          transform(I->operands(), std::back_inserter(Operands),
+                    [&](Value *Op) {
+                      if (auto Res = SimplifiedValues.lookup(Op))
+                        return Res;
+                      return Op;
+                    });
+          UnrolledCost += TTI.getInstructionCost(I, Operands, CostKind);
           LLVM_DEBUG(dbgs() << "Adding cost of instruction (iteration "
                             << Iteration << "): ");
           LLVM_DEBUG(I->dump());
diff --git a/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll b/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll
new file mode 100644
index 000000000000000..26de40bf1dc13e4
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
+; RUN: opt < %s -S -mtriple=riscv64 -passes=loop-unroll | FileCheck %s
+
+; Function Attrs: optsize
+define void @foo(ptr %array, i32 %x) #0 {
+; CHECK-LABEL: define void @foo
+; CHECK-SAME: (ptr [[ARRAY:%.*]], i32 [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       for.body:
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAY]], align 4
+; CHECK-NEXT:    [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 1
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAYIDX_1]], align 4
+; CHECK-NEXT:    [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 2
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAYIDX_2]], align 4
+; CHECK-NEXT:    [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 3
+; CHECK-NEXT:    store i32 [[X]], ptr [[ARRAYIDX_3]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %for.body
+
+for.body:
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds i32, ptr %array, i64 %indvars.iv
+  store i32 %x, ptr %arrayidx, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, 4
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+
+for.cond.cleanup:
+  ret void
+}
+
+attributes #0 = { optsize }

skachkov-sc · 2023-11-01T14:29:16Z

Sorry, missed one test that is affected by this change (didn't have AMDGPU backend in local build). @arsenm , could you please review this change? It looks like the cost of such GEPs:

%arrayidx.out = getelementptr inbounds float, ptr addrspace(3) %out, i32 %indvars.iv

is considered as zero, so it unrolls loop in test_func_addrspacecast_cost_nonfree.

skachkov-sc · 2023-11-08T13:06:30Z

Ping

skachkov-sc · 2023-11-16T10:30:27Z

Ping

Number of completely unrolled loops after this change on test-suite (-Os build)

Program                                                                       loop-unroll.NumCompletelyUnrolled              
                                                                              before                            after  diff  
     test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test   0.00                             27.00   inf%
           test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test   0.00                             27.00   inf%
      test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   0.00                              2.00   inf%
    test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   0.00                              2.00   inf%
                      test-suite :: MultiSource/Benchmarks/nbench/nbench.test   0.00                              1.00   inf%
               test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   0.00                              1.00   inf%
           test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test   0.00                             27.00   inf%
           test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test   0.00                             27.00   inf%
     test-suite :: MicroBenchmarks/LCALS/SubsetBLambdaLoops/lcalsBLambda.test   0.00                             27.00   inf%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   0.00                              2.00   inf%
     test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   0.00                             27.00   inf%
                        test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test   0.00                              3.00   inf%
                test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test   0.00                              2.00   inf%
               test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test   0.00                              2.00   inf%
                      test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   1.00                              6.00 500.0%
                 test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   1.00                              5.00 400.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   2.00                             10.00 400.0%
        test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   1.00                              4.00 300.0%
               test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  65.00                            252.00 287.7%
            test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   3.00                              9.00 200.0%
                  test-suite :: MultiSource/Benchmarks/Olden/power/power.test   3.00                              7.00 133.3%
              test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  16.00                             33.00 106.2%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   1.00                              2.00 100.0%
                test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   4.00                              7.00  75.0%
        test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test   7.00                              9.00  28.6%
                    test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  15.00                             17.00  13.3%

nikic · 2023-11-16T10:44:31Z

llvm/test/Transforms/LoopUnroll/AMDGPU/unroll-cost-addrspacecast.ll

@@ -50,7 +50,7 @@ for.end:
 }

 ; CHECK-LABEL: @test_func_addrspacecast_cost_nonfree(
-; CHECK: br i1 %exitcond
+; CHECK-NOT: br i1 %exitcond


I think you should adjust the -unroll-threshold=49 option in this test to preserve the old behavior. If I understood correctly, the extra cost saving applies to all functions in this file equally, so just adjusting the threshold should work.

Unfortunately, with this change it's now impossbile to find threshold that keeps old behavour for all tests in this file. The savings in test_func_addrspacecast_cost_nonfree are higher (I assume that for AMDGPU target GEP with constant offset is cheaper in addrspace(3)), so there is no threshold where first 2 tests are unrolled and the last is not.

@arsenm Any idea how to preserve the intent of this test?

I don't know have any great ideas. Either split the test file up, or add some filler operations to up the costs in other functions

I've come up with the following solution: add some extra level of indirection to get GEP indices so they will not become constant after loop unrolling; this hides the changes introduced by this patch and preserves the behaviour of original test (with slight increasing of unroll threshold). This is done by loading the GEP index from some global array (the additional cost of this loading is the same for all tests).

skachkov-sc added 2 commits November 1, 2023 13:35

[LoopUnroll][NFC] Add pre-commit test

1b6cdef

[LoopUnroll] Consider simplified operands while retrieving TTI instru…

f891820

…ction cost Get more precise cost of instruction after LoopUnroll considering that some operands of it can be simplified, e.g. induction variable will be replaced by constant after full unrolling.

skachkov-sc requested review from RKSimon and preames November 1, 2023 12:32

llvmbot added the llvm:transforms label Nov 1, 2023

llvmbot added the backend:AMDGPU label Nov 1, 2023

skachkov-sc requested a review from arsenm November 8, 2023 13:06

skachkov-sc requested a review from nikic November 16, 2023 10:30

nikic reviewed Nov 16, 2023

View reviewed changes

Fix AMDGPU test

45e8d46

skachkov-sc force-pushed the unroll-tti-cost branch from e719131 to 45e8d46 Compare December 1, 2023 09:31

arsenm approved these changes Feb 6, 2024

View reviewed changes

skachkov-sc merged commit ffd79b3 into llvm:main Feb 6, 2024
3 checks passed

sbc100 mentioned this pull request Feb 20, 2024

regression: SIMDe test failures with recent ToT emscripten-core/emscripten#21368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929

[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929

skachkov-sc commented Nov 1, 2023

llvmbot commented Nov 1, 2023 •

edited

skachkov-sc commented Nov 1, 2023

skachkov-sc commented Nov 8, 2023

skachkov-sc commented Nov 16, 2023

nikic Nov 16, 2023

skachkov-sc Nov 16, 2023

nikic Nov 16, 2023

arsenm Dec 1, 2023

skachkov-sc Dec 1, 2023

[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929

[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929

Conversation

skachkov-sc commented Nov 1, 2023

llvmbot commented Nov 1, 2023 • edited

skachkov-sc commented Nov 1, 2023

skachkov-sc commented Nov 8, 2023

skachkov-sc commented Nov 16, 2023

nikic Nov 16, 2023

Choose a reason for hiding this comment

skachkov-sc Nov 16, 2023

Choose a reason for hiding this comment

nikic Nov 16, 2023

Choose a reason for hiding this comment

arsenm Dec 1, 2023

Choose a reason for hiding this comment

skachkov-sc Dec 1, 2023

Choose a reason for hiding this comment

llvmbot commented Nov 1, 2023 •

edited