-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoopUnroll] Consider simplified operands while retrieving TTI instruction cost #70929
Conversation
…ction cost Get more precise cost of instruction after LoopUnroll considering that some operands of it can be simplified, e.g. induction variable will be replaced by constant after full unrolling.
@llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-llvm-transforms Author: Sergey Kachkov (skachkov-sc) ChangesMotivating example: https://godbolt.org/z/WcM6x1YPx
However, the more precise cost estimation is zero, because after unrolling we will not have non-constant index %indvars.iv.i, but some known compile-time constant: {0, 1, 2, 3}, and such addressing can be folded in given target architecture (RISC-V). My suggestion is to explicitly pass expected operands into TargetTransformInfo::getInstructionCost using SimplifiedValues map (e.g. for first iteration the mapping is i64 %indvars.iv.i -> i64 0). Full diff: https://github.com/llvm/llvm-project/pull/70929.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
index 446aa497026d3fb..470bc3038669d83 100644
--- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -443,7 +443,15 @@ static std::optional<EstimatedUnrollCost> analyzeLoopUnrollCost(
// First accumulate the cost of this instruction.
if (!Cost.IsFree) {
- UnrolledCost += TTI.getInstructionCost(I, CostKind);
+ // Consider simplified operands in instruction cost.
+ SmallVector<Value *, 4> Operands;
+ transform(I->operands(), std::back_inserter(Operands),
+ [&](Value *Op) {
+ if (auto Res = SimplifiedValues.lookup(Op))
+ return Res;
+ return Op;
+ });
+ UnrolledCost += TTI.getInstructionCost(I, Operands, CostKind);
LLVM_DEBUG(dbgs() << "Adding cost of instruction (iteration "
<< Iteration << "): ");
LLVM_DEBUG(I->dump());
diff --git a/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll b/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll
new file mode 100644
index 000000000000000..26de40bf1dc13e4
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/RISCV/unroll-Os.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
+; RUN: opt < %s -S -mtriple=riscv64 -passes=loop-unroll | FileCheck %s
+
+; Function Attrs: optsize
+define void @foo(ptr %array, i32 %x) #0 {
+; CHECK-LABEL: define void @foo
+; CHECK-SAME: (ptr [[ARRAY:%.*]], i32 [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: entry:
+; CHECK-NEXT: br label [[FOR_BODY:%.*]]
+; CHECK: for.body:
+; CHECK-NEXT: store i32 [[X]], ptr [[ARRAY]], align 4
+; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 1
+; CHECK-NEXT: store i32 [[X]], ptr [[ARRAYIDX_1]], align 4
+; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 2
+; CHECK-NEXT: store i32 [[X]], ptr [[ARRAYIDX_2]], align 4
+; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 3
+; CHECK-NEXT: store i32 [[X]], ptr [[ARRAYIDX_3]], align 4
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %for.body
+
+for.body:
+ %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+ %arrayidx = getelementptr inbounds i32, ptr %array, i64 %indvars.iv
+ store i32 %x, ptr %arrayidx, align 4
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %exitcond.not = icmp eq i64 %indvars.iv.next, 4
+ br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+
+for.cond.cleanup:
+ ret void
+}
+
+attributes #0 = { optsize }
|
Sorry, missed one test that is affected by this change (didn't have AMDGPU backend in local build). @arsenm , could you please review this change? It looks like the cost of such GEPs:
is considered as zero, so it unrolls loop in test_func_addrspacecast_cost_nonfree. |
Ping |
Ping Number of completely unrolled loops after this change on test-suite (-Os build)
|
@@ -50,7 +50,7 @@ for.end: | |||
} | |||
|
|||
; CHECK-LABEL: @test_func_addrspacecast_cost_nonfree( | |||
; CHECK: br i1 %exitcond | |||
; CHECK-NOT: br i1 %exitcond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should adjust the -unroll-threshold=49
option in this test to preserve the old behavior. If I understood correctly, the extra cost saving applies to all functions in this file equally, so just adjusting the threshold should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, with this change it's now impossbile to find threshold that keeps old behavour for all tests in this file. The savings in test_func_addrspacecast_cost_nonfree are higher (I assume that for AMDGPU target GEP with constant offset is cheaper in addrspace(3)), so there is no threshold where first 2 tests are unrolled and the last is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm Any idea how to preserve the intent of this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know have any great ideas. Either split the test file up, or add some filler operations to up the costs in other functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've come up with the following solution: add some extra level of indirection to get GEP indices so they will not become constant after loop unrolling; this hides the changes introduced by this patch and preserves the behaviour of original test (with slight increasing of unroll threshold). This is done by loading the GEP index from some global array (the additional cost of this loading is the same for all tests).
e719131
to
45e8d46
Compare
Motivating example: https://godbolt.org/z/WcM6x1YPx
Here clang doesn't unroll loop with -Os, despite the fact that it will produce smaller and faster code. The issue is that we estimate cost of GEP as 1 after unrolling:
However, the more precise cost estimation is zero, because after unrolling we will not have non-constant index %indvars.iv.i, but some known compile-time constant: {0, 1, 2, 3}, and such addressing can be folded in given target architecture (RISC-V). My suggestion is to explicitly pass expected operands into TargetTransformInfo::getInstructionCost using SimplifiedValues map (e.g. for first iteration the mapping is i64 %indvars.iv.i -> i64 0).