-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073
Conversation
7c2831c
to
aa66a02
Compare
Abbreviated Benchmark Summary@ commit fe22035d50c47a5f3a1e3f3a41bfcca1733e1591 (vs. base e2e126ce061454ad71bc2f5c1c08b1efc982f4cd) Data-Tiling Comparison TableClick to show
Regressed Latencies 🚩
[Top 3 out of 4 results showed] Improved Latencies 🎉
[Top 3 out of 21 results showed] Improved Total Dispatch Sizes 🎉
Improved Total Artifact Sizes 🎉
For more information: |
aa66a02
to
0a9330e
Compare
I took a look at total dispatch sizes regression about GPT2. It looks like they are from llvm/llvm-project@bb6d5c2 and llvm/llvm-project@eb42868 . There are more elementwise add ops at flow level. There are deps, so I can only try with IR dump without revert: https://gist.githubusercontent.com/hanhanW/42dda3e9994c3b9454d8d4e5b9b0be01/raw/e7f70daac54df584d4a0fff7e98093cdfb90c443/log10 IR dump with revert: https://gist.githubusercontent.com/hanhanW/1e695dfcc4beb894c2423ae3139d2a54/raw/56941d0cb62ee8aac266ccea85bec482523f8e01/log11 If you vimdiff two dumps, you will notice that there are more generic ops w/o revert; there are more constants w/ revert. It looks like some constant folding is not kicked in after integrate. (It would be good if someone can check whether my statements are correct or not.) |
The first difference in the diff is interesting. Before: builtin.module {
func.func @forward_dispatch_6_generic_2304_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<2304xf32>>) {
%0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
%1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
%2 = tensor.empty() : tensor<2304xf32>
%3 = linalg.generic {indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2304xf32>, tensor<2304xf32>) outs(%2 : tensor<2304xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%4 = arith.addf %in, %in_0 : f32
linalg.yield %4 : f32
} -> tensor<2304xf32>
flow.dispatch.tensor.store %3, %arg2, offsets = [0], sizes = [2304], strides = [1] : tensor<2304xf32> -> !flow.dispatch.tensor<writeonly:tensor<2304xf32>>
return
}
}
} After: builtin.module {
func.func @forward_dispatch_6_generic_2304_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg1: !flow.dispatch.tensor<writeonly:tensor<2304xf32>>) {
%cst = arith.constant dense_resource<__elided__> : tensor<2304xf32>
%0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
%1 = tensor.empty() : tensor<2304xf32>
%2 = linalg.generic {indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel"]} ins(%0, %cst : tensor<2304xf32>, tensor<2304xf32>) outs(%1 : tensor<2304xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%3 = arith.addf %in, %in_0 : f32
linalg.yield %3 : f32
} -> tensor<2304xf32>
flow.dispatch.tensor.store %2, %arg1, offsets = [0], sizes = [2304], strides = [1] : tensor<2304xf32> -> !flow.dispatch.tensor<writeonly:tensor<2304xf32>>
return
}
}
} A |
I probably need to dump the IR without eliding large constants. IIUC, they will be pulled into dispatches if they are small constants. We will need to re-dump the IRs and see what the actual values are. |
The first difference about dispatch is after |
I have a much smaller repro now: Run |
What's happening is:
The applyOpPatternsAndFold replaces the |
5c0a51e
to
75337be
Compare
I'm wondering why was this not happening before? Or was it happening but the constant was hoisted out of the region by the greedy pattern rewrite driver (and now it is no longer)? |
Yes, you are right. They were hoisted out of the region (maybe by the greedy pattern rewrite driver). |
Actually I don't really know what happened w/o the integrate. I need more time to understand it. My take is what you're saying.. I also wonder if the reshape(constant) -> constant should be a folder. It looks dangerous to me now. It could put a constant into a region while the op has IsolatedFromAbove trait (or some traits that disallow constant hoisting, I don't have much context about them), which sounds weird to me? |
I don't really see how that could happen. If we have a Folders never move ops around. The Do you have any rewrite patterns in IREE that move ops (in particular constant ops) from one region to another region? Before @llvm/llvm-project:#75897, moving a constant op to a different region brought internal data structures in the greedy pattern rewrite driver into an inconsistent state. This could've caused an op to be hoisted by accident by |
Very good point!
Well.. I don't know.. Maybe @MaheshRavishankar can answer. |
75337be
to
159487f
Compare
c7471f2
to
82b1659
Compare
@MaheshRavishankar 77b777c fixes the regression issue. Can you help review? |
82b1659
to
77b777c
Compare
@matthias-springer are you able to help fix test-with-listener.mlir? The To repro:
|
That is a problem with the This should be fixed in the greedy pattern rewrite driver and we should do something that is similar to the A simple fix for now could be to invert the order of the operands in the test case, can you give that a try?
I think this was already "broken" before my CSE change it was just not triggered because more notifications were sent around constants. |
Seems working, thanks! |
The regression of MobileBertSquad_fp16 on pixel gpu looks like a noise because total dispatch size is the same. |
I thought the review comments would pop out in the PR but it seems not. Mahesh already reviewed it in 77b777c and it looks good to him. |
This includes a revert for llvm/llvm-project@11ac97c
Other fixes:
HALDispatchABI::buildScopeAttr
to takeLLVM::LLVMFuncOp
as inputs, so it can handle if theDistinctAttr
is needed or not forDISubprogramAttr
.--split-input-file
tolower_to_ukernel_ops.mlir
.CHECK-DAG
to beginning for fixing lit tests.CHECK[-NEXT]
withCHECK-DAG
--canonicalize
toaffinemin_canonicalization.mlir
which hoists constants out of region.