Add support for tile interchange.#8664
Conversation
217ad53 to
bc12637
Compare
- Introduce tile_interchange attribute in lowering_config. - Plumb it through IREE compilation stack, which includes - Enhance distribution logic to consider tile interchange - Make LinalgFuse get tile_interchange from lowering config - Add an E2E matmul test This can only be controlled by compilation attribute for experimental purpose. It is not used in heuristic configuraion pipeline.
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Some minor comments. Overall looks good.
| /// operation. | ||
| void eraseCompilationInfo(Operation *op); | ||
|
|
||
| struct IREETileConfig { |
There was a problem hiding this comment.
I see why you need this. Wondering if there is a way to avoid a struct. The Attribute itself is a struct. Maybe just return the relevant LoweringConfigAttr and extract the information from there.
There was a problem hiding this comment.
For tile_interchange, I think the answer is yes. We may extract it from attribute. For tilingSizes, we may have to do something more (like what's done in this method). We only consider distributable loops. If we return the LoweringConfigAttr here, we still need to extract the distributable loops. That would be computed everywhere.
I'm actually wondering if having more than one lowering config is valid or not. If not, we can verify it in a verifier. Then we don't need complicated logic here, and may split it into two methods.
If it's valid, different lowering config attr may be fine for splitting it into two methods. Because they should result into the same tiling sizes and interchange config. The trade-off is have longer compilation time or complicated logic before calling the methods because we have to make sure they result into the same config.
There was a problem hiding this comment.
I just realized that we don't need tile_interchange attributes when inserting distribution information. The tWorkloadPerWorkgroup is not affected by the attribute. We may return tiling sizes and tile_interchange separately, but that still need to clarify if it is valid to have two lowering config in one dispatch.
There was a problem hiding this comment.
We want to be deliberate in the number of attributes we use. I have carefully settled at 2. One at the op level, and other at the translation unit level (put these two together to get the third). We would rather have a single attribute carry all the information needed here.
I am not sure I understand why the workload per workgroup is not affected by interchange. It should be. Interchange in general can change the axes along which the tiles are distributed. Maybe the particular intechange you are looking at doesnt cause it doesnt change the relative ordering of the distributed dimension.
There was a problem hiding this comment.
I was wrong, it's affected by interchange. There is a bug in createRemoveSingleIterationLoopPass which results in segfault. I the loop ranges and queried workload_per_workgroup look fine to me in the pass. There may be deeper issues.
| /// operation. | ||
| void eraseCompilationInfo(Operation *op); | ||
|
|
||
| struct IREETileConfig { |
There was a problem hiding this comment.
We want to be deliberate in the number of attributes we use. I have carefully settled at 2. One at the op level, and other at the translation unit level (put these two together to get the third). We would rather have a single attribute carry all the information needed here.
I am not sure I understand why the workload per workgroup is not affected by interchange. It should be. Interchange in general can change the axes along which the tiles are distributed. Maybe the particular intechange you are looking at doesnt cause it doesnt change the relative ordering of the distributed dimension.
MaheshRavishankar
left a comment
There was a problem hiding this comment.
I am not sure this works. I think the way to make this work is that you keep the same order for the tile sizes. When generating the region for the hal.executable.entry_point you apply interchange to the yielded values.
So if you tile sizes are [ts1, ts2, ts3], and your loop extent is [w1, w2, w3] (which is what the workload is set to at Flow level, then without intechange the number of workgroups is [ceildiv(t1, w1), ceildiv(t2, w2), ceildiv(t3, w3)]. If you apply an interchange of [1, 0, 2] you will get the number of workgroups be [ceildiv(t2, w2), ceildiv(t1, w1), ceildiv(t2, w2)].
(Note that the actual value yielded by the region on the entry point op is reverse of what I wrote above, but I didnt account for it above for sake of clarity).
| interchangedTileSizes.resize(tileSizes.size()); | ||
| for (auto en : llvm::enumerate(interchange)) { | ||
| // The check is needed because only the distributable dims are set. | ||
| if (en.value() < tileSizes.size()) { |
There was a problem hiding this comment.
Not guaranteed that only the innermost dimensions are reductions. You need to look at partitionable loop interface here to get the partitioned loops.
Something I would like to fix is instead of constantly invoking the partitionable loops interface, make partitionable_loops also part of the lowering config. Then you have all the information you need here.
|
Thanks for the great input, Mahesh! I now understand distribution mechanism more. This works with not deleting RemoveSingleIterationLoopPass. I found that we decide what loops to partition at flow level. Anything related to first level of tiling should consider it. In this context, the partitioned loops have to be leading loops after interchange. I agree that it's better to have it be part of lowering config, but that would need much more clean up. The PR is already large, so I don't want to add heavy stuff on it. I added a hard check, leave a TODO, and would like to clean it up later. WDYT? |
Yeah that makes sense. Ill take a look again. |
MaheshRavishankar
left a comment
There was a problem hiding this comment.
This looks good. The way it is setup now though you can only interchange the "parallel" loops. Have you tried intechanging the reduction loop with the parallel loops? The getDistributed...TileSizes method might need some change to work with that....
I did not try interchanging reduction loop at distribution level successfully. I ran into issues. IMO, we're not able to apply distribution in
After exploring tile interchange for a while, I think we'll do tile interchange at the L1 tiling level (which is WIP on my plate). |
These methods are used externally for auto-tuner, which were accidentally removed in #8664
This can only be controlled by compilation attribute for experimental
purpose for now. It is not used in heuristic configuraion pipeline.