Skip to content

Add support for tile interchange.#8664

Merged
hanhanW merged 11 commits intoiree-org:mainfrom
hanhanW:tmp
Apr 8, 2022
Merged

Add support for tile interchange.#8664
hanhanW merged 11 commits intoiree-org:mainfrom
hanhanW:tmp

Conversation

@hanhanW
Copy link
Copy Markdown
Contributor

@hanhanW hanhanW commented Mar 29, 2022

  • Introduce tile_interchange attribute in lowering_config.
  • Plumb it through IREE compilation stack, which includes
    • Enhance distribution logic to consider tile interchange
    • Make LinalgFuse get tile_interchange from lowering config
    • Add an E2E matmul test

This can only be controlled by compilation attribute for experimental
purpose for now. It is not used in heuristic configuraion pipeline.

@hanhanW hanhanW marked this pull request as draft March 29, 2022 00:49
@hanhanW
Copy link
Copy Markdown
Contributor Author

hanhanW commented Mar 29, 2022

Not ready for review yet. This depends on #8614 and #8663

@hanhanW hanhanW force-pushed the tmp branch 2 times, most recently from 217ad53 to bc12637 Compare March 29, 2022 22:42
- Introduce tile_interchange attribute in lowering_config.
- Plumb it through IREE compilation stack, which includes
  - Enhance distribution logic to consider tile interchange
  - Make LinalgFuse get tile_interchange from lowering config
  - Add an E2E matmul test

This can only be controlled by compilation attribute for experimental
purpose. It is not used in heuristic configuraion pipeline.
@hanhanW hanhanW marked this pull request as ready for review March 29, 2022 22:46
Copy link
Copy Markdown
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Overall looks good.

/// operation.
void eraseCompilationInfo(Operation *op);

struct IREETileConfig {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you need this. Wondering if there is a way to avoid a struct. The Attribute itself is a struct. Maybe just return the relevant LoweringConfigAttr and extract the information from there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tile_interchange, I think the answer is yes. We may extract it from attribute. For tilingSizes, we may have to do something more (like what's done in this method). We only consider distributable loops. If we return the LoweringConfigAttr here, we still need to extract the distributable loops. That would be computed everywhere.

I'm actually wondering if having more than one lowering config is valid or not. If not, we can verify it in a verifier. Then we don't need complicated logic here, and may split it into two methods.

If it's valid, different lowering config attr may be fine for splitting it into two methods. Because they should result into the same tiling sizes and interchange config. The trade-off is have longer compilation time or complicated logic before calling the methods because we have to make sure they result into the same config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that we don't need tile_interchange attributes when inserting distribution information. The tWorkloadPerWorkgroup is not affected by the attribute. We may return tiling sizes and tile_interchange separately, but that still need to clarify if it is valid to have two lowering config in one dispatch.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be deliberate in the number of attributes we use. I have carefully settled at 2. One at the op level, and other at the translation unit level (put these two together to get the third). We would rather have a single attribute carry all the information needed here.
I am not sure I understand why the workload per workgroup is not affected by interchange. It should be. Interchange in general can change the axes along which the tiles are distributed. Maybe the particular intechange you are looking at doesnt cause it doesnt change the relative ordering of the distributed dimension.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, it's affected by interchange. There is a bug in createRemoveSingleIterationLoopPass which results in segfault. I the loop ranges and queried workload_per_workgroup look fine to me in the pass. There may be deeper issues.

Comment thread iree/compiler/Codegen/Dialect/LoweringConfig.td
/// operation.
void eraseCompilationInfo(Operation *op);

struct IREETileConfig {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be deliberate in the number of attributes we use. I have carefully settled at 2. One at the op level, and other at the translation unit level (put these two together to get the third). We would rather have a single attribute carry all the information needed here.
I am not sure I understand why the workload per workgroup is not affected by interchange. It should be. Interchange in general can change the axes along which the tiles are distributed. Maybe the particular intechange you are looking at doesnt cause it doesnt change the relative ordering of the distributed dimension.

Comment thread iree/compiler/Codegen/Dialect/LoweringConfig.td
Copy link
Copy Markdown
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this works. I think the way to make this work is that you keep the same order for the tile sizes. When generating the region for the hal.executable.entry_point you apply interchange to the yielded values.

So if you tile sizes are [ts1, ts2, ts3], and your loop extent is [w1, w2, w3] (which is what the workload is set to at Flow level, then without intechange the number of workgroups is [ceildiv(t1, w1), ceildiv(t2, w2), ceildiv(t3, w3)]. If you apply an interchange of [1, 0, 2] you will get the number of workgroups be [ceildiv(t2, w2), ceildiv(t1, w1), ceildiv(t2, w2)].

(Note that the actual value yielded by the region on the entry point op is reverse of what I wrote above, but I didnt account for it above for sake of clarity).

interchangedTileSizes.resize(tileSizes.size());
for (auto en : llvm::enumerate(interchange)) {
// The check is needed because only the distributable dims are set.
if (en.value() < tileSizes.size()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not guaranteed that only the innermost dimensions are reductions. You need to look at partitionable loop interface here to get the partitioned loops.

Something I would like to fix is instead of constantly invoking the partitionable loops interface, make partitionable_loops also part of the lowering config. Then you have all the information you need here.

@hanhanW
Copy link
Copy Markdown
Contributor Author

hanhanW commented Apr 6, 2022

Thanks for the great input, Mahesh! I now understand distribution mechanism more. This works with not deleting RemoveSingleIterationLoopPass.

I found that we decide what loops to partition at flow level. Anything related to first level of tiling should consider it.

https://github.com/google/iree/blob/d52f47ee8ce56411a2a2c2adf69071165fec9de7/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L755-L770

In this context, the partitioned loops have to be leading loops after interchange. I agree that it's better to have it be part of lowering config, but that would need much more clean up. The PR is already large, so I don't want to add heavy stuff on it. I added a hard check, leave a TODO, and would like to clean it up later. WDYT?

@MaheshRavishankar
Copy link
Copy Markdown
Collaborator

Thanks for the great input, Mahesh! I now understand distribution mechanism more. This works with not deleting RemoveSingleIterationLoopPass.

I found that we decide what loops to partition at flow level. Anything related to first level of tiling should consider it.

https://github.com/google/iree/blob/d52f47ee8ce56411a2a2c2adf69071165fec9de7/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L755-L770

In this context, the partitioned loops have to be leading loops after interchange. I agree that it's better to have it be part of lowering config, but that would need much more clean up. The PR is already large, so I don't want to add heavy stuff on it. I added a hard check, leave a TODO, and would like to clean it up later. WDYT?

Yeah that makes sense. Ill take a look again.

Copy link
Copy Markdown
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. The way it is setup now though you can only interchange the "parallel" loops. Have you tried intechanging the reduction loop with the parallel loops? The getDistributed...TileSizes method might need some change to work with that....

@hanhanW
Copy link
Copy Markdown
Contributor Author

hanhanW commented Apr 8, 2022

This looks good. The way it is setup now though you can only interchange the "parallel" loops. Have you tried intechanging the reduction loop with the parallel loops? The getDistributed...TileSizes method might need some change to work with that....

I did not try interchanging reduction loop at distribution level successfully. I ran into issues. IMO, we're not able to apply distribution in parallel -> reduction -> parallel order. Because

  1. It might introduces races. Some threads read/write data in a reduction loop.
  2. We'll stop fusion after tiling reduction loops.

After exploring tile interchange for a while, I think we'll do tile interchange at the L1 tiling level (which is WIP on my plate).

@hanhanW hanhanW merged commit 7a3bbd3 into iree-org:main Apr 8, 2022
@hanhanW hanhanW deleted the tmp branch April 8, 2022 19:26
hanhanW added a commit that referenced this pull request Apr 11, 2022
These methods are used externally for auto-tuner, which were accidentally removed in #8664
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants