Replies: 3 comments 1 reply
-
High Level PlanPhase 1Phase 1 will be about just introducing PCF ops and flushing out any lower level codegen work to make sure we can lower PCF ops complete compilation. There won't be anything too complicated here, and most of the TileAndFuse pipeline will behave the same as before. Key Changes
What this will accomplish
What this doesn't do yet
Phase 2Phase 2 is a more significant rewrite of the TileAndFuse distribution logic. This phase will fill in the pieces from the "What this doesn't do yet" section above. This phase will also replace a large portion of the TileAndFuse pipeline before bufferization, and will have a bigger impact on parallel workstreams. Key Changes
What this will accomplish
Why do everything as one-shot1. There aren't really any other options: We want to create pcf.generic ops with pcf.allocs for operand promotion. The shape of these allocations use the full workgroup tile size, but also the reduction tile size. However, the scope of the allocation is the full thread-distributed pcf.generic, and the reduction loop must live inside of this pcf.generic so that consumer fusion is possible. This means:
These conditions don't leave many clean options other than doing a one-shot pass that performs the full tiling and operand promotion. 2. This one-shot pass is not really more complex than what we have today: The one-shot pass is at least straightforward and direct in the transformations it performs. The old path heavily relies on GPUFuseAndHoistParallelLoops, which is already a complex pass, and IMO it is arguably harder to understand/debug than this new one-shot pass would be. The GPUFuseAndHoistParallelLoops pass relies on the rest of the pipeline producing something of a certain form, which is not well-defined/documented, and it involves many cooperative patterns which makes reasoning about what the pass does difficult. There are many places where the GPUFuseAndHoistParallelLoops pass can go wrong, and it is notoriously hard to debug. Even once you find where the pass failed to produce what you want, you usually need to trace that all the way back to one of the previous passes in the pipeline to see where the real root cause is. The one-shot pass doesn't have these problems, because the input to the pass will basically just be whatever is produced by TileAndDistributeToWorkgroups, and the transformations performed by the pass will be well-defined and sequential. |
Beta Was this translation helpful? Give feedback.
-
|
Credit to @qedawkins for most of the implementation here. I am picking up some of the work he has put together on a branch, and trying to land parts of it progressively. This will be the foundation for moving over to pcf, and then we will be able to build more on top of it. |
Beta Was this translation helpful? Give feedback.
-
|
So, I'd like to flag a weird low-level nitpick that'll be important for TDM and such - in current codegen, we've often gone right from workgroups to threads, and, both for TDM reasons and for being able to do a uniformity analysis in a more straightforward manner, I think we might want to keep an eye out for places where we want to explicitly introduce the subgroup as a tiling level even if it wasn't there before. I think this change makes sense at a high level, though, and I'm for it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am starting to work on moving the TileAndFuse pipeline to use PCF for thread distribution and operand promotion. This will have a high impact on work currently being done within the TileAndFuse pipeline, so I want to document my plan and process here as I go. I'll begin with the high level plan, and I'll post updates as I make progress to keep anyone who is interested in the loop.
Beta Was this translation helpful? Give feedback.
All reactions