[LLVMGPU] Transitioning LLVMGPUTileAndFuse to use PCF Dialect #24070

Max191 · 2026-04-10T12:51:20Z

Max191
Apr 10, 2026
Collaborator

I am starting to work on moving the TileAndFuse pipeline to use PCF for thread distribution and operand promotion. This will have a high impact on work currently being done within the TileAndFuse pipeline, so I want to document my plan and process here as I go. I'll begin with the high level plan, and I'll post updates as I make progress to keep anyone who is interested in the loop.

Max191 · 2026-04-10T14:53:01Z

Max191
Apr 10, 2026
Collaborator Author

High Level Plan

Phase 1

Phase 1 will be about just introducing PCF ops and flushing out any lower level codegen work to make sure we can lower PCF ops complete compilation. There won't be anything too complicated here, and most of the TileAndFuse pipeline will behave the same as before.

Key Changes

We will introduce a new pass shortly after GPUFuseAndHoistParallelLoops that converts all scf.forall ops into pcf.generic ops. This happens after all thread distribution is already complete, so little will change there. The main difference will be the codegen after distribution (i.e., bufferization, resolving distribution loops, etc).
The new pass will also match barrier regions and bufferization.alloc_tensor ops to convert them into pcf.sref carried by the pcf.generic op. This will mainly affect the lowering for swizzling, since we will see pcf.alloc ops instead of bufferization.alloc_tensor. The plumbing for swizzling is mostly about preserving swizzle-based attribute hints down to the memref level, but with pcf ops, we could resolve them earlier.

What this will accomplish

Flushing out lower level codegen support for using pcf.
Enables improved swizzling resolution (no longer have to wait all the way until bufferizing)

What this doesn't do yet

Does not replace any fusion or operand promotion codegen strategies
Does not meaningfully affect the distribution phase of TileAndFuse codegen
Doesn't introduce pcf synchronization + buffer semantics earlier in codegen (e.g., when we do operand promotion)

Phase 2

Phase 2 is a more significant rewrite of the TileAndFuse distribution logic. This phase will fill in the pieces from the "What this doesn't do yet" section above. This phase will also replace a large portion of the TileAndFuse pipeline before bufferization, and will have a bigger impact on parallel workstreams.

Key Changes

Replace the progressive tiling levels, operand promotion anchoring, mma conversion, and loop hoisting/fusion with a single one-shot pass that performs all tiling, anchors promotion, and packs to mma shapes in one step. This is the major change in this phase, as it combines a significant portion of the codegen pipeline into a single pass.
Operand promotion will be anchored by the one-shot pass, but then lowered into the concrete thread-distributed promotion implementation (e.g., regular copy, dma, etc) as another pass. This separates out some of the complexity from the big one-shot pass. The one-shot pass will create a new iree_gpu.promote_operand op, which anchors the promotion, and the later pass will lower that into the promotion implementation.

What this will accomplish

Fully moves the TileAndFuse pipeline to use pcf constructs and transforms.
Materializes shared memory (as sref) and synchronization early in codegen
When operand promotion is materialized, we should have explicit synchronization and buffer semantics, so we can directly generate ops like dma without needing to go through special tensor forms of the op.
Bufferization doesn't have to do very much, because most of our memory will already be materialized. This includes dispatch input/output memory, because we have load_from_buffer/store_to_buffer ops.
Pipelining could happen before bufferization and make use of SSA data-dependencies and pcf synchronization.
Probably more benefits I'm not thinking of right now :P

Why do everything as one-shot

1. There aren't really any other options:

We want to create pcf.generic ops with pcf.allocs for operand promotion. The shape of these allocations use the full workgroup tile size, but also the reduction tile size. However, the scope of the allocation is the full thread-distributed pcf.generic, and the reduction loop must live inside of this pcf.generic so that consumer fusion is possible. This means:

We can't materialize the allocation until we have formed the reduction loop, because it is used inside the reduction loop.
We can't form the reduction loop until we have formed the thread-distributed pcf.generic, because the reduction loop lives inside the thread-distributed pcf.generic.
We also don't want to have to perform operand promotion after we have already distributed to threads, because we would have to reconstruct the undistributed tile (since the allocation is at the scope of the full workgroup tile)

These conditions don't leave many clean options other than doing a one-shot pass that performs the full tiling and operand promotion.

2. This one-shot pass is not really more complex than what we have today:

The one-shot pass is at least straightforward and direct in the transformations it performs. The old path heavily relies on GPUFuseAndHoistParallelLoops, which is already a complex pass, and IMO it is arguably harder to understand/debug than this new one-shot pass would be. The GPUFuseAndHoistParallelLoops pass relies on the rest of the pipeline producing something of a certain form, which is not well-defined/documented, and it involves many cooperative patterns which makes reasoning about what the pass does difficult. There are many places where the GPUFuseAndHoistParallelLoops pass can go wrong, and it is notoriously hard to debug. Even once you find where the pass failed to produce what you want, you usually need to trace that all the way back to one of the previous passes in the pipeline to see where the real root cause is.

The one-shot pass doesn't have these problems, because the input to the pass will basically just be whatever is produced by TileAndDistributeToWorkgroups, and the transformations performed by the pass will be well-defined and sequential.

0 replies

Max191 · 2026-04-10T14:55:02Z

Max191
Apr 10, 2026
Collaborator Author

Credit to @qedawkins for most of the implementation here. I am picking up some of the work he has put together on a branch, and trying to land parts of it progressively. This will be the foundation for moving over to pcf, and then we will be able to build more on top of it.

0 replies

krzysz00 · 2026-04-10T16:31:14Z

krzysz00
Apr 10, 2026
Collaborator

So, I'd like to flag a weird low-level nitpick that'll be important for TDM and such - in current codegen, we've often gone right from workgroups to threads, and, both for TDM reasons and for being able to do a uniformity analysis in a more straightforward manner, I think we might want to keep an eye out for places where we want to explicitly introduce the subgroup as a tiling level even if it wasn't there before.

I think this change makes sense at a high level, though, and I'm for it.

1 reply

Max191 Apr 10, 2026
Collaborator Author

As far as I understand, I think that shouldn't be hard to do if we need/want to later. It would just be another tiling level, so we could split up subgroup and lane scopes as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPU] Transitioning LLVMGPUTileAndFuse to use PCF Dialect #24070

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[LLVMGPU] Transitioning LLVMGPUTileAndFuse to use PCF Dialect #24070

Uh oh!

Max191 Apr 10, 2026 Collaborator

Replies: 3 comments · 1 reply

Uh oh!

Max191 Apr 10, 2026 Collaborator Author

High Level Plan

Phase 1

Key Changes

What this will accomplish

What this doesn't do yet

Phase 2

Key Changes

What this will accomplish

Why do everything as one-shot

Uh oh!

Max191 Apr 10, 2026 Collaborator Author

Uh oh!

krzysz00 Apr 10, 2026 Collaborator

Uh oh!

Max191 Apr 10, 2026 Collaborator Author

Max191
Apr 10, 2026
Collaborator

Replies: 3 comments 1 reply

Max191
Apr 10, 2026
Collaborator Author

Max191
Apr 10, 2026
Collaborator Author

krzysz00
Apr 10, 2026
Collaborator

Max191 Apr 10, 2026
Collaborator Author