You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of the SDXL sprint (#16854) several different pipelines were added to support GEMM/Convolutions. This issue is to land this in tree, as separate pipelines or merge these.
Main Issue
Three different pipelines were used to handle GEMM/Convolution computations in SDXL branch
Vector distribution based pipelines for supporting convolutions : c5150c3
Vector distribution based pipeline for handling matmuls which need padding to map to the specific intrinsics.
Some of these seem like an artificial distinction and can be merged. Potentiall we should have two pipelines, one for convolution and one for matrix multiplies.
One of the issues that was uncovered is that vector distribution pass is a bit too monolothic and has grown to be too complex to be easily generalizable for different lowering paths. For example, adding support for copy to shared memory was more easy to achieve by just avoiding use of vector distribution for this copy and distribute it after the fact. While the current vector distribution pass works very well (and in effect gives us a path to import some ideas from Triton and rocMLIR) the "one-shot" nature of it makes it hard to adapt.
Current convolution based pipelines dont really support NCHW convolution. They only support NHWC convolution. Using tensor.pack to mimic data movement from global to shared memory could allow us to support both NCHW and NHWC convolution without requiring full program transformation to convert NCHW convolutions to NHWC convolutions followed by propagation + fusion to eliminate all transposes. This is actually in-line with some of the work happening on the AIE backend to model data movement + transformation across memory heirarchies (and the pack based approach seem to work well there).
Covered commits by this issue are the majority of the commits to the SDXL branch.
Immediate next steps
A lot of the functionality of the vector distribution can be achieved by using tile + fuse + distribute on tensors, followed by vectorization. There might be some pieces missing to connect things all the way, but use of tensors allows use of encodings to maybe better model distribution. It might be worth doing a quick exploration on what this would look like and what the missing pieces are.
On the LLVMGPU backend, comprehensive bufferization is used to get shared memory allocations. This has been a long running pain point. Instead the shared memory allocation needs to be more prescriptive (using constructs similar to bufferization.alloc_tensor) and any implicit allocation done by bufferization should essentially be treated as an error (similar to what CPU backends do).
Use tensor.pack + bufferization.alloc_tensor to allow support for NCHW and NHWC convolutions more directly in the lowering pipelines, and still be able to target MFMA instructions. This might require adding support for tensor.pack to vector distribution.
The text was updated successfully, but these errors were encountered:
As part of the SDXL sprint (#16854) several different pipelines were added to support GEMM/Convolutions. This issue is to land this in tree, as separate pipelines or merge these.
Main Issue
Three different pipelines were used to handle GEMM/Convolution computations in SDXL branch
Some of these seem like an artificial distinction and can be merged. Potentiall we should have two pipelines, one for convolution and one for matrix multiplies.
One of the issues that was uncovered is that vector distribution pass is a bit too monolothic and has grown to be too complex to be easily generalizable for different lowering paths. For example, adding support for copy to shared memory was more easy to achieve by just avoiding use of vector distribution for this copy and distribute it after the fact. While the current vector distribution pass works very well (and in effect gives us a path to import some ideas from Triton and rocMLIR) the "one-shot" nature of it makes it hard to adapt.
Current convolution based pipelines dont really support NCHW convolution. They only support NHWC convolution. Using
tensor.pack
to mimic data movement from global to shared memory could allow us to support both NCHW and NHWC convolution without requiring full program transformation to convert NCHW convolutions to NHWC convolutions followed by propagation + fusion to eliminate all transposes. This is actually in-line with some of the work happening on the AIE backend to model data movement + transformation across memory heirarchies (and the pack based approach seem to work well there).Covered commits by this issue are the majority of the commits to the SDXL branch.
Immediate next steps
bufferization.alloc_tensor
) and any implicit allocation done by bufferization should essentially be treated as an error (similar to what CPU backends do).tensor.pack
+bufferization.alloc_tensor
to allow support for NCHW and NHWC convolutions more directly in the lowering pipelines, and still be able to target MFMA instructions. This might require adding support fortensor.pack
to vector distribution.The text was updated successfully, but these errors were encountered: