Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to Vector distribution pass to allow merging/landing the different convolution/gemm based pipelines used for SDXL. #16895

Open
MaheshRavishankar opened this issue Mar 26, 2024 · 0 comments

Comments

@MaheshRavishankar
Copy link
Contributor

As part of the SDXL sprint (#16854) several different pipelines were added to support GEMM/Convolutions. This issue is to land this in tree, as separate pipelines or merge these.

Main Issue

  1. Three different pipelines were used to handle GEMM/Convolution computations in SDXL branch

    • Implicit GEMM pipeline : d306196
    • Vector distribution based pipelines for supporting convolutions : c5150c3
    • Vector distribution based pipeline for handling matmuls which need padding to map to the specific intrinsics.
      Some of these seem like an artificial distinction and can be merged. Potentiall we should have two pipelines, one for convolution and one for matrix multiplies.
  2. One of the issues that was uncovered is that vector distribution pass is a bit too monolothic and has grown to be too complex to be easily generalizable for different lowering paths. For example, adding support for copy to shared memory was more easy to achieve by just avoiding use of vector distribution for this copy and distribute it after the fact. While the current vector distribution pass works very well (and in effect gives us a path to import some ideas from Triton and rocMLIR) the "one-shot" nature of it makes it hard to adapt.

  3. Current convolution based pipelines dont really support NCHW convolution. They only support NHWC convolution. Using tensor.pack to mimic data movement from global to shared memory could allow us to support both NCHW and NHWC convolution without requiring full program transformation to convert NCHW convolutions to NHWC convolutions followed by propagation + fusion to eliminate all transposes. This is actually in-line with some of the work happening on the AIE backend to model data movement + transformation across memory heirarchies (and the pack based approach seem to work well there).

Covered commits by this issue are the majority of the commits to the SDXL branch.

Immediate next steps

  • A lot of the functionality of the vector distribution can be achieved by using tile + fuse + distribute on tensors, followed by vectorization. There might be some pieces missing to connect things all the way, but use of tensors allows use of encodings to maybe better model distribution. It might be worth doing a quick exploration on what this would look like and what the missing pieces are.
  • On the LLVMGPU backend, comprehensive bufferization is used to get shared memory allocations. This has been a long running pain point. Instead the shared memory allocation needs to be more prescriptive (using constructs similar to bufferization.alloc_tensor) and any implicit allocation done by bufferization should essentially be treated as an error (similar to what CPU backends do).
  • Use tensor.pack + bufferization.alloc_tensor to allow support for NCHW and NHWC convolutions more directly in the lowering pipelines, and still be able to target MFMA instructions. This might require adding support for tensor.pack to vector distribution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant