[Draft] PR to track changes in SDXL. #16854

MaheshRavishankar · 2024-03-20T18:06:41Z

No description provided.

… results

… changes

Insert slices always fold into the `flow.dispatch.tensor.store` ops and can be fused with all producers.

Enables certain transpose fusions. Handles this case by swapping the operands of the contraction and transposing the result. Does not change any default behavior for SDXL because this path is not yet exercised.

…16748) This allows converting convolutions with >1 batch size and 1x1 filter to matmuls.

…16750)

…6721)

…olutions

…Matmul" (#16782) Reverts #16748 to enable fp32 accumulation from torch-mlir

Repurpose the workgroup swizzling pass to do more general workgroup reordering. Add filter function to run in mma pipelines only. Do not use workgroup counts from the runtime as these don't currently work on rocm.

Co-authored-by: MaheshRavishankar <mahesh@nod-labs.com> Co-authored-by: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com>

This yields the same performance on the full model but has lower overhead on isolated microbenchmarks.

It excludes the tiling from first level of tiling; promotes images, and tiles the filter.

)

#16798)

…16795) IR Dump: https://gist.githubusercontent.com/hanhanW/5dc64f1b9a47d85468e87a9721f97b8f/raw/118439521742399332e7e6331b0570f1b7166c86/log

This adds a winograd pipeline for LLVMGPU. The `--iree-codegen-winograd-use-forall` flag is needed to get distribution on input and output transforms. --------- Co-authored-by: harsh <harsh@nod-labs.com>

Additionally adds a flag to control promotion of the filter. Co-authored-by: MaheshRavishankar <mahesh@nod-labs.com>

This reverts commit 36377c2 because of a numerical regression.

Consider batch size in the heuristic. This is so that we do not create allocas. Co-authored-by: Jakub Kuderski <jkudersk@amd.com>

google-cla · 2024-03-20T18:06:48Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

ScottTodd · 2024-03-20T18:15:47Z

CLA check is complaining about a commit from @kuhar , fyi: https://github.com/openxla/iree/pull/16854/checks?check_run_id=22897648244 (check which email address you use as your default / to commit changes)

This allows for importing all or some parameters from a parameter file into the compiler. Currently only one import can be specified with the flag but we could extend that to multiple in the future by following the same flag conventions as the runtime tooling. If a scope is provided only parameters with that scope will be imported. If a parameter (optionally within a scope) is named explicitly it will be imported. If a maximum size is specified all parameters <= that size will be imported. This also renams the export flags, as they are inconsistent. --------- Co-authored-by: Ben Vanik <ben.vanik@gmail.com>

kuhar · 2024-03-20T19:18:16Z

CLA check is complaining about a commit from @kuhar , fyi: https://github.com/openxla/iree/pull/16854/checks?check_run_id=22897648244 (check which email address you use as your default / to commit changes)

Thanks for the heads up, I have to move across machines frequently and this was a one-off. Feel free to update it before landing on main. My main commit email is jakub@nod-labs.com.

…16852)

We are observing that it is always better to promote the filter, so turn it on by default.

MaheshRavishankar · 2024-03-20T23:19:30Z

Not planning to land this. Just a place to see all the commits in the branch w.r.t main

The smallest bounding box inference could fail while the op is really bounded by tile sizes. We provide an option for smallest bounding values as fallback: shark-infra/llvm-project@55ff42c To enable the new pipeline, we add `--iree-codegen-llvmgpu-use-vector-distribution` to the `iree-compile` tool. Full IR dump: https://gist.github.com/hanhanW/d9ee3111c5f86b0e7ad7ebdac46fe7c9 --------- Co-authored-by: Kunwar Grover <groverkss@gmail.com>

qedawkins and others added 30 commits March 18, 2024 23:03

[GlobalOptimization] Enable propagation of transpose ops to linalg op…

48a2a79

… results

[GlobalOptimization] Add flag to disable recent transpose propagation…

0f60e47

… changes

[Flow] Fuse single use insert_slice ops with producers

c8a1c77

Insert slices always fold into the `flow.dispatch.tensor.store` ops and can be fused with all producers.

Cherry-pick unit dim folding of pads.

5f987e2

[LLVMGPU] Extend to arbitrary convolutions

54d0339

[LLVMGPU] Update heuristic to consider entire contraction shape

16274f3

[LLVMGPU] Add support for KM_NK_MN contractions (#16741)

bce19ab

Enables certain transpose fusions. Handles this case by swapping the operands of the contraction and transposing the result. Does not change any default behavior for SDXL because this path is not yet exercised.

Pin llvm version to include new img2col linalg helpers (#16745)

221d28f

[GlobalOptimization] Add generic op to ConvertConv1X1FilterToMatmul (#…

6c37fd0

…16748) This allows converting convolutions with >1 batch size and 1x1 filter to matmuls.

Categorize dispatch name better for linalg.generic cases

c805096

[Flow] Improve annotation name for conv

7e350e2

Enable more aggressive fusion to fuse multiple reduction operations. (#…

f533a1c

…16750)

[VectorDistribution] Make transform op consume handle

420c133

[Preprocessing] Add pass to transpose Value operand for attention (#1…

5f15a9c

…6721)

Coalesce loops during GPUTensorTileToSerialLoops. (#16758)

db4d168

[GlobalOptimization] Add pass to horizontally fuse contractions (#16764)

cdeab56

[GlobalOptimization] Add pass to promote the accumulator type of conv…

5022599

…olutions

[preprocessing] pad computation to fit on intrinsics. (#16774)

a0d43f5

Revert "[GlobalOptimization] Add generic op to ConvertConv1X1FilterTo…

1397723

…Matmul" (#16782) Reverts #16748 to enable fp32 accumulation from torch-mlir

fix int64_max on pad-to-intrinsics. (#16785)

2cf0cf9

[LLVMGPU] Add initial workgroup reodering pass (#16786)

a82edcc

Repurpose the workgroup swizzling pass to do more general workgroup reordering. Add filter function to run in mma pipelines only. Do not use workgroup counts from the runtime as these don't currently work on rocm.

Update Torch-MLIR to 798bfd7 (#16787)

88787cb

[GPU] Add Implicit GEMM pipeline for LLVMGPU. (#16788)

d306196

Co-authored-by: MaheshRavishankar <mahesh@nod-labs.com> Co-authored-by: MaheshRavishankar <1663364+MaheshRavishankar@users.noreply.github.com>

Simplify workgroup ordering strategy (#16790)

d523174

This yields the same performance on the full model but has lower overhead on isolated microbenchmarks.

Add an experimental pipeline for conv vector distribution. (#16789)

c5150c3

It excludes the tiling from first level of tiling; promotes images, and tiles the filter.

Add a flag to enable experimental ConvVectorDistribute pipeline. (#16796

d61c6f8

)

Do not split tile sizes list when ConvVectorDistribute is not enabled. (

eb381e0

#16798)

Distribute and vectorize shared memory copy for ConvVectorDistribute (#…

713ccaf

…16795) IR Dump: https://gist.githubusercontent.com/hanhanW/5dc64f1b9a47d85468e87a9721f97b8f/raw/118439521742399332e7e6331b0570f1b7166c86/log

Fix horizontal fusion after change to accumulate in fp32 (#16799)

afe94c9

[LLVMGPU] Add LLVMGPU winograd pipeline (#16793)

a38e893

This adds a winograd pipeline for LLVMGPU. The `--iree-codegen-winograd-use-forall` flag is needed to get distribution on input and output transforms. --------- Co-authored-by: harsh <harsh@nod-labs.com>

Groverkss and others added 10 commits March 19, 2024 13:58

iglp opt prefetching (#16760)

63d5fca

Fix shared memory distribution for conv distribute pipeline (#16829)

5d0ffea

Additionally adds a flag to control promotion of the filter. Co-authored-by: MaheshRavishankar <mahesh@nod-labs.com>

Fold tensor collapse (#16831)

36377c2

Revert "Fold tensor collapse (#16831)" (#16834)

22dc7f7

This reverts commit 36377c2 because of a numerical regression.

[LLVMGPU] Add support for scalar broadcast distribution (#16836)

b143d54

Set the shared memory check to the mi300 limit

9660f6c

Tweak mma schedule seeds for batch matmul (#16837)

5a16082

Consider batch size in the heuristic. This is so that we do not create allocas. Co-authored-by: Jakub Kuderski <jkudersk@amd.com>

[LLVMGPU] Fit mma schedules inside shared memory limits (#16840)

901a01c

Flip schedule when flipping contractionOp. (#16846)

8708102

Fix permutations and add some proofs (#16851)

5f915e0

qedawkins and others added 4 commits March 20, 2024 16:43

[LLVMGPU] Tune convolution tile sizes to use more square image tiles (#…

4fb4538

…16852)

[LLVMGPU] Change promote filter to on by default for conv dist

aaf4d97

We are observing that it is always better to promote the filter, so turn it on by default.

Add AMDGPU dialect to registerMlirDialects. (#16858)

69fbea9

Add flag to control whether benchmarks are dumped with configurations

966c1a6

qedawkins and others added 8 commits March 21, 2024 10:53

[GlobalOpt] Fix horizontal fusion for > 2 contracts (#16863)

0d5a311

[LLVMGPU] Add support for mfma for NCHW convs (#16864)

05a818f

[VectorDistribution] Add reshaping lhs and rhs for matmul (#16754)

a508272

Distribute arith.constant with dense elements + conflict (#16841)

d3e6891

[LLVMGPU] Allow unit batch dimensions to distribute (#16878)

db74697

Add a pass to convert dense constant on vectors to tensors. (#16823)

cf44972

remove stray dump

d41b11e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] PR to track changes in SDXL. #16854

[Draft] PR to track changes in SDXL. #16854

MaheshRavishankar commented Mar 20, 2024

google-cla bot commented Mar 20, 2024

ScottTodd commented Mar 20, 2024

kuhar commented Mar 20, 2024

MaheshRavishankar commented Mar 20, 2024

[Draft] PR to track changes in SDXL. #16854

Are you sure you want to change the base?

[Draft] PR to track changes in SDXL. #16854

Conversation

MaheshRavishankar commented Mar 20, 2024

google-cla bot commented Mar 20, 2024

ScottTodd commented Mar 20, 2024

kuhar commented Mar 20, 2024

MaheshRavishankar commented Mar 20, 2024