Skip to content

Conversation

@LeiWang1999
Copy link
Contributor

@LeiWang1999 LeiWang1999 commented Jul 30, 2024

As we delved deeper into the contiguous batching optimizations for mixed precision GEMM, a crucial insight emerged: enabling dequantization at the warp tile level can conserve memory bandwidth, though it'll introduce few cost of computational overhead. To facilitate this, we must improve the lower warp memory pass, as TVM struggles to manage warp memory with decode intrinsics.

This pull request implement this optimizations, and we can now codegen mixed precison gemm with warp level dequantization. There're still some TODO Items should be resolved in future developments to officially integrate this optimizations.

##TODO

  • Introduce Transform Propagation Level 3, which can also enable weight propagation to eliminate the instruction ldmatrix.
  • Checkout the Correctness of Weight Propagation Stage 3.
  • The design of LOP3 Tensor Intrins should be optimized, as now we not only support local scope but also warp score, moreover, the buffer slot impl should be converted into point with dynamic offsets instead of Var.

LeiWang1999 and others added 28 commits July 23, 2024 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant