First pass in simplifying and improving qmm #1030

angeloskath · 2024-04-24T07:55:08Z

Extracted the loading and dequantizing logic to a reusable QuantizedBlockLoader à la steel. Also removed the reading of scales/biases in smem and reduced the tile size that provides a pretty big speedup.

Before on Mistral 7B 4bit 64 group size:

1000 token prompt: 880 tps
QLoRA validation: 12.8s
QLoRA training: ~533 tps

after

1000 token prompt: 1100 tps
QLoRA validation: 9.97s
QLoRA training: ~653 tps

all in all a pretty consistent 20%-25% speedup.

jagrit06 · 2024-04-24T17:28:10Z

mlx/backend/metal/kernels/quantized.metal


  // Instantiate the appropriate BlockMMA and Loader
  using mma_t = mlx::steel::BlockMMA<T, T, BM, BN, BK, WM, WN, false, true, BK, BK>;
-  using loader_x_t = mlx::steel::BlockLoader<T, BM, BK, BK, 1, WM * WN * SIMD_SIZE, 1, 4>;
+  using loader_x_t = mlx::steel::BlockLoader<T, BM, BK, BK, 1, WM * WN * SIMD_SIZE>;
+  using loader_w_t = QuantizedBlockLoader<T, BN, BK, BK, 1, WM * WN * SIMD_SIZE, group_size, bits>;

-  threadgroup T scales_block[BN * groups_per_block];
-  threadgroup T biases_block[BN * groups_per_block];
  threadgroup T Xs[BM * BK];
  threadgroup T Ws[BN * BK];



Changing the leading dim of the threadgroup memory should help with bank conflicts
Doing so should be as simple as

constexpr int BK_padded = (BK + 16 / sizeof(T)); // Instantiate the appropriate BlockMMA and Loader using mma_t = mlx::steel::BlockMMA<T, T, BM, BN, BK, WM, WN, false, true, BK_padded, BK_padded>; using loader_x_t = mlx::steel::BlockLoader<T, BM, BK, BK_padded, 1, WM * WN * SIMD_SIZE>; using loader_w_t = QuantizedBlockLoader<T, BN, BK, BK_padded, 1, WM * WN * SIMD_SIZE, group_size, bits>; threadgroup T Xs[BM * BK_padded]; threadgroup T Ws[BN * BK_padded];

awni

Wow 20% bump for prompt and QLoRA! Awesome!!

angeloskath · 2024-04-24T18:46:53Z

After @jagrit06 's improvement to avoid memory bank conflicts and remove the checks out of the loop we now get

1000 token prompt: 1230 tps
QLoRA validation: 8.99s
QLoRA training: ~725 tps

for a total improvement of 30%-40%!

awni · 2024-04-24T18:47:42Z

Incredible!! 🔥

jagrit06

Let’s gooo

ivanfioravanti · 2024-04-24T22:16:32Z

WOW, just WOW

angeloskath added 2 commits April 24, 2024 00:41

Improve qmm_t

8f5244a

Update qmm_n as well

19f260d

angeloskath requested review from jagrit06 and awni April 24, 2024 07:55

jagrit06 reviewed Apr 24, 2024

View reviewed changes

awni approved these changes Apr 24, 2024

View reviewed changes

Second pass improvemtns after Jagrit's comments

6338002

jagrit06 approved these changes Apr 24, 2024

View reviewed changes

angeloskath merged commit 20a01bb into main Apr 24, 2024
3 checks passed

angeloskath deleted the qmm branch April 24, 2024 20:07

Rifur13 pushed a commit to Rifur13/mlx that referenced this pull request Apr 24, 2024

Simplifying and improving qmm (ml-explore#1030)

451f876

BrewTestBot mentioned this pull request Apr 25, 2024

mlx 0.12.0 Homebrew/homebrew-core#170105

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass in simplifying and improving qmm #1030

First pass in simplifying and improving qmm #1030

angeloskath commented Apr 24, 2024

jagrit06 Apr 24, 2024

awni left a comment

angeloskath commented Apr 24, 2024

awni commented Apr 24, 2024

jagrit06 left a comment

ivanfioravanti commented Apr 24, 2024

First pass in simplifying and improving qmm #1030

First pass in simplifying and improving qmm #1030

Conversation

angeloskath commented Apr 24, 2024

jagrit06 Apr 24, 2024

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

angeloskath commented Apr 24, 2024

awni commented Apr 24, 2024

jagrit06 left a comment

Choose a reason for hiding this comment

ivanfioravanti commented Apr 24, 2024