More ggml cuda kernels #1977

LaurentMazare · 2024-03-31T18:06:09Z

These kernels are experimental for now, they can be tried out in the quantized example via the --fast-cuda flag.

As a benchmark,

cargo run --example quantized --profile=release-with-debug --features cuda -- --prompt "Building a website can be done in 10 simple steps:\nStep 1:" -n 100 --which 7b-mistral --fast-cuda

This goes from 33 token/s to 63 token/s with the new kernels, whereas llama.cpp is at ~55 token/s using the same model (mistral-7b-v0.1.Q4_K_S.gguf from TheBloke).

EricLBuehler · 2024-03-31T18:09:00Z

@LaurentMazare , this is a funny coincidence, I was just about to open a PR for this. Thanks for the work!

I observed a 1.9x speed decrease when forcing llama.cpp to also use dmmv, and it was slower than mistral.rs then. I think after using mmvq QMatMul will be much faster.

EricLBuehler · 2024-03-31T20:37:08Z

candle-core/src/quantized/cuda.rs


 pub struct QCudaStorage {
    data: CudaSlice<u8>,
    dtype: GgmlDType,
    device: CudaDevice,
 }

+pub const FORCE_DMMV: bool = false;


Could this maybe be an environment variable or ideally influenced by the compute cap? The __dp4a intrinsic is only supported on CC>610. I'm not sure what the current minimum compute cap for Candle is, but it seems like it would be better to avoid increasing it?

Looks like __nv_bfloat16 is CC>800, so this is not a problem.

LaurentMazare · 2024-03-31T21:33:15Z

I observed a 1.9x speed decrease when forcing llama.cpp to also use dmmv, and it was slower than mistral.rs then. I think after using mmvq QMatMul will be much faster.

Seems pretty much in line with what I'm seeing. In my early testing I get a speedup from 35 token/s to 60.8 token/s vs 55.0 token/s for llama.cpp, not sure why it actually outperforms it but that seems pretty promising.

EricLBuehler · 2024-03-31T21:42:25Z

not sure why it actually outperforms it but that seems pretty promising.

It seems plausible that the fact that dmmv needs to call a dequantize method many times per matmul may be the cause of this. Also, the __dp4a intrinsic usage in mmvq may be playing a role.

LaurentMazare · 2024-03-31T21:47:02Z

Well it's pretty clear why it outperforms the dmmv version, q8_1 is much more efficient than using a f32, but I'm more wondering how it could outperform the llama.cpp implementation overall.

EricLBuehler · 2024-03-31T22:38:07Z

Thank you for the excellent work!

LaurentMazare added 2 commits March 31, 2024 19:56

Add more cuda kernels for quantized matmul.

03302a7

Add the vec-dot bits.

d35f135

LaurentMazare added 5 commits March 31, 2024 20:35

Expose the quantized matmul-vec kernels.

1fd49a7

Also include the quantize-q8-1 kernel.

dd06c9b

Glue code for the q8-1 quantization.

90e663b

mm-vec product via q8-1 quantization.

7fb31ae

Add a test.

713c324

EricLBuehler reviewed Mar 31, 2024

View reviewed changes

LaurentMazare added 3 commits March 31, 2024 22:47

Add a mm test.

9d47a24

Get the test to return some sensible results.

feaa87f

Also test dmmv.

a74f600

Fix the launch params.

16dea46

Allow for tweaking the force_dmmv parameter while it's experimental.

a5efd00

LaurentMazare merged commit cd29c7c into main Mar 31, 2024
10 checks passed

LaurentMazare deleted the more-ggml-cuda-kernels branch March 31, 2024 22:15

LaurentMazare mentioned this pull request Apr 1, 2024

Slow generation compared to transformers + PyTorch #1683

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More ggml cuda kernels #1977

More ggml cuda kernels #1977

LaurentMazare commented Mar 31, 2024 •

edited

EricLBuehler commented Mar 31, 2024 •

edited

EricLBuehler Mar 31, 2024 •

edited

EricLBuehler Mar 31, 2024

LaurentMazare commented Mar 31, 2024

EricLBuehler commented Mar 31, 2024

LaurentMazare commented Mar 31, 2024

EricLBuehler commented Mar 31, 2024

More ggml cuda kernels #1977

More ggml cuda kernels #1977

Conversation

LaurentMazare commented Mar 31, 2024 • edited

EricLBuehler commented Mar 31, 2024 • edited

EricLBuehler Mar 31, 2024 • edited

Choose a reason for hiding this comment

EricLBuehler Mar 31, 2024

Choose a reason for hiding this comment

LaurentMazare commented Mar 31, 2024

EricLBuehler commented Mar 31, 2024

LaurentMazare commented Mar 31, 2024

EricLBuehler commented Mar 31, 2024

LaurentMazare commented Mar 31, 2024 •

edited

EricLBuehler commented Mar 31, 2024 •

edited

EricLBuehler Mar 31, 2024 •

edited