Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More ggml cuda kernels #1977

Merged
merged 12 commits into from
Mar 31, 2024
Merged

More ggml cuda kernels #1977

merged 12 commits into from
Mar 31, 2024

Conversation

LaurentMazare
Copy link
Collaborator

@LaurentMazare LaurentMazare commented Mar 31, 2024

These kernels are experimental for now, they can be tried out in the quantized example via the --fast-cuda flag.

As a benchmark,

cargo run --example quantized --profile=release-with-debug --features cuda -- --prompt "Building a website can be done in 10 simple steps:\nStep 1:" -n 100 --which 7b-mistral --fast-cuda

This goes from 33 token/s to 63 token/s with the new kernels, whereas llama.cpp is at ~55 token/s using the same model (mistral-7b-v0.1.Q4_K_S.gguf from TheBloke).

@EricLBuehler
Copy link
Contributor

EricLBuehler commented Mar 31, 2024

@LaurentMazare , this is a funny coincidence, I was just about to open a PR for this. Thanks for the work!

I observed a 1.9x speed decrease when forcing llama.cpp to also use dmmv, and it was slower than mistral.rs then. I think after using mmvq QMatMul will be much faster.


pub struct QCudaStorage {
data: CudaSlice<u8>,
dtype: GgmlDType,
device: CudaDevice,
}

pub const FORCE_DMMV: bool = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this maybe be an environment variable or ideally influenced by the compute cap? The __dp4a intrinsic is only supported on CC>610. I'm not sure what the current minimum compute cap for Candle is, but it seems like it would be better to avoid increasing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like __nv_bfloat16 is CC>800, so this is not a problem.

@LaurentMazare
Copy link
Collaborator Author

I observed a 1.9x speed decrease when forcing llama.cpp to also use dmmv, and it was slower than mistral.rs then. I think after using mmvq QMatMul will be much faster.

Seems pretty much in line with what I'm seeing. In my early testing I get a speedup from 35 token/s to 60.8 token/s vs 55.0 token/s for llama.cpp, not sure why it actually outperforms it but that seems pretty promising.

@EricLBuehler
Copy link
Contributor

not sure why it actually outperforms it but that seems pretty promising.

It seems plausible that the fact that dmmv needs to call a dequantize method many times per matmul may be the cause of this. Also, the __dp4a intrinsic usage in mmvq may be playing a role.

@LaurentMazare
Copy link
Collaborator Author

Well it's pretty clear why it outperforms the dmmv version, q8_1 is much more efficient than using a f32, but I'm more wondering how it could outperform the llama.cpp implementation overall.

@LaurentMazare LaurentMazare merged commit cd29c7c into main Mar 31, 2024
10 checks passed
@LaurentMazare LaurentMazare deleted the more-ggml-cuda-kernels branch March 31, 2024 22:15
@EricLBuehler
Copy link
Contributor

Thank you for the excellent work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants