-
Notifications
You must be signed in to change notification settings - Fork 778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More ggml cuda kernels #1977
More ggml cuda kernels #1977
Conversation
@LaurentMazare , this is a funny coincidence, I was just about to open a PR for this. Thanks for the work! I observed a 1.9x speed decrease when forcing llama.cpp to also use dmmv, and it was slower than mistral.rs then. I think after using mmvq QMatMul will be much faster. |
candle-core/src/quantized/cuda.rs
Outdated
|
||
pub struct QCudaStorage { | ||
data: CudaSlice<u8>, | ||
dtype: GgmlDType, | ||
device: CudaDevice, | ||
} | ||
|
||
pub const FORCE_DMMV: bool = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this maybe be an environment variable or ideally influenced by the compute cap? The __dp4a
intrinsic is only supported on CC>610. I'm not sure what the current minimum compute cap for Candle is, but it seems like it would be better to avoid increasing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like __nv_bfloat16 is CC>800, so this is not a problem.
Seems pretty much in line with what I'm seeing. In my early testing I get a speedup from 35 token/s to 60.8 token/s vs 55.0 token/s for llama.cpp, not sure why it actually outperforms it but that seems pretty promising. |
It seems plausible that the fact that |
Well it's pretty clear why it outperforms the dmmv version, q8_1 is much more efficient than using a f32, but I'm more wondering how it could outperform the llama.cpp implementation overall. |
Thank you for the excellent work! |
These kernels are experimental for now, they can be tried out in the
quantized
example via the--fast-cuda
flag.As a benchmark,
This goes from 33 token/s to 63 token/s with the new kernels, whereas llama.cpp is at ~55 token/s using the same model (mistral-7b-v0.1.Q4_K_S.gguf from TheBloke).