-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.58 bit implementation #1956
Comments
Are there some reference trained models somewhere? I haven't been able to find any so far. |
Apparently this one trains a 54M parameter mode from scratch. https://github.com/pranavjad/tinyllama-bitnet And this one is a pretty good technique for quantization which retains the model performance. They have also released the model weights. https://mobiusml.github.io/1bit_blog/ What is more interesting to me is the replacement of matrix multiplication with addition leading to significant performance gains. |
And the official models are here |
Not sure how close to complete this is but @tomsanbear has put up bitnet-rs which seems to be a candle implementation of this archicecture. |
Thanks @LaurentMazare, this is super helpful. |
Would it possible to implement 1.58 bit quantization on candle ? It was proposed in the following paper,
https://arxiv.org/pdf/2402.17764.pdf
The main inspiration behind using 1.58 bit implementation is that you could replace matrix multiplication with addition. If that is feasible, with apple accelerate framework's SIMD instructions, we could expect better training and inference on large language models.
A couple of Llama.cpp discussions here
ggerganov/llama.cpp#5761
ggerganov/llama.cpp#5999
There is also a training library which was released a couple of days ago,
https://github.com/rafacelente/bllama
Any thoughts ?
The text was updated successfully, but these errors were encountered: