New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The results are very different in 4-bit and 16-bit/8-bit modes #81
Comments
I think it's because of 4-bit quantization loss, so actually they're different models, especially for 4-bit model considering the fact that it is not RTN based. Current SOTA 4-bit model reach 5.85 on wikitext2 while the fp16 version is 5.68, still better than the quantized model.
|
That makes perfect sense, thanks for the reply. I was worried that I might have done something wrong while adapting the web UI to work with the code here. I'll do some final reviewing and will approve/document my own PR. 4-bit+lora seems to be the holy grail of consumer-grade LLM inference at the moment and it's nice to see it working. Thanks for your work. |
may I ask how to use 8 bit? |
Using this prompt:
I get these results for the
tloen/alpaca-lora-7b
LoRA applied on top of llama-7b:In all cases, the generation uses
do_sample=False
for greedy sampling. The 4-bit model used isllama-7b-4bit-128g
.The code that I am using is the one in this PR oobabooga/text-generation-webui#1200
Is this difference something to worry about? In all my tests, the 4-bit results diverge a lot from the 16-bit/8-bit results.
The text was updated successfully, but these errors were encountered: