New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformers Loader: 4-bit and 8-bit loading of c4ai-command-r-plus results in nonsense responses. #5838
Comments
With latest transformers and bnb, does it do it too? |
@Ph0rk0z Good question! I installed the latest bitsandbytes and got the following install error:
The install did complete however, and when I tried loading the model via 4-bit I got this error: 09:50:24-061953 INFO Starting Text generation web UI Running on local URL: http://127.0.0.1:7860 ERROR: Exception in ASGI application Looks like someone else is getting the same error: |
When I set Share=True in the server.py code and have the latest bits and bytes installed I get this error: 0:07:44-841486 INFO Starting Text generation web UI Running on local URL: http://127.0.0.1:7860 Running on public URL: https://6faaae8194bc7c77ae.gradio.live This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run Loading checkpoint shards: 2%|▌ | 1/44 [00:01<00:47, 1.10s/it] |
I haven't been able to run command-r-plus in 4-bit bnb format because I don't have enough memory, but the prequantized c4ai-command-r-v01-4bit works very well for me. Does c4ai-command-r-plus-4bit generate coherent results? |
:3 Hello! I'm currently downloading and will reply back with the result, but I suspect if it's working for you it will work for me too. I was reading the code at that link and it seemed like it's what textgen does too when doing 4 and 8 bit conversions on the fly with the transformers loader. Maybe it's not textgen then? Either way I'll let you know about the 4-bit version when it finishes downloading. Also, wanted to let you know I've got the new mixtral and dbrx models running in text gen with exllamav2 quants. Mixtral quantized without issue using the transformers loader on day one, and dbrx has a weird issue where it tries to load onto one gpu regardless of how one uses bitsandbytes. I am forced to run that one with exllamav2 quants only, but I've tried with the code from the databricks repo and the memory loading behavior is the same. Ty so much for responding to the new model releases, I really appreciate your work <3 |
Hmm, the model finished downloading and I tried loading it via the transformers loader, and I got the exact same response. I tried only loading the model across 3 gpus by exporting only 3 to the terminal, and still the same response. Also I'm using a version of oobabooga that is from April 10 (yestrday). |
I have figured out the solution, it was transformers. The current version is over 2 weeks old and it was definitely the issue for me. I did the following:
https://github.com/huggingface/transformers/tree/b109257f4fb8b1166e7c53cc5418632014ed53a5 This is the commit for the version of transformers that I ended up getting by installing the dev source today. Works with fp16 quantized models on the fly using the transformers loader too! |
The last released version is probably not reading the rope scale metadata correctly, causing the nonsense output. Good to know that the update fixed it, I'll update to the new version as soon as it comes out. |
Describe the bug
When I use the fp16 model of c4ai-command-r-plus (https://huggingface.co/CohereForAI/c4ai-command-r-plus) and load it via transformers 4-bit or 8-bit the resulting model only outputs repeating characters, often resulting in this type of output:
"authorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorauthorbeginbeginauthor-sectionsectionauthor-sectionbegin-section"
I have quantized the model using exllamav2, installed the most recent build into textgen, and have successfully loaded the model with it functioning well using the exllamav2 loader.
Is there an existing issue for this?
Reproduction
Select the Transformers loader, and try loading the original fp16 model in either 4-bit or 8-bit (enabling trusted code is necessary for 8-bit, and it results in the same output type and constancy as 4-bit), the model will only output random strings of repeating words.
Screenshot
Logs
System Info
The text was updated successfully, but these errors were encountered: