New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize GPTQ_loader, support any model #615
Conversation
It needs a PR to GPTQ.. technically GPT-NEOX and NEO can also be done this way. In here they don't want to swap around the GPTQ repo. The steps are get this merged upstream and then get support for the loras into upstream and PEFT as well.. Otherwise these never get merged. |
This is very impressive. |
I have done some basic sanity tests to check if everything is equivalent to the current code, and the answer is that yes. @Ph0rk0z I agree that it would be nice to have this functionality merged upstream in GPTQ-for-LLaMa, but I see no reason to use Maya's code until that happens. The only caveat is that we will have to watch for eventual changes in the upstream For reference, these are the VRAM usages that I have seen for pygmalion:
Test resultsPrompt:
Alpaca-30B-Int4
Maya:
Output generated in 4.48 seconds (13.39 tokens/s, 60 tokens, context 47) Main:
Output generated in 4.50 seconds (13.33 tokens/s, 60 tokens, context 47) alpaca-native-4bit
Maya:
Output generated in 2.36 seconds (34.81 tokens/s, 82 tokens, context 47) Main:
Output generated in 2.33 seconds (35.18 tokens/s, 82 tokens, context 47) |
Can someone convert that model https://huggingface.co/TehVenom/PPO_Pygway-V8p4_Dev-6b to 4bit please? I have not enough of GPU memory to do that( |
Done, added to first post |
Thank you very much! I will go test that) |
…y/feature/gpt-j-4bit-v2) This includes Pygmalion 4bit
Improved version of #521
Generalized version of quantized loader, now it auto-detecting model from model file. It allows loading GPT-J and Pygmalion-6b without joggling of repositories.
I'll try to make generalized offload version, but for now only llama supports it.
You can quantize models using my fork - https://github.com/mayaeary/GPTQ-for-LLaMa/tree/gptj-v2.
Pre-quantized
--wbits 4 --groupsize 128