-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make_quant() got an unexpected keyword argument 'faster' #667
Comments
(2023-04-02) See: #667 (comment) old (2023-04-01)``` cd repositories/GPTQ-for-LLaMA git checkout cuda ```(2023-04-01) The CUDA branch might've broken old quantizations again (they're crashing for me anyway), so if you want to keep using e.g. the ones @USBhost shared here or here then also do:
Then finally:
If that fails, open up
GPTQ-for-LLaMA changed the default branch today, do that to set it back. Evidently it's not backwards-compatible when called externally e.g. from here, and I think models might need to be requantized again to work with the new branch anyway. Btw the new branch supports --act-order + --groupsize simultaneously, and I did some LLaMA 30B (--wbits 4 --true-sequential) runs last night and my perplexity scores were: wikitext2
ptb-new
c4-new
The However the new branch evidently needs triton to be installed to not run really slowly, without which it's around 1/4th as fast at inference vs the old cuda branch on my machine. Triton doesn't support Windows natively, and I haven't gotten around to setting up WSL to test it out myself yet. |
That fixed it, thank you very much! I was pretty sure it was a recent change somewhere but I'm not familiar enough with all these pieces to quickly figure out where. |
Eh, so I guess I can't just get away with keep using the |
FYI, this is the commit that breaks things: qwopqwop200/GPTQ-for-LLaMa@f1af89a |
That same change is on the latest CUDA branch too now btw |
Yup :/ qwopqwop200/GPTQ-for-LLaMa@f1af89a Here's the previous commit 608f3ba71e40596c75f8864d73506eaf57323c6e |
Just got this same error with latest cuda branch. I think it's time to lock GPTQ to an exact (working) commit in the requirements file as GPTQ breaking changes seem to happen every day. Edit: Rolling back to the GPTQ commit @deece mentioned works ( |
https://github.com/oobabooga/text-generation-webui/blob/main/modules/GPTQ_loader.py#L36 Remove |
Also as far as I know the old models I made should still work. |
I got this after removing
This is getting ridiculous 😓 |
Yep, same error as above for me on the ungrouped and 128g 4-bit models from USBhost. The quantizations I did a few days ago on the short-lived pytorch branch work, as do some re-runs from last night (on the CUDA branch, this commit)...but performance is abysmal. Getting a whopping 0.11 tokens/s with ~1850 context size, while the older quant on the older commit does closer to 4-5/s. |
I don't even know why there's 2 branchs, triton and cuda? Wouldn't cuda be the fastest one? Why should we go for the slow version? lol |
uninstall transformers and reinstall. |
triton supports more features. |
@USBhost like what? |
No change after reinstalling transformers. |
Groupsize+act-order together. Also I feel the pytorch branch was broken. That thing was worse than the delay. |
@EyeDeck let me see. Does the groupsize have the same issue? |
I think it works on the cuda branch now, you can combine all the gptq implementations |
Well that's new. Hard to know when the comments are called update (filename). |
@USBhost I know right... I knew this by looking at the new readme, now he removed the "you can't combine act order and groupesize 128 together" 😅 |
@USBhost Same error with your ungrouped 30B and then slightly newer 128g 30B:
I think the triton branch without the triton dependency falls back to basically the same code that the pytorch branch had, at least performance is the same. However, the latest CUDA branch is literally slower than the triton fallback by a factor of 10 on my machine (single 3090), and the triton fallback is slower by a factor of 4-5 than the CUDA commit from a few days ago. Also yes, --act-order + --groupsize works on the latest CUDA branch, for both quantization and inference. |
Hi, I'm far below you guys but ive been trying to get this to work and i havn't slept since march. any idea when all this will be fixed?
=BUG REPORT===================================
|
Please use my fork of GPTQ-for-LLaMa. It corresponds to commit # activate the conda environment
conda activate textgen
# remove the existing GPT-for-LLaMa
cd text-generation-webui/repositories
rm -rf GPTQ-for-LLaMa
pip uninstall quant-cuda
# reinstall
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
python setup_cuda.py install I will keep using this until qwopqwop's branch stabilizes. Upstream changes will not be supported. This works with @USBhost's torrents for llama that are linked here. |
Does having a model quantized with Triton necessarily require the creation of a GPTQ-for-LLaMa repository with the Triton branch? |
Describe the bug
When trying to run 4bit 128g models, I'm getting the following error:
TypeError: make_quant() got an unexpected keyword argument 'faster'
Apologies if I've just screwed something up on install. I've been through the instructions several times and think I've gotten everything.
Non 4bit-128 models load fine. If it matters, I'm running under WSL2 under Windows 11.
Is there an existing issue for this?
Reproduction
Install by cloning the tip from GitHub, and try to run a 4bit-128 model.
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: