-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoAWQ: initial support #3999
AutoAWQ: initial support #3999
Conversation
Probably good as a proof of concept at this time. |
@oobabooga any ideas about the DeepSpeed stuff? |
Fused modules will be released in the next version. Significant speedup will come from them. V0.02 does not have the fuses modules (only basic ones). |
@casper-hansen These are from 8788fe106a5e34b80cbaf03fbe4710c2cfb27328. They work perfectly fine after copying awq/modules/fuse over when deepspeed is disabled. |
@casper-hansen By the way, mind fixing the pip install? Somehow it misses copying awq/modules/fused when installing. |
It is not missing the modules. The v0.0.2 did not have them implemented 2 weeks ago. I am working on releasing the next version. |
I mean I was using |
@cal066 It seems there was an issue, the |
@casper-hansen Awesome thanks, waiting for the new version. |
Very much looking forward. Thanks guys |
You can now try |
97706d1
to
d656071
Compare
@casper-hansen Thanks, but I just noticed I can't actually get the fused models to work, I don't have enough VRAM. |
If you set
|
@casper-hansen is max_new_tokens equivalent to context length? I'm not sure if I interpreted it correctly. |
|
@casper-hansen I guess max_new_tokens isn't really the context length, is there a variable for setting how long the model context should be? Non-fused layers work perfectly fine otherwise. |
|
Is this quant method better than GPTQ (act-order + groupsize 32) for the same size in terms of perplexity? |
It is better than act-order + groupsize 128 according to the paper. I have not conducted testing of GPTQ methods to make this comparison myself, but the loss in perplexity is generally very low as much as that benchmark is tough to trust. |
Added a new UI for AutoAWQ max_new_tokens, related to max_new_tokens, but needs to be set when the model is loaded. RoPE UI has been added as well, but untested. |
Not working with xformers...Traceback (most recent call last): |
@hronoas seems the dimensions are wrong, it should be [1,129,64,128//8] |
It is working fine for me with xformers:
@hronoas, what model are you using? |
model: TheBloke/Phind-CodeLlama-34B-v2-AWQ Technical detailsrun arguments
user-config.yaml (for model):
Python 3.10.10 (win x64) pip freeze
console output:
also tryed TheBloke/Xwin-LM-13B-V0.1-AWQ |
I have dropped the scaling options for now, but enabled all the HF generation parameter options, since it is using the same. |
AutoAWQ has never taken any RoPE parameters as input. It is something that could be implemented with time though. |
OK, I guess it was never really supported and I did not properly test it. Removed for now. |
Forgive the noobie interruption while you'r ehard at work, but how do i get this to show up in textgen webui. I've cloned the github repo via the the web ui and restarted. but Auto AWQ does not show up in the model loaders list. Thanks in advance. And keep up the good work... |
@s-konnex-engine How did you clone? You need to clone from my fork, it's not merged into oobabooga's yet. |
@cal066 ohhhh!!! my bad!!! I basically copied the link to the casper-hansen repo and pasted it in textgen add extensions input. You mean I have to clone your textgen repo until it's merged. Thanks a million in any case. my 6Gb 1060 appreciates you as much as I do. :D |
Tested with: * https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-AWQ * https://huggingface.co/TheBloke/wizard-vicuna-13B-AWQ Known issues: * Seems to be incompatible with DeepSpeed, working when disabled: ``` text-generation-webui | File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 984, in _post_init_method text-generation-webui | param.data = param.data.to(self.local_device) text-generation-webui | NotImplementedError: Cannot copy out of meta tensor; no data! ``` Credits to Ph0rk0z for multi-gpu handling.
@Ph0rk0z Thanks, added your changes with slight tweaks after looking at how get_max_memory_dict() works. |
I retested this with multi-gpu using the suggestion from @Ph0rk0z and am now able to get it to split the memory across my two 3090's. I had to set the memory limits much lower on each GPU to get it to work correctly so as @Ph0rk0z said, it does seem like something is not calculating the limits correctly or something, but it works. |
@casper-hansen question: I see that the AutoAWQ wheels do not have a |
@oobabooga They are Nvidia-only at the moment, Ampere or later. The next release of AutoAWQ brings Turing support and the 2GB memory saving. I am actively working on a Torch-only module that should enable AMD/Metal/CPU users to use the models but I suspect it will not be ready for next release. |
Thanks for the confirmation. For reference, these are the results of 2 quick perplexity tests that I ran a couple of days ago:
Test 1 is wikitext with a high stride value, and test 2 is a private dataset (exactly the same one as in the tests here https://oobabooga.github.io/blog/posts/perplexities/). So it seems like AWQ performs consistently a bit better than 4-bit 128g GPTQ. The speed is also very good. I'll merge this PR and then update the AutoAWQ in a future commit when a new version with the VRAM reduction is released. @cal066 thank you for this PR. |
Thanks all for helping to test and merge it! |
@oobabooga I just released v0.1.3 - which brings the VRAM reduction and Turing support. You should update :) |
same
message I guess I'll just give up with AWQ now |
@AG-w Which GPU and CUDA version? Did you follow the instructions for installing CUDA dependencies in a conda environment in AutoAWQ? |
is there a separate installation for CUDA support on AWQ? It's GTX1060 6GB with 537.42 WHQL driver |
GTX 1060 is a Pascal card which is not supported in AutoAWQ. You need Turing or Ampere cards to run AWQ kernels. |
Nice. I'll add a text-gen-webui section to my AWQ READMEs |
Ok, I didn't notice there's extra requirement but I notice your code has |
@TheBloke Will you go for 32g quants now? I wanna know how it compares in terms of perplexity to GPTQ 32g + act_order |
This is a fallback for Turing that implements code that is compatible with Turing. I would welcome any PRs adding similar support for Pascal cards - you would have to look into replacing some of the PTX code with |
Tested with:
Known issues:
Credits to Ph0rk0z for multi-gpu handling.
Checklist: