-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWQ quantization #102
Add AWQ quantization #102
Conversation
Co-authored-by: flozi00 <you@example.com>
|
Thats the error message i am struggling with
any idea on the first look ? |
Hey @flozi00, I spent some time playing around with it last night. At least for the first issue, it seems that AWQ made a change to the format of their weights in this commit: mit-han-lab/llm-awq@1480555#diff-cd7278928f5da471b08f4aedab4f33e560067768adf06ff06beec1972e9e7240 That seems to be causing the shape mismatch error. What I want to do is spend some time figuring out if the format of AWQ weights saved before this change can be successfully loaded and used with the newer code, as ideally we'd want to be on the newer version of AWQ. |
@flozi00 Docker image has ben built and pushed to https://github.com/predibase/lorax/pkgs/container/lorax/154831836?tag=awq-test. Any time you push to this branch, it will rebuild the image with the same tag. |
Using the format before the changes you linked results in the seconds error code i posted above, but i think that is confusing since its not a real cuda error. I read an thread about where the pytorch team said that error also occures sometimes when missmatching linears. As far i understand its both times related to lm_head. Definitely prefering the newer awq version since its faster than the one used in tgi if i remember correctly |
Sounds like newer AWQ performance is quite a bit faster, so I agree we should try to get it working with the newer version. |
@tgaddair what do you think about using the kernels from autoawq project ? https://github.com/casper-hansen/AutoAWQ/blob/5a673bf8435e019f50470b1b8878abf4ee63de57/awq/modules/linear.py#L213C7-L213C7 |
@flozi00 sounds good to me! |
Its working now ready to be merged from my side |
time_per_token="52.262122ms" on A2000 12GB Similiar to A6000 48GB with fp16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Just tested it myself and verified results look good!
moved to predibase repo from this #100