Add AWQ quantization #102

flozi00 · 2023-12-05T10:19:15Z

moved to predibase repo from this #100

Co-authored-by: flozi00 <you@example.com>

flozi00 · 2023-12-05T11:11:25Z

git submodule update --init --recursive
I just forgot to init the submodules
Now the docker build is not hanging anymore at punica build

flozi00 · 2023-12-05T15:26:15Z

File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/awq/awq.py", line 40, in forward
    return out.reshape(out_shape)
RuntimeError: shape '[32000, 6144]' is invalid for input of size 131072000

Thats the error message i am struggling with
Tried several ideas but no one worked.
Checked the huggingface tgi main branch and compared the implementations but cant see whats the mistake
Even used their commit in the makefile but still errors

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_causal_lm.py", line 827, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_causal_lm.py", line 923, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_causal_lm.py", line 920, in generate_token
    out = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_mistral.py", line 399, in forward
    logits = self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 578, in forward
    logits = self.lm_head(hidden_states, adapter_data)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/layers.py", line 476, in forward
    result = self.base_layer(input)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/layers.py", line 308, in forward
    output = super().forward(input)
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/layers.py", line 253, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/layers.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

any idea on the first look ?

tgaddair · 2023-12-05T17:07:29Z

Hey @flozi00, I spent some time playing around with it last night. At least for the first issue, it seems that AWQ made a change to the format of their weights in this commit: mit-han-lab/llm-awq@1480555#diff-cd7278928f5da471b08f4aedab4f33e560067768adf06ff06beec1972e9e7240

That seems to be causing the shape mismatch error. What I want to do is spend some time figuring out if the format of AWQ weights saved before this change can be successfully loaded and used with the newer code, as ideally we'd want to be on the newer version of AWQ.

tgaddair · 2023-12-05T17:48:53Z

@flozi00 Docker image has ben built and pushed to https://github.com/predibase/lorax/pkgs/container/lorax/154831836?tag=awq-test.

Any time you push to this branch, it will rebuild the image with the same tag.

flozi00 · 2023-12-05T18:22:09Z

Using the format before the changes you linked results in the seconds error code i posted above, but i think that is confusing since its not a real cuda error. I read an thread about where the pytorch team said that error also occures sometimes when missmatching linears.

As far i understand its both times related to lm_head.

Definitely prefering the newer awq version since its faster than the one used in tgi if i remember correctly

tgaddair · 2023-12-06T05:12:38Z

Sounds like newer AWQ performance is quite a bit faster, so I agree we should try to get it working with the newer version.

flozi00 · 2023-12-06T20:25:34Z

@tgaddair what do you think about using the kernels from autoawq project ?

https://github.com/casper-hansen/AutoAWQ/blob/5a673bf8435e019f50470b1b8878abf4ee63de57/awq/modules/linear.py#L213C7-L213C7
He is using customized kernels in his project, for example added v2 of an function where the original project use the same twice.

tgaddair · 2023-12-06T20:48:20Z

@flozi00 sounds good to me!

flozi00 · 2023-12-07T10:48:36Z

Its working now
near fp16 performance, better than gptq

ready to be merged from my side

@tgaddair

flozi00 · 2023-12-07T10:52:53Z

time_per_token="52.262122ms" on A2000 12GB

Similiar to A6000 48GB with fp16

tgaddair

Amazing! Just tested it myself and verified results look good!

flozi00 and others added 2 commits December 5, 2023 11:18

[untested] AWQ (#100)

2f856b7

Co-authored-by: flozi00 <you@example.com>

Refactor Dockerfile and devcontainer.json

b1398bf

ttf awq kernel call

9a21bd5

Push awq docker

56d8894

flozi00 added 2 commits December 7, 2023 10:35

use autoawq kernels

b21fd37

gemm forward instead gemv

4fe411d

flozi00 requested a review from tgaddair December 7, 2023 10:48

flozi00 changed the title ~~[untested] AWQ (#100)~~ AWQ Dec 7, 2023

tgaddair changed the title ~~AWQ~~ Add AWQ quantization Dec 7, 2023

Rvert workflow changes

0de9110

tgaddair approved these changes Dec 7, 2023

View reviewed changes

tgaddair merged commit bf3901b into main Dec 7, 2023
1 check failed

tgaddair deleted the awq branch December 7, 2023 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWQ quantization #102

Add AWQ quantization #102

flozi00 commented Dec 5, 2023

flozi00 commented Dec 5, 2023 •

edited

Loading

flozi00 commented Dec 5, 2023 •

edited

Loading

tgaddair commented Dec 5, 2023

tgaddair commented Dec 5, 2023

flozi00 commented Dec 5, 2023

tgaddair commented Dec 6, 2023

flozi00 commented Dec 6, 2023

tgaddair commented Dec 6, 2023

flozi00 commented Dec 7, 2023

flozi00 commented Dec 7, 2023

tgaddair left a comment

Add AWQ quantization #102

Add AWQ quantization #102

Conversation

flozi00 commented Dec 5, 2023

flozi00 commented Dec 5, 2023 • edited Loading

flozi00 commented Dec 5, 2023 • edited Loading

tgaddair commented Dec 5, 2023

tgaddair commented Dec 5, 2023

flozi00 commented Dec 5, 2023

tgaddair commented Dec 6, 2023

flozi00 commented Dec 6, 2023

tgaddair commented Dec 6, 2023

flozi00 commented Dec 7, 2023

flozi00 commented Dec 7, 2023

tgaddair left a comment

Choose a reason for hiding this comment

flozi00 commented Dec 5, 2023 •

edited

Loading

flozi00 commented Dec 5, 2023 •

edited

Loading