AutoAWQ: initial support #3999

cal066 · 2023-09-19T09:59:39Z

Tested with:

Known issues:

Seems to be incompatible with DeepSpeed, working when disabled:

text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 984, in _post_init_method
text-generation-webui  |     param.data = param.data.to(self.local_device)
text-generation-webui  | NotImplementedError: Cannot copy out of meta tensor; no data!

Credits to Ph0rk0z for multi-gpu handling.

Checklist:

I have read the Contributing guidelines.

cal066 · 2023-09-19T10:07:42Z

Probably good as a proof of concept at this time.

cal066 · 2023-09-19T11:38:12Z

@oobabooga any ideas about the DeepSpeed stuff?

casper-hansen · 2023-09-19T13:39:39Z

Fused modules will be released in the next version. Significant speedup will come from them.

V0.02 does not have the fuses modules (only basic ones).

cal066 · 2023-09-19T15:10:55Z

@casper-hansen These are from 8788fe106a5e34b80cbaf03fbe4710c2cfb27328. They work perfectly fine after copying awq/modules/fuse over when deepspeed is disabled.

cal066 · 2023-09-20T13:55:02Z

@casper-hansen By the way, mind fixing the pip install? Somehow it misses copying awq/modules/fused when installing.

casper-hansen · 2023-09-20T14:05:47Z

@casper-hansen By the way, mind fixing the pip install? Somehow it misses copying awq/modules/fused when installing.

It is not missing the modules. The v0.0.2 did not have them implemented 2 weeks ago. I am working on releasing the next version.

cal066 · 2023-09-20T15:41:32Z

I mean I was using
pip install git+https://github.com/casper-hansen/AutoAWQ@a5e8b048abec1b4e378973130f68863805d46eab
in my docker container, not sure why they're not copied.

casper-hansen · 2023-09-21T10:15:33Z

@cal066 It seems there was an issue, the __init__.py file was missing, causing this issue. Subtle mistake by me that only appeared after building with GitHub workflows. Fix now and releasing v0.1.0 today.
casper-hansen/AutoAWQ@fbeea40

cal066 · 2023-09-21T11:00:32Z

@casper-hansen Awesome thanks, waiting for the new version.

DiamondGlassDrill · 2023-09-21T11:35:10Z

Very much looking forward. Thanks guys

casper-hansen · 2023-09-21T13:15:13Z

You can now try pip install autoawq==0.1.0

cal066 · 2023-09-21T16:25:45Z

@casper-hansen Thanks, but I just noticed I can't actually get the fused models to work, I don't have enough VRAM.
@oobabooga This seems to be working now.

casper-hansen · 2023-09-21T16:29:33Z

@casper-hansen Thanks, but I just noticed I can't actually get the fused models to work, I don't have enough VRAM. @oobabooga This seems to be working now.

If you set max_new_tokens, the VRAM should vary (see benchmark.py). At a minimum level, it will not take much VRAM.

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	231.393	153.632	4.66 GB (19.68%)
1	64	64	233.909	154.475	4.66 GB (19.68%)
1	128	128	233.145	152.133	4.66 GB (19.68%)
1	256	256	228.562	147.692	4.67 GB (19.72%)
1	512	512	228.914	139.179	4.80 GB (20.26%)
1	1024	1024	227.393	125.058	5.56 GB (23.48%)
1	2048	2048	225.736	123.228	8.08 GB (34.09%)

cal066 · 2023-09-21T16:41:59Z

@casper-hansen is max_new_tokens equivalent to context length? I'm not sure if I interpreted it correctly.
Regardless, max_new_tokens=512 batch_size=1 uses around 9.3GB VRAM after enabling fusing, not sure if I'm doing it wrong.

casper-hansen · 2023-09-21T16:47:14Z

@casper-hansen is max_new_tokens equivalent to context length? I'm not sure if I interpreted it correctly. Regardless, max_new_tokens=512 batch_size=1 uses around 9.3GB VRAM after enabling fusing, not sure if I'm doing it wrong.

max_new_tokens pre-allocates the correct amount of cache when you use the fused modules. 9.3GB VRAM seems a little high for my own benchmarks on a 13B. You can always disable the fused layers if you do not have the VRAM for it.

cal066 · 2023-09-21T16:56:17Z

@casper-hansen I guess max_new_tokens isn't really the context length, is there a variable for setting how long the model context should be? Non-fused layers work perfectly fine otherwise.

casper-hansen · 2023-09-21T17:00:31Z

@casper-hansen I guess max_new_tokens isn't really the context length, is there a variable for setting how long the model context should be? Non-fused layers work perfectly fine otherwise.

max_new_tokens is the maximum number of new tokens to generate. The context that you input will automatically go into the cache.

BadisG · 2023-09-21T17:21:53Z

Is this quant method better than GPTQ (act-order + groupsize 32) for the same size in terms of perplexity?

casper-hansen · 2023-09-21T17:39:54Z

Is this quant method better than GPTQ (act-order + groupsize 32) for the same size in terms of perplexity?

It is better than act-order + groupsize 128 according to the paper. I have not conducted testing of GPTQ methods to make this comparison myself, but the loss in perplexity is generally very low as much as that benchmark is tough to trust.

cal066 · 2023-09-22T02:01:02Z

Added a new UI for AutoAWQ max_new_tokens, related to max_new_tokens, but needs to be set when the model is loaded. RoPE UI has been added as well, but untested.

hronoas · 2023-09-22T15:51:57Z

Not working with xformers...

Traceback (most recent call last):
File "E:\neuro\LLM\modules\callbacks.py", line 56, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "E:\neuro\LLM\modules\text_generation.py", line 347, in generate_with_callback
shared.model.generate(**kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\awq\models\base.py", line 36, in generate
return self.model.generate(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 1648, in generate
return self.sample(
File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 2730, in sample
outputs = self(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward
outputs = self.model(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\modules\llama_attn_hijack.py", line 40, in xformers_forward
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 129, 64, 128]' is invalid for input of size 132096
Output generated in 0.35 seconds (0.00 tokens/s, 0 tokens, context 129, seed 938926578)

casper-hansen · 2023-09-22T16:30:10Z

@hronoas seems the dimensions are wrong, it should be [1,129,64,128//8]

cal066 · 2023-09-23T00:47:16Z

It is working fine for me with xformers:

Replacing layers...:  95%|█████████▌| 38/40 [00:06<00:00,  5.74it/s]                                                                                                                                                                            
Replacing layers...:  98%|█████████▊| 39/40 [00:06<00:00,  5.73it/s]                                                                                                                                                                            
Replacing layers...: 100%|██████████| 40/40 [00:07<00:00,  5.66it/s]                                                                                                                                                                            
text-generation-webui  | 2023-09-21 16:41:03 INFO:Replaced attention with xformers_attention                                                                                                                                                    
text-generation-webui  | 2023-09-21 16:41:03 INFO:Loaded the model in 9.87 seconds.                                                                                                                                                             
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  | This is a conversation with your Assistant. It is a computer program designed to help you with various tasks such as answering questions, providing recommendations, and helping with decision making. You can ask it a
nything you want and it will do its best to give you accurate and relevant information.                                                                                                                                                         
text-generation-webui  | You: What is the speed of sound?                                                                                                                                                                                       
text-generation-webui  | Assistant:                                                                                                                                                                                                             
text-generation-webui  | --------------------                                                                                                                                                                                                   
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  | Output generated in 7.24 seconds (7.18 tokens/s, 52 tokens, context 72, seed 1027579411)

@hronoas, what model are you using?

hronoas · 2023-09-23T02:28:33Z

model: TheBloke/Phind-CodeLlama-34B-v2-AWQ

Technical details

run arguments

python server.py --api --listen --gradio-auth user:pass --verbose --xformers --model Phind-CodeLlama-34B-v2-AWQ

user-config.yaml (for model):

  loader: AutoAWQ
  auto_devices: false
  disk: false
  cpu: false
  trust_remote_code: false
  no_inject_fused_attention: true
  n_batch: 512
  max_new_tokens: 512
  truncation_length: 4096
  mode: instruct
  instruction_template: Phind
  preset: MY
  chat_generation_attempts: 3
  max_seq_len: 4096
  compress_pos_emb: 1
  alpha_value: 1
  rope_freq_base: 0

Python 3.10.10 (win x64)

pip freeze

absl-py==1.4.0
accelerate==0.23.0
aiofiles==23.1.0
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.0.1
antlr4-python3-runtime==4.9.3
anyio==3.7.1
appdirs==1.4.4
APScheduler==3.6.3
asttokens==2.2.1
async-timeout==4.0.3
attributedict==0.3.0
attrs==23.1.0
auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+cu117-cp310-cp310-win_amd64.whl#sha256=7145db94f57db80d1d292880487870686079d1b83ef48d3043b9b01023301fa4
autoawq @ https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.0/autoawq-0.1.0-cp310-cp310-win_amd64.whl#sha256=c4de5ff08833fbeb2457dccab135a1535a49629864ef6f9494fc2a4cc3257877
backcall==0.2.0
backoff==2.2.1
beautifulsoup4==4.12.2
bitsandbytes @ https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl#sha256=92258b5461c51e54fe9c6ab256aa61f42eee82423ecb8b36ba27a7024b743cc3
blessings==1.7
blinker==1.6.2
cachetools==5.3.1
certifi==2022.12.7
cffi==1.15.1
chardet==5.2.0
charset-normalizer==2.1.1
chromadb==0.3.18
click==8.1.6
clickhouse-connect==0.6.8
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
contourpy==1.1.0
coverage==7.3.1
cramjam==2.7.0
ctransformers @ https://github.com/jllllll/ctransformers-cuBLAS-wheels/releases/download/AVX2/ctransformers-0.2.27+cu117-py3-none-any.whl#sha256=701d93024d09d679f3afd7fb645cf583bd44ba88ce4a48482742367cbeb52b30
cycler==0.11.0
DataProperty==1.0.1
datasets==2.10.1
decorator==5.1.1
deep-translator==1.9.2
deepdiff==6.5.0
dill==0.3.6
diskcache==5.6.1
distlib==0.3.7
docker-pycreds==0.4.0
docopt==0.6.2
duckdb==0.8.1
einops==0.6.1
elevenlabs==0.2.24
exceptiongroup==1.1.3
executing==1.2.0
exllama @ https://github.com/jllllll/exllama/releases/download/0.0.17/exllama-0.0.17+cu117-cp310-cp310-win_amd64.whl#sha256=64eff5fefde42b113c64e346c062e50ace5a648257053e889fd618026928b84f
exllamav2==0.0.2
fastapi==0.95.2
fastparquet==2023.8.0
ffmpeg==1.4
ffmpeg-python==0.2.0
ffmpy==0.3.1
filelock==3.12.4
Flask==2.3.2
flask-cloudflared==0.0.12
fonttools==4.42.0
frozenlist==1.4.0
fsspec==2023.6.0
future==0.18.3
gitdb==4.0.10
GitPython==3.1.32
google-auth==2.22.0
google-auth-oauthlib==1.0.0
gptq-for-llama @ https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.0/gptq_for_llama-0.1.0+cu117-cp310-cp310-win_amd64.whl#sha256=bd7234ee8f49ddad3bada293d826ed727203f1bc4e245479593ffa2a561fc397
gradio==3.33.1
gradio_client==0.2.5
grpcio==1.57.0
h11==0.14.0
hnswlib==0.7.0
httpcore==0.17.3
httptools==0.6.0
httpx==0.24.1
huggingface-hub==0.16.4
humanfriendly==10.0
idna==3.4
importlib-metadata==6.8.0
inspecta==0.1.3
ipython==8.14.0
itsdangerous==2.1.2
jedi==0.19.0
Jinja2==3.1.2
joblib==1.3.2
jsonlines==4.0.0
jsonschema==4.19.0
jsonschema-specifications==2023.7.1
kiwisolver==1.4.4
linkify-it-py==2.0.2
llama-cpp-python-ggml @ https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python_ggml-0.1.78+cpuavx2-cp310-cp310-win_amd64.whl#sha256=6c0cb266a3c22d3a170efb2f19d6c63907efa82288d436e5127daf9ab54c6f9c
llama-cpp-python-ggml-cuda @ https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_ggml_cuda-0.1.78+cu117-cp310-cp310-win_amd64.whl#sha256=04ca481d43a5b28c45959a6edad2126699461f99607417c7421625738901c112
llama_cpp_python @ https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.6/llama_cpp_python-0.2.6-cp310-cp310-win_amd64.whl#sha256=72007119c1fe7647480847da2adc4058d34732ec073f055227bb460faae99707
llama_cpp_python_cuda @ https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.6+cu117-cp310-cp310-win_amd64.whl#sha256=62e0f7bd94817994bc47f9621cc94db1c0e9a809f9442083f81f3ddeb2625e32
llvmlite==0.40.1
lm-eval==0.3.0
lxml==4.9.3
lz4==4.3.2
Markdown==3.4.4
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.2
matplotlib-inline==0.1.6
mbstrdecoder==1.1.3
mdit-py-plugins==0.3.3
mdurl==0.1.2
monotonic==1.6
more-itertools==10.1.0
mpmath==1.2.1
multidict==6.0.4
multiprocess==0.70.14
networkx==3.0
ngrok==0.8.1
ninja==1.11.1
nltk==3.8.1
num2words==0.5.12
numba==0.57.1
numexpr==2.8.6
numpy==1.24.0
oauthlib==3.2.2
omegaconf==2.3.0
openai==0.28.0
openai-whisper==20230314
optimum==1.13.1
ordered-set==4.1.0
orjson==3.9.5
packaging==23.1
pandas==2.0.3
parso==0.8.3
pathtools==0.1.2
pathvalidate==3.2.0
peft @ git+https://github.com/huggingface/peft@96c0277a1b9a381b10ab34dbf84917f9b3b992e6
pickleshare==0.7.5
Pillow==10.0.0
platformdirs==3.10.0
pluggy==1.3.0
portalocker==2.8.2
posthog==2.4.2
prompt-toolkit==3.0.39
protobuf==4.24.0
psutil==5.9.5
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==12.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pybind11==2.11.1
pycountry==22.3.5
pycparser==2.21
pydantic==1.10.12
pydub==0.25.1
Pygments==2.16.1
pyparsing==3.0.9
pyproject-api==1.6.1
pyreadline3==3.4.1
pytablewriter==1.0.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-multipart==0.0.6
python-telegram-bot==13.15
pytz==2023.3
pywin32==306
PyYAML==6.0.1
referencing==0.30.2
regex==2023.8.8
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
rootpath==0.1.1
rouge==1.0.1
rouge-score==0.1.2
rpds-py==0.9.2
rsa==4.9
sacrebleu==1.5.0
safetensors==0.3.2
scikit-learn==1.3.0
scipy==1.11.1
semantic-version==2.10.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
sentry-sdk==1.29.2
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
soundfile==0.12.1
soupsieve==2.4.1
SpeechRecognition==3.10.0
sqlitedict==2.1.0
stack-data==0.6.2
starlette==0.27.0
sympy==1.11.1
tabledata==1.3.3
tabulate==0.9.0
tcolorpy==0.1.4
tensorboard==2.14.0
tensorboard-data-server==0.7.1
termcolor==2.3.0
texttable==1.6.7
threadpoolctl==3.2.0
tiktoken==0.3.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch==2.0.1+cu118
torchaudio==2.0.2+cu118
torchvision==0.15.2+cu118
tornado==6.1
tox==4.11.3
tqdm==4.66.1
tqdm-multiprocess==0.0.11
traitlets==5.9.0
transformers==4.33.1
typepy==1.3.1
typing_extensions==4.7.1
tzdata==2023.3
tzlocal==5.0.1
uc-micro-py==1.0.2
urllib3==1.26.13
uvicorn==0.23.2
virtualenv==20.24.5
waitress==2.1.2
wandb==0.15.8
watchfiles==0.19.0
wcwidth==0.2.6
websockets==11.0.2
Werkzeug==2.3.7
xformers==0.0.21
xxhash==3.3.0
yarl==1.9.2
zipp==3.16.2
zstandard==0.21.0

console output:


bin E:\neuro\LLM\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
2023-09-23 05:19:08 INFO:Loading settings from settings.yaml...
2023-09-23 05:19:08 INFO:Loading Phind-CodeLlama-34B-v2-AWQ...
Replacing layers...: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:03<00:00, 15.67it/s]
2023-09-23 05:19:20 INFO:Replaced attention with xformers_attention
2023-09-23 05:19:20 INFO:Loaded the model in 12.18 seconds.

Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
2023-09-23 05:19:20 INFO:Loading the extension "code_syntax_highlight"...
2023-09-23 05:19:20 INFO:Loading the extension "openai"...
Starting API at http://0.0.0.0:5000/api
2023-09-23 05:19:20 INFO:Loading the extension "history"...
OpenAI compatible API ready at: OPENAI_API_BASE=http://0.0.0.0:5001/v1
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.


### System Prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### User Message
test

### Assistant

--------------------

Traceback (most recent call last):
  File "E:\neuro\LLM\modules\callbacks.py", line 56, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "E:\neuro\LLM\modules\text_generation.py", line 347, in generate_with_callback
    shared.model.generate(**kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\awq\models\base.py", line 36, in generate
    return self.model.generate(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 1648, in generate
    return self.sample(
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 2730, in sample
    outputs = self(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward
    layer_outputs = decoder_layer(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\modules\llama_attn_hijack.py", line 40, in xformers_forward
    key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 42, 64, 128]' is invalid for input of size 43008
Output generated in 0.45 seconds (0.00 tokens/s, 0 tokens, context 42, seed 1955952382)

also tryed TheBloke/Xwin-LM-13B-V0.1-AWQ
same result

cal066 · 2023-10-03T11:36:41Z

I have dropped the scaling options for now, but enabled all the HF generation parameter options, since it is using the same.

casper-hansen · 2023-10-03T12:10:55Z

@casper-hansen I just noticed rope_scaling failing today:
TypeError: AutoAWQForCausalLM.from_quantized() got an unexpected keyword argument ‘rope_scaling’
I don't remember that error before, is it actually supported?

AutoAWQ has never taken any RoPE parameters as input. It is something that could be implemented with time though.

cal066 · 2023-10-03T12:36:11Z

OK, I guess it was never really supported and I did not properly test it. Removed for now.

s-konnex-engine · 2023-10-04T00:35:13Z

Forgive the noobie interruption while you'r ehard at work, but how do i get this to show up in textgen webui.

I've cloned the github repo via the the web ui and restarted. but Auto AWQ does not show up in the model loaders list.

Thanks in advance. And keep up the good work...

cal066 · 2023-10-04T02:50:41Z

@s-konnex-engine How did you clone? You need to clone from my fork, it's not merged into oobabooga's yet.

s-konnex-engine · 2023-10-04T06:37:55Z

@cal066 ohhhh!!! my bad!!! I basically copied the link to the casper-hansen repo and pasted it in textgen add extensions input. You mean I have to clone your textgen repo until it's merged. Thanks a million in any case. my 6Gb 1060 appreciates you as much as I do. :D

Tested with: * https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-AWQ * https://huggingface.co/TheBloke/wizard-vicuna-13B-AWQ Known issues: * Seems to be incompatible with DeepSpeed, working when disabled: ``` text-generation-webui | File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 984, in _post_init_method text-generation-webui | param.data = param.data.to(self.local_device) text-generation-webui | NotImplementedError: Cannot copy out of meta tensor; no data! ``` Credits to Ph0rk0z for multi-gpu handling.

cal066 · 2023-10-05T00:18:08Z

@Ph0rk0z Thanks, added your changes with slight tweaks after looking at how get_max_memory_dict() works.

CoryG89 · 2023-10-05T09:05:31Z

I retested this with multi-gpu using the suggestion from @Ph0rk0z and am now able to get it to split the memory across my two 3090's. I had to set the memory limits much lower on each GPU to get it to work correctly so as @Ph0rk0z said, it does seem like something is not calculating the limits correctly or something, but it works.

modules/models.py

oobabooga · 2023-10-05T16:05:59Z

@casper-hansen question: I see that the AutoAWQ wheels do not have a +cu in their filenames. Are they nvidia-only, or is there a chance that they could work on AMD/Intel Arc GPUs/Mac Metal out of the box?

casper-hansen · 2023-10-05T16:08:54Z

@oobabooga They are Nvidia-only at the moment, Ampere or later. The next release of AutoAWQ brings Turing support and the 2GB memory saving. I am actively working on a Torch-only module that should enable AMD/Metal/CPU users to use the models but I suspect it will not be ready for next release.

oobabooga · 2023-10-05T16:18:56Z

Thanks for the confirmation. For reference, these are the results of 2 quick perplexity tests that I ran a couple of days ago:

Test	Model	Perplexity
Test 1	TheBloke_Llama-2-7B-AWQ	5.641422748565674
Test 1	TheBloke_Llama-2-7B-GPTQ_gptq-4bit-128g-actorder_True	5.681295394897461
Test 2	TheBloke_Llama-2-7B-AWQ	6.007129192352295
Test 2	TheBloke_Llama-2-7B-GPTQ_gptq-4bit-128g-actorder_True	6.036174297332764

Test 1 is wikitext with a high stride value, and test 2 is a private dataset (exactly the same one as in the tests here https://oobabooga.github.io/blog/posts/perplexities/).

So it seems like AWQ performs consistently a bit better than 4-bit 128g GPTQ. The speed is also very good.

I'll merge this PR and then update the AutoAWQ in a future commit when a new version with the VRAM reduction is released. @cal066 thank you for this PR.

cal066 · 2023-10-05T16:20:37Z

Thanks all for helping to test and merge it!

casper-hansen · 2023-10-05T19:49:17Z

@oobabooga I just released v0.1.3 - which brings the VRAM reduction and Turing support. You should update :)

AG-w · 2023-10-05T19:51:25Z

same

RuntimeError: probability tensor contains either inf, nan or element < 0

message

I guess I'll just give up with AWQ now

casper-hansen · 2023-10-05T19:54:12Z

@AG-w Which GPU and CUDA version? Did you follow the instructions for installing CUDA dependencies in a conda environment in AutoAWQ?

AG-w · 2023-10-05T19:56:56Z

is there a separate installation for CUDA support on AWQ?
because I can use GGUF with CUDA just fine

It's GTX1060 6GB with 537.42 WHQL driver

casper-hansen · 2023-10-05T19:59:03Z

GTX 1060 is a Pascal card which is not supported in AutoAWQ. You need Turing or Ampere cards to run AWQ kernels.

TheBloke · 2023-10-05T20:01:44Z

Nice. I'll add a text-gen-webui section to my AWQ READMEs

AG-w · 2023-10-05T20:25:17Z

Ok, I didn't notice there's extra requirement

but I notice your code has defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 fallback
is that the plan for future or just unused code?

BadisG · 2023-10-05T20:26:41Z

@TheBloke Will you go for 32g quants now? I wanna know how it compares in terms of perplexity to GPTQ 32g + act_order

casper-hansen · 2023-10-05T20:28:19Z

defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 80

This is a fallback for Turing that implements code that is compatible with Turing. I would welcome any PRs adding similar support for Pascal cards - you would have to look into replacing some of the PTX code with m8n8k4 and find another solution for the ldmatrix calls. However, I am by no means a CUDA expert, so I hope the community can help out with this one and other CUDA-related optimization.

cal066 mentioned this pull request Sep 19, 2023

Strange interaction for DeepZero via text-generation-webui? casper-hansen/AutoAWQ#59

Closed

cal066 force-pushed the autoawq1 branch from 462b764 to 22b4e7f Compare September 20, 2023 05:31

cal066 force-pushed the autoawq1 branch 3 times, most recently from 97706d1 to d656071 Compare September 21, 2023 16:21

cal066 force-pushed the autoawq1 branch from d656071 to 4eac59d Compare September 22, 2023 01:58

cal066 force-pushed the autoawq1 branch from 8b067ff to e30ede7 Compare October 3, 2023 11:34

cal066 force-pushed the autoawq1 branch from e30ede7 to bb79617 Compare October 5, 2023 00:15

oobabooga reviewed Oct 5, 2023

View reviewed changes

modules/models.py Show resolved Hide resolved

oobabooga added 3 commits October 5, 2023 08:58

Merge branch 'main' into cal066-autoawq1

e5b9050

Minor tweaks

8069c02

Add grammar

c64993e

Update requirements.txt

2993adf

oobabooga merged commit cc632c3 into oobabooga:main Oct 5, 2023

AutoAWQ: initial support #3999

AutoAWQ: initial support #3999

Conversation

cal066 commented Sep 19, 2023 • edited Loading

Checklist:

cal066 commented Sep 19, 2023

cal066 commented Sep 19, 2023

casper-hansen commented Sep 19, 2023 • edited Loading

cal066 commented Sep 19, 2023

cal066 commented Sep 20, 2023

casper-hansen commented Sep 20, 2023

cal066 commented Sep 20, 2023

casper-hansen commented Sep 21, 2023

cal066 commented Sep 21, 2023

DiamondGlassDrill commented Sep 21, 2023

casper-hansen commented Sep 21, 2023

cal066 commented Sep 21, 2023

casper-hansen commented Sep 21, 2023

cal066 commented Sep 21, 2023 • edited Loading

casper-hansen commented Sep 21, 2023

cal066 commented Sep 21, 2023

casper-hansen commented Sep 21, 2023

BadisG commented Sep 21, 2023

casper-hansen commented Sep 21, 2023 • edited Loading

cal066 commented Sep 22, 2023

hronoas commented Sep 22, 2023

Not working with xformers...

casper-hansen commented Sep 22, 2023 • edited Loading

cal066 commented Sep 23, 2023

hronoas commented Sep 23, 2023 • edited Loading

cal066 commented Oct 3, 2023

casper-hansen commented Oct 3, 2023

cal066 commented Oct 3, 2023

s-konnex-engine commented Oct 4, 2023

cal066 commented Oct 4, 2023

s-konnex-engine commented Oct 4, 2023

cal066 commented Oct 5, 2023

CoryG89 commented Oct 5, 2023 • edited Loading

oobabooga commented Oct 5, 2023

casper-hansen commented Oct 5, 2023

oobabooga commented Oct 5, 2023

cal066 commented Oct 5, 2023

casper-hansen commented Oct 5, 2023

AG-w commented Oct 5, 2023 • edited Loading

casper-hansen commented Oct 5, 2023

AG-w commented Oct 5, 2023

casper-hansen commented Oct 5, 2023

TheBloke commented Oct 5, 2023 • edited Loading

AG-w commented Oct 5, 2023

BadisG commented Oct 5, 2023

casper-hansen commented Oct 5, 2023

cal066 commented Sep 19, 2023 •

edited

Loading

casper-hansen commented Sep 19, 2023 •

edited

Loading

cal066 commented Sep 21, 2023 •

edited

Loading

casper-hansen commented Sep 21, 2023 •

edited

Loading

casper-hansen commented Sep 22, 2023 •

edited

Loading

hronoas commented Sep 23, 2023 •

edited

Loading

CoryG89 commented Oct 5, 2023 •

edited

Loading

AG-w commented Oct 5, 2023 •

edited

Loading

TheBloke commented Oct 5, 2023 •

edited

Loading