Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoAWQ: initial support #3999

Merged
merged 5 commits into from
Oct 5, 2023
Merged

AutoAWQ: initial support #3999

merged 5 commits into from
Oct 5, 2023

Conversation

cal066
Copy link
Contributor

@cal066 cal066 commented Sep 19, 2023

Tested with:

Known issues:

  • Seems to be incompatible with DeepSpeed, working when disabled:
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 984, in _post_init_method
text-generation-webui  |     param.data = param.data.to(self.local_device)
text-generation-webui  | NotImplementedError: Cannot copy out of meta tensor; no data!

Credits to Ph0rk0z for multi-gpu handling.

Checklist:

@cal066
Copy link
Contributor Author

cal066 commented Sep 19, 2023

Probably good as a proof of concept at this time.

@cal066
Copy link
Contributor Author

cal066 commented Sep 19, 2023

@oobabooga any ideas about the DeepSpeed stuff?

@casper-hansen
Copy link
Contributor

casper-hansen commented Sep 19, 2023

Fused modules will be released in the next version. Significant speedup will come from them.

V0.02 does not have the fuses modules (only basic ones).

@cal066
Copy link
Contributor Author

cal066 commented Sep 19, 2023

@casper-hansen These are from 8788fe106a5e34b80cbaf03fbe4710c2cfb27328. They work perfectly fine after copying awq/modules/fuse over when deepspeed is disabled.

@cal066
Copy link
Contributor Author

cal066 commented Sep 20, 2023

@casper-hansen By the way, mind fixing the pip install? Somehow it misses copying awq/modules/fused when installing.

@casper-hansen
Copy link
Contributor

@casper-hansen By the way, mind fixing the pip install? Somehow it misses copying awq/modules/fused when installing.

It is not missing the modules. The v0.0.2 did not have them implemented 2 weeks ago. I am working on releasing the next version.

@cal066
Copy link
Contributor Author

cal066 commented Sep 20, 2023

I mean I was using
pip install git+https://github.com/casper-hansen/AutoAWQ@a5e8b048abec1b4e378973130f68863805d46eab
in my docker container, not sure why they're not copied.

@casper-hansen
Copy link
Contributor

@cal066 It seems there was an issue, the __init__.py file was missing, causing this issue. Subtle mistake by me that only appeared after building with GitHub workflows. Fix now and releasing v0.1.0 today.
casper-hansen/AutoAWQ@fbeea40

@cal066
Copy link
Contributor Author

cal066 commented Sep 21, 2023

@casper-hansen Awesome thanks, waiting for the new version.

@DiamondGlassDrill
Copy link

Very much looking forward. Thanks guys

@casper-hansen
Copy link
Contributor

You can now try pip install autoawq==0.1.0

@cal066 cal066 force-pushed the autoawq1 branch 3 times, most recently from 97706d1 to d656071 Compare September 21, 2023 16:21
@cal066
Copy link
Contributor Author

cal066 commented Sep 21, 2023

@casper-hansen Thanks, but I just noticed I can't actually get the fused models to work, I don't have enough VRAM.
@oobabooga This seems to be working now.

@casper-hansen
Copy link
Contributor

@casper-hansen Thanks, but I just noticed I can't actually get the fused models to work, I don't have enough VRAM. @oobabooga This seems to be working now.

If you set max_new_tokens, the VRAM should vary (see benchmark.py). At a minimum level, it will not take much VRAM.

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)

@cal066
Copy link
Contributor Author

cal066 commented Sep 21, 2023

@casper-hansen is max_new_tokens equivalent to context length? I'm not sure if I interpreted it correctly.
Regardless, max_new_tokens=512 batch_size=1 uses around 9.3GB VRAM after enabling fusing, not sure if I'm doing it wrong.

@casper-hansen
Copy link
Contributor

@casper-hansen is max_new_tokens equivalent to context length? I'm not sure if I interpreted it correctly. Regardless, max_new_tokens=512 batch_size=1 uses around 9.3GB VRAM after enabling fusing, not sure if I'm doing it wrong.

max_new_tokens pre-allocates the correct amount of cache when you use the fused modules. 9.3GB VRAM seems a little high for my own benchmarks on a 13B. You can always disable the fused layers if you do not have the VRAM for it.

@cal066
Copy link
Contributor Author

cal066 commented Sep 21, 2023

@casper-hansen I guess max_new_tokens isn't really the context length, is there a variable for setting how long the model context should be? Non-fused layers work perfectly fine otherwise.

@casper-hansen
Copy link
Contributor

@casper-hansen I guess max_new_tokens isn't really the context length, is there a variable for setting how long the model context should be? Non-fused layers work perfectly fine otherwise.

max_new_tokens is the maximum number of new tokens to generate. The context that you input will automatically go into the cache.

@BadisG
Copy link
Contributor

BadisG commented Sep 21, 2023

Is this quant method better than GPTQ (act-order + groupsize 32) for the same size in terms of perplexity?

@casper-hansen
Copy link
Contributor

casper-hansen commented Sep 21, 2023

Is this quant method better than GPTQ (act-order + groupsize 32) for the same size in terms of perplexity?

It is better than act-order + groupsize 128 according to the paper. I have not conducted testing of GPTQ methods to make this comparison myself, but the loss in perplexity is generally very low as much as that benchmark is tough to trust.

image

@cal066
Copy link
Contributor Author

cal066 commented Sep 22, 2023

Added a new UI for AutoAWQ max_new_tokens, related to max_new_tokens, but needs to be set when the model is loaded. RoPE UI has been added as well, but untested.

@hronoas
Copy link
Contributor

hronoas commented Sep 22, 2023

Not working with xformers...

Traceback (most recent call last):
File "E:\neuro\LLM\modules\callbacks.py", line 56, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "E:\neuro\LLM\modules\text_generation.py", line 347, in generate_with_callback
shared.model.generate(**kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\awq\models\base.py", line 36, in generate
return self.model.generate(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 1648, in generate
return self.sample(
File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 2730, in sample
outputs = self(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward
outputs = self.model(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\neuro\LLM\modules\llama_attn_hijack.py", line 40, in xformers_forward
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 129, 64, 128]' is invalid for input of size 132096
Output generated in 0.35 seconds (0.00 tokens/s, 0 tokens, context 129, seed 938926578)

@casper-hansen
Copy link
Contributor

casper-hansen commented Sep 22, 2023

@hronoas seems the dimensions are wrong, it should be [1,129,64,128//8]

@cal066
Copy link
Contributor Author

cal066 commented Sep 23, 2023

It is working fine for me with xformers:

Replacing layers...:  95%|█████████▌| 38/40 [00:06<00:00,  5.74it/s]                                                                                                                                                                            
Replacing layers...:  98%|█████████▊| 39/40 [00:06<00:00,  5.73it/s]                                                                                                                                                                            
Replacing layers...: 100%|██████████| 40/40 [00:07<00:00,  5.66it/s]                                                                                                                                                                            
text-generation-webui  | 2023-09-21 16:41:03 INFO:Replaced attention with xformers_attention                                                                                                                                                    
text-generation-webui  | 2023-09-21 16:41:03 INFO:Loaded the model in 9.87 seconds.                                                                                                                                                             
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  | This is a conversation with your Assistant. It is a computer program designed to help you with various tasks such as answering questions, providing recommendations, and helping with decision making. You can ask it a
nything you want and it will do its best to give you accurate and relevant information.                                                                                                                                                         
text-generation-webui  | You: What is the speed of sound?                                                                                                                                                                                       
text-generation-webui  | Assistant:                                                                                                                                                                                                             
text-generation-webui  | --------------------                                                                                                                                                                                                   
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  |                                                                                                                                                                                                                        
text-generation-webui  | Output generated in 7.24 seconds (7.18 tokens/s, 52 tokens, context 72, seed 1027579411)

@hronoas, what model are you using?

@hronoas
Copy link
Contributor

hronoas commented Sep 23, 2023

model: TheBloke/Phind-CodeLlama-34B-v2-AWQ

Technical details

run arguments

python server.py --api --listen --gradio-auth user:pass --verbose --xformers --model Phind-CodeLlama-34B-v2-AWQ

user-config.yaml (for model):

  loader: AutoAWQ
  auto_devices: false
  disk: false
  cpu: false
  trust_remote_code: false
  no_inject_fused_attention: true
  n_batch: 512
  max_new_tokens: 512
  truncation_length: 4096
  mode: instruct
  instruction_template: Phind
  preset: MY
  chat_generation_attempts: 3
  max_seq_len: 4096
  compress_pos_emb: 1
  alpha_value: 1
  rope_freq_base: 0

Python 3.10.10 (win x64)

pip freeze

absl-py==1.4.0
accelerate==0.23.0
aiofiles==23.1.0
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.0.1
antlr4-python3-runtime==4.9.3
anyio==3.7.1
appdirs==1.4.4
APScheduler==3.6.3
asttokens==2.2.1
async-timeout==4.0.3
attributedict==0.3.0
attrs==23.1.0
auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.4.2/auto_gptq-0.4.2+cu117-cp310-cp310-win_amd64.whl#sha256=7145db94f57db80d1d292880487870686079d1b83ef48d3043b9b01023301fa4
autoawq @ https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.0/autoawq-0.1.0-cp310-cp310-win_amd64.whl#sha256=c4de5ff08833fbeb2457dccab135a1535a49629864ef6f9494fc2a4cc3257877
backcall==0.2.0
backoff==2.2.1
beautifulsoup4==4.12.2
bitsandbytes @ https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl#sha256=92258b5461c51e54fe9c6ab256aa61f42eee82423ecb8b36ba27a7024b743cc3
blessings==1.7
blinker==1.6.2
cachetools==5.3.1
certifi==2022.12.7
cffi==1.15.1
chardet==5.2.0
charset-normalizer==2.1.1
chromadb==0.3.18
click==8.1.6
clickhouse-connect==0.6.8
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
contourpy==1.1.0
coverage==7.3.1
cramjam==2.7.0
ctransformers @ https://github.com/jllllll/ctransformers-cuBLAS-wheels/releases/download/AVX2/ctransformers-0.2.27+cu117-py3-none-any.whl#sha256=701d93024d09d679f3afd7fb645cf583bd44ba88ce4a48482742367cbeb52b30
cycler==0.11.0
DataProperty==1.0.1
datasets==2.10.1
decorator==5.1.1
deep-translator==1.9.2
deepdiff==6.5.0
dill==0.3.6
diskcache==5.6.1
distlib==0.3.7
docker-pycreds==0.4.0
docopt==0.6.2
duckdb==0.8.1
einops==0.6.1
elevenlabs==0.2.24
exceptiongroup==1.1.3
executing==1.2.0
exllama @ https://github.com/jllllll/exllama/releases/download/0.0.17/exllama-0.0.17+cu117-cp310-cp310-win_amd64.whl#sha256=64eff5fefde42b113c64e346c062e50ace5a648257053e889fd618026928b84f
exllamav2==0.0.2
fastapi==0.95.2
fastparquet==2023.8.0
ffmpeg==1.4
ffmpeg-python==0.2.0
ffmpy==0.3.1
filelock==3.12.4
Flask==2.3.2
flask-cloudflared==0.0.12
fonttools==4.42.0
frozenlist==1.4.0
fsspec==2023.6.0
future==0.18.3
gitdb==4.0.10
GitPython==3.1.32
google-auth==2.22.0
google-auth-oauthlib==1.0.0
gptq-for-llama @ https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.0/gptq_for_llama-0.1.0+cu117-cp310-cp310-win_amd64.whl#sha256=bd7234ee8f49ddad3bada293d826ed727203f1bc4e245479593ffa2a561fc397
gradio==3.33.1
gradio_client==0.2.5
grpcio==1.57.0
h11==0.14.0
hnswlib==0.7.0
httpcore==0.17.3
httptools==0.6.0
httpx==0.24.1
huggingface-hub==0.16.4
humanfriendly==10.0
idna==3.4
importlib-metadata==6.8.0
inspecta==0.1.3
ipython==8.14.0
itsdangerous==2.1.2
jedi==0.19.0
Jinja2==3.1.2
joblib==1.3.2
jsonlines==4.0.0
jsonschema==4.19.0
jsonschema-specifications==2023.7.1
kiwisolver==1.4.4
linkify-it-py==2.0.2
llama-cpp-python-ggml @ https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python_ggml-0.1.78+cpuavx2-cp310-cp310-win_amd64.whl#sha256=6c0cb266a3c22d3a170efb2f19d6c63907efa82288d436e5127daf9ab54c6f9c
llama-cpp-python-ggml-cuda @ https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_ggml_cuda-0.1.78+cu117-cp310-cp310-win_amd64.whl#sha256=04ca481d43a5b28c45959a6edad2126699461f99607417c7421625738901c112
llama_cpp_python @ https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.6/llama_cpp_python-0.2.6-cp310-cp310-win_amd64.whl#sha256=72007119c1fe7647480847da2adc4058d34732ec073f055227bb460faae99707
llama_cpp_python_cuda @ https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.6+cu117-cp310-cp310-win_amd64.whl#sha256=62e0f7bd94817994bc47f9621cc94db1c0e9a809f9442083f81f3ddeb2625e32
llvmlite==0.40.1
lm-eval==0.3.0
lxml==4.9.3
lz4==4.3.2
Markdown==3.4.4
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.2
matplotlib-inline==0.1.6
mbstrdecoder==1.1.3
mdit-py-plugins==0.3.3
mdurl==0.1.2
monotonic==1.6
more-itertools==10.1.0
mpmath==1.2.1
multidict==6.0.4
multiprocess==0.70.14
networkx==3.0
ngrok==0.8.1
ninja==1.11.1
nltk==3.8.1
num2words==0.5.12
numba==0.57.1
numexpr==2.8.6
numpy==1.24.0
oauthlib==3.2.2
omegaconf==2.3.0
openai==0.28.0
openai-whisper==20230314
optimum==1.13.1
ordered-set==4.1.0
orjson==3.9.5
packaging==23.1
pandas==2.0.3
parso==0.8.3
pathtools==0.1.2
pathvalidate==3.2.0
peft @ git+https://github.com/huggingface/peft@96c0277a1b9a381b10ab34dbf84917f9b3b992e6
pickleshare==0.7.5
Pillow==10.0.0
platformdirs==3.10.0
pluggy==1.3.0
portalocker==2.8.2
posthog==2.4.2
prompt-toolkit==3.0.39
protobuf==4.24.0
psutil==5.9.5
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==12.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pybind11==2.11.1
pycountry==22.3.5
pycparser==2.21
pydantic==1.10.12
pydub==0.25.1
Pygments==2.16.1
pyparsing==3.0.9
pyproject-api==1.6.1
pyreadline3==3.4.1
pytablewriter==1.0.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-multipart==0.0.6
python-telegram-bot==13.15
pytz==2023.3
pywin32==306
PyYAML==6.0.1
referencing==0.30.2
regex==2023.8.8
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
rootpath==0.1.1
rouge==1.0.1
rouge-score==0.1.2
rpds-py==0.9.2
rsa==4.9
sacrebleu==1.5.0
safetensors==0.3.2
scikit-learn==1.3.0
scipy==1.11.1
semantic-version==2.10.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
sentry-sdk==1.29.2
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
soundfile==0.12.1
soupsieve==2.4.1
SpeechRecognition==3.10.0
sqlitedict==2.1.0
stack-data==0.6.2
starlette==0.27.0
sympy==1.11.1
tabledata==1.3.3
tabulate==0.9.0
tcolorpy==0.1.4
tensorboard==2.14.0
tensorboard-data-server==0.7.1
termcolor==2.3.0
texttable==1.6.7
threadpoolctl==3.2.0
tiktoken==0.3.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch==2.0.1+cu118
torchaudio==2.0.2+cu118
torchvision==0.15.2+cu118
tornado==6.1
tox==4.11.3
tqdm==4.66.1
tqdm-multiprocess==0.0.11
traitlets==5.9.0
transformers==4.33.1
typepy==1.3.1
typing_extensions==4.7.1
tzdata==2023.3
tzlocal==5.0.1
uc-micro-py==1.0.2
urllib3==1.26.13
uvicorn==0.23.2
virtualenv==20.24.5
waitress==2.1.2
wandb==0.15.8
watchfiles==0.19.0
wcwidth==0.2.6
websockets==11.0.2
Werkzeug==2.3.7
xformers==0.0.21
xxhash==3.3.0
yarl==1.9.2
zipp==3.16.2
zstandard==0.21.0

console output:


bin E:\neuro\LLM\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
2023-09-23 05:19:08 INFO:Loading settings from settings.yaml...
2023-09-23 05:19:08 INFO:Loading Phind-CodeLlama-34B-v2-AWQ...
Replacing layers...: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:03<00:00, 15.67it/s]
2023-09-23 05:19:20 INFO:Replaced attention with xformers_attention
2023-09-23 05:19:20 INFO:Loaded the model in 12.18 seconds.

Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
2023-09-23 05:19:20 INFO:Loading the extension "code_syntax_highlight"...
2023-09-23 05:19:20 INFO:Loading the extension "openai"...
Starting API at http://0.0.0.0:5000/api
2023-09-23 05:19:20 INFO:Loading the extension "history"...
OpenAI compatible API ready at: OPENAI_API_BASE=http://0.0.0.0:5001/v1
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.


### System Prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### User Message
test

### Assistant

--------------------

Traceback (most recent call last):
  File "E:\neuro\LLM\modules\callbacks.py", line 56, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "E:\neuro\LLM\modules\text_generation.py", line 347, in generate_with_callback
    shared.model.generate(**kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\awq\models\base.py", line 36, in generate
    return self.model.generate(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 1648, in generate
    return self.sample(
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\generation\utils.py", line 2730, in sample
    outputs = self(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward
    layer_outputs = decoder_layer(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\venv\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "E:\neuro\LLM\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\neuro\LLM\modules\llama_attn_hijack.py", line 40, in xformers_forward
    key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 42, 64, 128]' is invalid for input of size 43008
Output generated in 0.45 seconds (0.00 tokens/s, 0 tokens, context 42, seed 1955952382)

also tryed TheBloke/Xwin-LM-13B-V0.1-AWQ
same result

@cal066
Copy link
Contributor Author

cal066 commented Oct 3, 2023

I have dropped the scaling options for now, but enabled all the HF generation parameter options, since it is using the same.

@casper-hansen
Copy link
Contributor

@casper-hansen I just noticed rope_scaling failing today:
TypeError: AutoAWQForCausalLM.from_quantized() got an unexpected keyword argument ‘rope_scaling’
I don't remember that error before, is it actually supported?

AutoAWQ has never taken any RoPE parameters as input. It is something that could be implemented with time though.

@cal066
Copy link
Contributor Author

cal066 commented Oct 3, 2023

OK, I guess it was never really supported and I did not properly test it. Removed for now.

@s-konnex-engine
Copy link

Forgive the noobie interruption while you'r ehard at work, but how do i get this to show up in textgen webui.

I've cloned the github repo via the the web ui and restarted. but Auto AWQ does not show up in the model loaders list.

Thanks in advance. And keep up the good work...

@cal066
Copy link
Contributor Author

cal066 commented Oct 4, 2023

@s-konnex-engine How did you clone? You need to clone from my fork, it's not merged into oobabooga's yet.

@s-konnex-engine
Copy link

@cal066 ohhhh!!! my bad!!! I basically copied the link to the casper-hansen repo and pasted it in textgen add extensions input. You mean I have to clone your textgen repo until it's merged. Thanks a million in any case. my 6Gb 1060 appreciates you as much as I do. :D

 Tested with:
 * https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-AWQ
 * https://huggingface.co/TheBloke/wizard-vicuna-13B-AWQ

Known issues:
 * Seems to be incompatible with DeepSpeed, working when disabled:
```
text-generation-webui  |   File "/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 984, in _post_init_method
text-generation-webui  |     param.data = param.data.to(self.local_device)
text-generation-webui  | NotImplementedError: Cannot copy out of meta tensor; no data!
```

Credits to Ph0rk0z for multi-gpu handling.
@cal066
Copy link
Contributor Author

cal066 commented Oct 5, 2023

@Ph0rk0z Thanks, added your changes with slight tweaks after looking at how get_max_memory_dict() works.

@CoryG89
Copy link

CoryG89 commented Oct 5, 2023

I retested this with multi-gpu using the suggestion from @Ph0rk0z and am now able to get it to split the memory across my two 3090's. I had to set the memory limits much lower on each GPU to get it to work correctly so as @Ph0rk0z said, it does seem like something is not calculating the limits correctly or something, but it works.

@oobabooga
Copy link
Owner

@casper-hansen question: I see that the AutoAWQ wheels do not have a +cu in their filenames. Are they nvidia-only, or is there a chance that they could work on AMD/Intel Arc GPUs/Mac Metal out of the box?

@casper-hansen
Copy link
Contributor

@oobabooga They are Nvidia-only at the moment, Ampere or later. The next release of AutoAWQ brings Turing support and the 2GB memory saving. I am actively working on a Torch-only module that should enable AMD/Metal/CPU users to use the models but I suspect it will not be ready for next release.

@oobabooga
Copy link
Owner

Thanks for the confirmation. For reference, these are the results of 2 quick perplexity tests that I ran a couple of days ago:

Test Model Perplexity
Test 1 TheBloke_Llama-2-7B-AWQ 5.641422748565674
Test 1 TheBloke_Llama-2-7B-GPTQ_gptq-4bit-128g-actorder_True 5.681295394897461
Test 2 TheBloke_Llama-2-7B-AWQ 6.007129192352295
Test 2 TheBloke_Llama-2-7B-GPTQ_gptq-4bit-128g-actorder_True 6.036174297332764

Test 1 is wikitext with a high stride value, and test 2 is a private dataset (exactly the same one as in the tests here https://oobabooga.github.io/blog/posts/perplexities/).

So it seems like AWQ performs consistently a bit better than 4-bit 128g GPTQ. The speed is also very good.

I'll merge this PR and then update the AutoAWQ in a future commit when a new version with the VRAM reduction is released. @cal066 thank you for this PR.

@oobabooga oobabooga merged commit cc632c3 into oobabooga:main Oct 5, 2023
@cal066
Copy link
Contributor Author

cal066 commented Oct 5, 2023

Thanks all for helping to test and merge it!

@casper-hansen
Copy link
Contributor

@oobabooga I just released v0.1.3 - which brings the VRAM reduction and Turing support. You should update :)

@AG-w
Copy link
Contributor

AG-w commented Oct 5, 2023

same

RuntimeError: probability tensor contains either inf, nan or element < 0

message

I guess I'll just give up with AWQ now

@casper-hansen
Copy link
Contributor

@AG-w Which GPU and CUDA version? Did you follow the instructions for installing CUDA dependencies in a conda environment in AutoAWQ?

@AG-w
Copy link
Contributor

AG-w commented Oct 5, 2023

is there a separate installation for CUDA support on AWQ?
because I can use GGUF with CUDA just fine

It's GTX1060 6GB with 537.42 WHQL driver

@casper-hansen
Copy link
Contributor

GTX 1060 is a Pascal card which is not supported in AutoAWQ. You need Turing or Ampere cards to run AWQ kernels.

@TheBloke
Copy link
Contributor

TheBloke commented Oct 5, 2023

Nice. I'll add a text-gen-webui section to my AWQ READMEs

@AG-w
Copy link
Contributor

AG-w commented Oct 5, 2023

Ok, I didn't notice there's extra requirement

but I notice your code has defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800 fallback
is that the plan for future or just unused code?

@BadisG
Copy link
Contributor

BadisG commented Oct 5, 2023

@TheBloke Will you go for 32g quants now? I wanna know how it compares in terms of perplexity to GPTQ 32g + act_order

@casper-hansen
Copy link
Contributor

defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 80

This is a fallback for Turing that implements code that is compatible with Turing. I would welcome any PRs adding similar support for Pascal cards - you would have to look into replacing some of the PTX code with m8n8k4 and find another solution for the ldmatrix calls. However, I am by no means a CUDA expert, so I hope the community can help out with this one and other CUDA-related optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet