You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
text-generation-launcher version: text-generation-launcher 2.0.5-dev0
cargo version: 1.79.0
os version: Ubuntu 22.04.4 LTS
python version: 3.10.14
cuda version: 12.2
hardware used: NVIDIA RTX A6000
used model: https://huggingface.co/LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
I downloaded LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2 and save it in local path.
Build and install text-generation-launcher from source, the version is text-generation-launcher 2.0.5-dev0.
Then I run following command to launch model: CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id /root/huggingface/cache/Qwen2-72B-Instruct-4.0bpw-h6-exl2 --max-total-tokens 4096 --port 8002 --num-shard 1 --quantize exl2
And it failed to load model, the error message is following:
It seems like the tgi trying to use weights.get_multi_weights_col in flash_qwen2_modeling.py, and this method is not supported in exl2 quantization. Is qwen2-exl2 quantization is not supported yet in TGI? Hope to see it can be supported soon.
The text was updated successfully, but these errors were encountered:
We concatenate QKV matrices for efficiency. However this is difficult for exl2-quantized matrices because the row groups of the quantized matrices usually don't align, so that's why the exception was raised. I will see if we can handle this differently in a nice manner.
System Info
text-generation-launcher version: text-generation-launcher 2.0.5-dev0
cargo version: 1.79.0
os version: Ubuntu 22.04.4 LTS
python version: 3.10.14
cuda version: 12.2
hardware used: NVIDIA RTX A6000
used model: https://huggingface.co/LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2
Information
Tasks
Reproduction
CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id /root/huggingface/cache/Qwen2-72B-Instruct-4.0bpw-h6-exl2 --max-total-tokens 4096 --port 8002 --num-shard 1 --quantize exl2
Expected behavior
It seems like the tgi trying to use
weights.get_multi_weights_col
in flash_qwen2_modeling.py, and this method is not supported in exl2 quantization. Is qwen2-exl2 quantization is not supported yet in TGI? Hope to see it can be supported soon.The text was updated successfully, but these errors were encountered: