Unable to load Qwen2-72B-Instruct-exl2 model #2081

sunxichen · 2024-06-18T03:04:22Z

System Info

text-generation-launcher version: text-generation-launcher 2.0.5-dev0
cargo version: 1.79.0
os version: Ubuntu 22.04.4 LTS
python version: 3.10.14
cuda version: 12.2
hardware used: NVIDIA RTX A6000
used model: https://huggingface.co/LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I downloaded LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2 and save it in local path.
Build and install text-generation-launcher from source, the version is text-generation-launcher 2.0.5-dev0.
Then I run following command to launch model:
CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id /root/huggingface/cache/Qwen2-72B-Instruct-4.0bpw-h6-exl2 --max-total-tokens 4096 --port 8002 --num-shard 1 --quantize exl2
And it failed to load model, the error message is following:

/root/software/text-generation-inference/server/text_generation_server/models/custom_modeling/fl │
│ ash_qwen2_modeling.py:43 in _load_gqa                                                            │
│                                                                                                  │
│    40 │   assert config.hidden_size % config.num_attention_heads == 0                            │
│    41 │   assert config.num_attention_heads % weights.process_group.size() == 0                  │
│    42 │                                                                                          │
│ ❱  43 │   weight = weights.get_multi_weights_col(                                                │
│    44 │   │   prefixes=[f"{prefix}.q_proj", f"{prefix}.k_proj", f"{prefix}.v_proj"],             │
│    45 │   │   quantize=config.quantize,                                                          │
│    46 │   │   dim=0,                                                                             │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ──────────────────────────────────────────╮    │
│ │  config = Qwen2Config {                                                                   │    │
│ │             "_name_or_path": "/root/huggingface/cache/Qwen2-72B-Instruct-4.0bpw-h6-exl2", │    │
│ │             "architectures": [                                                            │    │
│ │           │   "Qwen2ForCausalLM"                                                          │    │
│ │             ],                                                                            │    │
│ │             "attention_dropout": 0.0,                                                     │    │
│ │             "bos_token_id": 151643,                                                       │    │
│ │             "eos_token_id": 151645,                                                       │    │
│ │             "hidden_act": "silu",                                                         │    │
│ │             "hidden_size": 8192,                                                          │    │
│ │             "initializer_range": 0.02,                                                    │    │
│ │             "intermediate_size": 29568,                                                   │    │
│ │             "max_position_embeddings": 32768,                                             │    │
│ │             "max_window_layers": 80,                                                      │    │
│ │             "model_type": "qwen2",                                                        │    │
│ │             "num_attention_heads": 64,                                                    │    │
│ │             "num_hidden_layers": 80,                                                      │    │
│ │             "num_key_value_heads": 8,                                                     │    │
│ │             "quantization_config": {                                                      │    │
│ │           │   "bits": 4.0,                                                                │    │
│ │           │   "calibration": {                                                            │    │
│ │           │     "dataset": "(default)",                                                   │    │
│ │           │     "length": 2048,                                                           │    │
│ │           │     "rows": 100                                                               │    │
│ │           │   },                                                                          │    │
│ │           │   "head_bits": 6,                                                             │    │
│ │           │   "quant_method": "exl2",                                                     │    │
│ │           │   "version": "0.1.4"                                                          │    │
│ │             },                                                                            │    │
│ │             "quantize": "exl2",                                                           │    │
│ │             "rms_norm_eps": 1e-06,                                                        │    │
│ │             "rope_theta": 1000000.0,                                                      │    │
│ │             "sliding_window": 131072,                                                     │    │
│ │             "speculator": null,                                                           │    │
│ │             "tie_word_embeddings": false,                                                 │    │
│ │             "torch_dtype": "bfloat16",                                                    │    │
│ │             "transformers_version": "4.41.1",                                             │    │
│ │             "use_cache": true,                                                            │    │
│ │             "use_sliding_window": false,                                                  │    │
│ │             "vocab_size": 152064                                                          │    │
│ │           }                                                                               │    │
│ │  prefix = 'model.layers.0.self_attn'                                                      │    │
│ │ weights = <text_generation_server.utils.weights.Weights object at 0x7f2d5ee131f0>         │    │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────╯    │
│                                                                                                  │
│ /root/software/text-generation-inference/server/text_generation_server/utils/weights.py:318 in   │
│ get_multi_weights_col                                                                            │
│                                                                                                  │
│   315 │                                                                                          │
│   316 │   def get_multi_weights_col(self, prefixes: List[str], quantize: str, dim: int):         │
│   317 │   │   if quantize == "exl2":                                                             │
│ ❱ 318 │   │   │   raise ValueError("get_multi_weights_col is not supported for exl2")            │
│   319 │   │   elif quantize in ["gptq", "awq"]:                                                  │
│   320 │   │   │   from text_generation_server.layers.gptq import GPTQWeight                      │
│   321                                                                                            │
│                                                                                                  │
│ ╭────────────────────────────────────── locals ──────────────────────────────────────╮           │
│ │      dim = 0                                                                       │           │
│ │ prefixes = [                                                                       │           │
│ │            │   'model.layers.0.self_attn.q_proj',                                  │           │
│ │            │   'model.layers.0.self_attn.k_proj',                                  │           │
│ │            │   'model.layers.0.self_attn.v_proj'                                   │           │
│ │            ]                                                                       │           │
│ │ quantize = 'exl2'                                                                  │           │
│ │     self = <text_generation_server.utils.weights.Weights object at 0x7f2d5ee131f0> │           │
│ ╰────────────────────────────────────────────────────────────────────────────────────╯           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: get_multi_weights_col is not supported for exl2 rank=0
2024-06-18T02:59:53.704724Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-18T02:59:53.704742Z  INFO text_generation_launcher: Shutting down shards

Expected behavior

It seems like the tgi trying to use weights.get_multi_weights_col in flash_qwen2_modeling.py, and this method is not supported in exl2 quantization. Is qwen2-exl2 quantization is not supported yet in TGI? Hope to see it can be supported soon.

The text was updated successfully, but these errors were encountered:

danieldk · 2024-06-18T08:14:50Z

We concatenate QKV matrices for efficiency. However this is difficult for exl2-quantized matrices because the row groups of the quantized matrices usually don't align, so that's why the exception was raised. I will see if we can handle this differently in a nice manner.

Fixes #2081.

danieldk · 2024-06-20T05:56:44Z

Should be fixed in main.

danieldk added a commit that referenced this issue Jun 18, 2024

Support exl2-quantized Qwen2 models

14d83c1

Fixes #2081.

danieldk mentioned this issue Jun 18, 2024

Support exl2-quantized Qwen2 models #2085

Merged

5 tasks

danieldk closed this as completed in #2085 Jun 20, 2024

danieldk added a commit that referenced this issue Jun 20, 2024

Support exl2-quantized Qwen2 models (#2085)

f5a9837

Fixes #2081.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load Qwen2-72B-Instruct-exl2 model #2081

Unable to load Qwen2-72B-Instruct-exl2 model #2081

sunxichen commented Jun 18, 2024

danieldk commented Jun 18, 2024 •

edited

Loading

danieldk commented Jun 20, 2024

Unable to load Qwen2-72B-Instruct-exl2 model #2081

Unable to load Qwen2-72B-Instruct-exl2 model #2081

Comments

sunxichen commented Jun 18, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

danieldk commented Jun 18, 2024 • edited Loading

danieldk commented Jun 20, 2024

danieldk commented Jun 18, 2024 •

edited

Loading