Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load Qwen2-72B-Instruct-exl2 model #2081

Closed
2 of 4 tasks
sunxichen opened this issue Jun 18, 2024 · 2 comments · Fixed by #2085
Closed
2 of 4 tasks

Unable to load Qwen2-72B-Instruct-exl2 model #2081

sunxichen opened this issue Jun 18, 2024 · 2 comments · Fixed by #2085

Comments

@sunxichen
Copy link
Contributor

System Info

text-generation-launcher version: text-generation-launcher 2.0.5-dev0
cargo version: 1.79.0
os version: Ubuntu 22.04.4 LTS
python version: 3.10.14
cuda version: 12.2
hardware used: NVIDIA RTX A6000
used model: https://huggingface.co/LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. I downloaded LoneStriker/Qwen2-72B-Instruct-4.0bpw-h6-exl2 and save it in local path.
  2. Build and install text-generation-launcher from source, the version is text-generation-launcher 2.0.5-dev0.
  3. Then I run following command to launch model:
    CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id /root/huggingface/cache/Qwen2-72B-Instruct-4.0bpw-h6-exl2 --max-total-tokens 4096 --port 8002 --num-shard 1 --quantize exl2
  4. And it failed to load model, the error message is following:
/root/software/text-generation-inference/server/text_generation_server/models/custom_modeling/fl │
│ ash_qwen2_modeling.py:43 in _load_gqa                                                            │
│                                                                                                  │
│    40 │   assert config.hidden_size % config.num_attention_heads == 0                            │
│    41 │   assert config.num_attention_heads % weights.process_group.size() == 0                  │
│    42 │                                                                                          │
│ ❱  43 │   weight = weights.get_multi_weights_col(                                                │
│    44 │   │   prefixes=[f"{prefix}.q_proj", f"{prefix}.k_proj", f"{prefix}.v_proj"],             │
│    45 │   │   quantize=config.quantize,                                                          │
│    46 │   │   dim=0,                                                                             │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ──────────────────────────────────────────╮    │
│ │  config = Qwen2Config {                                                                   │    │
│ │             "_name_or_path": "/root/huggingface/cache/Qwen2-72B-Instruct-4.0bpw-h6-exl2", │    │
│ │             "architectures": [                                                            │    │
│ │           │   "Qwen2ForCausalLM"                                                          │    │
│ │             ],                                                                            │    │
│ │             "attention_dropout": 0.0,                                                     │    │
│ │             "bos_token_id": 151643,                                                       │    │
│ │             "eos_token_id": 151645,                                                       │    │
│ │             "hidden_act": "silu",                                                         │    │
│ │             "hidden_size": 8192,                                                          │    │
│ │             "initializer_range": 0.02,                                                    │    │
│ │             "intermediate_size": 29568,                                                   │    │
│ │             "max_position_embeddings": 32768,                                             │    │
│ │             "max_window_layers": 80,                                                      │    │
│ │             "model_type": "qwen2",                                                        │    │
│ │             "num_attention_heads": 64,                                                    │    │
│ │             "num_hidden_layers": 80,                                                      │    │
│ │             "num_key_value_heads": 8,                                                     │    │
│ │             "quantization_config": {                                                      │    │
│ │           │   "bits": 4.0,                                                                │    │
│ │           │   "calibration": {                                                            │    │
│ │           │     "dataset": "(default)",                                                   │    │
│ │           │     "length": 2048,                                                           │    │
│ │           │     "rows": 100                                                               │    │
│ │           │   },                                                                          │    │
│ │           │   "head_bits": 6,                                                             │    │
│ │           │   "quant_method": "exl2",                                                     │    │
│ │           │   "version": "0.1.4"                                                          │    │
│ │             },                                                                            │    │
│ │             "quantize": "exl2",                                                           │    │
│ │             "rms_norm_eps": 1e-06,                                                        │    │
│ │             "rope_theta": 1000000.0,                                                      │    │
│ │             "sliding_window": 131072,                                                     │    │
│ │             "speculator": null,                                                           │    │
│ │             "tie_word_embeddings": false,                                                 │    │
│ │             "torch_dtype": "bfloat16",                                                    │    │
│ │             "transformers_version": "4.41.1",                                             │    │
│ │             "use_cache": true,                                                            │    │
│ │             "use_sliding_window": false,                                                  │    │
│ │             "vocab_size": 152064                                                          │    │
│ │           }                                                                               │    │
│ │  prefix = 'model.layers.0.self_attn'                                                      │    │
│ │ weights = <text_generation_server.utils.weights.Weights object at 0x7f2d5ee131f0>         │    │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────╯    │
│                                                                                                  │
│ /root/software/text-generation-inference/server/text_generation_server/utils/weights.py:318 in   │
│ get_multi_weights_col                                                                            │
│                                                                                                  │
│   315 │                                                                                          │
│   316 │   def get_multi_weights_col(self, prefixes: List[str], quantize: str, dim: int):         │
│   317 │   │   if quantize == "exl2":                                                             │
│ ❱ 318 │   │   │   raise ValueError("get_multi_weights_col is not supported for exl2")            │
│   319 │   │   elif quantize in ["gptq", "awq"]:                                                  │
│   320 │   │   │   from text_generation_server.layers.gptq import GPTQWeight                      │
│   321                                                                                            │
│                                                                                                  │
│ ╭────────────────────────────────────── locals ──────────────────────────────────────╮           │
│ │      dim = 0                                                                       │           │
│ │ prefixes = [                                                                       │           │
│ │            │   'model.layers.0.self_attn.q_proj',                                  │           │
│ │            │   'model.layers.0.self_attn.k_proj',                                  │           │
│ │            │   'model.layers.0.self_attn.v_proj'                                   │           │
│ │            ]                                                                       │           │
│ │ quantize = 'exl2'                                                                  │           │
│ │     self = <text_generation_server.utils.weights.Weights object at 0x7f2d5ee131f0> │           │
│ ╰────────────────────────────────────────────────────────────────────────────────────╯           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: get_multi_weights_col is not supported for exl2 rank=0
2024-06-18T02:59:53.704724Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-18T02:59:53.704742Z  INFO text_generation_launcher: Shutting down shards

Expected behavior

It seems like the tgi trying to use weights.get_multi_weights_col in flash_qwen2_modeling.py, and this method is not supported in exl2 quantization. Is qwen2-exl2 quantization is not supported yet in TGI? Hope to see it can be supported soon.

@danieldk
Copy link
Member

danieldk commented Jun 18, 2024

We concatenate QKV matrices for efficiency. However this is difficult for exl2-quantized matrices because the row groups of the quantized matrices usually don't align, so that's why the exception was raised. I will see if we can handle this differently in a nice manner.

danieldk added a commit that referenced this issue Jun 18, 2024
danieldk added a commit that referenced this issue Jun 20, 2024
@danieldk
Copy link
Member

Should be fixed in main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants