Problem loading model with GPTNeoX architecture (weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist) #1460

ZQ-Dev8 · 2024-01-19T23:05:54Z

System Info

System Info: 4xA6000 Workstation
Full command line causing issues: docker run --rm --entrypoint /bin/bash -itd --name "traclm-v1-3b-instruct" -v "path/to/folder":/data --gpus '"device=3"' -p 172.20.158.30:8082:80 ghcr.io/huggingface/text-generation-inference:latest
OS version: Ubuntu 22.04 LTS
Rust version: rustc 1.73.0 (cc66ad468 2023-10-03)
Model being used: Local finetune of togethercomputer/RedPajama-INCITE-Base-3B-v1 with teknium/GPT4-LLM-Cleaned, created using the HF Trainer without flash attention (could this be the problem?)
Hardware used: see below for nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:19:00.0 Off |                  Off |
| 30%   38C    P8              34W / 300W |     12MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:1A:00.0 Off |                  Off |
| 30%   49C    P8              24W / 300W |  47642MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:67:00.0 Off |                  Off |
| 30%   55C    P8              27W / 300W |    968MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               Off | 00000000:68:00.0 Off |                  Off |
| 30%   49C    P8              35W / 300W |     25MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

-Deployment specificities: N/A, using latest TGI image as of creating this issue.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

N/A, I am attempting to load the model via standard usage of TGI. The problem is related to the model, which uses the GPTNeoX architecture (pasted below). Note the model was not trained with flash attention, though the provided stacktrace seems to imply it needs to be (e.g. multiple instances of FlashNeoXAttention())

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50432, 2560)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=2560, out_features=7680, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True)
          (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True)
          (act): GELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (embed_out): Linear(in_features=2560, out_features=50432, bias=False)

Expected behavior

I would expect the TGI server to start as with any other supported model. However, for this model I get the following error:

Container started successfully!
Running the text-generation-launcher command from /data directory inside the container...only local files will be used
2024-01-19T21:21:00.841715Z  INFO text_generation_launcher: Args { model_id: "/data/traclm-v1-3b-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 4096, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "790d41f93e80", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2024-01-19T21:21:00.841808Z  INFO download: text_generation_launcher: Starting download process.
2024-01-19T21:21:03.300505Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-01-19T21:21:03.846038Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-01-19T21:21:03.846486Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-01-19T21:21:06.578815Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 175, in get_model
    return FlashNeoXSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_neox.py", line 58, in __init__
    model = FlashGPTNeoXForCausalLM(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 366, in __init__
    self.gpt_neox = FlashGPTNeoXModel(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 308, in __init__
    [
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 309, in <listcomp>
    FlashNeoXLayer(layer_id, config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 223, in __init__
    self.attention = FlashNeoxAttention(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 101, in __init__
    self.rotary_emb = PositionRotaryEmbedding.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 625, in load
    inv_freq = weights.get_tensor(f"{prefix}.inv_freq")
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 75, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist

2024-01-19T21:21:07.151755Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 175, in get_model
    return FlashNeoXSharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_neox.py", line 58, in __init__
    model = FlashGPTNeoXForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 366, in __init__
    self.gpt_neox = FlashGPTNeoXModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 308, in __init__
    [

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 309, in <listcomp>
    FlashNeoXLayer(layer_id, config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 223, in __init__
    self.attention = FlashNeoxAttention(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 101, in __init__
    self.rotary_emb = PositionRotaryEmbedding.load(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 625, in load
    inv_freq = weights.get_tensor(f"{prefix}.inv_freq")

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 75, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist
 rank=0
2024-01-19T21:21:07.250391Z ERROR text_generation_launcher: Shard 0 failed to start
2024-01-19T21:21:07.250429Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Failed to run text-generation-launcher.

The text was updated successfully, but these errors were encountered:

ZQ-Dev8 · 2024-01-20T14:19:18Z

For additional context - I retrained the model, this time with flash attention, and the error remains the same:

RuntimeError: weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist
 rank=0
2024-01-20T14:17:26.514770Z ERROR text_generation_launcher: Shard 0 failed to start
2024-01-20T14:17:26.514815Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Failed to run text-generation-launcher.

dwyatte · 2024-01-24T15:34:36Z

@Narsil I think this is probably the same as #790 for Llama which you fixed by removing loading of position embeddings from weights in #793. This fix seems simple enough but it looks like there was some concern about it breaking previous versions of the model / integration tests (but maybe everything was fine)

I think this probably broke when position embeddings were removed from the GPTNeoX weights (link to line diff from @ArthurZucker), which looks to have been released in transformers 4.35 so if you don't need Flash Attention, you might be able to downgrade transformers and re-export the model to get things working until fixed

ZQ-Dev8 · 2024-01-29T21:35:35Z

TYVM for the response. I can see the PR is held up - I'll attempt the downgrade/re-export you mentioned in the meantime.

@OlivierDehaene

# What does this PR do? `transformers` 4.35 removed rotary embeddings from GPTNeoX's weights ([link to line diff](huggingface/transformers@253f9a3#diff-0e2a05d86c82e96f516db8c14070ceb36f53ca44c6bc21a9cd92ad2e777b9cf1R298)). This applies the same fix as #793 which generates them on-the-fly using the appropriate value from the config file Fixes #1460 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @OlivierDehaene OR @Narsil

@OlivierDehaene

# What does this PR do? `transformers` 4.35 removed rotary embeddings from GPTNeoX's weights ([link to line diff](huggingface/transformers@253f9a3#diff-0e2a05d86c82e96f516db8c14070ceb36f53ca44c6bc21a9cd92ad2e777b9cf1R298)). This applies the same fix as huggingface/text-generation-inference#793 which generates them on-the-fly using the appropriate value from the config file Fixes huggingface/text-generation-inference#1460 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @OlivierDehaene OR @Narsil

@OlivierDehaene

# What does this PR do? `transformers` 4.35 removed rotary embeddings from GPTNeoX's weights ([link to line diff](huggingface/transformers@253f9a3#diff-0e2a05d86c82e96f516db8c14070ceb36f53ca44c6bc21a9cd92ad2e777b9cf1R298)). This applies the same fix as huggingface#793 which generates them on-the-fly using the appropriate value from the config file Fixes huggingface#1460 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @OlivierDehaene OR @Narsil

ZQ-Dev8 mentioned this issue Jan 24, 2024

RedPajama Finetune - Does Not Work with text-generation-inference (TGI) OpenAccess-AI-Collective/axolotl#1165

Closed

8 tasks

dwyatte mentioned this issue Jan 26, 2024

GPTNeoX: Use static rotary embedding #1498

Merged

5 tasks

Narsil closed this as completed in #1498 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem loading model with GPTNeoX architecture (weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist) #1460

Problem loading model with GPTNeoX architecture (weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist) #1460

ZQ-Dev8 commented Jan 19, 2024

ZQ-Dev8 commented Jan 20, 2024

dwyatte commented Jan 24, 2024 •

edited

ZQ-Dev8 commented Jan 29, 2024

Problem loading model with GPTNeoX architecture (weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist) #1460

Problem loading model with GPTNeoX architecture (weight gpt_neox.layers.0.attention.rotary_emb.inv_freq does not exist) #1460

Comments

ZQ-Dev8 commented Jan 19, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

ZQ-Dev8 commented Jan 20, 2024

dwyatte commented Jan 24, 2024 • edited

ZQ-Dev8 commented Jan 29, 2024

dwyatte commented Jan 24, 2024 •

edited