Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi-3 medium 128k instruct fails to start #1930

Closed
2 of 4 tasks
xfalcox opened this issue May 21, 2024 · 8 comments
Closed
2 of 4 tasks

Phi-3 medium 128k instruct fails to start #1930

xfalcox opened this issue May 21, 2024 · 8 comments
Labels

Comments

@xfalcox
Copy link

xfalcox commented May 21, 2024

System Info

docker pull ghcr.io/huggingface/text-generation-inference:latest
latest: Pulling from huggingface/text-generation-inference
Digest: sha256:00d7f1cf3c6fce0a48ff9f2e0451cfa60a06fb48447d60dc6034f4e69443fd3e
Status: Image is up to date for ghcr.io/huggingface/text-generation-inference:latest
ghcr.io/huggingface/text-generation-inference:latest


docker run --rm --name tgi --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=xxx -p 8080:80 -v /opt/tgi-cache:/data ghcr.io/huggingface/text-generation-inference:latest --model-id microsoft/Phi-3-medium-128k-instruct
2024-05-21T19:16:29.208699Z  INFO text_generation_launcher: Args {
    model_id: "microsoft/Phi-3-medium-128k-instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "a9898b15798a",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-05-21T19:16:29.208767Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-05-21T19:16:31.499304Z  INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071`.
2024-05-21T19:16:31.499313Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-05-21T19:16:31.499316Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-05-21T19:16:31.499317Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-05-21T19:16:31.499319Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-05-21T19:16:31.499404Z  INFO download: text_generation_launcher: Starting download process.
2024-05-21T19:16:33.562753Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-05-21T19:16:33.901453Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-05-21T19:16:33.901584Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-05-21T19:16:37.067234Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 423, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 396, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 320, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 321, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 116, in __init__
    self.query_key_value = load_attention(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 51, in load_attention
    bias = config.attention_bias
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 263, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'Phi3Config' object has no attribute 'attention_bias'

2024-05-21T19:16:37.504771Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 423, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 396, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 320, in __init__
    [

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 321, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 116, in __init__
    self.query_key_value = load_attention(config, prefix, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 51, in load_attention
    bias = config.attention_bias

  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 263, in __getattribute__
    return super().__getattribute__(key)

AttributeError: 'Phi3Config' object has no attribute 'attention_bias'
 rank=0
2024-05-21T19:16:37.603743Z ERROR text_generation_launcher: Shard 0 failed to start
2024-05-21T19:16:37.603753Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Try to run TGI Docker with latest phi-3
  2. It fails to start

Expected behavior

  1. Try to run TGI Docker with latest phi-3
  2. It works
@OjoDojoJo
Copy link

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

@ulrichkr
Copy link

I encounter this as well. I believe it arises from the recent addition of Granite support after Phi-3 support in TGI 2.0.3. See here.

@amihalik
Copy link
Contributor

@OjoDojoJo What's your full command line? I'm running this command on a aws g6.48xlarge

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 2g   \
     -v /models/:/models/ ghcr.io/huggingface/text-generation-inference:2.0.3   \
     --model-id /models/microsoft/Phi-3-medium-128k-instruct/     \
     --hostname 0.0.0.0         --trust-remote-code --num-shard 8     \
     --max-input-length=9000 --max-total-tokens=9500 \
     --max-batch-prefill-tokens=9000

And I'm getting this error:

[rank1]: Traceback (most recent call last):

[rank1]:   File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank1]:     sys.exit(app())

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
[rank1]:     server.serve(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
[rank1]:     asyncio.run(

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank1]:     return loop.run_until_complete(main)

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank1]:     return future.result()

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
[rank1]:     model = get_model(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
[rank1]:     return FlashLlama(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
[rank1]:     model = FlashLlamaForCausalLM(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
[rank1]:     self.model = FlashLlamaModel(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
[rank1]:     [

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
[rank1]:     FlashLlamaLayer(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
[rank1]:     self.self_attn = FlashLlamaAttention(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
[rank1]:     self.query_key_value = load_attention(config, prefix, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 45, in load_attention
[rank1]:     return TensorParallelColumnLinear.load_multi(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 115, in load_multi
[rank1]:     weight = weights.get_multi_weights_col(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in get_multi_weights_col
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in <listcomp>
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 112, in get_sharded
[rank1]:     filename, tensor_name = self.get_filename(tensor_name)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
[rank1]:     raise RuntimeError(f"weight {tensor_name} does not exist")

[rank1]: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

@dcbark01
Copy link

dcbark01 commented May 23, 2024

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

Can confirm that this works. There's currently an open PR on HF to fix the issue. In the meantime, you can run the model by directly specifying the revision. Here's my full command:

docker run --gpus all --shm-size 2g -p 8080:80 \
-v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:2.0 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--revision refs/pr/68 \
--trust-remote-code \
-p 8080 \
--hostname 0.0.0.0

@stefanobranco
Copy link

I'm still getting the same issue as @amihalik, even with the attention bias fixed:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Not sure what causes it, I'm using pretty much the exact same docker commands.

@xfalcox
Copy link
Author

xfalcox commented May 24, 2024

Still fails for me with TGI 2.0, trust remote code, attention_bias false.

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

@pranavthombare
Copy link

pranavthombare commented May 27, 2024

It is the same for us. tells me
The argument 'trust_remote_code' is to be used with Auto classes. It has no effect here and is ignored.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 27, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants