Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Phi-3 Model #1807

Closed
2 tasks done
ChristophRaab opened this issue Apr 25, 2024 · 4 comments
Closed
2 tasks done

Add support for Phi-3 Model #1807

ChristophRaab opened this issue Apr 25, 2024 · 4 comments

Comments

@ChristophRaab
Copy link
Contributor

Model description

Hi all,

currently, the microsoft/Phi-3-mini-128k-instruct is not supported by text-generation-inference. As displayed in the following error:

2024-04-25T12:45:45.282234Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 648, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type phi3

The server is started with the following config:

text_generation_launcher: Args { model_id: "microsoft/Phi-3-mini-128k-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, 
max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-phi-deployment-6c75c84cf9-qsbh5", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: 
false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }

Are there any plans to further support it?

Best wishes
Christoph

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

The link to the model is https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

@ChristophRaab ChristophRaab changed the title Add Phi-3 Model Add support for Phi-3 Model Apr 25, 2024
@amihalik
Copy link
Contributor

amihalik commented Apr 26, 2024

I'm able to get a bit farther if I run with a newer TGI build, eg:

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g  \
    ghcr.io/huggingface/text-generation-inference:sha-986b404 \
    --model-id microsoft/Phi-3-mini-128k-instruct/ \
    --trust-remote-code \
    --num-shard $(nvidia-smi -L | wc -l) 

But TGI errors out because factor isn't set. I've tried various combinations of rope-factor and rope-scaling, (eg --rope-factor=32 --rope-scaling=dynamic), but the model generates garbage.

Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.

@RonanKMcGovern
Copy link

RonanKMcGovern commented Apr 27, 2024

I'm able to get a bit farther if I run with a newer TGI build, eg:

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g  \
    ghcr.io/huggingface/text-generation-inference:sha-986b404 \
    --model-id microsoft/Phi-3-mini-128k-instruct/ \
    --trust-remote-code \
    --num-shard $(nvidia-smi -L | wc -l) 

But TGI errors out because factor isn't set. I've tried various combinations of rope-factor and rope-scaling, (eg --rope-factor=32 --rope-scaling=dynamic), but the model generates garbage.

Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.

Same issue.

@nitronomic
Copy link

Hi

I get the same error loading phi-3-128k on the latest docker:

Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:latest
2024-04-30T11:03:01.808284Z  INFO text_generation_launcher: Args {
    model_id: "/home/nitro/models//microsoft_Phi-3-mini-128k-instruct",
    revision: None,
    validation_workers: 15,
    sharded: None,
    num_shard: Some(
        2,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        57344,
    ),
    max_total_tokens: Some(
        65536,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        57344,
    ),
    max_batch_total_tokens: Some(
        65536,
    ),
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: Some(
        [
            1,
            2,
            4,
            8,
            16,
            32,
        ],
    ),
    hostname: "0.0.0.0",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 0.99,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-04-30T11:03:01.808396Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/home/nitro/models//microsoft_Phi-3-mini-128k-instruct` do not contain malicious code.
2024-04-30T11:03:01.808403Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-04-30T11:03:01.808519Z  INFO download: text_generation_launcher: Starting download process.
2024-04-30T11:03:05.695523Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-30T11:03:06.315199Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T11:03:06.315540Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T11:03:06.315622Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-04-30T11:03:12.513292Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 333, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 385, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 309, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 310, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 249, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 107, in __init__
    self.rotary_emb = PositionRotaryEmbedding.static(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 1032, in static
    scaling_factor = rope_scaling["factor"]
KeyError: 'factor'

Thank you for all the work on TGI

@ChristophRaab
Copy link
Contributor Author

ChristophRaab commented May 2, 2024

I am able to run the model with the following command on 2.0.2:

text-generation-launcher --model-id=microsoft/Phi-3-mini-128k-instruct --port=80  --trust-remote-code --rope-factor=32  --rope-scaling=dynamic

However, i receive the warning:

2024-05-02T10:09:32.001826Z  WARN text_generation_router: router/src/main.rs:266: Could not parse config Error("unknown variant `phi3`, expected one of `llava_next`, `clip_vision_model`, `mistral`, `idefics`, `idefics2`, `ssm`, `gpt_bigcode`, `santacoder`, `bloom`, `mpt`, `gpt_neox`, `phi`, `phi-msft`, `llama`, `baichuan`, `gemma`, `cohere`, `drbx`, `falcon`, `mixtral`, `starcoder2`, `qwen2`, `opt`, `t5`", line: 19, column: 22) 

@Narsil because you added support for phi3, it the above warning may is interesting to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants