Add support for Phi-3 Model #1807

ChristophRaab · 2024-04-25T12:50:18Z

Model description

Hi all,

currently, the microsoft/Phi-3-mini-128k-instruct is not supported by text-generation-inference. As displayed in the following error:

2024-04-25T12:45:45.282234Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 648, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type phi3

The server is started with the following config:

text_generation_launcher: Args { model_id: "microsoft/Phi-3-mini-128k-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, 
max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-phi-deployment-6c75c84cf9-qsbh5", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: 
false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }

Are there any plans to further support it?

Best wishes
Christoph

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

The link to the model is https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

The text was updated successfully, but these errors were encountered:

amihalik · 2024-04-26T13:53:26Z

I'm able to get a bit farther if I run with a newer TGI build, eg:

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g  \
    ghcr.io/huggingface/text-generation-inference:sha-986b404 \
    --model-id microsoft/Phi-3-mini-128k-instruct/ \
    --trust-remote-code \
    --num-shard $(nvidia-smi -L | wc -l)

But TGI errors out because factor isn't set. I've tried various combinations of rope-factor and rope-scaling, (eg --rope-factor=32 --rope-scaling=dynamic), but the model generates garbage.

Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.

RonanKMcGovern · 2024-04-27T09:38:51Z

I'm able to get a bit farther if I run with a newer TGI build, eg:
docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g  \
    ghcr.io/huggingface/text-generation-inference:sha-986b404 \
    --model-id microsoft/Phi-3-mini-128k-instruct/ \
    --trust-remote-code \
    --num-shard $(nvidia-smi -L | wc -l) 
But TGI errors out because factor isn't set. I've tried various combinations of rope-factor and rope-scaling, (eg --rope-factor=32 --rope-scaling=dynamic), but the model generates garbage.

Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.

Same issue.

nitronomic · 2024-04-30T11:44:36Z

Hi

I get the same error loading phi-3-128k on the latest docker:

Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:latest
2024-04-30T11:03:01.808284Z  INFO text_generation_launcher: Args {
    model_id: "/home/nitro/models//microsoft_Phi-3-mini-128k-instruct",
    revision: None,
    validation_workers: 15,
    sharded: None,
    num_shard: Some(
        2,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        57344,
    ),
    max_total_tokens: Some(
        65536,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        57344,
    ),
    max_batch_total_tokens: Some(
        65536,
    ),
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: Some(
        [
            1,
            2,
            4,
            8,
            16,
            32,
        ],
    ),
    hostname: "0.0.0.0",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 0.99,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-04-30T11:03:01.808396Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/home/nitro/models//microsoft_Phi-3-mini-128k-instruct` do not contain malicious code.
2024-04-30T11:03:01.808403Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-04-30T11:03:01.808519Z  INFO download: text_generation_launcher: Starting download process.
2024-04-30T11:03:05.695523Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-30T11:03:06.315199Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T11:03:06.315540Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T11:03:06.315622Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-04-30T11:03:12.513292Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 333, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 385, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 309, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 310, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 249, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 107, in __init__
    self.rotary_emb = PositionRotaryEmbedding.static(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 1032, in static
    scaling_factor = rope_scaling["factor"]
KeyError: 'factor'

Thank you for all the work on TGI

ChristophRaab · 2024-05-02T10:11:13Z

I am able to run the model with the following command on 2.0.2:

text-generation-launcher --model-id=microsoft/Phi-3-mini-128k-instruct --port=80  --trust-remote-code --rope-factor=32  --rope-scaling=dynamic

However, i receive the warning:

2024-05-02T10:09:32.001826Z  WARN text_generation_router: router/src/main.rs:266: Could not parse config Error("unknown variant `phi3`, expected one of `llava_next`, `clip_vision_model`, `mistral`, `idefics`, `idefics2`, `ssm`, `gpt_bigcode`, `santacoder`, `bloom`, `mpt`, `gpt_neox`, `phi`, `phi-msft`, `llama`, `baichuan`, `gemma`, `cohere`, `drbx`, `falcon`, `mixtral`, `starcoder2`, `qwen2`, `opt`, `t5`", line: 19, column: 22)

@Narsil because you added support for phi3, it the above warning may is interesting to you.

ChristophRaab changed the title ~~Add Phi-3 Model~~ Add support for Phi-3 Model Apr 25, 2024

ChristophRaab closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Phi-3 Model #1807

Add support for Phi-3 Model #1807

ChristophRaab commented Apr 25, 2024

amihalik commented Apr 26, 2024 •

edited

RonanKMcGovern commented Apr 27, 2024 •

edited

nitronomic commented Apr 30, 2024

ChristophRaab commented May 2, 2024 •

edited

Add support for Phi-3 Model #1807

Add support for Phi-3 Model #1807

Comments

ChristophRaab commented Apr 25, 2024

Model description

Open source status

Provide useful links for the implementation

amihalik commented Apr 26, 2024 • edited

RonanKMcGovern commented Apr 27, 2024 • edited

nitronomic commented Apr 30, 2024

ChristophRaab commented May 2, 2024 • edited

amihalik commented Apr 26, 2024 •

edited

RonanKMcGovern commented Apr 27, 2024 •

edited

ChristophRaab commented May 2, 2024 •

edited