CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint #1799

h4gen · 2024-04-23T13:58:03Z

System Info

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Goto model Select CohereForAI/c4ai-command-r-plus-4bit
Deploy on Inference Endpoint
Use standard settings (see picture)
Wait forever

See this output:

2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427626Z","level":"INFO","fields":{"message":"Args { model_id: \"/repository\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(1512), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(2048), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"r-h4g3n-c4ai-command-r-plus-4bit-qts-a1e5dl6j-cfd73-rk5q9\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }"},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427704Z","level":"INFO","fields":{"message":"Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191`."},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427711Z","level":"INFO","fields":{"message":"Bitsandbytes doesn't work with cuda graphs, deactivating them"},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427793Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/23 15:46:14 ~ {"timestamp":"2024-04-23T13:46:14.728238Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/04/23 15:46:15 ~ {"timestamp":"2024-04-23T13:46:15.432829Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/23 15:46:15 ~ {"timestamp":"2024-04-23T13:46:15.433034Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/23 15:46:22 ~ {"timestamp":"2024-04-23T13:46:22.644366Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 201, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 375, in get_model\n return FlashCohere(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py\", line 61, in __init__\n model = FlashCohereForCausalLM(config, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 482, in __init__\n self.model = FlashCohereModel(config, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 420, in __init__\n [\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 421, in <listcomp>\n FlashCohereLayer(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 360, in __init__\n self.self_attn = FlashCohereAttention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 217, in __init__\n self.query_key_value = load_attention(config, prefix, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 140, in load_attention\n return _load_gqa(config, prefix, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 167, in _load_gqa\n assert list(weight.shape) == [\nAssertionError: [88080384, 1] != [14336, 12288]\n"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.041667Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 201, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 375, in get_model\n return FlashCohere(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py\", line 61, in __init__\n model = FlashCohereForCausalLM(config, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 482, in __init__\n self.model = FlashCohereModel(config, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 420, in __init__\n [\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 421, in <listcomp>\n FlashCohereLayer(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 360, in __init__\n self.self_attn = FlashCohereAttention(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 217, in __init__\n self.query_key_value = load_attention(config, prefix, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 140, in load_attention\n return _load_gqa(config, prefix, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 167, in _load_gqa\n assert list(weight.shape) == [\n\nAssertionError: [88080384, 1] != [14336, 12288]\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.140326Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.140367Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ Error: ShardCannotStart

It looks like the flash attention mechanism ist not working properly.

Expected behavior

Expected behaviour is to start normally.

Note: I have not been able to get it running with any cohere model on any hardware configuration.

The text was updated successfully, but these errors were encountered:

davhin · 2024-04-23T16:18:04Z

Fails the same way for me on a gcp vm with

docker run --gpus all --shm-size 1g -p 8888:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 2

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 375, in get_model
return FlashCohere(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py", line 61, in init
model = FlashCohereForCausalLM(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 482, in init
self.model = FlashCohereModel(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 420, in init
[

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 421, in
FlashCohereLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 360, in init
self.self_attn = FlashCohereAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 217, in init
self.query_key_value = load_attention(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 140, in load_attention
return _load_gqa(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 167, in _load_gqa
assert list(weight.shape) == [

AssertionError: [44040192, 1] != [8192, 12288]
rank=1

backroom-coder · 2024-05-03T17:15:08Z

same issue
2024-05-03T17:11:35.945462Z INFO text_generation_launcher: Unknown quantization method bitsandbytes

github-actions · 2024-06-03T01:48:58Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jun 3, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint #1799

CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint #1799

h4gen commented Apr 23, 2024 •

edited

davhin commented Apr 23, 2024

backroom-coder commented May 3, 2024

github-actions bot commented Jun 3, 2024

CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint #1799

CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint #1799

Comments

h4gen commented Apr 23, 2024 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

davhin commented Apr 23, 2024

backroom-coder commented May 3, 2024

github-actions bot commented Jun 3, 2024

h4gen commented Apr 23, 2024 •

edited