Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint #1799

Closed
1 of 4 tasks
h4gen opened this issue Apr 23, 2024 · 3 comments
Closed
1 of 4 tasks
Labels

Comments

@h4gen
Copy link

h4gen commented Apr 23, 2024

System Info

image

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Goto model Select CohereForAI/c4ai-command-r-plus-4bit
  2. Deploy on Inference Endpoint
  3. Use standard settings (see picture)
  4. Wait forever

See this output:

2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427626Z","level":"INFO","fields":{"message":"Args { model_id: \"/repository\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(1512), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(2048), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"r-h4g3n-c4ai-command-r-plus-4bit-qts-a1e5dl6j-cfd73-rk5q9\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }"},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427704Z","level":"INFO","fields":{"message":"Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191`."},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427711Z","level":"INFO","fields":{"message":"Bitsandbytes doesn't work with cuda graphs, deactivating them"},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427793Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/23 15:46:14 ~ {"timestamp":"2024-04-23T13:46:14.728238Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/04/23 15:46:15 ~ {"timestamp":"2024-04-23T13:46:15.432829Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/23 15:46:15 ~ {"timestamp":"2024-04-23T13:46:15.433034Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/23 15:46:22 ~ {"timestamp":"2024-04-23T13:46:22.644366Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 201, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 375, in get_model\n return FlashCohere(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py\", line 61, in __init__\n model = FlashCohereForCausalLM(config, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 482, in __init__\n self.model = FlashCohereModel(config, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 420, in __init__\n [\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 421, in <listcomp>\n FlashCohereLayer(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 360, in __init__\n self.self_attn = FlashCohereAttention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 217, in __init__\n self.query_key_value = load_attention(config, prefix, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 140, in load_attention\n return _load_gqa(config, prefix, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 167, in _load_gqa\n assert list(weight.shape) == [\nAssertionError: [88080384, 1] != [14336, 12288]\n"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.041667Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 201, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 375, in get_model\n return FlashCohere(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py\", line 61, in __init__\n model = FlashCohereForCausalLM(config, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 482, in __init__\n self.model = FlashCohereModel(config, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 420, in __init__\n [\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 421, in <listcomp>\n FlashCohereLayer(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 360, in __init__\n self.self_attn = FlashCohereAttention(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 217, in __init__\n self.query_key_value = load_attention(config, prefix, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 140, in load_attention\n return _load_gqa(config, prefix, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 167, in _load_gqa\n assert list(weight.shape) == [\n\nAssertionError: [88080384, 1] != [14336, 12288]\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.140326Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.140367Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ Error: ShardCannotStart

It looks like the flash attention mechanism ist not working properly.

Expected behavior

Expected behaviour is to start normally.

Note: I have not been able to get it running with any cohere model on any hardware configuration.

@davhin
Copy link

davhin commented Apr 23, 2024

Fails the same way for me on a gcp vm with

docker run --gpus all --shm-size 1g -p 8888:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 2

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 375, in get_model
return FlashCohere(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py", line 61, in init
model = FlashCohereForCausalLM(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 482, in init
self.model = FlashCohereModel(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 420, in init
[

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 421, in
FlashCohereLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 360, in init
self.self_attn = FlashCohereAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 217, in init
self.query_key_value = load_attention(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 140, in load_attention
return _load_gqa(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 167, in _load_gqa
assert list(weight.shape) == [

AssertionError: [44040192, 1] != [8192, 12288]
rank=1

@backroom-coder
Copy link

same issue
2024-05-03T17:11:35.945462Z INFO text_generation_launcher: Unknown quantization method bitsandbytes

Copy link

github-actions bot commented Jun 3, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 3, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants