Not able to install locally #1788

shwu-nyunai · 2024-04-22T09:54:43Z

System Info

2024-04-22T09:19:51.209245Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Mon Apr 22 09:19:50 2024       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
   | N/A   29C    P0    42W / 400W |      0MiB / 40960MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
2024-04-22T09:19:51.209446Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }
2024-04-22T09:19:51.209835Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:19:51.209844Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:19:51.209847Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:19:51.209850Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:19:51.210103Z  INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:19:55.920267Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-22T09:19:56.615746Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:19:56.616115Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:20:01.251224Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-22T09:20:01.286558Z  WARN text_generation_launcher: We're not using custom kernels.

2024-04-22T09:20:01.329486Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:20:01.355485Z  WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:20:02.122101Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
    from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
 rank=0
2024-04-22T09:20:02.220814Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:20:02.220836Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I have a local model quantised with autoawq; even tried with bloke awq for llama 2 7b from hf directly
use the command:

# ================= with local install =================
method="awq"
model="/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-$method"
# model=""

text-generation-launcher --model-id "$model" --quantize $method --huggingface-hub-cache $HUGGINGFACE_CACHE 2>&1 | tee "tgi-$method.log"

Expected behavior

The server should start;

I have all the packages installed using the commands mentioned to install using make

(venv) shwu@a100-spot-altzone-1:~/labs/TGI$ python -c "import pip._internal.operations.freeze; print('\n'.join([p for p in pip._internal.operations.freeze.freeze() if 'exllama' in p or 'vllm' in p or 'flash' in p]))" && bash generate.sh 
exllamav2_kernels==0.0.0
flash_attn==2.5.6
vllm==0.4.0.post1+cu122
2024-04-22T09:22:26.077582Z  INFO text_generation_launcher: Args { model_id: "/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-awq", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some(".cache/"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-22T09:22:26.077989Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:22:26.077998Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:22:26.078001Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:22:26.078003Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:22:26.078233Z  INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:22:30.659168Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-22T09:22:31.283684Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:22:31.284013Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:22:36.095219Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-22T09:22:36.131032Z  WARN text_generation_launcher: We're not using custom kernels.

2024-04-22T09:22:36.174655Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:22:36.201589Z  WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:22:36.890726Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
    from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
 rank=0
2024-04-22T09:22:36.989104Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:22:36.989127Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart```

shuaills · 2024-04-22T12:56:09Z

You need to re-install vllm and flash-attention-v2
`cd text-generation-inference/server
rm -rf vllm
make install-vllm-cuda

rm -rf flash-attention-v2
make install-flash-attention-v2-cuda`

They forgot to add this to the release notes about local installs.
#1738
I tried this and solved my problem.

shwu-nyunai · 2024-04-22T14:46:59Z

I have been installing all of the extensions via those commands for 2 days now;
I also tried using the release v2.0.1 code zip
let me try this once more with a clean installation

shuaills · 2024-04-22T14:50:18Z

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

I feel you, did exactly the same. install/delete about 4 times

boxiaowave · 2024-04-23T08:43:00Z

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

You can follow the steps in the Dockerfile, after compile flash-attn with cmd 'make install-flash..‘, the script moves the compiled file to python's site-package folder, just like
cp -r /text-generation-inference/server/flash-attention-v2/build/lib.linux-x86_64-cpython-39/* /usr/local/lib/python3.10/site-packages/

shwu-nyunai · 2024-04-23T13:06:55Z

have resolved the issues using the following set of install-scripts;
https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts

Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run

scripts/install-tgi.sh , then
scripts/parallel-install-extensions.sh (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)

use other scripts in the directory as required.

for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt

ps. maintainer can close this. leaving open for anyone facing a similar issue.

for-just-we · 2024-04-25T13:36:26Z

have resolved the issues using the following set of install-scripts; https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts

Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run
1. [scripts/install-tgi.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/install-tgi.sh) , then

2. [scripts/parallel-install-extensions.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/parallel-install-extensions.sh) (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)
use other scripts in the directory as required.

for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt

ps. maintainer can close this. leaving open for anyone facing a similar issue.

When install vllm for TGI-2.0.1, I came across :

error: triton 2.3.0 is installed but triton==2.1.0 is required by {'torch'}
make: *** [Makefile-vllm:12: install-vllm-cuda] Error 1

Is this because I use wrong vllm version. I don't modify anything in the Makefile-* scriot

shwu-nyunai · 2024-04-25T17:02:06Z

Your PyTorch version might be different. I faced this issue for the same reason that my PyTorch version was higher than torch==2.1.0 and hence the default triton that was installed was 2.2.0 (afair).
Nonetheless, use a fresh virtual env (maybe conda)

install torch==2.1.0 or use install-tgi.sh

Semihal · 2024-04-27T15:48:54Z

Build and install rotary and layer_norm from https://github.com/Dao-AILab/flash-attention/tree/main/csrc.
This work for me

github-actions · 2024-05-28T01:47:18Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label May 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to install locally #1788

Not able to install locally #1788

shwu-nyunai commented Apr 22, 2024 •

edited

Loading

shuaills commented Apr 22, 2024

shwu-nyunai commented Apr 22, 2024

shuaills commented Apr 22, 2024

boxiaowave commented Apr 23, 2024

shwu-nyunai commented Apr 23, 2024

for-just-we commented Apr 25, 2024

shwu-nyunai commented Apr 25, 2024

Semihal commented Apr 27, 2024 •

edited

Loading

github-actions bot commented May 28, 2024

Not able to install locally #1788

Not able to install locally #1788

Comments

shwu-nyunai commented Apr 22, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

shuaills commented Apr 22, 2024

shwu-nyunai commented Apr 22, 2024

shuaills commented Apr 22, 2024

boxiaowave commented Apr 23, 2024

shwu-nyunai commented Apr 23, 2024

for-just-we commented Apr 25, 2024

shwu-nyunai commented Apr 25, 2024

Semihal commented Apr 27, 2024 • edited Loading

github-actions bot commented May 28, 2024

shwu-nyunai commented Apr 22, 2024 •

edited

Loading

Semihal commented Apr 27, 2024 •

edited

Loading