Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to install locally #1788

Closed
2 of 4 tasks
shwu-nyunai opened this issue Apr 22, 2024 · 9 comments
Closed
2 of 4 tasks

Not able to install locally #1788

shwu-nyunai opened this issue Apr 22, 2024 · 9 comments
Labels

Comments

@shwu-nyunai
Copy link

shwu-nyunai commented Apr 22, 2024

System Info

2024-04-22T09:19:51.209245Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Mon Apr 22 09:19:50 2024       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
   | N/A   29C    P0    42W / 400W |      0MiB / 40960MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
2024-04-22T09:19:51.209446Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }
2024-04-22T09:19:51.209835Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:19:51.209844Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:19:51.209847Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:19:51.209850Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:19:51.210103Z  INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:19:55.920267Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-22T09:19:56.615746Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:19:56.616115Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:20:01.251224Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-22T09:20:01.286558Z  WARN text_generation_launcher: We're not using custom kernels.

2024-04-22T09:20:01.329486Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:20:01.355485Z  WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:20:02.122101Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
    from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
 rank=0
2024-04-22T09:20:02.220814Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:20:02.220836Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I have a local model quantised with autoawq; even tried with bloke awq for llama 2 7b from hf directly
use the command:

# ================= with local install =================
method="awq"
model="/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-$method"
# model=""

text-generation-launcher --model-id "$model" --quantize $method --huggingface-hub-cache $HUGGINGFACE_CACHE 2>&1 | tee "tgi-$method.log"

Expected behavior

The server should start;

I have all the packages installed using the commands mentioned to install using make

(venv) shwu@a100-spot-altzone-1:~/labs/TGI$ python -c "import pip._internal.operations.freeze; print('\n'.join([p for p in pip._internal.operations.freeze.freeze() if 'exllama' in p or 'vllm' in p or 'flash' in p]))" && bash generate.sh 
exllamav2_kernels==0.0.0
flash_attn==2.5.6
vllm==0.4.0.post1+cu122
2024-04-22T09:22:26.077582Z  INFO text_generation_launcher: Args { model_id: "/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-awq", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some(".cache/"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-22T09:22:26.077989Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:22:26.077998Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:22:26.078001Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:22:26.078003Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:22:26.078233Z  INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:22:30.659168Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-22T09:22:31.283684Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:22:31.284013Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:22:36.095219Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-22T09:22:36.131032Z  WARN text_generation_launcher: We're not using custom kernels.

2024-04-22T09:22:36.174655Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:22:36.201589Z  WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:22:36.890726Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
    from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
 rank=0
2024-04-22T09:22:36.989104Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:22:36.989127Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart```
@shuaills
Copy link

You need to re-install vllm and flash-attention-v2
`cd text-generation-inference/server
rm -rf vllm
make install-vllm-cuda

rm -rf flash-attention-v2
make install-flash-attention-v2-cuda`

They forgot to add this to the release notes about local installs.
#1738
I tried this and solved my problem.

@shwu-nyunai
Copy link
Author

I have been installing all of the extensions via those commands for 2 days now;
I also tried using the release v2.0.1 code zip
let me try this once more with a clean installation

@shuaills
Copy link

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

I feel you, did exactly the same. install/delete about 4 times

@boxiaowave
Copy link

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

You can follow the steps in the Dockerfile, after compile flash-attn with cmd 'make install-flash..‘, the script moves the compiled file to python's site-package folder, just like
cp -r /text-generation-inference/server/flash-attention-v2/build/lib.linux-x86_64-cpython-39/* /usr/local/lib/python3.10/site-packages/

@shwu-nyunai
Copy link
Author

have resolved the issues using the following set of install-scripts;
https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts

Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run

  1. scripts/install-tgi.sh , then
  2. scripts/parallel-install-extensions.sh (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)

use other scripts in the directory as required.

for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt

ps. maintainer can close this. leaving open for anyone facing a similar issue.

@for-just-we
Copy link

have resolved the issues using the following set of install-scripts; https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts

Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run

1. [scripts/install-tgi.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/install-tgi.sh) , then

2. [scripts/parallel-install-extensions.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/parallel-install-extensions.sh) (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)

use other scripts in the directory as required.

for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt

ps. maintainer can close this. leaving open for anyone facing a similar issue.

When install vllm for TGI-2.0.1, I came across :

error: triton 2.3.0 is installed but triton==2.1.0 is required by {'torch'}
make: *** [Makefile-vllm:12: install-vllm-cuda] Error 1

Is this because I use wrong vllm version. I don't modify anything in the Makefile-* scriot

@shwu-nyunai
Copy link
Author

Your PyTorch version might be different. I faced this issue for the same reason that my PyTorch version was higher than torch==2.1.0 and hence the default triton that was installed was 2.2.0 (afair).
Nonetheless, use a fresh virtual env (maybe conda)

install torch==2.1.0 or use install-tgi.sh

@Semihal
Copy link

Semihal commented Apr 27, 2024

Build and install rotary and layer_norm from https://github.com/Dao-AILab/flash-attention/tree/main/csrc.
This work for me

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 28, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants