Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to run tgi in Google Colab. Shard Cannot Start #1780

Closed
3 of 4 tasks
andychoi98 opened this issue Apr 21, 2024 · 1 comment
Closed
3 of 4 tasks

Not able to run tgi in Google Colab. Shard Cannot Start #1780

andychoi98 opened this issue Apr 21, 2024 · 1 comment
Labels

Comments

@andychoi98
Copy link

System Info

Installed TGI using the following script and was trying to run the launcher, but fails with this error

2024-04-21T02:10:40.097949Z ERROR text_generation_launcher: Shard 0 failed to start

Below is how I installed rust, protc and tgi locally on Google Colab,

Install Rust

!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
import os
os.environ['PATH'] += ':/root/.cargo/bin'

Install protoc

!apt install -y protobuf-compiler

Clone the repository for text-generation-inference

!git clone https://github.com/huggingface/text-generation-inference.git tgi
%cd tgi
#install transformers
!pip install git+https://github.com/huggingface/transformers.git

Compile and install any extensions if needed (Replace with the correct make command if necessary)

!BUILD_EXTENSIONS=True make install

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

And below is the log I get by running,

!text-generation-launcher --model-id bigcode/starcoder2-3b --sharded false --quantize bitsandbytes-fp4 --disable-custom-kernels

2024-04-21T02:19:11.151696Z INFO text_generation_launcher: Args { model_id: "bigcode/starcoder2-3b", revision: None, validation_workers: 2, sharded: Some(false), num_shard: None, quantize: Some(BitsandbytesFP4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "05f318f80dae", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-21T02:19:11.152123Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-04-21T02:19:11.152493Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-04-21T02:19:11.152557Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-21T02:19:11.152563Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-21T02:19:11.152565Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-21T02:19:11.152568Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-04-21T02:19:11.152859Z INFO download: text_generation_launcher: Starting download process.
2024-04-21T02:19:15.578314Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-21T02:19:16.458472Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-21T02:19:16.458842Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-21T02:19:20.534018Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-21T02:19:20.575670Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/content/tgi/server/text_generation_server/utils/layers.py)

2024-04-21T02:19:20.576457Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'

2024-04-21T02:19:24.668451Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-04-21 02:19:20.860344: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-21 02:19:20.860389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-21 02:19:20.862244: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-21 02:19:22.064777: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /content/tgi/server/text_generation_server/cli.py:71 in serve │
│ │
│ 68 │ ) │
│ 69 │ │
│ 70 │ # Import here after the logger is added to log potential import exceptions │
│ ❱ 71 │ from text_generation_server import server │
│ 72 │ from text_generation_server.tracing import setup_tracing │
│ 73 │ │
│ 74 │ # Setup OpenTelemetry distributed tracing │
│ │
│ ╭──────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ logger_level = 'INFO' │ │
│ │ model_id = 'bigcode/starcoder2-3b' │ │
│ │ otlp_endpoint = None │ │
│ │ quantize = <Quantization.bitsandbytes_fp4: 'bitsandbytes-fp4'> │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰─────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/server.py:16 in │
│ │
│ 13 from text_generation_server.cache import Cache │
│ 14 from text_generation_server.interceptor import ExceptionInterceptor │
│ 15 from text_generation_server.models import Model, get_model │
│ ❱ 16 from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch │
│ 17 from text_generation_server.pb import generate_pb2_grpc, generate_pb2 │
│ 18 from text_generation_server.tracing import UDSOpenTelemetryAioServerInterceptor │
│ 19 from text_generation_server.models.idefics_causal_lm import IdeficsCausalLMBatch │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ aio = <module 'grpc.aio' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/grpc/aio/init.py'> │ │
│ │ asyncio = <module 'asyncio' from '/usr/lib/python3.10/asyncio/init.py'> │ │
│ │ Cache = <class 'text_generation_server.cache.Cache'> │ │
│ │ ExceptionInterceptor = <class 'text_generation_server.interceptor.ExceptionInterceptor'> │ │
│ │ get_model = <function get_model at 0x7a805095b010> │ │
│ │ List = typing.List │ │
│ │ logger = <loguru.logger handlers=[(id=1, level=20, sink=)]> │ │
│ │ Model = <class 'text_generation_server.models.model.Model'> │ │
│ │ Optional = typing.Optional │ │
│ │ os = <module 'os' from '/usr/lib/python3.10/os.py'> │ │
│ │ Path = <class 'pathlib.Path'> │ │
│ │ reflection = <module 'grpc_reflection.v1alpha.reflection' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/grpc_reflection/v1alpha/ref… │ │
│ │ time = <module 'time' (built-in)> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/vlm_causal_lm.py:14 in │
│ │
│ 11 from transformers import PreTrainedTokenizerBase │
│ 12 from transformers.image_processing_utils import select_best_resolution │
│ 13 from text_generation_server.pb import generate_pb2 │
│ ❱ 14 from text_generation_server.models.flash_mistral import ( │
│ 15 │ BaseFlashMistral, │
│ 16 │ FlashMistralBatch, │
│ 17 ) │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ base64 = <module 'base64' from '/usr/lib/python3.10/base64.py'> │ │
│ │ BytesIO = <class '_io.BytesIO'> │ │
│ │ Dict = typing.Dict │ │
│ │ generate_pb2 = <module 'text_generation_server.pb.generate_pb2' from │ │
│ │ '/content/tgi/server/text_generation_server/pb/generate_pb2.py'> │ │
│ │ Image = <module 'PIL.Image' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/PIL/Image.py'> │ │
│ │ List = typing.List │ │
│ │ math = <module 'math' (built-in)> │ │
│ │ Optional = typing.Optional │ │
│ │ PreTrainedTokenizerBase = <class │ │
│ │ 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'> │ │
│ │ re = <module 're' from '/usr/lib/python3.10/re.py'> │ │
│ │ select_best_resolution = <function select_best_resolution at 0x7a7f876f2710> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ trace = <module 'opentelemetry.trace' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/opentelemetry/trace/__in… │ │
│ │ Tuple = typing.Tuple │ │
│ │ Type = typing.Type │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/flash_mistral.py:18 in │
│ │
│ 15 from text_generation_server.models.cache_manager import ( │
│ 16 │ get_cache_manager, │
│ 17 ) │
│ ❱ 18 from text_generation_server.models.custom_modeling.flash_mistral_modeling import ( │
│ 19 │ FlashMistralForCausalLM, │
│ 20 │ MistralConfig, │
│ 21 ) │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ AutoConfig = <class 'transformers.models.auto.configuration_auto.AutoConfig'> │ │
│ │ AutoTokenizer = <class 'transformers.models.auto.tokenization_auto.AutoTokenizer'> │ │
│ │ BLOCK_SIZE = 16 │ │
│ │ dataclass = <function dataclass at 0x7a8121e38ca0> │ │
│ │ FlashCausalLM = <class │ │
│ │ 'text_generation_server.models.flash_causal_lm.FlashCausalLM'> │ │
│ │ FlashCausalLMBatch = <class │ │
│ │ 'text_generation_server.models.flash_causal_lm.FlashCausalLMBatch… │ │
│ │ generate_pb2 = <module 'text_generation_server.pb.generate_pb2' from │ │
│ │ '/content/tgi/server/text_generation_server/pb/generate_pb2.py'> │ │
│ │ get_cache_manager = <function get_cache_manager at 0x7a801d1ce830> │ │
│ │ math = <module 'math' (built-in)> │ │
│ │ np = <module 'numpy' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/numpy/init.py'> │ │
│ │ Optional = typing.Optional │ │
│ │ PreTrainedTokenizerBase = <class │ │
│ │ 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ trace = <module 'opentelemetry.trace' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/opentelemetry/trace/__in… │ │
│ │ Tuple = typing.Tuple │ │
│ │ Type = typing.Type │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py:30 │
│ in │
│ │
│ 27 from typing import Optional, List, Tuple │
│ 28 │
│ 29 from text_generation_server.utils import paged_attention, flash_attn │
│ ❱ 30 from text_generation_server.utils.layers import ( │
│ 31 │ TensorParallelRowLinear, │
│ 32 │ TensorParallelColumnLinear, │
│ 33 │ TensorParallelEmbedding, │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ ACT2FN = ClassInstantier([('gelu', <class │ │
│ │ 'transformers.activations.GELUActivation'>), ('gelu_10', │ │
│ │ (<class 'transformers.activations.ClippedGELUActivation'>, │ │
│ │ {'min': -10, 'max': 10})), ('gelu_fast', <class │ │
│ │ 'transformers.activations.FastGELUActivation'>), ('gelu_new', │ │
│ │ <class 'transformers.activations.NewGELUActivation'>), │ │
│ │ ('gelu_python', (<class │ │
│ │ 'transformers.activations.GELUActivation'>, {'use_gelu_python': │ │
│ │ True})), ('gelu_pytorch_tanh', <class │ │
│ │ 'transformers.activations.PytorchGELUTanh'>), ('gelu_accurate', │ │
│ │ <class 'transformers.activations.AccurateGELUActivation'>), │ │
│ │ ('laplace', <class │ │
│ │ 'transformers.activations.LaplaceActivation'>), ('leaky_relu', │ │
│ │ <class 'torch.nn.modules.activation.LeakyReLU'>), ('linear', │ │
│ │ <class 'transformers.activations.LinearActivation'>), ('mish', │ │
│ │ <class 'transformers.activations.MishActivation'>), │ │
│ │ ('quick_gelu', <class │ │
│ │ 'transformers.activations.QuickGELUActivation'>), ('relu', │ │
│ │ <class 'torch.nn.modules.activation.ReLU'>), ('relu2', <class │ │
│ │ 'transformers.activations.ReLUSquaredActivation'>), ('relu6', │ │
│ │ <class 'torch.nn.modules.activation.ReLU6'>), ('sigmoid', │ │
│ │ <class 'torch.nn.modules.activation.Sigmoid'>), ('silu', <class │ │
│ │ 'torch.nn.modules.activation.SiLU'>), ('swish', <class │ │
│ │ 'torch.nn.modules.activation.SiLU'>), ('tanh', <class │ │
│ │ 'torch.nn.modules.activation.Tanh'>)]) │ │
│ │ flash_attn = <module 'text_generation_server.utils.flash_attn' from │ │
│ │ '/content/tgi/server/text_generation_server/utils/flash_attn.p… │ │
│ │ List = typing.List │ │
│ │ nn = <module 'torch.nn' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/nn/init.py'> │ │
│ │ Optional = typing.Optional │ │
│ │ paged_attention = <module 'text_generation_server.utils.paged_attention' from │ │
│ │ '/content/tgi/server/text_generation_server/utils/paged_attent… │ │
│ │ PretrainedConfig = <class 'transformers.configuration_utils.PretrainedConfig'> │ │
│ │ TensorParallelColumnLinear = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelColumnLinea… │ │
│ │ TensorParallelEmbedding = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelEmbedding'> │ │
│ │ TensorParallelRowLinear = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelRowLinear'> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ Tuple = typing.Tuple │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers'
(/content/tgi/server/text_generation_server/utils/layers.py) rank=0
2024-04-21T02:19:24.765988Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-21T02:19:24.766013Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Expected behavior

The launcher is supposed to run, but gets an error.
Any help is appreciated, thank you.

@andychoi98 andychoi98 changed the title Not able to run in Google Colab Not able to run tgi in Google Colab. Shard Cannot Start Apr 21, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 22, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant