Not able to run tgi in Google Colab. Shard Cannot Start #1780

andychoi98 · 2024-04-21T02:28:27Z

System Info

Installed TGI using the following script and was trying to run the launcher, but fails with this error

2024-04-21T02:10:40.097949Z ERROR text_generation_launcher: Shard 0 failed to start

Below is how I installed rust, protc and tgi locally on Google Colab,

Install Rust

!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
import os
os.environ['PATH'] += ':/root/.cargo/bin'

Install protoc

!apt install -y protobuf-compiler

Clone the repository for text-generation-inference

!git clone https://github.com/huggingface/text-generation-inference.git tgi
%cd tgi
#install transformers
!pip install git+https://github.com/huggingface/transformers.git

Compile and install any extensions if needed (Replace with the correct make command if necessary)

!BUILD_EXTENSIONS=True make install

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

And below is the log I get by running,

!text-generation-launcher --model-id bigcode/starcoder2-3b --sharded false --quantize bitsandbytes-fp4 --disable-custom-kernels

2024-04-21T02:19:11.151696Z INFO text_generation_launcher: Args { model_id: "bigcode/starcoder2-3b", revision: None, validation_workers: 2, sharded: Some(false), num_shard: None, quantize: Some(BitsandbytesFP4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "05f318f80dae", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-21T02:19:11.152123Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-04-21T02:19:11.152493Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-04-21T02:19:11.152557Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-21T02:19:11.152563Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-21T02:19:11.152565Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-21T02:19:11.152568Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-04-21T02:19:11.152859Z INFO download: text_generation_launcher: Starting download process.
2024-04-21T02:19:15.578314Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-21T02:19:16.458472Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-21T02:19:16.458842Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-21T02:19:20.534018Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-21T02:19:20.575670Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/content/tgi/server/text_generation_server/utils/layers.py)

2024-04-21T02:19:20.576457Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'

2024-04-21T02:19:24.668451Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-04-21 02:19:20.860344: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-21 02:19:20.860389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-21 02:19:20.862244: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-21 02:19:22.064777: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /content/tgi/server/text_generation_server/cli.py:71 in serve │
│ │
│ 68 │ ) │
│ 69 │ │
│ 70 │ # Import here after the logger is added to log potential import exceptions │
│ ❱ 71 │ from text_generation_server import server │
│ 72 │ from text_generation_server.tracing import setup_tracing │
│ 73 │ │
│ 74 │ # Setup OpenTelemetry distributed tracing │
│ │
│ ╭──────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ logger_level = 'INFO' │ │
│ │ model_id = 'bigcode/starcoder2-3b' │ │
│ │ otlp_endpoint = None │ │
│ │ quantize = <Quantization.bitsandbytes_fp4: 'bitsandbytes-fp4'> │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰─────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/server.py:16 in │
│ │
│ 13 from text_generation_server.cache import Cache │
│ 14 from text_generation_server.interceptor import ExceptionInterceptor │
│ 15 from text_generation_server.models import Model, get_model │
│ ❱ 16 from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch │
│ 17 from text_generation_server.pb import generate_pb2_grpc, generate_pb2 │
│ 18 from text_generation_server.tracing import UDSOpenTelemetryAioServerInterceptor │
│ 19 from text_generation_server.models.idefics_causal_lm import IdeficsCausalLMBatch │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ aio = <module 'grpc.aio' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/grpc/aio/init.py'> │ │
│ │ asyncio = <module 'asyncio' from '/usr/lib/python3.10/asyncio/init.py'> │ │
│ │ Cache = <class 'text_generation_server.cache.Cache'> │ │
│ │ ExceptionInterceptor = <class 'text_generation_server.interceptor.ExceptionInterceptor'> │ │
│ │ get_model = <function get_model at 0x7a805095b010> │ │
│ │ List = typing.List │ │
│ │ logger = <loguru.logger handlers=[(id=1, level=20, sink=)]> │ │
│ │ Model = <class 'text_generation_server.models.model.Model'> │ │
│ │ Optional = typing.Optional │ │
│ │ os = <module 'os' from '/usr/lib/python3.10/os.py'> │ │
│ │ Path = <class 'pathlib.Path'> │ │
│ │ reflection = <module 'grpc_reflection.v1alpha.reflection' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/grpc_reflection/v1alpha/ref… │ │
│ │ time = <module 'time' (built-in)> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/vlm_causal_lm.py:14 in │
│ │
│ 11 from transformers import PreTrainedTokenizerBase │
│ 12 from transformers.image_processing_utils import select_best_resolution │
│ 13 from text_generation_server.pb import generate_pb2 │
│ ❱ 14 from text_generation_server.models.flash_mistral import ( │
│ 15 │ BaseFlashMistral, │
│ 16 │ FlashMistralBatch, │
│ 17 ) │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ base64 = <module 'base64' from '/usr/lib/python3.10/base64.py'> │ │
│ │ BytesIO = <class '_io.BytesIO'> │ │
│ │ Dict = typing.Dict │ │
│ │ generate_pb2 = <module 'text_generation_server.pb.generate_pb2' from │ │
│ │ '/content/tgi/server/text_generation_server/pb/generate_pb2.py'> │ │
│ │ Image = <module 'PIL.Image' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/PIL/Image.py'> │ │
│ │ List = typing.List │ │
│ │ math = <module 'math' (built-in)> │ │
│ │ Optional = typing.Optional │ │
│ │ PreTrainedTokenizerBase = <class │ │
│ │ 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'> │ │
│ │ re = <module 're' from '/usr/lib/python3.10/re.py'> │ │
│ │ select_best_resolution = <function select_best_resolution at 0x7a7f876f2710> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ trace = <module 'opentelemetry.trace' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/opentelemetry/trace/__in… │ │
│ │ Tuple = typing.Tuple │ │
│ │ Type = typing.Type │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/flash_mistral.py:18 in │
│ │
│ 15 from text_generation_server.models.cache_manager import ( │
│ 16 │ get_cache_manager, │
│ 17 ) │
│ ❱ 18 from text_generation_server.models.custom_modeling.flash_mistral_modeling import ( │
│ 19 │ FlashMistralForCausalLM, │
│ 20 │ MistralConfig, │
│ 21 ) │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ AutoConfig = <class 'transformers.models.auto.configuration_auto.AutoConfig'> │ │
│ │ AutoTokenizer = <class 'transformers.models.auto.tokenization_auto.AutoTokenizer'> │ │
│ │ BLOCK_SIZE = 16 │ │
│ │ dataclass = <function dataclass at 0x7a8121e38ca0> │ │
│ │ FlashCausalLM = <class │ │
│ │ 'text_generation_server.models.flash_causal_lm.FlashCausalLM'> │ │
│ │ FlashCausalLMBatch = <class │ │
│ │ 'text_generation_server.models.flash_causal_lm.FlashCausalLMBatch… │ │
│ │ generate_pb2 = <module 'text_generation_server.pb.generate_pb2' from │ │
│ │ '/content/tgi/server/text_generation_server/pb/generate_pb2.py'> │ │
│ │ get_cache_manager = <function get_cache_manager at 0x7a801d1ce830> │ │
│ │ math = <module 'math' (built-in)> │ │
│ │ np = <module 'numpy' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/numpy/init.py'> │ │
│ │ Optional = typing.Optional │ │
│ │ PreTrainedTokenizerBase = <class │ │
│ │ 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ trace = <module 'opentelemetry.trace' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/opentelemetry/trace/__in… │ │
│ │ Tuple = typing.Tuple │ │
│ │ Type = typing.Type │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py:30 │
│ in │
│ │
│ 27 from typing import Optional, List, Tuple │
│ 28 │
│ 29 from text_generation_server.utils import paged_attention, flash_attn │
│ ❱ 30 from text_generation_server.utils.layers import ( │
│ 31 │ TensorParallelRowLinear, │
│ 32 │ TensorParallelColumnLinear, │
│ 33 │ TensorParallelEmbedding, │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ ACT2FN = ClassInstantier([('gelu', <class │ │
│ │ 'transformers.activations.GELUActivation'>), ('gelu_10', │ │
│ │ (<class 'transformers.activations.ClippedGELUActivation'>, │ │
│ │ {'min': -10, 'max': 10})), ('gelu_fast', <class │ │
│ │ 'transformers.activations.FastGELUActivation'>), ('gelu_new', │ │
│ │ <class 'transformers.activations.NewGELUActivation'>), │ │
│ │ ('gelu_python', (<class │ │
│ │ 'transformers.activations.GELUActivation'>, {'use_gelu_python': │ │
│ │ True})), ('gelu_pytorch_tanh', <class │ │
│ │ 'transformers.activations.PytorchGELUTanh'>), ('gelu_accurate', │ │
│ │ <class 'transformers.activations.AccurateGELUActivation'>), │ │
│ │ ('laplace', <class │ │
│ │ 'transformers.activations.LaplaceActivation'>), ('leaky_relu', │ │
│ │ <class 'torch.nn.modules.activation.LeakyReLU'>), ('linear', │ │
│ │ <class 'transformers.activations.LinearActivation'>), ('mish', │ │
│ │ <class 'transformers.activations.MishActivation'>), │ │
│ │ ('quick_gelu', <class │ │
│ │ 'transformers.activations.QuickGELUActivation'>), ('relu', │ │
│ │ <class 'torch.nn.modules.activation.ReLU'>), ('relu2', <class │ │
│ │ 'transformers.activations.ReLUSquaredActivation'>), ('relu6', │ │
│ │ <class 'torch.nn.modules.activation.ReLU6'>), ('sigmoid', │ │
│ │ <class 'torch.nn.modules.activation.Sigmoid'>), ('silu', <class │ │
│ │ 'torch.nn.modules.activation.SiLU'>), ('swish', <class │ │
│ │ 'torch.nn.modules.activation.SiLU'>), ('tanh', <class │ │
│ │ 'torch.nn.modules.activation.Tanh'>)]) │ │
│ │ flash_attn = <module 'text_generation_server.utils.flash_attn' from │ │
│ │ '/content/tgi/server/text_generation_server/utils/flash_attn.p… │ │
│ │ List = typing.List │ │
│ │ nn = <module 'torch.nn' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/nn/init.py'> │ │
│ │ Optional = typing.Optional │ │
│ │ paged_attention = <module 'text_generation_server.utils.paged_attention' from │ │
│ │ '/content/tgi/server/text_generation_server/utils/paged_attent… │ │
│ │ PretrainedConfig = <class 'transformers.configuration_utils.PretrainedConfig'> │ │
│ │ TensorParallelColumnLinear = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelColumnLinea… │ │
│ │ TensorParallelEmbedding = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelEmbedding'> │ │
│ │ TensorParallelRowLinear = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelRowLinear'> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ Tuple = typing.Tuple │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers'
(/content/tgi/server/text_generation_server/utils/layers.py) rank=0
2024-04-21T02:19:24.765988Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-21T02:19:24.766013Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Expected behavior

The launcher is supposed to run, but gets an error.
Any help is appreciated, thank you.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-22T01:46:48Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

andychoi98 changed the title ~~Not able to run in Google Colab~~ Not able to run tgi in Google Colab. Shard Cannot Start Apr 21, 2024

github-actions bot added the Stale label May 22, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to run tgi in Google Colab. Shard Cannot Start #1780

Not able to run tgi in Google Colab. Shard Cannot Start #1780

andychoi98 commented Apr 21, 2024

github-actions bot commented May 22, 2024

Not able to run tgi in Google Colab. Shard Cannot Start #1780

Not able to run tgi in Google Colab. Shard Cannot Start #1780

Comments

andychoi98 commented Apr 21, 2024

System Info

Install Rust

Install protoc

Clone the repository for text-generation-inference

Compile and install any extensions if needed (Replace with the correct make command if necessary)

Information

Tasks

Reproduction

Expected behavior

github-actions bot commented May 22, 2024