You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-04-21T02:19:11.151696Z INFO text_generation_launcher: Args { model_id: "bigcode/starcoder2-3b", revision: None, validation_workers: 2, sharded: Some(false), num_shard: None, quantize: Some(BitsandbytesFP4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "05f318f80dae", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-21T02:19:11.152123Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-04-21T02:19:11.152493Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-04-21T02:19:11.152557Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-21T02:19:11.152563Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-21T02:19:11.152565Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-21T02:19:11.152568Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-04-21T02:19:11.152859Z INFO download: text_generation_launcher: Starting download process.
2024-04-21T02:19:15.578314Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-21T02:19:16.458472Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-21T02:19:16.458842Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-21T02:19:20.534018Z ERROR text_generation_launcher: exllamav2_kernels not installed.
2024-04-21T02:19:20.575670Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/content/tgi/server/text_generation_server/utils/layers.py)
2024-04-21T02:19:20.576457Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-04-21T02:19:24.668451Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
System Info
Installed TGI using the following script and was trying to run the launcher, but fails with this error
2024-04-21T02:10:40.097949Z ERROR text_generation_launcher: Shard 0 failed to start
Below is how I installed rust, protc and tgi locally on Google Colab,
Install Rust
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
import os
os.environ['PATH'] += ':/root/.cargo/bin'
Install protoc
!apt install -y protobuf-compiler
Clone the repository for text-generation-inference
!git clone https://github.com/huggingface/text-generation-inference.git tgi
%cd tgi
#install transformers
!pip install git+https://github.com/huggingface/transformers.git
Compile and install any extensions if needed (Replace with the correct make command if necessary)
!BUILD_EXTENSIONS=True make install
Information
Tasks
Reproduction
And below is the log I get by running,
!text-generation-launcher --model-id bigcode/starcoder2-3b --sharded false --quantize bitsandbytes-fp4 --disable-custom-kernels
2024-04-21T02:19:11.151696Z INFO text_generation_launcher: Args { model_id: "bigcode/starcoder2-3b", revision: None, validation_workers: 2, sharded: Some(false), num_shard: None, quantize: Some(BitsandbytesFP4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "05f318f80dae", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-21T02:19:11.152123Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-04-21T02:19:11.152493Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using
--max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383
.2024-04-21T02:19:11.152557Z INFO text_generation_launcher: Default
max_input_tokens
to 40952024-04-21T02:19:11.152563Z INFO text_generation_launcher: Default
max_total_tokens
to 40962024-04-21T02:19:11.152565Z INFO text_generation_launcher: Default
max_batch_prefill_tokens
to 41452024-04-21T02:19:11.152568Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-04-21T02:19:11.152859Z INFO download: text_generation_launcher: Starting download process.
2024-04-21T02:19:15.578314Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-21T02:19:16.458472Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-21T02:19:16.458842Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-21T02:19:20.534018Z ERROR text_generation_launcher: exllamav2_kernels not installed.
2024-04-21T02:19:20.575670Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/content/tgi/server/text_generation_server/utils/layers.py)
2024-04-21T02:19:20.576457Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-04-21T02:19:24.668451Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-04-21 02:19:20.860344: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-21 02:19:20.860389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-21 02:19:20.862244: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-21 02:19:22.064777: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /content/tgi/server/text_generation_server/cli.py:71 in serve │
│ │
│ 68 │ ) │
│ 69 │ │
│ 70 │ # Import here after the logger is added to log potential import exceptions │
│ ❱ 71 │ from text_generation_server import server │
│ 72 │ from text_generation_server.tracing import setup_tracing │
│ 73 │ │
│ 74 │ # Setup OpenTelemetry distributed tracing │
│ │
│ ╭──────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ logger_level = 'INFO' │ │
│ │ model_id = 'bigcode/starcoder2-3b' │ │
│ │ otlp_endpoint = None │ │
│ │ quantize = <Quantization.bitsandbytes_fp4: 'bitsandbytes-fp4'> │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰─────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/server.py:16 in │
│ │
│ 13 from text_generation_server.cache import Cache │
│ 14 from text_generation_server.interceptor import ExceptionInterceptor │
│ 15 from text_generation_server.models import Model, get_model │
│ ❱ 16 from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch │
│ 17 from text_generation_server.pb import generate_pb2_grpc, generate_pb2 │
│ 18 from text_generation_server.tracing import UDSOpenTelemetryAioServerInterceptor │
│ 19 from text_generation_server.models.idefics_causal_lm import IdeficsCausalLMBatch │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ aio = <module 'grpc.aio' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/grpc/aio/init.py'> │ │
│ │ asyncio = <module 'asyncio' from '/usr/lib/python3.10/asyncio/init.py'> │ │
│ │ Cache = <class 'text_generation_server.cache.Cache'> │ │
│ │ ExceptionInterceptor = <class 'text_generation_server.interceptor.ExceptionInterceptor'> │ │
│ │ get_model = <function get_model at 0x7a805095b010> │ │
│ │ List = typing.List │ │
│ │ logger = <loguru.logger handlers=[(id=1, level=20, sink=)]> │ │
│ │ Model = <class 'text_generation_server.models.model.Model'> │ │
│ │ Optional = typing.Optional │ │
│ │ os = <module 'os' from '/usr/lib/python3.10/os.py'> │ │
│ │ Path = <class 'pathlib.Path'> │ │
│ │ reflection = <module 'grpc_reflection.v1alpha.reflection' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/grpc_reflection/v1alpha/ref… │ │
│ │ time = <module 'time' (built-in)> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/vlm_causal_lm.py:14 in │
│ │
│ 11 from transformers import PreTrainedTokenizerBase │
│ 12 from transformers.image_processing_utils import select_best_resolution │
│ 13 from text_generation_server.pb import generate_pb2 │
│ ❱ 14 from text_generation_server.models.flash_mistral import ( │
│ 15 │ BaseFlashMistral, │
│ 16 │ FlashMistralBatch, │
│ 17 ) │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ base64 = <module 'base64' from '/usr/lib/python3.10/base64.py'> │ │
│ │ BytesIO = <class '_io.BytesIO'> │ │
│ │ Dict = typing.Dict │ │
│ │ generate_pb2 = <module 'text_generation_server.pb.generate_pb2' from │ │
│ │ '/content/tgi/server/text_generation_server/pb/generate_pb2.py'> │ │
│ │ Image = <module 'PIL.Image' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/PIL/Image.py'> │ │
│ │ List = typing.List │ │
│ │ math = <module 'math' (built-in)> │ │
│ │ Optional = typing.Optional │ │
│ │ PreTrainedTokenizerBase = <class │ │
│ │ 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'> │ │
│ │ re = <module 're' from '/usr/lib/python3.10/re.py'> │ │
│ │ select_best_resolution = <function select_best_resolution at 0x7a7f876f2710> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ trace = <module 'opentelemetry.trace' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/opentelemetry/trace/__in… │ │
│ │ Tuple = typing.Tuple │ │
│ │ Type = typing.Type │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/flash_mistral.py:18 in │
│ │
│ 15 from text_generation_server.models.cache_manager import ( │
│ 16 │ get_cache_manager, │
│ 17 ) │
│ ❱ 18 from text_generation_server.models.custom_modeling.flash_mistral_modeling import ( │
│ 19 │ FlashMistralForCausalLM, │
│ 20 │ MistralConfig, │
│ 21 ) │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ AutoConfig = <class 'transformers.models.auto.configuration_auto.AutoConfig'> │ │
│ │ AutoTokenizer = <class 'transformers.models.auto.tokenization_auto.AutoTokenizer'> │ │
│ │ BLOCK_SIZE = 16 │ │
│ │ dataclass = <function dataclass at 0x7a8121e38ca0> │ │
│ │ FlashCausalLM = <class │ │
│ │ 'text_generation_server.models.flash_causal_lm.FlashCausalLM'> │ │
│ │ FlashCausalLMBatch = <class │ │
│ │ 'text_generation_server.models.flash_causal_lm.FlashCausalLMBatch… │ │
│ │ generate_pb2 = <module 'text_generation_server.pb.generate_pb2' from │ │
│ │ '/content/tgi/server/text_generation_server/pb/generate_pb2.py'> │ │
│ │ get_cache_manager = <function get_cache_manager at 0x7a801d1ce830> │ │
│ │ math = <module 'math' (built-in)> │ │
│ │ np = <module 'numpy' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/numpy/init.py'> │ │
│ │ Optional = typing.Optional │ │
│ │ PreTrainedTokenizerBase = <class │ │
│ │ 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ trace = <module 'opentelemetry.trace' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/opentelemetry/trace/__in… │ │
│ │ Tuple = typing.Tuple │ │
│ │ Type = typing.Type │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /content/tgi/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py:30 │
│ in │
│ │
│ 27 from typing import Optional, List, Tuple │
│ 28 │
│ 29 from text_generation_server.utils import paged_attention, flash_attn │
│ ❱ 30 from text_generation_server.utils.layers import ( │
│ 31 │ TensorParallelRowLinear, │
│ 32 │ TensorParallelColumnLinear, │
│ 33 │ TensorParallelEmbedding, │
│ │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ ACT2FN = ClassInstantier([('gelu', <class │ │
│ │ 'transformers.activations.GELUActivation'>), ('gelu_10', │ │
│ │ (<class 'transformers.activations.ClippedGELUActivation'>, │ │
│ │ {'min': -10, 'max': 10})), ('gelu_fast', <class │ │
│ │ 'transformers.activations.FastGELUActivation'>), ('gelu_new', │ │
│ │ <class 'transformers.activations.NewGELUActivation'>), │ │
│ │ ('gelu_python', (<class │ │
│ │ 'transformers.activations.GELUActivation'>, {'use_gelu_python': │ │
│ │ True})), ('gelu_pytorch_tanh', <class │ │
│ │ 'transformers.activations.PytorchGELUTanh'>), ('gelu_accurate', │ │
│ │ <class 'transformers.activations.AccurateGELUActivation'>), │ │
│ │ ('laplace', <class │ │
│ │ 'transformers.activations.LaplaceActivation'>), ('leaky_relu', │ │
│ │ <class 'torch.nn.modules.activation.LeakyReLU'>), ('linear', │ │
│ │ <class 'transformers.activations.LinearActivation'>), ('mish', │ │
│ │ <class 'transformers.activations.MishActivation'>), │ │
│ │ ('quick_gelu', <class │ │
│ │ 'transformers.activations.QuickGELUActivation'>), ('relu', │ │
│ │ <class 'torch.nn.modules.activation.ReLU'>), ('relu2', <class │ │
│ │ 'transformers.activations.ReLUSquaredActivation'>), ('relu6', │ │
│ │ <class 'torch.nn.modules.activation.ReLU6'>), ('sigmoid', │ │
│ │ <class 'torch.nn.modules.activation.Sigmoid'>), ('silu', <class │ │
│ │ 'torch.nn.modules.activation.SiLU'>), ('swish', <class │ │
│ │ 'torch.nn.modules.activation.SiLU'>), ('tanh', <class │ │
│ │ 'torch.nn.modules.activation.Tanh'>)]) │ │
│ │ flash_attn = <module 'text_generation_server.utils.flash_attn' from │ │
│ │ '/content/tgi/server/text_generation_server/utils/flash_attn.p… │ │
│ │ List = typing.List │ │
│ │ nn = <module 'torch.nn' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/nn/init.py'> │ │
│ │ Optional = typing.Optional │ │
│ │ paged_attention = <module 'text_generation_server.utils.paged_attention' from │ │
│ │ '/content/tgi/server/text_generation_server/utils/paged_attent… │ │
│ │ PretrainedConfig = <class 'transformers.configuration_utils.PretrainedConfig'> │ │
│ │ TensorParallelColumnLinear = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelColumnLinea… │ │
│ │ TensorParallelEmbedding = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelEmbedding'> │ │
│ │ TensorParallelRowLinear = <class │ │
│ │ 'text_generation_server.utils.layers.TensorParallelRowLinear'> │ │
│ │ torch = <module 'torch' from │ │
│ │ '/usr/local/lib/python3.10/dist-packages/torch/init.py'> │ │
│ │ Tuple = typing.Tuple │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers'
(/content/tgi/server/text_generation_server/utils/layers.py) rank=0
2024-04-21T02:19:24.765988Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-21T02:19:24.766013Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Expected behavior
The launcher is supposed to run, but gets an error.
Any help is appreciated, thank you.
The text was updated successfully, but these errors were encountered: