You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/data22/text-generation-inference# text-generation-launcher --model-id "/data2/ollama7b"
2024-04-13T07:14:56.303639Z INFO text_generation_launcher: Args { model_id: "/data2/ollama7b", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-13T07:14:56.303698Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-04-13T07:14:56.303704Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-13T07:14:56.303706Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-13T07:14:56.303714Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-13T07:14:56.303716Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-13T07:14:56.303784Z INFO download: text_generation_launcher: Starting download process.
2024-04-13T07:15:00.157607Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-13T07:15:00.909166Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-13T07:15:00.909409Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-13T07:15:04.986503Z INFO text_generation_launcher: Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm
2024-04-13T07:15:05.333503Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-04-13T07:15:09.599782Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-04-13T07:15:09.624815Z INFO shard-manager: text_generation_launcher: Shard ready in 8.714489962s rank=0
2024-04-13T07:15:09.724775Z INFO text_generation_launcher: Starting Webserver
2024-04-13T07:15:09.803663Z INFO text_generation_router: router/src/main.rs:250: Using config Some(Llama)
2024-04-13T07:15:09.803700Z INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-13T07:15:09.803720Z WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /data2/ollama7b
2024-04-13T07:15:09.806984Z INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-13T07:15:10.724133Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/aml/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/data22/text-generation-inference/server/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 240, in serve
asyncio.run(
File "/aml/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
handle._run()
File "/aml/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/aml/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/data22/text-generation-inference/server/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/aml/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/aml/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 98, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 768, in warmup
_, batch, _ = self.generate_token(batch)
File "/aml/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
raise e
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
out, speculative_logits = self.forward(batch)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 885, in forward
return self.model.forward(
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 443, in forward
hidden_states = self.model(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in forward
hidden_states, residual = layer(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 316, in forward
attn_output = self.self_attn(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 209, in forward
paged_attention.reshape_and_cache(
File "/data22/text-generation-inference/server/text_generation_server/utils/paged_attention.py", line 16, in reshape_and_cache
cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0)
TypeError: reshape_and_cache(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: str) -> None
You need to install our version of vllm (cd server && make install-vllm) as we optimized the kernels for our codebase.
That's why we recommend using the docker layer, it makes it easier to navigate the dependencies (We use the CLI a lot to dev things, it's just there's no easy way to make it easy on users to have a clean environment given an arbitrary machine/pre-existing environment)
System Info
text-generation-launcher --model-id "/data2/ollama7b"
this file is a llama2 7b model
torch 2.1.2
flash-attn 2.5,7
TGI lastest version
Information
Tasks
Reproduction
/data22/text-generation-inference# text-generation-launcher --model-id "/data2/ollama7b"
2024-04-13T07:14:56.303639Z INFO text_generation_launcher: Args { model_id: "/data2/ollama7b", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-13T07:14:56.303698Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using
--max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383
.2024-04-13T07:14:56.303704Z INFO text_generation_launcher: Default
max_input_tokens
to 40952024-04-13T07:14:56.303706Z INFO text_generation_launcher: Default
max_total_tokens
to 40962024-04-13T07:14:56.303714Z INFO text_generation_launcher: Default
max_batch_prefill_tokens
to 41452024-04-13T07:14:56.303716Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-13T07:14:56.303784Z INFO download: text_generation_launcher: Starting download process.
2024-04-13T07:15:00.157607Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-13T07:15:00.909166Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-13T07:15:00.909409Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-13T07:15:04.986503Z INFO text_generation_launcher: Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm
2024-04-13T07:15:05.333503Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-04-13T07:15:09.599782Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-04-13T07:15:09.624815Z INFO shard-manager: text_generation_launcher: Shard ready in 8.714489962s rank=0
2024-04-13T07:15:09.724775Z INFO text_generation_launcher: Starting Webserver
2024-04-13T07:15:09.803663Z INFO text_generation_router: router/src/main.rs:250: Using config Some(Llama)
2024-04-13T07:15:09.803700Z INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-13T07:15:09.803720Z WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /data2/ollama7b
2024-04-13T07:15:09.806984Z INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-13T07:15:10.724133Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/aml/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/data22/text-generation-inference/server/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 240, in serve
asyncio.run(
File "/aml/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
handle._run()
File "/aml/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/aml/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
Invoked with: tensor([[[ 5.5566e-01, -6.1572e-01, 3.7451e-01, ..., -7.2632e-02,
2.7759e-01, -4.3579e-02],
[ 4.7949e-01, 5.9180e-01, 4.1479e-01, ..., 1.1094e+00,
-8.7061e-01, 1.0000e+00],
[ 1.7126e-01, 2.3108e-01, -4.4702e-01, ..., 3.7402e-01,
1.2861e+00, 1.4746e+00],
...,
[ 9.4910e-03, -4.8120e-01, -1.4624e-01, ..., -1.4050e-01,
9.4482e-01, -7.1924e-01],
[-6.9385e-01, 6.6895e-02, -5.3174e-01, ..., -1.0615e+00,
8.1689e-01, 8.3154e-01],
[-4.4165e-01, 1.0117e+00, 5.8545e-01, ..., 4.8828e-01,
-1.6199e-01, -6.0059e-02]],
And still hang in this view
Expected behavior
Just use it to inference
The text was updated successfully, but these errors were encountered: