Error forreshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0) #1738

hellangleZ · 2024-04-13T07:21:02Z

System Info

text-generation-launcher --model-id "/data2/ollama7b"

this file is a llama2 7b model

torch 2.1.2
flash-attn 2.5,7
TGI lastest version

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

/data22/text-generation-inference# text-generation-launcher --model-id "/data2/ollama7b"
2024-04-13T07:14:56.303639Z INFO text_generation_launcher: Args { model_id: "/data2/ollama7b", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-13T07:14:56.303698Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-04-13T07:14:56.303704Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-13T07:14:56.303706Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-13T07:14:56.303714Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-13T07:14:56.303716Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-13T07:14:56.303784Z INFO download: text_generation_launcher: Starting download process.
2024-04-13T07:15:00.157607Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-13T07:15:00.909166Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-13T07:15:00.909409Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-13T07:15:04.986503Z INFO text_generation_launcher: Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm

2024-04-13T07:15:05.333503Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'

2024-04-13T07:15:09.599782Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-04-13T07:15:09.624815Z INFO shard-manager: text_generation_launcher: Shard ready in 8.714489962s rank=0
2024-04-13T07:15:09.724775Z INFO text_generation_launcher: Starting Webserver
2024-04-13T07:15:09.803663Z INFO text_generation_router: router/src/main.rs:250: Using config Some(Llama)
2024-04-13T07:15:09.803700Z INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-13T07:15:09.803720Z WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /data2/ollama7b
2024-04-13T07:15:09.806984Z INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-13T07:15:10.724133Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/aml/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/data22/text-generation-inference/server/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 240, in serve
asyncio.run(
File "/aml/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
handle._run()
File "/aml/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/aml/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/data22/text-generation-inference/server/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/aml/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/aml/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 98, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 768, in warmup
_, batch, _ = self.generate_token(batch)
File "/aml/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
raise e
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
out, speculative_logits = self.forward(batch)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 885, in forward
return self.model.forward(
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 443, in forward
hidden_states = self.model(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in forward
hidden_states, residual = layer(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 316, in forward
attn_output = self.self_attn(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 209, in forward
paged_attention.reshape_and_cache(
File "/data22/text-generation-inference/server/text_generation_server/utils/paged_attention.py", line 16, in reshape_and_cache
cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0)
TypeError: reshape_and_cache(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: str) -> None

Invoked with: tensor([[[ 5.5566e-01, -6.1572e-01, 3.7451e-01, ..., -7.2632e-02,
2.7759e-01, -4.3579e-02],
[ 4.7949e-01, 5.9180e-01, 4.1479e-01, ..., 1.1094e+00,
-8.7061e-01, 1.0000e+00],
[ 1.7126e-01, 2.3108e-01, -4.4702e-01, ..., 3.7402e-01,
1.2861e+00, 1.4746e+00],
...,
[ 9.4910e-03, -4.8120e-01, -1.4624e-01, ..., -1.4050e-01,
9.4482e-01, -7.1924e-01],
[-6.9385e-01, 6.6895e-02, -5.3174e-01, ..., -1.0615e+00,
8.1689e-01, 8.3154e-01],
[-4.4165e-01, 1.0117e+00, 5.8545e-01, ..., 4.8828e-01,
-1.6199e-01, -6.0059e-02]],

    [[ 1.3464e-01, -3.6694e-01,  1.2764e-02,  ...,  2.7710e-01,
      -1.1072e-01,  3.4131e-01],
     [-6.1401e-02, -1.0109e-02,  1.2268e-01,  ...,  2.2705e-01,
       3.6041e-02,  1.6736e-01],
     [-2.3181e-01, -3.1079e-01, -2.7612e-01,  ..., -8.2617e-01,
      -8.6426e-01, -8.3154e-01],

And still hang in this view

Expected behavior

Just use it to inference

The text was updated successfully, but these errors were encountered:

OlivierDehaene · 2024-04-15T09:11:36Z

You need to re-install vllm and flash-attention-v2

cd text-generation-inference/server
rm -rf vllm
make install-vllm-cuda

rm -rf flash-attention-v2
make install-flash-attention-v2-cuda

Sorry we forgot to add this to the release notes. Since we mainly ship a container we forget about local installs.

Narsil · 2024-04-15T12:34:41Z

You need to install our version of vllm (cd server && make install-vllm) as we optimized the kernels for our codebase.

That's why we recommend using the docker layer, it makes it easier to navigate the dependencies (We use the CLI a lot to dev things, it's just there's no easy way to make it easy on users to have a clean environment given an arbitrary machine/pre-existing environment)

hellangleZ · 2024-04-15T15:06:05Z

You need to re-install vllm and flash-attention-v2
cd text-generation-inference/server
rm -rf vllm
make install-vllm-cuda

rm -rf flash-attention-v2
make install-flash-attention-v2-cuda
Sorry we forgot to add this to the release notes. Since we mainly ship a container we forget about local installs.

Thank you，it works for me

OlivierDehaene closed this as completed Apr 15, 2024

This was referenced Apr 20, 2024

Have installed flash-attn and flash-attn-v2 but output 'Flash Attention is not installed' when using Medusa #1778

Closed

Not able to install locally #1788

Closed

Jason-CKY mentioned this issue Jul 8, 2024

error warming up cohere/aya-23-35b model: error in reshape_and_cache function with TGI 2.1.1 #2199

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error forreshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0) #1738

Error forreshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0) #1738

hellangleZ commented Apr 13, 2024

OlivierDehaene commented Apr 15, 2024

Narsil commented Apr 15, 2024

hellangleZ commented Apr 15, 2024

Error forreshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0) #1738

Error forreshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0) #1738

Comments

hellangleZ commented Apr 13, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

OlivierDehaene commented Apr 15, 2024

Narsil commented Apr 15, 2024

hellangleZ commented Apr 15, 2024