Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error forreshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0) #1738

Closed
2 of 4 tasks
hellangleZ opened this issue Apr 13, 2024 · 3 comments

Comments

@hellangleZ
Copy link

System Info

text-generation-launcher --model-id "/data2/ollama7b"

this file is a llama2 7b model

image

torch 2.1.2
flash-attn 2.5,7
TGI lastest version

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

/data22/text-generation-inference# text-generation-launcher --model-id "/data2/ollama7b"
2024-04-13T07:14:56.303639Z INFO text_generation_launcher: Args { model_id: "/data2/ollama7b", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-13T07:14:56.303698Z INFO text_generation_launcher: Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383.
2024-04-13T07:14:56.303704Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-04-13T07:14:56.303706Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-04-13T07:14:56.303714Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-04-13T07:14:56.303716Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-13T07:14:56.303784Z INFO download: text_generation_launcher: Starting download process.
2024-04-13T07:15:00.157607Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-13T07:15:00.909166Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-13T07:15:00.909409Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-13T07:15:04.986503Z INFO text_generation_launcher: Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm

2024-04-13T07:15:05.333503Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'

2024-04-13T07:15:09.599782Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-04-13T07:15:09.624815Z INFO shard-manager: text_generation_launcher: Shard ready in 8.714489962s rank=0
2024-04-13T07:15:09.724775Z INFO text_generation_launcher: Starting Webserver
2024-04-13T07:15:09.803663Z INFO text_generation_router: router/src/main.rs:250: Using config Some(Llama)
2024-04-13T07:15:09.803700Z INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-13T07:15:09.803720Z WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /data2/ollama7b
2024-04-13T07:15:09.806984Z INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-13T07:15:10.724133Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/aml/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/aml/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/aml/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/data22/text-generation-inference/server/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 240, in serve
asyncio.run(
File "/aml/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/aml/conda/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
handle._run()
File "/aml/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/aml/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/data22/text-generation-inference/server/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/aml/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/aml/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/data22/text-generation-inference/server/text_generation_server/server.py", line 98, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 768, in warmup
_, batch, _ = self.generate_token(batch)
File "/aml/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
raise e
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
out, speculative_logits = self.forward(batch)
File "/data22/text-generation-inference/server/text_generation_server/models/flash_causal_lm.py", line 885, in forward
return self.model.forward(
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 443, in forward
hidden_states = self.model(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in forward
hidden_states, residual = layer(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 316, in forward
attn_output = self.self_attn(
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aml/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data22/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 209, in forward
paged_attention.reshape_and_cache(
File "/data22/text-generation-inference/server/text_generation_server/utils/paged_attention.py", line 16, in reshape_and_cache
cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slots, "auto", 1.0)
TypeError: reshape_and_cache(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: str) -> None

Invoked with: tensor([[[ 5.5566e-01, -6.1572e-01, 3.7451e-01, ..., -7.2632e-02,
2.7759e-01, -4.3579e-02],
[ 4.7949e-01, 5.9180e-01, 4.1479e-01, ..., 1.1094e+00,
-8.7061e-01, 1.0000e+00],
[ 1.7126e-01, 2.3108e-01, -4.4702e-01, ..., 3.7402e-01,
1.2861e+00, 1.4746e+00],
...,
[ 9.4910e-03, -4.8120e-01, -1.4624e-01, ..., -1.4050e-01,
9.4482e-01, -7.1924e-01],
[-6.9385e-01, 6.6895e-02, -5.3174e-01, ..., -1.0615e+00,
8.1689e-01, 8.3154e-01],
[-4.4165e-01, 1.0117e+00, 5.8545e-01, ..., 4.8828e-01,
-1.6199e-01, -6.0059e-02]],

    [[ 1.3464e-01, -3.6694e-01,  1.2764e-02,  ...,  2.7710e-01,
      -1.1072e-01,  3.4131e-01],
     [-6.1401e-02, -1.0109e-02,  1.2268e-01,  ...,  2.2705e-01,
       3.6041e-02,  1.6736e-01],
     [-2.3181e-01, -3.1079e-01, -2.7612e-01,  ..., -8.2617e-01,
      -8.6426e-01, -8.3154e-01],

And still hang in this view

image

Expected behavior

Just use it to inference

@OlivierDehaene
Copy link
Member

You need to re-install vllm and flash-attention-v2

cd text-generation-inference/server
rm -rf vllm
make install-vllm-cuda

rm -rf flash-attention-v2
make install-flash-attention-v2-cuda

Sorry we forgot to add this to the release notes. Since we mainly ship a container we forget about local installs.

@Narsil
Copy link
Collaborator

Narsil commented Apr 15, 2024

You need to install our version of vllm (cd server && make install-vllm) as we optimized the kernels for our codebase.

That's why we recommend using the docker layer, it makes it easier to navigate the dependencies (We use the CLI a lot to dev things, it's just there's no easy way to make it easy on users to have a clean environment given an arbitrary machine/pre-existing environment)

@hellangleZ
Copy link
Author

You need to re-install vllm and flash-attention-v2

cd text-generation-inference/server
rm -rf vllm
make install-vllm-cuda

rm -rf flash-attention-v2
make install-flash-attention-v2-cuda

Sorry we forgot to add this to the release notes. Since we mainly ship a container we forget about local installs.

Thank you,it works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants