Skip to content

[Bug] vLLM-service runtime issue for ChatQnA #1119

@louie-tsai

Description

@louie-tsai

Priority

P3-Medium

OS type

Ubuntu

Hardware type

Gaudi2

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source

Deploy method

  • Docker compose
  • Docker
  • Kubernetes
  • Helm

Running nodes

Single Node

What's the version?

the latest version

Description

Failed with below error when we test vllm-gaudi-service
Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 394, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args return cls( File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 78, in __init__ self.engine = LLMEngine(*args, File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 351, in __init__ self._initialize_kv_caches() File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 486, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/executor/hpu_executor.py", line 84, in determine_num_available_blocks return self.driver_worker.determine_num_available_blocks() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_worker.py", line 180, in determine_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 1451, in profile_run self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches, File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 1523, in warmup_scenario self.execute_model(inputs, kv_caches, warmup_mode=True) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 2134, in execute_model hidden_states = self.model.forward( File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward return wrapped_hpugraph_forward( File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 570, in wrapped_hpugraph_forward return orig_fwd(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 387, in forward hidden_states = self.model(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1523, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 566, in forward model_output = self.model(input_ids, positions, kv_caches, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 352, in forward hidden_states, residual = layer(positions, hidden_states, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 261, in forward hidden_states = self.self_attn(positions=positions, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 191, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/attention/layer.py", line 100, in forward return self.impl.forward(query, File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/attention/backends/hpu_attn.py", line 208, in forward out = ops.prompt_attention( File "/usr/local/lib/python3.10/dist-packages/vllm_hpu_extension/ops.py", line 226, in prompt_attention attn_weights = FusedSDPA.apply(query, key, value, None, 0.0, True, File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply return super().apply(*args, **kwargs) # type: ignore[misc] TypeError: FusedSDPA.forward() takes from 4 to 9 positional arguments but 12 were given Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 585, in <module> uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start

Reproduce steps

follow readme to test vllm-service
https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/docker_compose/intel/hpu/gaudi/README.md#validate-microservices-and-megaservice

Raw log

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions