[Bug] vLLM-service runtime issue for ChatQnA

### Priority

P3-Medium

### OS type

Ubuntu

### Hardware type

Gaudi2

### Installation method

- [X] Pull docker images from hub.docker.com
- [ ] Build docker images from source

### Deploy method

- [X] Docker compose
- [ ] Docker
- [ ] Kubernetes
- [ ] Helm

### Running nodes

Single Node

### What's the version?

the latest version

### Description

Failed with below error when we test vllm-gaudi-service
`Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 394, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
    return cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 351, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 486, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/executor/hpu_executor.py", line 84, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_worker.py", line 180, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 1451, in profile_run
    self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 1523, in warmup_scenario
    self.execute_model(inputs, kv_caches, warmup_mode=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 2134, in execute_model
    hidden_states = self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
    return wrapped_hpugraph_forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 570, in wrapped_hpugraph_forward
    return orig_fwd(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/worker/hpu_model_runner.py", line 387, in forward
    hidden_states = self.model(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1523, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 566, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 352, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 261, in forward
    hidden_states = self.self_attn(positions=positions,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 191, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/attention/layer.py", line 100, in forward
    return self.impl.forward(query,
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/attention/backends/hpu_attn.py", line 208, in forward
    out = ops.prompt_attention(
  File "/usr/local/lib/python3.10/dist-packages/vllm_hpu_extension/ops.py", line 226, in prompt_attention
    attn_weights = FusedSDPA.apply(query, key, value, None, 0.0, True,
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
TypeError: FusedSDPA.forward() takes from 4 to 9 positional arguments but 12 were given
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.3.dev588+g1033c3eb.gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
`

### Reproduce steps

follow readme to test vllm-service
https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/docker_compose/intel/hpu/gaudi/README.md#validate-microservices-and-megaservice

### Raw log

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] vLLM-service runtime issue for ChatQnA #1119

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] vLLM-service runtime issue for ChatQnA #1119

Description

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions