B60 single card load multi-instance offloading on cpu

Used ZE_AFFINITY_MASK=0 tp=1 make single card run multi-instance on B60. The multi-instance include same model running on different port causing the problem that model offloading on cpu & memory rather than OOM. Here is the script below:

export ZE_AFFINITY_MASK=0
export TORCH_LLM_ALLREDUCE=1
export VLLM_USE_V1=1
export CCL_ZE_IPC_EXCHANGE=pidfd
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 -m vllm.entrypoints.openai.api_server
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B
--served-model-name DeepSeek-R1-Distill-Qwen-7B
--dtype=float16
--enforce-eager
--port 8000
--host 0.0.0.0
--trust-remote-code
--disable-sliding-window
--gpu-memory-util=0.9
--no-enable-prefix-caching
--max-num-batched-tokens=8192
--disable-log-requests
--max-model-len=8192
--block-size 64
--tensor-parallel-size 1
--reasoning-parser deepseek_r1
-tp=1

The trend of memory:

<img width="550" height="216" alt="Image" src="https://github.com/user-attachments/assets/6450465a-1efb-4053-930e-7be972dddde7" />

The trend of Vision Memory:

<img width="744" height="355" alt="Image" src="https://github.com/user-attachments/assets/21a21044-9ded-4c6f-889f-9a6242cf82a3" />

The model contain 8 billion parameters, estimated 16GiB Vision Memory occupied in FP16.

Could you help me figure out this problem to support customer enablement? Thanks! 

Please contact me email or teams if any further detail needed. Email : lucas.cai@intel.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

B60 single card load multi-instance offloading on cpu #859

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

B60 single card load multi-instance offloading on cpu #859

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions