Skip to content

B60 single card load multi-instance offloading on cpu #859

@Lucas-cai

Description

@Lucas-cai

Used ZE_AFFINITY_MASK=0 tp=1 make single card run multi-instance on B60. The multi-instance include same model running on different port causing the problem that model offloading on cpu & memory rather than OOM. Here is the script below:

export ZE_AFFINITY_MASK=0
export TORCH_LLM_ALLREDUCE=1
export VLLM_USE_V1=1
export CCL_ZE_IPC_EXCHANGE=pidfd
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 -m vllm.entrypoints.openai.api_server
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B
--served-model-name DeepSeek-R1-Distill-Qwen-7B
--dtype=float16
--enforce-eager
--port 8000
--host 0.0.0.0
--trust-remote-code
--disable-sliding-window
--gpu-memory-util=0.9
--no-enable-prefix-caching
--max-num-batched-tokens=8192
--disable-log-requests
--max-model-len=8192
--block-size 64
--tensor-parallel-size 1
--reasoning-parser deepseek_r1
-tp=1

The trend of memory:

Image

The trend of Vision Memory:

Image

The model contain 8 billion parameters, estimated 16GiB Vision Memory occupied in FP16.

Could you help me figure out this problem to support customer enablement? Thanks!

Please contact me email or teams if any further detail needed. Email : lucas.cai@intel.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions