-
Notifications
You must be signed in to change notification settings - Fork 256
Description
Used ZE_AFFINITY_MASK=0 tp=1 make single card run multi-instance on B60. The multi-instance include same model running on different port causing the problem that model offloading on cpu & memory rather than OOM. Here is the script below:
export ZE_AFFINITY_MASK=0
export TORCH_LLM_ALLREDUCE=1
export VLLM_USE_V1=1
export CCL_ZE_IPC_EXCHANGE=pidfd
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 -m vllm.entrypoints.openai.api_server
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B
--served-model-name DeepSeek-R1-Distill-Qwen-7B
--dtype=float16
--enforce-eager
--port 8000
--host 0.0.0.0
--trust-remote-code
--disable-sliding-window
--gpu-memory-util=0.9
--no-enable-prefix-caching
--max-num-batched-tokens=8192
--disable-log-requests
--max-model-len=8192
--block-size 64
--tensor-parallel-size 1
--reasoning-parser deepseek_r1
-tp=1
The trend of memory:
The trend of Vision Memory:
The model contain 8 billion parameters, estimated 16GiB Vision Memory occupied in FP16.
Could you help me figure out this problem to support customer enablement? Thanks!
Please contact me email or teams if any further detail needed. Email : lucas.cai@intel.com