Skip to content

fix: VLLM_SKIP_PROFILE_RUN patch for Lunar Lake iGPU profile_run() hang#340

Open
MegaStood wants to merge 1 commit intointel:mainfrom
MegaStood:claude/upstream-skip-profile-patch-CB5w6
Open

fix: VLLM_SKIP_PROFILE_RUN patch for Lunar Lake iGPU profile_run() hang#340
MegaStood wants to merge 1 commit intointel:mainfrom
MegaStood:claude/upstream-skip-profile-patch-CB5w6

Conversation

@MegaStood
Copy link
Copy Markdown

Bug

vLLM's XPU worker calls profile_run() during startup — a dummy forward pass
to measure peak GPU memory for KV cache sizing. On Lunar Lake Xe2 iGPU (Arc 140V),
this hangs indefinitely for MoE models (gpt-oss-20b, GLM-4.7-flash), blocking
server startup entirely.

Related upstream issue: vllm-project/vllm#30359

Fix

Adds a vllm_xpu_worker_skip_profile.patch for vllm/v1/worker/xpu_worker.py that
introduces VLLM_SKIP_PROFILE_RUN=1 environment variable support:

  • Skips profile_run() entirely when set
  • Estimates peak memory as memory_allocated() × 1.2 (conservative)
  • Prints memory profiling analysis for debugging

Also updates lunar_lake_serve.sh to set the env var automatically.

Impact

  • Without patch: Server hangs at startup for MoE models on iGPU — unusable
  • With patch: KV cache allocation is ~1.2× conservative (slightly less cache
    than optimal), but server starts and runs correctly

Tested On

  • Device: MSI Claw 8 AI+ (Core Ultra 7 258V, Arc 140V, 32GB LPDDR5x)
  • Models verified: gpt-oss-20b (MXFP4), Qwen3.5-4B (INT4), Qwen3-8B (INT4)
  • vLLM version: 0.14.0 with XPU backend
  • Dense models (Qwen3.5-4B, Qwen3-8B) also work with the patch — the 1.2× estimate
    matches actual peak closely

@MegaStood MegaStood closed this Apr 2, 2026
@MegaStood MegaStood force-pushed the claude/upstream-skip-profile-patch-CB5w6 branch from 16aa9cc to e874953 Compare April 2, 2026 11:54
…Lake iGPU

vLLM's XPU worker runs a dummy forward pass (profile_run()) during startup
to measure peak GPU memory for KV cache sizing. On Lunar Lake's Xe2 iGPU,
this forward pass hangs indefinitely for MoE models (gpt-oss-20b, GLM-4.7).

This patch adds VLLM_SKIP_PROFILE_RUN=1 environment variable support to
_determine_available_memory_default() in xpu_worker.py. When set:
- Skips profile_run() entirely
- Estimates peak memory as memory_allocated() * 1.2
- Prints memory profiling analysis for debugging

Tested on: MSI Claw 8 AI+ (Core Ultra 7 258V, Arc 140V, 32GB LPDDR5x)
Models verified: gpt-oss-20b (MXFP4), Qwen3.5-4B, Qwen3-8B

Related: vllm-project/vllm#30359
@MegaStood
Copy link
Copy Markdown
Author

please check and review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants