fix(arc): use whole /dev/dri pass-through for device-drift resilience by growlf · Pull Request #38 · growlf/ai-stack

growlf · 2026-05-19T08:15:48Z

Summary

Replaces single-device mount pattern (`${GPU_CARD:-/dev/dri/card1}` + GPU_RENDER) with whole-directory mount `/dev/dri:/dev/dri`
Defensive change against per-boot card-number drift + future installs missing `.env` override
Empirically verified neutral perf impact on Phoenix (13.75 tok/s 3-run mean vs 14.45 baseline, within noise band)

Context

NetYeti fixed his bms-ai-cluster standalone Ollama (separate compose) by switching to whole-dir mount + adding Stage-2 SYCL env-var overrides. Pod studied Solution_files (10 docs) end-to-end + ran empirical apply-test on ai-stack.

Finding: NetYeti's full Stage-2 fix does NOT transfer to ai-stack's current image build. Loom 8th-class probe (binary-self-report-read on `docker logs ollama`) shows ai-stack's `Build with Macros:` reports only `FORCE_MMQ: no` + `F16: no` — does NOT have `DISABLE_OPT: yes` at compile time (unlike bms-ai-cluster's older image). Stage-2 env-var overrides are null-effect on this build; OLLAMA_NUM_CTX=8192 actively hurts (-17% from KV cache bloat).

This PR captures only the part that's safe + defensive across image builds: the whole-dir mount.

Test plan

Backup current compose
Apply change + force-recreate container
5-probe verification (mount, sycl-ls, clpeak, load-log offload, perf-stat) all green
3-run tok/s measurement on qwen2.5:7b: 13.68, 13.75, 13.81 (mean 13.75)
Baseline reference: 14.45 tok/s
Delta: -0.7 tok/s (~5%), within slight-noise band — defensive change neutral on perf

What this PR does NOT include

Stage-2 SYCL env-var overrides (GGML_SYCL_DISABLE_OPT=0, DISABLE_GRAPH=0, PRIORITIZE_DMMV=1, NUM_CTX=8192, etc.) — verified empirically as null-effect or harmful on ai-stack's current image build via 8th-class probe + apply-test
NetYeti's bms-ai-cluster fix-pattern applies to older ipex-llm builds with DISABLE_OPT=yes at compile-time; doesn't transfer to ai-stack's current ava-agentone:latest build

Co-Authored-By

Claude Opus 4.7 (1M context) noreply@anthropic.com

Replaces single-device pattern (${GPU_CARD:-/dev/dri/card1} + GPU_RENDER) with whole-directory mount /dev/dri:/dev/dri. Why: - Original pattern hardcoded card1 as default; only safe with .env override setting GPU_CARD=/dev/dri/card0. Future installs missing .env override would hit the bms-ai-cluster-class device-mapping bug (msg ca3d45b4 + Solution_files validation 2026-05-18). - Whole-/dev/dri mount exposes all card*, renderD*, by-path/ symlinks, is resilient to per-boot card-number drift, and works on any host regardless of which card-N is the GPU. - Defensive change: doesn't affect ai-stack-on-Phoenix deployment (card0 was already mounted via .env override). Empirical measurement on Phoenix shows 13.75 tok/s 3-run mean vs 14.45 baseline (within slight-noise band), confirming neutral perf impact + defensive benefit. What this fix does NOT include: - Stage-2 SYCL env-var overrides (GGML_SYCL_DISABLE_OPT=0, etc.) — verified via 8th-class binary-self-report probe that ai-stack's current ava-agentone:latest image already has those opts enabled at build time (no 'DISABLE_OPT: yes' in Build with Macros). Stage-2 overrides are null-effect on this image build; OLLAMA_NUM_CTX=8192 actively hurts (-17% measured via KV cache bloat). - The bms-ai-cluster 'Solution_files' Stage-2 boost applies to OLDER ipex-llm builds with DISABLE_OPT=yes at compile-time. Doesn't transfer to ai-stack's current image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

growlf merged commit aa31d17 into main May 19, 2026
5 checks passed

This was referenced May 19, 2026

chore(env): remove stale GPU_CARD/GPU_RENDER defaults from .env.example #40

Merged

feat(intel-igpu): detector + installer routing for older Intel iGPUs #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(arc): use whole /dev/dri pass-through for device-drift resilience#38

fix(arc): use whole /dev/dri pass-through for device-drift resilience#38
growlf merged 1 commit into
mainfrom
fix/arc-compose-whole-dir-mount-2026-05-18

growlf commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

growlf commented May 19, 2026

Summary

Context

Test plan

What this PR does NOT include

Co-Authored-By

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant