Skip to content

fix(arc): use whole /dev/dri pass-through for device-drift resilience#38

Merged
growlf merged 1 commit into
mainfrom
fix/arc-compose-whole-dir-mount-2026-05-18
May 19, 2026
Merged

fix(arc): use whole /dev/dri pass-through for device-drift resilience#38
growlf merged 1 commit into
mainfrom
fix/arc-compose-whole-dir-mount-2026-05-18

Conversation

@growlf
Copy link
Copy Markdown
Owner

@growlf growlf commented May 19, 2026

Summary

  • Replaces single-device mount pattern (`${GPU_CARD:-/dev/dri/card1}` + GPU_RENDER) with whole-directory mount `/dev/dri:/dev/dri`
  • Defensive change against per-boot card-number drift + future installs missing `.env` override
  • Empirically verified neutral perf impact on Phoenix (13.75 tok/s 3-run mean vs 14.45 baseline, within noise band)

Context

NetYeti fixed his bms-ai-cluster standalone Ollama (separate compose) by switching to whole-dir mount + adding Stage-2 SYCL env-var overrides. Pod studied Solution_files (10 docs) end-to-end + ran empirical apply-test on ai-stack.

Finding: NetYeti's full Stage-2 fix does NOT transfer to ai-stack's current image build. Loom 8th-class probe (binary-self-report-read on `docker logs ollama`) shows ai-stack's `Build with Macros:` reports only `FORCE_MMQ: no` + `F16: no` — does NOT have `DISABLE_OPT: yes` at compile time (unlike bms-ai-cluster's older image). Stage-2 env-var overrides are null-effect on this build; OLLAMA_NUM_CTX=8192 actively hurts (-17% from KV cache bloat).

This PR captures only the part that's safe + defensive across image builds: the whole-dir mount.

Test plan

  • Backup current compose
  • Apply change + force-recreate container
  • 5-probe verification (mount, sycl-ls, clpeak, load-log offload, perf-stat) all green
  • 3-run tok/s measurement on qwen2.5:7b: 13.68, 13.75, 13.81 (mean 13.75)
  • Baseline reference: 14.45 tok/s
  • Delta: -0.7 tok/s (~5%), within slight-noise band — defensive change neutral on perf

What this PR does NOT include

  • Stage-2 SYCL env-var overrides (GGML_SYCL_DISABLE_OPT=0, DISABLE_GRAPH=0, PRIORITIZE_DMMV=1, NUM_CTX=8192, etc.) — verified empirically as null-effect or harmful on ai-stack's current image build via 8th-class probe + apply-test
  • NetYeti's bms-ai-cluster fix-pattern applies to older ipex-llm builds with DISABLE_OPT=yes at compile-time; doesn't transfer to ai-stack's current ava-agentone:latest build

Co-Authored-By

Claude Opus 4.7 (1M context) noreply@anthropic.com

Replaces single-device pattern (${GPU_CARD:-/dev/dri/card1} + GPU_RENDER) with
whole-directory mount /dev/dri:/dev/dri.

Why:
- Original pattern hardcoded card1 as default; only safe with .env override
  setting GPU_CARD=/dev/dri/card0. Future installs missing .env override
  would hit the bms-ai-cluster-class device-mapping bug (msg ca3d45b4
  + Solution_files validation 2026-05-18).
- Whole-/dev/dri mount exposes all card*, renderD*, by-path/ symlinks,
  is resilient to per-boot card-number drift, and works on any host
  regardless of which card-N is the GPU.
- Defensive change: doesn't affect ai-stack-on-Phoenix deployment
  (card0 was already mounted via .env override). Empirical measurement
  on Phoenix shows 13.75 tok/s 3-run mean vs 14.45 baseline (within
  slight-noise band), confirming neutral perf impact + defensive benefit.

What this fix does NOT include:
- Stage-2 SYCL env-var overrides (GGML_SYCL_DISABLE_OPT=0, etc.) — verified
  via 8th-class binary-self-report probe that ai-stack's current
  ava-agentone:latest image already has those opts enabled at build time
  (no 'DISABLE_OPT: yes' in Build with Macros). Stage-2 overrides are
  null-effect on this image build; OLLAMA_NUM_CTX=8192 actively hurts
  (-17% measured via KV cache bloat).
- The bms-ai-cluster 'Solution_files' Stage-2 boost applies to OLDER
  ipex-llm builds with DISABLE_OPT=yes at compile-time. Doesn't transfer
  to ai-stack's current image build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@growlf growlf merged commit aa31d17 into main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant