refactor(memory): introduce MemoryManager façade for VRAM ownership#321
Merged
Conversation
Phase 5 Track A. The gemma4 model-specific knobs no longer live in the global RuntimeConfig — they belong on the model, not on the runtime. Migration: - New ModelConfig::Overrides::Gemma4 struct in src/model/model_config.h - 6 caller files now read from model_->config().overrides.gemma4.* - engine_init_resolver mutates model_->config_.overrides.gemma4.* directly - GemmContext + GemmKernelArgs gained a `force_mmvq` field so the low-level GGUF small-M kernel (which has no Model in scope) can keep honoring the per-model override without a back-channel into the RuntimeConfig singleton - RuntimeConfig::Gemma4 struct and gemma4.* parser entries removed - imp.conf.example [gemma4] section dropped (knob is now per-model only) - test_gemm_kernel_registry tests updated to set args.force_mmvq instead of mutating RuntimeConfig::current() Eliminates the "model-name in a runtime singleton" code smell flagged in the original Phase 5 spec. Phase 5 Track A of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 Track B. imp_generate and imp_generate_streaming are now thin wrappers that delegate to imp_prefill_with_params + a imp_decode_step loop. Both paths share a single generate_via_prefill_decode_loop helper, so the per-token loop, EOS / max_tokens / cancellation handling, sampling-params translation, and request lifecycle now live in exactly one place (the underlying primitives). Public C-API symbol surface is unchanged (ABI-stable). Behaviour is preserved — verified via new tests/test_api_generate.cpp and existing DegenerationTest suite (Qwen3-8B-Q8_0). Note: src/api/imp_api.cpp grew by +11 LOC because the new shared helper carries documentation for the prefill-sampled-first-token bookkeeping (consumed_output cursor reset) and the natural-finish stop signal. The net win is the elimination of two parallel prefill/decode loops, not raw line count — future per-token-behaviour changes no longer need parallel edits in imp_generate and imp_generate_streaming. Phase 5 Track B of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 Track C of the architecture refactor roadmap. Engine carried a
loose `VRAMAllocator vram_alloc_` field directly and depended on the
free functions `compute_vram_budget` and `plan_storage` in
src/runtime/{vram_budget,storage_planner}.cpp. Memory ownership was
spread across the engine and two anonymous free functions.
This PR introduces `MemoryManager` in src/memory/memory_manager.{h,cpp}:
- Owns the VRAMAllocator directly (matches the only allocator the
engine instantiates today).
- Exposes PinnedAllocator + DeviceAllocator via lazy `unique_ptr<>`
accessors so future call sites can reach them through the façade
without changing engine init costs (no 64-MiB pinned pool until
something asks).
- Wraps `compute_vram_budget` as `compute_budget()` and `plan_storage`
as `plan_storage_for()`.
- Engine now has one `MemoryManager memory_manager_` field; the
pre-existing `Engine::vram_allocator()` accessor delegates to it for
source-compatibility, and `Engine::memory_manager()` is the new
primary entry point.
The five underlying modules (vram_allocator, device_allocator,
pinned_allocator, vram_budget, storage_planner) are unchanged — this
is a façade, not an implementation merge. 17 call sites in
src/runtime/engine.cpp migrated from `vram_alloc_` /
`compute_vram_budget(...)` to the new façade. Executor-internal call
sites in src/exec/ keep using their own `vram_alloc_` raw pointer
(set via `set_vram_allocator(&memory_manager_.vram_allocator())`) —
they are not Engine-scope and stay out of this façade pass.
`make verify-fast` passes (test-gpu has a pre-existing unrelated stub
crash in StubModelTest.CreateContextAndInfer, confirmed by running
the same target on the parent commit).
Phase 5 Track C of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 5 Track C. Introduce `MemoryManager` class in `src/memory/memory_manager.{h,cpp}` as a façade owning the engine's VRAM allocators + wrapping the budget/planner free functions. Engine field changes from `VRAMAllocator vram_alloc_` to `MemoryManager memory_manager_` — single entry point.
Stacking
#319 → #320 → this.
Design: façade pattern, not implementation-merge
The 5 underlying modules stay untouched:
"Deeper merge" (kill the .cpp files and fold their bodies) would have inflated the diff by ~300 LOC and broken the façade-not-merge claim. Façade now, fold later if a second consumer ever appears.
LOC
Why 18 not ~30
PinnedAllocator/DeviceAllocator were standalone classes that Engine never instantiated. `src/exec/*` has its own `vram_alloc_` pointer (GraphExecutor::set_vram_allocator), not Engine's — unchanged after this PR's redirect.
Pre-existing test failure noted
`StubModelTest.CreateContextAndInfer` segfaults; verified pre-existing by reproducing on parent commit `b746959`. Not introduced by this PR.
Test plan
Phase 5 Track C of `docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md`.
🤖 Generated with Claude Code