refactor(api): dedupe imp_generate via imp_prefill + imp_decode_step#320
Merged
Conversation
Phase 5 Track A. The gemma4 model-specific knobs no longer live in the global RuntimeConfig — they belong on the model, not on the runtime. Migration: - New ModelConfig::Overrides::Gemma4 struct in src/model/model_config.h - 6 caller files now read from model_->config().overrides.gemma4.* - engine_init_resolver mutates model_->config_.overrides.gemma4.* directly - GemmContext + GemmKernelArgs gained a `force_mmvq` field so the low-level GGUF small-M kernel (which has no Model in scope) can keep honoring the per-model override without a back-channel into the RuntimeConfig singleton - RuntimeConfig::Gemma4 struct and gemma4.* parser entries removed - imp.conf.example [gemma4] section dropped (knob is now per-model only) - test_gemm_kernel_registry tests updated to set args.force_mmvq instead of mutating RuntimeConfig::current() Eliminates the "model-name in a runtime singleton" code smell flagged in the original Phase 5 spec. Phase 5 Track A of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 Track B. imp_generate and imp_generate_streaming are now thin wrappers that delegate to imp_prefill_with_params + a imp_decode_step loop. Both paths share a single generate_via_prefill_decode_loop helper, so the per-token loop, EOS / max_tokens / cancellation handling, sampling-params translation, and request lifecycle now live in exactly one place (the underlying primitives). Public C-API symbol surface is unchanged (ABI-stable). Behaviour is preserved — verified via new tests/test_api_generate.cpp and existing DegenerationTest suite (Qwen3-8B-Q8_0). Note: src/api/imp_api.cpp grew by +11 LOC because the new shared helper carries documentation for the prefill-sampled-first-token bookkeeping (consumed_output cursor reset) and the natural-finish stop signal. The net win is the elimination of two parallel prefill/decode loops, not raw line count — future per-token-behaviour changes no longer need parallel edits in imp_generate and imp_generate_streaming. Phase 5 Track B of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 5 Track B. `imp_generate` + `imp_generate_streaming` become thin wrappers around `imp_prefill_with_params` + `imp_decode_step` loop via shared template helper `generate_via_prefill_decode_loop`.
Stacking
Stack: #319 → this.
LOC
`src/api/imp_api.cpp` 875 → 886 (+11). Pre-refactor functions were already small (~50 LOC each) — the spec's −150 LOC target was outdated. Real win is structural: shared helper + routing through public primitives.
Diff stats: +175 / −82 in imp_api.cpp. The helper adds ~40 LOC of explanatory comments (cursor reset, natural-finish signal, empty-tokens guard, cancellation contract). Strip comments → ~25 LOC of code replacing two ~50-LOC parallel loops.
Subtle gotcha resolved
Wrapper resets `ctx->consumed_output = 0` after `imp_prefill_with_params` to drain the prefill-sampled first token. Documented inline with 7-line comment block; future readers can't accidentally remove it.
Tests
New `tests/test_api_generate.cpp` (187 LOC, 3 tests):
All 3 PASS against Qwen3-8B Q8_0. Model-gated via `GTEST_SKIP` when model absent.
Strict byte-parity test was attempted and dropped: two `Engine` instances on the same weights produce slightly divergent greedy logits enough to diverge after a few tokens (cross-engine non-determinism, not a refactor bug). Replaced with the 3 weaker but meaningful tests above.
Public-API surface unchanged
`include/imp/imp.h` has empty diff — ABI-stable.
Test plan
Phase 5 Track B of `docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md`.
🤖 Generated with Claude Code