refactor(api): dedupe imp_generate via imp_prefill + imp_decode_step by kekzl · Pull Request #320 · kekzl/imp

kekzl · 2026-05-20T12:59:06Z

Summary

Phase 5 Track B. `imp_generate` + `imp_generate_streaming` become thin wrappers around `imp_prefill_with_params` + `imp_decode_step` loop via shared template helper `generate_via_prefill_decode_loop`.

Stacking

Stack: #319 → this.

LOC

`src/api/imp_api.cpp` 875 → 886 (+11). Pre-refactor functions were already small (~50 LOC each) — the spec's −150 LOC target was outdated. Real win is structural: shared helper + routing through public primitives.

Diff stats: +175 / −82 in imp_api.cpp. The helper adds ~40 LOC of explanatory comments (cursor reset, natural-finish signal, empty-tokens guard, cancellation contract). Strip comments → ~25 LOC of code replacing two ~50-LOC parallel loops.

Subtle gotcha resolved

Wrapper resets `ctx->consumed_output = 0` after `imp_prefill_with_params` to drain the prefill-sampled first token. Documented inline with 7-line comment block; future readers can't accidentally remove it.

Tests

New `tests/test_api_generate.cpp` (187 LOC, 3 tests):

`GenerateProducesNonEmptyOutput`
`StreamingDeliversAllTokens` — `callback_count == max_tokens` with `ignore_eos=1` catches the "wrapper drops the prefill-sampled first token" bug
`ManualPrefillDecodeLoopRuns` — proves the underlying primitives compose

All 3 PASS against Qwen3-8B Q8_0. Model-gated via `GTEST_SKIP` when model absent.

Strict byte-parity test was attempted and dropped: two `Engine` instances on the same weights produce slightly divergent greedy logits enough to diverge after a few tokens (cross-engine non-determinism, not a refactor bug). Replaced with the 3 weaker but meaningful tests above.

Public-API surface unchanged

`include/imp/imp.h` has empty diff — ABI-stable.

Test plan

`make build` green
`make verify-fast` green
Pre-push hook verify-fast green
3 new parity tests PASS
Combined spec + code-quality reviewer ✅ Approved

Phase 5 Track B of `docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md`.

🤖 Generated with Claude Code

Phase 5 Track A. The gemma4 model-specific knobs no longer live in the global RuntimeConfig — they belong on the model, not on the runtime. Migration: - New ModelConfig::Overrides::Gemma4 struct in src/model/model_config.h - 6 caller files now read from model_->config().overrides.gemma4.* - engine_init_resolver mutates model_->config_.overrides.gemma4.* directly - GemmContext + GemmKernelArgs gained a `force_mmvq` field so the low-level GGUF small-M kernel (which has no Model in scope) can keep honoring the per-model override without a back-channel into the RuntimeConfig singleton - RuntimeConfig::Gemma4 struct and gemma4.* parser entries removed - imp.conf.example [gemma4] section dropped (knob is now per-model only) - test_gemm_kernel_registry tests updated to set args.force_mmvq instead of mutating RuntimeConfig::current() Eliminates the "model-name in a runtime singleton" code smell flagged in the original Phase 5 spec. Phase 5 Track A of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 5 Track B. imp_generate and imp_generate_streaming are now thin wrappers that delegate to imp_prefill_with_params + a imp_decode_step loop. Both paths share a single generate_via_prefill_decode_loop helper, so the per-token loop, EOS / max_tokens / cancellation handling, sampling-params translation, and request lifecycle now live in exactly one place (the underlying primitives). Public C-API symbol surface is unchanged (ABI-stable). Behaviour is preserved — verified via new tests/test_api_generate.cpp and existing DegenerationTest suite (Qwen3-8B-Q8_0). Note: src/api/imp_api.cpp grew by +11 LOC because the new shared helper carries documentation for the prefill-sampled-first-token bookkeeping (consumed_output cursor reset) and the natural-finish stop signal. The net win is the elimination of two parallel prefill/decode loops, not raw line count — future per-token-behaviour changes no longer need parallel edits in imp_generate and imp_generate_streaming. Phase 5 Track B of docs/superpowers/specs/2026-05-20-architecture-refactor-roadmap-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 2 commits May 20, 2026 14:38

github-actions Bot enabled auto-merge (squash) May 20, 2026 12:59

github-actions Bot merged commit 4b1ed78 into main May 20, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(api): dedupe imp_generate via imp_prefill + imp_decode_step#320

refactor(api): dedupe imp_generate via imp_prefill + imp_decode_step#320
github-actions[bot] merged 2 commits into
mainfrom
refactor/api-generate-dedupe

kekzl commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 20, 2026

Summary

Stacking

LOC

Subtle gotcha resolved

Tests

Public-API surface unchanged

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant