PUMA v2.5.0 — Hardening (Sprint 8)
PUMA v2.5.0 Release Notes
Release date: 2026-05-16
Previous release: v2.4.0 (2026-05-13)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 8 (hardening) onto the v2.4.0 base.
It resolves the six inconsistencies (I5–I10) detected in the
post-v2.4.0 technical analysis and adds the first empirical MAE
canonical baseline for puma validate-baseline. The gemma4 family
remains empirically excluded from gpu-entry per F8 / D18; v2.5.0
documents the exclusion in a new versioned catalog changelog rather
than re-introducing the failure mode.
Highlights
Six inconsistencies resolved (I5–I10)
| ID | Resolution | Primary artefact |
|---|---|---|
| I5 | macOS Docker (CPU) vs native Ollama (Metal) modes clarified; v2.6.0 plan stated | docs/MACOS_NOTES.md (new) |
| I6 | gpu-entry hardware tolerance bands across RTX 2060/3050/3060/4050/4060 Mobile + Apple cross-arch row | docs/HARDWARE.md |
| I7 | Catalog now versioned (catalog_version: 2.5.0); changelog document; new unit test |
config/models_catalog.yaml, docs/CATALOG_HISTORY.md (new) |
| I8 | CI job integration-tests-ollama runs the 4 @pytest.mark.ollama tests on every push to main/develop |
.github/workflows/lint-and-test.yml |
| I9 | puma validate-baseline --expected-mae extends the command to estimation_tawos; canonical spec + empirical reference established |
src/puma/cli.py, specs/runs/baseline_estimation_canonical.yaml (new), docs/baseline_references.md (new) |
| I10 | Coverage breakdown by module group with explicit rationale for sub-40 % modules | docs/TESTING.md (new) |
Estimation canonical baseline established (v2.5.0)
The first empirical MAE reference for puma validate-baseline on
estimation_tawos is now published:
- Spec:
specs/runs/baseline_estimation_canonical.yaml - Configuration:
qwen2.5:3b × zero-shot × N=200 × seed=42 × T=0.0 - Reference MAE = 5.7150 SP (tolerance ±0.05 SP)
- Establishing run:
baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317 - Verified bit-exact across 4 consecutive runs (cold + warm)
- Hardware:
gpu-entry(RTX 2060 Mobile 6 GB)
Cross-scenario state contamination — documented finding
During empirical establishment of the MAE reference, a
state-contamination effect was characterised: running a
triage_jira baseline between an Ollama restart and the estimation
validation shifts MAE from 5.7150 to ≈6.3150 SP (delta = +0.6 SP) —
well outside the ±0.05 tolerance. The fresh-Ollama-state validation
protocol that prevents the drift is documented in
docs/baseline_references.md. This is a property of Ollama's
inference engine (KV-cache + warm-state behaviour), not a PUMA
code-path regression. Related to D3 (CUDA non-determinism).
gemma4 family — status preserved
The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
remains catalogued and remains empirically excluded from
gpu-entry. The exclusion is grounded in:
- F8 (closed):
gemma4:e2bGGUF measured at 7.2 GB on disk
versus the ~2 GB suggested by effective active params. - D18 (closed): all 5 smoke runs of
gemma4:e2bon RTX 2060
6 GB VRAM returned emptyraw_responsestrings.
The regression-guard test
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. Users on gpu-mid (12–24 GB VRAM) and gpu-pro (24+
GB VRAM) hardware can use the gemma4 family normally; on
gpu-entry, select qwen2.5:* or gemma3:* instead. Full
rationale in docs/CATALOG_HISTORY.md.
Tests
- New:
tests/unit/test_cli_validate_baseline.pyextended from
3 → 8 tests (5 new for the MAE path, mutual exclusivity, missing
metric, and default-spec resolution). - New:
tests/unit/test_catalog_metadata.py::test_catalog_has_version_field. - Suite total: 348 → 354 passing, 7 deselected (
-m 'not ollama'). pre-commit run --all-files: all hooks green.puma validate-baseline(triage, F1 path):PASS f1=0.5831, delta=-0.0036.puma validate-baseline --expected-mae 5.7150(estimation,
NEW):PASS mae=5.7150, delta=+0.0000.
Quality
- Coverage: 61 % (essentially flat from v2.4.0; per-module
breakdown now indocs/TESTING.md). - CI: green on both
mainanddevelop. New
integration-tests-ollamajob runs only on push to those
branches and iscontinue-on-error: trueso a transient
Ollama failure does not gate the merge queue. - Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved. MAE = 5.7150 ± 0.05 on estimation_tawos newly
established. src/puma/cli.pyLOC essentially unchanged from v2.4.0
(signature extension + a small dispatch block).
Design decisions
--expected-maeis additive, not a refactor. When neither
flag is provided,puma validate-baselinepreserves its v2.4.0
behaviour (F1 = 0.5867 against the triage baseline). Existing
CI invocations continue to work unchanged. Sprint Operating
Principle P5 (don't break working code) governed the choice.- gemma4 stays excluded. The original Sprint 8 plan asked for
re-adding gemma4 togpu-entryat 1.5 / 3 GB sizes. Inventory
caught the conflict with F8 (7.2 GB measured) and D18 (empty
responses). The revised plan converted S8.7 to
documentation-only inCATALOG_HISTORY.md, preserving the
regression guard. Sprint Operating Principle P6
(don't re-introduce previously-rejected models) governed the
choice. - MAE tolerance set to ±0.05 SP. The reference is bit-exact
across cold + warm runs (4-run verification); the ±0.05 band
absorbs the same kind of FP-ordering drift that the F1 ±0.01
band absorbs on triage. The cross-scenario state-contamination
effect (≈0.6 SP) is NOT absorbed by the tolerance — instead, a
validation protocol prevents the contamination.
Debt tracking
- No new open debt introduced by this release.
- Inconsistencies tracked: I1–I10 documented across the project;
v2.5.0 resolves I5–I10 (six items). I1–I4 were resolved in
earlier releases. - Total resolved across v2.0.0 → v2.5.0: 15 of 24 technical debt
items (62 %) plus 6 of 10 inconsistencies (60 %); v2.5.0
contributes the inconsistency resolutions.
Known limitations
Unchanged from v2.4.0:
- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above catalogued but not yet empirically
evaluated. - AMD ROCm and Apple Metal backends not yet detected. Apple
Silicon native mode planned for v2.6.0; AMD ROCm pending
hardware availability. - TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
3). input_textnot persisted intriage_jirainstances (D22,
Low).
Upgrade notes
- No breaking changes. Existing CI invocations of
puma validate-baseline(no flags, expecting F1 = 0.5867)
continue to work unchanged. - New flag available:
puma validate-baseline --expected-mae
forestimation_tawos. Seedocs/baseline_references.mdfor
the recommended invocation including the fresh-Ollama-state
protocol. - New docs to know about:
MACOS_NOTES.md,
CATALOG_HISTORY.md,baseline_references.md,TESTING.md. - The
lint-and-test.ymlCI workflow now contains a second job
(integration-tests-ollama) that runs on push to main/develop
only. PRs are unaffected.
Acknowledgments
Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.