PUMA v2.5.0 Release Notes

Release date: 2026-05-16
Previous release: v2.4.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 8 (hardening) onto the v2.4.0 base.
It resolves the six inconsistencies (I5–I10) detected in the
post-v2.4.0 technical analysis and adds the first empirical MAE
canonical baseline for puma validate-baseline. The gemma4 family
remains empirically excluded from gpu-entry per F8 / D18; v2.5.0
documents the exclusion in a new versioned catalog changelog rather
than re-introducing the failure mode.

Highlights

Six inconsistencies resolved (I5–I10)

ID	Resolution	Primary artefact
I5	macOS Docker (CPU) vs native Ollama (Metal) modes clarified; v2.6.0 plan stated	`docs/MACOS_NOTES.md` (new)
I6	gpu-entry hardware tolerance bands across RTX 2060/3050/3060/4050/4060 Mobile + Apple cross-arch row	`docs/HARDWARE.md`
I7	Catalog now versioned (`catalog_version: 2.5.0`); changelog document; new unit test	`config/models_catalog.yaml`, `docs/CATALOG_HISTORY.md` (new)
I8	CI job `integration-tests-ollama` runs the 4 `@pytest.mark.ollama` tests on every push to main/develop	`.github/workflows/lint-and-test.yml`
I9	`puma validate-baseline --expected-mae` extends the command to estimation_tawos; canonical spec + empirical reference established	`src/puma/cli.py`, `specs/runs/baseline_estimation_canonical.yaml` (new), `docs/baseline_references.md` (new)
I10	Coverage breakdown by module group with explicit rationale for sub-40 % modules	`docs/TESTING.md` (new)

Estimation canonical baseline established (v2.5.0)

The first empirical MAE reference for puma validate-baseline on
estimation_tawos is now published:

Spec: specs/runs/baseline_estimation_canonical.yaml
Configuration: qwen2.5:3b × zero-shot × N=200 × seed=42 × T=0.0
Reference MAE = 5.7150 SP (tolerance ±0.05 SP)
Establishing run: baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317
Verified bit-exact across 4 consecutive runs (cold + warm)
Hardware: gpu-entry (RTX 2060 Mobile 6 GB)

Cross-scenario state contamination — documented finding

During empirical establishment of the MAE reference, a
state-contamination effect was characterised: running a
triage_jira baseline between an Ollama restart and the estimation
validation shifts MAE from 5.7150 to ≈6.3150 SP (delta = +0.6 SP) —
well outside the ±0.05 tolerance. The fresh-Ollama-state validation
protocol that prevents the drift is documented in
docs/baseline_references.md. This is a property of Ollama's
inference engine (KV-cache + warm-state behaviour), not a PUMA
code-path regression. Related to D3 (CUDA non-determinism).

gemma4 family — status preserved

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
remains catalogued and remains empirically excluded from
gpu-entry. The exclusion is grounded in:

F8 (closed): gemma4:e2b GGUF measured at 7.2 GB on disk
versus the ~2 GB suggested by effective active params.
D18 (closed): all 5 smoke runs of gemma4:e2b on RTX 2060
6 GB VRAM returned empty raw_response strings.

The regression-guard test
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. Users on gpu-mid (12–24 GB VRAM) and gpu-pro (24+
GB VRAM) hardware can use the gemma4 family normally; on
gpu-entry, select qwen2.5:* or gemma3:* instead. Full
rationale in docs/CATALOG_HISTORY.md.

Tests

New: tests/unit/test_cli_validate_baseline.py extended from
3 → 8 tests (5 new for the MAE path, mutual exclusivity, missing
metric, and default-spec resolution).
New: tests/unit/test_catalog_metadata.py::test_catalog_has_version_field.
Suite total: 348 → 354 passing, 7 deselected (-m 'not ollama').
pre-commit run --all-files: all hooks green.
puma validate-baseline (triage, F1 path): PASS f1=0.5831, delta=-0.0036.
puma validate-baseline --expected-mae 5.7150 (estimation,
NEW): PASS mae=5.7150, delta=+0.0000.

Quality

Coverage: 61 % (essentially flat from v2.4.0; per-module
breakdown now in docs/TESTING.md).
CI: green on both main and develop. New
integration-tests-ollama job runs only on push to those
branches and is continue-on-error: true so a transient
Ollama failure does not gate the merge queue.
Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved. MAE = 5.7150 ± 0.05 on estimation_tawos newly
established.
src/puma/cli.py LOC essentially unchanged from v2.4.0
(signature extension + a small dispatch block).

Design decisions

--expected-mae is additive, not a refactor. When neither
flag is provided, puma validate-baseline preserves its v2.4.0
behaviour (F1 = 0.5867 against the triage baseline). Existing
CI invocations continue to work unchanged. Sprint Operating
Principle P5 (don't break working code) governed the choice.
gemma4 stays excluded. The original Sprint 8 plan asked for
re-adding gemma4 to gpu-entry at 1.5 / 3 GB sizes. Inventory
caught the conflict with F8 (7.2 GB measured) and D18 (empty
responses). The revised plan converted S8.7 to
documentation-only in CATALOG_HISTORY.md, preserving the
regression guard. Sprint Operating Principle P6
(don't re-introduce previously-rejected models) governed the
choice.
MAE tolerance set to ±0.05 SP. The reference is bit-exact
across cold + warm runs (4-run verification); the ±0.05 band
absorbs the same kind of FP-ordering drift that the F1 ±0.01
band absorbs on triage. The cross-scenario state-contamination
effect (≈0.6 SP) is NOT absorbed by the tolerance — instead, a
validation protocol prevents the contamination.

Debt tracking

No new open debt introduced by this release.
Inconsistencies tracked: I1–I10 documented across the project;
v2.5.0 resolves I5–I10 (six items). I1–I4 were resolved in
earlier releases.
Total resolved across v2.0.0 → v2.5.0: 15 of 24 technical debt
items (62 %) plus 6 of 10 inconsistencies (60 %); v2.5.0
contributes the inconsistency resolutions.

Known limitations

Unchanged from v2.4.0:

Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above catalogued but not yet empirically
evaluated.
AMD ROCm and Apple Metal backends not yet detected. Apple
Silicon native mode planned for v2.6.0; AMD ROCm pending
hardware availability.
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
3).
input_text not persisted in triage_jira instances (D22,
Low).

Upgrade notes

No breaking changes. Existing CI invocations of
puma validate-baseline (no flags, expecting F1 = 0.5867)
continue to work unchanged.
New flag available: puma validate-baseline --expected-mae
for estimation_tawos. See docs/baseline_references.md for
the recommended invocation including the fresh-Ollama-state
protocol.
New docs to know about: MACOS_NOTES.md,
CATALOG_HISTORY.md, baseline_references.md, TESTING.md.
The lint-and-test.yml CI workflow now contains a second job
(integration-tests-ollama) that runs on push to main/develop
only. PRs are unaffected.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUMA v2.5.0 — Hardening (Sprint 8)

Choose a tag to compare

Sorry, something went wrong.