Skip to content

PUMA v2.5.0 — Hardening (Sprint 8)

Choose a tag to compare

@pumacp pumacp released this 16 May 01:39
· 243 commits to main since this release

PUMA v2.5.0 Release Notes

Release date: 2026-05-16
Previous release: v2.4.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 8 (hardening) onto the v2.4.0 base.
It resolves the six inconsistencies (I5–I10) detected in the
post-v2.4.0 technical analysis and adds the first empirical MAE
canonical baseline for puma validate-baseline. The gemma4 family
remains empirically excluded from gpu-entry per F8 / D18; v2.5.0
documents the exclusion in a new versioned catalog changelog rather
than re-introducing the failure mode.

Highlights

Six inconsistencies resolved (I5–I10)

ID Resolution Primary artefact
I5 macOS Docker (CPU) vs native Ollama (Metal) modes clarified; v2.6.0 plan stated docs/MACOS_NOTES.md (new)
I6 gpu-entry hardware tolerance bands across RTX 2060/3050/3060/4050/4060 Mobile + Apple cross-arch row docs/HARDWARE.md
I7 Catalog now versioned (catalog_version: 2.5.0); changelog document; new unit test config/models_catalog.yaml, docs/CATALOG_HISTORY.md (new)
I8 CI job integration-tests-ollama runs the 4 @pytest.mark.ollama tests on every push to main/develop .github/workflows/lint-and-test.yml
I9 puma validate-baseline --expected-mae extends the command to estimation_tawos; canonical spec + empirical reference established src/puma/cli.py, specs/runs/baseline_estimation_canonical.yaml (new), docs/baseline_references.md (new)
I10 Coverage breakdown by module group with explicit rationale for sub-40 % modules docs/TESTING.md (new)

Estimation canonical baseline established (v2.5.0)

The first empirical MAE reference for puma validate-baseline on
estimation_tawos is now published:

  • Spec: specs/runs/baseline_estimation_canonical.yaml
  • Configuration: qwen2.5:3b × zero-shot × N=200 × seed=42 × T=0.0
  • Reference MAE = 5.7150 SP (tolerance ±0.05 SP)
  • Establishing run: baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317
  • Verified bit-exact across 4 consecutive runs (cold + warm)
  • Hardware: gpu-entry (RTX 2060 Mobile 6 GB)

Cross-scenario state contamination — documented finding

During empirical establishment of the MAE reference, a
state-contamination effect was characterised: running a
triage_jira baseline between an Ollama restart and the estimation
validation shifts MAE from 5.7150 to ≈6.3150 SP (delta = +0.6 SP) —
well outside the ±0.05 tolerance. The fresh-Ollama-state validation
protocol that prevents the drift is documented in
docs/baseline_references.md. This is a property of Ollama's
inference engine (KV-cache + warm-state behaviour), not a PUMA
code-path regression. Related to D3 (CUDA non-determinism).

gemma4 family — status preserved

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
remains catalogued and remains empirically excluded from
gpu-entry. The exclusion is grounded in:

  • F8 (closed): gemma4:e2b GGUF measured at 7.2 GB on disk
    versus the ~2 GB suggested by effective active params.
  • D18 (closed): all 5 smoke runs of gemma4:e2b on RTX 2060
    6 GB VRAM returned empty raw_response strings.

The regression-guard test
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. Users on gpu-mid (12–24 GB VRAM) and gpu-pro (24+
GB VRAM) hardware can use the gemma4 family normally; on
gpu-entry, select qwen2.5:* or gemma3:* instead. Full
rationale in docs/CATALOG_HISTORY.md.

Tests

  • New: tests/unit/test_cli_validate_baseline.py extended from
    3 → 8 tests (5 new for the MAE path, mutual exclusivity, missing
    metric, and default-spec resolution).
  • New: tests/unit/test_catalog_metadata.py::test_catalog_has_version_field.
  • Suite total: 348 → 354 passing, 7 deselected (-m 'not ollama').
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline (triage, F1 path): PASS f1=0.5831, delta=-0.0036.
  • puma validate-baseline --expected-mae 5.7150 (estimation,
    NEW): PASS mae=5.7150, delta=+0.0000.

Quality

  • Coverage: 61 % (essentially flat from v2.4.0; per-module
    breakdown now in docs/TESTING.md).
  • CI: green on both main and develop. New
    integration-tests-ollama job runs only on push to those
    branches and is continue-on-error: true so a transient
    Ollama failure does not gate the merge queue.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
    preserved. MAE = 5.7150 ± 0.05 on estimation_tawos newly
    established.
  • src/puma/cli.py LOC essentially unchanged from v2.4.0
    (signature extension + a small dispatch block).

Design decisions

  • --expected-mae is additive, not a refactor. When neither
    flag is provided, puma validate-baseline preserves its v2.4.0
    behaviour (F1 = 0.5867 against the triage baseline). Existing
    CI invocations continue to work unchanged. Sprint Operating
    Principle P5 (don't break working code) governed the choice.
  • gemma4 stays excluded. The original Sprint 8 plan asked for
    re-adding gemma4 to gpu-entry at 1.5 / 3 GB sizes. Inventory
    caught the conflict with F8 (7.2 GB measured) and D18 (empty
    responses). The revised plan converted S8.7 to
    documentation-only in CATALOG_HISTORY.md, preserving the
    regression guard. Sprint Operating Principle P6
    (don't re-introduce previously-rejected models) governed the
    choice.
  • MAE tolerance set to ±0.05 SP. The reference is bit-exact
    across cold + warm runs (4-run verification); the ±0.05 band
    absorbs the same kind of FP-ordering drift that the F1 ±0.01
    band absorbs on triage. The cross-scenario state-contamination
    effect (≈0.6 SP) is NOT absorbed by the tolerance — instead, a
    validation protocol prevents the contamination.

Debt tracking

  • No new open debt introduced by this release.
  • Inconsistencies tracked: I1–I10 documented across the project;
    v2.5.0 resolves I5–I10 (six items). I1–I4 were resolved in
    earlier releases.
  • Total resolved across v2.0.0 → v2.5.0: 15 of 24 technical debt
    items (62 %) plus 6 of 10 inconsistencies (60 %); v2.5.0
    contributes the inconsistency resolutions.

Known limitations

Unchanged from v2.4.0:

  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above catalogued but not yet empirically
    evaluated.
  • AMD ROCm and Apple Metal backends not yet detected. Apple
    Silicon native mode planned for v2.6.0; AMD ROCm pending
    hardware availability.
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
    3).
  • input_text not persisted in triage_jira instances (D22,
    Low).

Upgrade notes

  • No breaking changes. Existing CI invocations of
    puma validate-baseline (no flags, expecting F1 = 0.5867)
    continue to work unchanged.
  • New flag available: puma validate-baseline --expected-mae
    for estimation_tawos. See docs/baseline_references.md for
    the recommended invocation including the fresh-Ollama-state
    protocol.
  • New docs to know about: MACOS_NOTES.md,
    CATALOG_HISTORY.md, baseline_references.md, TESTING.md.
  • The lint-and-test.yml CI workflow now contains a second job
    (integration-tests-ollama) that runs on push to main/develop
    only. PRs are unaffected.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.