PUMA v2.4.0 Release Notes

Release date: 2026-05-13
Previous release: v2.3.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 7 (CLI completeness for Anexo F) onto
the v2.3.0 base. It resolves the long-standing gap between the academic
Anexo F catalog and the actual repository state by adding the six
high-value commands from section A.2 of Anexo F, together with a new
source-of-truth document distinguishing implemented commands from
documented design proposals.

Highlights

Anexo F gap resolved

docs/anexo_F_cli_reference.md is now the canonical CLI reference,
split into two sections:

Section A — Implemented: A.1 lists commands pre-existing in
v2.0.0–v2.3.0 (preflight, models, datasets, cache, run,
validate-baseline, compare, dashboard, report, db). A.2
lists the six commands added in this release (below). Every command
in section A is verifiable via puma <comando> --help and covered
by tests under tests/cli/.
Section B — Proposed extensions: 5 Bash auxiliary scripts and
12 further CLI commands (Ollama management, sweep wrappers, DB
tooling, code-quality wrappers) documented as design space.
Explicitly marked as not implemented; the decision rationale is
recorded in the document.

Six new CLI commands (Anexo F § A.2)

Anexo F	Command	Style
A.2.1	`puma prepare-datasets`	Thin subprocess wrapper of `scripts/prepare_datasets.py` (`--dataset`, `--force-redownload`, `--verify`)
A.2.2	`puma wilcoxon`	NEW analysis: paired Wilcoxon signed-rank between two `run_id`s; uses `puma.metrics.statistical_tests.wilcoxon_signed_rank_models`
A.2.3	`puma bias-analysis`	NEW analysis: bias evaluation report; uses `puma.dashboard.data.load_predictions_with_gold` + `puma.metrics.fairness.perturbation_disparity`
A.2.4	`puma generate-plots`	Thin subprocess wrapper of `scripts/generate_phase_b_plots.py` (`--source phase_b` only; `bias_eval`/`multi_seed` exit 2 with deferred-implementation message)
A.2.5	`puma list-runs`	New: SQL pivot of `runs ⋈ metrics` with `--scenario`/`--model`/`--last-n`/`--since` filters and `--json`
A.2.6	`puma list-ollama-models`	New: parses `docker exec puma_ollama ollama list` subprocess output

Why two commands are NEW analyses and not wrappers: Anexo F § A.2.2
and § A.2.3 specify semantics that diverge from the existing scripts
(positional run_id arguments vs. --run-prefix; --models /
--perturbations filters vs. prefix-only). Rather than mutate the
scripts, the new CLI commands call PUMA's own core helpers directly.
The scripts remain unchanged and continue to support their original
workflows (top-K ranking).

Tests

New: tests/cli/ package with 27 tests across 6 files, one per
command. Each file tests at minimum: --help exit 0, happy path,
error paths.
Suite total: 318 → 348 passing, 7 deselected (-m 'not ollama').
pre-commit run --all-files: all hooks green.
puma validate-baseline: PASS f1_macro=0.5831, delta=-0.0036.

Quality

Coverage: 58 % (no significant change from v2.3.0).
CI: green on both main and develop.
Baseline reproducibility: F1 = 0.5867 ± 0.01 holds.
app.py and cli.py LOC are stable; the only file that grew
meaningfully is src/puma/cli.py (363 → 777 LOC by inline
command bodies). Refactor to src/puma/cli/commands/ package was
considered and deferred — the monolith remains the cleaner option
at this size.

Design decisions

--source bias_eval / multi_seed for generate-plots are
accepted by the parser but exit 2 with a deferred-implementation
message because the underlying plotting scripts for those sources
do not yet exist. This matches the Anexo F spec without inventing
data.
--verify for prepare-datasets currently emits SHA-256 hashes
only. A manifest file (docs/datasets_manifest.json) for full hash
comparison is documented in Anexo F but not yet in the repo; the
command is forward-compatible.
No src/puma/cli/commands/ refactor was performed in this
release. With 6 new commands the inline monolith is still readable;
the refactor would be justified if/when Section B extensions land.

Debt tracking

No new open debt introduced by this release.
Total resolved across v2.0.0 → v2.4.0: 15 of 24 (62 %).
Section B of Anexo F is documented design space, not technical
debt. Implementation is optional and conditional on demand.

Known limitations

Unchanged from v2.3.0:

Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above catalogued but not yet empirically evaluated.
AMD ROCm and Apple Metal backends not yet detected.
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_text not persisted in triage_jira instances (D22, Low).

Upgrade notes

No breaking changes to existing commands or YAML run-spec schema.
Six new commands available. See puma <command> --help for usage
or read docs/anexo_F_cli_reference.md § A.2.
New test directory tests/cli/ joins the existing
tests/unit/ and tests/integration/.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUMA v2.4.0 — CLI completeness (Anexo F section A.2)

Choose a tag to compare

Sorry, something went wrong.