PUMA v2.4.0 — CLI completeness (Anexo F section A.2)
PUMA v2.4.0 Release Notes
Release date: 2026-05-13
Previous release: v2.3.0 (2026-05-13)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 7 (CLI completeness for Anexo F) onto
the v2.3.0 base. It resolves the long-standing gap between the academic
Anexo F catalog and the actual repository state by adding the six
high-value commands from section A.2 of Anexo F, together with a new
source-of-truth document distinguishing implemented commands from
documented design proposals.
Highlights
Anexo F gap resolved
docs/anexo_F_cli_reference.md is now the canonical CLI reference,
split into two sections:
- Section A — Implemented: A.1 lists commands pre-existing in
v2.0.0–v2.3.0 (preflight,models,datasets,cache,run,
validate-baseline,compare,dashboard,report,db). A.2
lists the six commands added in this release (below). Every command
in section A is verifiable viapuma <comando> --helpand covered
by tests undertests/cli/. - Section B — Proposed extensions: 5 Bash auxiliary scripts and
12 further CLI commands (Ollama management, sweep wrappers, DB
tooling, code-quality wrappers) documented as design space.
Explicitly marked as not implemented; the decision rationale is
recorded in the document.
Six new CLI commands (Anexo F § A.2)
| Anexo F | Command | Style |
|---|---|---|
| A.2.1 | puma prepare-datasets |
Thin subprocess wrapper of scripts/prepare_datasets.py (--dataset, --force-redownload, --verify) |
| A.2.2 | puma wilcoxon |
NEW analysis: paired Wilcoxon signed-rank between two run_ids; uses puma.metrics.statistical_tests.wilcoxon_signed_rank_models |
| A.2.3 | puma bias-analysis |
NEW analysis: bias evaluation report; uses puma.dashboard.data.load_predictions_with_gold + puma.metrics.fairness.perturbation_disparity |
| A.2.4 | puma generate-plots |
Thin subprocess wrapper of scripts/generate_phase_b_plots.py (--source phase_b only; bias_eval/multi_seed exit 2 with deferred-implementation message) |
| A.2.5 | puma list-runs |
New: SQL pivot of runs ⋈ metrics with --scenario/--model/--last-n/--since filters and --json |
| A.2.6 | puma list-ollama-models |
New: parses docker exec puma_ollama ollama list subprocess output |
Why two commands are NEW analyses and not wrappers: Anexo F § A.2.2
and § A.2.3 specify semantics that diverge from the existing scripts
(positional run_id arguments vs. --run-prefix; --models /
--perturbations filters vs. prefix-only). Rather than mutate the
scripts, the new CLI commands call PUMA's own core helpers directly.
The scripts remain unchanged and continue to support their original
workflows (top-K ranking).
Tests
- New:
tests/cli/package with 27 tests across 6 files, one per
command. Each file tests at minimum:--helpexit 0, happy path,
error paths. - Suite total: 318 → 348 passing, 7 deselected (
-m 'not ollama'). pre-commit run --all-files: all hooks green.puma validate-baseline: PASSf1_macro=0.5831, delta=-0.0036.
Quality
- Coverage: 58 % (no significant change from v2.3.0).
- CI: green on both
mainanddevelop. - Baseline reproducibility: F1 = 0.5867 ± 0.01 holds.
app.pyandcli.pyLOC are stable; the only file that grew
meaningfully issrc/puma/cli.py(363 → 777 LOC by inline
command bodies). Refactor tosrc/puma/cli/commands/package was
considered and deferred — the monolith remains the cleaner option
at this size.
Design decisions
--source bias_eval/multi_seedforgenerate-plotsare
accepted by the parser but exit 2 with a deferred-implementation
message because the underlying plotting scripts for those sources
do not yet exist. This matches the Anexo F spec without inventing
data.--verifyforprepare-datasetscurrently emits SHA-256 hashes
only. A manifest file (docs/datasets_manifest.json) for full hash
comparison is documented in Anexo F but not yet in the repo; the
command is forward-compatible.- No
src/puma/cli/commands/refactor was performed in this
release. With 6 new commands the inline monolith is still readable;
the refactor would be justified if/when Section B extensions land.
Debt tracking
- No new open debt introduced by this release.
- Total resolved across v2.0.0 → v2.4.0: 15 of 24 (62 %).
- Section B of Anexo F is documented design space, not technical
debt. Implementation is optional and conditional on demand.
Known limitations
Unchanged from v2.3.0:
- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above catalogued but not yet empirically evaluated. - AMD ROCm and Apple Metal backends not yet detected.
- TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_textnot persisted intriage_jirainstances (D22, Low).
Upgrade notes
- No breaking changes to existing commands or YAML run-spec schema.
- Six new commands available. See
puma <command> --helpfor usage
or readdocs/anexo_F_cli_reference.md§ A.2. - New test directory
tests/cli/joins the existing
tests/unit/andtests/integration/.
Acknowledgments
Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.