Skip to content

PUMA v2.4.0 — CLI completeness (Anexo F section A.2)

Choose a tag to compare

@pumacp pumacp released this 13 May 03:26
· 251 commits to main since this release

PUMA v2.4.0 Release Notes

Release date: 2026-05-13
Previous release: v2.3.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 7 (CLI completeness for Anexo F) onto
the v2.3.0 base. It resolves the long-standing gap between the academic
Anexo F catalog and the actual repository state by adding the six
high-value commands from section A.2 of Anexo F, together with a new
source-of-truth document distinguishing implemented commands from
documented design proposals.

Highlights

Anexo F gap resolved

docs/anexo_F_cli_reference.md is now the canonical CLI reference,
split into two sections:

  • Section A — Implemented: A.1 lists commands pre-existing in
    v2.0.0–v2.3.0 (preflight, models, datasets, cache, run,
    validate-baseline, compare, dashboard, report, db). A.2
    lists the six commands added in this release (below). Every command
    in section A is verifiable via puma <comando> --help and covered
    by tests under tests/cli/.
  • Section B — Proposed extensions: 5 Bash auxiliary scripts and
    12 further CLI commands (Ollama management, sweep wrappers, DB
    tooling, code-quality wrappers) documented as design space.
    Explicitly marked as not implemented; the decision rationale is
    recorded in the document.

Six new CLI commands (Anexo F § A.2)

Anexo F Command Style
A.2.1 puma prepare-datasets Thin subprocess wrapper of scripts/prepare_datasets.py (--dataset, --force-redownload, --verify)
A.2.2 puma wilcoxon NEW analysis: paired Wilcoxon signed-rank between two run_ids; uses puma.metrics.statistical_tests.wilcoxon_signed_rank_models
A.2.3 puma bias-analysis NEW analysis: bias evaluation report; uses puma.dashboard.data.load_predictions_with_gold + puma.metrics.fairness.perturbation_disparity
A.2.4 puma generate-plots Thin subprocess wrapper of scripts/generate_phase_b_plots.py (--source phase_b only; bias_eval/multi_seed exit 2 with deferred-implementation message)
A.2.5 puma list-runs New: SQL pivot of runs ⋈ metrics with --scenario/--model/--last-n/--since filters and --json
A.2.6 puma list-ollama-models New: parses docker exec puma_ollama ollama list subprocess output

Why two commands are NEW analyses and not wrappers: Anexo F § A.2.2
and § A.2.3 specify semantics that diverge from the existing scripts
(positional run_id arguments vs. --run-prefix; --models /
--perturbations filters vs. prefix-only). Rather than mutate the
scripts, the new CLI commands call PUMA's own core helpers directly.
The scripts remain unchanged and continue to support their original
workflows (top-K ranking).

Tests

  • New: tests/cli/ package with 27 tests across 6 files, one per
    command. Each file tests at minimum: --help exit 0, happy path,
    error paths.
  • Suite total: 318 → 348 passing, 7 deselected (-m 'not ollama').
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline: PASS f1_macro=0.5831, delta=-0.0036.

Quality

  • Coverage: 58 % (no significant change from v2.3.0).
  • CI: green on both main and develop.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 holds.
  • app.py and cli.py LOC are stable; the only file that grew
    meaningfully is src/puma/cli.py (363 → 777 LOC by inline
    command bodies). Refactor to src/puma/cli/commands/ package was
    considered and deferred — the monolith remains the cleaner option
    at this size.

Design decisions

  • --source bias_eval / multi_seed for generate-plots are
    accepted by the parser but exit 2 with a deferred-implementation
    message because the underlying plotting scripts for those sources
    do not yet exist. This matches the Anexo F spec without inventing
    data.
  • --verify for prepare-datasets currently emits SHA-256 hashes
    only. A manifest file (docs/datasets_manifest.json) for full hash
    comparison is documented in Anexo F but not yet in the repo; the
    command is forward-compatible.
  • No src/puma/cli/commands/ refactor was performed in this
    release. With 6 new commands the inline monolith is still readable;
    the refactor would be justified if/when Section B extensions land.

Debt tracking

  • No new open debt introduced by this release.
  • Total resolved across v2.0.0 → v2.4.0: 15 of 24 (62 %).
  • Section B of Anexo F is documented design space, not technical
    debt. Implementation is optional and conditional on demand.

Known limitations

Unchanged from v2.3.0:

  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above catalogued but not yet empirically evaluated.
  • AMD ROCm and Apple Metal backends not yet detected.
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
  • input_text not persisted in triage_jira instances (D22, Low).

Upgrade notes

  • No breaking changes to existing commands or YAML run-spec schema.
  • Six new commands available. See puma <command> --help for usage
    or read docs/anexo_F_cli_reference.md § A.2.
  • New test directory tests/cli/ joins the existing
    tests/unit/ and tests/integration/.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.