04 Jun 09:42

pumacp

6671108

PUMA reproducibility anchor (v2.7.0-baseline-anchor)

Canonical, SHA-pinned reproducibility anchor for the PUMA empirical baseline.

Commit: 6671108

Verified reproducible metrics (qwen2.5:3b, N=200, seed=42, temperature=0.0):

triage_jira F1-macro = 0.5867 +/- 0.01 (contextual-anchoring)
estimation_tawos MAE = 5.7150 +/- 0.05 SP (zero-shot)

Reproduce:
puma validate-baseline --expected-f1 0.5867 --tolerance 0.01
puma validate-baseline --expected-mae 5.7150 --tolerance 0.05

Reproducibility gates F1=0.5867 and MAE=5.7150 are bit-exact stable across
releases v2.4.0, v2.5.0, v2.6.0 and v2.7.0.

Assets 2

02 Jun 02:50

pumacp

v4.0.0

3335f14

PUMA v4.0.0 — Sprint 12 closure Latest

Latest

PUMA v4.0.0 — Sprint 12 closure

PUMA is a local, reproducible benchmarking framework for open LLMs on
project-management tasks (issue triage and effort estimation), run entirely
on your own hardware via Ollama — deterministic, offline by default, and
verifiable end to end.

This release is validated by a real-world milestone. PUMA's federated
community-submission infrastructure is now operational and was proven end to
end by the first official production submission — qwen2.5:3b on
triage_jira / zero_shot, F1-macro 0.3898 — which landed at
pumacp/puma-community#8
(merge SHA 111cee36) and mirrored to the Hugging Face submissions dataset.
That F1 is the reproducible floor anchor for the zero-shot strategy on
triage_jira; see docs/first-submission.md.

Sprint 12 highlights

#45 — PyPI + Docker (ghcr.io) publishing workflows + Dockerfile.publish; pyproject.toml hardened (distribution name puma-cp)
#46 — Multi-model comparison dashboard view + corporate Streamlit palette
#47 — README channel-directory restructure; acrostic visual flexibility
#48 — mkdocs full content sync (nav 6 → 28 pages); D30 resolved
#49 — Manual IDE contribution workflow docs
#50 — Security audit MVP: pip-audit + bandit + gitleaks + Trivy + SECURITY.md + threat model
#51 — Consolidated technical reference (~5100 words, 17-decision timeline)
#52 — Inaugural production submission documented (docs/first-submission.md)

Installation

pip install puma-cp==4.0.0

The container image (ghcr.io/pumacp/puma:4.0.0) is not yet published for
this release: the Trivy gate in the publish pipeline blocked it on 3 HIGH-severity
base-image CVEs (0 CRITICAL). This is the security gate working as designed; the
image will be re-published once the base image is patched (tracked for S12.19).
Use the PyPI package in the meantime.

Quick start

See the documentation site and the
getting-started overview.

What's in v4.0.0

Added — federated submission infrastructure (S12.15) validated by the
inaugural submission; PyPI + ghcr.io publishing; multi-model dashboard view;
consolidated technical reference; manual IDE contribution workflow; security
audit MVP; corporate monochrome visual identity; acrostic visual flexibility.

Changed — mkdocs nav 6 → 28 public pages; read-only puma models
sub-group (D30 resolved); pyproject.toml hardened; project version → 4.0.0.

Security — pip-audit / bandit / gitleaks on every push; Trivy on every
container publish (blocks HIGH/CRITICAL); SECURITY.md disclosure policy;
9-check submission validation pipeline.

Infrastructure — Dockerfile.publish (multi-stage, non-root, OCI labels);
GitHub Pages live with a 28-page nav; Hugging Face dataset mirror operational.

Full detail in CHANGELOG.md.

Known limitations (deferred to S12.19 / post-Sprint-12)

See docs/known_debt.md.

D38 — validate-submission workflow references a non-existent action version.
D39 — verify-integrity workflow broken by gradio_client API drift; the inaugural submission is therefore self-attested.
D40 — puma share-results CLI hangs after the Review panel.
Container publish blocked by 3 HIGH base-image CVEs (re-publish after patching).
notify-discord lacks the DISCORD_WEBHOOK secret (optional integration).

Acknowledgments

Thanks to everyone who tested the submission pipeline end to end and helped land
the first official community submission — the milestone that validates this release.

Assets 3

25 May 01:37

pumacp

v3.1.0

62c8316

v3.1.0 — Sprint 11' Post-v3.0.0 Reconciliation

PUMA v3.1.0 — Sprint 11' Post-v3.0.0 Reconciliation

Release date: 2026-05-25
Previous release: v3.0.0 (2026-05-20)
Tag: v3.1.0

Overview

Sprint 11' is a post-v3.0.0 hardening Sprint: it reconciles the v3.0.0
release artifacts (version strings, CHANGELOG, release notes), completes the
puma community CLI subgroup, applies minimal repairs to the wiki-sync
workflows and the Verifier Space, and cleans residual documentation drift
while preserving historical accuracy. No breaking changes.

Highlights

New puma community Typer group with four subcommands — browse,
pull, verify-hash, validate (Anexo F F.16.5–F.16.8). The GitHub
Contents API is read over httpx (mockable in tests); verify-hash
recomputes the predictions hash byte-identically to the client and is
D23-aware on --remote.
Verifier Space hash output now matches schema v1.0.0 — the
pumaproject/puma-verifier Space no longer emits the sha256: prefix, so
its digest conforms to ^[a-f0-9]{64}$.
Wiki sync works again in both repos — the wiki-sync.yml workflows now
declare contents: write, so the GitHub Wiki publishes on push; both
/wiki pages render (HTTP 200).
D23 deuda técnica documented — the Verifier (2-field JSONL) and client
(4-field DB CSV) hash algorithms differ by construction; reconciliation is
deferred to v4.x with a schema decision.

Quality at release time

mypy --strict: 0 errors / 81 files
pytest -m "not ollama": 597 passed, 1 skipped, 7 deselected
pytest -m ollama: 7 passed
F1 triage baseline: 0.5831 ± 0.01 (Δ = −0.0036)
MAE estimation baseline: 5.7150 bit-exact (Δ = +0.0000)
Coverage on community CLI: browse 80%, pull 87%, verify 85%, validate 89%
Schema v1.0.0 unchanged (P3); zero federation references in code (P4).

Known limitations

D23: puma community verify-hash --remote returns mismatch by
construction for schema v1.0.0 submissions even when the local hash is
correct, because the Verifier Space hashes a different input shape than the
client. Local verification (the canonical path) is unaffected.
Reconciliation deferred to v4.x with a schema decision. See
docs/known_debt.md.
Kaggle mirror activation (S11'.6): the mirror-kaggle.yml workflow is
fully hardened (companion PR pumacp/puma-community#4: --dir-mode zip,
robust create-vs-version, CC-BY-4.0 license, 50-char-safe title,
post-publish HEAD verification). Publication is pending a Kaggle-internal
soft-delete grace period that reserves the slug
pumacp/puma-community-submissions; it resolves automatically once Kaggle
releases the slug — no further code action is needed.

Upgrade path

No breaking changes since v3.0.0. Refresh with:

git pull && pip install -e '.[dev]'

PUMA v2.7.0 Release Notes

Release date: 2026-05-16
Previous release: v2.6.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 10 (catalog expansion) onto the
v2.6.0 base. It adds two Alibaba Qwen3 family entries to the
catalog — both verified against the Ollama registry before
inclusion — and formally excludes Kimi K2.6 after a 13-tag
registry probe confirmed it is not distributed via Ollama. The
catalog schema is preserved at the 8 fields used since v2.0.0;
all v2.7.0 metadata lives within the existing notes field per
the project's minimum-complexity discipline.

Highlights

Two Qwen3 entries — registry-verified before cataloguing

Tag	Type	GGUF (verified)	Profile	Notes
`qwen3:30b`	Dense	17.3 GB	`gpu-high`	Hybrid Gated DeltaNet + self-attention, 262144 context
`qwen3:30b-a3b`	MoE	17.3 GB	`gpu-high`	30B total / ~3B active per token; F8/D18 MoE caveat preserved in `notes`

Both entries:

Verified via Ollama registry manifest probe
(registry.ollama.ai/v2/library/qwen3/manifests/* returned HTTP
200 with the GGUF size derived from the sum of layer sizes).
Declare logprobs_supported: false conservatively until
empirical verification on appropriate hardware.
Excluded from gpu-entry AND every apple-silicon-* profile by
the P11 pending-validation invariant. Five new regression-guard
tests in tests/unit/test_catalog_metadata.py pin this contract
via exact-equality on profiles_compatible == ['gpu-high'].
params_b: 30.0 follows the gemma4:26b-a4b precedent (TOTAL
when the tag encodes both numbers). The MoE/F8 caveat in the
notes field documents that active-params count does NOT
predict GGUF size or runtime VRAM consumption.

Kimi K2.6 — formally excluded after 13-tag registry probe

A registry probe on 2026-05-16 returned HTTP 404 on every
plausible Ollama tag naming for Kimi K2.6:

kimi-k2:6              kimi-k2:latest        kimi-k2:1t
kimi-k2:1t-instruct    kimi-k2:0905          kimi-k2:base
kimi-k2:instruct       kimi:latest           kimi-k2.6:latest
moonshot:latest        moonshot:kimi-k2      kimi-k2-base:latest
kimi-k2-instruct:latest

The model is not distributed via the Ollama registry as of the
v2.7.0 cut. Cataloguing a non-existent ollama_tag would violate
the project's empirical-first principle (P10) and produce a
broken puma models pull command for users following the catalog
metadata. The exclusion decision is recorded in
docs/CATALOG_HISTORY.md v2.7.0 § "Considered but not catalogued"
with the full probe table for academic traceability. It may be
reconsidered in a future release if Moonshot AI or a third-party
distributor publishes K2.6 to the Ollama registry.

Deferred — known on Ollama but out of v2.7.0 scope

The registry probe confirmed these tags exist (HTTP 200) but they
are deferred from v2.7.0 for scope discipline:

Tag	Real GGUF	Reason
`qwen3:32b` (dense)	18.8 GB	Marginal upgrade over `qwen3:30b`; defer until empirical validation on gpu-high can distinguish them
`qwen3:235b-a22b` (MoE)	132.4 GB	Requires multi-GPU rigs well beyond gpu-high (24+ GB VRAM); pending hardware tier extension
`qwen3-coder:30b`, `qwen3-coder:480b`	—	Coder family is task-specific; out of scope for PMO benchmarks

Schema unchanged — minimum-complexity preserved

The original Sprint 10 plan proposed ~12 new YAML fields (family,
parameters_total_b, parameters_active_b, profile_recommended,
size_gb_disk_estimate, size_gb_vram_estimate, quantization,
license, release_date, capabilities, empirical_validation,
validation_blockers). The user's minimum-complexity decision
kept the catalog at the v2.0.0–v2.6.0 schema (8 fields:
ollama_tag, params_b, gguf_size_gb, context_window,
logprobs_supported, profiles_compatible, timeout_s,
notes). All v2.7.0 metadata (license, release date, MoE caveat,
validation blockers, architecture details) lives within
multi-line notes: text. src/puma/preflight/catalog.py and the
ModelEntry dataclass are byte-identical to v2.6.0.

Invariants generalised, not relaxed

The pending-validation exclusion from gpu-entry (established in
Sprint 9 for Apple Silicon entries) is reaffirmed for the new
Qwen3 entries and extended to every apple-silicon-* profile via
explicit tests:

Test	Status	Invariant
`test_gemma4_family_excluded_from_gpu_entry`	PASSED (preserved)	D18/F8 (Sprint 2)
`test_gemma4_family_not_compatible_with_any_apple_silicon`	PASSED (preserved)	P6 extension to Apple Silicon (Sprint 9)
`test_qwen3_entries_excluded_from_gpu_entry`	PASSED (new)	P10/P11 (Sprint 10)
`test_qwen3_entries_excluded_from_all_apple_silicon`	PASSED (new)	P11 generalisation across profile families
`test_qwen3_entries_target_gpu_high_only`	PASSED (new)	Exact-equality anchor against accidental loosening

The pattern is now: new entries default to the safest profile
only; loosening requires empirical evidence and an explicit
debt-tracker entry referencing the prior exclusion.

Tests

402 → 407 passing (-m "not ollama"), 7 deselected.
5 new regression-guard tests in
tests/unit/test_catalog_metadata.py (see invariant table
above).
tests/unit/test_preflight_catalog.py::test_load_catalog_returns_all_entries:
entry-count expectation updated 15 → 17 to reflect the two new
Qwen3 additions.
pre-commit run --all-files: all hooks green.
puma validate-baseline (triage, F1 path, fresh Ollama):
PASS f1=0.5831, delta=-0.0036, ±0.01.
puma validate-baseline --expected-mae 5.7150 (estimation,
fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05 —
bit-exact.

Quality

Coverage: 61 % (no significant change from v2.6.0; new entries
are YAML-only, no Python statements added).
CI: green on both main and develop. The
integration-tests-ollama job introduced in v2.5.0 continues to
run on push to those branches.
Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved; MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
Linux + NVIDIA dispatch byte-identical to v2.6.0 (no new
profiles, no new dispatch logic, no new code paths). The Qwen3
entries appear in models_for_profile('gpu-high') only.

Design decisions

Schema unchanged at 8 fields. Documented Sprint-10-original
proposal of 12 new fields; chose to keep schema minimal. All
metadata that would have required new fields now lives in the
notes multi-line text. Governed by P5 (additive over
modification) and the project's minimum-complexity discipline.
Real Ollama tags only. Every catalogued ollama_tag is
verified against registry.ollama.ai/v2/library/<repo>/manifests/<tag>
before inclusion. The originally-planned qwen3:27b and
qwen3:35b-a3b were remapped to the real qwen3:30b and
qwen3:30b-a3b after probe; Kimi K2.6 was removed entirely
after every plausible tag returned 404.
Conservative params_b for MoE. The qwen3:30b-a3b entry
declares params_b: 30.0 following the gemma4:26b-a4b
precedent — TOTAL params when the tag encodes both numbers. The
F8/D18 caveat in notes documents that active-params count
does NOT predict VRAM consumption.
logprobs_supported: false conservatively. The Qwen3 family
announces logprob support upstream, but PUMA has not yet
empirically verified token-level confidence on these specific
tags. Flipping to true is part of the empirical validation
protocol when hardware becomes available.
gpu-high as the only target. 17.3 GB GGUF exceeds gpu-mid's
12–24 GB upper bound once OS + context overhead are accounted
for; gpu-high (24+ GB VRAM) is the only safe default. The
exact-equality anchor test
test_qwen3_entries_target_gpu_high_only pins this so future
loosening requires deliberate intent.

Debt tracking

No new open debt introduced by this release.
No closure of pre-existing debt — Sprint 10 is
forward-looking catalog expansion, not a debt-paydown Sprint.
Empirical validation of qwen3:30b and qwen3:30b-a3b is the
explicit follow-up; tracked via the notes text on each entry
and via the validation roadmap in
docs/CATALOG_HISTORY.md § "Empirical validation roadmap".

Known limitations

Unchanged from v2.6.0:

Single hardware tier empirically evaluated (gpu-entry);
models requiring gpu-mid/gpu-high are catalogued but not
yet validated.
AMD ROCm not yet detected.
All apple-silicon-* profiles declare
empirical_validation: pending (Sprint 9 forward-work).
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
3).

New in v2.7.0:

Both qwen3:30b and qwen3:30b-a3b are catalogued with
validation pending. Users on gpu-high hardware can use them
via puma run and report empirical results to close the
validation gap.
Kimi K2.6 is not distributed via Ollama; the catalog
intentionally omits it. If a third-party distributor publishes
K2.6 to Ollama, a future Sprint can revisit cataloguing.

Empirical validation roadmap (when gpu-high hardware available)

The protocol for closing the validation gap is documented in
docs/CATALOG_HISTORY.md v2.7.0 § "Empirical validation roadmap":

Pull the model via ollama pull qwen3:30b and verify the
digest matches the registry manifest probed at cataloguing.
Run the canonical baselines: triage_jira (F1) and
estimation_tawos (MAE).
Measure parse_failure_rate (should be 0 for usable models)
and reproducibility (bit-exact under T=0.0 + seed=42 on the
gpu-high hardware).
If validation succeeds: bump logprobs_supported to true
(after a logprobs-enabled probe), extend profiles_compatible
to vetted Apple Silicon Max/Ultra variants (≥36 GB unified
memory) pending separate Apple-side v...

Assets 3

16 May 02:40

pumacp

v2.6.0

61a6781

PUMA v2.6.0 — Apple Silicon M3/M4/M5 support

PUMA v2.6.0 Release Notes

Release date: 2026-05-16
Previous release: v2.5.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 9 (Apple Silicon M3/M4/M5 support)
onto the v2.5.0 base. It adds first-class detection of Apple Silicon
hosts (9 new profile identifiers covering M3 base/Pro/Max, M4
base/Pro/Max, M5 base/Pro/Max, and M5 Ultra), a native runtime mode
via ./start_puma.sh --native that boots Ollama with Metal on
macOS (no Docker), macOS-aware CodeCarbon tracking with a
powermetrics-availability probe, and the cross-architecture
reproducibility question documented as a testable hypothesis for a
future empirical close-out.

Empirical validation status: all apple-silicon-* profiles
declare empirical_validation: pending. PUMA's validation hardware
is the RTX 2060 Mobile 6 GB (gpu-entry); Apple Silicon hardware
joins the validation set in a future Sprint when MacBook M-series
hardware becomes available to the project. The dispatch
infrastructure shipped here enables that validation; the testing
protocol is documented in docs/CROSS_ARCH_REPRODUCIBILITY.md
§ "Testing protocol".

Highlights

Apple Silicon catalogued end-to-end

Layer	Artefact
Profiles	9 new `apple-silicon-*` identifiers in `config/profiles.yaml`; schema extended additively with `apple_silicon_required`, `chip_brand_match`, `min_unified_memory_gb`
Catalog	`catalog_version` bumped 2.5.0 → 2.6.0; conservative `profiles_compatible[]` additions per a ≈ 2× GGUF memory-headroom rule
Detection	`src/puma/preflight/apple_silicon.py` (NEW) — platform-isolated sysctl-based detection; testable on Linux via `unittest.mock`
Dispatch	`SystemCapabilities` gains `chip_brand` + `unified_memory_gb`; `Profile` gains optional Apple Silicon fields; `select_profile()` runs a new branch BEFORE the existing GPU/CPU dispatch
Runtime	`./start_puma.sh --native` boots native Ollama + Python venv on macOS; `./stop_puma_native.sh` teardown companion
Emissions	`get_tracking_mode_and_warnings()` resolver in `puma.sustainability.codecarbon_wrapper`; powermetrics probe; graceful process-mode fallback on macOS
Docs	`docs/CROSS_ARCH_REPRODUCIBILITY.md` (NEW); extensions in `MACOS_NOTES.md`, `HARDWARE.md`, `CATALOG_HISTORY.md`

Conservative model compatibility

Apple Silicon variants need at least roughly 2× the model's GGUF
size in unified memory plus a small OS overhead. Applied as a rule:

Model	Compatible Apple Silicon variants
`qwen2.5:1.5b`, `gemma3:1b`	All 10 (m3 base → m5-ultra)
`qwen2.5:3b`, `gemma3:4b`	All except m3 base (8 GB tight)
`qwen2.5:7b`, `mistral:7b`, `llama3.1:8b`, `deepseek-r1:7b`	Pro / Max / Ultra only (≥ 18 GB)
`qwen2.5:14b`, `deepseek-r1:14b`, `gemma3:27b`	Max / Ultra only (≥ 36 GB)
`gemma3:12b`	m3-max, m4-pro+, m5-pro+ (≥ 24 GB)
*`gemma4:`**	Excluded from every apple-silicon-* — P6 enforcement

gemma4 family — exclusion preserved AND extended

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
stays excluded from gpu-entry per F8 / D18.
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. v2.6.0 extends the exclusion with a new invariant
test
test_gemma4_family_not_compatible_with_any_apple_silicon
ensuring the same VRAM-pressure failure mode is not re-introduced
on small unified-memory variants by accidental copy-paste during
future catalog edits. Re-enabling any (gemma4, apple-silicon-*)
pair requires new empirical evidence on Mac hardware and an
explicit debt entry referencing the prior exclusion.

CodeCarbon survives on macOS (Mode B)

tracking_mode="machine" is what PUMA's split-container Linux
architecture relies upon (the D15 fix). On macOS Mode B without
passwordless powermetrics, machine-mode silently fails to record
energy. v2.6.0 adopts a graceful fallback in
get_tracking_mode_and_warnings():

Passwordless powermetrics configured → tracking_mode="machine", no warnings.
Default macOS state (sudo required) → tracking_mode="process" + one warning pointing at docs/MACOS_NOTES.md.
puma run --no-emissions → tracking disabled entirely.

Linux + NVIDIA path is byte-identical to v2.5.0:
tracking_mode="machine" with no warnings.

Cross-architecture reproducibility — open question, testable hypothesis

v2.6.0 frames bit-exact reproducibility between x86_64 Linux and
arm64 macOS as an open empirical question. The Q4_K_M integer
quantisation makes F1 and MAE expected to be bit-exact across
architectures; logprobs (and therefore ECE) are expected to differ
by FP rounding. The document records H0 / H1 / H2 / H3 hypotheses
and a 6-step testing protocol for closing them out when Mac
hardware joins the validation set. See
docs/CROSS_ARCH_REPRODUCIBILITY.md.

Tests

354 → 402 passing (-m "not ollama"), 7 deselected.
New tests/unit/test_apple_silicon.py: 28 tests covering every
public entry point of the new module with mocks for the
Darwin/arm64 gate, sysctl success + 3 failure modes
(FileNotFoundError, TimeoutExpired, CalledProcessError), a
parametrised mapping for all 10 chip brands, forward-compat for
unmapped chips, the get_apple_silicon_info dict shape, and
consistency checks on CHIP_BRAND_TO_PROFILE.
New tests/unit/test_codecarbon_macos.py: 7 tests for the
tracking-mode helper and powermetrics probe — Linux
short-circuit (never invokes subprocess), macOS with/without
sudoers, probe behaviour on FileNotFoundError / non-zero exit /
zero exit.
Extended tests/unit/test_catalog_metadata.py: +5 tests
(VALID_PROFILES inclusion, profiles.yaml definitions for all 9
apple-silicon-, chip_brand_match uniqueness, gemma4 exclusion
from every apple-silicon-, qwen2.5:3b anchor on
apple-silicon-m4-pro). Pre-existing
test_model_metadata_is_internally_consistent and
test_gemma4_family_excluded_from_gpu_entry preserved
unchanged.
Extended tests/unit/test_preflight_profile.py: +7 tests for the
auto-dispatch path (M4/M4 Pro/M5 Max), boundary at 8 GB unified
memory, fall-through cases for insufficient unified memory,
unmapped chips, and non-Apple chip-brand values; +1 manual
override test for apple-silicon-m4.
pre-commit run --all-files: all hooks green.
puma validate-baseline (triage, F1 path, fresh Ollama):
PASS f1=0.5831, delta=-0.0036, ±0.01.
puma validate-baseline --expected-mae 5.7150 (estimation,
fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05 —
bit-exact.

Quality

Coverage: 61 % (no significant change from v2.5.0).
CI: green on both main and develop. The integration-tests-ollama
job introduced in v2.5.0 continues to run on push to those
branches only.
Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved. MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
src/puma/cli.py LOC unchanged from v2.5.0; the only
meaningfully-grown file is src/puma/preflight/apple_silicon.py
(NEW, 141 LOC).

Design decisions

Linux path byte-identical to v2.5.0. Every Apple Silicon
code path returns None / no-ops when caps.chip_brand is None
or is_apple_silicon() is False, so no CI invocation on Linux
changes behaviour. Governed by P5 (additive over modification)
and P3 (reproducibility non-negotiable).
apple-silicon-* profiles ship with empirical_validation: pending. Cataloguing without validation is a deliberate
signalling choice — the dispatch infrastructure exists, the
numbers do not. Frames the future Mac-hardware Sprint as
empirical close-out rather than groundwork. Aligns with the
F8/D18 lesson: nominal specifications do not predict runtime
compatibility on constrained hardware.
gemma4 stays excluded across both gpu-entry AND
apple-silicon-*. P6 generalises from "do not re-introduce
previously-rejected (model, profile) pairs" to "do not
re-introduce the failure mode in a different profile family".
get_tracking_mode_and_warnings() is additive. The Linux
branch returns ("machine", []), byte-identical to the
v2.5.0-hardcoded value. The macOS branch is the new behaviour;
it only runs when is_apple_silicon() returns True, which is
False on every Linux test runner.
Cross-arch reproducibility documented as testable, not as
fact. v2.6.0 ships a hypothesis (H0/H1/H2/H3) and a protocol,
not a claim of bit-exactness. Honest about the empirical gap.

Debt tracking

No new open debt introduced by this release.
No closure of pre-existing debt — Sprint 9 is forward-looking
infrastructure, not a debt-paydown Sprint.
Empirical validation of apple-silicon-* profiles is the
explicit follow-up; it is not tracked as "debt" because the
catalogue declares its own status (empirical_validation: pending).

Known limitations

Unchanged from v2.5.0:

Single hardware tier empirically evaluated (gpu-entry); models
requiring gpu-mid/gpu-high are catalogued but not yet
validated.
AMD ROCm not yet detected.
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_text not persisted in triage_jira instances (D22, Low).

New in v2.6.0:

All apple-silicon-* profiles declare
empirical_validation: pending. Users on Mac hardware should
validate cross-arch reproducibility on their specific chip
before treating F1/MAE results as comparable to the Linux
baselines — see docs/CROSS_ARCH_REPRODUCIBILITY.md § "Testing
protocol".
macOS Mode B (native Ollama) is opt-in via
./start_puma.sh --native; the default ./start_puma.sh Docker
path is unchanged.

Upgrade notes

No breaking changes. Existing CI invocations of
puma validate-baseline (no flags) continue to work unchanged.
Linux + NVIDIA dispatch is byte-identical to v2.5.0.
macOS users can opt into native mode (Metal acceleratio...

Assets 3

16 May 01:39

pumacp

v2.5.0

a59e9c1

PUMA v2.5.0 — Hardening (Sprint 8)

PUMA v2.5.0 Release Notes

Release date: 2026-05-16
Previous release: v2.4.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 8 (hardening) onto the v2.4.0 base.
It resolves the six inconsistencies (I5–I10) detected in the
post-v2.4.0 technical analysis and adds the first empirical MAE
canonical baseline for puma validate-baseline. The gemma4 family
remains empirically excluded from gpu-entry per F8 / D18; v2.5.0
documents the exclusion in a new versioned catalog changelog rather
than re-introducing the failure mode.

Highlights

Six inconsistencies resolved (I5–I10)

ID	Resolution	Primary artefact
I5	macOS Docker (CPU) vs native Ollama (Metal) modes clarified; v2.6.0 plan stated	`docs/MACOS_NOTES.md` (new)
I6	gpu-entry hardware tolerance bands across RTX 2060/3050/3060/4050/4060 Mobile + Apple cross-arch row	`docs/HARDWARE.md`
I7	Catalog now versioned (`catalog_version: 2.5.0`); changelog document; new unit test	`config/models_catalog.yaml`, `docs/CATALOG_HISTORY.md` (new)
I8	CI job `integration-tests-ollama` runs the 4 `@pytest.mark.ollama` tests on every push to main/develop	`.github/workflows/lint-and-test.yml`
I9	`puma validate-baseline --expected-mae` extends the command to estimation_tawos; canonical spec + empirical reference established	`src/puma/cli.py`, `specs/runs/baseline_estimation_canonical.yaml` (new), `docs/baseline_references.md` (new)
I10	Coverage breakdown by module group with explicit rationale for sub-40 % modules	`docs/TESTING.md` (new)

Estimation canonical baseline established (v2.5.0)

The first empirical MAE reference for puma validate-baseline on
estimation_tawos is now published:

Spec: specs/runs/baseline_estimation_canonical.yaml
Configuration: qwen2.5:3b × zero-shot × N=200 × seed=42 × T=0.0
Reference MAE = 5.7150 SP (tolerance ±0.05 SP)
Establishing run: baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317
Verified bit-exact across 4 consecutive runs (cold + warm)
Hardware: gpu-entry (RTX 2060 Mobile 6 GB)

Cross-scenario state contamination — documented finding

During empirical establishment of the MAE reference, a
state-contamination effect was characterised: running a
triage_jira baseline between an Ollama restart and the estimation
validation shifts MAE from 5.7150 to ≈6.3150 SP (delta = +0.6 SP) —
well outside the ±0.05 tolerance. The fresh-Ollama-state validation
protocol that prevents the drift is documented in
docs/baseline_references.md. This is a property of Ollama's
inference engine (KV-cache + warm-state behaviour), not a PUMA
code-path regression. Related to D3 (CUDA non-determinism).

gemma4 family — status preserved

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
remains catalogued and remains empirically excluded from
gpu-entry. The exclusion is grounded in:

F8 (closed): gemma4:e2b GGUF measured at 7.2 GB on disk
versus the ~2 GB suggested by effective active params.
D18 (closed): all 5 smoke runs of gemma4:e2b on RTX 2060
6 GB VRAM returned empty raw_response strings.

The regression-guard test
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. Users on gpu-mid (12–24 GB VRAM) and gpu-pro (24+
GB VRAM) hardware can use the gemma4 family normally; on
gpu-entry, select qwen2.5:* or gemma3:* instead. Full
rationale in docs/CATALOG_HISTORY.md.

Tests

New: tests/unit/test_cli_validate_baseline.py extended from
3 → 8 tests (5 new for the MAE path, mutual exclusivity, missing
metric, and default-spec resolution).
New: tests/unit/test_catalog_metadata.py::test_catalog_has_version_field.
Suite total: 348 → 354 passing, 7 deselected (-m 'not ollama').
pre-commit run --all-files: all hooks green.
puma validate-baseline (triage, F1 path): PASS f1=0.5831, delta=-0.0036.
puma validate-baseline --expected-mae 5.7150 (estimation,
NEW): PASS mae=5.7150, delta=+0.0000.

Quality

Coverage: 61 % (essentially flat from v2.4.0; per-module
breakdown now in docs/TESTING.md).
CI: green on both main and develop. New
integration-tests-ollama job runs only on push to those
branches and is continue-on-error: true so a transient
Ollama failure does not gate the merge queue.
Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved. MAE = 5.7150 ± 0.05 on estimation_tawos newly
established.
src/puma/cli.py LOC essentially unchanged from v2.4.0
(signature extension + a small dispatch block).

Design decisions

--expected-mae is additive, not a refactor. When neither
flag is provided, puma validate-baseline preserves its v2.4.0
behaviour (F1 = 0.5867 against the triage baseline). Existing
CI invocations continue to work unchanged. Sprint Operating
Principle P5 (don't break working code) governed the choice.
gemma4 stays excluded. The original Sprint 8 plan asked for
re-adding gemma4 to gpu-entry at 1.5 / 3 GB sizes. Inventory
caught the conflict with F8 (7.2 GB measured) and D18 (empty
responses). The revised plan converted S8.7 to
documentation-only in CATALOG_HISTORY.md, preserving the
regression guard. Sprint Operating Principle P6
(don't re-introduce previously-rejected models) governed the
choice.
MAE tolerance set to ±0.05 SP. The reference is bit-exact
across cold + warm runs (4-run verification); the ±0.05 band
absorbs the same kind of FP-ordering drift that the F1 ±0.01
band absorbs on triage. The cross-scenario state-contamination
effect (≈0.6 SP) is NOT absorbed by the tolerance — instead, a
validation protocol prevents the contamination.

Debt tracking

No new open debt introduced by this release.
Inconsistencies tracked: I1–I10 documented across the project;
v2.5.0 resolves I5–I10 (six items). I1–I4 were resolved in
earlier releases.
Total resolved across v2.0.0 → v2.5.0: 15 of 24 technical debt
items (62 %) plus 6 of 10 inconsistencies (60 %); v2.5.0
contributes the inconsistency resolutions.

Known limitations

Unchanged from v2.4.0:

Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above catalogued but not yet empirically
evaluated.
AMD ROCm and Apple Metal backends not yet detected. Apple
Silicon native mode planned for v2.6.0; AMD ROCm pending
hardware availability.
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
3).
input_text not persisted in triage_jira instances (D22,
Low).

Upgrade notes

No breaking changes. Existing CI invocations of
puma validate-baseline (no flags, expecting F1 = 0.5867)
continue to work unchanged.
New flag available: puma validate-baseline --expected-mae
for estimation_tawos. See docs/baseline_references.md for
the recommended invocation including the fresh-Ollama-state
protocol.
New docs to know about: MACOS_NOTES.md,
CATALOG_HISTORY.md, baseline_references.md, TESTING.md.
The lint-and-test.yml CI workflow now contains a second job
(integration-tests-ollama) that runs on push to main/develop
only. PRs are unaffected.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.

Assets 2

13 May 03:26

pumacp

v2.4.0

9ad05a4

PUMA v2.4.0 — CLI completeness (Anexo F section A.2)

PUMA v2.4.0 Release Notes

Release date: 2026-05-13
Previous release: v2.3.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 7 (CLI completeness for Anexo F) onto
the v2.3.0 base. It resolves the long-standing gap between the academic
Anexo F catalog and the actual repository state by adding the six
high-value commands from section A.2 of Anexo F, together with a new
source-of-truth document distinguishing implemented commands from
documented design proposals.

Highlights

Anexo F gap resolved

docs/anexo_F_cli_reference.md is now the canonical CLI reference,
split into two sections:

Section A — Implemented: A.1 lists commands pre-existing in
v2.0.0–v2.3.0 (preflight, models, datasets, cache, run,
validate-baseline, compare, dashboard, report, db). A.2
lists the six commands added in this release (below). Every command
in section A is verifiable via puma <comando> --help and covered
by tests under tests/cli/.
Section B — Proposed extensions: 5 Bash auxiliary scripts and
12 further CLI commands (Ollama management, sweep wrappers, DB
tooling, code-quality wrappers) documented as design space.
Explicitly marked as not implemented; the decision rationale is
recorded in the document.

Six new CLI commands (Anexo F § A.2)

Anexo F	Command	Style
A.2.1	`puma prepare-datasets`	Thin subprocess wrapper of `scripts/prepare_datasets.py` (`--dataset`, `--force-redownload`, `--verify`)
A.2.2	`puma wilcoxon`	NEW analysis: paired Wilcoxon signed-rank between two `run_id`s; uses `puma.metrics.statistical_tests.wilcoxon_signed_rank_models`
A.2.3	`puma bias-analysis`	NEW analysis: bias evaluation report; uses `puma.dashboard.data.load_predictions_with_gold` + `puma.metrics.fairness.perturbation_disparity`
A.2.4	`puma generate-plots`	Thin subprocess wrapper of `scripts/generate_phase_b_plots.py` (`--source phase_b` only; `bias_eval`/`multi_seed` exit 2 with deferred-implementation message)
A.2.5	`puma list-runs`	New: SQL pivot of `runs ⋈ metrics` with `--scenario`/`--model`/`--last-n`/`--since` filters and `--json`
A.2.6	`puma list-ollama-models`	New: parses `docker exec puma_ollama ollama list` subprocess output

Why two commands are NEW analyses and not wrappers: Anexo F § A.2.2
and § A.2.3 specify semantics that diverge from the existing scripts
(positional run_id arguments vs. --run-prefix; --models /
--perturbations filters vs. prefix-only). Rather than mutate the
scripts, the new CLI commands call PUMA's own core helpers directly.
The scripts remain unchanged and continue to support their original
workflows (top-K ranking).

Tests

New: tests/cli/ package with 27 tests across 6 files, one per
command. Each file tests at minimum: --help exit 0, happy path,
error paths.
Suite total: 318 → 348 passing, 7 deselected (-m 'not ollama').
pre-commit run --all-files: all hooks green.
puma validate-baseline: PASS f1_macro=0.5831, delta=-0.0036.

Quality

Coverage: 58 % (no significant change from v2.3.0).
CI: green on both main and develop.
Baseline reproducibility: F1 = 0.5867 ± 0.01 holds.
app.py and cli.py LOC are stable; the only file that grew
meaningfully is src/puma/cli.py (363 → 777 LOC by inline
command bodies). Refactor to src/puma/cli/commands/ package was
considered and deferred — the monolith remains the cleaner option
at this size.

Design decisions

--source bias_eval / multi_seed for generate-plots are
accepted by the parser but exit 2 with a deferred-implementation
message because the underlying plotting scripts for those sources
do not yet exist. This matches the Anexo F spec without inventing
data.
--verify for prepare-datasets currently emits SHA-256 hashes
only. A manifest file (docs/datasets_manifest.json) for full hash
comparison is documented in Anexo F but not yet in the repo; the
command is forward-compatible.
No src/puma/cli/commands/ refactor was performed in this
release. With 6 new commands the inline monolith is still readable;
the refactor would be justified if/when Section B extensions land.

Debt tracking

No new open debt introduced by this release.
Total resolved across v2.0.0 → v2.4.0: 15 of 24 (62 %).
Section B of Anexo F is documented design space, not technical
debt. Implementation is optional and conditional on demand.

Known limitations

Unchanged from v2.3.0:

Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above catalogued but not yet empirically evaluated.
AMD ROCm and Apple Metal backends not yet detected.
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_text not persisted in triage_jira instances (D22, Low).

Upgrade notes

No breaking changes to existing commands or YAML run-spec schema.
Six new commands available. See puma <command> --help for usage
or read docs/anexo_F_cli_reference.md § A.2.
New test directory tests/cli/ joins the existing
tests/unit/ and tests/integration/.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

Assets 3

13 May 01:58

pumacp

v2.3.0

d9ca109

PUMA v2.3.0 — dashboard production-quality + docs structure

PUMA v2.3.0 Release Notes

Release date: 2026-05-13
Previous release: v2.2.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 6 (dashboard polish + structural
refactor) and retrospective documentation work (INDEX.md +
docs/overview.md + README branding) onto the v2.2.0 base. With this
release, Phase C of the master plan is fully complete.

Highlights

Dashboard production-quality (Sprint 6)

Major refactor: app.py reduced from 803 LOC monolithic to 168 LOC
router (-79 %). View logic delegated to seven modules in
src/puma/dashboard/views/. Each view is independently importable
and testable; the router publishes filters to st.session_state and
dispatches via a VIEWS dict.

Ten polish improvements applied:

#	Improvement	Impact
1	`@st.cache_data(ttl=60)` on 7 loaders	Performance
2	`st.spinner` on slow operations	UX
3	CSV export on 4 tables	Productivity
4	Tooltips on ≈ 12 metric cards	UX
5	Unified empty-filtered-state component	UX
6	Friendly expander titles in Overview	UX
7	Module-level imports (no more inline)	Code quality
8	Emoji prefixes consistent across 7 view titles	UX
9	Dark-mode dataframe text legibility	UX (bug fix)
10	Empty-selectbox guard in Instance Drill-down	Robustness

Plus: first-visit guided tour with view overview and tips
(download CSV, dark mode, tooltips). Persistent dismiss via
st.session_state["tour_dismissed"]; "📖 Show tour" button in the
sidebar to re-open.

Documentation structure (Phase E.bis retrospective + Phase E.ter)

INDEX.md (root, uppercase): project status, phases, releases,
debt tracking, architecture entry points. Created in Phase E.bis;
this release updates it for v2.3.0 status.
docs/overview.md (new location): preserves the 256 LOC of
architectural content from the legacy lowercase index.md.
README.md: branded header with PUMA logo, descriptive
blockquote, and Related-Resources section linking to puma-vault,
the published knowledge garden, releases, INDEX.md, and
docs/overview.md.

Quality

Tests: 318 passing (up from 313 in v2.2.0; +5 dashboard smoke
tests covering view module integrity, polish helpers, cache
decorator presence, and the end-to-end AppTest render with the
live database).
Coverage: 58 % (up from 55 % in v2.2.0).
Pre-commit: 10/10 hooks green.
CI: green on both main and develop.
Baseline reproducibility: F1 = 0.5867 ± 0.01 holds; verified
via puma validate-baseline (PASS at 0.5831, delta −0.0036).

Methodological findings (academic traceability)

Sprint 6 surfaced one additional finding consistent with the
meta-pattern documented in docs/known_debt.md ("symptom in layer
N, root cause in layer M ≠ N"):

Dark-mode dataframe invisibility. The CSS rule applied
light-mode colours globally; under dark mode, table text inherited
light-mode colours against the dark background, rendering tables
nearly unreadable. Symptom (invisible tables) appeared in the
dashboard layer; root cause (CSS scope without theme awareness)
was in the styling layer. Resolved in the same commit as the
refactor by adding a theme-aware CSS override
(color: #E5E7EB + background-color: #16213E when
dark_mode == True).

This brings the meta-pattern catalogue to five instances (D15, D18,
D21, D22, and this CSS scope issue); the fifth is retired in the
same commit that surfaced it.

CI workflow hygiene

The .github/workflows/release.yml fix introduced in Phase E.bis
(commit 863c166) is now exercised end-to-end by the v2.3.0 tag
push. After the tag was pushed and gh release create ran, exactly
one release was created (no duplicate draft). The fix is verified
effective for v2.X.0 releases going forward.

Debt tracking

No new open debt introduced by this release.
Total resolved across v2.0.0 → v2.3.0: 15 of 24 items (62 %).
Phase C: ✓ COMPLETE (was the last open phase; all five
Gate-C criteria met).

Full inventory and diagnostic write-ups in
docs/known_debt.md.

Known limitations

Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
the gemma4 family, llama3.1:70b) catalogued but not yet
empirically evaluated.
AMD ROCm and Apple Metal backends not yet detected (development
hardware is NVIDIA-only).
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_text not persisted in triage_jira instances (D22, Low —
future data-pipeline enhancement). The Dashboard Instance
Drill-down handles this gracefully with an informative message.

Master plan status (post-v2.3.0)

Phase	Status
A — Foundations	✓ COMPLETE
B — Multi-model sweep	✓ COMPLETE
C — Professional dashboard	✓ COMPLETE (this release)
D — Technical depth	✓ ~95 % (ROCm/Metal n/a in current hardware)
E — Documentation and releases	✓ COMPLETE (v2.0.0, v2.1.0, v2.2.0, v2.3.0)

All five phases of the original master plan are now complete or
effectively complete (Phase D's remaining items are
hardware-dependent or scope-deferred).

Upgrade notes

No breaking changes to the public CLI or YAML run-spec schema.
Dashboard refactor is internal; user-facing behaviour is preserved.
Existing run-specs and CLI invocations work unchanged.
The dashboard module structure has changed (app.py is now a
router; each view lives in src/puma/dashboard/views/<name>.py).
Any external tooling that imported view code from app.py should
migrate to the new module paths.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

Assets 3

13 May 01:02

pumacp

v2.2.0

157ca28

PUMA v2.2.0 — statistical pipeline + dashboard core + bias evaluation

PUMA v2.2.0 Release Notes

Release date: 2026-05-13
Previous release: v2.1.0 (2026-05-10)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprints 3, 4, and 5 onto the v2.1.0 base:
statistical-analysis pipeline (ECE, Wilcoxon signed-rank, multi-seed
reproducibility), dashboard core with PUMA visual identity, and an
empirical bias-evaluation suite adapted to the characteristics of the
technical corpus.

Highlights

Statistical analysis pipeline (Sprint 3)

ECE end-to-end. Expected Calibration Error computed from real
Ollama logprobs and persisted as metrics.metric_name='ece'.
Validated against Guo et al. (2017) canonical cases with tolerance
1e-6. Baseline qwen2.5:3b shows ECE=0.39 — significant
miscalibration typical of out-of-the-box LLMs without post-hoc
calibration.
Multi-seed validation. Seeds {42, 123, 456} on the canonical
baseline yield zero variance in task metrics under T=0.0,
confirming the bit-exact reproducibility guarantee documented in
v2.0.0. Runtime jitter ~4 %.
Wilcoxon pairwise comparison (Demšar 2006 methodology) on
paired-correctness indicators. Demonstrated empirically on a
mini-comparison (qwen2.5:1.5b vs gemma3:1b on triage_jira × N=50)
that a 0.19-point F1 gap is not statistically significant at
α=0.05 (p=0.108, n_pairs=19/50) — the kind of finding the test is
designed to surface.

Dashboard core (Sprint 4)

5 fully functional views: Overview (cohort cards + per-run
expanders, sidebar filters applied), Model Comparison (mean±std
aggregation across seeds, run × metric heatmap, Wilcoxon
artefact rendering), Reliability (real ECE + reliability diagram
from logprobs), Sustainability Frontier (F1 vs CO₂ Pareto consuming
the emissions table from Sprint 2 D15), Instance Drill-down
(correct gold_label via the new JOIN, top-K logprobs).
2 informed placeholders pending data: Fairness and Robustness
(made functional by Sprint 5).
PUMA visual identity: emerald palette, sans-serif typography,
logo in sidebar, telemetry disabled.
Dark-mode toggle via runtime CSS override.
6 smoke tests including an end-to-end streamlit.testing.v1.AppTest
render.
Bug fix: the Fairness and Instance Drill-down views in v2.1.0
read gold_label from predictions, where it does not exist
(it lives in instances). The new load_predictions_with_gold
helper LEFT-JOINs the two tables and is now consumed by every
view that needs the gold label.

Bias evaluation (Sprint 5) — empirical findings

Methodological adaptation. The triage_jira corpus is 100 %
technical incident text with 0 % gendered terms (verified by regex
over 23 EN tokens across all 200 instances). A textbook
pronoun-substitution gender_swap on this corpus would yield
flip_rate = 0 and disparity = 0 — a false PASS demonstrating
nothing. Sprint 5 therefore evaluates bias via signal injection
rather than signal substitution:

gender_swap_prefix_{male,female} — prepends a gendered identity
prefix (John Smith reported: … vs Mary Smith reported: …).
Methodology per Caliskan et al. (2017) and Bolukbasi et al. (2016).
register_shift_informal — formal→informal substitution proxy for
the dialect axis on a monolingual technical corpus (Tatman 2017).

Key empirical findings on triage_jira × N=100 per condition:

Model	Flip rate (any prefix vs baseline)	Δ accuracy	M-vs-F directional bias
qwen2.5:1.5b	~25-27 %	−11 to −12 pp	15 %
qwen2.5:3b	~25-27 %	−3 to −4 pp	5 %

Both models flip ~25 % of predictions when any gender signal is
added vs the un-perturbed baseline — strong sensitivity to identity
cues that the technical content does not require.
The 3× larger model exhibits ~3× less directional bias while
losing ~3× less accuracy under signal injection.
register_shift_informal shows ~0 % effect on both models:
formal-to-informal substitution does not perturb predictions.

Closes Gate D criterion 4 ("Bias semántico básico implementado y
validado"). Closes debt D19.

Quality

Tests: 313 passing (up from 276 in v2.1.0; +37 TDD across the
three sprints).
Pre-commit: 10/10 hooks green.
CI: green on both main and develop.
Baseline reproducibility: F1=0.5867 ± 0.01 holds; verified via
puma validate-baseline (PASS at 0.5831, delta −0.0036).

Methodological findings (academic traceability)

Sprint 3 confirmed empirically the deterministic-reproducibility
guarantee documented in v2.0.0 (zero variance under T=0.0 with three
different seeds; runtime jitter does not propagate to metrics).
Sprint 4 dashboard integration exposed D22: the synthetic
triage_jira dataset does not persist input_text. This is a
fourth instance of the "symptom in layer N, root cause in layer M"
meta-pattern documented in docs/known_debt.md (joining D15, D18,
D21).
Sprint 5 confirmed D3 empirically: puma validate-baseline
showed bit-exact F1=0.6002 across consecutive runs after the bias
sweep, then returned to F1=0.5831 after docker compose restart puma_ollama. Documents the warm-state-drift scope of the
reproducibility guarantee — important for any future B.4 sweep
protocol.

Debt tracking

Resolved this release: D19 (fairness scaffolding only).
New entry: D22 (Low) — instances.input_text empty on
triage_jira.
Total resolved across v2.0.0 → v2.2.0: 15 of 23 (65 %).
Open: 8 (0 critical, 5 medium, 2 low; 1 marked
DECIDED-NO-ACTION).

Full inventory and diagnostic write-ups in
docs/known_debt.md.

Known limitations

input_text not persisted in triage_jira instances (D22, Low —
future data-pipeline enhancement).
Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
the gemma4 family, llama3.1:70b) catalogued but not yet
empirically evaluated.
Dashboard polish deferred — animations, guided tour, refactor of
the 640-LOC app.py to a views/ module package. Slated for a
future Sprint 6.
AMD ROCm and Apple Metal backends not yet detected (development
hardware is NVIDIA-only).
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).

Upgrade notes

No breaking changes to the public CLI or YAML run-spec schema.
New perturbation names accepted in perturbations: lists:
gender_swap_prefix_male, gender_swap_prefix_female,
register_shift_informal.
Dashboard views update automatically when perturbed runs are
present; no migration step required.

References

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics
derived automatically from language corpora contain human-like
biases. Science 356(6334), 183-186.
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T.
(2016). Man is to computer programmer as woman is to homemaker?
Debiasing word embeddings. NeurIPS.
Tatman, R. (2017). Gender and dialect bias in YouTube's automatic
captions. In Proceedings of the First ACL Workshop on Ethics in
Natural Language Processing.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
calibration of modern neural networks. ICML.
Demšar, J. (2006). Statistical comparisons of classifiers over
multiple data sets. JMLR 7, 1-30.
Wilcoxon, F. (1945). Individual comparisons by ranking methods.
Biometrics Bulletin 1(6), 80-83.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

Assets 2

Releases: pumacp/puma

PUMA reproducibility anchor (v2.7.0-baseline-anchor)

Uh oh!