Releases: pumacp/puma
PUMA reproducibility anchor (v2.7.0-baseline-anchor)
Canonical, SHA-pinned reproducibility anchor for the PUMA empirical baseline.
Commit: 6671108
Verified reproducible metrics (qwen2.5:3b, N=200, seed=42, temperature=0.0):
- triage_jira F1-macro = 0.5867 +/- 0.01 (contextual-anchoring)
- estimation_tawos MAE = 5.7150 +/- 0.05 SP (zero-shot)
Reproduce:
puma validate-baseline --expected-f1 0.5867 --tolerance 0.01
puma validate-baseline --expected-mae 5.7150 --tolerance 0.05
Reproducibility gates F1=0.5867 and MAE=5.7150 are bit-exact stable across
releases v2.4.0, v2.5.0, v2.6.0 and v2.7.0.
PUMA v4.0.0 — Sprint 12 closure
PUMA v4.0.0 — Sprint 12 closure
PUMA is a local, reproducible benchmarking framework for open LLMs on
project-management tasks (issue triage and effort estimation), run entirely
on your own hardware via Ollama — deterministic, offline by default, and
verifiable end to end.
This release is validated by a real-world milestone. PUMA's federated
community-submission infrastructure is now operational and was proven end to
end by the first official production submission — qwen2.5:3b on
triage_jira / zero_shot, F1-macro 0.3898 — which landed at
pumacp/puma-community#8
(merge SHA 111cee36) and mirrored to the Hugging Face submissions dataset.
That F1 is the reproducible floor anchor for the zero-shot strategy on
triage_jira; see docs/first-submission.md.
Sprint 12 highlights
- #45 — PyPI + Docker (ghcr.io) publishing workflows +
Dockerfile.publish;pyproject.tomlhardened (distribution namepuma-cp) - #46 — Multi-model comparison dashboard view + corporate Streamlit palette
- #47 — README channel-directory restructure; acrostic visual flexibility
- #48 — mkdocs full content sync (nav 6 → 28 pages); D30 resolved
- #49 — Manual IDE contribution workflow docs
- #50 — Security audit MVP: pip-audit + bandit + gitleaks + Trivy +
SECURITY.md+ threat model - #51 — Consolidated technical reference (~5100 words, 17-decision timeline)
- #52 — Inaugural production submission documented (
docs/first-submission.md)
Installation
pip install puma-cp==4.0.0The container image (ghcr.io/pumacp/puma:4.0.0) is not yet published for
this release: the Trivy gate in the publish pipeline blocked it on 3 HIGH-severity
base-image CVEs (0 CRITICAL). This is the security gate working as designed; the
image will be re-published once the base image is patched (tracked for S12.19).
Use the PyPI package in the meantime.
Quick start
See the documentation site and the
getting-started overview.
What's in v4.0.0
Added — federated submission infrastructure (S12.15) validated by the
inaugural submission; PyPI + ghcr.io publishing; multi-model dashboard view;
consolidated technical reference; manual IDE contribution workflow; security
audit MVP; corporate monochrome visual identity; acrostic visual flexibility.
Changed — mkdocs nav 6 → 28 public pages; read-only puma models
sub-group (D30 resolved); pyproject.toml hardened; project version → 4.0.0.
Security — pip-audit / bandit / gitleaks on every push; Trivy on every
container publish (blocks HIGH/CRITICAL); SECURITY.md disclosure policy;
9-check submission validation pipeline.
Infrastructure — Dockerfile.publish (multi-stage, non-root, OCI labels);
GitHub Pages live with a 28-page nav; Hugging Face dataset mirror operational.
Full detail in CHANGELOG.md.
Links
- Documentation: https://pumacp.github.io/puma
- Community repository: https://github.com/pumacp/puma-community
- Hugging Face submissions dataset: https://huggingface.co/datasets/pumaproject/puma-community-submissions
- Hugging Face leaderboard Space: https://huggingface.co/spaces/pumaproject/puma-leaderboard
Known limitations (deferred to S12.19 / post-Sprint-12)
See docs/known_debt.md.
- D38 —
validate-submissionworkflow references a non-existent action version. - D39 —
verify-integrityworkflow broken bygradio_clientAPI drift; the inaugural submission is thereforeself-attested. - D40 —
puma share-resultsCLI hangs after the Review panel. - Container publish blocked by 3 HIGH base-image CVEs (re-publish after patching).
notify-discordlacks theDISCORD_WEBHOOKsecret (optional integration).
Acknowledgments
Thanks to everyone who tested the submission pipeline end to end and helped land
the first official community submission — the milestone that validates this release.
v3.1.0 — Sprint 11' Post-v3.0.0 Reconciliation
PUMA v3.1.0 — Sprint 11' Post-v3.0.0 Reconciliation
- Release date: 2026-05-25
- Previous release: v3.0.0 (2026-05-20)
- Tag: v3.1.0
Overview
Sprint 11' is a post-v3.0.0 hardening Sprint: it reconciles the v3.0.0
release artifacts (version strings, CHANGELOG, release notes), completes the
puma community CLI subgroup, applies minimal repairs to the wiki-sync
workflows and the Verifier Space, and cleans residual documentation drift
while preserving historical accuracy. No breaking changes.
Highlights
- New
puma communityTyper group with four subcommands —browse,
pull,verify-hash,validate(Anexo F F.16.5–F.16.8). The GitHub
Contents API is read overhttpx(mockable in tests);verify-hash
recomputes the predictions hash byte-identically to the client and is
D23-aware on--remote. - Verifier Space hash output now matches schema v1.0.0 — the
pumaproject/puma-verifierSpace no longer emits thesha256:prefix, so
its digest conforms to^[a-f0-9]{64}$. - Wiki sync works again in both repos — the
wiki-sync.ymlworkflows now
declarecontents: write, so the GitHub Wiki publishes on push; both
/wikipages render (HTTP 200). - D23 deuda técnica documented — the Verifier (2-field JSONL) and client
(4-field DB CSV) hash algorithms differ by construction; reconciliation is
deferred to v4.x with a schema decision.
Quality at release time
- mypy --strict: 0 errors / 81 files
- pytest -m "not ollama": 597 passed, 1 skipped, 7 deselected
- pytest -m ollama: 7 passed
- F1 triage baseline: 0.5831 ± 0.01 (Δ = −0.0036)
- MAE estimation baseline: 5.7150 bit-exact (Δ = +0.0000)
- Coverage on community CLI: browse 80%, pull 87%, verify 85%, validate 89%
- Schema v1.0.0 unchanged (P3); zero federation references in code (P4).
Known limitations
- D23:
puma community verify-hash --remotereturnsmismatchby
construction for schema v1.0.0 submissions even when the local hash is
correct, because the Verifier Space hashes a different input shape than the
client. Local verification (the canonical path) is unaffected.
Reconciliation deferred to v4.x with a schema decision. See
docs/known_debt.md. - Kaggle mirror activation (S11'.6): the
mirror-kaggle.ymlworkflow is
fully hardened (companion PRpumacp/puma-community#4:--dir-mode zip,
robust create-vs-version, CC-BY-4.0 license, 50-char-safe title,
post-publish HEAD verification). Publication is pending a Kaggle-internal
soft-delete grace period that reserves the slug
pumacp/puma-community-submissions; it resolves automatically once Kaggle
releases the slug — no further code action is needed.
Upgrade path
No breaking changes since v3.0.0. Refresh with:
git pull && pip install -e '.[dev]'See also
CHANGELOG.md— the[3.1.0]section.- PR #15 on this repo — the full commit list.
- Companion PRs:
pumacp/puma-community#2(HF namespace),#3(wiki-sync),
#4(Kaggle mirror hardening). pumaproject/puma-verifierHF Space @ commitd8a4ffd.- Anexo G of the academic memoria — discovery-before-write episodes recorded
during this Sprint.
PUMA v3.0.0 — Community + mypy remediation + public docs + security hardening
First public release. See CHANGELOG.md for full details.
PUMA v2.7.0 — Catalog expansion: Qwen3 (pending validation)
PUMA v2.7.0 Release Notes
Release date: 2026-05-16
Previous release: v2.6.0 (2026-05-16)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 10 (catalog expansion) onto the
v2.6.0 base. It adds two Alibaba Qwen3 family entries to the
catalog — both verified against the Ollama registry before
inclusion — and formally excludes Kimi K2.6 after a 13-tag
registry probe confirmed it is not distributed via Ollama. The
catalog schema is preserved at the 8 fields used since v2.0.0;
all v2.7.0 metadata lives within the existing notes field per
the project's minimum-complexity discipline.
Highlights
Two Qwen3 entries — registry-verified before cataloguing
| Tag | Type | GGUF (verified) | Profile | Notes |
|---|---|---|---|---|
qwen3:30b |
Dense | 17.3 GB | gpu-high |
Hybrid Gated DeltaNet + self-attention, 262144 context |
qwen3:30b-a3b |
MoE | 17.3 GB | gpu-high |
30B total / ~3B active per token; F8/D18 MoE caveat preserved in notes |
Both entries:
- Verified via Ollama registry manifest probe
(registry.ollama.ai/v2/library/qwen3/manifests/*returned HTTP
200 with the GGUF size derived from the sum of layer sizes). - Declare
logprobs_supported: falseconservatively until
empirical verification on appropriate hardware. - Excluded from
gpu-entryAND everyapple-silicon-*profile by
the P11 pending-validation invariant. Five new regression-guard
tests intests/unit/test_catalog_metadata.pypin this contract
via exact-equality onprofiles_compatible == ['gpu-high']. params_b: 30.0follows thegemma4:26b-a4bprecedent (TOTAL
when the tag encodes both numbers). The MoE/F8 caveat in the
notesfield documents that active-params count does NOT
predict GGUF size or runtime VRAM consumption.
Kimi K2.6 — formally excluded after 13-tag registry probe
A registry probe on 2026-05-16 returned HTTP 404 on every
plausible Ollama tag naming for Kimi K2.6:
kimi-k2:6 kimi-k2:latest kimi-k2:1t
kimi-k2:1t-instruct kimi-k2:0905 kimi-k2:base
kimi-k2:instruct kimi:latest kimi-k2.6:latest
moonshot:latest moonshot:kimi-k2 kimi-k2-base:latest
kimi-k2-instruct:latest
The model is not distributed via the Ollama registry as of the
v2.7.0 cut. Cataloguing a non-existent ollama_tag would violate
the project's empirical-first principle (P10) and produce a
broken puma models pull command for users following the catalog
metadata. The exclusion decision is recorded in
docs/CATALOG_HISTORY.md v2.7.0 § "Considered but not catalogued"
with the full probe table for academic traceability. It may be
reconsidered in a future release if Moonshot AI or a third-party
distributor publishes K2.6 to the Ollama registry.
Deferred — known on Ollama but out of v2.7.0 scope
The registry probe confirmed these tags exist (HTTP 200) but they
are deferred from v2.7.0 for scope discipline:
| Tag | Real GGUF | Reason |
|---|---|---|
qwen3:32b (dense) |
18.8 GB | Marginal upgrade over qwen3:30b; defer until empirical validation on gpu-high can distinguish them |
qwen3:235b-a22b (MoE) |
132.4 GB | Requires multi-GPU rigs well beyond gpu-high (24+ GB VRAM); pending hardware tier extension |
qwen3-coder:30b, qwen3-coder:480b |
— | Coder family is task-specific; out of scope for PMO benchmarks |
Schema unchanged — minimum-complexity preserved
The original Sprint 10 plan proposed ~12 new YAML fields (family,
parameters_total_b, parameters_active_b, profile_recommended,
size_gb_disk_estimate, size_gb_vram_estimate, quantization,
license, release_date, capabilities, empirical_validation,
validation_blockers). The user's minimum-complexity decision
kept the catalog at the v2.0.0–v2.6.0 schema (8 fields:
ollama_tag, params_b, gguf_size_gb, context_window,
logprobs_supported, profiles_compatible, timeout_s,
notes). All v2.7.0 metadata (license, release date, MoE caveat,
validation blockers, architecture details) lives within
multi-line notes: text. src/puma/preflight/catalog.py and the
ModelEntry dataclass are byte-identical to v2.6.0.
Invariants generalised, not relaxed
The pending-validation exclusion from gpu-entry (established in
Sprint 9 for Apple Silicon entries) is reaffirmed for the new
Qwen3 entries and extended to every apple-silicon-* profile via
explicit tests:
| Test | Status | Invariant |
|---|---|---|
test_gemma4_family_excluded_from_gpu_entry |
PASSED (preserved) | D18/F8 (Sprint 2) |
test_gemma4_family_not_compatible_with_any_apple_silicon |
PASSED (preserved) | P6 extension to Apple Silicon (Sprint 9) |
test_qwen3_entries_excluded_from_gpu_entry |
PASSED (new) | P10/P11 (Sprint 10) |
test_qwen3_entries_excluded_from_all_apple_silicon |
PASSED (new) | P11 generalisation across profile families |
test_qwen3_entries_target_gpu_high_only |
PASSED (new) | Exact-equality anchor against accidental loosening |
The pattern is now: new entries default to the safest profile
only; loosening requires empirical evidence and an explicit
debt-tracker entry referencing the prior exclusion.
Tests
- 402 → 407 passing (
-m "not ollama"), 7 deselected. - 5 new regression-guard tests in
tests/unit/test_catalog_metadata.py(see invariant table
above). tests/unit/test_preflight_catalog.py::test_load_catalog_returns_all_entries:
entry-count expectation updated 15 → 17 to reflect the two new
Qwen3 additions.pre-commit run --all-files: all hooks green.puma validate-baseline(triage, F1 path, fresh Ollama):
PASS f1=0.5831, delta=-0.0036, ±0.01.puma validate-baseline --expected-mae 5.7150(estimation,
fresh Ollama):PASS mae=5.7150, delta=+0.0000, ±0.05—
bit-exact.
Quality
- Coverage: 61 % (no significant change from v2.6.0; new entries
are YAML-only, no Python statements added). - CI: green on both
mainanddevelop. The
integration-tests-ollamajob introduced in v2.5.0 continues to
run on push to those branches. - Baseline reproducibility: F1 = 0.5867 ± 0.01 on
triage_jira
preserved; MAE = 5.7150 ± 0.05 onestimation_tawospreserved. - Linux + NVIDIA dispatch byte-identical to v2.6.0 (no new
profiles, no new dispatch logic, no new code paths). The Qwen3
entries appear inmodels_for_profile('gpu-high')only.
Design decisions
- Schema unchanged at 8 fields. Documented Sprint-10-original
proposal of 12 new fields; chose to keep schema minimal. All
metadata that would have required new fields now lives in the
notesmulti-line text. Governed by P5 (additive over
modification) and the project's minimum-complexity discipline. - Real Ollama tags only. Every catalogued
ollama_tagis
verified againstregistry.ollama.ai/v2/library/<repo>/manifests/<tag>
before inclusion. The originally-plannedqwen3:27band
qwen3:35b-a3bwere remapped to the realqwen3:30band
qwen3:30b-a3bafter probe; Kimi K2.6 was removed entirely
after every plausible tag returned 404. - Conservative
params_bfor MoE. Theqwen3:30b-a3bentry
declaresparams_b: 30.0following thegemma4:26b-a4b
precedent — TOTAL params when the tag encodes both numbers. The
F8/D18 caveat innotesdocuments that active-params count
does NOT predict VRAM consumption. logprobs_supported: falseconservatively. The Qwen3 family
announces logprob support upstream, but PUMA has not yet
empirically verified token-level confidence on these specific
tags. Flipping to true is part of the empirical validation
protocol when hardware becomes available.- gpu-high as the only target. 17.3 GB GGUF exceeds gpu-mid's
12–24 GB upper bound once OS + context overhead are accounted
for; gpu-high (24+ GB VRAM) is the only safe default. The
exact-equality anchor test
test_qwen3_entries_target_gpu_high_onlypins this so future
loosening requires deliberate intent.
Debt tracking
- No new open debt introduced by this release.
- No closure of pre-existing debt — Sprint 10 is
forward-looking catalog expansion, not a debt-paydown Sprint. - Empirical validation of
qwen3:30bandqwen3:30b-a3bis the
explicit follow-up; tracked via thenotestext on each entry
and via the validation roadmap in
docs/CATALOG_HISTORY.md§ "Empirical validation roadmap".
Known limitations
Unchanged from v2.6.0:
- Single hardware tier empirically evaluated (
gpu-entry);
models requiringgpu-mid/gpu-highare catalogued but not
yet validated. - AMD ROCm not yet detected.
- All
apple-silicon-*profiles declare
empirical_validation: pending(Sprint 9 forward-work). - TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
3).
New in v2.7.0:
- Both
qwen3:30bandqwen3:30b-a3bare catalogued with
validation pending. Users ongpu-highhardware can use them
viapuma runand report empirical results to close the
validation gap. - Kimi K2.6 is not distributed via Ollama; the catalog
intentionally omits it. If a third-party distributor publishes
K2.6 to Ollama, a future Sprint can revisit cataloguing.
Empirical validation roadmap (when gpu-high hardware available)
The protocol for closing the validation gap is documented in
docs/CATALOG_HISTORY.md v2.7.0 § "Empirical validation roadmap":
- Pull the model via
ollama pull qwen3:30band verify the
digest matches the registry manifest probed at cataloguing. - Run the canonical baselines: triage_jira (F1) and
estimation_tawos (MAE). - Measure
parse_failure_rate(should be 0 for usable models)
and reproducibility (bit-exact under T=0.0 + seed=42 on the
gpu-high hardware). - If validation succeeds: bump
logprobs_supportedto true
(after a logprobs-enabled probe), extendprofiles_compatible
to vetted Apple Silicon Max/Ultra variants (≥36 GB unified
memory) pending separate Apple-side v...
PUMA v2.6.0 — Apple Silicon M3/M4/M5 support
PUMA v2.6.0 Release Notes
Release date: 2026-05-16
Previous release: v2.5.0 (2026-05-16)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 9 (Apple Silicon M3/M4/M5 support)
onto the v2.5.0 base. It adds first-class detection of Apple Silicon
hosts (9 new profile identifiers covering M3 base/Pro/Max, M4
base/Pro/Max, M5 base/Pro/Max, and M5 Ultra), a native runtime mode
via ./start_puma.sh --native that boots Ollama with Metal on
macOS (no Docker), macOS-aware CodeCarbon tracking with a
powermetrics-availability probe, and the cross-architecture
reproducibility question documented as a testable hypothesis for a
future empirical close-out.
Empirical validation status: all apple-silicon-* profiles
declare empirical_validation: pending. PUMA's validation hardware
is the RTX 2060 Mobile 6 GB (gpu-entry); Apple Silicon hardware
joins the validation set in a future Sprint when MacBook M-series
hardware becomes available to the project. The dispatch
infrastructure shipped here enables that validation; the testing
protocol is documented in docs/CROSS_ARCH_REPRODUCIBILITY.md
§ "Testing protocol".
Highlights
Apple Silicon catalogued end-to-end
| Layer | Artefact |
|---|---|
| Profiles | 9 new apple-silicon-* identifiers in config/profiles.yaml; schema extended additively with apple_silicon_required, chip_brand_match, min_unified_memory_gb |
| Catalog | catalog_version bumped 2.5.0 → 2.6.0; conservative profiles_compatible[] additions per a ≈ 2× GGUF memory-headroom rule |
| Detection | src/puma/preflight/apple_silicon.py (NEW) — platform-isolated sysctl-based detection; testable on Linux via unittest.mock |
| Dispatch | SystemCapabilities gains chip_brand + unified_memory_gb; Profile gains optional Apple Silicon fields; select_profile() runs a new branch BEFORE the existing GPU/CPU dispatch |
| Runtime | ./start_puma.sh --native boots native Ollama + Python venv on macOS; ./stop_puma_native.sh teardown companion |
| Emissions | get_tracking_mode_and_warnings() resolver in puma.sustainability.codecarbon_wrapper; powermetrics probe; graceful process-mode fallback on macOS |
| Docs | docs/CROSS_ARCH_REPRODUCIBILITY.md (NEW); extensions in MACOS_NOTES.md, HARDWARE.md, CATALOG_HISTORY.md |
Conservative model compatibility
Apple Silicon variants need at least roughly 2× the model's GGUF
size in unified memory plus a small OS overhead. Applied as a rule:
| Model | Compatible Apple Silicon variants |
|---|---|
qwen2.5:1.5b, gemma3:1b |
All 10 (m3 base → m5-ultra) |
qwen2.5:3b, gemma3:4b |
All except m3 base (8 GB tight) |
qwen2.5:7b, mistral:7b, llama3.1:8b, deepseek-r1:7b |
Pro / Max / Ultra only (≥ 18 GB) |
qwen2.5:14b, deepseek-r1:14b, gemma3:27b |
Max / Ultra only (≥ 36 GB) |
gemma3:12b |
m3-max, m4-pro+, m5-pro+ (≥ 24 GB) |
gemma4:* |
Excluded from every apple-silicon-* — P6 enforcement |
gemma4 family — exclusion preserved AND extended
The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
stays excluded from gpu-entry per F8 / D18.
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. v2.6.0 extends the exclusion with a new invariant
test
test_gemma4_family_not_compatible_with_any_apple_silicon
ensuring the same VRAM-pressure failure mode is not re-introduced
on small unified-memory variants by accidental copy-paste during
future catalog edits. Re-enabling any (gemma4, apple-silicon-*)
pair requires new empirical evidence on Mac hardware and an
explicit debt entry referencing the prior exclusion.
CodeCarbon survives on macOS (Mode B)
tracking_mode="machine" is what PUMA's split-container Linux
architecture relies upon (the D15 fix). On macOS Mode B without
passwordless powermetrics, machine-mode silently fails to record
energy. v2.6.0 adopts a graceful fallback in
get_tracking_mode_and_warnings():
- Passwordless powermetrics configured →
tracking_mode="machine", no warnings. - Default macOS state (sudo required) →
tracking_mode="process"+ one warning pointing atdocs/MACOS_NOTES.md. puma run --no-emissions→ tracking disabled entirely.
Linux + NVIDIA path is byte-identical to v2.5.0:
tracking_mode="machine" with no warnings.
Cross-architecture reproducibility — open question, testable hypothesis
v2.6.0 frames bit-exact reproducibility between x86_64 Linux and
arm64 macOS as an open empirical question. The Q4_K_M integer
quantisation makes F1 and MAE expected to be bit-exact across
architectures; logprobs (and therefore ECE) are expected to differ
by FP rounding. The document records H0 / H1 / H2 / H3 hypotheses
and a 6-step testing protocol for closing them out when Mac
hardware joins the validation set. See
docs/CROSS_ARCH_REPRODUCIBILITY.md.
Tests
- 354 → 402 passing (
-m "not ollama"), 7 deselected. - New
tests/unit/test_apple_silicon.py: 28 tests covering every
public entry point of the new module with mocks for the
Darwin/arm64 gate, sysctl success + 3 failure modes
(FileNotFoundError, TimeoutExpired, CalledProcessError), a
parametrised mapping for all 10 chip brands, forward-compat for
unmapped chips, theget_apple_silicon_infodict shape, and
consistency checks onCHIP_BRAND_TO_PROFILE. - New
tests/unit/test_codecarbon_macos.py: 7 tests for the
tracking-mode helper and powermetrics probe — Linux
short-circuit (never invokes subprocess), macOS with/without
sudoers, probe behaviour on FileNotFoundError / non-zero exit /
zero exit. - Extended
tests/unit/test_catalog_metadata.py: +5 tests
(VALID_PROFILES inclusion, profiles.yaml definitions for all 9
apple-silicon-,chip_brand_matchuniqueness, gemma4 exclusion
from every apple-silicon-, qwen2.5:3b anchor on
apple-silicon-m4-pro). Pre-existing
test_model_metadata_is_internally_consistentand
test_gemma4_family_excluded_from_gpu_entrypreserved
unchanged. - Extended
tests/unit/test_preflight_profile.py: +7 tests for the
auto-dispatch path (M4/M4 Pro/M5 Max), boundary at 8 GB unified
memory, fall-through cases for insufficient unified memory,
unmapped chips, and non-Apple chip-brand values; +1 manual
override test for apple-silicon-m4. pre-commit run --all-files: all hooks green.puma validate-baseline(triage, F1 path, fresh Ollama):
PASS f1=0.5831, delta=-0.0036, ±0.01.puma validate-baseline --expected-mae 5.7150(estimation,
fresh Ollama):PASS mae=5.7150, delta=+0.0000, ±0.05—
bit-exact.
Quality
- Coverage: 61 % (no significant change from v2.5.0).
- CI: green on both
mainanddevelop. Theintegration-tests-ollama
job introduced in v2.5.0 continues to run on push to those
branches only. - Baseline reproducibility: F1 = 0.5867 ± 0.01 on
triage_jira
preserved. MAE = 5.7150 ± 0.05 onestimation_tawospreserved. src/puma/cli.pyLOC unchanged from v2.5.0; the only
meaningfully-grown file issrc/puma/preflight/apple_silicon.py
(NEW, 141 LOC).
Design decisions
- Linux path byte-identical to v2.5.0. Every Apple Silicon
code path returnsNone/ no-ops whencaps.chip_brand is None
oris_apple_silicon()is False, so no CI invocation on Linux
changes behaviour. Governed by P5 (additive over modification)
and P3 (reproducibility non-negotiable). apple-silicon-*profiles ship withempirical_validation: pending. Cataloguing without validation is a deliberate
signalling choice — the dispatch infrastructure exists, the
numbers do not. Frames the future Mac-hardware Sprint as
empirical close-out rather than groundwork. Aligns with the
F8/D18 lesson: nominal specifications do not predict runtime
compatibility on constrained hardware.- gemma4 stays excluded across both
gpu-entryAND
apple-silicon-*. P6 generalises from "do not re-introduce
previously-rejected (model, profile) pairs" to "do not
re-introduce the failure mode in a different profile family". get_tracking_mode_and_warnings()is additive. The Linux
branch returns("machine", []), byte-identical to the
v2.5.0-hardcoded value. The macOS branch is the new behaviour;
it only runs whenis_apple_silicon()returns True, which is
False on every Linux test runner.- Cross-arch reproducibility documented as testable, not as
fact. v2.6.0 ships a hypothesis (H0/H1/H2/H3) and a protocol,
not a claim of bit-exactness. Honest about the empirical gap.
Debt tracking
- No new open debt introduced by this release.
- No closure of pre-existing debt — Sprint 9 is forward-looking
infrastructure, not a debt-paydown Sprint. - Empirical validation of
apple-silicon-*profiles is the
explicit follow-up; it is not tracked as "debt" because the
catalogue declares its own status (empirical_validation: pending).
Known limitations
Unchanged from v2.5.0:
- Single hardware tier empirically evaluated (
gpu-entry); models
requiringgpu-mid/gpu-highare catalogued but not yet
validated. - AMD ROCm not yet detected.
- TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_textnot persisted intriage_jirainstances (D22, Low).
New in v2.6.0:
- All
apple-silicon-*profiles declare
empirical_validation: pending. Users on Mac hardware should
validate cross-arch reproducibility on their specific chip
before treating F1/MAE results as comparable to the Linux
baselines — seedocs/CROSS_ARCH_REPRODUCIBILITY.md§ "Testing
protocol". - macOS Mode B (native Ollama) is opt-in via
./start_puma.sh --native; the default./start_puma.shDocker
path is unchanged.
Upgrade notes
- No breaking changes. Existing CI invocations of
puma validate-baseline(no flags) continue to work unchanged.
Linux + NVIDIA dispatch is byte-identical to v2.5.0. - macOS users can opt into native mode (Metal acceleratio...
PUMA v2.5.0 — Hardening (Sprint 8)
PUMA v2.5.0 Release Notes
Release date: 2026-05-16
Previous release: v2.4.0 (2026-05-13)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 8 (hardening) onto the v2.4.0 base.
It resolves the six inconsistencies (I5–I10) detected in the
post-v2.4.0 technical analysis and adds the first empirical MAE
canonical baseline for puma validate-baseline. The gemma4 family
remains empirically excluded from gpu-entry per F8 / D18; v2.5.0
documents the exclusion in a new versioned catalog changelog rather
than re-introducing the failure mode.
Highlights
Six inconsistencies resolved (I5–I10)
| ID | Resolution | Primary artefact |
|---|---|---|
| I5 | macOS Docker (CPU) vs native Ollama (Metal) modes clarified; v2.6.0 plan stated | docs/MACOS_NOTES.md (new) |
| I6 | gpu-entry hardware tolerance bands across RTX 2060/3050/3060/4050/4060 Mobile + Apple cross-arch row | docs/HARDWARE.md |
| I7 | Catalog now versioned (catalog_version: 2.5.0); changelog document; new unit test |
config/models_catalog.yaml, docs/CATALOG_HISTORY.md (new) |
| I8 | CI job integration-tests-ollama runs the 4 @pytest.mark.ollama tests on every push to main/develop |
.github/workflows/lint-and-test.yml |
| I9 | puma validate-baseline --expected-mae extends the command to estimation_tawos; canonical spec + empirical reference established |
src/puma/cli.py, specs/runs/baseline_estimation_canonical.yaml (new), docs/baseline_references.md (new) |
| I10 | Coverage breakdown by module group with explicit rationale for sub-40 % modules | docs/TESTING.md (new) |
Estimation canonical baseline established (v2.5.0)
The first empirical MAE reference for puma validate-baseline on
estimation_tawos is now published:
- Spec:
specs/runs/baseline_estimation_canonical.yaml - Configuration:
qwen2.5:3b × zero-shot × N=200 × seed=42 × T=0.0 - Reference MAE = 5.7150 SP (tolerance ±0.05 SP)
- Establishing run:
baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317 - Verified bit-exact across 4 consecutive runs (cold + warm)
- Hardware:
gpu-entry(RTX 2060 Mobile 6 GB)
Cross-scenario state contamination — documented finding
During empirical establishment of the MAE reference, a
state-contamination effect was characterised: running a
triage_jira baseline between an Ollama restart and the estimation
validation shifts MAE from 5.7150 to ≈6.3150 SP (delta = +0.6 SP) —
well outside the ±0.05 tolerance. The fresh-Ollama-state validation
protocol that prevents the drift is documented in
docs/baseline_references.md. This is a property of Ollama's
inference engine (KV-cache + warm-state behaviour), not a PUMA
code-path regression. Related to D3 (CUDA non-determinism).
gemma4 family — status preserved
The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
remains catalogued and remains empirically excluded from
gpu-entry. The exclusion is grounded in:
- F8 (closed):
gemma4:e2bGGUF measured at 7.2 GB on disk
versus the ~2 GB suggested by effective active params. - D18 (closed): all 5 smoke runs of
gemma4:e2bon RTX 2060
6 GB VRAM returned emptyraw_responsestrings.
The regression-guard test
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. Users on gpu-mid (12–24 GB VRAM) and gpu-pro (24+
GB VRAM) hardware can use the gemma4 family normally; on
gpu-entry, select qwen2.5:* or gemma3:* instead. Full
rationale in docs/CATALOG_HISTORY.md.
Tests
- New:
tests/unit/test_cli_validate_baseline.pyextended from
3 → 8 tests (5 new for the MAE path, mutual exclusivity, missing
metric, and default-spec resolution). - New:
tests/unit/test_catalog_metadata.py::test_catalog_has_version_field. - Suite total: 348 → 354 passing, 7 deselected (
-m 'not ollama'). pre-commit run --all-files: all hooks green.puma validate-baseline(triage, F1 path):PASS f1=0.5831, delta=-0.0036.puma validate-baseline --expected-mae 5.7150(estimation,
NEW):PASS mae=5.7150, delta=+0.0000.
Quality
- Coverage: 61 % (essentially flat from v2.4.0; per-module
breakdown now indocs/TESTING.md). - CI: green on both
mainanddevelop. New
integration-tests-ollamajob runs only on push to those
branches and iscontinue-on-error: trueso a transient
Ollama failure does not gate the merge queue. - Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved. MAE = 5.7150 ± 0.05 on estimation_tawos newly
established. src/puma/cli.pyLOC essentially unchanged from v2.4.0
(signature extension + a small dispatch block).
Design decisions
--expected-maeis additive, not a refactor. When neither
flag is provided,puma validate-baselinepreserves its v2.4.0
behaviour (F1 = 0.5867 against the triage baseline). Existing
CI invocations continue to work unchanged. Sprint Operating
Principle P5 (don't break working code) governed the choice.- gemma4 stays excluded. The original Sprint 8 plan asked for
re-adding gemma4 togpu-entryat 1.5 / 3 GB sizes. Inventory
caught the conflict with F8 (7.2 GB measured) and D18 (empty
responses). The revised plan converted S8.7 to
documentation-only inCATALOG_HISTORY.md, preserving the
regression guard. Sprint Operating Principle P6
(don't re-introduce previously-rejected models) governed the
choice. - MAE tolerance set to ±0.05 SP. The reference is bit-exact
across cold + warm runs (4-run verification); the ±0.05 band
absorbs the same kind of FP-ordering drift that the F1 ±0.01
band absorbs on triage. The cross-scenario state-contamination
effect (≈0.6 SP) is NOT absorbed by the tolerance — instead, a
validation protocol prevents the contamination.
Debt tracking
- No new open debt introduced by this release.
- Inconsistencies tracked: I1–I10 documented across the project;
v2.5.0 resolves I5–I10 (six items). I1–I4 were resolved in
earlier releases. - Total resolved across v2.0.0 → v2.5.0: 15 of 24 technical debt
items (62 %) plus 6 of 10 inconsistencies (60 %); v2.5.0
contributes the inconsistency resolutions.
Known limitations
Unchanged from v2.4.0:
- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above catalogued but not yet empirically
evaluated. - AMD ROCm and Apple Metal backends not yet detected. Apple
Silicon native mode planned for v2.6.0; AMD ROCm pending
hardware availability. - TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
3). input_textnot persisted intriage_jirainstances (D22,
Low).
Upgrade notes
- No breaking changes. Existing CI invocations of
puma validate-baseline(no flags, expecting F1 = 0.5867)
continue to work unchanged. - New flag available:
puma validate-baseline --expected-mae
forestimation_tawos. Seedocs/baseline_references.mdfor
the recommended invocation including the fresh-Ollama-state
protocol. - New docs to know about:
MACOS_NOTES.md,
CATALOG_HISTORY.md,baseline_references.md,TESTING.md. - The
lint-and-test.ymlCI workflow now contains a second job
(integration-tests-ollama) that runs on push to main/develop
only. PRs are unaffected.
Acknowledgments
Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.
PUMA v2.4.0 — CLI completeness (Anexo F section A.2)
PUMA v2.4.0 Release Notes
Release date: 2026-05-13
Previous release: v2.3.0 (2026-05-13)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 7 (CLI completeness for Anexo F) onto
the v2.3.0 base. It resolves the long-standing gap between the academic
Anexo F catalog and the actual repository state by adding the six
high-value commands from section A.2 of Anexo F, together with a new
source-of-truth document distinguishing implemented commands from
documented design proposals.
Highlights
Anexo F gap resolved
docs/anexo_F_cli_reference.md is now the canonical CLI reference,
split into two sections:
- Section A — Implemented: A.1 lists commands pre-existing in
v2.0.0–v2.3.0 (preflight,models,datasets,cache,run,
validate-baseline,compare,dashboard,report,db). A.2
lists the six commands added in this release (below). Every command
in section A is verifiable viapuma <comando> --helpand covered
by tests undertests/cli/. - Section B — Proposed extensions: 5 Bash auxiliary scripts and
12 further CLI commands (Ollama management, sweep wrappers, DB
tooling, code-quality wrappers) documented as design space.
Explicitly marked as not implemented; the decision rationale is
recorded in the document.
Six new CLI commands (Anexo F § A.2)
| Anexo F | Command | Style |
|---|---|---|
| A.2.1 | puma prepare-datasets |
Thin subprocess wrapper of scripts/prepare_datasets.py (--dataset, --force-redownload, --verify) |
| A.2.2 | puma wilcoxon |
NEW analysis: paired Wilcoxon signed-rank between two run_ids; uses puma.metrics.statistical_tests.wilcoxon_signed_rank_models |
| A.2.3 | puma bias-analysis |
NEW analysis: bias evaluation report; uses puma.dashboard.data.load_predictions_with_gold + puma.metrics.fairness.perturbation_disparity |
| A.2.4 | puma generate-plots |
Thin subprocess wrapper of scripts/generate_phase_b_plots.py (--source phase_b only; bias_eval/multi_seed exit 2 with deferred-implementation message) |
| A.2.5 | puma list-runs |
New: SQL pivot of runs ⋈ metrics with --scenario/--model/--last-n/--since filters and --json |
| A.2.6 | puma list-ollama-models |
New: parses docker exec puma_ollama ollama list subprocess output |
Why two commands are NEW analyses and not wrappers: Anexo F § A.2.2
and § A.2.3 specify semantics that diverge from the existing scripts
(positional run_id arguments vs. --run-prefix; --models /
--perturbations filters vs. prefix-only). Rather than mutate the
scripts, the new CLI commands call PUMA's own core helpers directly.
The scripts remain unchanged and continue to support their original
workflows (top-K ranking).
Tests
- New:
tests/cli/package with 27 tests across 6 files, one per
command. Each file tests at minimum:--helpexit 0, happy path,
error paths. - Suite total: 318 → 348 passing, 7 deselected (
-m 'not ollama'). pre-commit run --all-files: all hooks green.puma validate-baseline: PASSf1_macro=0.5831, delta=-0.0036.
Quality
- Coverage: 58 % (no significant change from v2.3.0).
- CI: green on both
mainanddevelop. - Baseline reproducibility: F1 = 0.5867 ± 0.01 holds.
app.pyandcli.pyLOC are stable; the only file that grew
meaningfully issrc/puma/cli.py(363 → 777 LOC by inline
command bodies). Refactor tosrc/puma/cli/commands/package was
considered and deferred — the monolith remains the cleaner option
at this size.
Design decisions
--source bias_eval/multi_seedforgenerate-plotsare
accepted by the parser but exit 2 with a deferred-implementation
message because the underlying plotting scripts for those sources
do not yet exist. This matches the Anexo F spec without inventing
data.--verifyforprepare-datasetscurrently emits SHA-256 hashes
only. A manifest file (docs/datasets_manifest.json) for full hash
comparison is documented in Anexo F but not yet in the repo; the
command is forward-compatible.- No
src/puma/cli/commands/refactor was performed in this
release. With 6 new commands the inline monolith is still readable;
the refactor would be justified if/when Section B extensions land.
Debt tracking
- No new open debt introduced by this release.
- Total resolved across v2.0.0 → v2.4.0: 15 of 24 (62 %).
- Section B of Anexo F is documented design space, not technical
debt. Implementation is optional and conditional on demand.
Known limitations
Unchanged from v2.3.0:
- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above catalogued but not yet empirically evaluated. - AMD ROCm and Apple Metal backends not yet detected.
- TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_textnot persisted intriage_jirainstances (D22, Low).
Upgrade notes
- No breaking changes to existing commands or YAML run-spec schema.
- Six new commands available. See
puma <command> --helpfor usage
or readdocs/anexo_F_cli_reference.md§ A.2. - New test directory
tests/cli/joins the existing
tests/unit/andtests/integration/.
Acknowledgments
Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.
PUMA v2.3.0 — dashboard production-quality + docs structure
PUMA v2.3.0 Release Notes
Release date: 2026-05-13
Previous release: v2.2.0 (2026-05-13)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 6 (dashboard polish + structural
refactor) and retrospective documentation work (INDEX.md +
docs/overview.md + README branding) onto the v2.2.0 base. With this
release, Phase C of the master plan is fully complete.
Highlights
Dashboard production-quality (Sprint 6)
Major refactor: app.py reduced from 803 LOC monolithic to 168 LOC
router (-79 %). View logic delegated to seven modules in
src/puma/dashboard/views/. Each view is independently importable
and testable; the router publishes filters to st.session_state and
dispatches via a VIEWS dict.
Ten polish improvements applied:
| # | Improvement | Impact |
|---|---|---|
| 1 | @st.cache_data(ttl=60) on 7 loaders |
Performance |
| 2 | st.spinner on slow operations |
UX |
| 3 | CSV export on 4 tables | Productivity |
| 4 | Tooltips on ≈ 12 metric cards | UX |
| 5 | Unified empty-filtered-state component | UX |
| 6 | Friendly expander titles in Overview | UX |
| 7 | Module-level imports (no more inline) | Code quality |
| 8 | Emoji prefixes consistent across 7 view titles | UX |
| 9 | Dark-mode dataframe text legibility | UX (bug fix) |
| 10 | Empty-selectbox guard in Instance Drill-down | Robustness |
Plus: first-visit guided tour with view overview and tips
(download CSV, dark mode, tooltips). Persistent dismiss via
st.session_state["tour_dismissed"]; "📖 Show tour" button in the
sidebar to re-open.
Documentation structure (Phase E.bis retrospective + Phase E.ter)
INDEX.md(root, uppercase): project status, phases, releases,
debt tracking, architecture entry points. Created in Phase E.bis;
this release updates it for v2.3.0 status.docs/overview.md(new location): preserves the 256 LOC of
architectural content from the legacy lowercaseindex.md.README.md: branded header with PUMA logo, descriptive
blockquote, and Related-Resources section linking to puma-vault,
the published knowledge garden, releases, INDEX.md, and
docs/overview.md.
Quality
- Tests: 318 passing (up from 313 in v2.2.0; +5 dashboard smoke
tests covering view module integrity, polish helpers, cache
decorator presence, and the end-to-end AppTest render with the
live database). - Coverage: 58 % (up from 55 % in v2.2.0).
- Pre-commit: 10/10 hooks green.
- CI: green on both
mainanddevelop. - Baseline reproducibility: F1 = 0.5867 ± 0.01 holds; verified
viapuma validate-baseline(PASS at 0.5831, delta −0.0036).
Methodological findings (academic traceability)
Sprint 6 surfaced one additional finding consistent with the
meta-pattern documented in docs/known_debt.md ("symptom in layer
N, root cause in layer M ≠ N"):
- Dark-mode dataframe invisibility. The CSS rule applied
light-mode colours globally; under dark mode, table text inherited
light-mode colours against the dark background, rendering tables
nearly unreadable. Symptom (invisible tables) appeared in the
dashboard layer; root cause (CSS scope without theme awareness)
was in the styling layer. Resolved in the same commit as the
refactor by adding a theme-aware CSS override
(color: #E5E7EB+background-color: #16213Ewhen
dark_mode == True).
This brings the meta-pattern catalogue to five instances (D15, D18,
D21, D22, and this CSS scope issue); the fifth is retired in the
same commit that surfaced it.
CI workflow hygiene
The .github/workflows/release.yml fix introduced in Phase E.bis
(commit 863c166) is now exercised end-to-end by the v2.3.0 tag
push. After the tag was pushed and gh release create ran, exactly
one release was created (no duplicate draft). The fix is verified
effective for v2.X.0 releases going forward.
Debt tracking
- No new open debt introduced by this release.
- Total resolved across v2.0.0 → v2.3.0: 15 of 24 items (62 %).
- Phase C: ✓ COMPLETE (was the last open phase; all five
Gate-C criteria met).
Full inventory and diagnostic write-ups in
docs/known_debt.md.
Known limitations
- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
thegemma4family, llama3.1:70b) catalogued but not yet
empirically evaluated. - AMD ROCm and Apple Metal backends not yet detected (development
hardware is NVIDIA-only). - TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_textnot persisted intriage_jirainstances (D22, Low —
future data-pipeline enhancement). The Dashboard Instance
Drill-down handles this gracefully with an informative message.
Master plan status (post-v2.3.0)
| Phase | Status |
|---|---|
| A — Foundations | ✓ COMPLETE |
| B — Multi-model sweep | ✓ COMPLETE |
| C — Professional dashboard | ✓ COMPLETE (this release) |
| D — Technical depth | ✓ ~95 % (ROCm/Metal n/a in current hardware) |
| E — Documentation and releases | ✓ COMPLETE (v2.0.0, v2.1.0, v2.2.0, v2.3.0) |
All five phases of the original master plan are now complete or
effectively complete (Phase D's remaining items are
hardware-dependent or scope-deferred).
Upgrade notes
- No breaking changes to the public CLI or YAML run-spec schema.
- Dashboard refactor is internal; user-facing behaviour is preserved.
- Existing run-specs and CLI invocations work unchanged.
- The dashboard module structure has changed (
app.pyis now a
router; each view lives insrc/puma/dashboard/views/<name>.py).
Any external tooling that imported view code fromapp.pyshould
migrate to the new module paths.
Acknowledgments
Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.
PUMA v2.2.0 — statistical pipeline + dashboard core + bias evaluation
PUMA v2.2.0 Release Notes
Release date: 2026-05-13
Previous release: v2.1.0 (2026-05-10)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprints 3, 4, and 5 onto the v2.1.0 base:
statistical-analysis pipeline (ECE, Wilcoxon signed-rank, multi-seed
reproducibility), dashboard core with PUMA visual identity, and an
empirical bias-evaluation suite adapted to the characteristics of the
technical corpus.
Highlights
Statistical analysis pipeline (Sprint 3)
- ECE end-to-end. Expected Calibration Error computed from real
Ollama logprobs and persisted asmetrics.metric_name='ece'.
Validated against Guo et al. (2017) canonical cases with tolerance
1e-6. Baseline qwen2.5:3b shows ECE=0.39 — significant
miscalibration typical of out-of-the-box LLMs without post-hoc
calibration. - Multi-seed validation. Seeds {42, 123, 456} on the canonical
baseline yield zero variance in task metrics under T=0.0,
confirming the bit-exact reproducibility guarantee documented in
v2.0.0. Runtime jitter ~4 %. - Wilcoxon pairwise comparison (Demšar 2006 methodology) on
paired-correctness indicators. Demonstrated empirically on a
mini-comparison (qwen2.5:1.5b vs gemma3:1b on triage_jira × N=50)
that a 0.19-point F1 gap is not statistically significant at
α=0.05 (p=0.108, n_pairs=19/50) — the kind of finding the test is
designed to surface.
Dashboard core (Sprint 4)
- 5 fully functional views: Overview (cohort cards + per-run
expanders, sidebar filters applied), Model Comparison (mean±std
aggregation across seeds, run × metric heatmap, Wilcoxon
artefact rendering), Reliability (real ECE + reliability diagram
from logprobs), Sustainability Frontier (F1 vs CO₂ Pareto consuming
the emissions table from Sprint 2 D15), Instance Drill-down
(correctgold_labelvia the new JOIN, top-K logprobs). - 2 informed placeholders pending data: Fairness and Robustness
(made functional by Sprint 5). - PUMA visual identity: emerald palette, sans-serif typography,
logo in sidebar, telemetry disabled. - Dark-mode toggle via runtime CSS override.
- 6 smoke tests including an end-to-end
streamlit.testing.v1.AppTest
render. - Bug fix: the Fairness and Instance Drill-down views in v2.1.0
readgold_labelfrompredictions, where it does not exist
(it lives ininstances). The newload_predictions_with_gold
helper LEFT-JOINs the two tables and is now consumed by every
view that needs the gold label.
Bias evaluation (Sprint 5) — empirical findings
Methodological adaptation. The triage_jira corpus is 100 %
technical incident text with 0 % gendered terms (verified by regex
over 23 EN tokens across all 200 instances). A textbook
pronoun-substitution gender_swap on this corpus would yield
flip_rate = 0 and disparity = 0 — a false PASS demonstrating
nothing. Sprint 5 therefore evaluates bias via signal injection
rather than signal substitution:
gender_swap_prefix_{male,female}— prepends a gendered identity
prefix (John Smith reported: …vsMary Smith reported: …).
Methodology per Caliskan et al. (2017) and Bolukbasi et al. (2016).register_shift_informal— formal→informal substitution proxy for
the dialect axis on a monolingual technical corpus (Tatman 2017).
Key empirical findings on triage_jira × N=100 per condition:
| Model | Flip rate (any prefix vs baseline) | Δ accuracy | M-vs-F directional bias |
|---|---|---|---|
| qwen2.5:1.5b | ~25-27 % | −11 to −12 pp | 15 % |
| qwen2.5:3b | ~25-27 % | −3 to −4 pp | 5 % |
- Both models flip ~25 % of predictions when any gender signal is
added vs the un-perturbed baseline — strong sensitivity to identity
cues that the technical content does not require. - The 3× larger model exhibits ~3× less directional bias while
losing ~3× less accuracy under signal injection. register_shift_informalshows ~0 % effect on both models:
formal-to-informal substitution does not perturb predictions.
Closes Gate D criterion 4 ("Bias semántico básico implementado y
validado"). Closes debt D19.
Quality
- Tests: 313 passing (up from 276 in v2.1.0; +37 TDD across the
three sprints). - Pre-commit: 10/10 hooks green.
- CI: green on both
mainanddevelop. - Baseline reproducibility: F1=0.5867 ± 0.01 holds; verified via
puma validate-baseline(PASS at 0.5831, delta −0.0036).
Methodological findings (academic traceability)
- Sprint 3 confirmed empirically the deterministic-reproducibility
guarantee documented in v2.0.0 (zero variance under T=0.0 with three
different seeds; runtime jitter does not propagate to metrics). - Sprint 4 dashboard integration exposed D22: the synthetic
triage_jiradataset does not persistinput_text. This is a
fourth instance of the "symptom in layer N, root cause in layer M"
meta-pattern documented indocs/known_debt.md(joining D15, D18,
D21). - Sprint 5 confirmed D3 empirically:
puma validate-baseline
showed bit-exact F1=0.6002 across consecutive runs after the bias
sweep, then returned to F1=0.5831 afterdocker compose restart puma_ollama. Documents the warm-state-drift scope of the
reproducibility guarantee — important for any future B.4 sweep
protocol.
Debt tracking
- Resolved this release: D19 (fairness scaffolding only).
- New entry: D22 (Low) —
instances.input_textempty on
triage_jira. - Total resolved across v2.0.0 → v2.2.0: 15 of 23 (65 %).
- Open: 8 (0 critical, 5 medium, 2 low; 1 marked
DECIDED-NO-ACTION).
Full inventory and diagnostic write-ups in
docs/known_debt.md.
Known limitations
input_textnot persisted intriage_jirainstances (D22, Low —
future data-pipeline enhancement).- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
thegemma4family, llama3.1:70b) catalogued but not yet
empirically evaluated. - Dashboard polish deferred — animations, guided tour, refactor of
the 640-LOCapp.pyto aviews/module package. Slated for a
future Sprint 6. - AMD ROCm and Apple Metal backends not yet detected (development
hardware is NVIDIA-only). - TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
Upgrade notes
- No breaking changes to the public CLI or YAML run-spec schema.
- New perturbation names accepted in
perturbations:lists:
gender_swap_prefix_male,gender_swap_prefix_female,
register_shift_informal. - Dashboard views update automatically when perturbed runs are
present; no migration step required.
References
- Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics
derived automatically from language corpora contain human-like
biases. Science 356(6334), 183-186. - Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T.
(2016). Man is to computer programmer as woman is to homemaker?
Debiasing word embeddings. NeurIPS. - Tatman, R. (2017). Gender and dialect bias in YouTube's automatic
captions. In Proceedings of the First ACL Workshop on Ethics in
Natural Language Processing. - Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
calibration of modern neural networks. ICML. - Demšar, J. (2006). Statistical comparisons of classifiers over
multiple data sets. JMLR 7, 1-30. - Wilcoxon, F. (1945). Individual comparisons by ranking methods.
Biometrics Bulletin 1(6), 80-83.
Acknowledgments
Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.