Skip to content

Releases: pumacp/puma

PUMA reproducibility anchor (v2.7.0-baseline-anchor)

04 Jun 09:42

Choose a tag to compare

Canonical, SHA-pinned reproducibility anchor for the PUMA empirical baseline.

Commit: 6671108

Verified reproducible metrics (qwen2.5:3b, N=200, seed=42, temperature=0.0):

  • triage_jira F1-macro = 0.5867 +/- 0.01 (contextual-anchoring)
  • estimation_tawos MAE = 5.7150 +/- 0.05 SP (zero-shot)

Reproduce:
puma validate-baseline --expected-f1 0.5867 --tolerance 0.01
puma validate-baseline --expected-mae 5.7150 --tolerance 0.05

Reproducibility gates F1=0.5867 and MAE=5.7150 are bit-exact stable across
releases v2.4.0, v2.5.0, v2.6.0 and v2.7.0.

PUMA v4.0.0 — Sprint 12 closure

02 Jun 02:50

Choose a tag to compare

PUMA v4.0.0 — Sprint 12 closure

PUMA is a local, reproducible benchmarking framework for open LLMs on
project-management tasks (issue triage and effort estimation), run entirely
on your own hardware via Ollama — deterministic, offline by default, and
verifiable end to end.

This release is validated by a real-world milestone. PUMA's federated
community-submission infrastructure is now operational and was proven end to
end by the first official production submissionqwen2.5:3b on
triage_jira / zero_shot, F1-macro 0.3898 — which landed at
pumacp/puma-community#8
(merge SHA 111cee36) and mirrored to the Hugging Face submissions dataset.
That F1 is the reproducible floor anchor for the zero-shot strategy on
triage_jira; see docs/first-submission.md.

Sprint 12 highlights

  • #45 — PyPI + Docker (ghcr.io) publishing workflows + Dockerfile.publish; pyproject.toml hardened (distribution name puma-cp)
  • #46 — Multi-model comparison dashboard view + corporate Streamlit palette
  • #47 — README channel-directory restructure; acrostic visual flexibility
  • #48 — mkdocs full content sync (nav 6 → 28 pages); D30 resolved
  • #49 — Manual IDE contribution workflow docs
  • #50 — Security audit MVP: pip-audit + bandit + gitleaks + Trivy + SECURITY.md + threat model
  • #51 — Consolidated technical reference (~5100 words, 17-decision timeline)
  • #52 — Inaugural production submission documented (docs/first-submission.md)

Installation

pip install puma-cp==4.0.0

The container image (ghcr.io/pumacp/puma:4.0.0) is not yet published for
this release: the Trivy gate in the publish pipeline blocked it on 3 HIGH-severity
base-image CVEs (0 CRITICAL). This is the security gate working as designed; the
image will be re-published once the base image is patched (tracked for S12.19).
Use the PyPI package in the meantime.

Quick start

See the documentation site and the
getting-started overview.

What's in v4.0.0

Added — federated submission infrastructure (S12.15) validated by the
inaugural submission; PyPI + ghcr.io publishing; multi-model dashboard view;
consolidated technical reference; manual IDE contribution workflow; security
audit MVP; corporate monochrome visual identity; acrostic visual flexibility.

Changed — mkdocs nav 6 → 28 public pages; read-only puma models
sub-group (D30 resolved); pyproject.toml hardened; project version → 4.0.0.

Security — pip-audit / bandit / gitleaks on every push; Trivy on every
container publish (blocks HIGH/CRITICAL); SECURITY.md disclosure policy;
9-check submission validation pipeline.

InfrastructureDockerfile.publish (multi-stage, non-root, OCI labels);
GitHub Pages live with a 28-page nav; Hugging Face dataset mirror operational.

Full detail in CHANGELOG.md.

Links

Known limitations (deferred to S12.19 / post-Sprint-12)

See docs/known_debt.md.

  • D38validate-submission workflow references a non-existent action version.
  • D39verify-integrity workflow broken by gradio_client API drift; the inaugural submission is therefore self-attested.
  • D40puma share-results CLI hangs after the Review panel.
  • Container publish blocked by 3 HIGH base-image CVEs (re-publish after patching).
  • notify-discord lacks the DISCORD_WEBHOOK secret (optional integration).

Acknowledgments

Thanks to everyone who tested the submission pipeline end to end and helped land
the first official community submission — the milestone that validates this release.

v3.1.0 — Sprint 11' Post-v3.0.0 Reconciliation

25 May 01:37

Choose a tag to compare

PUMA v3.1.0 — Sprint 11' Post-v3.0.0 Reconciliation

  • Release date: 2026-05-25
  • Previous release: v3.0.0 (2026-05-20)
  • Tag: v3.1.0

Overview

Sprint 11' is a post-v3.0.0 hardening Sprint: it reconciles the v3.0.0
release artifacts (version strings, CHANGELOG, release notes), completes the
puma community CLI subgroup, applies minimal repairs to the wiki-sync
workflows and the Verifier Space, and cleans residual documentation drift
while preserving historical accuracy. No breaking changes.

Highlights

  • New puma community Typer group with four subcommands — browse,
    pull, verify-hash, validate (Anexo F F.16.5–F.16.8). The GitHub
    Contents API is read over httpx (mockable in tests); verify-hash
    recomputes the predictions hash byte-identically to the client and is
    D23-aware on --remote.
  • Verifier Space hash output now matches schema v1.0.0 — the
    pumaproject/puma-verifier Space no longer emits the sha256: prefix, so
    its digest conforms to ^[a-f0-9]{64}$.
  • Wiki sync works again in both repos — the wiki-sync.yml workflows now
    declare contents: write, so the GitHub Wiki publishes on push; both
    /wiki pages render (HTTP 200).
  • D23 deuda técnica documented — the Verifier (2-field JSONL) and client
    (4-field DB CSV) hash algorithms differ by construction; reconciliation is
    deferred to v4.x with a schema decision.

Quality at release time

  • mypy --strict: 0 errors / 81 files
  • pytest -m "not ollama": 597 passed, 1 skipped, 7 deselected
  • pytest -m ollama: 7 passed
  • F1 triage baseline: 0.5831 ± 0.01 (Δ = −0.0036)
  • MAE estimation baseline: 5.7150 bit-exact (Δ = +0.0000)
  • Coverage on community CLI: browse 80%, pull 87%, verify 85%, validate 89%
  • Schema v1.0.0 unchanged (P3); zero federation references in code (P4).

Known limitations

  • D23: puma community verify-hash --remote returns mismatch by
    construction for schema v1.0.0 submissions even when the local hash is
    correct, because the Verifier Space hashes a different input shape than the
    client. Local verification (the canonical path) is unaffected.
    Reconciliation deferred to v4.x with a schema decision. See
    docs/known_debt.md.
  • Kaggle mirror activation (S11'.6): the mirror-kaggle.yml workflow is
    fully hardened (companion PR pumacp/puma-community#4: --dir-mode zip,
    robust create-vs-version, CC-BY-4.0 license, 50-char-safe title,
    post-publish HEAD verification). Publication is pending a Kaggle-internal
    soft-delete grace period that reserves the slug
    pumacp/puma-community-submissions; it resolves automatically once Kaggle
    releases the slug — no further code action is needed.

Upgrade path

No breaking changes since v3.0.0. Refresh with:

git pull && pip install -e '.[dev]'

See also

  • CHANGELOG.md — the [3.1.0] section.
  • PR #15 on this repo — the full commit list.
  • Companion PRs: pumacp/puma-community#2 (HF namespace), #3 (wiki-sync),
    #4 (Kaggle mirror hardening).
  • pumaproject/puma-verifier HF Space @ commit d8a4ffd.
  • Anexo G of the academic memoria — discovery-before-write episodes recorded
    during this Sprint.

PUMA v3.0.0 — Community + mypy remediation + public docs + security hardening

20 May 13:30

Choose a tag to compare

First public release. See CHANGELOG.md for full details.

PUMA v2.7.0 — Catalog expansion: Qwen3 (pending validation)

16 May 04:54

Choose a tag to compare

PUMA v2.7.0 Release Notes

Release date: 2026-05-16
Previous release: v2.6.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 10 (catalog expansion) onto the
v2.6.0 base. It adds two Alibaba Qwen3 family entries to the
catalog — both verified against the Ollama registry before
inclusion — and formally excludes Kimi K2.6 after a 13-tag
registry probe confirmed it is not distributed via Ollama. The
catalog schema is preserved at the 8 fields used since v2.0.0;
all v2.7.0 metadata lives within the existing notes field per
the project's minimum-complexity discipline.

Highlights

Two Qwen3 entries — registry-verified before cataloguing

Tag Type GGUF (verified) Profile Notes
qwen3:30b Dense 17.3 GB gpu-high Hybrid Gated DeltaNet + self-attention, 262144 context
qwen3:30b-a3b MoE 17.3 GB gpu-high 30B total / ~3B active per token; F8/D18 MoE caveat preserved in notes

Both entries:

  • Verified via Ollama registry manifest probe
    (registry.ollama.ai/v2/library/qwen3/manifests/* returned HTTP
    200 with the GGUF size derived from the sum of layer sizes).
  • Declare logprobs_supported: false conservatively until
    empirical verification on appropriate hardware.
  • Excluded from gpu-entry AND every apple-silicon-* profile by
    the P11 pending-validation invariant. Five new regression-guard
    tests in tests/unit/test_catalog_metadata.py pin this contract
    via exact-equality on profiles_compatible == ['gpu-high'].
  • params_b: 30.0 follows the gemma4:26b-a4b precedent (TOTAL
    when the tag encodes both numbers). The MoE/F8 caveat in the
    notes field documents that active-params count does NOT
    predict GGUF size or runtime VRAM consumption.

Kimi K2.6 — formally excluded after 13-tag registry probe

A registry probe on 2026-05-16 returned HTTP 404 on every
plausible Ollama tag naming for Kimi K2.6:

kimi-k2:6              kimi-k2:latest        kimi-k2:1t
kimi-k2:1t-instruct    kimi-k2:0905          kimi-k2:base
kimi-k2:instruct       kimi:latest           kimi-k2.6:latest
moonshot:latest        moonshot:kimi-k2      kimi-k2-base:latest
kimi-k2-instruct:latest

The model is not distributed via the Ollama registry as of the
v2.7.0 cut. Cataloguing a non-existent ollama_tag would violate
the project's empirical-first principle (P10) and produce a
broken puma models pull command for users following the catalog
metadata. The exclusion decision is recorded in
docs/CATALOG_HISTORY.md v2.7.0 § "Considered but not catalogued"
with the full probe table for academic traceability. It may be
reconsidered in a future release if Moonshot AI or a third-party
distributor publishes K2.6 to the Ollama registry.

Deferred — known on Ollama but out of v2.7.0 scope

The registry probe confirmed these tags exist (HTTP 200) but they
are deferred from v2.7.0 for scope discipline:

Tag Real GGUF Reason
qwen3:32b (dense) 18.8 GB Marginal upgrade over qwen3:30b; defer until empirical validation on gpu-high can distinguish them
qwen3:235b-a22b (MoE) 132.4 GB Requires multi-GPU rigs well beyond gpu-high (24+ GB VRAM); pending hardware tier extension
qwen3-coder:30b, qwen3-coder:480b Coder family is task-specific; out of scope for PMO benchmarks

Schema unchanged — minimum-complexity preserved

The original Sprint 10 plan proposed ~12 new YAML fields (family,
parameters_total_b, parameters_active_b, profile_recommended,
size_gb_disk_estimate, size_gb_vram_estimate, quantization,
license, release_date, capabilities, empirical_validation,
validation_blockers). The user's minimum-complexity decision
kept the catalog at the v2.0.0–v2.6.0 schema (8 fields:
ollama_tag, params_b, gguf_size_gb, context_window,
logprobs_supported, profiles_compatible, timeout_s,
notes). All v2.7.0 metadata (license, release date, MoE caveat,
validation blockers, architecture details) lives within
multi-line notes: text. src/puma/preflight/catalog.py and the
ModelEntry dataclass are byte-identical to v2.6.0.

Invariants generalised, not relaxed

The pending-validation exclusion from gpu-entry (established in
Sprint 9 for Apple Silicon entries) is reaffirmed for the new
Qwen3 entries and extended to every apple-silicon-* profile via
explicit tests:

Test Status Invariant
test_gemma4_family_excluded_from_gpu_entry PASSED (preserved) D18/F8 (Sprint 2)
test_gemma4_family_not_compatible_with_any_apple_silicon PASSED (preserved) P6 extension to Apple Silicon (Sprint 9)
test_qwen3_entries_excluded_from_gpu_entry PASSED (new) P10/P11 (Sprint 10)
test_qwen3_entries_excluded_from_all_apple_silicon PASSED (new) P11 generalisation across profile families
test_qwen3_entries_target_gpu_high_only PASSED (new) Exact-equality anchor against accidental loosening

The pattern is now: new entries default to the safest profile
only; loosening requires empirical evidence and an explicit
debt-tracker entry referencing the prior exclusion.

Tests

  • 402 → 407 passing (-m "not ollama"), 7 deselected.
  • 5 new regression-guard tests in
    tests/unit/test_catalog_metadata.py (see invariant table
    above).
  • tests/unit/test_preflight_catalog.py::test_load_catalog_returns_all_entries:
    entry-count expectation updated 15 → 17 to reflect the two new
    Qwen3 additions.
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline (triage, F1 path, fresh Ollama):
    PASS f1=0.5831, delta=-0.0036, ±0.01.
  • puma validate-baseline --expected-mae 5.7150 (estimation,
    fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05
    bit-exact.

Quality

  • Coverage: 61 % (no significant change from v2.6.0; new entries
    are YAML-only, no Python statements added).
  • CI: green on both main and develop. The
    integration-tests-ollama job introduced in v2.5.0 continues to
    run on push to those branches.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
    preserved; MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
  • Linux + NVIDIA dispatch byte-identical to v2.6.0 (no new
    profiles, no new dispatch logic, no new code paths). The Qwen3
    entries appear in models_for_profile('gpu-high') only.

Design decisions

  • Schema unchanged at 8 fields. Documented Sprint-10-original
    proposal of 12 new fields; chose to keep schema minimal. All
    metadata that would have required new fields now lives in the
    notes multi-line text. Governed by P5 (additive over
    modification) and the project's minimum-complexity discipline.
  • Real Ollama tags only. Every catalogued ollama_tag is
    verified against registry.ollama.ai/v2/library/<repo>/manifests/<tag>
    before inclusion. The originally-planned qwen3:27b and
    qwen3:35b-a3b were remapped to the real qwen3:30b and
    qwen3:30b-a3b after probe; Kimi K2.6 was removed entirely
    after every plausible tag returned 404.
  • Conservative params_b for MoE. The qwen3:30b-a3b entry
    declares params_b: 30.0 following the gemma4:26b-a4b
    precedent — TOTAL params when the tag encodes both numbers. The
    F8/D18 caveat in notes documents that active-params count
    does NOT predict VRAM consumption.
  • logprobs_supported: false conservatively. The Qwen3 family
    announces logprob support upstream, but PUMA has not yet
    empirically verified token-level confidence on these specific
    tags. Flipping to true is part of the empirical validation
    protocol when hardware becomes available.
  • gpu-high as the only target. 17.3 GB GGUF exceeds gpu-mid's
    12–24 GB upper bound once OS + context overhead are accounted
    for; gpu-high (24+ GB VRAM) is the only safe default. The
    exact-equality anchor test
    test_qwen3_entries_target_gpu_high_only pins this so future
    loosening requires deliberate intent.

Debt tracking

  • No new open debt introduced by this release.
  • No closure of pre-existing debt — Sprint 10 is
    forward-looking catalog expansion, not a debt-paydown Sprint.
  • Empirical validation of qwen3:30b and qwen3:30b-a3b is the
    explicit follow-up; tracked via the notes text on each entry
    and via the validation roadmap in
    docs/CATALOG_HISTORY.md § "Empirical validation roadmap".

Known limitations

Unchanged from v2.6.0:

  • Single hardware tier empirically evaluated (gpu-entry);
    models requiring gpu-mid/gpu-high are catalogued but not
    yet validated.
  • AMD ROCm not yet detected.
  • All apple-silicon-* profiles declare
    empirical_validation: pending (Sprint 9 forward-work).
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
    3).

New in v2.7.0:

  • Both qwen3:30b and qwen3:30b-a3b are catalogued with
    validation pending. Users on gpu-high hardware can use them
    via puma run and report empirical results to close the
    validation gap.
  • Kimi K2.6 is not distributed via Ollama; the catalog
    intentionally omits it. If a third-party distributor publishes
    K2.6 to Ollama, a future Sprint can revisit cataloguing.

Empirical validation roadmap (when gpu-high hardware available)

The protocol for closing the validation gap is documented in
docs/CATALOG_HISTORY.md v2.7.0 § "Empirical validation roadmap":

  1. Pull the model via ollama pull qwen3:30b and verify the
    digest matches the registry manifest probed at cataloguing.
  2. Run the canonical baselines: triage_jira (F1) and
    estimation_tawos (MAE).
  3. Measure parse_failure_rate (should be 0 for usable models)
    and reproducibility (bit-exact under T=0.0 + seed=42 on the
    gpu-high hardware).
  4. If validation succeeds: bump logprobs_supported to true
    (after a logprobs-enabled probe), extend profiles_compatible
    to vetted Apple Silicon Max/Ultra variants (≥36 GB unified
    memory) pending separate Apple-side v...
Read more

PUMA v2.6.0 — Apple Silicon M3/M4/M5 support

16 May 02:40

Choose a tag to compare

PUMA v2.6.0 Release Notes

Release date: 2026-05-16
Previous release: v2.5.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 9 (Apple Silicon M3/M4/M5 support)
onto the v2.5.0 base. It adds first-class detection of Apple Silicon
hosts (9 new profile identifiers covering M3 base/Pro/Max, M4
base/Pro/Max, M5 base/Pro/Max, and M5 Ultra), a native runtime mode
via ./start_puma.sh --native that boots Ollama with Metal on
macOS (no Docker), macOS-aware CodeCarbon tracking with a
powermetrics-availability probe, and the cross-architecture
reproducibility question documented as a testable hypothesis for a
future empirical close-out.

Empirical validation status: all apple-silicon-* profiles
declare empirical_validation: pending. PUMA's validation hardware
is the RTX 2060 Mobile 6 GB (gpu-entry); Apple Silicon hardware
joins the validation set in a future Sprint when MacBook M-series
hardware becomes available to the project. The dispatch
infrastructure shipped here enables that validation; the testing
protocol
is documented in docs/CROSS_ARCH_REPRODUCIBILITY.md
§ "Testing protocol".

Highlights

Apple Silicon catalogued end-to-end

Layer Artefact
Profiles 9 new apple-silicon-* identifiers in config/profiles.yaml; schema extended additively with apple_silicon_required, chip_brand_match, min_unified_memory_gb
Catalog catalog_version bumped 2.5.0 → 2.6.0; conservative profiles_compatible[] additions per a ≈ 2× GGUF memory-headroom rule
Detection src/puma/preflight/apple_silicon.py (NEW) — platform-isolated sysctl-based detection; testable on Linux via unittest.mock
Dispatch SystemCapabilities gains chip_brand + unified_memory_gb; Profile gains optional Apple Silicon fields; select_profile() runs a new branch BEFORE the existing GPU/CPU dispatch
Runtime ./start_puma.sh --native boots native Ollama + Python venv on macOS; ./stop_puma_native.sh teardown companion
Emissions get_tracking_mode_and_warnings() resolver in puma.sustainability.codecarbon_wrapper; powermetrics probe; graceful process-mode fallback on macOS
Docs docs/CROSS_ARCH_REPRODUCIBILITY.md (NEW); extensions in MACOS_NOTES.md, HARDWARE.md, CATALOG_HISTORY.md

Conservative model compatibility

Apple Silicon variants need at least roughly 2× the model's GGUF
size in unified memory plus a small OS overhead. Applied as a rule:

Model Compatible Apple Silicon variants
qwen2.5:1.5b, gemma3:1b All 10 (m3 base → m5-ultra)
qwen2.5:3b, gemma3:4b All except m3 base (8 GB tight)
qwen2.5:7b, mistral:7b, llama3.1:8b, deepseek-r1:7b Pro / Max / Ultra only (≥ 18 GB)
qwen2.5:14b, deepseek-r1:14b, gemma3:27b Max / Ultra only (≥ 36 GB)
gemma3:12b m3-max, m4-pro+, m5-pro+ (≥ 24 GB)
gemma4:* Excluded from every apple-silicon-* — P6 enforcement

gemma4 family — exclusion preserved AND extended

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
stays excluded from gpu-entry per F8 / D18.
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. v2.6.0 extends the exclusion with a new invariant
test
test_gemma4_family_not_compatible_with_any_apple_silicon
ensuring the same VRAM-pressure failure mode is not re-introduced
on small unified-memory variants by accidental copy-paste during
future catalog edits. Re-enabling any (gemma4, apple-silicon-*)
pair requires new empirical evidence on Mac hardware and an
explicit debt entry referencing the prior exclusion.

CodeCarbon survives on macOS (Mode B)

tracking_mode="machine" is what PUMA's split-container Linux
architecture relies upon (the D15 fix). On macOS Mode B without
passwordless powermetrics, machine-mode silently fails to record
energy. v2.6.0 adopts a graceful fallback in
get_tracking_mode_and_warnings():

  1. Passwordless powermetrics configuredtracking_mode="machine", no warnings.
  2. Default macOS state (sudo required)tracking_mode="process" + one warning pointing at docs/MACOS_NOTES.md.
  3. puma run --no-emissions → tracking disabled entirely.

Linux + NVIDIA path is byte-identical to v2.5.0:
tracking_mode="machine" with no warnings.

Cross-architecture reproducibility — open question, testable hypothesis

v2.6.0 frames bit-exact reproducibility between x86_64 Linux and
arm64 macOS as an open empirical question. The Q4_K_M integer
quantisation makes F1 and MAE expected to be bit-exact across
architectures; logprobs (and therefore ECE) are expected to differ
by FP rounding. The document records H0 / H1 / H2 / H3 hypotheses
and a 6-step testing protocol for closing them out when Mac
hardware joins the validation set. See
docs/CROSS_ARCH_REPRODUCIBILITY.md.

Tests

  • 354 → 402 passing (-m "not ollama"), 7 deselected.
  • New tests/unit/test_apple_silicon.py: 28 tests covering every
    public entry point of the new module with mocks for the
    Darwin/arm64 gate, sysctl success + 3 failure modes
    (FileNotFoundError, TimeoutExpired, CalledProcessError), a
    parametrised mapping for all 10 chip brands, forward-compat for
    unmapped chips, the get_apple_silicon_info dict shape, and
    consistency checks on CHIP_BRAND_TO_PROFILE.
  • New tests/unit/test_codecarbon_macos.py: 7 tests for the
    tracking-mode helper and powermetrics probe — Linux
    short-circuit (never invokes subprocess), macOS with/without
    sudoers, probe behaviour on FileNotFoundError / non-zero exit /
    zero exit.
  • Extended tests/unit/test_catalog_metadata.py: +5 tests
    (VALID_PROFILES inclusion, profiles.yaml definitions for all 9
    apple-silicon-, chip_brand_match uniqueness, gemma4 exclusion
    from every apple-silicon-
    , qwen2.5:3b anchor on
    apple-silicon-m4-pro). Pre-existing
    test_model_metadata_is_internally_consistent and
    test_gemma4_family_excluded_from_gpu_entry preserved
    unchanged.
  • Extended tests/unit/test_preflight_profile.py: +7 tests for the
    auto-dispatch path (M4/M4 Pro/M5 Max), boundary at 8 GB unified
    memory, fall-through cases for insufficient unified memory,
    unmapped chips, and non-Apple chip-brand values; +1 manual
    override test for apple-silicon-m4.
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline (triage, F1 path, fresh Ollama):
    PASS f1=0.5831, delta=-0.0036, ±0.01.
  • puma validate-baseline --expected-mae 5.7150 (estimation,
    fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05
    bit-exact.

Quality

  • Coverage: 61 % (no significant change from v2.5.0).
  • CI: green on both main and develop. The integration-tests-ollama
    job introduced in v2.5.0 continues to run on push to those
    branches only.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
    preserved. MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
  • src/puma/cli.py LOC unchanged from v2.5.0; the only
    meaningfully-grown file is src/puma/preflight/apple_silicon.py
    (NEW, 141 LOC).

Design decisions

  • Linux path byte-identical to v2.5.0. Every Apple Silicon
    code path returns None / no-ops when caps.chip_brand is None
    or is_apple_silicon() is False, so no CI invocation on Linux
    changes behaviour. Governed by P5 (additive over modification)
    and P3 (reproducibility non-negotiable).
  • apple-silicon-* profiles ship with empirical_validation: pending. Cataloguing without validation is a deliberate
    signalling choice — the dispatch infrastructure exists, the
    numbers do not. Frames the future Mac-hardware Sprint as
    empirical close-out rather than groundwork. Aligns with the
    F8/D18 lesson: nominal specifications do not predict runtime
    compatibility on constrained hardware.
  • gemma4 stays excluded across both gpu-entry AND
    apple-silicon-
    *. P6 generalises from "do not re-introduce
    previously-rejected (model, profile) pairs" to "do not
    re-introduce the failure mode in a different profile family".
  • get_tracking_mode_and_warnings() is additive. The Linux
    branch returns ("machine", []), byte-identical to the
    v2.5.0-hardcoded value. The macOS branch is the new behaviour;
    it only runs when is_apple_silicon() returns True, which is
    False on every Linux test runner.
  • Cross-arch reproducibility documented as testable, not as
    fact.
    v2.6.0 ships a hypothesis (H0/H1/H2/H3) and a protocol,
    not a claim of bit-exactness. Honest about the empirical gap.

Debt tracking

  • No new open debt introduced by this release.
  • No closure of pre-existing debt — Sprint 9 is forward-looking
    infrastructure, not a debt-paydown Sprint.
  • Empirical validation of apple-silicon-* profiles is the
    explicit follow-up; it is not tracked as "debt" because the
    catalogue declares its own status (empirical_validation: pending).

Known limitations

Unchanged from v2.5.0:

  • Single hardware tier empirically evaluated (gpu-entry); models
    requiring gpu-mid/gpu-high are catalogued but not yet
    validated.
  • AMD ROCm not yet detected.
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
  • input_text not persisted in triage_jira instances (D22, Low).

New in v2.6.0:

  • All apple-silicon-* profiles declare
    empirical_validation: pending. Users on Mac hardware should
    validate cross-arch reproducibility on their specific chip
    before treating F1/MAE results as comparable to the Linux
    baselines — see docs/CROSS_ARCH_REPRODUCIBILITY.md § "Testing
    protocol".
  • macOS Mode B (native Ollama) is opt-in via
    ./start_puma.sh --native; the default ./start_puma.sh Docker
    path is unchanged.

Upgrade notes

  • No breaking changes. Existing CI invocations of
    puma validate-baseline (no flags) continue to work unchanged.
    Linux + NVIDIA dispatch is byte-identical to v2.5.0.
  • macOS users can opt into native mode (Metal acceleratio...
Read more

PUMA v2.5.0 — Hardening (Sprint 8)

16 May 01:39

Choose a tag to compare

PUMA v2.5.0 Release Notes

Release date: 2026-05-16
Previous release: v2.4.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 8 (hardening) onto the v2.4.0 base.
It resolves the six inconsistencies (I5–I10) detected in the
post-v2.4.0 technical analysis and adds the first empirical MAE
canonical baseline for puma validate-baseline. The gemma4 family
remains empirically excluded from gpu-entry per F8 / D18; v2.5.0
documents the exclusion in a new versioned catalog changelog rather
than re-introducing the failure mode.

Highlights

Six inconsistencies resolved (I5–I10)

ID Resolution Primary artefact
I5 macOS Docker (CPU) vs native Ollama (Metal) modes clarified; v2.6.0 plan stated docs/MACOS_NOTES.md (new)
I6 gpu-entry hardware tolerance bands across RTX 2060/3050/3060/4050/4060 Mobile + Apple cross-arch row docs/HARDWARE.md
I7 Catalog now versioned (catalog_version: 2.5.0); changelog document; new unit test config/models_catalog.yaml, docs/CATALOG_HISTORY.md (new)
I8 CI job integration-tests-ollama runs the 4 @pytest.mark.ollama tests on every push to main/develop .github/workflows/lint-and-test.yml
I9 puma validate-baseline --expected-mae extends the command to estimation_tawos; canonical spec + empirical reference established src/puma/cli.py, specs/runs/baseline_estimation_canonical.yaml (new), docs/baseline_references.md (new)
I10 Coverage breakdown by module group with explicit rationale for sub-40 % modules docs/TESTING.md (new)

Estimation canonical baseline established (v2.5.0)

The first empirical MAE reference for puma validate-baseline on
estimation_tawos is now published:

  • Spec: specs/runs/baseline_estimation_canonical.yaml
  • Configuration: qwen2.5:3b × zero-shot × N=200 × seed=42 × T=0.0
  • Reference MAE = 5.7150 SP (tolerance ±0.05 SP)
  • Establishing run: baseline_estimation_canonical_v1__26d0e07aaa7949ec__20260516T003317
  • Verified bit-exact across 4 consecutive runs (cold + warm)
  • Hardware: gpu-entry (RTX 2060 Mobile 6 GB)

Cross-scenario state contamination — documented finding

During empirical establishment of the MAE reference, a
state-contamination effect was characterised: running a
triage_jira baseline between an Ollama restart and the estimation
validation shifts MAE from 5.7150 to ≈6.3150 SP (delta = +0.6 SP) —
well outside the ±0.05 tolerance. The fresh-Ollama-state validation
protocol that prevents the drift is documented in
docs/baseline_references.md. This is a property of Ollama's
inference engine (KV-cache + warm-state behaviour), not a PUMA
code-path regression. Related to D3 (CUDA non-determinism).

gemma4 family — status preserved

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
remains catalogued and remains empirically excluded from
gpu-entry. The exclusion is grounded in:

  • F8 (closed): gemma4:e2b GGUF measured at 7.2 GB on disk
    versus the ~2 GB suggested by effective active params.
  • D18 (closed): all 5 smoke runs of gemma4:e2b on RTX 2060
    6 GB VRAM returned empty raw_response strings.

The regression-guard test
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. Users on gpu-mid (12–24 GB VRAM) and gpu-pro (24+
GB VRAM) hardware can use the gemma4 family normally; on
gpu-entry, select qwen2.5:* or gemma3:* instead. Full
rationale in docs/CATALOG_HISTORY.md.

Tests

  • New: tests/unit/test_cli_validate_baseline.py extended from
    3 → 8 tests (5 new for the MAE path, mutual exclusivity, missing
    metric, and default-spec resolution).
  • New: tests/unit/test_catalog_metadata.py::test_catalog_has_version_field.
  • Suite total: 348 → 354 passing, 7 deselected (-m 'not ollama').
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline (triage, F1 path): PASS f1=0.5831, delta=-0.0036.
  • puma validate-baseline --expected-mae 5.7150 (estimation,
    NEW): PASS mae=5.7150, delta=+0.0000.

Quality

  • Coverage: 61 % (essentially flat from v2.4.0; per-module
    breakdown now in docs/TESTING.md).
  • CI: green on both main and develop. New
    integration-tests-ollama job runs only on push to those
    branches and is continue-on-error: true so a transient
    Ollama failure does not gate the merge queue.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
    preserved. MAE = 5.7150 ± 0.05 on estimation_tawos newly
    established.
  • src/puma/cli.py LOC essentially unchanged from v2.4.0
    (signature extension + a small dispatch block).

Design decisions

  • --expected-mae is additive, not a refactor. When neither
    flag is provided, puma validate-baseline preserves its v2.4.0
    behaviour (F1 = 0.5867 against the triage baseline). Existing
    CI invocations continue to work unchanged. Sprint Operating
    Principle P5 (don't break working code) governed the choice.
  • gemma4 stays excluded. The original Sprint 8 plan asked for
    re-adding gemma4 to gpu-entry at 1.5 / 3 GB sizes. Inventory
    caught the conflict with F8 (7.2 GB measured) and D18 (empty
    responses). The revised plan converted S8.7 to
    documentation-only in CATALOG_HISTORY.md, preserving the
    regression guard. Sprint Operating Principle P6
    (don't re-introduce previously-rejected models) governed the
    choice.
  • MAE tolerance set to ±0.05 SP. The reference is bit-exact
    across cold + warm runs (4-run verification); the ±0.05 band
    absorbs the same kind of FP-ordering drift that the F1 ±0.01
    band absorbs on triage. The cross-scenario state-contamination
    effect (≈0.6 SP) is NOT absorbed by the tolerance — instead, a
    validation protocol prevents the contamination.

Debt tracking

  • No new open debt introduced by this release.
  • Inconsistencies tracked: I1–I10 documented across the project;
    v2.5.0 resolves I5–I10 (six items). I1–I4 were resolved in
    earlier releases.
  • Total resolved across v2.0.0 → v2.5.0: 15 of 24 technical debt
    items (62 %) plus 6 of 10 inconsistencies (60 %); v2.5.0
    contributes the inconsistency resolutions.

Known limitations

Unchanged from v2.4.0:

  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above catalogued but not yet empirically
    evaluated.
  • AMD ROCm and Apple Metal backends not yet detected. Apple
    Silicon native mode planned for v2.6.0; AMD ROCm pending
    hardware availability.
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
    3).
  • input_text not persisted in triage_jira instances (D22,
    Low).

Upgrade notes

  • No breaking changes. Existing CI invocations of
    puma validate-baseline (no flags, expecting F1 = 0.5867)
    continue to work unchanged.
  • New flag available: puma validate-baseline --expected-mae
    for estimation_tawos. See docs/baseline_references.md for
    the recommended invocation including the fresh-Ollama-state
    protocol.
  • New docs to know about: MACOS_NOTES.md,
    CATALOG_HISTORY.md, baseline_references.md, TESTING.md.
  • The lint-and-test.yml CI workflow now contains a second job
    (integration-tests-ollama) that runs on push to main/develop
    only. PRs are unaffected.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.

PUMA v2.4.0 — CLI completeness (Anexo F section A.2)

13 May 03:26

Choose a tag to compare

PUMA v2.4.0 Release Notes

Release date: 2026-05-13
Previous release: v2.3.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 7 (CLI completeness for Anexo F) onto
the v2.3.0 base. It resolves the long-standing gap between the academic
Anexo F catalog and the actual repository state by adding the six
high-value commands from section A.2 of Anexo F, together with a new
source-of-truth document distinguishing implemented commands from
documented design proposals.

Highlights

Anexo F gap resolved

docs/anexo_F_cli_reference.md is now the canonical CLI reference,
split into two sections:

  • Section A — Implemented: A.1 lists commands pre-existing in
    v2.0.0–v2.3.0 (preflight, models, datasets, cache, run,
    validate-baseline, compare, dashboard, report, db). A.2
    lists the six commands added in this release (below). Every command
    in section A is verifiable via puma <comando> --help and covered
    by tests under tests/cli/.
  • Section B — Proposed extensions: 5 Bash auxiliary scripts and
    12 further CLI commands (Ollama management, sweep wrappers, DB
    tooling, code-quality wrappers) documented as design space.
    Explicitly marked as not implemented; the decision rationale is
    recorded in the document.

Six new CLI commands (Anexo F § A.2)

Anexo F Command Style
A.2.1 puma prepare-datasets Thin subprocess wrapper of scripts/prepare_datasets.py (--dataset, --force-redownload, --verify)
A.2.2 puma wilcoxon NEW analysis: paired Wilcoxon signed-rank between two run_ids; uses puma.metrics.statistical_tests.wilcoxon_signed_rank_models
A.2.3 puma bias-analysis NEW analysis: bias evaluation report; uses puma.dashboard.data.load_predictions_with_gold + puma.metrics.fairness.perturbation_disparity
A.2.4 puma generate-plots Thin subprocess wrapper of scripts/generate_phase_b_plots.py (--source phase_b only; bias_eval/multi_seed exit 2 with deferred-implementation message)
A.2.5 puma list-runs New: SQL pivot of runs ⋈ metrics with --scenario/--model/--last-n/--since filters and --json
A.2.6 puma list-ollama-models New: parses docker exec puma_ollama ollama list subprocess output

Why two commands are NEW analyses and not wrappers: Anexo F § A.2.2
and § A.2.3 specify semantics that diverge from the existing scripts
(positional run_id arguments vs. --run-prefix; --models /
--perturbations filters vs. prefix-only). Rather than mutate the
scripts, the new CLI commands call PUMA's own core helpers directly.
The scripts remain unchanged and continue to support their original
workflows (top-K ranking).

Tests

  • New: tests/cli/ package with 27 tests across 6 files, one per
    command. Each file tests at minimum: --help exit 0, happy path,
    error paths.
  • Suite total: 318 → 348 passing, 7 deselected (-m 'not ollama').
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline: PASS f1_macro=0.5831, delta=-0.0036.

Quality

  • Coverage: 58 % (no significant change from v2.3.0).
  • CI: green on both main and develop.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 holds.
  • app.py and cli.py LOC are stable; the only file that grew
    meaningfully is src/puma/cli.py (363 → 777 LOC by inline
    command bodies). Refactor to src/puma/cli/commands/ package was
    considered and deferred — the monolith remains the cleaner option
    at this size.

Design decisions

  • --source bias_eval / multi_seed for generate-plots are
    accepted by the parser but exit 2 with a deferred-implementation
    message because the underlying plotting scripts for those sources
    do not yet exist. This matches the Anexo F spec without inventing
    data.
  • --verify for prepare-datasets currently emits SHA-256 hashes
    only. A manifest file (docs/datasets_manifest.json) for full hash
    comparison is documented in Anexo F but not yet in the repo; the
    command is forward-compatible.
  • No src/puma/cli/commands/ refactor was performed in this
    release. With 6 new commands the inline monolith is still readable;
    the refactor would be justified if/when Section B extensions land.

Debt tracking

  • No new open debt introduced by this release.
  • Total resolved across v2.0.0 → v2.4.0: 15 of 24 (62 %).
  • Section B of Anexo F is documented design space, not technical
    debt. Implementation is optional and conditional on demand.

Known limitations

Unchanged from v2.3.0:

  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above catalogued but not yet empirically evaluated.
  • AMD ROCm and Apple Metal backends not yet detected.
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
  • input_text not persisted in triage_jira instances (D22, Low).

Upgrade notes

  • No breaking changes to existing commands or YAML run-spec schema.
  • Six new commands available. See puma <command> --help for usage
    or read docs/anexo_F_cli_reference.md § A.2.
  • New test directory tests/cli/ joins the existing
    tests/unit/ and tests/integration/.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

PUMA v2.3.0 — dashboard production-quality + docs structure

13 May 01:58

Choose a tag to compare

PUMA v2.3.0 Release Notes

Release date: 2026-05-13
Previous release: v2.2.0 (2026-05-13)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 6 (dashboard polish + structural
refactor) and retrospective documentation work (INDEX.md +
docs/overview.md + README branding) onto the v2.2.0 base. With this
release, Phase C of the master plan is fully complete.

Highlights

Dashboard production-quality (Sprint 6)

Major refactor: app.py reduced from 803 LOC monolithic to 168 LOC
router
(-79 %). View logic delegated to seven modules in
src/puma/dashboard/views/. Each view is independently importable
and testable; the router publishes filters to st.session_state and
dispatches via a VIEWS dict.

Ten polish improvements applied:

# Improvement Impact
1 @st.cache_data(ttl=60) on 7 loaders Performance
2 st.spinner on slow operations UX
3 CSV export on 4 tables Productivity
4 Tooltips on ≈ 12 metric cards UX
5 Unified empty-filtered-state component UX
6 Friendly expander titles in Overview UX
7 Module-level imports (no more inline) Code quality
8 Emoji prefixes consistent across 7 view titles UX
9 Dark-mode dataframe text legibility UX (bug fix)
10 Empty-selectbox guard in Instance Drill-down Robustness

Plus: first-visit guided tour with view overview and tips
(download CSV, dark mode, tooltips). Persistent dismiss via
st.session_state["tour_dismissed"]; "📖 Show tour" button in the
sidebar to re-open.

Documentation structure (Phase E.bis retrospective + Phase E.ter)

  • INDEX.md (root, uppercase): project status, phases, releases,
    debt tracking, architecture entry points. Created in Phase E.bis;
    this release updates it for v2.3.0 status.
  • docs/overview.md (new location): preserves the 256 LOC of
    architectural content from the legacy lowercase index.md.
  • README.md: branded header with PUMA logo, descriptive
    blockquote, and Related-Resources section linking to puma-vault,
    the published knowledge garden, releases, INDEX.md, and
    docs/overview.md.

Quality

  • Tests: 318 passing (up from 313 in v2.2.0; +5 dashboard smoke
    tests covering view module integrity, polish helpers, cache
    decorator presence, and the end-to-end AppTest render with the
    live database).
  • Coverage: 58 % (up from 55 % in v2.2.0).
  • Pre-commit: 10/10 hooks green.
  • CI: green on both main and develop.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 holds; verified
    via puma validate-baseline (PASS at 0.5831, delta −0.0036).

Methodological findings (academic traceability)

Sprint 6 surfaced one additional finding consistent with the
meta-pattern documented in docs/known_debt.md ("symptom in layer
N, root cause in layer M ≠ N"):

  • Dark-mode dataframe invisibility. The CSS rule applied
    light-mode colours globally; under dark mode, table text inherited
    light-mode colours against the dark background, rendering tables
    nearly unreadable. Symptom (invisible tables) appeared in the
    dashboard layer; root cause (CSS scope without theme awareness)
    was in the styling layer. Resolved in the same commit as the
    refactor by adding a theme-aware CSS override
    (color: #E5E7EB + background-color: #16213E when
    dark_mode == True).

This brings the meta-pattern catalogue to five instances (D15, D18,
D21, D22, and this CSS scope issue); the fifth is retired in the
same commit that surfaced it.

CI workflow hygiene

The .github/workflows/release.yml fix introduced in Phase E.bis
(commit 863c166) is now exercised end-to-end by the v2.3.0 tag
push. After the tag was pushed and gh release create ran, exactly
one release was created (no duplicate draft). The fix is verified
effective for v2.X.0 releases going forward.

Debt tracking

  • No new open debt introduced by this release.
  • Total resolved across v2.0.0 → v2.3.0: 15 of 24 items (62 %).
  • Phase C: ✓ COMPLETE (was the last open phase; all five
    Gate-C criteria met).

Full inventory and diagnostic write-ups in
docs/known_debt.md.

Known limitations

  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
    the gemma4 family, llama3.1:70b) catalogued but not yet
    empirically evaluated.
  • AMD ROCm and Apple Metal backends not yet detected (development
    hardware is NVIDIA-only).
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
  • input_text not persisted in triage_jira instances (D22, Low —
    future data-pipeline enhancement). The Dashboard Instance
    Drill-down handles this gracefully with an informative message.

Master plan status (post-v2.3.0)

Phase Status
A — Foundations ✓ COMPLETE
B — Multi-model sweep ✓ COMPLETE
C — Professional dashboard ✓ COMPLETE (this release)
D — Technical depth ✓ ~95 % (ROCm/Metal n/a in current hardware)
E — Documentation and releases ✓ COMPLETE (v2.0.0, v2.1.0, v2.2.0, v2.3.0)

All five phases of the original master plan are now complete or
effectively complete (Phase D's remaining items are
hardware-dependent or scope-deferred).

Upgrade notes

  • No breaking changes to the public CLI or YAML run-spec schema.
  • Dashboard refactor is internal; user-facing behaviour is preserved.
  • Existing run-specs and CLI invocations work unchanged.
  • The dashboard module structure has changed (app.py is now a
    router; each view lives in src/puma/dashboard/views/<name>.py).
    Any external tooling that imported view code from app.py should
    migrate to the new module paths.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

PUMA v2.2.0 — statistical pipeline + dashboard core + bias evaluation

13 May 01:02

Choose a tag to compare

PUMA v2.2.0 Release Notes

Release date: 2026-05-13
Previous release: v2.1.0 (2026-05-10)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprints 3, 4, and 5 onto the v2.1.0 base:
statistical-analysis pipeline (ECE, Wilcoxon signed-rank, multi-seed
reproducibility), dashboard core with PUMA visual identity, and an
empirical bias-evaluation suite adapted to the characteristics of the
technical corpus.

Highlights

Statistical analysis pipeline (Sprint 3)

  • ECE end-to-end. Expected Calibration Error computed from real
    Ollama logprobs and persisted as metrics.metric_name='ece'.
    Validated against Guo et al. (2017) canonical cases with tolerance
    1e-6. Baseline qwen2.5:3b shows ECE=0.39 — significant
    miscalibration typical of out-of-the-box LLMs without post-hoc
    calibration.
  • Multi-seed validation. Seeds {42, 123, 456} on the canonical
    baseline yield zero variance in task metrics under T=0.0,
    confirming the bit-exact reproducibility guarantee documented in
    v2.0.0. Runtime jitter ~4 %.
  • Wilcoxon pairwise comparison (Demšar 2006 methodology) on
    paired-correctness indicators. Demonstrated empirically on a
    mini-comparison (qwen2.5:1.5b vs gemma3:1b on triage_jira × N=50)
    that a 0.19-point F1 gap is not statistically significant at
    α=0.05 (p=0.108, n_pairs=19/50) — the kind of finding the test is
    designed to surface.

Dashboard core (Sprint 4)

  • 5 fully functional views: Overview (cohort cards + per-run
    expanders, sidebar filters applied), Model Comparison (mean±std
    aggregation across seeds, run × metric heatmap, Wilcoxon
    artefact rendering), Reliability (real ECE + reliability diagram
    from logprobs), Sustainability Frontier (F1 vs CO₂ Pareto consuming
    the emissions table from Sprint 2 D15), Instance Drill-down
    (correct gold_label via the new JOIN, top-K logprobs).
  • 2 informed placeholders pending data: Fairness and Robustness
    (made functional by Sprint 5).
  • PUMA visual identity: emerald palette, sans-serif typography,
    logo in sidebar, telemetry disabled.
  • Dark-mode toggle via runtime CSS override.
  • 6 smoke tests including an end-to-end streamlit.testing.v1.AppTest
    render.
  • Bug fix: the Fairness and Instance Drill-down views in v2.1.0
    read gold_label from predictions, where it does not exist
    (it lives in instances). The new load_predictions_with_gold
    helper LEFT-JOINs the two tables and is now consumed by every
    view that needs the gold label.

Bias evaluation (Sprint 5) — empirical findings

Methodological adaptation. The triage_jira corpus is 100 %
technical incident text with 0 % gendered terms (verified by regex
over 23 EN tokens across all 200 instances). A textbook
pronoun-substitution gender_swap on this corpus would yield
flip_rate = 0 and disparity = 0 — a false PASS demonstrating
nothing. Sprint 5 therefore evaluates bias via signal injection
rather than signal substitution:

  • gender_swap_prefix_{male,female} — prepends a gendered identity
    prefix (John Smith reported: … vs Mary Smith reported: …).
    Methodology per Caliskan et al. (2017) and Bolukbasi et al. (2016).
  • register_shift_informal — formal→informal substitution proxy for
    the dialect axis on a monolingual technical corpus (Tatman 2017).

Key empirical findings on triage_jira × N=100 per condition:

Model Flip rate (any prefix vs baseline) Δ accuracy M-vs-F directional bias
qwen2.5:1.5b ~25-27 % −11 to −12 pp 15 %
qwen2.5:3b ~25-27 % −3 to −4 pp 5 %
  • Both models flip ~25 % of predictions when any gender signal is
    added vs the un-perturbed baseline — strong sensitivity to identity
    cues that the technical content does not require.
  • The 3× larger model exhibits ~3× less directional bias while
    losing ~3× less accuracy under signal injection.
  • register_shift_informal shows ~0 % effect on both models:
    formal-to-informal substitution does not perturb predictions.

Closes Gate D criterion 4 ("Bias semántico básico implementado y
validado"). Closes debt D19.

Quality

  • Tests: 313 passing (up from 276 in v2.1.0; +37 TDD across the
    three sprints).
  • Pre-commit: 10/10 hooks green.
  • CI: green on both main and develop.
  • Baseline reproducibility: F1=0.5867 ± 0.01 holds; verified via
    puma validate-baseline (PASS at 0.5831, delta −0.0036).

Methodological findings (academic traceability)

  • Sprint 3 confirmed empirically the deterministic-reproducibility
    guarantee documented in v2.0.0 (zero variance under T=0.0 with three
    different seeds; runtime jitter does not propagate to metrics).
  • Sprint 4 dashboard integration exposed D22: the synthetic
    triage_jira dataset does not persist input_text. This is a
    fourth instance of the "symptom in layer N, root cause in layer M"
    meta-pattern documented in docs/known_debt.md (joining D15, D18,
    D21).
  • Sprint 5 confirmed D3 empirically: puma validate-baseline
    showed bit-exact F1=0.6002 across consecutive runs after the bias
    sweep, then returned to F1=0.5831 after docker compose restart puma_ollama. Documents the warm-state-drift scope of the
    reproducibility guarantee — important for any future B.4 sweep
    protocol.

Debt tracking

  • Resolved this release: D19 (fairness scaffolding only).
  • New entry: D22 (Low) — instances.input_text empty on
    triage_jira.
  • Total resolved across v2.0.0 → v2.2.0: 15 of 23 (65 %).
  • Open: 8 (0 critical, 5 medium, 2 low; 1 marked
    DECIDED-NO-ACTION).

Full inventory and diagnostic write-ups in
docs/known_debt.md.

Known limitations

  • input_text not persisted in triage_jira instances (D22, Low —
    future data-pipeline enhancement).
  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
    the gemma4 family, llama3.1:70b) catalogued but not yet
    empirically evaluated.
  • Dashboard polish deferred — animations, guided tour, refactor of
    the 640-LOC app.py to a views/ module package. Slated for a
    future Sprint 6.
  • AMD ROCm and Apple Metal backends not yet detected (development
    hardware is NVIDIA-only).
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).

Upgrade notes

  • No breaking changes to the public CLI or YAML run-spec schema.
  • New perturbation names accepted in perturbations: lists:
    gender_swap_prefix_male, gender_swap_prefix_female,
    register_shift_informal.
  • Dashboard views update automatically when perturbed runs are
    present; no migration step required.

References

  • Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics
    derived automatically from language corpora contain human-like
    biases. Science 356(6334), 183-186.
  • Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T.
    (2016). Man is to computer programmer as woman is to homemaker?
    Debiasing word embeddings. NeurIPS.
  • Tatman, R. (2017). Gender and dialect bias in YouTube's automatic
    captions. In Proceedings of the First ACL Workshop on Ethics in
    Natural Language Processing
    .
  • Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
    calibration of modern neural networks. ICML.
  • Demšar, J. (2006). Statistical comparisons of classifiers over
    multiple data sets. JMLR 7, 1-30.
  • Wilcoxon, F. (1945). Individual comparisons by ranking methods.
    Biometrics Bulletin 1(6), 80-83.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.