PUMA v2.2.0 — statistical pipeline + dashboard core + bias evaluation
PUMA v2.2.0 Release Notes
Release date: 2026-05-13
Previous release: v2.1.0 (2026-05-10)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprints 3, 4, and 5 onto the v2.1.0 base:
statistical-analysis pipeline (ECE, Wilcoxon signed-rank, multi-seed
reproducibility), dashboard core with PUMA visual identity, and an
empirical bias-evaluation suite adapted to the characteristics of the
technical corpus.
Highlights
Statistical analysis pipeline (Sprint 3)
- ECE end-to-end. Expected Calibration Error computed from real
Ollama logprobs and persisted asmetrics.metric_name='ece'.
Validated against Guo et al. (2017) canonical cases with tolerance
1e-6. Baseline qwen2.5:3b shows ECE=0.39 — significant
miscalibration typical of out-of-the-box LLMs without post-hoc
calibration. - Multi-seed validation. Seeds {42, 123, 456} on the canonical
baseline yield zero variance in task metrics under T=0.0,
confirming the bit-exact reproducibility guarantee documented in
v2.0.0. Runtime jitter ~4 %. - Wilcoxon pairwise comparison (Demšar 2006 methodology) on
paired-correctness indicators. Demonstrated empirically on a
mini-comparison (qwen2.5:1.5b vs gemma3:1b on triage_jira × N=50)
that a 0.19-point F1 gap is not statistically significant at
α=0.05 (p=0.108, n_pairs=19/50) — the kind of finding the test is
designed to surface.
Dashboard core (Sprint 4)
- 5 fully functional views: Overview (cohort cards + per-run
expanders, sidebar filters applied), Model Comparison (mean±std
aggregation across seeds, run × metric heatmap, Wilcoxon
artefact rendering), Reliability (real ECE + reliability diagram
from logprobs), Sustainability Frontier (F1 vs CO₂ Pareto consuming
the emissions table from Sprint 2 D15), Instance Drill-down
(correctgold_labelvia the new JOIN, top-K logprobs). - 2 informed placeholders pending data: Fairness and Robustness
(made functional by Sprint 5). - PUMA visual identity: emerald palette, sans-serif typography,
logo in sidebar, telemetry disabled. - Dark-mode toggle via runtime CSS override.
- 6 smoke tests including an end-to-end
streamlit.testing.v1.AppTest
render. - Bug fix: the Fairness and Instance Drill-down views in v2.1.0
readgold_labelfrompredictions, where it does not exist
(it lives ininstances). The newload_predictions_with_gold
helper LEFT-JOINs the two tables and is now consumed by every
view that needs the gold label.
Bias evaluation (Sprint 5) — empirical findings
Methodological adaptation. The triage_jira corpus is 100 %
technical incident text with 0 % gendered terms (verified by regex
over 23 EN tokens across all 200 instances). A textbook
pronoun-substitution gender_swap on this corpus would yield
flip_rate = 0 and disparity = 0 — a false PASS demonstrating
nothing. Sprint 5 therefore evaluates bias via signal injection
rather than signal substitution:
gender_swap_prefix_{male,female}— prepends a gendered identity
prefix (John Smith reported: …vsMary Smith reported: …).
Methodology per Caliskan et al. (2017) and Bolukbasi et al. (2016).register_shift_informal— formal→informal substitution proxy for
the dialect axis on a monolingual technical corpus (Tatman 2017).
Key empirical findings on triage_jira × N=100 per condition:
| Model | Flip rate (any prefix vs baseline) | Δ accuracy | M-vs-F directional bias |
|---|---|---|---|
| qwen2.5:1.5b | ~25-27 % | −11 to −12 pp | 15 % |
| qwen2.5:3b | ~25-27 % | −3 to −4 pp | 5 % |
- Both models flip ~25 % of predictions when any gender signal is
added vs the un-perturbed baseline — strong sensitivity to identity
cues that the technical content does not require. - The 3× larger model exhibits ~3× less directional bias while
losing ~3× less accuracy under signal injection. register_shift_informalshows ~0 % effect on both models:
formal-to-informal substitution does not perturb predictions.
Closes Gate D criterion 4 ("Bias semántico básico implementado y
validado"). Closes debt D19.
Quality
- Tests: 313 passing (up from 276 in v2.1.0; +37 TDD across the
three sprints). - Pre-commit: 10/10 hooks green.
- CI: green on both
mainanddevelop. - Baseline reproducibility: F1=0.5867 ± 0.01 holds; verified via
puma validate-baseline(PASS at 0.5831, delta −0.0036).
Methodological findings (academic traceability)
- Sprint 3 confirmed empirically the deterministic-reproducibility
guarantee documented in v2.0.0 (zero variance under T=0.0 with three
different seeds; runtime jitter does not propagate to metrics). - Sprint 4 dashboard integration exposed D22: the synthetic
triage_jiradataset does not persistinput_text. This is a
fourth instance of the "symptom in layer N, root cause in layer M"
meta-pattern documented indocs/known_debt.md(joining D15, D18,
D21). - Sprint 5 confirmed D3 empirically:
puma validate-baseline
showed bit-exact F1=0.6002 across consecutive runs after the bias
sweep, then returned to F1=0.5831 afterdocker compose restart puma_ollama. Documents the warm-state-drift scope of the
reproducibility guarantee — important for any future B.4 sweep
protocol.
Debt tracking
- Resolved this release: D19 (fairness scaffolding only).
- New entry: D22 (Low) —
instances.input_textempty on
triage_jira. - Total resolved across v2.0.0 → v2.2.0: 15 of 23 (65 %).
- Open: 8 (0 critical, 5 medium, 2 low; 1 marked
DECIDED-NO-ACTION).
Full inventory and diagnostic write-ups in
docs/known_debt.md.
Known limitations
input_textnot persisted intriage_jirainstances (D22, Low —
future data-pipeline enhancement).- Single hardware tier evaluated (
gpu-entry); models requiring
gpu-midand above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
thegemma4family, llama3.1:70b) catalogued but not yet
empirically evaluated. - Dashboard polish deferred — animations, guided tour, refactor of
the 640-LOCapp.pyto aviews/module package. Slated for a
future Sprint 6. - AMD ROCm and Apple Metal backends not yet detected (development
hardware is NVIDIA-only). - TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
Upgrade notes
- No breaking changes to the public CLI or YAML run-spec schema.
- New perturbation names accepted in
perturbations:lists:
gender_swap_prefix_male,gender_swap_prefix_female,
register_shift_informal. - Dashboard views update automatically when perturbed runs are
present; no migration step required.
References
- Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics
derived automatically from language corpora contain human-like
biases. Science 356(6334), 183-186. - Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T.
(2016). Man is to computer programmer as woman is to homemaker?
Debiasing word embeddings. NeurIPS. - Tatman, R. (2017). Gender and dialect bias in YouTube's automatic
captions. In Proceedings of the First ACL Workshop on Ethics in
Natural Language Processing. - Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
calibration of modern neural networks. ICML. - Demšar, J. (2006). Statistical comparisons of classifiers over
multiple data sets. JMLR 7, 1-30. - Wilcoxon, F. (1945). Individual comparisons by ranking methods.
Biometrics Bulletin 1(6), 80-83.
Acknowledgments
Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.