PUMA v2.2.0 Release Notes

Release date: 2026-05-13
Previous release: v2.1.0 (2026-05-10)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprints 3, 4, and 5 onto the v2.1.0 base:
statistical-analysis pipeline (ECE, Wilcoxon signed-rank, multi-seed
reproducibility), dashboard core with PUMA visual identity, and an
empirical bias-evaluation suite adapted to the characteristics of the
technical corpus.

Highlights

Statistical analysis pipeline (Sprint 3)

ECE end-to-end. Expected Calibration Error computed from real
Ollama logprobs and persisted as metrics.metric_name='ece'.
Validated against Guo et al. (2017) canonical cases with tolerance
1e-6. Baseline qwen2.5:3b shows ECE=0.39 — significant
miscalibration typical of out-of-the-box LLMs without post-hoc
calibration.
Multi-seed validation. Seeds {42, 123, 456} on the canonical
baseline yield zero variance in task metrics under T=0.0,
confirming the bit-exact reproducibility guarantee documented in
v2.0.0. Runtime jitter ~4 %.
Wilcoxon pairwise comparison (Demšar 2006 methodology) on
paired-correctness indicators. Demonstrated empirically on a
mini-comparison (qwen2.5:1.5b vs gemma3:1b on triage_jira × N=50)
that a 0.19-point F1 gap is not statistically significant at
α=0.05 (p=0.108, n_pairs=19/50) — the kind of finding the test is
designed to surface.

Dashboard core (Sprint 4)

5 fully functional views: Overview (cohort cards + per-run
expanders, sidebar filters applied), Model Comparison (mean±std
aggregation across seeds, run × metric heatmap, Wilcoxon
artefact rendering), Reliability (real ECE + reliability diagram
from logprobs), Sustainability Frontier (F1 vs CO₂ Pareto consuming
the emissions table from Sprint 2 D15), Instance Drill-down
(correct gold_label via the new JOIN, top-K logprobs).
2 informed placeholders pending data: Fairness and Robustness
(made functional by Sprint 5).
PUMA visual identity: emerald palette, sans-serif typography,
logo in sidebar, telemetry disabled.
Dark-mode toggle via runtime CSS override.
6 smoke tests including an end-to-end streamlit.testing.v1.AppTest
render.
Bug fix: the Fairness and Instance Drill-down views in v2.1.0
read gold_label from predictions, where it does not exist
(it lives in instances). The new load_predictions_with_gold
helper LEFT-JOINs the two tables and is now consumed by every
view that needs the gold label.

Bias evaluation (Sprint 5) — empirical findings

Methodological adaptation. The triage_jira corpus is 100 %
technical incident text with 0 % gendered terms (verified by regex
over 23 EN tokens across all 200 instances). A textbook
pronoun-substitution gender_swap on this corpus would yield
flip_rate = 0 and disparity = 0 — a false PASS demonstrating
nothing. Sprint 5 therefore evaluates bias via signal injection
rather than signal substitution:

gender_swap_prefix_{male,female} — prepends a gendered identity
prefix (John Smith reported: … vs Mary Smith reported: …).
Methodology per Caliskan et al. (2017) and Bolukbasi et al. (2016).
register_shift_informal — formal→informal substitution proxy for
the dialect axis on a monolingual technical corpus (Tatman 2017).

Key empirical findings on triage_jira × N=100 per condition:

Model	Flip rate (any prefix vs baseline)	Δ accuracy	M-vs-F directional bias
qwen2.5:1.5b	~25-27 %	−11 to −12 pp	15 %
qwen2.5:3b	~25-27 %	−3 to −4 pp	5 %

Both models flip ~25 % of predictions when any gender signal is
added vs the un-perturbed baseline — strong sensitivity to identity
cues that the technical content does not require.
The 3× larger model exhibits ~3× less directional bias while
losing ~3× less accuracy under signal injection.
register_shift_informal shows ~0 % effect on both models:
formal-to-informal substitution does not perturb predictions.

Closes Gate D criterion 4 ("Bias semántico básico implementado y
validado"). Closes debt D19.

Quality

Tests: 313 passing (up from 276 in v2.1.0; +37 TDD across the
three sprints).
Pre-commit: 10/10 hooks green.
CI: green on both main and develop.
Baseline reproducibility: F1=0.5867 ± 0.01 holds; verified via
puma validate-baseline (PASS at 0.5831, delta −0.0036).

Methodological findings (academic traceability)

Sprint 3 confirmed empirically the deterministic-reproducibility
guarantee documented in v2.0.0 (zero variance under T=0.0 with three
different seeds; runtime jitter does not propagate to metrics).
Sprint 4 dashboard integration exposed D22: the synthetic
triage_jira dataset does not persist input_text. This is a
fourth instance of the "symptom in layer N, root cause in layer M"
meta-pattern documented in docs/known_debt.md (joining D15, D18,
D21).
Sprint 5 confirmed D3 empirically: puma validate-baseline
showed bit-exact F1=0.6002 across consecutive runs after the bias
sweep, then returned to F1=0.5831 after docker compose restart puma_ollama. Documents the warm-state-drift scope of the
reproducibility guarantee — important for any future B.4 sweep
protocol.

Debt tracking

Resolved this release: D19 (fairness scaffolding only).
New entry: D22 (Low) — instances.input_text empty on
triage_jira.
Total resolved across v2.0.0 → v2.2.0: 15 of 23 (65 %).
Open: 8 (0 critical, 5 medium, 2 low; 1 marked
DECIDED-NO-ACTION).

Full inventory and diagnostic write-ups in
docs/known_debt.md.

Known limitations

input_text not persisted in triage_jira instances (D22, Low —
future data-pipeline enhancement).
Single hardware tier evaluated (gpu-entry); models requiring
gpu-mid and above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
the gemma4 family, llama3.1:70b) catalogued but not yet
empirically evaluated.
Dashboard polish deferred — animations, guided tour, refactor of
the 640-LOC app.py to a views/ module package. Slated for a
future Sprint 6.
AMD ROCm and Apple Metal backends not yet detected (development
hardware is NVIDIA-only).
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).

Upgrade notes

No breaking changes to the public CLI or YAML run-spec schema.
New perturbation names accepted in perturbations: lists:
gender_swap_prefix_male, gender_swap_prefix_female,
register_shift_informal.
Dashboard views update automatically when perturbed runs are
present; no migration step required.

References

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics
derived automatically from language corpora contain human-like
biases. Science 356(6334), 183-186.
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T.
(2016). Man is to computer programmer as woman is to homemaker?
Debiasing word embeddings. NeurIPS.
Tatman, R. (2017). Gender and dialect bias in YouTube's automatic
captions. In Proceedings of the First ACL Workshop on Ethics in
Natural Language Processing.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
calibration of modern neural networks. ICML.
Demšar, J. (2006). Statistical comparisons of classifiers over
multiple data sets. JMLR 7, 1-30.
Wilcoxon, F. (1945). Individual comparisons by ranking methods.
Biometrics Bulletin 1(6), 80-83.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUMA v2.2.0 — statistical pipeline + dashboard core + bias evaluation

Choose a tag to compare

Sorry, something went wrong.