Skip to content

PUMA v2.2.0 — statistical pipeline + dashboard core + bias evaluation

Choose a tag to compare

@pumacp pumacp released this 13 May 01:02
· 263 commits to main since this release

PUMA v2.2.0 Release Notes

Release date: 2026-05-13
Previous release: v2.1.0 (2026-05-10)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprints 3, 4, and 5 onto the v2.1.0 base:
statistical-analysis pipeline (ECE, Wilcoxon signed-rank, multi-seed
reproducibility), dashboard core with PUMA visual identity, and an
empirical bias-evaluation suite adapted to the characteristics of the
technical corpus.

Highlights

Statistical analysis pipeline (Sprint 3)

  • ECE end-to-end. Expected Calibration Error computed from real
    Ollama logprobs and persisted as metrics.metric_name='ece'.
    Validated against Guo et al. (2017) canonical cases with tolerance
    1e-6. Baseline qwen2.5:3b shows ECE=0.39 — significant
    miscalibration typical of out-of-the-box LLMs without post-hoc
    calibration.
  • Multi-seed validation. Seeds {42, 123, 456} on the canonical
    baseline yield zero variance in task metrics under T=0.0,
    confirming the bit-exact reproducibility guarantee documented in
    v2.0.0. Runtime jitter ~4 %.
  • Wilcoxon pairwise comparison (Demšar 2006 methodology) on
    paired-correctness indicators. Demonstrated empirically on a
    mini-comparison (qwen2.5:1.5b vs gemma3:1b on triage_jira × N=50)
    that a 0.19-point F1 gap is not statistically significant at
    α=0.05 (p=0.108, n_pairs=19/50) — the kind of finding the test is
    designed to surface.

Dashboard core (Sprint 4)

  • 5 fully functional views: Overview (cohort cards + per-run
    expanders, sidebar filters applied), Model Comparison (mean±std
    aggregation across seeds, run × metric heatmap, Wilcoxon
    artefact rendering), Reliability (real ECE + reliability diagram
    from logprobs), Sustainability Frontier (F1 vs CO₂ Pareto consuming
    the emissions table from Sprint 2 D15), Instance Drill-down
    (correct gold_label via the new JOIN, top-K logprobs).
  • 2 informed placeholders pending data: Fairness and Robustness
    (made functional by Sprint 5).
  • PUMA visual identity: emerald palette, sans-serif typography,
    logo in sidebar, telemetry disabled.
  • Dark-mode toggle via runtime CSS override.
  • 6 smoke tests including an end-to-end streamlit.testing.v1.AppTest
    render.
  • Bug fix: the Fairness and Instance Drill-down views in v2.1.0
    read gold_label from predictions, where it does not exist
    (it lives in instances). The new load_predictions_with_gold
    helper LEFT-JOINs the two tables and is now consumed by every
    view that needs the gold label.

Bias evaluation (Sprint 5) — empirical findings

Methodological adaptation. The triage_jira corpus is 100 %
technical incident text with 0 % gendered terms (verified by regex
over 23 EN tokens across all 200 instances). A textbook
pronoun-substitution gender_swap on this corpus would yield
flip_rate = 0 and disparity = 0 — a false PASS demonstrating
nothing. Sprint 5 therefore evaluates bias via signal injection
rather than signal substitution:

  • gender_swap_prefix_{male,female} — prepends a gendered identity
    prefix (John Smith reported: … vs Mary Smith reported: …).
    Methodology per Caliskan et al. (2017) and Bolukbasi et al. (2016).
  • register_shift_informal — formal→informal substitution proxy for
    the dialect axis on a monolingual technical corpus (Tatman 2017).

Key empirical findings on triage_jira × N=100 per condition:

Model Flip rate (any prefix vs baseline) Δ accuracy M-vs-F directional bias
qwen2.5:1.5b ~25-27 % −11 to −12 pp 15 %
qwen2.5:3b ~25-27 % −3 to −4 pp 5 %
  • Both models flip ~25 % of predictions when any gender signal is
    added vs the un-perturbed baseline — strong sensitivity to identity
    cues that the technical content does not require.
  • The 3× larger model exhibits ~3× less directional bias while
    losing ~3× less accuracy under signal injection.
  • register_shift_informal shows ~0 % effect on both models:
    formal-to-informal substitution does not perturb predictions.

Closes Gate D criterion 4 ("Bias semántico básico implementado y
validado"). Closes debt D19.

Quality

  • Tests: 313 passing (up from 276 in v2.1.0; +37 TDD across the
    three sprints).
  • Pre-commit: 10/10 hooks green.
  • CI: green on both main and develop.
  • Baseline reproducibility: F1=0.5867 ± 0.01 holds; verified via
    puma validate-baseline (PASS at 0.5831, delta −0.0036).

Methodological findings (academic traceability)

  • Sprint 3 confirmed empirically the deterministic-reproducibility
    guarantee documented in v2.0.0 (zero variance under T=0.0 with three
    different seeds; runtime jitter does not propagate to metrics).
  • Sprint 4 dashboard integration exposed D22: the synthetic
    triage_jira dataset does not persist input_text. This is a
    fourth instance of the "symptom in layer N, root cause in layer M"
    meta-pattern documented in docs/known_debt.md (joining D15, D18,
    D21).
  • Sprint 5 confirmed D3 empirically: puma validate-baseline
    showed bit-exact F1=0.6002 across consecutive runs after the bias
    sweep, then returned to F1=0.5831 after docker compose restart puma_ollama. Documents the warm-state-drift scope of the
    reproducibility guarantee — important for any future B.4 sweep
    protocol.

Debt tracking

  • Resolved this release: D19 (fairness scaffolding only).
  • New entry: D22 (Low) — instances.input_text empty on
    triage_jira.
  • Total resolved across v2.0.0 → v2.2.0: 15 of 23 (65 %).
  • Open: 8 (0 critical, 5 medium, 2 low; 1 marked
    DECIDED-NO-ACTION).

Full inventory and diagnostic write-ups in
docs/known_debt.md.

Known limitations

  • input_text not persisted in triage_jira instances (D22, Low —
    future data-pipeline enhancement).
  • Single hardware tier evaluated (gpu-entry); models requiring
    gpu-mid and above (qwen2.5:14b, gemma3:27b, deepseek-r1:14b,
    the gemma4 family, llama3.1:70b) catalogued but not yet
    empirically evaluated.
  • Dashboard polish deferred — animations, guided tour, refactor of
    the 640-LOC app.py to a views/ module package. Slated for a
    future Sprint 6.
  • AMD ROCm and Apple Metal backends not yet detected (development
    hardware is NVIDIA-only).
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).

Upgrade notes

  • No breaking changes to the public CLI or YAML run-spec schema.
  • New perturbation names accepted in perturbations: lists:
    gender_swap_prefix_male, gender_swap_prefix_female,
    register_shift_informal.
  • Dashboard views update automatically when perturbed runs are
    present; no migration step required.

References

  • Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics
    derived automatically from language corpora contain human-like
    biases. Science 356(6334), 183-186.
  • Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T.
    (2016). Man is to computer programmer as woman is to homemaker?
    Debiasing word embeddings. NeurIPS.
  • Tatman, R. (2017). Gender and dialect bias in YouTube's automatic
    captions. In Proceedings of the First ACL Workshop on Ethics in
    Natural Language Processing
    .
  • Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On
    calibration of modern neural networks. ICML.
  • Demšar, J. (2006). Statistical comparisons of classifiers over
    multiple data sets. JMLR 7, 1-30.
  • Wilcoxon, F. (1945). Individual comparisons by ranking methods.
    Biometrics Bulletin 1(6), 80-83.

Acknowledgments

Development assistance provided by generative AI tooling. All commits
are attributed to the project's git identity per repository
convention.