-
Notifications
You must be signed in to change notification settings - Fork 0
Metrics And Sustainability
Isi Roca edited this page Jun 6, 2026
·
2 revisions
PUMA computes seven metric families for every run and pairs them with a sustainability footprint. The combination is what makes PUMA useful for research and engineering decisions: pure quality numbers tell only half the story when energy and latency vary by an order of magnitude across configurations.
- Accuracy — for classification scenarios, F1-macro, precision/recall per class, and a full confusion matrix. For regression scenarios, MAE, MdAE, and R². Computed from the predictions table once a run finishes.
- Calibration — Expected Calibration Error (ECE) over the model's confidence values. Only meaningful when the model exposes logprobs (most Ollama models do).
- Efficiency — latency percentiles (p50, p90, p99), tokens generated per second, and the total wall-clock time per run.
- Stability — how much the metrics shift across N repeated runs with different seeds. Reported as the coefficient of variation.
- Robustness — accuracy under controlled input perturbations: typo injection, paraphrasing, and unicode confusable swaps.
- Fairness — disparity across input subgroups (e.g., by project, by issue length). Reported as max-min gap per metric.
- Sustainability — gCO₂eq per run, energy Wh, country-grid emissions intensity, all sourced via CodeCarbon.
The Streamlit dashboard (run docker compose up -d puma_dashboard, then open
http://localhost:8501) has nine views:
- Overview — a leaderboard across all logged runs.
- Model Comparison — side-by-side comparison of two or more selected runs.
- Multi-model — aggregate metrics across every model in the catalog.
- Reliability — calibration curves and Expected Calibration Error.
- Robustness — accuracy curves under each perturbation.
- Fairness — subgroup disparity heatmaps.
- Sustainability Frontier — gCO₂eq, energy Wh, and energy-per-correct-prediction.
- Instance Drill-down — per-instance inspection of predictions vs gold labels.
- 🤝 Community — the entry point for publishing your results to PUMA Community.
CodeCarbon tracks energy at the process level and converts to gCO₂eq using the configured country grid. Typical reference values:
- A 10-instance
triage_jirarun onqwen2.5:3boncpu-standard: ≈ 0.3 gCO₂eq, ≈ 90 s wall-clock. - The same 10-instance run on
qwen2.5:7bongpu-entry: ≈ 0.05 gCO₂eq, ≈ 25 s. - A full 200-instance sweep across six models on
gpu-mid: ≈ 15 gCO₂eq.
Comparing models without comparing footprints overlooks an entire dimension of the engineering decision; PUMA makes both visible side by side.