Metrics And Sustainability

Metrics and Sustainability

PUMA computes seven metric families for every run and pairs them with a sustainability footprint. The combination is what makes PUMA useful for research and engineering decisions: pure quality numbers tell only half the story when energy and latency vary by an order of magnitude across configurations.

The seven metric families

Accuracy — for classification scenarios, F1-macro, precision/recall per class, and a full confusion matrix. For regression scenarios, MAE, MdAE, and R². Computed from the predictions table once a run finishes.
Calibration — Expected Calibration Error (ECE) over the model's confidence values. Only meaningful when the model exposes logprobs (most Ollama models do).
Efficiency — latency percentiles (p50, p90, p99), tokens generated per second, and the total wall-clock time per run.
Stability — how much the metrics shift across N repeated runs with different seeds. Reported as the coefficient of variation.
Robustness — accuracy under controlled input perturbations: typo injection, paraphrasing, and unicode confusable swaps.
Fairness — disparity across input subgroups (e.g., by project, by issue length). Reported as max-min gap per metric.
Sustainability — gCO₂eq per run, energy Wh, country-grid emissions intensity, all sourced via CodeCarbon.

Reading the dashboard

The Streamlit dashboard (run docker compose up -d puma_dashboard, then open http://localhost:8501) has nine views:

Overview — a leaderboard across all logged runs.
Model Comparison — side-by-side comparison of two or more selected runs.
Multi-model — aggregate metrics across every model in the catalog.
Reliability — calibration curves and Expected Calibration Error.
Robustness — accuracy curves under each perturbation.
Fairness — subgroup disparity heatmaps.
Sustainability Frontier — gCO₂eq, energy Wh, and energy-per-correct-prediction.
Instance Drill-down — per-instance inspection of predictions vs gold labels.
🤝 Community — the entry point for publishing your results to PUMA Community.

Carbon footprint

CodeCarbon tracks energy at the process level and converts to gCO₂eq using the configured country grid. Typical reference values:

A 10-instance triage_jira run on qwen2.5:3b on cpu-standard: ≈ 0.3 gCO₂eq, ≈ 90 s wall-clock.
The same 10-instance run on qwen2.5:7b on gpu-entry: ≈ 0.05 gCO₂eq, ≈ 25 s.
A full 200-instance sweep across six models on gpu-mid: ≈ 15 gCO₂eq.

Comparing models without comparing footprints overlooks an entire dimension of the engineering decision; PUMA makes both visible side by side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics And Sustainability

Metrics and Sustainability

The seven metric families

Reading the dashboard

Carbon footprint

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally