A small Python package for comparing machine learning models across benchmark suites. Given a CSV of per-model, per-dataset scores, evaluma can compute three complementary views of the results:
- IQM ranking — interquartile mean with bootstrapped confidence intervals, following Agarwal et al. (2021)
- Bayesian pairwise comparison — posterior probabilities that model A beats model B (or is practically equivalent), via baycomp
- Dolan-Moré performance profiles — cumulative distribution of performance ratios and area-under-profile scores
Full documentation, including tutorials and API reference, is available at evaluma.readthedocs.io.
pip install evalumaFor a development install from source:
git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"import evaluma
bench = evaluma.load_df(
"results.csv",
model="model",
dataset="dataset",
metric="metric",
score="score",
)
# IQM ranking with 95% bootstrap CI
iqm = bench.iqm_ranking()
print(iqm.table)
fig = iqm.plot()
fig.savefig("iqm.png")
# Bayesian pairwise probabilities
bayes = bench.bayesian_comparison()
print(bayes.table)
# Dolan-Moré performance profiles
profiles = bench.performance_profiles()
fig = profiles.plot()# Run all three analyses and write six output files
evaluma report results.csv \
--model model --dataset dataset --metric metric --score score \
--output results/
# Individual subcommands
evaluma rank results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma compare results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma profiles results.csv --model model --dataset dataset --metric metric --score score --output results/Each subcommand writes a .csv table and a .png figure to --output.
If your CSV uses different column names, pass them explicitly:
evaluma report results.csv \
--model experiment --dataset task --metric measure --score value \
--output results/Or put them in a YAML config file:
# config.yaml
model: experiment
dataset: task
metric: measure
score: valueevaluma report results.csv --config config.yaml --output results/bench = evaluma.load_df(
"results.csv",
model="model", dataset="dataset", metric="metric", score="score",
metric_direction={"rmse": "min"},
)evaluma report results.csv ... --metric-direction rmse:minbench_ab = bench.select_models(["ModelA", "ModelB"])
bench_core = bench.select_datasets(["dataset1", "dataset2", "dataset3"])evaluma expects a long-format CSV with one row per (model, dataset) combination:
model,dataset,metric,score
ModelA,dataset1,acc,0.91
ModelA,dataset2,acc,0.87
ModelB,dataset1,acc,0.84
...
Multiple seeds are supported — pass --seed seed_col and evaluma aggregates by mean before analysis.
git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"
# Run tests
pytest --cov=evaluma --cov-report=term-missing
# Lint and format
ruff check .
ruff format .
# Type checking
ty checkBug reports and pull requests are welcome on GitHub.
Apache License 2.0. See LICENSE for the full text.
If you use evaluma in your research, please cite:
@software{lehmann2026evaluma,
author = {Lehmann, Nils},
title = {evaluma: ML Benchmark Ranking Tools},
year = {2026},
url = {https://github.com/nilsleh/evaluma},
version = {0.1.0},
}also cite the works of the underlying methods and frameworks used:
@inproceedings{agarwal2021deep,
title = {Deep Reinforcement Learning at the Edge of the Statistical Precipice},
author = {Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
and Courville, Aaron and Bellemare, Marc G.},
booktitle = {Advances in Neural Information Processing Systems},
year = {2021},
}
@article{benavoli2017time,
title = {Time for a Change: a Tutorial for Comparing Multiple Classifiers
Through Bayesian Analysis},
author = {Benavoli, Alessio and Corani, Giorgio and Dem{\v{s}}ar, Janez
and Zaffalon, Marco},
journal = {Journal of Machine Learning Research},
volume = {18},
number = {77},
pages = {1--36},
year = {2017},
}
@article{dolan2002benchmarking,
title = {Benchmarking Optimization Software with Performance Profiles},
author = {Dolan, Elizabeth D. and Mor{\'e}, Jorge J.},
journal = {Mathematical Programming},
volume = {91},
pages = {201--213},
year = {2002},
}