evaluma

A small Python package for comparing machine learning models across benchmark suites. Given a CSV of per-model, per-dataset scores, evaluma can compute three complementary views of the results:

IQM ranking — interquartile mean with bootstrapped confidence intervals, following Agarwal et al. (2021)
Bayesian pairwise comparison — posterior probabilities that model A beats model B (or is practically equivalent), via baycomp
Dolan-Moré performance profiles — cumulative distribution of performance ratios and area-under-profile scores

Documentation

Full documentation, including tutorials and API reference, is available at evaluma.readthedocs.io.

Installation

pip install evaluma

For a development install from source:

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

Quick start

Python API

import evaluma

bench = evaluma.load_df(
    "results.csv",
    model="model",
    dataset="dataset",
    metric="metric",
    score="score",
)

# IQM ranking with 95% bootstrap CI
iqm = bench.iqm_ranking()
print(iqm.table)
fig = iqm.plot()
fig.savefig("iqm.png")

# Bayesian pairwise probabilities
bayes = bench.bayesian_comparison()
print(bayes.table)

# Dolan-Moré performance profiles
profiles = bench.performance_profiles()
fig = profiles.plot()

CLI

# Run all three analyses and write six output files
evaluma report results.csv \
    --model model --dataset dataset --metric metric --score score \
    --output results/

# Individual subcommands
evaluma rank    results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma compare results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma profiles results.csv --model model --dataset dataset --metric metric --score score --output results/

Each subcommand writes a .csv table and a .png figure to --output.

Column mapping

If your CSV uses different column names, pass them explicitly:

evaluma report results.csv \
    --model experiment --dataset task --metric measure --score value \
    --output results/

Or put them in a YAML config file:

# config.yaml
model: experiment
dataset: task
metric: measure
score: value

evaluma report results.csv --config config.yaml --output results/

Lower-is-better metrics

bench = evaluma.load_df(
    "results.csv",
    model="model", dataset="dataset", metric="metric", score="score",
    metric_direction={"rmse": "min"},
)

evaluma report results.csv ... --metric-direction rmse:min

Filtering models or datasets

bench_ab = bench.select_models(["ModelA", "ModelB"])
bench_core = bench.select_datasets(["dataset1", "dataset2", "dataset3"])

Input format

evaluma expects a long-format CSV with one row per (model, dataset) combination:

model,dataset,metric,score
ModelA,dataset1,acc,0.91
ModelA,dataset2,acc,0.87
ModelB,dataset1,acc,0.84
...

Multiple seeds are supported — pass --seed seed_col and evaluma aggregates by mean before analysis.

Contributing

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

# Run tests
pytest --cov=evaluma --cov-report=term-missing

# Lint and format
ruff check .
ruff format .

# Type checking
ty check

Bug reports and pull requests are welcome on GitHub.

License

Apache License 2.0. See LICENSE for the full text.

Citation

If you use evaluma in your research, please cite:

@software{lehmann2026evaluma,
  author  = {Lehmann, Nils},
  title   = {evaluma: ML Benchmark Ranking Tools},
  year    = {2026},
  url     = {https://github.com/nilsleh/evaluma},
  version = {0.1.0},
}

also cite the works of the underlying methods and frameworks used:

@inproceedings{agarwal2021deep,
  title     = {Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author    = {Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
               and Courville, Aaron and Bellemare, Marc G.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2021},
}

@article{benavoli2017time,
  title   = {Time for a Change: a Tutorial for Comparing Multiple Classifiers
             Through Bayesian Analysis},
  author  = {Benavoli, Alessio and Corani, Giorgio and Dem{\v{s}}ar, Janez
             and Zaffalon, Marco},
  journal = {Journal of Machine Learning Research},
  volume  = {18},
  number  = {77},
  pages   = {1--36},
  year    = {2017},
}

@article{dolan2002benchmarking,
  title   = {Benchmarking Optimization Software with Performance Profiles},
  author  = {Dolan, Elizabeth D. and Mor{\'e}, Jorge J.},
  journal = {Mathematical Programming},
  volume  = {91},
  pages   = {201--213},
  year    = {2002},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
docs		docs
evaluma		evaluma
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
results_and_parameters.csv		results_and_parameters.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evaluma

Documentation

Installation

Quick start

Python API

CLI

Column mapping

Lower-is-better metrics

Filtering models or datasets

Input format

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

evaluma

Documentation

Installation

Quick start

Python API

CLI

Column mapping

Lower-is-better metrics

Filtering models or datasets

Input format

Contributing

License

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages