Skip to content

nilsleh/evaluma

Repository files navigation

evaluma_logo

CI Python 3.11+ Coverage License PyPI Docs

evaluma

A small Python package for comparing machine learning models across benchmark suites. Given a CSV of per-model, per-dataset scores, evaluma can compute three complementary views of the results:

  • IQM ranking — interquartile mean with bootstrapped confidence intervals, following Agarwal et al. (2021)
  • Bayesian pairwise comparison — posterior probabilities that model A beats model B (or is practically equivalent), via baycomp
  • Dolan-Moré performance profiles — cumulative distribution of performance ratios and area-under-profile scores

Documentation

Full documentation, including tutorials and API reference, is available at evaluma.readthedocs.io.

Installation

pip install evaluma

For a development install from source:

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

Quick start

Python API

import evaluma

bench = evaluma.load_df(
    "results.csv",
    model="model",
    dataset="dataset",
    metric="metric",
    score="score",
)

# IQM ranking with 95% bootstrap CI
iqm = bench.iqm_ranking()
print(iqm.table)
fig = iqm.plot()
fig.savefig("iqm.png")

# Bayesian pairwise probabilities
bayes = bench.bayesian_comparison()
print(bayes.table)

# Dolan-Moré performance profiles
profiles = bench.performance_profiles()
fig = profiles.plot()

CLI

# Run all three analyses and write six output files
evaluma report results.csv \
    --model model --dataset dataset --metric metric --score score \
    --output results/

# Individual subcommands
evaluma rank    results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma compare results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma profiles results.csv --model model --dataset dataset --metric metric --score score --output results/

Each subcommand writes a .csv table and a .png figure to --output.

Column mapping

If your CSV uses different column names, pass them explicitly:

evaluma report results.csv \
    --model experiment --dataset task --metric measure --score value \
    --output results/

Or put them in a YAML config file:

# config.yaml
model: experiment
dataset: task
metric: measure
score: value
evaluma report results.csv --config config.yaml --output results/

Lower-is-better metrics

bench = evaluma.load_df(
    "results.csv",
    model="model", dataset="dataset", metric="metric", score="score",
    metric_direction={"rmse": "min"},
)
evaluma report results.csv ... --metric-direction rmse:min

Filtering models or datasets

bench_ab = bench.select_models(["ModelA", "ModelB"])
bench_core = bench.select_datasets(["dataset1", "dataset2", "dataset3"])

Input format

evaluma expects a long-format CSV with one row per (model, dataset) combination:

model,dataset,metric,score
ModelA,dataset1,acc,0.91
ModelA,dataset2,acc,0.87
ModelB,dataset1,acc,0.84
...

Multiple seeds are supported — pass --seed seed_col and evaluma aggregates by mean before analysis.

Contributing

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

# Run tests
pytest --cov=evaluma --cov-report=term-missing

# Lint and format
ruff check .
ruff format .

# Type checking
ty check

Bug reports and pull requests are welcome on GitHub.

License

Apache License 2.0. See LICENSE for the full text.

Citation

If you use evaluma in your research, please cite:

@software{lehmann2026evaluma,
  author  = {Lehmann, Nils},
  title   = {evaluma: ML Benchmark Ranking Tools},
  year    = {2026},
  url     = {https://github.com/nilsleh/evaluma},
  version = {0.1.0},
}

also cite the works of the underlying methods and frameworks used:

@inproceedings{agarwal2021deep,
  title     = {Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author    = {Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
               and Courville, Aaron and Bellemare, Marc G.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2021},
}

@article{benavoli2017time,
  title   = {Time for a Change: a Tutorial for Comparing Multiple Classifiers
             Through Bayesian Analysis},
  author  = {Benavoli, Alessio and Corani, Giorgio and Dem{\v{s}}ar, Janez
             and Zaffalon, Marco},
  journal = {Journal of Machine Learning Research},
  volume  = {18},
  number  = {77},
  pages   = {1--36},
  year    = {2017},
}

@article{dolan2002benchmarking,
  title   = {Benchmarking Optimization Software with Performance Profiles},
  author  = {Dolan, Elizabeth D. and Mor{\'e}, Jorge J.},
  journal = {Mathematical Programming},
  volume  = {91},
  pages   = {201--213},
  year    = {2002},
}

About

Benchmark Evaluation Tools

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors