CalArena

A large-scale benchmark for post-hoc calibration of classification models.

CalArena systematically evaluates 20 binary and 19 multiclass calibration methods across hundreds of (dataset, model) pairs spanning classical tabular models, state-of-the-art tabular foundation models, and computer vision networks. All calibrator implementations are provided by the probmetrics package.

📄 Read the paper here: CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Leaderboards

Binary Calibration Winrates

Multiclass Calibration Winrates

Above, we plot results for binary and multiclass post-hoc calibration benchmarks. Each bar represents the winrate of the calibration method, averaged over all experiments in the benchmark with 95% Confidence Intervals (CIs) constructed by bootstrapping entire datasets (TabRepo and TabArena-binary benchmarks) or experiments directly (TabArena-multiclass and CV benchmarks). Methods are ranked based on the average winrate over the three benchmarks.

Repository structure

CalArena/
├── run_benchmark.py              # Run calibration experiments
├── utils.py                      # Analysis utilities and plotting helpers
├── custom_calibrators.py         # Register your own calibrators
├── paper_figures.ipynb           # Reproduce all paper figures
├── calibration_benchmarks/       # Benchmark data and generation scripts
│   ├── generate_tabrepo_benchmarks.py
│   ├── generate_tabarena_benchmarks.py
│   ├── generate_cv_benchmarks.py
│   └── {benchmark}-experiments.csv  # Benchmark experiments index
├── batch_scripts/                # SLURM job scripts for cluster execution
├── results/                      # Benchmark results (one CSV per calibrator)
│   └── {benchmark}/{calibrator}.csv
└── figures/                      # Paper figures (PDF)

Note: The HDF5 files storing benchmark data (calibration_benchmarks/*.h5) are listed in .gitignore due to their size (~1.7 GB in total). Download them from HuggingFace.

Benchmarks overview

CalArena includes 7 benchmarks across three data modalities:

Benchmark	Problem type	Base models	Datasets	#Experiments
`tabrepo-binary`	Binary	8 classical tabular models	104 tabular datasets	832
`tabarena-binary`	Binary	11 advanced tabular models	30 tabular datasets	314
`cv-binary`	Binary	9 deep CV models	CIFAR-10 (Animal vs Machine), Breast, Pneumonia	13
`tabrepo-multiclass`	Multiclass	8 classical tabular models	65 tabular datasets	520
`tabarena-multiclass`	Multiclass	11 modern tabular models	8 tabular datasets	84
`cv-multiclass`	Multiclass	10 deep CV models	CIFAR-10/100, Birds, SVHN, Derma, OCT	20
`imagenet-multiclass`	Large scale multiclass	8 deep CV models	ImageNet	8

Each benchmark is stored as

a single HDF5 file (calibration_benchmarks/{benchmark}.h5) with the hierarchy {dataset}/{model}/ → {probas_cal, labels_cal, probas_test, labels_test}
a companion {benchmark}-experiments.csv experiments index file listing dataset, model, calibration set size, test set size, number of classes

TabRepo base models: CatBoost, ExtraTrees, LightGBM, LinearModel, NeuralNetFastAI, NeuralNetTorch, RandomForest, XGBoost.

TabArena base models (≥ 1300 ELO on TabArena leaderboard, as of April 1 2026, v0.1.3.1): TabPFN-v2.6, TabICLv2, RealTabPFN-v2.5, TabICL_GPU, LimiX_GPU, TabM_GPU, RealMLP_GPU, BetaTabPFN_GPU, ModernNCA_GPU, Mitra_GPU, TabDPT_GPU.

Calibrators overview

💻 Implementations for every calibrator listed can be found in the probmetrics package.

Binary (20 methods)

Name	Method	Paper
`Base-model`	No calibration (raw model probabilities)
`Hist-uniform`	Histogram binning with fixed sized bins (10 bins)	[1]
`Hist-quantile`	Histogram binning with fixed number of points per bin (10 bins)	[1]
`Scaling-Binning`	Platt scaling + histogram binning	[2]
`BBQ`	Bayesian Binning into Quantiles	[3]
`Isotonic`	Isotonic regression	[4]
`CIR`	Centered Isotonic Regression	[5]
`Venn-Abers`	Venn-Abers predictor	[6]
`TS`	Temperature Scaling	[7]
`ETS`	Ensemble Temperature Scaling	[8]
`Platt-probs`	Platt Scaling on top-class probabilities	[9]
`Platt-logits`	Platt Scaling on top-class logits	[9]
`Quadratic`	Quadratic logistic calibration	[10]
`Beta`	Beta calibration	[11]
`Spline`	Spline based calibration	[12]
`CDF-Spline`	Spline based calibration on the CDF	[13]
`Kernel`	Kernel calibration	[14]
`XGBoost`	Post-hoc calibration using a binary classifier
`LightGBM`	Post-hoc calibration using a binary classifier
`CatBoost`	Post-hoc calibration using a binary classifier

Multiclass (19 methods)

Name	Method	Paper
`Base-model`	No calibration
`Hist-uniform`	Histogram binning with fixed sized bins (OvR)	[1]
`Hist-quantile`	Histogram binning with fixed number of points per bin (OvR)	[1]
`BBQ`	Bayesian Binning into Quantiles (OvR)	[3]
`Isotonic`	Isotonic regression (OvR)	[4]
`CIR`	Centered Isotonic Regression (OvR)	[5]
`Venn-Abers`	Venn-Abers predictor (OvR)	[6]
`Spline`	Spline based calibration (OvR)	[12]
`TS`	Temperature Scaling	[7]
`ETS`	Ensemble Temperature Scaling	[8]
`VS`	Vector Scaling	[7]
`SVS`	Structured Vector Scaling	[10]
`MS`	Matrix Scaling	[7]
`SMS`	Structured Matrix Scaling	[10]
`Dirichlet`	Dirichlet calibration	[15]
`Kernel`	Kernel calibration	[14]
`XGBoost`	Post-hoc calibration using a multiclass classifier
`LightGBM`	Post-hoc calibration using a multiclass classifier
`CatBoost`	Post-hoc calibration using a multiclass classifier

[1] Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
[2] Verified Uncertainty Calibration
[3] Obtaining Well Calibrated Probabilities Using Bayesian Binning
[4] Transforming Classifier Scores into Accurate Multiclass Probability Estimates
[5] Centered Isotonic Regression: Point and Interval Estimation for Dose–Response Studies
[6] Large-scale probabilistic predictors with and without guarantees of validity
[7] On Calibration of Modern Neural Networks
[8] Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning
[9] Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
[10] Structured Matrix Scaling for Multi-Class Calibration
[11] Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers
[12] Spline-Based Probability Calibration
[13] Calibration of Neural Networks using Splines
[14] Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors
[15] Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration

Installation

To run the benchmark, you need to install probmetrics with extra dependencies:

# "extra" and "dirichletcal" install other calibration package to support all
# calibrators in the library.
pip install 'probmetrics[extra,dirichletcal]'

# To load the benchmark data, you also need h5py
pip install h5py

For results analysis and plotting you need extra packages

pip install scikit-posthocs # Statistical analysis and CD diagrams
pip install arena-rank # To compute Elo scores

Collecting benchmark data

Downloading (recommended)

The benchmark files (HDF5 + experiment CSVs) are available on HuggingFace:

https://huggingface.co/datasets/probkit/CalArena

Download the files and place them under calibration_benchmarks/.

Generating the benchmarks from the original data sources (disk space hungry)

For the sake of completeness, we provide the scripts that we ran to generate the benchmarks from the original data sources. Each script writes its output directly into calibration_benchmarks/, you only need to run these once.

python calibration_benchmarks/generate_tabrepo_benchmarks.py # requires installing tabrepo
python calibration_benchmarks/generate_tabarena_benchmarks.py # requires installing tabarena
python calibration_benchmarks/generate_cv_benchmarks.py # requires downloading the data from original sources

⚠️ We strongly recommend using the files downloaded from HuggingFace as accessing model predictions from TabRepo and TabArena requires downloading hundred gigabytes of raw data locally.

TabRepo data: Model predictions (we use context D244_F3_C1530_200, ~120 GB total) are downloaded automatically on first run. Ensure a fast internet connection.

TabArena data: Model predictions (~60 GB total) are downloaded automatically on first run. Ensure a fast internet connection.

CV data: generate_cv_benchmarks.py reads raw logits from cv_data/ (gitignored). Place the Markus pickle files under cv_data/Markus/ and the Hekler safetensors files under cv_data/Hekler/ before running. The experiment-list CSVs (cv-binary-experiments.csv, etc.) are already committed and serve as the index of which datasets and models to process.

Running the benchmarks

Once the HDF5 and CSV files are downloaded and stored under calibration_benchmarks/, you can run the benchmarks using the following python scripts.

Run all calibrators

python run_benchmark.py --benchmark tabrepo-binary

Results are saved to results/{benchmark}/{calibrator}.csv, one file per calibrator.

Run a single calibrator

python run_benchmark.py --benchmark tabrepo-binary --calibrator TS

Results are saved to results/{benchmark}/{calibrator}.csv.

CLI reference

--benchmark       Name of the benchmark, e.g. tabrepo-binary, cv-multiclass   [required]
--calibrator      Name of a single calibrator to run (default: run all)
--benchmarks_dir  Directory containing benchmark HDF5/CSV files (default: calibration_benchmarks)
--results_dir     Directory to write results (default: results)

Cluster execution (SLURM)

The recommended cluster strategy is to submit one job per calibrator so that runtimes are isolated. Ready-to-use SLURM scripts for each benchmark are in batch_scripts/. Before submitting, uncomment and adjust the conda activate line for your environment:

sbatch batch_scripts/tabrepo-binary.batch

Example script (abbreviated):

#!/bin/bash
#SBATCH --array=0-19
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --time=05:00:00
#SBATCH --output=logs/%x_%A_%a.out

CALIBRATORS=(
  "Base-model"
  "Hist-uniform" "Hist-quantile" "Scaling-Binning" "BBQ"
  "Isotonic" "CIR" "Venn-Abers"
  "TS" "ETS" "Platt" "Affine" "Quadratic" "Beta"
  "Spline" "CDF-Spline" "Kernel"
  "XGBoost" "LightGBM" "CatBoost"
)

source /home/$USER/.bashrc
cd $HOME/CalArena

python run_benchmark.py \
    --benchmark tabrepo-binary \
    --calibrator "${CALIBRATORS[$SLURM_ARRAY_TASK_ID]}"

SLURM output logs are written to logs/ (gitignored).

Adding your own calibrator

Step 1: Implement the calibrator

A calibrator must expose two methods:

def fit(self, p_cal: np.ndarray, y_cal: np.ndarray) -> None: ...
def predict_proba(self, p_test: np.ndarray) -> np.ndarray: ...

For binary tasks, p_cal / p_test are 1-D arrays of positive-class probabilities (or shape (n, 2)). For multiclass tasks they are (n, k) arrays.

The cleanest approach is to subclass the probmetrics base class, which accepts either a NumPy or PyTorch implementation:

from probmetrics.calibrators import Calibrator

class MyCalibrator(Calibrator):
    def _fit_impl(self, X: np.ndarray, y: np.ndarray) -> None:
        '''Implement either this or the following'''
        ...

    def _fit_torch_impl(
        self, y_pred: CategoricalDistribution, y_true_labels: torch.Tensor
    ) -> None:
        '''Implement either this or the previous'''
        ...
    
    def _predict_proba_impl(self, X: np.ndarray) -> np.ndarray:
        '''Implement either this or the following'''
        ...

    def _predict_proba_torch_impl(
        self, y_pred: CategoricalDistribution
    ) -> CategoricalDistribution:
        '''Implement either this or the previous'''
        ...

However, any object that satisfies .fit() / .predict_proba() works, inheriting from Calibrator is optional but recommend for smooth integration into the library.

Step 2: Register it in `custom_calibrators.py`

from my_module import MyCalibrator

CUSTOM_CALIBRATORS = {
    "MyCalibrator": lambda: MyCalibrator(),
}

The dictionary key is the display name used in plot legends and result filenames.

Step 3: Run it

python run_benchmark.py --benchmark tabrepo-binary --calibrator MyCalibrator

Results are saved to results/tabrepo-binary/MyCalibrator.csv, exactly like a built-in method.

Reproducing the paper figures

All figures are generated by paper_figures.ipynb. Pre-generated PDFs are in figures/.

Results are loaded via load_benchmark_results() from utils.py:

from utils import load_benchmark_results

df = load_benchmark_results(
    benchmark_name="tabrepo-binary",
    methods=["TS", "Platt", "Isotonic", "Base-model"],
)

utils.py also provides:

compute_winrates() — mean win rates with 95% bootstrap CIs (dataset-level or experiment-level)
compute_elo_scores() — Bradley-Terry Elo ratings with bootstrap CIs
compute_absolute_improvements() — per-metric improvements over Base-model, LaTeX-formatted
plot_leaderboard(), plot_grouped_leaderboards(), plot_cd_diagram() — reproducing publication figures

Contributing a calibrator to `probmetrics`

All built-in calibrators live in the probmetrics package. To contribute a new method and include it in the official benchmark, implement it there following the Calibrator interface and open a pull request.

Citation

@article{calarena2026,
  title     = {CalArena: A Large-Scale Post-Hoc Calibration Benchmark},
  author    = {Eug{\`e}ne Berta and David Holzm{\"u}ller and Francis Bach and Michael I. Jordan},
  journal   = {arXiv preprint arXiv:2605.30188},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.30188},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CalArena

Leaderboards

Repository structure

Benchmarks overview

Calibrators overview

Binary (20 methods)

Multiclass (19 methods)

Installation

Collecting benchmark data

Downloading (recommended)

Generating the benchmarks from the original data sources (disk space hungry)

Running the benchmarks

Run all calibrators

Run a single calibrator

CLI reference

Cluster execution (SLURM)

Adding your own calibrator

Step 1: Implement the calibrator

Step 2: Register it in `custom_calibrators.py`

Step 3: Run it

Reproducing the paper figures

Contributing a calibrator to `probmetrics`

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
batch_scripts		batch_scripts
calibration_benchmarks		calibration_benchmarks
figures		figures
results		results
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
custom_calibrators.py		custom_calibrators.py
paper_figures.ipynb		paper_figures.ipynb
run_benchmark.py		run_benchmark.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

CalArena

Leaderboards

Repository structure

Benchmarks overview

Calibrators overview

Binary (20 methods)

Multiclass (19 methods)

Installation

Collecting benchmark data

Downloading (recommended)

Generating the benchmarks from the original data sources (disk space hungry)

Running the benchmarks

Run all calibrators

Run a single calibrator

CLI reference

Cluster execution (SLURM)

Adding your own calibrator

Step 1: Implement the calibrator

Step 2: Register it in custom_calibrators.py

Step 3: Run it

Reproducing the paper figures

Contributing a calibrator to probmetrics

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2: Register it in `custom_calibrators.py`

Contributing a calibrator to `probmetrics`

Packages