Python package implementing multiplicity-adjusted bootstrap tilting (MABT) lower confidence bounds for prediction performance after model selection (currently: accuracy for binary and categorical classification).
This repository is the Python port (R → Python) of code developed for the dissertation:
- Pascal Rink (2025). Confidence Limits for Prediction Performance. Doctoral thesis, University of Bremen. https://doi.org/10.26092/elib/3822
Related paper:
- Pascal Rink & Werner Brannath (2025). Post-selection confidence bounds for prediction performance. Machine Learning. https://link.springer.com/article/10.1007/s10994-024-06632-w
In many applied ML workflows, multiple candidate models are trained (e.g., many LASSO variants). A subset of “promising” candidates may be identified using cross-validation on the training data, and then a final model is selected by maximizing performance on an evaluation dataset.
If you compute a standard confidence interval/bound after selecting the best-performing candidate, the reported performance is typically too optimistic (selection bias / winner’s curse). The selection event (best-of-k) is part of the randomness and must be accounted for in inference.
MABT addresses this by producing a lower (1 − α) confidence bound for the conditional prediction performance of the selected model, while adjusting for selection over multiple candidates.
mabt.mabt_ci(true_labels, pred_labels, alpha=0.05, B=10000, seed=None)
Returns:
bound: MABT lower confidence bound for the selected model’s accuracy (level1 - alpha)tau: calibrated tilting parametert0: naive point estimate for the selected model (optimistic “best observed” accuracy)
From GitHub:
pip install git+https://github.com/pascalrink/prediction-performance-ci.gitFor development:
git clone https://github.com/pascalrink/prediction-performance-ci.git
cd prediction-performance-ci
pip install -e .import numpy as np
from mabt import mabt_ci
true_labels = np.array([0, 1, 0, 1])
pred_labels = np.array([
[0, 1], # sample 1 predictions from model 1 and 2
[1, 1],
[0, 0],
[1, 0],
])
bound, tau, t0 = mabt_ci(true_labels, pred_labels, alpha=0.05, B=10_000, seed=123)
print("MABT lower bound:", bound)
print("Tilting parameter:", tau)
print("Point estimate (optimistic):", t0)The repository currently includes one end-to-end example:
python examples/run_from_csv.pyexamples/run_from_csv.py does the following:
- loads
examples/data/labels.csv(true labels) andexamples/data/predictions.csv(predicted labels for multiple models), - calls
mabt_ci(...), - prints
bound,tau, andt0.
- shape
(n,) - binary labels encoded as
{0, 1}
- shape
(n,)for a single model or(n, k)forkcandidate models - hard class predictions in
{0, 1}(not probabilities) - each column corresponds to one candidate model
- Candidate models are trained on training data (outside this repo).
- “Promising” candidates are identified using (e.g.) cross-validation on training data.
- On a separate evaluation dataset, you collect:
- true labels
y - predicted labels for each candidate model
- true labels
- On this evaluation dataset you:
- select the best candidate by accuracy, and
- compute a post-selection lower confidence bound for the selected model’s conditional accuracy.
t0is the empirical accuracy of the “best” model (maximum over candidates).
By construction this is optimistic because it ignores selection.boundis a conservative lower guarantee that remains valid even when the model was selected among multiple candidates.tauis the exponential-tilting parameter; in practice it mainly serves as a diagnostic (how strongly reweighting was needed).
The Python implementation follows the MABT approach in the dissertation, in particular:
- Part II: “Multiplicity-Adjusted Bootstrap Tilting” (Chapter 5)
- Appendix C.1: “MABT algorithm” (Algorithm C.1: Accuracy)
https://doi.org/10.26092/elib/3822
-
Correctness matrix: convert labels into indicators
ỹ_ij = 1{ŷ_ij == y_i}for each sampleiand modelj.
The empirical accuracy per model isθ̂_j = mean_i(ỹ_ij). -
Selection: choose
s = argmax_j θ̂_j(best observed).
The naive estimate ist0 = θ̂_s. -
Multivariate bootstrap: resample evaluation observations jointly across all models to approximate the joint distribution of the model-wise statistics.
-
Simultaneity / maxT-style adjustment: transform bootstrap statistics using their empirical CDFs (“uniformization”) and aggregate via a maximum construction. This leverages dependence between models without assuming independence.
-
Bootstrap tilting (exponential reweighting): use weights of the form
w_i(τ) ∝ exp(τ · U_i),
whereU_iis derived from the selected model’s data (for accuracy this is directly tied to the selected model’s correctness indicator). -
Calibrate
τand compute the bound: findτ < 0such that the multiplicity-adjusted, bootstrap-based p-value equalsα. The bound is then the weighted accuracy of the selected model:
bound = Σ_i w_i(τ) · ỹ_is.
The construction is designed such that the resulting lower bound achieves nominal coverage asymptotically (see the theory in Chapter 5 of the dissertation).
- currently implemented: accuracy for binary classification with hard labels
{0,1} - other performance measures (e.g., AUC) are discussed in the dissertation (with applications)
pytestIf you use this package in academic work, cite the dissertation:
@phdthesis{rink2025confidence,
title = {Confidence Limits for Prediction Performance},
author = {Rink, Pascal},
school = {University of Bremen},
year = {2025},
doi = {10.26092/elib/3822}
}MIT (see LICENSE).