Statistically rigorous ML model evaluation and benchmarking.
Every number you report from an ML experiment is a random variable. This library forces you to treat it that way.
Niccolò Forzano --- github.com/nickforce989 · nic.forz@gmail.com
Starting from the mlci root directory:
pip install .For PyTorch integration:
pip install . torchFor Bayesian comparison with full PyMC backend:
pip install ".[bayesian]"- Python 3.9+
- numpy >= 1.23
- scipy >= 1.9
- scikit-learn >= 1.1
- matplotlib >= 3.5
- seaborn >= 0.12
- pandas >= 1.5
- joblib >= 1.2
- tqdm >= 4.64
All dependencies are installed automatically with pip install .. To install manually:
pip install numpy scipy scikit-learn matplotlib seaborn pandas joblib tqdmTo run the examples, no additional dependencies are needed — they use datasets that ship with scikit-learn. The PyTorch example additionally requires pip install torch.
Standard ML practice:
RandomForest accuracy: 0.9631
GradientBoosting accuracy: 0.9631
Difference: 0.0000 — they're equal
But these numbers came from a single run with a single random seed. Run it again and you get different numbers. The variance of accuracy across seeds on a typical benchmark is 0.2–0.5%. A 0.3% difference between two models is often indistinguishable from noise — but almost no paper will tell you that.
mlci makes it trivial to do this correctly.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from mlci import Experiment
from mlci.stats.bootstrap import summary
from mlci.stats.tests import compare
X, y = load_breast_cancer(return_X_y=True)
rf = Experiment(
model_factory=lambda seed: RandomForestClassifier(random_state=seed),
X=X, y=y, metric="accuracy", n_seeds=20, n_splits=5,
model_name="RandomForest",
)
gb = Experiment(
model_factory=lambda seed: GradientBoostingClassifier(random_state=seed),
X=X, y=y, metric="accuracy", n_seeds=20, n_splits=5,
model_name="GradientBoosting",
)
rf_results = rf.run()
gb_results = gb.run()Summary with uncertainty:
────────────────────────────────────────────────────
Model : RandomForest
Metric : accuracy
Seeds : 20 Splits: 5
────────────────────────────────────────────────────
Mean : 0.9593
95% CI : [0.9577, 0.9608] (±0.0015)
Seed σ : 0.0037 (variance from random init)
Split σ : 0.0061 (variance from data split)
────────────────────────────────────────────────────
Statistical comparison:
compare(rf_results, gb_results, method="corrected_ttest")────────────────────────────────────────────────────────────
Comparison : RandomForest vs GradientBoosting
Method : Corrected Resampled t-test (Nadeau & Bengio, 2003)
Effect : +0.0004 [-0.0023, +0.0029]
P(A > B) : 0.430
p-value : 0.6305 (α=0.05)
Conclusion : No statistically significant difference (p=0.6305 >= α=0.05).
Cannot conclude either model is better.
────────────────────────────────────────────────────────────
The standard single-run reported both models at identical accuracy. mlci correctly identifies the difference is noise.
Seven ready-to-run example scripts are included in the examples/ folder. All use datasets that ship with scikit-learn — no downloads needed.
The simplest and most focused example. Demonstrates the single most important claim of the library: that the standard practice of reporting a single accuracy number is misleading, and that two models which look identical under standard evaluation can be properly assessed with mlci.
Runs three models first the standard way (one number), then with mlci (20 seeds × 5 folds, confidence intervals, corrected t-test). Ends with a side-by-side summary showing exactly what information was lost.
python examples/worked_example.py
# Runtime: ~1 minuteRun this first — it's the clearest demonstration of why the library exists.
Covers every major feature across nine parts: four classical models, bootstrap summaries, variance decomposition, all three comparison methods, bring-your-own scores, custom metrics, learning curves, five plots, and JSON save/load.
python examples/full_example.py
# Runtime: ~3 minutesPlots produced (saved to examples/plots/):
score_distributions.png— KDE of raw scores per modelmodel_comparison.png— forest plot with CI barsvariance_decomposition.png— pie chart + score heatmaplearning_curve.png— RF vs LogReg with uncertainty bandsbootstrap_distribution.png— bootstrap distribution of the mean
Implements a fully-connected MLP from scratch using NumPy only — no PyTorch, no TensorFlow. He initialisation, ReLU activations, mini-batch SGD with momentum, L2 regularisation, early stopping. Wrapped in a sklearn-compatible interface so mlci works on it identically to any other model.
python examples/neural_network_example.py
# Runtime: ~2 minutesPlots produced (saved to examples/nn_plots/):
nn_training_curves.png— loss curves (train vs val) for each architecturenn_score_distributions.png— KDE of raw scores for all modelsnn_vs_classical_comparison.png— forest plot: MLP vs classical baselinesnn_architecture_comparison.png— bar chart comparing network sizes with CInn_learning_curve.png— MLP vs RandomForest learning curves with uncertainty bandsnn_variance_decomposition.png— variance sources for the MLPnn_bootstrap_distribution.png— bootstrap distribution of MLP mean accuracynn_summary_panel.png— single-figure summary combining five panels
Key finding: on a small tabular dataset (569 samples), LogisticRegression outperforms the MLP (0.979 vs 0.973, p=0.0001). Neural networks need large datasets to outperform well-tuned linear models. mlci makes this conclusion rigorous rather than anecdotal.
Demonstrates mlci.integrations.sklearn across four parts: defining models with PipelineFactory instead of lambdas, wrapping external scores with wrap_cross_val_scores, running 8 classifiers at once with ModelGrid, and running the same grid across three datasets to produce a cross-dataset comparison heatmap.
python examples/sklearn_example.py
# Runtime: ~2 minutesPlots produced (saved to examples/sklearn_plots/):
sklearn_model_comparison.png— 8 classifiers ranked by accuracy with CI barssklearn_scaler_comparison.png— does scaler choice matter? (yes, dramatically)sklearn_score_distributions.png— KDE for the top 3 modelssklearn_learning_curve.png— LogReg vs RF vs SVM with uncertainty bandssklearn_cross_dataset_heatmap.png— mean accuracy grid: 3 models × 3 datasetssklearn_variance_decomposition.png— variance sources for the best model
Key finding: MinMaxScaler drops LogisticRegression accuracy from 0.977 to 0.937 — a large, statistically significant difference that would be invisible from a single run.
Plugs real PyTorch neural networks into mlci via TorchTrainer — a sklearn-compatible wrapper that handles seeding (torch + numpy + random + CUDA), early stopping, and disk caching (completed runs are saved and skipped on re-run). Compares three architectures (Small, Deep, Wide MLP) against classical baselines with full statistical rigour.
pip install torch # required
python examples/torch_example.py
# Runtime: ~5 minutes first run; near-instant on re-run due to cachingPlots produced (saved to examples/torch_plots/):
torch_model_comparison.png— PyTorch MLPs vs classical baselinestorch_score_distribution.png— score distribution for the best architecturetorch_learning_curve.png— SmallMLP vs RandomForest with uncertainty bandstorch_training_curves.png— loss curves for all three architectures
Key feature — caching: each (seed, split) training run is hashed by architecture + config + data and saved to disk. Kill the process halfway through 50 runs, rerun the script, and it continues from where it left off. Essential when training takes minutes per run.
Answers a question that comes up before every experiment: "How many seeds do I actually need?" Uses a short pilot run to estimate the score variance of your model, then works backwards from a target effect size and power level to give you a concrete seed budget — or, given a fixed budget, tells you the smallest effect you can reliably detect.
python examples/mde_example.py
# Runtime: ~30 secondsWhat it demonstrates:
- Pilot-based σ estimation (
estimate_sigma) — runs 5 seeds to measure how much accuracy varies due to random initialisation alone, without any assumption about the distribution. - Seeds-to-detect query (
mde_n_seeds) — given a target effect (e.g. 0.5% accuracy difference) and a target power (e.g. 80%), returns the minimum number of seeds required. - MDE-from-budget query (
mde_effect) — given a fixed seed budget (e.g. 20 seeds), returns the minimum effect size you can detect at the specified power level. - Power reference table (
quick_summary) — prints a grid of power values across a range of effect sizes and seed counts so you can reason about the tradeoff at a glance. - Power curve and heatmap (
plot_power_curve,plot_power_heatmap) — visualises how power grows with more seeds, for multiple effect sizes simultaneously. - Verification run — actually runs 20 seeds, compares two models with the corrected t-test, and checks whether the observed effect falls above or below the computed MDE. Confirms the power estimate is consistent with real experimental outcomes.
All power calculations are built directly on the Nadeau-Bengio corrected t-test — the same test used by compare() — so the estimates are internally consistent with the rest of the library.
Plots produced (saved to examples/mde_plots/):
power_curves.png— power vs number of seeds for multiple effect sizes, with 80% target linepower_heatmap.png— effect size × seed count grid coloured by power
Answers a different question: "When my model says 80% confidence, does that actually mean 80%?" sklearn's calibration_curve gives you a single reliability diagram from one run. This example uses mlci to run the same calibration analysis across many seeds and folds, producing uncertainty bands on the reliability diagram itself — so you can see whether calibration is genuinely stable or just lucky on a particular random split.
python examples/calibration_example.py
# Runtime: ~30 secondsWhat it demonstrates:
CalibrationExperiment— runs calibration measurement acrossn_seeds × n_splitsand aggregates ECE, MCE, and per-bin reliability statistics with proper uncertainty quantification.- ECE and MCE with confidence intervals — Expected Calibration Error and Maximum Calibration Error reported as distributions across seeds, not single numbers. The
.summary()method prints the mean, 95% CI, and seed-level standard deviation for both. - Reliability diagrams with uncertainty bands (
plot_reliability_diagram) — the standard "confidence vs accuracy" plot, but with a 95% CI band around the mean curve and optional faint overlay of every individual seed run. Shows at a glance how much the calibration curve moves between seeds. - ECE distribution plots (
plot_ece_distribution) — KDE of the ECE values across seeds, one per model. A narrow distribution means calibration is consistent; a wide one means you got lucky (or unlucky) on a specific split. - Multi-model comparison (
plot_calibration_comparison) — plots all models' reliability diagrams side-by-side with CI bands for direct comparison. - Reliability overlay (
plot_reliability_overlay) — overlays all models' mean reliability curves on a single axis, making it easy to see which model is best-calibrated across the confidence range.
Three models are compared on the breast cancer dataset: RandomForest (tends to be overconfident at high probabilities), GradientBoosting (better calibrated but still biased), and LogisticRegression (typically the best-calibrated of the three on tabular data).
Plots produced (saved to examples/calibration_plots/):
reliability_randomforest.png— reliability diagram with 95% CI band for RandomForestreliability_gradientboosting.png— same for GradientBoostingreliability_logisticregression.png— same for LogisticRegressionece_dist_randomforest.png— ECE distribution across seeds for RandomForestece_dist_gradientboosting.png— same for GradientBoostingece_dist_logisticregression.png— same for LogisticRegressioncalibration_comparison.png— all three reliability diagrams side-by-side with CI bandsreliability_overlay.png— all three mean curves on a single axis
from mlci.stats.bootstrap import bootstrap_ci, summary
ci = bootstrap_ci(results, confidence=0.95)
print(ci)
# 0.9593 [95% CI: 0.9577, 0.9608] (width=0.0031)Resamples at the seed level — the correct choice, since seeds are independent but folds within a seed share training data and are not.
from mlci.stats.tests import compare
# Method 1 (default): Corrected Resampled t-test (Nadeau & Bengio, 2003)
compare(a, b, method="corrected_ttest")
# Method 2: Wilcoxon signed-rank test (non-parametric, no distributional assumptions)
compare(a, b, method="wilcoxon")
# Method 3: Bayesian — reports P(A > B) rather than a binary p-value
compare(a, b, method="bayesian")
# Run all three at once
compare(a, b, method="all")The standard paired t-test applied to k-fold CV scores is anti-conservative: it systematically overstates significance because the folds share training data, making the observations non-independent. Nadeau & Bengio (2003) derived a corrected variance estimator that accounts for this. It is well-known in the statistical ML literature but almost never used in practice because no mainstream library implements it cleanly.
Reference: Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239-281.
Where does your score variance come from — random initialisation, or the data split?
from mlci.stats.anova import decompose_variance
print(decompose_variance(results))────────────────────────────────────────────────────────
Variance Decomposition: RandomForest (accuracy)
────────────────────────────────────────────────────────
Total variance : 0.000435
Seed variance : 0.000010 (2.3%) <- random init
Split variance : 0.000016 (3.6%) <- which data in test set
Interaction : 0.000409 (94.1%) <- seed×split interaction
────────────────────────────────────────────────────────
Uses a two-way balanced ANOVA decomposition. High split variance means your evaluation is sensitive to which portion of your data ends up in the test set — a signal that you may need more folds or a larger dataset.
The existing method="bayesian" comparison in compare() estimates P(A > B)
analytically from the bootstrap distribution of the mean difference. It is
fast and requires no extra dependencies, but it does not model the structure
of the (n_seeds × n_splits) score matrix. Folds within a seed share training
data and are not independent, which means a flat normal approximation slightly
overstates certainty.
The full Bayesian hierarchical comparison uses PyMC to build a proper two-level random-effects model:
diff[s, k] ~ Normal(μ_s, σ_within) # observed per-(seed, split) diff
μ_s ~ Normal(μ_global, σ_between) # per-seed random effect
μ_global ~ Normal(0, 0.10) # global mean difference (prior)
σ_within ~ HalfNormal(0.05) # within-seed (fold) variance
σ_between ~ HalfNormal(0.05) # between-seed variance
where diff[s, k] = score_A[s, k] − score_B[s, k] and the random effect μ_s
absorbs the correlation among folds that share the same random seed.
Inference on μ_global is therefore correctly calibrated.
pip install ".[bayesian]"
# installs: pymc>=5.0, arviz>=0.16from mlci.stats.bayesian_hierarchical import bayesian_hierarchical_compare
result = bayesian_hierarchical_compare(
rf_results, gb_results,
rope=0.005, # differences < 0.5 pp are practically irrelevant
hdi_prob=0.94, # 94% HDI following Kruschke/McElreath conventions
draws=2000,
tune=1000,
chains=4,
plot=True, # saves posterior + trace plots
plot_path="examples/plots/",
)
print(result)Or via the unified compare() dispatcher:
from mlci.stats.tests import compare
compare(rf_results, gb_results, method="bayesian_hierarchical", rope=0.005)────────────────────────────────────────────────────────────────────
Bayesian Hierarchical Comparison (PyMC)
────────────────────────────────────────────────────────────────────
A : RandomForest
B : GradientBoosting
Metric : accuracy
────────────────────────────────────────────────────────────────────
μ(A−B) : +0.0004 [94% HDI: -0.0022, +0.0031]
P(A > B) : 0.613
ROPE : [-0.0050, +0.0050]
In ROPE : 87.4% of posterior
────────────────────────────────────────────────────────────────────
Decision : Inconclusive — HDI overlaps the ROPE.
Collect more data or relax the ROPE threshold.
────────────────────────────────────────────────────────────────────
| Field | Meaning |
|---|---|
μ(A−B) |
Posterior mean of the expected performance gap, in metric units |
94% HDI |
Highest Density Interval: the narrowest interval containing 94% of posterior probability |
P(A > B) |
Probability that model A has higher true performance than model B |
ROPE |
Region Of Practical Equivalence — differences smaller than this are treated as irrelevant |
In ROPE |
Fraction of the posterior inside the ROPE |
| Condition | Decision |
|---|---|
| 94% HDI lies entirely inside the ROPE | Models are practically equivalent |
| 94% HDI lies entirely above the ROPE | Model A is practically better |
| 94% HDI lies entirely below the ROPE | Model B is practically better |
| HDI overlaps the ROPE | Inconclusive — collect more data |
The ROPE should be set in the same units as your metric. Reasonable defaults:
| Metric | Suggested ROPE |
|---|---|
| accuracy / F1 / ROC-AUC | rope=0.005 (±0.5 pp) |
| RMSE / MAE | dataset-dependent; use ~1% of the target range |
| R² | rope=0.01 |
You can also pass a custom interval directly:
bayesian_hierarchical_compare(a, b, rope=(-0.01, 0.01))When plot=True, two figures are saved / displayed:
bayesian_posterior.png— posterior distribution ofμ_globalwith the HDI interval, a vertical line at 0, and the ROPE shaded in red.bayesian_trace.png— MCMC trace and rank plots forμ_global,σ_within, andσ_between; useful for diagnosing sampler pathologies.
A flat model (e.g. a t-test on the per-split differences) treats all n_seeds × n_splits
observations as independent. They are not — the n_splits fold scores within a
single seed share the same model initialisation and the same train/test split
boundaries. The random effect μ_s captures this within-seed dependency
explicitly, so the posterior width for μ_global reflects the effective
number of independent observations (closer to n_seeds than to
n_seeds × n_splits).
This is especially important when n_splits is large relative to n_seeds:
without the random effect, the model would incorrectly pool all fold scores as
independent evidence, artificially narrowing the posterior and overstating
certainty about the performance gap.
from mlci.sensitivity.learning_curve import learning_curve
from mlci.viz.plots import plot_learning_curve
lc = learning_curve(
model_factory=lambda seed: RandomForestClassifier(random_state=seed),
X=X, y=y,
train_fractions=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0],
n_seeds=10, n_splits=5,
)
fig = plot_learning_curve(lc)Each point on the curve has a confidence band rather than being a single line — so you can see when two models that look different are actually statistically indistinguishable at smaller data scales.
How much does each hyperparameter actually move the needle?
from mlci.sensitivity.hyperparameter import hyperparameter_sensitivity
results = hyperparameter_sensitivity(
model_factory=lambda seed, n_estimators, max_depth:
RandomForestClassifier(n_estimators=n_estimators,
max_depth=max_depth,
random_state=seed),
param_grid={
"n_estimators": [10, 50, 100, 200],
"max_depth": [None, 5, 10],
},
X=X, y=y, n_seeds=10, n_splits=5,
)
results.summary_table()
# Most sensitive parameter: n_estimators σ = 0.0089
# Second most sensitive: max_depth σ = 0.0031Each combination in the grid is evaluated as a full mlci experiment (n_seeds × n_splits), giving you confidence intervals on every point and a sensitivity ranking showing which parameters to prioritise when tuning.
Which samples in your dataset are genuinely hard to classify?
from mlci.sensitivity.dataset import dataset_sensitivity
results = dataset_sensitivity(
model_factory=lambda seed: RandomForestClassifier(random_state=seed),
X=X, y=y, n_seeds=10, n_splits=5,
)
print(results)
# Easy (rate=0.0): 412 samples — always correct
# Medium (0<r≤0.5): 134 samples — sometimes wrong
# Hard (rate>0.5): 23 samples — usually wrong
hard_X = X[results.hard_indices] # the genuinely ambiguous samplesTracks per-sample misclassification rates across all (seed, split) pairs. Samples that are frequently wrong regardless of seed are genuinely ambiguous — useful for dataset curation and identifying potential labelling errors before publishing a benchmark.
Build model + scaler pipelines from a config dict, with automatic seed injection:
from mlci.integrations.sklearn import PipelineFactory
factory = PipelineFactory(
LogisticRegression,
params={"max_iter": 2000, "C": 1.0},
scaler="standard", # or "robust", "minmax", "none"
)
exp = Experiment(model_factory=factory, X=X, y=y, ...)Already have scores? Bring them in without rerunning:
from mlci.integrations.sklearn import wrap_cross_val_scores
import numpy as np
scores = np.vstack([
cross_val_score(RandomForestClassifier(random_state=s), X, y, cv=5)
for s in range(10)
])
results = wrap_cross_val_scores(scores, metric="accuracy", model_name="RF")
summary(results) # full CI, variance decomposition, comparisonsWorks with any source: sklearn, PyTorch training loops, a collaborator's array of numbers — anything that produces scores.
Run many models in one call:
from mlci.integrations.sklearn import ModelGrid
grid = ModelGrid(
models={
"LogisticRegression": PipelineFactory(LogisticRegression, scaler="standard"),
"RandomForest": PipelineFactory(RandomForestClassifier, scaler="none"),
"SVM": PipelineFactory(SVC, scaler="standard"),
},
X=X, y=y, metric="accuracy", n_seeds=15, n_splits=5,
)
all_results = grid.run()
# {model_name: ExperimentResults} — pass directly to plot_comparison(), compare(), etc.Wrap any nn.Module in a sklearn-compatible interface with caching:
import torch.nn as nn
from mlci.integrations.torch import TorchTrainer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
class MyNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(30, 64), nn.ReLU(),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1),
)
def forward(self, x):
return self.net(x)
exp = Experiment(
model_factory=lambda seed: Pipeline([
("scaler", StandardScaler()),
("net", TorchTrainer(
model_fn=MyNet,
lr=1e-3,
n_epochs=100,
patience=15,
random_state=seed,
cache_dir="./mlci_cache", # resume interrupted runs
)),
]),
X=X, y=y, metric="accuracy", n_seeds=10, n_splits=5,
model_name="MyNet",
)
results = exp.run()
summary(results)TorchTrainer handles seeding (torch + numpy + random + CUDA), early stopping with validation-based checkpoint restore, and disk caching keyed by architecture + config + data fingerprint. Different architectures always get different cache entries.
Run a model grid across multiple datasets with Benjamini-Hochberg multiple testing correction:
from mlci.benchmarks.runner import run_benchmark, print_benchmark_report
from mlci.benchmarks.datasets import load_all_classification
results = run_benchmark(
models={
"RF": lambda seed: RandomForestClassifier(random_state=seed),
"LR": PipelineFactory(LogisticRegression, scaler="standard"),
"SVM": PipelineFactory(SVC, scaler="standard"),
},
datasets=load_all_classification(),
n_seeds=10, n_splits=5,
)
print_benchmark_report(results)Without multiple testing correction, comparing 3 models on 4 datasets means 12 pairwise tests — at α=0.05 you would expect spurious significant results by chance. Benjamini-Hochberg controls the false discovery rate across all comparisons and is built in automatically.
Before running an experiment, use the MDE calculator to determine how many seeds you need — or what the smallest detectable effect is given a fixed seed budget.
from mlci.stats.power import estimate_sigma, mde_n_seeds, mde_effect, quick_summary
# Step 1: estimate score variance from a small pilot run
pilot = Experiment(
model_factory=lambda seed: RandomForestClassifier(random_state=seed),
X=X, y=y, metric="accuracy", n_seeds=5, n_splits=5,
).run(verbose=False)
sigma = estimate_sigma(pilot)
# Step 2a: how many seeds to detect a 0.5% difference at 80% power?
result = mde_n_seeds(effect_size=0.005, sigma=sigma, n_splits=5, alpha=0.05, target_power=0.80)
print(result)
# → Need 28 seeds to detect a 0.5% effect at 80% power (α=0.05)
# Step 2b: with a budget of 20 seeds, what's the smallest detectable effect?
mde = mde_effect(n_seeds=20, sigma=sigma, n_splits=5, alpha=0.05, target_power=0.80)
print(mde)
# → With 20 seeds: MDE = 0.572% (80% power, α=0.05)Print a power reference table across a range of effects and seed counts:
quick_summary(
sigma=sigma, n_splits=5, alpha=0.05,
effects=[0.001, 0.002, 0.005, 0.010, 0.020],
n_seeds_options=[5, 10, 20, 30, 50],
)Generate power curves and a heatmap:
from mlci.stats.power import power_analysis, plot_power_curve, plot_power_heatmap
import numpy as np
curve = power_analysis(
effects=np.linspace(0.001, 0.015, 8),
n_seeds_range=list(range(5, 60, 5)),
sigma=sigma, n_splits=5, alpha=0.05,
)
fig = plot_power_curve(curve, target_power=0.80, highlight_effects=[0.005, 0.010])
fig = plot_power_heatmap(curve, target_power=0.80)All calculations use the Nadeau-Bengio corrected t-test variance formula — the same test used by compare() — so power estimates are directly consistent with your actual experimental outcomes.
Measure whether your model's predicted probabilities are reliable, and quantify how stable calibration is across seeds — something sklearn's calibration_curve cannot tell you.
from mlci.calibration import CalibrationExperiment
cal = CalibrationExperiment(
model_factory=lambda seed: RandomForestClassifier(n_estimators=100, random_state=seed),
X=X, y=y,
n_seeds=15, n_splits=5,
n_bins=10, strategy="uniform",
model_name="RandomForest",
)
result = cal.run()
print(result.summary())────────────────────────────────────────────────────────────
Calibration: RandomForest (accuracy)
Seeds: 15 Splits: 5 Bins: 10
────────────────────────────────────────────────────────────
ECE mean : 0.0431 95% CI: [0.0389, 0.0472] σ=0.0042
MCE mean : 0.1203 95% CI: [0.1089, 0.1318] σ=0.0113
────────────────────────────────────────────────────────────
Plot a reliability diagram with uncertainty bands across seeds:
from mlci.calibration import (
plot_reliability_diagram,
plot_ece_distribution,
plot_calibration_comparison,
plot_reliability_overlay,
)
# Single model: reliability diagram with 95% CI band + individual seed runs
fig = plot_reliability_diagram(result, confidence=0.95, show_individual_runs=True)
# ECE distribution across seeds
fig = plot_ece_distribution(result)
# Compare multiple models side-by-side
fig = plot_calibration_comparison([rf_cal, gb_cal, lr_cal], confidence=0.95)
# Overlay mean curves for all models on a single axis
fig = plot_reliability_overlay([rf_cal, gb_cal, lr_cal])The key difference from sklearn: calibration_curve gives a single curve from one run. mlci gives you a 95% CI band on that curve across seeds, so you know whether your model is genuinely well-calibrated or whether the result depends heavily on which random split happened to be used.
from mlci.viz.plots import (
plot_score_distribution, # KDE + CI on the mean
plot_comparison, # Forest plot for multiple models
plot_variance_decomposition, # Pie chart + score heatmap
plot_bootstrap_distribution, # Bootstrap distribution of the mean
plot_learning_curve, # Learning curve with uncertainty bands
)Built-in: "accuracy", "f1", "f1_binary", "roc_auc", "rmse", "mae", "r2", "mse"
Or pass any callable (y_true, y_pred) -> float:
Experiment(
model_factory=...,
metric=lambda yt, yp: my_custom_metric(yt, yp),
)results.save("rf_results.json")
loaded = ExperimentResults.load("rf_results.json")Results are stored as plain JSON — no pickle, no version lock-in. Share results with collaborators and they can run the full statistical analysis without retraining anything.
Seeds are the unit of resampling. When we bootstrap, we resample at the seed level, not the individual (seed, split) level. Seeds are independent; folds within a seed share training data and are not independent.
Report distributions, not point estimates. The default output of every function is either a distribution or a statistic with a confidence interval.
Correct by default. The default comparison method is the Nadeau-Bengio corrected t-test, not the standard paired t-test, because the standard test is wrong for this use case.
No hidden state. ExperimentResults is a plain data container. You can inspect, serialise, and reconstruct it without going through any library-specific machinery.
Framework-agnostic. The library works with any model that implements fit(X, y) and predict(X). sklearn, PyTorch, XGBoost, your own custom model — all treated identically.
If you use mlci in published work, please cite it:
@software{forzano2026mlci,
author = {Forzano, Niccolo},
title = {mlci: Statistically Rigorous ML Model Evaluation and Benchmarking},
year = {2026},
url = {https://github.com/nickforce989/mlci},
note = {Python package}
}Or in plain text:
Forzano, N. (2026). mlci: Statistically Rigorous ML Model Evaluation and Benchmarking. https://github.com/nickforce989/mlci
Please also cite the underlying statistical method the library is built on:
@article{nadeau2003inference,
author = {Nadeau, Claude and Bengio, Yoshua},
title = {Inference for the Generalization Error},
journal = {Machine Learning},
year = {2003},
volume = {52},
number = {3},
pages = {239--281}
}- Bootstrap confidence intervals
- Corrected resampled t-test (Nadeau & Bengio, 2003)
- Wilcoxon and Bayesian comparison methods
- Variance decomposition (two-way ANOVA)
- Learning curves with uncertainty bands
- Hyperparameter sensitivity analysis
- Dataset sensitivity (per-sample misclassification rates)
- sklearn integrations (PipelineFactory, ModelGrid, wrap_cross_val_scores)
- PyTorch integration with caching and early stopping
- Multi-dataset benchmarking with Benjamini-Hochberg correction
- PyMC-based full Bayesian hierarchical comparison
- Minimum Detectable Effect (MDE) calculator with power curves and heatmaps
- Calibration analysis with ECE/MCE confidence intervals and reliability diagrams
- Full PyTorch/HuggingFace training loop integration with Slurm support
- HTML report generation
- Integration with MLflow / Weights & Biases
- Support for multi-class and multi-label metrics
GPL