MLci: Machine Learning confidence interval

Statistically rigorous ML model evaluation and benchmarking.

Every number you report from an ML experiment is a random variable. This library forces you to treat it that way.

Author

Niccolò Forzano --- github.com/nickforce989 · nic.forz@gmail.com

Installation

Starting from the mlci root directory:

pip install .

For PyTorch integration:

pip install . torch

For Bayesian comparison with full PyMC backend:

pip install ".[bayesian]"

Requirements

Python 3.9+
numpy >= 1.23
scipy >= 1.9
scikit-learn >= 1.1
matplotlib >= 3.5
seaborn >= 0.12
pandas >= 1.5
joblib >= 1.2
tqdm >= 4.64

All dependencies are installed automatically with pip install .. To install manually:

pip install numpy scipy scikit-learn matplotlib seaborn pandas joblib tqdm

To run the examples, no additional dependencies are needed — they use datasets that ship with scikit-learn. The PyTorch example additionally requires pip install torch.

The Problem

Standard ML practice:

RandomForest     accuracy: 0.9631
GradientBoosting accuracy: 0.9631
Difference: 0.0000 — they're equal

But these numbers came from a single run with a single random seed. Run it again and you get different numbers. The variance of accuracy across seeds on a typical benchmark is 0.2–0.5%. A 0.3% difference between two models is often indistinguishable from noise — but almost no paper will tell you that.

mlci makes it trivial to do this correctly.

Quickstart

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from mlci import Experiment
from mlci.stats.bootstrap import summary
from mlci.stats.tests import compare

X, y = load_breast_cancer(return_X_y=True)

rf = Experiment(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y, metric="accuracy", n_seeds=20, n_splits=5,
    model_name="RandomForest",
)
gb = Experiment(
    model_factory=lambda seed: GradientBoostingClassifier(random_state=seed),
    X=X, y=y, metric="accuracy", n_seeds=20, n_splits=5,
    model_name="GradientBoosting",
)

rf_results = rf.run()
gb_results = gb.run()

Summary with uncertainty:

────────────────────────────────────────────────────
  Model   : RandomForest
  Metric  : accuracy
  Seeds   : 20   Splits: 5
────────────────────────────────────────────────────
  Mean    : 0.9593
  95% CI  : [0.9577, 0.9608]  (±0.0015)
  Seed σ  : 0.0037  (variance from random init)
  Split σ : 0.0061  (variance from data split)
────────────────────────────────────────────────────

Statistical comparison:

compare(rf_results, gb_results, method="corrected_ttest")

────────────────────────────────────────────────────────────
  Comparison : RandomForest  vs  GradientBoosting
  Method     : Corrected Resampled t-test (Nadeau & Bengio, 2003)
  Effect     : +0.0004  [-0.0023, +0.0029]
  P(A > B)   : 0.430
  p-value    : 0.6305  (α=0.05)
  Conclusion : No statistically significant difference (p=0.6305 >= α=0.05).
               Cannot conclude either model is better.
────────────────────────────────────────────────────────────

The standard single-run reported both models at identical accuracy. mlci correctly identifies the difference is noise.

Running the Examples

Seven ready-to-run example scripts are included in the examples/ folder. All use datasets that ship with scikit-learn — no downloads needed.

`worked_example.py` — start here

The simplest and most focused example. Demonstrates the single most important claim of the library: that the standard practice of reporting a single accuracy number is misleading, and that two models which look identical under standard evaluation can be properly assessed with mlci.

Runs three models first the standard way (one number), then with mlci (20 seeds × 5 folds, confidence intervals, corrected t-test). Ends with a side-by-side summary showing exactly what information was lost.

python examples/worked_example.py
# Runtime: ~1 minute

Run this first — it's the clearest demonstration of why the library exists.

`full_example.py` — full feature walkthrough

Covers every major feature across nine parts: four classical models, bootstrap summaries, variance decomposition, all three comparison methods, bring-your-own scores, custom metrics, learning curves, five plots, and JSON save/load.

python examples/full_example.py
# Runtime: ~3 minutes

Plots produced (saved to examples/plots/):

score_distributions.png — KDE of raw scores per model
model_comparison.png — forest plot with CI bars
variance_decomposition.png — pie chart + score heatmap
learning_curve.png — RF vs LogReg with uncertainty bands
bootstrap_distribution.png — bootstrap distribution of the mean

`neural_network_example.py` — neural network from scratch

Implements a fully-connected MLP from scratch using NumPy only — no PyTorch, no TensorFlow. He initialisation, ReLU activations, mini-batch SGD with momentum, L2 regularisation, early stopping. Wrapped in a sklearn-compatible interface so mlci works on it identically to any other model.

python examples/neural_network_example.py
# Runtime: ~2 minutes

Plots produced (saved to examples/nn_plots/):

nn_training_curves.png — loss curves (train vs val) for each architecture
nn_score_distributions.png — KDE of raw scores for all models
nn_vs_classical_comparison.png — forest plot: MLP vs classical baselines
nn_architecture_comparison.png — bar chart comparing network sizes with CI
nn_learning_curve.png — MLP vs RandomForest learning curves with uncertainty bands
nn_variance_decomposition.png — variance sources for the MLP
nn_bootstrap_distribution.png — bootstrap distribution of MLP mean accuracy
nn_summary_panel.png — single-figure summary combining five panels

Key finding: on a small tabular dataset (569 samples), LogisticRegression outperforms the MLP (0.979 vs 0.973, p=0.0001). Neural networks need large datasets to outperform well-tuned linear models. mlci makes this conclusion rigorous rather than anecdotal.

`sklearn_example.py` — sklearn integration showcase

Demonstrates mlci.integrations.sklearn across four parts: defining models with PipelineFactory instead of lambdas, wrapping external scores with wrap_cross_val_scores, running 8 classifiers at once with ModelGrid, and running the same grid across three datasets to produce a cross-dataset comparison heatmap.

python examples/sklearn_example.py
# Runtime: ~2 minutes

Plots produced (saved to examples/sklearn_plots/):

sklearn_model_comparison.png — 8 classifiers ranked by accuracy with CI bars
sklearn_scaler_comparison.png — does scaler choice matter? (yes, dramatically)
sklearn_score_distributions.png — KDE for the top 3 models
sklearn_learning_curve.png — LogReg vs RF vs SVM with uncertainty bands
sklearn_cross_dataset_heatmap.png — mean accuracy grid: 3 models × 3 datasets
sklearn_variance_decomposition.png — variance sources for the best model

Key finding: MinMaxScaler drops LogisticRegression accuracy from 0.977 to 0.937 — a large, statistically significant difference that would be invisible from a single run.

`torch_example.py` — PyTorch integration

Plugs real PyTorch neural networks into mlci via TorchTrainer — a sklearn-compatible wrapper that handles seeding (torch + numpy + random + CUDA), early stopping, and disk caching (completed runs are saved and skipped on re-run). Compares three architectures (Small, Deep, Wide MLP) against classical baselines with full statistical rigour.

pip install torch   # required
python examples/torch_example.py
# Runtime: ~5 minutes first run; near-instant on re-run due to caching

Plots produced (saved to examples/torch_plots/):

torch_model_comparison.png — PyTorch MLPs vs classical baselines
torch_score_distribution.png — score distribution for the best architecture
torch_learning_curve.png — SmallMLP vs RandomForest with uncertainty bands
torch_training_curves.png — loss curves for all three architectures

Key feature — caching: each (seed, split) training run is hashed by architecture + config + data and saved to disk. Kill the process halfway through 50 runs, rerun the script, and it continues from where it left off. Essential when training takes minutes per run.

`mde_example.py` — minimum detectable effect calculator

Answers a question that comes up before every experiment: "How many seeds do I actually need?" Uses a short pilot run to estimate the score variance of your model, then works backwards from a target effect size and power level to give you a concrete seed budget — or, given a fixed budget, tells you the smallest effect you can reliably detect.

python examples/mde_example.py
# Runtime: ~30 seconds

What it demonstrates:

Pilot-based σ estimation (estimate_sigma) — runs 5 seeds to measure how much accuracy varies due to random initialisation alone, without any assumption about the distribution.
Seeds-to-detect query (mde_n_seeds) — given a target effect (e.g. 0.5% accuracy difference) and a target power (e.g. 80%), returns the minimum number of seeds required.
MDE-from-budget query (mde_effect) — given a fixed seed budget (e.g. 20 seeds), returns the minimum effect size you can detect at the specified power level.
Power reference table (quick_summary) — prints a grid of power values across a range of effect sizes and seed counts so you can reason about the tradeoff at a glance.
Power curve and heatmap (plot_power_curve, plot_power_heatmap) — visualises how power grows with more seeds, for multiple effect sizes simultaneously.
Verification run — actually runs 20 seeds, compares two models with the corrected t-test, and checks whether the observed effect falls above or below the computed MDE. Confirms the power estimate is consistent with real experimental outcomes.

All power calculations are built directly on the Nadeau-Bengio corrected t-test — the same test used by compare() — so the estimates are internally consistent with the rest of the library.

Plots produced (saved to examples/mde_plots/):

power_curves.png — power vs number of seeds for multiple effect sizes, with 80% target line
power_heatmap.png — effect size × seed count grid coloured by power

`calibration_example.py` — calibration analysis across seeds

Answers a different question: "When my model says 80% confidence, does that actually mean 80%?" sklearn's calibration_curve gives you a single reliability diagram from one run. This example uses mlci to run the same calibration analysis across many seeds and folds, producing uncertainty bands on the reliability diagram itself — so you can see whether calibration is genuinely stable or just lucky on a particular random split.

python examples/calibration_example.py
# Runtime: ~30 seconds

What it demonstrates:

CalibrationExperiment — runs calibration measurement across n_seeds × n_splits and aggregates ECE, MCE, and per-bin reliability statistics with proper uncertainty quantification.
ECE and MCE with confidence intervals — Expected Calibration Error and Maximum Calibration Error reported as distributions across seeds, not single numbers. The .summary() method prints the mean, 95% CI, and seed-level standard deviation for both.
Reliability diagrams with uncertainty bands (plot_reliability_diagram) — the standard "confidence vs accuracy" plot, but with a 95% CI band around the mean curve and optional faint overlay of every individual seed run. Shows at a glance how much the calibration curve moves between seeds.
ECE distribution plots (plot_ece_distribution) — KDE of the ECE values across seeds, one per model. A narrow distribution means calibration is consistent; a wide one means you got lucky (or unlucky) on a specific split.
Multi-model comparison (plot_calibration_comparison) — plots all models' reliability diagrams side-by-side with CI bands for direct comparison.
Reliability overlay (plot_reliability_overlay) — overlays all models' mean reliability curves on a single axis, making it easy to see which model is best-calibrated across the confidence range.

Three models are compared on the breast cancer dataset: RandomForest (tends to be overconfident at high probabilities), GradientBoosting (better calibrated but still biased), and LogisticRegression (typically the best-calibrated of the three on tabular data).

Plots produced (saved to examples/calibration_plots/):

reliability_randomforest.png — reliability diagram with 95% CI band for RandomForest
reliability_gradientboosting.png — same for GradientBoosting
reliability_logisticregression.png — same for LogisticRegression
ece_dist_randomforest.png — ECE distribution across seeds for RandomForest
ece_dist_gradientboosting.png — same for GradientBoosting
ece_dist_logisticregression.png — same for LogisticRegression
calibration_comparison.png — all three reliability diagrams side-by-side with CI bands
reliability_overlay.png — all three mean curves on a single axis

Core Features

1. Bootstrap Confidence Intervals

from mlci.stats.bootstrap import bootstrap_ci, summary

ci = bootstrap_ci(results, confidence=0.95)
print(ci)
# 0.9593 [95% CI: 0.9577, 0.9608] (width=0.0031)

Resamples at the seed level — the correct choice, since seeds are independent but folds within a seed share training data and are not.

2. Three Statistical Comparison Methods

from mlci.stats.tests import compare

# Method 1 (default): Corrected Resampled t-test (Nadeau & Bengio, 2003)
compare(a, b, method="corrected_ttest")

# Method 2: Wilcoxon signed-rank test (non-parametric, no distributional assumptions)
compare(a, b, method="wilcoxon")

# Method 3: Bayesian — reports P(A > B) rather than a binary p-value
compare(a, b, method="bayesian")

# Run all three at once
compare(a, b, method="all")

Why the Nadeau-Bengio corrected t-test matters

The standard paired t-test applied to k-fold CV scores is anti-conservative: it systematically overstates significance because the folds share training data, making the observations non-independent. Nadeau & Bengio (2003) derived a corrected variance estimator that accounts for this. It is well-known in the statistical ML literature but almost never used in practice because no mainstream library implements it cleanly.

Reference: Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239-281.

3. Variance Decomposition

Where does your score variance come from — random initialisation, or the data split?

from mlci.stats.anova import decompose_variance

print(decompose_variance(results))

────────────────────────────────────────────────────────
  Variance Decomposition: RandomForest (accuracy)
────────────────────────────────────────────────────────
  Total variance  : 0.000435
  Seed variance   : 0.000010  (2.3%)   <- random init
  Split variance  : 0.000016  (3.6%)   <- which data in test set
  Interaction     : 0.000409  (94.1%)  <- seed×split interaction
────────────────────────────────────────────────────────

Uses a two-way balanced ANOVA decomposition. High split variance means your evaluation is sensitive to which portion of your data ends up in the test set — a signal that you may need more folds or a larger dataset.

4. Full Bayesian Hierarchical Comparison (PyMC)

The existing method="bayesian" comparison in compare() estimates P(A > B) analytically from the bootstrap distribution of the mean difference. It is fast and requires no extra dependencies, but it does not model the structure of the (n_seeds × n_splits) score matrix. Folds within a seed share training data and are not independent, which means a flat normal approximation slightly overstates certainty.

The full Bayesian hierarchical comparison uses PyMC to build a proper two-level random-effects model:

diff[s, k]  ~  Normal(μ_s,        σ_within)   # observed per-(seed, split) diff
μ_s         ~  Normal(μ_global,   σ_between)   # per-seed random effect
μ_global    ~  Normal(0, 0.10)                 # global mean difference (prior)
σ_within    ~  HalfNormal(0.05)               # within-seed (fold) variance
σ_between   ~  HalfNormal(0.05)               # between-seed variance

where diff[s, k] = score_A[s, k] − score_B[s, k] and the random effect μ_s absorbs the correlation among folds that share the same random seed. Inference on μ_global is therefore correctly calibrated.

Installation

pip install ".[bayesian]"
# installs: pymc>=5.0, arviz>=0.16

Usage

from mlci.stats.bayesian_hierarchical import bayesian_hierarchical_compare

result = bayesian_hierarchical_compare(
    rf_results, gb_results,
    rope=0.005,        # differences < 0.5 pp are practically irrelevant
    hdi_prob=0.94,     # 94% HDI following Kruschke/McElreath conventions
    draws=2000,
    tune=1000,
    chains=4,
    plot=True,         # saves posterior + trace plots
    plot_path="examples/plots/",
)
print(result)

Or via the unified compare() dispatcher:

from mlci.stats.tests import compare

compare(rf_results, gb_results, method="bayesian_hierarchical", rope=0.005)

Example output

────────────────────────────────────────────────────────────────────
  Bayesian Hierarchical Comparison (PyMC)
────────────────────────────────────────────────────────────────────
  A  :  RandomForest
  B  :  GradientBoosting
  Metric    : accuracy
────────────────────────────────────────────────────────────────────
  μ(A−B)    : +0.0004  [94% HDI: -0.0022, +0.0031]
  P(A > B)  : 0.613
  ROPE      : [-0.0050, +0.0050]
  In ROPE   : 87.4% of posterior
────────────────────────────────────────────────────────────────────
  Decision  : Inconclusive — HDI overlaps the ROPE.
               Collect more data or relax the ROPE threshold.
────────────────────────────────────────────────────────────────────

Interpreting the output

Field	Meaning
`μ(A−B)`	Posterior mean of the expected performance gap, in metric units
`94% HDI`	Highest Density Interval: the narrowest interval containing 94% of posterior probability
`P(A > B)`	Probability that model A has higher true performance than model B
`ROPE`	Region Of Practical Equivalence — differences smaller than this are treated as irrelevant
`In ROPE`	Fraction of the posterior inside the ROPE

Decision rules (Kruschke 2018)

Condition	Decision
94% HDI lies entirely inside the ROPE	Models are practically equivalent
94% HDI lies entirely above the ROPE	Model A is practically better
94% HDI lies entirely below the ROPE	Model B is practically better
HDI overlaps the ROPE	Inconclusive — collect more data

ROPE guidance

The ROPE should be set in the same units as your metric. Reasonable defaults:

Metric	Suggested ROPE
accuracy / F1 / ROC-AUC	`rope=0.005` (±0.5 pp)
RMSE / MAE	dataset-dependent; use ~1% of the target range
R²	`rope=0.01`

You can also pass a custom interval directly:

bayesian_hierarchical_compare(a, b, rope=(-0.01, 0.01))

Plots

When plot=True, two figures are saved / displayed:

bayesian_posterior.png — posterior distribution of μ_global with the HDI interval, a vertical line at 0, and the ROPE shaded in red.
bayesian_trace.png — MCMC trace and rank plots for μ_global, σ_within, and σ_between; useful for diagnosing sampler pathologies.

Why hierarchical?

A flat model (e.g. a t-test on the per-split differences) treats all n_seeds × n_splits observations as independent. They are not — the n_splits fold scores within a single seed share the same model initialisation and the same train/test split boundaries. The random effect μ_s captures this within-seed dependency explicitly, so the posterior width for μ_global reflects the effective number of independent observations (closer to n_seeds than to n_seeds × n_splits).

This is especially important when n_splits is large relative to n_seeds: without the random effect, the model would incorrectly pool all fold scores as independent evidence, artificially narrowing the posterior and overstating certainty about the performance gap.

5. Sensitivity Analysis

Learning curves with uncertainty bands

from mlci.sensitivity.learning_curve import learning_curve
from mlci.viz.plots import plot_learning_curve

lc = learning_curve(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y,
    train_fractions=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0],
    n_seeds=10, n_splits=5,
)
fig = plot_learning_curve(lc)

Each point on the curve has a confidence band rather than being a single line — so you can see when two models that look different are actually statistically indistinguishable at smaller data scales.

Hyperparameter sensitivity

How much does each hyperparameter actually move the needle?

from mlci.sensitivity.hyperparameter import hyperparameter_sensitivity

results = hyperparameter_sensitivity(
    model_factory=lambda seed, n_estimators, max_depth:
        RandomForestClassifier(n_estimators=n_estimators,
                               max_depth=max_depth,
                               random_state=seed),
    param_grid={
        "n_estimators": [10, 50, 100, 200],
        "max_depth":    [None, 5, 10],
    },
    X=X, y=y, n_seeds=10, n_splits=5,
)
results.summary_table()
# Most sensitive parameter: n_estimators  σ = 0.0089
# Second most sensitive:    max_depth     σ = 0.0031

Each combination in the grid is evaluated as a full mlci experiment (n_seeds × n_splits), giving you confidence intervals on every point and a sensitivity ranking showing which parameters to prioritise when tuning.

Dataset sensitivity

Which samples in your dataset are genuinely hard to classify?

from mlci.sensitivity.dataset import dataset_sensitivity

results = dataset_sensitivity(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y, n_seeds=10, n_splits=5,
)
print(results)
# Easy  (rate=0.0): 412 samples  — always correct
# Medium (0<r≤0.5): 134 samples  — sometimes wrong
# Hard  (rate>0.5):  23 samples  — usually wrong

hard_X = X[results.hard_indices]   # the genuinely ambiguous samples

Tracks per-sample misclassification rates across all (seed, split) pairs. Samples that are frequently wrong regardless of seed are genuinely ambiguous — useful for dataset curation and identifying potential labelling errors before publishing a benchmark.

6. sklearn Integrations

PipelineFactory

Build model + scaler pipelines from a config dict, with automatic seed injection:

from mlci.integrations.sklearn import PipelineFactory

factory = PipelineFactory(
    LogisticRegression,
    params={"max_iter": 2000, "C": 1.0},
    scaler="standard",   # or "robust", "minmax", "none"
)

exp = Experiment(model_factory=factory, X=X, y=y, ...)

wrap_cross_val_scores

Already have scores? Bring them in without rerunning:

from mlci.integrations.sklearn import wrap_cross_val_scores
import numpy as np

scores = np.vstack([
    cross_val_score(RandomForestClassifier(random_state=s), X, y, cv=5)
    for s in range(10)
])

results = wrap_cross_val_scores(scores, metric="accuracy", model_name="RF")
summary(results)   # full CI, variance decomposition, comparisons

Works with any source: sklearn, PyTorch training loops, a collaborator's array of numbers — anything that produces scores.

ModelGrid

Run many models in one call:

from mlci.integrations.sklearn import ModelGrid

grid = ModelGrid(
    models={
        "LogisticRegression": PipelineFactory(LogisticRegression, scaler="standard"),
        "RandomForest":       PipelineFactory(RandomForestClassifier, scaler="none"),
        "SVM":                PipelineFactory(SVC, scaler="standard"),
    },
    X=X, y=y, metric="accuracy", n_seeds=15, n_splits=5,
)
all_results = grid.run()
# {model_name: ExperimentResults} — pass directly to plot_comparison(), compare(), etc.

7. PyTorch Integration

Wrap any nn.Module in a sklearn-compatible interface with caching:

import torch.nn as nn
from mlci.integrations.torch import TorchTrainer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

class MyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(30, 64), nn.ReLU(),
            nn.Linear(64, 32), nn.ReLU(),
            nn.Linear(32, 1),
        )
    def forward(self, x):
        return self.net(x)

exp = Experiment(
    model_factory=lambda seed: Pipeline([
        ("scaler", StandardScaler()),
        ("net", TorchTrainer(
            model_fn=MyNet,
            lr=1e-3,
            n_epochs=100,
            patience=15,
            random_state=seed,
            cache_dir="./mlci_cache",   # resume interrupted runs
        )),
    ]),
    X=X, y=y, metric="accuracy", n_seeds=10, n_splits=5,
    model_name="MyNet",
)
results = exp.run()
summary(results)

TorchTrainer handles seeding (torch + numpy + random + CUDA), early stopping with validation-based checkpoint restore, and disk caching keyed by architecture + config + data fingerprint. Different architectures always get different cache entries.

8. Multi-Dataset Benchmarking

Run a model grid across multiple datasets with Benjamini-Hochberg multiple testing correction:

from mlci.benchmarks.runner import run_benchmark, print_benchmark_report
from mlci.benchmarks.datasets import load_all_classification

results = run_benchmark(
    models={
        "RF":  lambda seed: RandomForestClassifier(random_state=seed),
        "LR":  PipelineFactory(LogisticRegression, scaler="standard"),
        "SVM": PipelineFactory(SVC, scaler="standard"),
    },
    datasets=load_all_classification(),
    n_seeds=10, n_splits=5,
)
print_benchmark_report(results)

Without multiple testing correction, comparing 3 models on 4 datasets means 12 pairwise tests — at α=0.05 you would expect spurious significant results by chance. Benjamini-Hochberg controls the false discovery rate across all comparisons and is built in automatically.

9. Minimum Detectable Effect (MDE) Calculator

Before running an experiment, use the MDE calculator to determine how many seeds you need — or what the smallest detectable effect is given a fixed seed budget.

from mlci.stats.power import estimate_sigma, mde_n_seeds, mde_effect, quick_summary

# Step 1: estimate score variance from a small pilot run
pilot = Experiment(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y, metric="accuracy", n_seeds=5, n_splits=5,
).run(verbose=False)

sigma = estimate_sigma(pilot)

# Step 2a: how many seeds to detect a 0.5% difference at 80% power?
result = mde_n_seeds(effect_size=0.005, sigma=sigma, n_splits=5, alpha=0.05, target_power=0.80)
print(result)
# → Need 28 seeds to detect a 0.5% effect at 80% power (α=0.05)

# Step 2b: with a budget of 20 seeds, what's the smallest detectable effect?
mde = mde_effect(n_seeds=20, sigma=sigma, n_splits=5, alpha=0.05, target_power=0.80)
print(mde)
# → With 20 seeds: MDE = 0.572%  (80% power, α=0.05)

Print a power reference table across a range of effects and seed counts:

quick_summary(
    sigma=sigma, n_splits=5, alpha=0.05,
    effects=[0.001, 0.002, 0.005, 0.010, 0.020],
    n_seeds_options=[5, 10, 20, 30, 50],
)

Generate power curves and a heatmap:

from mlci.stats.power import power_analysis, plot_power_curve, plot_power_heatmap
import numpy as np

curve = power_analysis(
    effects=np.linspace(0.001, 0.015, 8),
    n_seeds_range=list(range(5, 60, 5)),
    sigma=sigma, n_splits=5, alpha=0.05,
)
fig = plot_power_curve(curve, target_power=0.80, highlight_effects=[0.005, 0.010])
fig = plot_power_heatmap(curve, target_power=0.80)

All calculations use the Nadeau-Bengio corrected t-test variance formula — the same test used by compare() — so power estimates are directly consistent with your actual experimental outcomes.

10. Calibration Analysis

Measure whether your model's predicted probabilities are reliable, and quantify how stable calibration is across seeds — something sklearn's calibration_curve cannot tell you.

from mlci.calibration import CalibrationExperiment

cal = CalibrationExperiment(
    model_factory=lambda seed: RandomForestClassifier(n_estimators=100, random_state=seed),
    X=X, y=y,
    n_seeds=15, n_splits=5,
    n_bins=10, strategy="uniform",
    model_name="RandomForest",
)
result = cal.run()
print(result.summary())

────────────────────────────────────────────────────────────
  Calibration: RandomForest  (accuracy)
  Seeds: 15   Splits: 5   Bins: 10
────────────────────────────────────────────────────────────
  ECE  mean : 0.0431   95% CI: [0.0389, 0.0472]   σ=0.0042
  MCE  mean : 0.1203   95% CI: [0.1089, 0.1318]   σ=0.0113
────────────────────────────────────────────────────────────

Plot a reliability diagram with uncertainty bands across seeds:

from mlci.calibration import (
    plot_reliability_diagram,
    plot_ece_distribution,
    plot_calibration_comparison,
    plot_reliability_overlay,
)

# Single model: reliability diagram with 95% CI band + individual seed runs
fig = plot_reliability_diagram(result, confidence=0.95, show_individual_runs=True)

# ECE distribution across seeds
fig = plot_ece_distribution(result)

# Compare multiple models side-by-side
fig = plot_calibration_comparison([rf_cal, gb_cal, lr_cal], confidence=0.95)

# Overlay mean curves for all models on a single axis
fig = plot_reliability_overlay([rf_cal, gb_cal, lr_cal])

The key difference from sklearn: calibration_curve gives a single curve from one run. mlci gives you a 95% CI band on that curve across seeds, so you know whether your model is genuinely well-calibrated or whether the result depends heavily on which random split happened to be used.

11. Visualisation

from mlci.viz.plots import (
    plot_score_distribution,      # KDE + CI on the mean
    plot_comparison,              # Forest plot for multiple models
    plot_variance_decomposition,  # Pie chart + score heatmap
    plot_bootstrap_distribution,  # Bootstrap distribution of the mean
    plot_learning_curve,          # Learning curve with uncertainty bands
)

Supported Metrics

Built-in: "accuracy", "f1", "f1_binary", "roc_auc", "rmse", "mae", "r2", "mse"

Or pass any callable (y_true, y_pred) -> float:

Experiment(
    model_factory=...,
    metric=lambda yt, yp: my_custom_metric(yt, yp),
)

Saving and Loading Results

results.save("rf_results.json")
loaded = ExperimentResults.load("rf_results.json")

Results are stored as plain JSON — no pickle, no version lock-in. Share results with collaborators and they can run the full statistical analysis without retraining anything.

Design Principles

Seeds are the unit of resampling. When we bootstrap, we resample at the seed level, not the individual (seed, split) level. Seeds are independent; folds within a seed share training data and are not independent.

Report distributions, not point estimates. The default output of every function is either a distribution or a statistic with a confidence interval.

Correct by default. The default comparison method is the Nadeau-Bengio corrected t-test, not the standard paired t-test, because the standard test is wrong for this use case.

No hidden state. ExperimentResults is a plain data container. You can inspect, serialise, and reconstruct it without going through any library-specific machinery.

Framework-agnostic. The library works with any model that implements fit(X, y) and predict(X). sklearn, PyTorch, XGBoost, your own custom model — all treated identically.

Citation

If you use mlci in published work, please cite it:

@software{forzano2026mlci,
  author    = {Forzano, Niccolo},
  title     = {mlci: Statistically Rigorous ML Model Evaluation and Benchmarking},
  year      = {2026},
  url       = {https://github.com/nickforce989/mlci},
  note      = {Python package}
}

Or in plain text:

Forzano, N. (2026). mlci: Statistically Rigorous ML Model Evaluation and Benchmarking. https://github.com/nickforce989/mlci

Please also cite the underlying statistical method the library is built on:

@article{nadeau2003inference,
  author  = {Nadeau, Claude and Bengio, Yoshua},
  title   = {Inference for the Generalization Error},
  journal = {Machine Learning},
  year    = {2003},
  volume  = {52},
  number  = {3},
  pages   = {239--281}
}

Roadmap

License

GPL

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
mlci.egg-info		mlci.egg-info
mlci		mlci
tests		tests
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

MLci: Machine Learning confidence interval

Author

Installation

Requirements

The Problem

Quickstart

Running the Examples

worked_example.py — start here

full_example.py — full feature walkthrough

neural_network_example.py — neural network from scratch

sklearn_example.py — sklearn integration showcase

torch_example.py — PyTorch integration

mde_example.py — minimum detectable effect calculator

calibration_example.py — calibration analysis across seeds

Core Features

1. Bootstrap Confidence Intervals

2. Three Statistical Comparison Methods

Why the Nadeau-Bengio corrected t-test matters

3. Variance Decomposition

4. Full Bayesian Hierarchical Comparison (PyMC)

Installation

Usage

Example output

Interpreting the output

Decision rules (Kruschke 2018)

ROPE guidance

Plots

Why hierarchical?

5. Sensitivity Analysis

Learning curves with uncertainty bands

Hyperparameter sensitivity

Dataset sensitivity

6. sklearn Integrations

PipelineFactory

wrap_cross_val_scores

ModelGrid

7. PyTorch Integration

8. Multi-Dataset Benchmarking

9. Minimum Detectable Effect (MDE) Calculator

10. Calibration Analysis

11. Visualisation

Supported Metrics

Saving and Loading Results

Design Principles

Citation

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`worked_example.py` — start here

`full_example.py` — full feature walkthrough

`neural_network_example.py` — neural network from scratch

`sklearn_example.py` — sklearn integration showcase

`torch_example.py` — PyTorch integration

`mde_example.py` — minimum detectable effect calculator

`calibration_example.py` — calibration analysis across seeds

Packages