Corrected test statistics for comparing machine learning models on correlated samples
You can install the stable version of correctipy
from GitHub using:
pip install git+https://github.com/hendersontrent/correctipy
Often in machine learning, we want to compare the performance of
different models to determine if one statistically outperforms another.
However, the methods used (e.g., data resampling, correctipy
is a lightweight package that implements a
small number of corrected test statistics for cases when samples are not
independent (and therefore are correlated), such as in the case of
resampling,
If you are interested in the version for R, please see correctR
.
correctipy
is a lightweight package that implements a small number of corrected test statistics for cases when samples of two machine learning model metrics (e.g., classification accuracy) are not independent (and therefore are correlated), such as in the case of resampling and correctipy
:
- Random subsampling
-
$k$ -fold cross-validation - Repeated
$k$ -fold cross-validation
These corrections were all originally proposed by Nadeau and Bengio (2003) with additional representations in Bouckaert and Frank (2004).
In random subsampling, the standard resampled_ttest
in correctipy
) in the form of:
where
There is an alternate formulation of the random subsampling correction, devised in terms of the unbiased estimator kfold_tttest
in correctipy
:
where
Repeated repkfold_ttest
in correctipy
:
In the real world, we would have proper results obtained through fitting two models according to one or more of the procedures outlined above. For simplicity here, we are just going to simulate three datasets so we can get to the package functionality cleaner and easier. We are going to assume we are in a classification context and generate classification accuracy values. These values are purposefully egregious---we are going to (in the case of the random subsampling) just fix the train set sample size (n1
) to 80 and the test set sample size (n2
) to 20, and assume (using the same data) for the
In the case of repeated data.frame
you pass in to repkfold_ttest
can have more than the four columns specified here, it must contain at least these four with the exact corresponding names. The function explicitly searches for them. They are:
"model"
--- contains a label for each of the two models to compare"values"
--- the numerical values of the performance metric (i.e., classification accuracy)"k"
--- which fold the values correspond to"r"
--- which repeat of the fold the values correspond to
import numpy as np
import pandas as pd
x = np.random.normal(0.6, 0.1, 30)
y = np.random.normal(0.4, 0.1, 30)
tmp = pd.DataFrame({'model':np.repeat([1, 2], 60),
'values':np.concatenate((np.random.normal(0.6, 0.1, 60), np.random.normal(0.4, 0.1, 60))),
'k':[1, 1, 2, 2]*30,
'r':[1, 2]*60
})
We can fit all the corrections in one-line functions:
from correctipy import resampled_ttest
from correctipy import kfold_ttest
from correctipy import repkfold_ttest
rss = resampled_ttest(x, y, 30, 80, 20) # Random subsampling
kcv = kfold_ttest(x, y, 100, 30) # k-fold cross-validation
rkcv = repkfold_ttest(tmp, 80, 20, 2, 2) # Repeated k-fold cross-validation
All the functions return a Pandas dataframe with two named columns: "statistic"
(the "p_value"
(the associated resampled_ttest
case:
statistic p_value
0 6.09829 6.083703e-07