# Comparing two classification models using `stambo`


[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Oulu-IMEDS/stambo/main?labpath=notebooks%2FClassification.ipynb)

V1.1.3: © Aleksei Tiulpin, PhD, 2025

This notebook shows an end-to-end example on how one can take a dataset, train two machine learning models, and conduct a statistical test to assess whether the two models are different. We will first use a set of classical metrics (basically the metrics from sklearn). At the end of the tutorial, we will show how one can generate a LaTeX report, and implement a custom metric. 

## Import of necessary libraries

In [1]:
import stambo

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

SEED = 2024

In [2]:
stambo.__version__

'0.1.3'

## Loading the UCI breast cancer dataset and creating train-test split

In [3]:
X, y = load_breast_cancer(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.5, random_state=SEED, stratify=y)

scaler = StandardScaler()
scaler.fit(Xtr)

Xtr = scaler.transform(Xtr)
Xte = scaler.transform(Xte)

## Training the models

We train a kNN and a logistic regression. Here, we can see that the logistic regression outperformes the kNN. 

In [4]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(Xtr, ytr)
preds_knn = model.predict_proba(Xte)[:, 1]

model = LogisticRegression(C=1e-2, random_state=42)
model.fit(Xtr, ytr)
preds_lr = model.predict_proba(Xte)[:, 1]


auc_knn, auc_lr = roc_auc_score(yte, preds_knn), roc_auc_score(yte, preds_lr)
print(f"kNN AUC: {auc_knn:.4f} / LR AUC: {auc_lr:.4f}")

kNN AUC: 0.9722 / LR AUC: 0.9918


## Statistical testing

As stated in the documentation, the testing routine returns the `dict` of `tuple`. The keys in the dict are the metric tags, and the values are tuples that store the data in the following format:

* p-value ($H_0: model_1 = model_2$)
* Empirical value (model 1)
* CI low (model 1)
* CI high (model 1)
* Empirical value (model 2)
* CI low (model 2)
* CI high (model 2)

If you launch the code in Binder, decrease the number of bootstrap iterations (`10000` by default).

In [5]:
testing_result = stambo.compare_models(yte, preds_knn, preds_lr, metrics=("ROCAUC", "AP", "QKappa", "BACC", "MCC"), seed=SEED)

Bootstrapping: 100%|██████████| 10000/10000 [00:30<00:00, 324.45it/s]


If we want to visualize the testing results, they are available in a dict in the format we have described above:

In [6]:
testing_result

{'ROCAUC': array([9.99900010e-05, 9.72172447e-01, 9.48864207e-01, 9.91257029e-01,
        9.91778223e-01, 9.79622319e-01, 9.99108880e-01]),
 'AP': array([9.99900010e-05, 9.69989968e-01, 9.43102214e-01, 9.90846088e-01,
        9.94036066e-01, 9.84350198e-01, 9.99497541e-01]),
 'QKappa': array([0.6940306 , 0.89362837, 0.83599462, 0.94459118, 0.88445634,
        0.82383847, 0.9383927 ]),
 'BACC': array([0.82711729, 0.941657  , 0.91037184, 0.96995029, 0.93116897,
        0.89705871, 0.96276596]),
 'MCC': array([0.56184382, 0.89455841, 0.8380851 , 0.94489108, 0.88892445,
        0.83293842, 0.93994144])}

Most commonly, we though want to visualize them in a report, paper, or a presentation. For that, we can use a function `to_latex`, and get a cut-and-paste `tabular`. To use it in a LaTeX document, one needs to not forget to import booktabs

In [7]:
print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))

% \usepackage{booktabs} <-- do not for get to have this imported. 
\begin{tabular}{llllll} \\ 
\toprule 
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{QKappa} & \textbf{BACC} & \textbf{MCC} \\ 
\midrule 
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.89$ [$0.84$-$0.94$] & $0.94$ [$0.91$-$0.97$] & $0.89$ [$0.84$-$0.94$] \\ 
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.88$ [$0.82$-$0.94$] & $0.93$ [$0.90$-$0.96$] & $0.89$ [$0.83$-$0.94$] \\ 
\midrule
$p$-value & $0.00$ & $0.00$ & $0.69$ & $0.83$ & $0.56$ \\ 
\bottomrule
\end{tabular}


## Own metrics

Sometimes, having default metrics is not enough, and one may want to have some additional metrics. Let us define an F2 score.

In [8]:
from sklearn.metrics import fbeta_score
from functools import partial
from stambo.metrics import Metric

In [9]:
class F2Score(Metric):
    def __init__(self) -> None:
        Metric.__init__(self, partial(fbeta_score, beta=2), int_input=True)

    def __str__(self) -> str:
        return "F2Score"

In [10]:
testing_result = stambo.compare_models(yte, preds_knn, preds_lr, 
                                       ("ROCAUC", "AP", F2Score()),seed=SEED)

Bootstrapping: 100%|██████████| 10000/10000 [00:23<00:00, 421.76it/s]


In [11]:
print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))

% \usepackage{booktabs} <-- do not for get to have this imported. 
\begin{tabular}{llll} \\ 
\toprule 
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{F2Score} \\ 
\midrule 
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.97$ [$0.95$-$0.99$] \\ 
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.98$ [$0.97$-$0.99$] \\ 
\midrule
$p$-value & $0.00$ & $0.00$ & $0.18$ \\ 
\bottomrule
\end{tabular}


In [12]:
testing_result

{'ROCAUC': array([9.99900010e-05, 9.72172447e-01, 9.48864207e-01, 9.91257029e-01,
        9.91778223e-01, 9.79622319e-01, 9.99108880e-01]),
 'AP': array([9.99900010e-05, 9.69989968e-01, 9.43102214e-01, 9.90846088e-01,
        9.94036066e-01, 9.84350198e-01, 9.99497541e-01]),
 'F2Score': array([0.17658234, 0.97114317, 0.95036943, 0.98758465, 0.98017621,
        0.96656217, 0.99041534])}

In [13]:
print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))

% \usepackage{booktabs} <-- do not for get to have this imported. 
\begin{tabular}{llll} \\ 
\toprule 
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{F2Score} \\ 
\midrule 
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.97$ [$0.95$-$0.99$] \\ 
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.98$ [$0.97$-$0.99$] \\ 
\midrule
$p$-value & $0.00$ & $0.00$ & $0.18$ \\ 
\bottomrule
\end{tabular}
