# EMBS-BHI-2025: Robust and Reproducible AI Tutorial

Tutorial instructors: Ernest Namdar and Pascal Tyrrell

The dataset used in this tutorial is adopted from Openradiomics (BraTS 2020). https://pubmed.ncbi.nlm.nih.gov/40760408/

This notebook explores superiority tests for ML-based classification using the tutorial radiomics dataset.

## Overview
We revisit the radiomics classification task and compare two ensemble tree models across a fixed, stratified 10-fold split. Fold-wise AUROC scores let us test whether one model outperforms the other under paired statistical tests.

## Setup
Import dependencies and pin the random seeds for reproducibility.

In [1]:
import os
import random

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelBinarizer
from scipy.stats import shapiro, ttest_rel, ttest_ind, mannwhitneyu, wilcoxon

SEED = 0
np.random.seed(SEED)
random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)

In [2]:
# Comment out this cell if you run the code locally
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# --- Set working directory ---
import os

project_path = '/content/drive/MyDrive/EMBS_BHI_2025_Tutorial_Robust_and_Reproducible_AI/Part1-Classification/code'

# Change the working directory
os.chdir(project_path)

# Verify
print("Current working directory:", os.getcwd())

Mounted at /content/drive
Current working directory: /content/drive/MyDrive/EMBS_BHI_2025_Tutorial_Robust_and_Reproducible_AI/Part1-Classification/code


## Load and Prepare the Dataset
The radiomics table ships with this tutorial. We drop metadata columns and retain numeric descriptors for modelling.

In [3]:
DATA_PATH = '../data/Radiomics_NoNormalization_Whole_Tumor_T1CE.csv'

df = pd.read_csv(DATA_PATH)

In [4]:
df

Unnamed: 0,Lesion_ID,Group,Group_label,Patient_ID,Normalization,Subregion,Sequence,diagnostics_Versions_PyRadiomics,diagnostics_Versions_Numpy,diagnostics_Versions_SimpleITK,...,wavelet-LLL_glszm_SmallAreaHighGrayLevelEmphasis,wavelet-LLL_glszm_SmallAreaLowGrayLevelEmphasis,wavelet-LLL_glszm_ZoneEntropy,wavelet-LLL_glszm_ZonePercentage,wavelet-LLL_glszm_ZoneVariance,wavelet-LLL_ngtdm_Busyness,wavelet-LLL_ngtdm_Coarseness,wavelet-LLL_ngtdm_Complexity,wavelet-LLL_ngtdm_Contrast,wavelet-LLL_ngtdm_Strength
0,1,LGG,0,BraTS20_Training_264,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,257.875481,0.002917,7.129410,0.068842,1.065569e+04,1.857225,0.001056,402.594060,0.057651,0.400334
1,2,LGG,0,BraTS20_Training_333,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,31400.796681,0.000034,9.194488,0.550131,5.733321e+00,0.037253,0.000198,615308.179535,0.086944,37.474497
2,3,LGG,0,BraTS20_Training_290,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,687.290391,0.001712,7.892693,0.031033,4.019240e+05,1.891263,0.000142,1546.563118,0.005046,0.767814
3,4,LGG,0,BraTS20_Training_269,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,14.038038,0.055507,5.427283,0.004712,1.996953e+06,34.988012,0.000462,11.145451,0.005164,0.024456
4,5,LGG,0,BraTS20_Training_263,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,799.617583,0.000725,7.988775,0.117867,5.486999e+03,0.454818,0.000960,2447.225716,0.047115,3.002776
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
364,365,HGG,1,BraTS20_Training_207,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,250.731263,0.003331,7.412216,0.055957,3.441931e+04,2.054098,0.000544,586.352427,0.032293,0.458392
365,366,HGG,1,BraTS20_Training_192,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,994.351189,0.001169,7.464253,0.117230,3.670582e+04,2.321920,0.000186,2915.452880,0.036619,0.935791
366,367,HGG,1,BraTS20_Training_007,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,1122.955449,0.001012,7.625789,0.249371,2.043887e+03,0.674250,0.000442,8412.504297,0.069797,2.593628
367,368,HGG,1,BraTS20_Training_235,NoNormalization,Whole_Tumor,T1CE,v3.0.1,1.20.3,2.1.1,...,2511.604246,0.000511,8.163777,0.277927,3.428665e+03,1.032603,0.000163,17448.583739,0.088018,1.784128


In [5]:
y = df['Group_label']
X = df.drop(columns=[
    'Lesion_ID', 'Group', 'Group_label', 'Patient_ID',
    'Normalization', 'Subregion', 'Sequence'
] + [col for col in df.columns if col.startswith('diag')])

lb = LabelBinarizer()
y_bin = lb.fit_transform(y).ravel()

print('Feature matrix shape:', X.shape)
print('Label vector shape:', y.shape)

Feature matrix shape: (369, 1688)
Label vector shape: (369,)


## Fixed 10-Fold Evaluation
We apply a stratified 10-fold split with shuffle and a fixed random state so each model sees identical training folds. Random Forest (RF) and Extremely Randomized Trees (ExtraTrees) are trained per fold and AUROC is recorded on the hold-out fold.

In [6]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=SEED)
rf_scores = []
et_scores = []
folds = []

for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y_bin), start=1):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y_bin[train_idx], y_bin[test_idx]

    rf = RandomForestClassifier(random_state=SEED)
    rf.fit(X_train, y_train)
    rf_prob = rf.predict_proba(X_test)[:, 1]
    rf_auc = roc_auc_score(y_test, rf_prob)

    et = ExtraTreesClassifier(random_state=SEED)
    et.fit(X_train, y_train)
    et_prob = et.predict_proba(X_test)[:, 1]
    et_auc = roc_auc_score(y_test, et_prob)

    folds.append(fold_idx)
    rf_scores.append(rf_auc)
    et_scores.append(et_auc)

results_df = pd.DataFrame({
    'fold': folds,
    'rf_auc': rf_scores,
    'et_auc': et_scores,
    'diff_rf_minus_et': np.array(rf_scores) - np.array(et_scores),
})
results_df

Unnamed: 0,fold,rf_auc,et_auc,diff_rf_minus_et
0,1,0.954741,0.99569,-0.04094828
1,2,0.931034,0.931034,0.0
2,3,0.935345,0.9375,-0.002155172
3,4,0.967672,0.987069,-0.01939655
4,5,0.99569,0.989224,0.006465517
5,6,0.946121,0.948276,-0.002155172
6,7,0.916667,0.933333,-0.01666667
7,8,0.892857,0.892857,1.110223e-16
8,9,0.911905,0.907143,0.004761905
9,10,0.972906,0.960591,0.01231527


In [7]:
summary = results_df[['rf_auc', 'et_auc']].agg(['mean', 'std']).T
summary.rename(columns={'mean': 'mean_auc', 'std': 'std_auc'}, inplace=True)
summary

Unnamed: 0,mean_auc,std_auc
rf_auc,0.942494,0.031271
et_auc,0.948272,0.03493


## Selecting a Statistical Test
We inspect the normality of fold-wise AUROC differences via Shapiro-Wilk.
If the difference distribution is approximately normal we proceed with a paired t-test; otherwise we fall back to the Mann-Whitney U test.

In [8]:
def choose_test(sample_a, sample_b, alpha=0.05):
    """Compare two paired samples by checking Shapiro-Wilk normality on their differences.

    Returns a dictionary containing the Shapiro statistic, suggested test name, and corresponding test results."""
    diffs = np.array(sample_a) - np.array(sample_b)
    shapiro_stat, shapiro_p = shapiro(diffs)

    if shapiro_p > alpha:
        test_name = 'paired_ttest'
        test_stat, test_p = ttest_rel(sample_a, sample_b)
    else:
        test_name = 'mann_whitney_u'
        test_stat, test_p = mannwhitneyu(sample_a, sample_b, alternative='two-sided')

    return {
        'shapiro_stat': shapiro_stat,
        'shapiro_p': shapiro_p,
        'selected_test': test_name,
        'test_stat': test_stat,
        'test_p': test_p,
    }

selection = choose_test(rf_scores, et_scores)
selection

{'shapiro_stat': np.float64(0.8715494575811606),
 'shapiro_p': np.float64(0.10420633197558432),
 'selected_test': 'paired_ttest',
 'test_stat': np.float64(-1.162747782102781),
 'test_p': np.float64(0.27483442323067214)}

## Bootstrapping


In [11]:
def bootstrap_compare(sample_a, sample_b, n_bootstrap=100, alpha=0.05, random_state=None):
    """Bootstrap comparison between two paired samples using mean difference."""
    rng = np.random.default_rng(random_state)
    diffs = np.array(sample_a) - np.array(sample_b)
    boot_means = []

    for _ in range(n_bootstrap):
        resample = rng.choice(diffs, size=len(diffs), replace=True)
        boot_means.append(np.mean(resample))

    boot_means = np.array(boot_means)
    ci_lower = np.percentile(boot_means, 100 * alpha / 2)
    ci_upper = np.percentile(boot_means, 100 * (1 - alpha / 2))
    p_boot = np.mean(boot_means <= 0) if np.mean(diffs) > 0 else np.mean(boot_means >= 0)

    return {
        'boot_mean_diff': np.mean(diffs),
        'boot_ci': (ci_lower, ci_upper),
        'boot_p_value': p_boot,
        'n_bootstrap': n_bootstrap
    }

bootstrap_result = bootstrap_compare(rf_scores, et_scores, n_bootstrap=100)
bootstrap_result

{'boot_mean_diff': np.float64(-0.005777914614121538),
 'boot_ci': (np.float64(-0.015809267241379325),
  np.float64(0.0021063218390804284)),
 'boot_p_value': np.float64(0.15),
 'n_bootstrap': 100}

## (a) Parametric vs Non-Parametric Contrast
We compute both paired t-test and Mann-Whitney U p-values on the fold AUROCs to illustrate that rank-based tests typically yield more conservative significance levels.

In [9]:
paired_t = ttest_rel(rf_scores, et_scores)
mann_whitney = mannwhitneyu(rf_scores, et_scores, alternative='two-sided')

print(f'Paired t-test p-value: {paired_t.pvalue:.4f}')
print(f'Mann-Whitney U p-value: {mann_whitney.pvalue:.4f}')

notes_a = (
    'Non-parametric tests such as Mann-Whitney U operate on ranks and do not assume normality, '
    'often producing more conservative p-values than parametric counterparts when the paired differences are roughly symmetric.'
)
notes_a

Paired t-test p-value: 0.2748
Mann-Whitney U p-value: 0.7912


'Non-parametric tests such as Mann-Whitney U operate on ranks and do not assume normality, often producing more conservative p-values than parametric counterparts when the paired differences are roughly symmetric.'

## (b) Paired vs Unpaired Testing
Using the same folds enables paired comparisons. We contrast paired and unpaired variants to emphasise that ignoring the shared splits (unpaired) typically expands p-values. A fixed cross-validation partition is essential—without identical folds, paired tests would be biased.

In [10]:
unpaired_t = ttest_ind(rf_scores, et_scores, equal_var=False)

# Wilcoxon signed-rank acts as the paired analogue to Mann-Whitney U
wilcoxon_test = wilcoxon(np.array(rf_scores) - np.array(et_scores))
mann_whitney_unpaired = mannwhitneyu(rf_scores, et_scores, alternative='two-sided')

comparison = pd.DataFrame({
    'test': ['paired_t', 'unpaired_t', 'wilcoxon (paired ranks)', 'mann_whitney_u (unpaired)'],
    'p_value': [paired_t.pvalue, unpaired_t.pvalue, wilcoxon_test.pvalue, mann_whitney_unpaired.pvalue],
}).sort_values('p_value')
comparison

Unnamed: 0,test,p_value
0,paired_t,0.274834
2,wilcoxon (paired ranks),0.476562
1,unpaired_t,0.701369
3,mann_whitney_u (unpaired),0.791183


<!-- Highlight takeaway -->
- Paired tests exploit the shared folds and usually deliver smaller (less conservative) p-values when the pairing is valid.
- Treating the same scores as independent (unpaired) inflates the uncertainty, offering a conservative alternative.
- Always align the data splits before applying a paired test; mismatched folds artificially exaggerate differences.