# Examples
Here we provide some examples with real-world data to examplify the difference between the different approaches.

## Breast Cancer Prediction
In this example, we look at the Breast Cancer dataset from scikit-learn. The classification task is to predict whether the subject has cancer or not.

In [11]:
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Import our custom functions
from src.metrics import create_metrics
from src.cv import run_cv

In [12]:
X, y = load_breast_cancer(return_X_y=True)

# print dataset shape
print(f"Dataset shape: {X.shape}")
# Check class balance
print(f"Class distribution:\n{sum(y)} positive, {len(y) - sum(y)} negative, ratio: {sum(y) / len(y):.2f}")

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Dataset shape: (569, 30)
Class distribution:
357 positive, 212 negative, ratio: 0.63


In [13]:
model = LogisticRegression()
metrics = create_metrics(["accuracy", "rocauc", "prcauc"])
results = run_cv(model, X_scaled, y, metrics, n_splits=5, stratified=True, random_state=1)

results_df = pd.DataFrame(results)
print(results_df)

           average    pooled
accuracy  0.978910  0.978910
rocauc    0.995428  0.995098
prcauc    0.996996  0.996543


## Cognitive Impairment
In this example, we look at the Oasis3 data. We have cortical volume for various subjects and sessions that are cognitively normal or impaired. The classification task is to predict whether the subject is cogntiviely normal or not, based on cortical features.

In [14]:
# Load the dataset
data_path = Path("../data/oasis3_fs_mci.tsv")
df = pd.read_csv(data_path, sep="\t")

print(f"Dataset shape: {df.shape}")

# drop rows that have empty cells / NAs
df = df.dropna(axis=0, how="any")

# only keep first occurence of each subject
df_baseline = df.drop_duplicates(subset=["subject"], keep="first")
print(f"Shape of baseline data: {df_baseline.shape}")

Dataset shape: (2832, 104)
Shape of baseline data: (1029, 104)


In [15]:
# split into X
X = df_baseline.drop(columns=["subject", "session", "age", "cognitiveyly_normal"])
X = X.apply(pd.to_numeric, errors="coerce")
X = X.to_numpy()

# standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# and y
y = df_baseline["cognitiveyly_normal"].to_numpy()
y_binary = (y == True).astype(int)

# Check class balance
print(f"Class distribution:\n{sum(y_binary)} positive, {len(y_binary) - sum(y_binary)} negative, ratio: {sum(y_binary) / len(y_binary):.2f}")

Class distribution:
755 positive, 274 negative, ratio: 0.73


In [16]:
## Run cross-validation
model = LogisticRegression(max_iter=1000000)
metrics = create_metrics(["accuracy", "rocauc", "prcauc"])
results = run_cv(model, X_scaled, y_binary, metrics, n_splits=5, stratified=True, random_state=1)
results_df = pd.DataFrame(results)
print(results_df)

           average    pooled
accuracy  0.820214  0.820214
rocauc    0.822567  0.822845
prcauc    0.910967  0.910037


## Depression Remission
In this example, we look at the NP1 data. The classification task is to predict whether a patient will achieve remission from depression after a certain period of time.


In [17]:
# Load the dataset
data_path = Path("../data/np1_fs_mdd_episode.csv")
df = pd.read_csv(data_path, sep=",")

print(f"Dataset shape: {df.shape}")

# drop rows that have empty cells / NAs
df = df.dropna(axis=0, how="any")

# only keep first occurence of each subject
print(f"Shape of baseline data: {df.shape}")

Dataset shape: (79, 163)
Shape of baseline data: (79, 163)


In [21]:
X = df.drop(columns=["mdd_episode", "diagnosis"])
X = X.to_numpy()
print(X)
# standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# and y
y = df["mdd_episode"].to_numpy()
y_binary = (y == "Recurrent").astype(int)

# Check class balance
print(f"Class distribution:\n{sum(y_binary)} positive, {len(y_binary) - sum(y_binary)} negative, ratio: {sum(y_binary) / len(y_binary):.2f}")

[[ 2.49        2.945       2.683      ...  1.3486023  29.95756331
  14.        ]
 [ 2.538       2.411       2.664      ...  1.3746835  33.86721424
  13.        ]
 [ 2.593       2.854       2.664      ...  1.3468745  23.93702943
  14.        ]
 ...
 [ 2.742       3.369       2.815      ...  1.4016296  23.95345654
  13.        ]
 [ 2.787       2.755       2.68       ...  1.6235194  28.06844627
  11.        ]
 [ 2.49        3.11        2.629      ...  1.1697149  29.35249829
  12.        ]]
Class distribution:
47 positive, 32 negative, ratio: 0.59


In [None]:
## Run cross-validation
model = LogisticRegression(max_iter=1000000)
metrics = create_metrics(["accuracy", "rocauc", "prcauc"])
results = run_cv(model, X_scaled, y_binary, metrics, n_splits=5, stratified=True, random_state=1)
results_df = pd.DataFrame(results)
print(results_df)