

## Model 4 Summary

### Features
- MaxHR
- Oldpeak
- ExerciseAngina
- ST_Slope

### Objective
- Isolate predictive power of ETT (Exercise Tolerance Test) features
- Compare linear vs nonlinear algorithm performance
- Evaluate using 5-fold Stratified Cross-Validation with ROC-AUC

### Models
- Logistic Regression (scaled numeric features)
- GradientBoostingClassifier (nonlinear model)

### Performance

Logistic Regression:
- Mean ROC-AUC: 0.885
- Std ROC-AUC: 0.035

Gradient Boosting:
- Mean ROC-AUC: 0.886
- Std ROC-AUC: 0.031

### Interpretation
- ETT features alone provide strong discriminatory power (~0.89 ROC-AUC)
- Performance is stable across folds
- Nonlinear model provides no meaningful improvement over Logistic Regression
- The predictive signal within ETT variables appears largely linear
- ETT features capture a substantial portion of heart disease risk in this dataset

## Model 3 vs Model 4 Comparison
#### *(For full comparison, cross validation om model 3 needed! In progress)*

Model 4 (ETT-only):
- ROC-AUC ≈ 0.885
- Strong discriminatory power using only exercise test features
- Performance stable across folds
- Linear and nonlinear models perform similarly

Model 3 (Full feature set):
- ROC-AUC ≈ 0.92
- Improved discrimination compared to ETT-only
- Uses complete clinical information

### Interpretation

- ETT features capture a substantial portion of predictive signal.
- Full feature set improves ROC-AUC by ~0.035.
- Additional resting and demographic features provide incremental predictive value.
- Model 3 remains the strongest overall diagnostic model.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [5]:
df = pd.read_csv("../Data/Raw/heart_disease.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [6]:
TARGET_COL = "HeartDisease"
NUM_COLS = ["MaxHR", "Oldpeak"]
CAT_COLS = ["ExerciseAngina", "ST_Slope"]

X = df[NUM_COLS + CAT_COLS].copy()
y = df[TARGET_COL].copy()

print("X shape:", X.shape)
print("y shape:", y.shape)
print("\nClass distribution:")
print(y.value_counts(normalize=True))

X shape: (918, 4)
y shape: (918,)

Class distribution:
HeartDisease
1    0.553377
0    0.446623
Name: proportion, dtype: float64


### Build Preprocessing Pipelines

Preprocessor 1 (for Logistic Regression)
- StandardScaler → numeric
- OneHotEncoder → categorical

Preprocessor 2 (for Gradient Boosting)
- passthrough → numeric
- OneHotEncoder → categorical

In [7]:
preprocess_lr = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), NUM_COLS),
        ("cat", OneHotEncoder(handle_unknown="ignore"), CAT_COLS),
    ]
)

In [8]:
preprocess_gb = ColumnTransformer(
    transformers=[
        ("num", "passthrough", NUM_COLS),
        ("cat", OneHotEncoder(handle_unknown="ignore"), CAT_COLS),
    ]
)

### Build the Pipelines

In [11]:
lr_pipeline = Pipeline(
    steps=[
        ("preprocess", preprocess_lr),
        ("model", LogisticRegression(max_iter=2000, random_state=42))
    ]
)

In [12]:
gb_pipeline = Pipeline(
    steps=[
        ("preprocess", preprocess_gb),
        ("model", GradientBoostingClassifier(random_state=42))
    ]
)

### 5-Fold Stratified Cross-Validation (ROC-AUC)

In [13]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

lr_auc = cross_val_score(lr_pipeline, X, y, cv=cv, scoring="roc_auc")

gb_auc = cross_val_score(gb_pipeline, X, y, cv=cv, scoring="roc_auc")

print("Logistic Regression ROC-AUC per fold:", np.round(lr_auc, 4))
print(f"Logistic Regression Mean ROC-AUC: {lr_auc.mean():.4f}")
print(f"Logistic Regression Std ROC-AUC:  {lr_auc.std(ddof=1):.4f}\n")

print("Gradient Boosting ROC-AUC per fold:", np.round(gb_auc, 4))
print(f"Gradient Boosting Mean ROC-AUC: {gb_auc.mean():.4f}")
print(f"Gradient Boosting Std ROC-AUC:  {gb_auc.std(ddof=1):.4f}")

Logistic Regression ROC-AUC per fold: [0.9198 0.8911 0.8532 0.9166 0.8459]
Logistic Regression Mean ROC-AUC: 0.8853
Logistic Regression Std ROC-AUC:  0.0346

Gradient Boosting ROC-AUC per fold: [0.9327 0.887  0.8463 0.8912 0.8748]
Gradient Boosting Mean ROC-AUC: 0.8864
Gradient Boosting Std ROC-AUC:  0.0313
