## Assignment 4

In this week's assignment you will fit and evaluate multiple classification models based on the Heart.csv dataset from ISL, predicting whether the subjects have a heart disease (AHD). There are two main learning goals:
- Gain more practical experience with writing analysis code, this time focusing specifically on reusing code through loops and functions.
- Compare approaches for model performance evaluation.
The tasks are organized such that you should be able to reuse code from the early tasks in the later ones.

Note: In the text below I ask you to write some functions to do specific things. It is not a hard requirement to write exactly these functions. For instance, if you already know of an existing function performing one (or multiple) of the steps described below, feel free to use it. The goal with being so explicit is that you should focus on separation-of-concerns in your code (as discussed in Lecture 3), writing separate chunks of code with separate responsibilities and chaining them together when this is useful. With this in mind, you are free to do as you want as long as you achieve the overall goals

### Kommentar
- Jeg brukte totalt to, tre timer med noen problemer som jeg ikke helt skjønner fremdeles. Jeg brukte samme env som tidligere oppgaver, men flere av imports kjørte uendelig. Løste det ved å slette og lage nytt env på femte forsøk. Så måtte rase litt raskere gjennom oppgavene enn jeg foretrekker, pluss brukte Claude Code, særlig på oppgave 1.

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

## 0. Prepare the dataset for the subsequent modelling.

- Download the heart disease dataset from https://www.statlearning.com/s/Heart.csv
   -  Load the dataset and drop all variables except the predictors Age, Sex, ChestPain, RestBP, Chol, and the target variable AHD.
   - Drop all rows containing a NaN value.
-  Onehot encode the variable ChestPain. This means that where you before had a single column with one of four values ['typical', 'asymptomatic', 'nonanginal', 'nontypical'], you will now have four binary columns (their names don't matter), akin to 'ChestPain_typical' 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical'. A row that before had ChestPain='typical' will now have ChestPain_typical=1 and the other three columns set to 0, ChestPain='asymptomatic' will have ChestPain_asymptomatic=1 and the other three set to 0, etc.
- Binary encode the target variable AHD such that 'No'=0 and 'Yes'=1.

In [2]:
# Load the dataset
df = pd.read_csv('heart.csv', index_col=0)

# Drop all variables except the predictors Age, Sex, ChestPain, RestBP, Chol, and the target variable AHD
df = df[['Age', 'Sex', 'ChestPain', 'RestBP', 'Chol', 'AHD']]

# Drop all rows containing a NaN value
df = df.dropna()

# Onehot encode the variable ChestPain (convert to binary 1/0)
df = pd.get_dummies(df, columns=['ChestPain'], drop_first=False, dtype=int)

# Binary encode AHD (No=0, Yes=1)
df['AHD'] = df['AHD'].map({'No': 0, 'Yes': 1})

In [3]:
# Display the prepared dataset
print(f"Dataset shape: {df.shape}")
print(f"\nAHD value counts:\n{df['AHD'].value_counts()}")
print(f"\nFirst rows:")
df.head()

Dataset shape: (303, 9)

AHD value counts:
AHD
0    164
1    139
Name: count, dtype: int64

First rows:


Unnamed: 0,Age,Sex,RestBP,Chol,AHD,ChestPain_asymptomatic,ChestPain_nonanginal,ChestPain_nontypical,ChestPain_typical
1,63,1,145,233,0,0,0,0,1
2,67,1,160,286,1,1,0,0,0
3,67,1,120,229,1,1,0,0,0
4,37,1,130,250,0,0,1,0,0
5,41,0,130,204,0,0,0,1,0


## 1. Fit a model using a standard train/validation split through multiple steps.

Through the steps you will practice chaining functions, and you will also create the infrastructure necessary for the remaining tasks.

    Write a function "stratified_split" that takes three arguments: A dataframe, a number of folds, and a list of variables to stratify by. The function should return a list of dataframes, one for each fold, where the dataframes are stratified by the variables in the list. Test that the function works by splitting the dataset into two folds based on 'AHD', 'Age' and 'RestBP' and print the size of each fold, the counts of 0s and 1s in AHD, and the mean of each of 'Age' and 'RestBP' (all these should be printed individually per fold). Ensure that the function does not modify the original dataframe.

    Write a function 'fit_and_predict' that takes 4 arguments: A training set, a validation set, a list of predictors, and a target variable. The function should fit a logistic regression model to the training set using the predictors and target variable, and return the predictions of the model on the validation set.

    Write a function 'fit_and_predict_standardized' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. Using a loop (or a scaler), the function should z-score standardize the given variables in both the training set and the validation set based on the mean and standard deviation in the training set. Then, the function should call the 'fit_and_predict' function and return its result. Ensure that the function does not modify the original dataframes. Test the function using the train and validation set from above (e.g. the two folds from the split), while standardizing the 'Age', 'RestBP' and 'Chol' variables (as mentioned above, the target should be AHD, and you should also include the remaining predictors: 'Sex' and the ChestPain-variables)

    Write a function 'fit_and_compute_auc' that takes 5 arguments: A training set, a validation set, a list of predictors, a target variable, and a list of variables to standardize. The function should call the 'fit_and_predict_standardized' function to retrieve out-of-sample predictions for the validation set. Based on these and the ground truth labels in the validation set, it should compute and return the AUC. Test the function using the train and test set from above, while standardizing the 'Age', 'RestBP' and 'Chol' variables (and including the remaining predictors). Print the AUC.

In [4]:
# Write a 'function stratified_split'

def stratified_split(df, n_folds, stratify_vars):
    df_copy = df.copy()
    
    # Stratify only on the first variable
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    folds = []
    
    for _, fold_idx in skf.split(df_copy, df_copy[stratify_vars[0]]):
        folds.append(df_copy.iloc[fold_idx])
    
    return folds

In [5]:
# Write a 'function fit_and_predict'

def fit_and_predict(train, validation, predictors, target):
    # Fit logistic regression on training set
    model = LogisticRegression(max_iter=1000)
    model.fit(train[predictors], train[target])
    
    # Return predictions on validation set
    predictions = model.predict_proba(validation[predictors])[:, 1]
    return predictions

In [6]:
# Write a function 'fit_and_predict_standardized'

def fit_and_predict_standardized(train, validation, predictors, target, standardize_vars):
    train_copy = train.copy()
    validation_copy = validation.copy()
    
    # Standardize based on training set statistics
    for var in standardize_vars:
        mean = train_copy[var].mean()
        std = train_copy[var].std()
        train_copy[var] = (train_copy[var] - mean) / std
        validation_copy[var] = (validation_copy[var] - mean) / std
    
    return fit_and_predict(train_copy, validation_copy, predictors, target)

In [7]:
# Write a 'function fit_and_compute_auc'

def fit_and_compute_auc(train, validation, predictors, target, standardize_vars):
    # Get predictions from fit_and_predict_standardized
    predictions = fit_and_predict_standardized(train, validation, predictors, target, standardize_vars)
    
    # Compute and return AUC
    auc = roc_auc_score(validation[target], predictions)
    return auc

In [8]:
# Test stratified_split: split into 2 folds and print stats for each fold
folds = stratified_split(df, 2, ['AHD', 'Age', 'RestBP'])

for i, fold in enumerate(folds):
    print(f"Fold {i}:")
    print(f"  Size: {len(fold)}")
    print(f"  AHD=0: {(fold['AHD'] == 0).sum()}")
    print(f"  AHD=1: {(fold['AHD'] == 1).sum()}")
    print(f"  Mean Age: {fold['Age'].mean():.2f}")
    print(f"  Mean RestBP: {fold['RestBP'].mean():.2f}")
    print()


# Test fit_and_predict_standardized: get predictions on validation set
train = folds[0]
validation = folds[1]
predictors = ['Age', 'Sex', 'RestBP', 'Chol', 'ChestPain_asymptomatic', 
              'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical']
target = 'AHD'
standardize_vars = ['Age', 'RestBP', 'Chol']

predictions = fit_and_predict_standardized(train, validation, predictors, target, standardize_vars)
print(f"First 5 predictions: {predictions[:5]}")
print()


# Test fit_and_compute_auc: compute and print AUC
auc = fit_and_compute_auc(train, validation, predictors, target, standardize_vars)
print(f"AUC: {auc:.4f}")

Fold 0:
  Size: 152
  AHD=0: 82
  AHD=1: 70
  Mean Age: 55.41
  Mean RestBP: 131.55

Fold 1:
  Size: 151
  AHD=0: 82
  AHD=1: 69
  Mean Age: 53.46
  Mean RestBP: 131.83

First 5 predictions: [0.25366754 0.16440969 0.03742883 0.53852323 0.72382675]

AUC: 0.8501


## 2. Perform a cross-validation.

Use the 'stratified_split' function to split the dataset into 10 folds, stratified on variables you find reasonable. For each fold, use the 'fit_and_compute_auc' function to compute the AUC of the model on the held-out validation set. Print the mean and standard deviation of the AUCs across the 10 folds.

In [9]:
# Perform 10-fold cross-validation
ten_folds = stratified_split(df, 10, ['AHD', 'Sex'])

# Define predictors and target
predictors = ['Age', 'Sex', 'RestBP', 'Chol', 'ChestPain_asymptomatic', 
              'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical']
target = 'AHD'
standardize_vars = ['Age', 'RestBP', 'Chol']

# Compute AUC for each fold
aucs = []
for i in range(len(ten_folds)):
    # Use fold i as validation, combine others as training
    validation = ten_folds[i]
    train = pd.concat([ten_folds[j] for j in range(len(ten_folds)) if j != i])
    
    # Compute AUC
    auc = fit_and_compute_auc(train, validation, predictors, target, standardize_vars)
    aucs.append(auc)
    print(f"Fold {i}: AUC = {auc:.4f}")

# Print mean and standard deviation
print(f"\nMean AUC: {np.mean(aucs):.4f}")
print(f"Std AUC: {np.std(aucs):.4f}")

Fold 0: AUC = 0.7941
Fold 1: AUC = 0.8613
Fold 2: AUC = 0.7815
Fold 3: AUC = 0.8824
Fold 4: AUC = 0.8125
Fold 5: AUC = 0.8348
Fold 6: AUC = 0.9509
Fold 7: AUC = 0.8616
Fold 8: AUC = 0.9107
Fold 9: AUC = 0.7723

Mean AUC: 0.8462
Std AUC: 0.0552


3. OPTIONAL: Use the bootstrap to achieve a distribution of out-of-bag AUCs.

For 100 iterations, create a bootstrap sample by sampling with replacement from the full dataset until you have a training set equal in size to 80% of the original data. Use the observations not included in the bootstrap sample as the validation set for that iteration.. Fit models and calculate AUCs for each iteration. Print the mean and standard deviation of the AUCs.

In [10]:
# Rakk dessverre ikke

4. Theory

List some benefits of wrapping code in functions rather than copying and pasting it multiple times.
- Saves time, efficient, reproducible workflow, more overview of function and fewer possibilities for errors.

Explain three classification metrics and their benefits and drawbacks.
- Recall: correctly classified actual positives / all actual positives
Benefit: Critical when missing positives is costly (e.g. disease detection - can't miss sick patients).
Drawback: ignores false positives, meaning you could predict everything as positive and get 100%, but it's not truly useful.

- Accuracy = correct classifications / total classifications
Benefit: Simple, intuitive, and shows overall model performance.
Drawback: can be misleading with imbalanced classes, meaning if e.g 90% of a class is 0, then always classifying as 0 would give 90% accuracy.

- Precision: correctly classified actual positives / everything classified as positive
Benefit: Important when false positives are costly (e.g. spam filtering - don't want real emails blocked)
Drawback: ignores false negatives, meaning you could predict very few positives and get high precision, while missing most of the actual positive values.

Write a couple of sentences comparing the three methods (train/validation, cross-validation, bootstrap) above as approaches to quantify model performance. Which one yielded the best results?
The Mean AUC of cross-val was .85, and the AUC of train/val: .85, however, for the train/val we have one split that could be slightly imbalanced or not, and for the cross-val AUC, we have ten different splits, and the mean of that. So its difficult so state whether one is better than the other. Cross-val is more reliable, train/val is more variable.
     
Which one would you expect to yield the best results?
- Train/val depends on split, while cross-val is a mean of e.g. 10 folds, possibly providing a more realistic AUC score. If we ran bootstrap, it would possibly be on the more positive side of yielding "best results"

Can you mention some theoretical benefits and drawbacks with each? Even if you didn't do the optional bootstrap exercise you should reflect on this as an approach.
- Train/val: fast and simple, and can be a cleaner split, but also dependent on balance between splits
- Cross-val: also helpful for smaller datasets, and provides more reliable estimates, given many smaller folds, but can be computationally expensive given k models and n folds
- Bootstrap: sometimes positive-biased, but provides confidence intervals and helpful for smaller datasets because the same data can appear in both train and test datasplits

Why do we stratify the dataset before splitting?
- to balance our data, and not get skewed splits

What other use cases can you think of for the bootstrap method?
- confidence intervals and small datasets.