# Homework 4

## Instructions
Once you are finished, complete the following steps.

1.  Restart your kernel and rerun everything.

2.  Fix any errors which result from this.

3.  Repeat the above until your notebook runs without errors.

4.  Submit your completed notebook (.ipynb) to OWL by the deadline.

## Introduction

In this homework, you will use a Heart Disease dataset to predict if a patient has heart disease or not. The features in the dataset are described below.

*   **sex**:	Sex (1 = male; 0 = female)
*   **cp**:	Chest pain type (1: typical angina; 2: atypical angina; 3: non-anginal pain; 4: asymptomatic)
*   **trestbps**:	Resting blood pressure (in mm Hg on admission to the hospital)
*   **chol**:	Serum cholesterol in mg/dl
*   **fbs**:	Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
*   **restecg**:	Resting electrocardiographic results (0: normal; 1: having ST-T wave abnormality; 2: showing probable or definite left ventricular hypertrophy)
*   **thalach**:	Maximum heart rate achieved
*   **exang**:	Exercise induced angina (1 = yes; 0 = no)
*   **oldpeak**:	ST depression induced by exercise relative to rest
*   **Class**: 1 if heart disease, 0 if no heart disease


In [None]:
# Package import
import numpy as np

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_validate, RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Data management imports
import pandas as pd
import polars as pl

# Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Uncomment the line below if you are using Google colab
!gdown https://drive.google.com/uc?id=1fpNHlDypoSHvbH1EqFzkH_SGk7jVj9Vb

## Part 1

1. Read the CSV file using **Polars** and store it using `null_values=['NA']`. Show **summary statistics** for the dataset. What is the **baseline accuracy** for a model?

In [None]:
df_pl = pl.read_csv('Heart Disease Data.csv', null_values=['NA'])
print(df_pl.describe())
df = df_pl.to_pandas()
baseline_accuracy = df['Class'].value_counts(normalize=True).max()
print(f"baseline accuracy: {baseline_accuracy:.2%}")

**Written answer**: The baseline accuracy for the model is `XX.XX%` (fill in from cell above).

2. Assume that we are only interested in studying people aged **70 or less**. Remove anyone with ages larger than that. (Note that this will slightly change your baseline accuracy.)

In [None]:
df = df[df['age'] <= 70]
baseline_accuracy = df['Class'].value_counts(normalize=True).max()
print(f"shape: {df.shape}, baseline: {baseline_accuracy:.2%}")

3. Replace the missing values in the dataset using the **median** of the corresponding predictor.

In [None]:
for col in df.columns:
    if df[col].isnull().any():
        df[col] = df[col].fillna(df[col].median())
print(df.isnull().sum().sum(), "nulls remaining")

Note: It was an **error** to have you fill in the missing values with the median based on the **entire dataset** rather than just the training set created in the next question. This leads to **data leakage** (although it is relatively minor). We did it in this coursework for simplicity only.

## Part 2

4. Create a training and testing dataset. Reserve **30%** of the data for testing and stratify the split based on the outcome. Use a random state equal to your **Student ID**.

In [None]:
X = df.drop('Class', axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=251280006, stratify=y
)

print(X_train.shape[0], "train,", X_test.shape[0], "test")

5. Using **all** potential features, train a **logistic regression model** to predict if a patient has the condition. Remember to **standardise** the features. Use the following arguments.

*   `solver = 'lbfgs'`
*   `penalty = None`
*   `max_iter = 10000`
*   `verbose = 1`
*   `random_state = 0`
*   `n_jobs = -1`
*   `class_weight = 'balanced'`

In [None]:
feature_cols = X_train.columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[('scaler', StandardScaler(), feature_cols)],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(
        solver='lbfgs', penalty=None, max_iter=10000, verbose=1,
        random_state=0, n_jobs=-1, class_weight='balanced'
    ))
])

pipeline.fit(X_train, y_train)

6. Compute the **accuracy** and **AUC** of your model on the test set.

In [None]:
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)

print(f"accuracy: {accuracy:.2%}, auc: {auc:.4f}")

**Written answer:** Accuracy and AUC are in the cell above.

## Part 3

7. Without estimates of the uncertainty of the performance metrics, it can be hard to make definitive conclusions about the model performance. Compute **90% confidence intervals** for the accuracy and AUC using **bootstrapping** with **800** replicates. Interpret your results.

In [None]:
n_bootstrap = 800
rng = np.random.default_rng(0)

boot_accuracy = []
boot_auc = []
n_test = len(y_test)

for _ in range(n_bootstrap):
    idx = rng.integers(0, n_test, size=n_test)
    y_b = y_test.values[idx]
    y_pred_b = pipeline.predict(X_test.iloc[idx])
    y_proba_b = pipeline.predict_proba(X_test.iloc[idx])[:, 1]
    boot_accuracy.append(accuracy_score(y_b, y_pred_b))
    boot_auc.append(roc_auc_score(y_b, y_proba_b))

boot_accuracy = np.array(boot_accuracy)
boot_auc = np.array(boot_auc)

acc_ci_low = np.percentile(boot_accuracy, 5)
acc_ci_high = np.percentile(boot_accuracy, 95)
auc_ci_low = np.percentile(boot_auc, 5)
auc_ci_high = np.percentile(boot_auc, 95)

print(f"90% CI accuracy: [{acc_ci_low:.2%}, {acc_ci_high:.2%}]")
print(f"90% CI auc: [{auc_ci_low:.4f}, {auc_ci_high:.4f}]")

**Written answer:** CIs above. Model is reasonable because intervals are above baseline and AUC > 0.5.

8. Plot the distribution of the accuracy using **histogram**. Provide a title and axes labels for your plots. Add a **purple** vertical line representing the mean of accuracy. Repeat the same for the AUC.

In [None]:
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.hist(boot_accuracy, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(boot_accuracy.mean(), color='purple', linewidth=2, label='mean')
plt.xlabel('accuracy')
plt.ylabel('frequency')
plt.title('distribution of bootstrap accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.hist(boot_auc, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(boot_auc.mean(), color='purple', linewidth=2, label='mean')
plt.xlabel('auc')
plt.ylabel('frequency')
plt.title('distribution of bootstrap auc')
plt.legend()
plt.tight_layout()
plt.show()

## Part 4

9. Compute **90%** confidence intervals for the accuracy and AUC using **repeated cross-validation**. Use **10** splits and **100** repetitions with a random state of **0**. Compare your results to what you obtained using bootstrapping. Which method provides better confidence intervals in this case?

In [None]:
rkf = RepeatedKFold(n_splits=10, n_repeats=100, random_state=0)

cv_results = cross_validate(
    pipeline, X_train, y_train,
    cv=rkf, scoring=['accuracy', 'roc_auc'],
    return_estimator=True, n_jobs=-1
)

cv_accuracy = cv_results['test_accuracy']
cv_auc = cv_results['test_roc_auc']

acc_cv_low = np.percentile(cv_accuracy, 5)
acc_cv_high = np.percentile(cv_accuracy, 95)
auc_cv_low = np.percentile(cv_auc, 5)
auc_cv_high = np.percentile(cv_auc, 95)

print(f"90% CI for accuracy (repeated cv): [{acc_cv_low:.2%}, {acc_cv_high:.2%}]")
print(f"90% CI for auc (repeated cv): [{auc_cv_low:.4f}, {auc_cv_high:.4f}]")

In [None]:
print(f"bootstrap acc CI: [{acc_ci_low:.2%}, {acc_ci_high:.2%}]"); print(f"cv acc CI: [{acc_cv_low:.2%}, {acc_cv_high:.2%}]")
print(f"bootstrap auc CI: [{auc_ci_low:.4f}, {auc_ci_high:.4f}]"); print(f"cv auc CI: [{auc_cv_low:.4f}, {auc_cv_high:.4f}]")

**Written answer:** CIs above. Repeated CV gives better intervals (uses many splits; bootstrap uses one test set).

10. Using your cross-validation results, compute a **95%** confidence interval for each feature in the model. Which features should you remove since they have their 95% confidence intervals **include zero**?

In [None]:
estimators = cv_results['estimator']
n_features = len(feature_cols)
coefs = np.array([[e.named_steps['classifier'].coef_[0][j] for j in range(n_features)] for e in estimators])

coef_ci_low = np.percentile(coefs, 2.5, axis=0)
coef_ci_high = np.percentile(coefs, 97.5, axis=0)

for i, name in enumerate(feature_cols):
    inc = (coef_ci_low[i] <= 0 <= coef_ci_high[i])
    print(f"{name}: [{coef_ci_low[i]:.3f}, {coef_ci_high[i]:.3f}]  zero: {inc}")
features_to_remove = [feature_cols[i] for i in range(n_features) if coef_ci_low[i] <= 0 <= coef_ci_high[i]]
print("remove:", features_to_remove if features_to_remove else "none (use sex)")

**Written answer:** Remove features listed above (CI includes zero). If none, remove sex.

11. Fit your logistic regression model like before but remove the features you indentified in Q10. If you did not identify any features in Q10, remove **sex** instead. Plot the **ROC curve** of the model over the test set and **annotate** it with the AUC of the model. Provide a title and axes labels for your plot.

In [None]:
drop_cols = features_to_remove if features_to_remove else ['sex']
cols_red = [c for c in feature_cols if c not in drop_cols]
pipe_red = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(solver='lbfgs', penalty=None, max_iter=10000, verbose=0, random_state=0, n_jobs=-1, class_weight='balanced'))
])
pipe_red.fit(X_train[cols_red], y_train)
y_proba_red = pipe_red.predict_proba(X_test[cols_red])[:, 1]
auc_red = roc_auc_score(y_test, y_proba_red)
fpr_red, tpr_red, _ = roc_curve(y_test, y_proba_red)
plt.figure(figsize=(6, 5))
plt.plot(fpr_red, tpr_red, 'b-')
plt.plot([0, 1], [0, 1], 'k--')
plt.text(0.6, 0.2, f'auc = {auc_red:.3f}')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('roc curve (reduced model)')
plt.show()

12. Iterate through your cross-validation created in Q9 and calculate the uncertainty for the prediction of the **second** testing patient. Plot a **histogram** of the different predictions. Provide a title and axes labels for your plot. Add a **purple** vertical line representing the mean of the predictions.

Hint: If you need to stack a list of arrays, you can use [np.hstack(list)](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html).

In [None]:
preds_second = []
for (_, test_idx), est in zip(rkf.split(X_train, y_train), cv_results['estimator']):
    x_second = X_train.iloc[test_idx[1]:test_idx[1]+1]
    preds_second.append(est.predict_proba(x_second)[0, 1])
preds_second = np.array(preds_second)
plt.figure(figsize=(6, 4))
plt.hist(preds_second, bins=25, edgecolor='black', alpha=0.7)
plt.axvline(preds_second.mean(), color='purple', linewidth=2, label='mean')
plt.xlabel('predicted probability')
plt.ylabel('frequency')
plt.title('predictions for second test patient (each fold)')
plt.legend()
plt.show()

## End of Homework 4