
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score


In [3]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [9]:

X = ha[['age', 'chol']]
y = ha['cp']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

knn = KNeighborsClassifier()
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}
grid_search_knn = GridSearchCV(knn, param_grid_knn, cv=5)
grid_search_knn.fit(X_train, y_train)
best_knn = grid_search_knn.best_estimator_

dt = DecisionTreeClassifier(random_state=42)
param_grid_dt = {
    'max_depth': [2, 4, 6, 8],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=5)
grid_search_dt.fit(X_train, y_train)
best_dt = grid_search_dt.best_estimator_

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print("Best KNN Model:", grid_search_knn.best_params_, "with accuracy:", best_knn.score(X_test, y_test))
print("Best Decision Tree Model:", grid_search_dt.best_params_, "with accuracy:", best_dt.score(X_test, y_test))
print("LDA Model accuracy:", lda.score(X_test, y_test))

# plt.figure(figsize=(20, 10))
# plot_tree(best_dt, filled=True, feature_names=['age', 'chol'], class_names=['Type 0', 'Type 1', 'Type 2', 'Type 3'])
# plt.show()


Best KNN Model: {'metric': 'euclidean', 'n_neighbors': 11, 'weights': 'uniform'} with accuracy: 0.4057971014492754
Best Decision Tree Model: {'criterion': 'entropy', 'max_depth': 2, 'min_samples_leaf': 4, 'min_samples_split': 2} with accuracy: 0.391304347826087
LDA Model accuracy: 0.4782608695652174


## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [10]:

for cp_val in range(4):
    ha[f'cp_is_{cp_val}'] = (ha['cp'] == cp_val).astype(int)

lr = LogisticRegression()
f1_scores = {}

for cp_val in range(4):
    y_train, y_test = train_test_split(ha[f'cp_is_{cp_val}'], test_size=0.25, random_state=42)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    f1_scores[f'cp_is_{cp_val}'] = f1_score(y_test, y_pred)

f1_scores


{'cp_is_0': 0.3272727272727273, 'cp_is_1': 0.0, 'cp_is_2': 0.0, 'cp_is_3': 0.0}

## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [12]:
def fit_and_evaluate_ovo(cp1, cp2):
    ha_ovo = ha[ha['cp'].isin([cp1, cp2])]

    X_ovo = ha_ovo[['age', 'chol']]
    y_ovo = ha_ovo['cp']

    X_train_ovo, X_test_ovo, y_train_ovo, y_test_ovo = train_test_split(X_ovo, y_ovo, test_size=0.25, random_state=42)

    log_reg_ovo = LogisticRegression()
    log_reg_ovo.fit(X_train_ovo, y_train_ovo)
    y_pred_proba_ovo = log_reg_ovo.predict_proba(X_test_ovo)[:, 1]

    return roc_auc_score(y_test_ovo, y_pred_proba_ovo)

roc_auc_scores = {
    '0_vs_1': fit_and_evaluate_ovo(0, 1),
    '0_vs_2': fit_and_evaluate_ovo(0, 2),
    '0_vs_3': fit_and_evaluate_ovo(0, 3)
}

roc_auc_scores


{'0_vs_1': 0.5591133004926109,
 '0_vs_2': 0.7333333333333334,
 '0_vs_3': 0.5396825396825397}