
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [59]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.pipeline import Pipeline
import plotnine as p9

In [60]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

In [61]:
ha.head()

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach,output
0,63,1,3,145,233,0,150,1
1,37,1,2,130,250,1,187,1
2,56,1,1,120,236,1,178,1
3,57,0,0,120,354,1,163,1
4,57,1,0,140,192,1,148,1


In [62]:
ha.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   age      273 non-null    int64
 1   sex      273 non-null    int64
 2   cp       273 non-null    int64
 3   trtbps   273 non-null    int64
 4   chol     273 non-null    int64
 5   restecg  273 non-null    int64
 6   thalach  273 non-null    int64
 7   output   273 non-null    int64
dtypes: int64(8)
memory usage: 17.2 KB


In [63]:
X = ha.drop(['cp', 'output'], axis=1)
y = ha['cp']

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [65]:
knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])
knn_pipe.fit(X_train, y_train)

In [66]:
knn_pred = knn_pipe.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_pred)
print(f"K-Nearest Neighbors Accuracy: {knn_accuracy}")

K-Nearest Neighbors Accuracy: 0.43636363636363634


In [67]:
dt_model = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_model.fit(X_train, y_train)

In [68]:
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
print(f"Decision Tree Accuracy: {dt_accuracy}")

Decision Tree Accuracy: 0.4909090909090909


In [69]:
lda_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis())
])
lda_pipe.fit(X_train, y_train)

In [70]:
lda_pred = lda_pipe.predict(X_test)
lda_accuracy = accuracy_score(y_test, lda_pred)
print(f"Linear Discriminant Analysis Accuracy: {lda_accuracy}")

Linear Discriminant Analysis Accuracy: 0.4727272727272727


## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [71]:
ha = ha.copy()
ha['cp_is_3'] = (ha['cp'] == 3).astype(int)

In [72]:
f1_scores = {}

In [78]:
for cp_category in sorted(ha['cp'].unique()):
    y_ovr = (ha['cp'] == cp_category).astype(int)
    X_ovr = ha.drop(['cp', 'output', 'cp_is_3'], axis=1)

    X_train_ovr, X_test_ovr, y_train_ovr, y_test_ovr = train_test_split(
        X_ovr, y_ovr, test_size=0.25, random_state=42, stratify=y_ovr
    )

    log_reg_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('log_reg', LogisticRegression(
            solver='liblinear', random_state=42, class_weight='balanced'
        ))
    ])
    log_reg_pipe.fit(X_train_ovr, y_train_ovr)
    y_pred_ovr = log_reg_pipe.predict(X_test_ovr)

    score = f1_score(y_test_ovr, y_pred_ovr)
    f1_scores[cp_category] = score
    print(f"F1 Score (cp={cp_category} vs rest): {score}")

F1 Score (cp=0 vs rest): 0.6197183098591549
F1 Score (cp=1 vs rest): 0.32432432432432434
F1 Score (cp=2 vs rest): 0.45614035087719296
F1 Score (cp=3 vs rest): 0.22857142857142856


In [74]:
best_category_ovr = max(f1_scores, key=f1_scores.get)
print(f"The OvR approach was '{best_category_ovr}' with an F1 score of {f1_scores[best_category_ovr]}.")

The OvR approach was '0' with an F1 score of 0.6197183098591549.


## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [75]:
roc_auc_scores = {}

In [76]:
# Pairs to compare: (0, 1), (0, 2), and (0, 3)
for other_category in [1, 2, 3]:
    # Reduce dataset
    ha_ovo = ha[ha['cp'].isin([0, other_category])].copy()
    
    # Define features and target
    X_ovo = ha_ovo.drop(['cp', 'output', 'cp_is_3'], axis=1, errors='ignore')
    y_ovo = ha_ovo['cp']
    
    # Split data
    X_train_ovo, X_test_ovo, y_train_ovo, y_test_ovo = train_test_split(
        X_ovo, y_ovo, test_size=0.25, random_state=42, stratify=y_ovo
    )
    
    # Create and fit a logistic regression pipeline
    log_reg_pipe_ovo = Pipeline([
        ('scaler', StandardScaler()),
        ('log_reg', LogisticRegression(solver='liblinear', random_state=42))
    ])
    log_reg_pipe_ovo.fit(X_train_ovo, y_train_ovo)
    
    # Predict probabilities
    y_pred_proba_ovo = log_reg_pipe_ovo.predict_proba(X_test_ovo)[:, 1]
    
    # Calculate ROC-AUC score
    score = roc_auc_score(y_test_ovo, y_pred_proba_ovo)
    pair_name = f"0 vs {other_category}"
    roc_auc_scores[pair_name] = score
    
    print(f"ROC-AUC for distinguishing between category '0' and '{other_category}': {score}")

ROC-AUC for distinguishing between category '0' and '1': 0.7017045454545454
ROC-AUC for distinguishing between category '0' and '2': 0.8065476190476191
ROC-AUC for distinguishing between category '0' and '3': 0.59375


In [77]:
best_pair_ovo = max(roc_auc_scores, key=roc_auc_scores.get)
print(f"The OvO approach was best '{best_pair_ovo}' with an ROC-AUC score of {roc_auc_scores[best_pair_ovo]}.")

The OvO approach was best '0 vs 2' with an ROC-AUC score of 0.8065476190476191.
