
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [24]:
## library imports here
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn import tree

In [25]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach,output
0,63,1,3,145,233,0,150,1
1,37,1,2,130,250,1,187,1
2,56,1,1,120,236,1,178,1
3,57,0,0,120,354,1,163,1
4,57,1,0,140,192,1,148,1
...,...,...,...,...,...,...,...,...
268,59,1,0,164,176,0,90,0
269,57,0,0,140,241,1,123,0
270,45,1,3,110,264,1,132,0
271,68,1,0,144,193,1,141,0


## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [26]:
X = ha.drop("cp", axis=1)
y = ha["cp"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [27]:
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

dt = DecisionTreeClassifier(random_state=0)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)
y_pred_lda = lda.predict(X_test_scaled)
accuracy_lda = accuracy_score(y_test, y_pred_lda)

## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

def fit_logistic_regression_for_cp_type(ha, cp_value):
    ha[f"cp_is_{cp_value}"] = (ha["cp"] == cp_value).astype(int)
    X = ha.drop(["cp", f'cp_is_{cp_value}'], axis=1)
    y = ha[f"cp_is_{cp_value}"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    lr = LogisticRegression()
    lr.fit(X_train_scaled, y_train)
    return f1_score(y_test, lr.predict(X_test_scaled))

f1_scores_ovr = {cp: fit_logistic_regression_for_cp_type(ha, cp) for cp in range(4)}
print(f1_scores_ovr)

{0: 0.7532467532467534, 1: 0.0, 2: 0.8260869565217391, 3: 1.0}


## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [29]:
def fit_logistic_regression_ovo(cp1, cp2):
    filtered_data = ha[ha['cp'].isin([cp1, cp2])]
    X = filtered_data.drop('cp', axis=1)
    y = filtered_data['cp']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    lr = LogisticRegression()
    lr.fit(X_train_scaled, y_train)
    y_pred_proba = lr.predict_proba(X_test_scaled)[:, 1]
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    return roc_auc

roc_auc_scores = {(0, 1): fit_logistic_regression_ovo(0, 1),
    (0, 2): fit_logistic_regression_ovo(0, 2),
    (0, 3): fit_logistic_regression_ovo(0, 3)}

# Display results
print("Q1: Multiclass Model Accuracies - KNN:", accuracy_knn, "Decision Tree:", accuracy_dt, "LDA:", accuracy_lda)
print("Q2: OvR F1 Scores:", f1_scores)
print("Q3: OvO ROC-AUC Scores:", roc_auc_scores)

Q1: Multiclass Model Accuracies - KNN: 0.43902439024390244 Decision Tree: 0.35365853658536583 LDA: 0.524390243902439
Q2: OvR F1 Scores: {0: 0.7532467532467534, 1: 0.0, 2: 0.8260869565217391, 3: 1.0}
Q3: OvO ROC-AUC Scores: {(0, 1): 1.0, (0, 2): 1.0, (0, 3): 1.0}
