
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [None]:
## library imports here

import pandas as pd
import numpy as np 


from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet , LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import r2_score,classification_report

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

ImportError: cannot import name 'cross_val_split' from 'sklearn.linear_model' (C:\Users\ldcal\anaconda3\Lib\site-packages\sklearn\linear_model\__init__.py)

In [2]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")

## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [None]:
X1 = ha.drop("cp", axis = 1) 
y1 = ha["cp"]

X_train, X_test, y_train, y_test = train_test_split(
    X1, y1, test_size=0.3, random_state=42)

In [5]:
ct = ColumnTransformer(
  [
    ("dummify", 
    OneHotEncoder(sparse_output = False, handle_unknown='ignore'),
    make_column_selector(dtype_include=object)),],
  remainder = "passthrough"
)

pipeKNN = Pipeline(
  [("preprocessing", ct),
  ("KNN", KNeighborsClassifier())]
)

In [23]:
param_grid_knn = {
    "KNN__n_neighbors": range(1,100)}

In [34]:
grid = GridSearchCV(pipeKNN, param_grid_knn, cv=5, scoring='r2')
grid.fit(X1, y1)
print(grid.best_params_)

{'KNN__n_neighbors': 21}


In [62]:
pipeKNN = Pipeline(
  [("preprocessing", ct),
  ("KNN", KNeighborsClassifier(n_neighbors=21))]
)

In [63]:
pipeKNN.fit(X1,y1)

preds = pipeKNN.predict(X_test)

print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.55      0.94      0.69        36
           1       0.00      0.00      0.00        17
           2       0.47      0.38      0.42        24
           3       0.00      0.00      0.00         5

    accuracy                           0.52        82
   macro avg       0.26      0.33      0.28        82
weighted avg       0.38      0.52      0.43        82



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [82]:
pipeTree = Pipeline(
  [("preprocessing", ct),
  ("Tree", DecisionTreeClassifier())]
)

param_grid_Tree = {
    "Tree__max_depth": [None, 3, 5, 8, 12],
    "Tree__min_samples_split": [2, 5, 10, 20],
    "Tree__min_samples_leaf": [1, 2, 5, 10],
    "Tree__splitter": ["best", "random"],
    "Tree__ccp_alpha": [0.0, 0.0005, 0.001, 0.005]  
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)

tree_gs = GridSearchCV(pipeTree, param_grid_Tree, cv=cv, scoring = "f1_macro", n_jobs = -1)

tree_gs.fit(X1, y1)
print(tree_gs.best_params_)

{'Tree__ccp_alpha': 0.0005, 'Tree__max_depth': 8, 'Tree__min_samples_leaf': 2, 'Tree__min_samples_split': 2, 'Tree__splitter': 'random'}


In [83]:
pipeTree = Pipeline(
  [("preprocessing", ct),
  ("Tree", DecisionTreeClassifier(ccp_alpha=0.005, max_depth=8, min_samples_leaf=2, min_samples_split=2,
  splitter = "random"))]
)

In [98]:
cross_val_score(pipeTree,X1, y1,cv = cv, scoring = "f1_macro").mean()

np.float64(0.3406254716286955)

In [91]:
pipeTree.fit(X_train,y_train)

Treepreds = pipeTree.predict(X_test)

print(classification_report(y_test, Treepreds))

              precision    recall  f1-score   support

           0       0.58      0.86      0.70        36
           1       0.17      0.06      0.09        17
           2       0.43      0.42      0.43        24
           3       0.00      0.00      0.00         5

    accuracy                           0.51        82
   macro avg       0.30      0.33      0.30        82
weighted avg       0.42      0.51      0.45        82



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?