
Our dataset consists of clinical data from patients who entered the hospital complaining of chest pain ("angina") during exercise.  The information collected includes:

* `age` : Age of the patient

* `sex` : Sex of the patient

* `cp` : Chest Pain type

    + Value 0: asymptomatic
    + Value 1: typical angina
    + Value 2: atypical angina
    + Value 3: non-anginal pain
   
    
* `trtbps` : resting blood pressure (in mm Hg)

* `chol` : cholesterol in mg/dl fetched via BMI sensor

* `restecg` : resting electrocardiographic results

    + Value 0: normal
    + Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    + Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

* `thalach` : maximum heart rate achieved during exercise

* `output` : the doctor's diagnosis of whether the patient is at risk for a heart attack
    + 0 = not at risk of heart attack
    + 1 = at risk of heart attack

In [None]:
## library imports here

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
import pandas as pd
import numpy as np
from plotnine import *
from math import *

In [None]:
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach,output
0,63,1,3,145,233,0,150,1
1,37,1,2,130,250,1,187,1
2,56,1,1,120,236,1,178,1
3,57,0,0,120,354,1,163,1
4,57,1,0,140,192,1,148,1
...,...,...,...,...,...,...,...,...
268,59,1,0,164,176,0,90,0
269,57,0,0,140,241,1,123,0
270,45,1,3,110,264,1,132,0
271,68,1,0,144,193,1,141,0


## Q1: Natural Multiclass Models

Fit a multiclass KNN, Decision Tree, and LDA for the heart disease data; this time predicting the type of chest pain (categories 0 - 3) that a patient experiences.  For the decision tree, plot the fitted tree, and interpret the first couple splits.


In [None]:
X = ha.drop(["cp"], axis = 1)
y = ha["cp"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80)

####Preprocessing Step

In [None]:
ct = ColumnTransformer(
    [
        ("keep", FunctionTransformer(lambda x: x), ["age", "trtbps", "chol", "thalach"]),
        ("dummify", OneHotEncoder(sparse_output = False), ["sex", "restecg", "output"]),
    ],
    remainder = "passthrough"
).set_output(transform = "pandas")



####KNN

In [None]:
knn_pipeline_1 = Pipeline(
  [("preprocessing", ct),
  ("knn", KNeighborsClassifier(n_neighbors=20))]
)

In [None]:
knn_pipeline_fitted = knn_pipeline_1.fit(X_train, y_train)

####Decision Tree

In [None]:
decision_tree_pipeline = Pipeline(
  [("preprocessing", ct),
  ("decision_tree_regression", DecisionTreeClassifier(min_samples_leaf = 30))]
)

In [None]:
decision_tree_pipeline_fitted = decision_tree_pipeline.fit(X_train, y_train)

####LDA

In [None]:
lda_pipeline = Pipeline(
  [("preprocessing", ct),
  ("lda_model", LinearDiscriminantAnalysis())]
)

In [None]:
lda_pipeline_fitted = lda_pipeline.fit(X_train, y_train)

## Q2:  OvR

Create a new column in the `ha` dataset called `cp_is_3`, which is equal to `1` if the `cp` variable is equal to `3` and `0` otherwise.

Then, fit a Logistic Regression to predict this new target, and report the **F1 Score**.

Repeat for the other three `cp` categories.  Which category was the OvR approach best at distinguishing?

In [None]:
ha['cp_is_3'] = np.where(ha['cp'] == 3, 1, 0)
ha['cp_is_2'] = np.where(ha['cp'] == 2, 1, 0)
ha['cp_is_1'] = np.where(ha['cp'] == 1, 1, 0)
ha['cp_is_0'] = np.where(ha['cp'] == 0, 1, 0)

ha.head()

Unnamed: 0,age,sex,cp,trtbps,chol,restecg,thalach,output,cp_is_3,cp_is_2,cp_is_1,cp_is_0
0,63,1,3,145,233,0,150,1,1,0,0,0
1,37,1,2,130,250,1,187,1,0,1,0,0
2,56,1,1,120,236,1,178,1,0,0,1,0
3,57,0,0,120,354,1,163,1,0,0,0,1
4,57,1,0,140,192,1,148,1,0,0,0,1


####CP3

In [None]:
X = ha.drop(["cp", "cp_is_3", "cp_is_2", "cp_is_1", "cp_is_0"], axis = 1)
y = ha["cp_is_3"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80, random_state = 1)

In [None]:
ct = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
         make_column_selector(dtype_include=['category', 'object']))
    ],
    remainder="passthrough"
)

In [None]:
logreg_pipeline = Pipeline(
  [("preprocessing", ct),
  ("log_regression", LogisticRegression())]
)

In [None]:
logreg_pipeline_fitted_3 = logreg_pipeline.fit(X_train, y_train)
y_preds_3 = logreg_pipeline_fitted_3.predict(X_test)
f1_cp3 = f1_score(y_test, y_preds_3, average='macro')
f1_cp3

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.5384031700531152

####CP2

In [None]:
X = ha.drop(["cp", "cp_is_3", "cp_is_2", "cp_is_1", "cp_is_0"], axis = 1)
y = ha["cp_is_2"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80, random_state = 1)

In [None]:
logreg_pipeline_fitted_2 = logreg_pipeline.fit(X_train, y_train)
y_preds_2 = logreg_pipeline_fitted_2.predict(X_test)
f1_cp2 = f1_score(y_test, y_preds_2, average='macro')
f1_cp2

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.4963839360106649

####CP1

In [None]:
X = ha.drop(["cp", "cp_is_3", "cp_is_2", "cp_is_1", "cp_is_0"], axis = 1)
y = ha["cp_is_1"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80, random_state = 1)

In [None]:
logreg_pipeline_fitted_1 = logreg_pipeline.fit(X_train, y_train)
y_preds_1 = logreg_pipeline_fitted_1.predict(X_test)
f1_cp1 = f1_score(y_test, y_preds_1, average='macro')
f1_cp1

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.4525

####CP0

In [None]:
X = ha.drop(["cp", "cp_is_3", "cp_is_2", "cp_is_1", "cp_is_0"], axis = 1)
y = ha["cp_is_0"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80, random_state = 1)

In [None]:
logreg_pipeline_fitted_0 = logreg_pipeline.fit(X_train, y_train)
y_preds_0 = logreg_pipeline_fitted_0.predict(X_test)
f1_cp0 = f1_score(y_test, y_preds_0, average='macro')
f1_cp0

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.618970757782203

Based on all of the f1 scores, the logistic regression model is the best at predicting if chest pain level is 0.

## Q3: OvO

Reduce your dataset to only the `0` and `1` types of chest pain.

Then, fit a Logistic Regression to predict between the two groups, and report the **ROC-AUC**.  

Repeat comparing category `0` to `2` and `3`.  Which pair was the OvO approach best at distinguishing?

In [None]:
# reset
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha['sex'] = ha['sex'].astype(str)
ha['cp'] = ha['cp'].astype(str)
ha['restecg'] = ha['restecg'].astype(str)
ha['output'] = ha['output'].astype(str)

# 0 vs 1
ha03 = ha[(ha["cp"] == "0") | (ha["cp"] == "1")]

X = ha03.drop(["cp"], axis = 1) # drop bc identification variables/response variable
y = ha03["cp"]

logisticPipeline = Pipeline(
    [("preprocessing", ct),
     ("logistic_regression", LogisticRegression())]
)

logistic_model_fitted = logisticPipeline.fit(X,y)

# Final Model ROC AUC metric
y_prob = logisticPipeline.predict_proba(X)[:, 1]
roc_auc_score(y, y_prob)

0.8442826704545455

In [None]:
# reset
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha['sex'] = ha['sex'].astype(str)
ha['cp'] = ha['cp'].astype(str)
ha['restecg'] = ha['restecg'].astype(str)
ha['output'] = ha['output'].astype(str)

# 0 vs 2
ha02 = ha[(ha["cp"] == "0") | (ha["cp"] == "2")]

X = ha02.drop(["cp"], axis = 1) # drop bc identification variables/response variable
y = ha02["cp"]

logisticPipeline = Pipeline(
    [("preprocessing", ct),
     ("logistic_regression", LogisticRegression())]
)

logistic_model_fitted = logisticPipeline.fit(X,y)

# Final Model ROC AUC metric
y_prob = logisticPipeline.predict_proba(X)[:, 1]
roc_auc_score(y, y_prob)

0.8080632716049382

In [None]:
# reset
ha = pd.read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha['sex'] = ha['sex'].astype(str)
ha['cp'] = ha['cp'].astype(str)
ha['restecg'] = ha['restecg'].astype(str)
ha['output'] = ha['output'].astype(str)

# 0 vs 3
ha03 = ha[(ha["cp"] == "0") | (ha["cp"] == "3")]

X = ha03.drop(["cp"], axis = 1) # drop bc identification variables/response variable
y = ha03["cp"]

logisticPipeline = Pipeline(
  [("preprocessing", ct),
   ("logistic_regression", LogisticRegression())]
)

logistic_model_fitted = logisticPipeline.fit(X,y)

# Final Model ROC AUC metric
y_prob = logisticPipeline.predict_proba(X)[:, 1]
roc_auc_score(y, y_prob)