#### PREDICTING HEART DISEASE

**Data Dictionary**:

- age: in years

- sex: (1 = male; 0 = female)

- cp: chest pain type

- restbps: resting blood pressure (in mm Hg on admission to the hospital)

- chol: serum cholestorol in mg/dl

- fbsP: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- restecg: resting electrocardiographic results

- thalach: maximum heart rate achieved

- exang: exercise induced angina (1 = yes; 0 = no)

- oldpeakST: depression induced by exercise relative to rest

- slope: the slope of the peak exercise ST segment

- ca: number of major vessels (0-3) colored by flourosopy

- thal:  3 = normal; 2 = fixed defect; 1 = reversable defect

- target: 1 or 0  Where 1 is Heart Disease, 0 is No Heart Disease

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression


import warnings
warnings.simplefilter(action='ignore')

In [None]:
df= pd.read_csv('datasets/heart_2.csv')

In [None]:
df.head()

In [None]:
df['thal'] = df['thal'].replace(0, 2)

In [None]:
dfd = pd.get_dummies(df, columns = ['cp', 'thal'], drop_first=True)

In [None]:
X = dfd[['thal_2','thalach', 'slope', 'cp_2', 'cp_1', 'thal_3','exang', 'oldpeak', 'ca', 'sex']]
y = dfd['target']

In [None]:
y.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify = y)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
ss = StandardScaler()

Xs_train  = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [None]:
logreg = LogisticRegression()

In [None]:
cross_val_score(logreg, Xs_train, y_train).mean()

In [None]:
logreg.fit(Xs_train, y_train)

In [None]:
logreg.score(Xs_train, y_train), logreg.score(Xs_test, y_test)

### Let's streamline this with a Pipeline...

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([
    ('pf', PolynomialFeatures()),
    ('ss', StandardScaler()),
    ('lr', LogisticRegression())])

#### So here we can conclude that our model does pretty well but is a bit overfit... Maaaayyyybe we can Gridsearch to try to improve it? 

### GridSearch with Pipeline Syntax

`GridSearch` accepts a `Pipeline` object as an estimator and a param grid.

The param grid uses the `string_name`s from your pipeline followed by a dunder `__` and the argument name for that particular step. You then provide an iterable to search over (generally a list or a range-style object).

In [None]:
params = {
    'pf__degree': [1, 2, 3],
    'lr__penalty': ['l1', 'l2'],
    'lr__class_weight' : ['balanced', None],
    'lr__C' : [0.001, 0.01, 0.1, 1.0, 2.0]
}

In [None]:
gs = GridSearchCV(pipe, params, cv=3, scoring = 'recall')

#### This is looking decent... Let's look at our other evaluation metrics.

### Classification Metrics: Accuracy Score, Confusion Matrix, Classification Report

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
# take our predictions from the Pipeline with GridSearchCV

y_hat = 

In [None]:
#let's create a confusion matrix
cm = confusion_matrix(y_test, y_hat)
cm

In [None]:
#let's make the confusion matrix slightly less confusing by adding labels and making it a dataframe...

confusion = pd.DataFrame(cm, 
                            index=['Actual_Negative', 'Actual_Positive'], 
                            columns=['Predicted_Negative', 'Predicted_Positive'])
confusion

#### What is the Sensitivity?

In [None]:
# sensitivity = TP/TP+FN

sensitivity = / ( + )
sensitivity

##sensitivity == True Positive Rate == Recall

#How can we interpret this score?

#### What is the Specificity?

In [None]:
# specificity = TN/(TN +FP) True Negative Rate
specificity = /( + )
specificity

#How do we interpret this score?

#### What is the Precision?

In [None]:
# precision = TP/(TP+FP)  (True positive rate)
precision = /(+)
precision

# How do we intepret this score?

#### What is the Misclassification rate?

In [None]:
accuracy = accuracy_score(y_test, y_hat)

In [None]:
# misclassification rate

misclass = 1- accuracy
misclass


In [None]:
#let's make a classification report

print(classification_report(y_test, y_hat))

### Brier Score:

In [None]:
from sklearn.metrics import brier_score_loss

Brier Score: A brier score is a way to verify the accuracy of a probability forecast.
    
    - The best possible Brier score is 0, for total accuracy.
    - The worst possible score is 1, which means the probability forecast (the predicted probabilities) were entirely inaccurate.

#### This takes three main arguments, our y_true and our y_probs, what is considered the positive label

In [None]:
y_probs = gs.predict_proba(X_test)[:,1] #this takes all the rows and all the probabilities of falling in the 1's class

In [None]:
brier_score_loss(y_test, y_probs, pos_label=1)

#### ROC/AUC Score

In [None]:
from sklearn.metrics import roc_auc_score

