# Evaluating and Tuning a Binary Classification Model

## Goals

After this lesson, you should be able to:

- Build and explain confusion matrices from a model output
- Calculate various binary classification metrics
- Explain the AUC/ROC curve, why it matters, and how to use it
- Understand when and how to optimize a model for various metrics
- Optimize a classification model based on costs

### Category definitions - possible outcomes in binary classification
 
#### - TP = True Positive (class 1 correctly classified as class 1) - e.g. Patient with cancer tests positive for cancer
#### - TN = True Negative (class 0 correctly classified as class 0) - e.g. Patient without cancer tests negative for cancer
#### - FP = False Positive (class 0 incorrectly classified as class 1) - e.g. Patient without cancer tests positive for cancer
#### - FN = False Negative (class 1 incorrectly classified as class 0) - e.g. Patient with cancer tests negative for cancer

### $$ \text{Possible misclassifications} $$

![Type 1 vs. Type 2 Error](images/type-1-type-2.jpg)

## Let's run a model and look at some metrics 

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('./data/heart.csv')

In [None]:
df.head()

In [None]:
df['target'].value_counts(normalize = True)

In [None]:
X = df.drop('target', axis = 1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 77, stratify = y, test_size = .5)

In [None]:
y_train.mean(), y_test.mean()

In [None]:
rf = RandomForestClassifier(n_estimators = 100, random_state = 77)
rf.fit(X_train, y_train)

In [None]:
rf.score(X_test, y_test)

### The Default Measure (in most prebuilt models) - Accuracy

$$ \frac{(TP + TN)}{(TP + FP + TN + FN)} $$

#### We got an accuracy score of .842, but what does that tell us? Just that 84.2% of the time we are correct, nothing about how we are correct or how we are wrong

In [None]:
predictions = rf.predict(X_test)
actual = y_test

In [None]:
confusion_matrix(actual, predictions)

#### My eyes!!!

In [None]:
pd.DataFrame(confusion_matrix(actual, predictions), columns = ['predicted 0', 'predicted 1'], 
             index = ['actual 0', 'actual 1'])

#### We got more false negatives than false positives. What would we likely prefer in the case of this dataset?

## Other metrics

### Misclassification Rate
#### $$ 1 - \text{accuracy} $$ 

### $$ {OR} $$

#### $$ \frac{FP + FN}{TP + FP + TN + FN} $$

### Sensitivity (AKA True Positive Rate, Recall, and Probability of Detection)

$$ \frac{TP}{TP + FN} $$

### Specificity (AKA True Negative Rate)

$$ \frac{TN}{TN + FP} $$

### Precision (AKA Positive Predictive Value)

$$ \frac{TP}{TP + FP} $$

### False Positive Rate

$$ \frac{FP}{FP + TN} $$

OR

#### 1 - Specificity

### Negative Predictive Value

$$ \frac{TN}{TN + FN} $$

### F1 Score

### $$ 2 * \frac{\text{Precision * Recall}}{\text{Precision + Recall}} $$

#### Useful with imbalanced classes where the Negative class is the majority class

### Balanced Accuracy

### $$ \frac{\text{Sensitivity + Specificity}}{2} $$

#### Useful with imbalanced classes where the Positive class is the majority class

## All the Binary Classification Metrics

![classification metrics](./images/conf_matrix_classification_metrics.png)

### Which of these metrics would we want to optimize for in a heart disease detection algorithm?

False Positives and False Negatives each have some cost associated with them.

### Let's figure out how to optimize!

#### Remember that Random Forest gives probability predictions for each class, in addition to the final classification. By default, a majority of trees voting for a class determines the classification, but we can adjust that threshold

In [None]:
predicts = []
for item in rf.predict_proba(X_test):
    if item[0] <= .49:
        predicts.append(1)
    else:
        predicts.append(0)

In [None]:
conf_matrix = pd.DataFrame(confusion_matrix(y_test, predicts), index = ['actual 0', 'actual 1'], 
             columns = ['predicted 0', 'predicted 1'])
conf_matrix

In [None]:
## accuracy

(conf_matrix['predicted 0'][0] + conf_matrix['predicted 1'][1]) / len(predicts)

### The AUC / ROC curve (Area Under Curve of the Receiver Operating Characteristic)

In [None]:
import matplotlib.pyplot as plt

In [None]:
x_list = []
y_list = []

for x in np.linspace(0, 1, 100):
    
    # Same predictions based on predict_proba thresholds
    predicts = []
    
    for item in rf.predict_proba(X_test):
        if item[0] <= x:
            predicts.append(1)
        else:
            predicts.append(0)

    conf_matrix = pd.DataFrame(confusion_matrix(y_test, predicts), index = ['actual 0', 'actual 1'], 
                     columns = ['predicted 0', 'predicted 1'])
    
    
    # Assign TP, TN, FP, FN
    true_positives = conf_matrix['predicted 1'][1]
    true_negatives = conf_matrix['predicted 0'][0]
    false_positives = conf_matrix['predicted 1'][0]
    false_negatives = conf_matrix['predicted 0'][1]

    
    # Calculate Sensitivity and Specificity
    sensitivity = true_positives / (true_positives + false_negatives)

    specificity = true_negatives / (true_negatives + false_positives)
    
    
    # Append to lists to graph
    x_list.append(1 - specificity)

    y_list.append(sensitivity)

    
# Plot ROC curve

plt.figure(figsize = (10, 8))
plt.title('ROC Curve', fontsize = 20)
plt.xlabel('1 - specificity (False Positive Rate)', fontsize = 15)
plt.ylabel('sensitivity (True Positive Rate)', fontsize = 15)
plt.xlim(-0.01, 1)
plt.ylim(-0.01, 1)
plt.plot(x_list, y_list);
plt.plot([0, 1], [0, 1]);

# x = 1 - specificity
# y = sensitivity

In [None]:
from sklearn.metrics import auc

In [None]:
auc(x_list, y_list)

### Let's add associated costs to our TP, FP, TN, FN to our loop and minimize the cost
This is the naive way to optimize, but works well - you could also create a closed form optimization function

In [None]:
def cost_function(model, X_test, y_test, num_thres = 100, cost_fp = 3, cost_tn = 0.5, cost_tp = 1, cost_fn = 2):

    _thres = []; tpr = [] ; fpr = [] ; cost = []

    # assign model predictions
    prediction = model.predict_proba(X_test)

    ## Different code for same objective to calculate metrics at thresholds
    
    for thres in np.linspace(0.01, 1, num_thres):
        
        _thres.append(thres)
        predicts = np.zeros((prediction.shape[0], 1)) 
        predicts[np.where(prediction[:, 1] >= thres)] = 1

        conf_matrix = confusion_matrix(y_test, predicts)

        tp = conf_matrix[1, 1]
        tn = conf_matrix[0, 0]
        fp = conf_matrix[0, 1]
        fn = conf_matrix[1, 0]

        sensitivity = tp / (tp + fn)
        tnr = specificity = tn / (tn + fp)
        fnr = 1 - sensitivity

        tpr.append(sensitivity)
    
        fpr.append(1 - specificity)
        
        # add a cost function (this involves domain knowledge)
        
        current_cost = (cost_fp * fp) + (cost_tn * tn) + (cost_tp * tp) + (cost_fn * fn)
            
        cost.append(current_cost)  

    return fpr, tpr, cost, _thres

In [None]:
fpr, tpr, cost, thres = cost_function(model = rf, X_test = X_test, y_test = y_test,
                                      num_thres = 100, cost_fp = 2, cost_tn = 1, cost_tp = 1, cost_fn = 3)

cost_idx = np.argmin(cost)
min_cost_threshold = fpr[cost_idx], tpr[cost_idx], thres[cost_idx]

ax = plt.figure(figsize = (10, 8))
plt.title('ROC Curve', fontsize = 20)
plt.xlabel('1 - specificity', fontsize = 15)
plt.ylabel('sensitivity', fontsize = 15)
plt.xlim(-.01, 1.01)
plt.ylim(-.01, 1.01)
plt.plot(fpr, tpr);
plt.plot([0, 1], [0, 1]);
plt.scatter(min_cost_threshold[0], min_cost_threshold[1], marker ='o', color = 'red', s=250)
ax.text(min_cost_threshold[0] + 0.06, min_cost_threshold[1] - 0.03, 'Threshold:'+ str(round(min_cost_threshold[2], 2)))

## Optimizing costs on multiple models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

In [None]:
logreg = LogisticRegression(random_state=42)

In [None]:
logreg.fit(X_train, y_train)

In [None]:
nb_class = GaussianNB()

In [None]:
nb_class.fit(X_train, y_train)

In [None]:
def cost_function_multi_model(model_list, X_test, y_test, num_thres = 100, cost_fp = 3, 
                              cost_tn = 0.5, cost_tp = 1, cost_fn = 2):
    '''model_list expects a list of already fit models - You could add the model.fit() code to the function.
    models in model_list MUST have the predict_proba() method - this could be modified in the future'''
    
    best_cost = []
    best_thresh = []
    
    for model in model_list:
        
        _thres = []
        cost = []

    # assign model predictions
        prediction = model.predict_proba(X_test)

    ## Different code for same objective to calculate metrics at thresholds
    
        for thres in np.linspace(0.01, 1, num_thres):

            _thres.append(thres)
            predicts = np.zeros((prediction.shape[0], 1)) 
            predicts[np.where(prediction[:, 1] >= thres)] = 1

            conf_matrix = confusion_matrix(y_test, predicts)

            tp = conf_matrix[1, 1]
            tn = conf_matrix[0, 0]
            fp = conf_matrix[0, 1]
            fn = conf_matrix[1, 0]

            # add a cost function (this involves domain knowledge)

            current_cost = (cost_fp * fp) + (cost_tn * tn) + (cost_tp * tp) + (cost_fn * fn)

            cost.append(current_cost)
        
        thresh_idx = np.array(cost).argmin()
        best_cost.append(min(cost))
        best_thresh.append(_thres[thresh_idx])

    return pd.DataFrame({'model':model_list, 'best_cost' : best_cost, 'best_thresh' : best_thresh})

In [None]:
cost_function_multi_model([rf, logreg, nb_class], X_test = X_test, y_test = y_test,
                         num_thres = 100, cost_fp = 3, cost_tn = 0.5, cost_tp = 1, cost_fn = 2)

### Visualizing Threshold vs. Population Distribution

[ROC Curve Interactive Visualizer](http://www.navan.name/roc/)

In [None]:
import seaborn as sns

In [None]:
rf.predict_proba(X_test)[:5]

In [None]:
rf.predict(X_test)[:5]

In [None]:
no_cancer_dist = []
cancer_dist = []

for item in rf.predict_proba(X_test):
    if item[0] <= .49:
        cancer_dist.append(item[0])
    else:
        no_cancer_dist.append(item[0])

In [None]:
plt.figure(figsize = (10, 6))
plt.title('Distributions of Population of Patients with and without Cancer')
plt.xlabel('Threshold', fontsize = 14)

sns.distplot(no_cancer_dist, bins = 15, color = 'red')
sns.distplot(cancer_dist, bins = 15, color = 'blue')
plt.legend(['no cancer dist', 'cancer dist']);

### ROC curve vs Population Separation
![a](images/pop-curve.png)

![d](images/varying_dist_roc.png)

# Imbalanced Data

In [None]:
fraud_df = pd.read_csv('./data/creditcard.csv')

### Quick EDA

In [None]:
fraud_df.head()

In [None]:
fraud_df.info()

In [None]:
fraud_df['Class'].value_counts()

In [None]:
fraud_df['Class'].value_counts(normalize = True)

In [None]:
fraud_df[fraud_df['Amount'] == 0]['Class'].value_counts(normalize = True)

In [None]:
# plt.figure(figsize = (12,8))
# sns.heatmap(fraud_df.corr()[['Class']].sort_values(by = 'Class'), cmap = 'coolwarm', 
#             vmin = -1, vmax = 1, annot = True);

In [None]:
fraud_df.columns

### Assign X and y

In [None]:
X = fraud_df.loc[:, 'V26' : 'V27']

In [None]:
y = fraud_df['Class']

### Switch classes

In [None]:
# y = 1 - y

### Run model

In [None]:
fraud_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [None]:
fraud_rf.fit(X_train, y_train)

### Results

In [None]:
fraud_rf.score(X_test, y_test)

In [None]:
pd.DataFrame(confusion_matrix(y_test, fraud_rf.predict(X_test)), columns = ['predicted 0', 'predicted 1'], 
             index = ['actual 0', 'actual 1'])

#### ROC Curve

In [None]:
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, precision_recall_fscore_support

In [None]:
roc_curve(y_test, fraud_rf.predict_proba(X_test)[:, 1])

In [None]:
fpr, tpr, threshold = roc_curve(y_test, fraud_rf.predict_proba(X_test)[:, 1])

In [None]:
plt.figure(figsize = (10,6))
plt.title('ROC Curve', fontsize = 20)
plt.xlabel('1 - specificity', fontsize = 15)
plt.ylabel('sensitivity', fontsize = 15)
plt.plot([0,1], [0,1])
plt.plot(fpr,tpr);

In [None]:
auc(fpr,tpr)

#### Precision Recall Curve

In [None]:
precision, recall, threshs = precision_recall_curve(y_test, fraud_rf.predict_proba(X_test)[:, 1])

In [None]:
plt.figure(figsize = (10,6))
plt.xlabel('Recall (TP / (TP + FN))', fontsize = 18)
plt.ylabel('Precision (TP / (TP + FP))', fontsize = 18)
plt.plot([0,1], [0,0])
plt.plot(recall, precision);

### Using all columns

In [None]:
X = fraud_df.drop(columns=['Class', 'Time'])

In [None]:
y = fraud_df['Class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

In [None]:
rf_fraud = RandomForestClassifier(n_estimators = 100, n_jobs = -1, oob_score = True)

In [None]:
rf_fraud.fit(X_train, y_train)

### Results

In [None]:
rf_fraud.oob_score_

In [None]:
rf_fraud.score(X_test, y_test)

In [None]:
fraud_preds = rf_fraud.predict(X_test)

In [None]:
pd.DataFrame(confusion_matrix(y_test, fraud_preds), columns = ['predicted 0', 'predicted 1'], 
             index = ['actual 0', 'actual 1'])

In [None]:
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, precision_recall_fscore_support

In [None]:
rf_fraud.predict_proba(X_test)[:, 1]

In [None]:
# roc_curve(y_test, rf_fraud.predict_proba(X_test)[:, 1])

In [None]:
fpr, tpr, threshold = roc_curve(y_test, rf_fraud.predict_proba(X_test)[:, 1])

In [None]:
plt.figure(figsize = (10,6))
plt.title('ROC Curve', fontsize = 20)
plt.xlabel('1 - specificity', fontsize = 15)
plt.ylabel('sensitivity', fontsize = 15)
plt.plot([0,1], [0,1])
plt.plot(fpr,tpr);

In [None]:
auc(fpr, tpr)

In [None]:
precision, recall, threshs = precision_recall_curve(y_test, rf_fraud.predict_proba(X_test)[:, 1])

In [None]:
plt.figure(figsize = (10,6))
plt.xlabel('Recall (TP / (TP + FN))', fontsize = 18)
plt.ylabel('Precision (TP / (TP + FP))', fontsize = 18)
plt.plot([0,1], [0,0])
plt.plot(recall, precision);

### Try our cost function

In [None]:
fpr, tpr, cost, thres = cost_function(model = rf_fraud, X_test = X_test, y_test = y_test,
                                      num_thres = 100, cost_fp = 1, cost_tn = 1, cost_tp = 1, cost_fn = 5)

cost_idx = np.argmin(cost)
min_cost_threshold = fpr[cost_idx], tpr[cost_idx], thres[cost_idx]

ax = plt.figure(figsize = (10, 8))
plt.title('ROC Curve', fontsize = 20)
plt.xlabel('1 - specificity', fontsize = 15)
plt.ylabel('sensitivity', fontsize = 15)
plt.xlim(-.01, 1.01)
plt.ylim(-.01, 1.01)
plt.plot(fpr, tpr);
plt.plot([0, 1], [0, 1]);
plt.scatter(min_cost_threshold[0], min_cost_threshold[1], marker ='o', color = 'red', s=250)
ax.text(min_cost_threshold[0] + 0.06, min_cost_threshold[1] - 0.03, 'Threshold:'+ str(round(min_cost_threshold[2], 2)))

### SMOTE!

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
smote = SMOTE(k_neighbors=13)

In [None]:
X_train, y_train = smote.fit_sample(X_train, y_train)

In [None]:
rf_smote = RandomForestClassifier(n_estimators=100, n_jobs=-1)

In [None]:
rf_smote.fit(X_train, y_train)

In [None]:
rf_smote.score(X_test,y_test)

In [None]:
fraud_preds = rf_smote.predict(X_test)

In [None]:
pd.DataFrame(confusion_matrix(y_test, fraud_preds), columns = ['predicted 0', 'predicted 1'], 
             index = ['actual 0', 'actual 1'])

In [None]:
fpr, tpr, threshold = roc_curve(y_test, rf_smote.predict_proba(X_test)[:, 1])

In [None]:
plt.figure(figsize = (10,6))
plt.title('ROC Curve', fontsize = 20)
plt.xlabel('1 - specificity', fontsize = 15)
plt.ylabel('sensitivity', fontsize = 15)
plt.plot([0,1], [0,1])
plt.plot(fpr,tpr);

In [None]:
auc(fpr, tpr)