In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.datasets import load_iris, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from seaborn import heatmap

# On Classifier Evaluation

## Confusion Matrix

For classification problems, the target is a categorical variable. This means that we can simply count the number of times that our model predicts the correct category and the number of times that it predicts something else.

We can visualize this by means of a **confusion matrix**, which displays counts of correct and incorrect predictions. We'll explore this below. There are [many ways](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks) of evaluating a classification model, but most derive from the confusion matrix.

## Lottery Number Prediction

Suppose I want to train a model to predict whether a string of six numbers (a "ticket") would win the lottery or not. What sort of data might I use?

### Scenario 1

I gather all the winning tickets from the last 10000 days or so. So I have one column for the tickets themselves, and a Boolean column indicating whether the ticket was a winner or not.

Now if all the tickets on which my model trains are *winning* tickets, then it would predict every ticket to win! Suppose I test the model on a set of 1000 tickets, and suppose that there is exactly one winning ticket among those 1000. My model will always predict the ticket to win. Let's think about what the confusion matrix will look like.

In [None]:
# Setting up the true values:
# create an array shape (1000,1) full of 'False' values
# randomly select one value to be true (winning ticket)


In [None]:
# Setting up the predictions
# Create another array of the same size as y_true with all ones.

# Confusion Matrices

Confusion matrices are the basis of many of the metrics for classification.  There are a few exceptions.

![Confusion Matrix Image](../resources/confusion_matrix.png)

Image courtesy of [Open Genius IQ](https://iq.opengenus.org/performance-metrics-in-classification-regression/)

In [None]:
# Defining the confusion matrix
cm_1 = confusion_matrix(y_true, y_pred)

The confusion matrix should tell us that we have 999 false positives (999 losing tickets predicted to win) and 1 true positive (1 winning ticket predicted to win):

In [None]:
cm_1

In [None]:
heatmap(cm_1, annot=True, cmap='Reds', 
        xticklabels=['predict loss','predict win'],
        yticklabels=['true loss','true win'])

Notice the way that sklearn displays its confusion matrix: The rows are \['actually false', 'actually true'\]; the columns are \['predicted false', 'predicted true'\].

So it displays:

$\begin{bmatrix}
TN & FP \\
FN & TP
\end{bmatrix}$

In [None]:
tn = cm_1[0, 0]
fp = cm_1[0, 1]
fn = cm_1[1, 0]
tp = cm_1[1, 1]

Let's see if we can calculate some of our metrics for this matrix.

## **Accuracy** = $\frac{TP + TN}{TP + TN + FP + FN}$

In words: How often did my model get the right answer?

In [None]:
def accuracy_score():
    #your function here
    
lottery_accuracy = accuracy_score()

## **Recall** = $\frac{TP}{TP + FN}$

In words: How many of the true positives did my model label as predicted positives?

In [None]:
def recall_score():
    #your function here
    
lottery_recall = recall_score()

## **Precision** = $\frac{TP}{TP + FP}$

In words: How many of the predicted positives where true positives?

In [None]:
def precision_score():
    #your function here
    
lottery_precision = precision_score()

## **F-1 Score** = $\frac{2PrRc}{Pr + Rc}$ = $\frac{2TP}{2TP + FP + FN}$

This one is harder to put into words, but think of it as a balance of recall and precision.

In [None]:
def f1_score():
    #your function here

lottery_f1 = f1_score()

In [None]:
print('accuracy = ', lottery_accuracy)
print('recall = ', lottery_recall)
print('precision = ', lottery_precision)
print('f1 = ', lottery_f1)

### Scenario 2

Well, my recall was good, but everything else I measured was terrible! Can I do better?

This time I'll train my model in a much different way. I'll give it the tickets of 10000 people who played the lottery yesterday. Suppose that there are one winning ticket and 9999 losers. Now I test the model on the same 1000 tickets from before in Scenario 1.

This time my model will almost always predict a ticket to lose. Suppose that, in the 1000 predictions, it makes only one prediction of a winner, and suppose that this prediction is wrong.

In [None]:
# Set up predictions
y_pred_2 = np.zeros((1000,1))
pred_winner = np.random.randint(0,1001)
y_pred_2[pred_winner] = 1

Instead of coding out the scores by hand, we can use sklearn to speed things up.

In [None]:
# Import accuracy_score, precision_score, recall_score, and f1_score from sklearn.metrics
#

In [None]:
# Print out the scores for each metric
accuracy = accuracy_score(y_true, y_pred_2)
precision = precision_score(y_true, y_pred_2)
recall = recall_score(y_true, y_pred_2)
f1 = f1_score(y_true, y_pred_2)

conf = confusion_matrix(y_true, y_pred_2)
heatmap(conf, annot=True)

print('Accuracy', accuracy)
print('Precision', precision)
print('Recall', recall)
print('F1', f1)

## Resampling

The last classifier had a really high accuracy, but everything else was terrible.

In both cases, the problem we ran into was **class imbalance**. A popular solution for such a problem is resampling our data. To do this, let's take a look at [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html#imblearn.over_sampling.SMOTE).

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
data = make_classification(n_samples = 10000, weights=[0.1, 0.9])
X = data[0]
y = data[1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)

In [None]:
# Check class balance for the training data
pd.Series(y_train).value_counts(normalize=True)

This is a pretty imbalanced dataset. Only 10% of the data has a class of `0`.

Let's use resampling to address this:

In [None]:
resample = SMOTE(random_state=2021)
X_train_resampled, y_train_resampled = resample.fit_resample(X_train, y_train)

## Never Resample Your Test Set!

### **Also, when using cross validation, ALWAYS use a pipeline when resampling data, same as any transformer**

In [None]:
pd.Series(y_train_resampled).value_counts(normalize=True)

## Multiple Classes

We can understand these metrics of recall, precision, and the rest even if there are more than two classes in our classification problem.

In [None]:
flowers = load_iris()

In [None]:
print(flowers.DESCR)

In [None]:
dims_train, dims_test, spec_train, spec_test = train_test_split(flowers.data,
                                                                flowers.target,
                                                                test_size=0.5,
                                                               random_state=2021)

In [None]:
spec_train[:5]

In [None]:
ss_f = StandardScaler()

ss_f.fit(dims_train)

dims_train_sc = ss_f.transform(dims_train)
dims_test_sc = ss_f.transform(dims_test)

In [None]:
logreg_f = LogisticRegression(multi_class='multinomial',
                             C=0.01, random_state=42)

logreg_f.fit(dims_train_sc, spec_train)

In [None]:
plot_confusion_matrix(estimator=logreg_f,
                      X=dims_test_sc,
                      y_true=spec_test,
                     display_labels=[
                         'setosa',
                         'versicolor',
                         'virginica'
                            ]);

In [None]:
accuracy_score(spec_test,
              logreg_f.predict(dims_test_sc))

In [None]:
precision_score(spec_test,
                logreg_f.predict(dims_test_sc),
               average=None)

In [None]:
recall_score(spec_test,
            logreg_f.predict(dims_test_sc),
            average=None
            )

In [None]:
f1_score(spec_test,
            logreg_f.predict(dims_test_sc),
            average=None)

### Why are we getting three numbers for these metrics instead of one???

*YOUR ANSWER HERE*

If our data is not binary, we have to change the `average=` argument to something other than 'binary', which is the default. 

For multi-class precision, the relevant denominator is a **column**; for multi-class recall, the relevant denominator is a **row**.

## Which Metric Should I Care About?

Well, it depends.

### General Lessons

First, let's make some general observations about the metrics we've so far defined.

Accuracy:
- Pro: Takes into account both false positives and false negatives.
- Con: Can be misleadingly high when there is a significant class imbalance. (A lottery-ticket predictor that *always* predicts a loser will be highly accurate.)

Recall:
- Pro: Highly sensitive to false negatives.
- Con: No sensitivity to false positives.

Precision:
- Pro: Highly sensitive to false positives.
- Con: No sensitivity to false negatives.

F-1 Score:
- Harmonic mean of recall and precision.
    - Pro: Balanced both recall and precision
    - Con: Harder to interpret/explain

### Using classification metrics with sklearn's cross validation

In [None]:
# Import some classification data
data = make_classification()
X = data[0]
y = data[1]

# Create train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=2021)

# Create a model
model = LogisticRegression(solver='lbfgs')

The metrics [found here](https://scikit-learn.org/stable/modules/model_evaluation.html) are built into sklearn's cross validation

In [None]:
from sklearn.model_selection import cross_val_score


recall = cross_val_score(model, X_train, y_train, scoring='recall', cv=5)
precision = cross_val_score(model, X_train, y_train, scoring='precision', cv=5)
f1 = cross_val_score(model, X_train, y_train, scoring='f1', cv=5)
accuracy = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=5)

In [None]:
print('Average recall:', recall.mean())
print('Average precision:', precision.mean())
print('Average f1:', f1.mean())
print('Average accuracy:', accuracy.mean())

### Exercise

If you wanted to use resampling with multiple fold cross validation, ***you cannot resample before making your cross validated splits***. When using resampling, you should always evaluate your model on *unbalanced* data. 

ie, the following would be *incorrect*
```
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train, y_train = resample.fit_resample(X_train, y_train)

scores = cross_val_score(model, X_train, y_train)
```

Because of this, you would need to write out the cross validation yourself using `Kfolds`.

**In the cell below, comments have been provided to quide you through every step of this cross validation process.**

In [None]:
# Import KFold from sklearn

# Import precision, recall, 
# accuracy, and precision from sklearn


# Create a kfolds object

# Create a list for each 
# classification score


# Instantiate a logistic regression model

# Instantiate a standard scaler

# Instantiate a smote object


# Loop over the training and validation indices
# using the kfolds object

    # Split X_train into a train and validation split

    
    # Scale and transform the training 
    # split using the standard scaler

    
    # Resample the training split
    
    # Fit the model to the resampled training split
    
    # Scale the validation split

    # Produce predictions for the validation split

    # Calculate the recall, precision, f1, and accuracy scores

    # Append the calculated scores to their corresponding list

    
# Print the mean of each score list    

# Bonus!  ROC AUC and PR AUC

**Receiver Operating Characterist Area Under the Curve** (ROC AUC)

Probability threshold is the probability at which a classification model assigns an observation to a class.  By default this threshold is .5, but it can be changed.  Changing this threshold will change the metrics of the predictions.

A ROC curve is a comparision, along different probability thresholds, of false positive rates and true positive rates.  We can use this metric to assess our model over all possible probability thresholds.  This gives a more complete way to compare different models when we are planning to tune the probability threshold we use to assign observations to classes in our predictions.

In [None]:
from sklearn.metrics import roc_auc_score, plot_roc_curve

model.fit(X_train, y_train)

#return only one of the columns returned by predict_proba, as they each add up to 1
y_pred_proba = model.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_proba)

In [None]:
plot_roc_curve(model, X_test, y_test)

**Precision Recall Area Under the Curve**

Similarly, this metric allows us to example the trade offs between precision and recall as we adjust the probability thresholds in our model.

In [None]:
from sklearn.metrics import plot_precision_recall_curve, average_precision_score

average_precision_score(y_test, y_pred_proba)

In [None]:
plot_precision_recall_curve(model, X_test, y_test)

## The above tools are very helpful when working with imbalanced datasets

In the real world most classification problems are imbalanced.  There are generally a large amount of the negative class and more of the positive class.  Consider customer churn, loan defaults, diseases, etc.  These kinds of datasets will have very skewed and misleading accuracies, recalls, and precisions.  F1 Score is difficult to interpret and better for model comparison.  They also give you information on the best way to set your probability thresholds.