<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Classification Metrics

_Authors: Matt Brems, Dave Yerrington, Noelle Brown, Jeff Hale_

---

### Learning objectives

After this lesson, students will be able to 

- Understand a confusion matrix
- Calculate sensitivity/recall/TPR, specificity/TNR, and precision
- Understand and calculate F1 score
- Understand and calcuate Balanced Accuracy
- Create a ROC curve in scikit-learn
- Understand how ROC AUC is calculated
- Identify methods for handling imbalanced data

### Prior knowledge required

- Python basics
- Pandas basics
- Machine learning workflow with scikit learn
- How to caclucate accuracy

---

#### Import libraries


In [None]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_breast_cancer

In [None]:
import sklearn
sklearn.__version__

## Create dataset
---

Call the `load_breast_cancer()` function to create our dataset.

In [None]:
data = load_breast_cancer(as_frame=True).frame
data.head(2)

#### Check the target variable values: 

Class Distribution:

212 - Malignant, 357 - Benign

from: the docs https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

## Create `X` and `y`
---

⚠️ The dataset labels benign tumors as 1, and malignant tumors as 0. This is contrary to how you typically label data: the more important class (malignant) should be labeled 1. Let's fix that.

## Train/Test Split
---

Create the train and test sets. 

**Note** we'll want to create a stratified split.

## Scaling our features
---

Because we're using KNN for our model, we'll want to scale our training and testing sets.

## Instantiate and fit model
---

Create and fit an instance of `KNeighborsClassifier`. Use the default parameters.

## Predictions
---

Use our newly fitted KNN model to create predictions from the scaled `X_test`.

## Confusion Matrix
---

We'll create a confusion matrix using the `confusion_matrix` function from `sklearn`'s `metrics` module.

#### Want those values as individual integers? The function signature tells us how all four get returned.

Why do we have to flatten/ravel/reshape the array?

#### `ConfusionMatrixDisplay.from_estimator`

scikit-learn will plot a confusion matrix. You can pass the estimator and the X and y test set data to *from_estimator*. 

Want to make it easier to read? Add `display_labels` values.

Want percentages? Pass `normalize='true'`.

#### How many Type I errors are there?
---

<details>
    <summary>Need a hint?</summary>
    Type I = False positive
</details>

#### How many Type II errors are there?


#### Which error is worse here (Type I vs Type II)?
---

#### Accuracy

#### Calculate the sensitivity
---

<details>
    <summary>Need a hint?</summary>
    It's the same as recall.
</details>

#### Calculate the specificity
---



#### Calculate the precision
---

<details>
    <summary>Need a hint?</summary>
    Precision starts with p, so it's the tp/ all the predicted positives.
</details>

#### Calculate recall
---

<details>
    <summary>Need a hint?</summary>
    Recall = Sensitivity
</details>

## Use scikit-learn functions instead

---
## Smoking dataset
Let's look at another dataset to predict mortality from smoking status and age. Data dictionary [here](https://myweb.uiowa.edu/pbreheny/data/whickham.html).

In [None]:
df_smoke = pd.read_csv('./data/Whickham.csv')

#### Inspect

- `outcome`: Whether someone is alive or dead.
- `smoker`: Whether somebody smoked or did not smoke.
- `age`: Age in years.

#### Make X and y

Make `smoker` numeric. It's just two values that will definitely be in the training and test sets, so we can just binarize it (0, or 1) instead of using a scikit-learn OneHotEncoder. 

#### Create holdout/test set and training dataset

### Null model


<details><summary>In this situation, what term would we use to describe someone who is predicted to be dead but actually is alive? (Remember that alive is coded as 0 and dead is coded as 1.)</summary>

- We **falsely** predict someone to be **positive**.
- This would be a **false positive**.
</details>

<details><summary>In this situation, what is a true negative?</summary>

- We **correctly** predict someone to be **negative**.
- Someone who is predicted to be alive (`0`) and actually is alive (`0`).
</details>

## Evaluate the model performance



### Generate confusion matrix

#### Plot it!


## Basic Metrics

#### What's the accuracy?

#### What is the specificity? 

#### What is the sensitivity?

#### What is the precision?

--- 
# Exercise
## Make a better model

#### Make a LogisticRegression model with the default parameters. 


### Plot the confusion matrix

### Evaluate

### Convenient reporting

---
# More advanced metrics 🏆

## F1 score

The F1 score is the harmonic mean of precision and recall. 

If you care about precision and recall roughly the same amount, F1 score is a great metric to use! 

Note that even though all the metrics you’ve seen can be followed by the word `score` F1 almost always is. 🤷‍♀️

2 *    (Precision * Recall)      /      (Precision  + Recall)

F1 score is very popular.

## Balanced Accuracy Score

You can think of it as the average accuracy for each class.

Average of TPR and TNR

= (TP/P)+(TN/N))/2

= (Sensitivity + Specificity) / 2


### When is balanced accuracy a good metric to use?

#### Specificity = TNR

#### Balanced Accuracy

#### Accuracy Score

## ROC curve

Plot the TPR vs. FPR for the range of possible decision thresholds and you get the ROC curve!

What's the TPR?

What's the FPR?

fp = (fp + tn)

Add the handy worse-case-scenario line

TPR = Sensitivity = Recall

Note that you pass the estimator as the first argument. ⚠️

#### What happens as you move to the right along the curve?

The more area under the curve, the better separated our distributions are.
- Check out this gif - just an example ([source](https://twitter.com/DrHughHarvey/status/1104435699095404544)):

![](https://media.giphy.com/media/H1SZ5oRLIuZ1t1c4Di/giphy.gif)

We use the **area under the ROC curve** (often abbreviated **ROC AUC** or **AUC**) to quantify the gap between our distributions.

### What's the ROC AUC score?

Recall that `.predict_proba()` method will return the probabilities of both classes in a NumPy array.

#### Note that you have to pass the probabilities to `roc_auc_score`. ⚠️

## Interpreting ROC AUC
- If you have an ROC AUC of 0.5, your positive and negative populations perfectly overlap and your model is as bad as it can get.
- If you have an ROC AUC of 1, your positive and negative populations are perfectly separated and your model is as good as it can get.
- The closer your ROC AUC is to 1, the better. (1 is the maximum score.)

- If you have an ROC AUC of below 0.5, your positive and negative distributions have flipped sides. By flipping your predicted values (i.e. flipping predicted 1s and 0s), your ROC AUC will now be above 0.5.

We generate one ROC curve per model. The ROC curve is generated by varying our threshold from 0 to 1. This doesn't actually change the threshold or our original predictions, but it helps us to visualize the trade-offs.

--- 
## Summary

You've seen how to create and use common classification metrics.

## Interview Questions

<details><summary>How is the ROC curve calculated?</summary>
    
The ROC curve is generated by starting our classification threshold at 0, calculating sensitivity and 1-specificity (TNR), and plotting the values. 
    
We then increment our threshold by a small number (e.g. 0.1%), and calculate and plot sensitivity and 1-specificity again. 
    
Repeat this process until we reach 1.
    
</details>

<details><summary>Let's say you were building a search engine and wanted to build a classification model that would recommend articles based on the search input. What metric would you want to optimize for and why?</summary>
    
- You could make a case for wanting to minimize false positives (stories that weren't relevant), in which case you'd want to optimize for precision.
- You could make a case for wanting to minimize false negatives (not passing along possibly useful content), in which case you'd want to optimize for recall. 
- Alumni Comment: "The interviewer seemed more interested in seeing if I knew what the metrics were and explaining what priorities would lead me to optimize for one over the other."
</details>

## Check for Understanding

You might want to make flash cards for these terms. We'll review them in the coming weeks.

- What is F1 Score?
- When would use F1 Score?
- What is balanced accuracy?
- When might you use balanced accuracy?

- What is recall?
- What is precision?
- What is sensitivity?
- What is specificity?