In [8]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# Evaluation Metrics

© 2018 Daniel Voigt Godoy

## 1. Definition

There are ***MANY*** metrics available for classification problems. It may be a bit ***confusing*** at first, so let's look at the ***confusion matrix*** to understand it better (pun intended!).

### 1.1 Confusion Matrix

The ***confusion matrix*** is the contingency table of ***actual*** (rows) vs ***predicted*** (columns) ***classes***.

Some representations start with positive samples on both first row and columns. But ***Scikit-Learn*** results are returned with ***negative samples first***. So, we're sticking with its convention to avoid confusion!

Therefore, a matrix has 4 values, as shown in the picture:

![](confusion_matrix.png)

The confusion matrix provides the necessary information to build a lot of different metrics.

&nbsp; | &nbsp;
:---:|:---:
![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/264px-Precisionrecall.svg.png) | ![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Sensitivity_and_specificity.svg/264px-Sensitivity_and_specificity.svg.png)
<center>Source: Wikipedia</center> | <center>Source: Wikipedia</center>

Notice that the matrix is built on top of ***predicted classes***, not ***probabilities***. It means you should first decide on a ***threshold*** to convert probabilities into classes and only then compute the matrix.

Changing the ***threshold*** will change the matrix and, consequently, the metrics that depend on its values.

So, it is possible to ***tweak the threshold*** to achieve a better performance on a given metric.

### 1.2 Accuracy

***How often my classifier is right?***

This is the most straightforward metric of all - how often a classifier is right, generally speaking.

It may be a ***misleading*** metric, though, if the dataset is ***imbalanced***.

$$
Accuracy = \frac{TP + TN}{Total}
$$

### 1.3 Precision

***My classifier says it's positive - how often is it right?***

If ***False Positives*** are a ***problem***, this is the metric you should pay attention to.

Example: if you want to classify videos as ***appropriate for kids*** (positive) or not (negative), you ***really*** don't want a ***false positive***, that is, an ***inappropriate video*** showing up. You will end up ***rejecting good videos***, but that's a lesser problem.

$$
Precision = \frac{TP}{TP + FP}
$$

### 1.4 True Positive Rate (TPR) / Recall / Sensitivity

***It IS a positive sample - how often my classifier gets it right?***

If ***False Negatives*** are a ***problem***, this is the metric you should pay attention to.

Example: if you want to detect if someone has a ***rare and fatal disease*** (positive) or not (negative), you ***really*** don't want a ***false negative***, that is, ***dismissing a sick person***. You will end up ***investigating further healthy people***, but that's a lesser problem.

$$
Recall = \frac{TP}{TP + FN}
$$

### 1.5 False Positive Rate (FPR) / Specificity

***It IS a negative sample - how often my classifier gets it wrong?***

If ***False Positives*** are a ***problem***, this is the metric you should pay attention to.

$$
FPR = 1 - Specificity = 1 - \frac{TN}{TN + FP} = \frac{FP}{TN + FP}
$$

### 1.6 F1-Score

It is the ***harmonic mean*** of precision and recall, so it combines both metrics into a single value.

It favors classifiers that deliver similar levels of precision and recall.

$$
F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}
$$

### Tweaking the Threshold

The metrics so far were computed for a given threshold. If we want to compare how they fare whenever we ***change the threshold*** to all its possible values, we need to construct one of these ***curves*** below.

They are especially useful to evaluate classifiers on ***imbalanced datasets***.

### 1.7 Precision-Recall Curve (Recall x Precision)

The ***PR Curve*** depicts the trade-off between ***Recall*** on the horizontal axis and ***Precision*** on the vertical axis.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_precision_recall_001.png)
<center>Source: Scikit-Learn</center>

You may have noticed the curve is somewhat ***bumpy***.

If you ***raise the threshold***, you will move to the ***left*** on the curve. 

It means you're trying to ***avoid False Positives*** at the expense of ***trading True Positives for False Negatives***.
1. More FN reduces Recall (TPR) (less TP has little impact as its on both numerator and denominator)
2. Less FP increases precision, but less TP reduces precision

But, as you shift the threshold, you may ***lose more TPs than FPs***, and then it will reduce your precision momentarily.

### 1.8 ROC Curve (FPR x TPR)

The ***ROC Curve*** depicts the trade-off between ***False Positive Rate*** on the horizontal axis and ***True Positive Rate*** on the vertical axis.

The shape of the curve will depend on ***how separable*** the classes are:
- perfectly separable: the "curve" would actually be a square, going straight up to 1 and staying there
- completely overlapped: the "curve" would actually be a diagonal line, from the origin to the upper right corner
- somewhat separable: a curve like the one in the figure below

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_roc_001.png)
<center>Source: Scikit-Learn</center>

If you ***raise the threshold***, you will move to the ***left*** on the curve. 

It means you're trying to ***avoid False Positives*** at the expense of ***trading True Positives for False Negatives***.
1. More FN reduces TPR (Recall) (less TP has little impact as its on both numerator and denominator)
2. Less FP reduces FPR

Since TP is not present on the calculation of FPR, we ***do not*** observe the bumpiness as in the PR Curve.

### 1.9 Area Under ROC

The ROC Curve is a very popular method of evaluating a binary classifier. But how does one compare two curves? Unless one of them is strictly better than the other, this would be a difficult task.

To make it easier to compare classifiers, one can use the ***area*** under the ROC Curve. The closer it is to ***one***, the better the classifier, as it achieves a high ***TPR*** with a little ***FPR***.

## 2. Experiment

Time to try it yourself!

### 2.1 Balanced Dataset

There are 1,000 data points in a ***balanced dataset*** of ***green*** (positive) and ***red*** (negative) labels.

There is only ***one*** feature represented at the horizontal axis.

***Positive*** samples are initially ***centered at 1.0***, while ***negative*** samples are ***centered at -1.0***.

A ***Logistic Regression*** was trained on the data and it is shown on top of the distribution of the data (left plot).

The ***probability threshold*** is the ***dotted horizontal line*** and it has a linked ***vertical dashed line*** indicating the corresponding ***feature value*** (horizontal axis) for that threshold.

The plots on the right show the corresponding ***ROC*** and ***PR*** curves for the trained Logistic Regression, and the ***stars*** correspond to the ***chosen threshold***. At the bottom, there is also a ***confusion matrix***.

The controls allow you to:
- change the threshold
- move the center of the distribution for negative and positive samples

Use the controls to play with different configurations and answer the ***questions*** below.

In [9]:
from intuitiveml.supervised.classification.LogisticRegression import *

In [10]:
mylr = plotLogistic(x=(-1, 1), n_samples=1000, positive=.5)
vb = VBox(build_figure_thresh(mylr))
vb.layout.align_items = 'center'


plotly.tools.make_subplots is deprecated, please use plotly.subplots.make_subplots instead



In [11]:
vb

VBox(children=(FigureWidget({
    'data': [{'marker': {'color': 'green'},
              'name': 'Positive Samp…

#### Questions

1. Using the ***threshold*** slider:
    - raise the threshold - what happens to the star in PR and ROC curves?
        - go all the way to 1.0 and check the confusion matrix, as well as the position in both curves
    - lower the threshold - what happens to the star in PR and ROC curves?
        - go all the way to 0.0 and check the confusion matrix, as well as the position in both curves


2. Make the threshold 0.5 again and, using the ***negative center*** slider:
    - change the center of the negative samples to -2.00 - what happens to the ***shape*** of PR and ROC curves? Why?
        - could you achieve similar result using other algorithm than a logistic regression? Why?
    - now change this center to 0.00 - how did the ***shape*** of PR and ROC curves changed? Why?
    - which one of the settings is better? Why?
    
    
3. Keep the negative center at 0.0 and, using the ***positive center*** slider:
    - change the center of positive samples to 0.2 - what happens to the ***shape*** of PR and ROC curves? Why?
        - could you achieve a better result using a more advanced algorithm than a logistic regression? Why?

### 2.2 Imbalanced Dataset

There are still 1,000 data points, but now the ***green*** points are about 5% of the total, making it a ***quite imbalanced*** dataset.

In [12]:
mylr2 = plotLogistic(x=(-1, 1), n_samples=1000, positive=.05)
vb2 = VBox(build_figure_thresh(mylr2))
vb2.layout.align_items = 'center'


plotly.tools.make_subplots is deprecated, please use plotly.subplots.make_subplots instead



In [13]:
vb2

VBox(children=(FigureWidget({
    'data': [{'marker': {'color': 'green'},
              'name': 'Positive Samp…

#### More Questions

1. Using the ***threshold*** slider:
    - at 0.5, check the ***positive*** samples, what is the ***recall***? What about the ***precision***?
    - how to improve the ***recall***? What is the side-effect of doing it?
    - what if you need to avoid ***false positives***, how to achieve it?
    
    
2. Make the threshold 0.5 again and, using the ***negative center*** slider:
    - change the center of the negative samples to -2.00 - what happens to the ***shape*** of PR and ROC curves? Why? How is this different from what you observed when the dataset was balanced?
    - now change this center to 0.00 - how did the ***shape*** of PR and ROC curves changed? Why?
    
    
3. Keep the negative center at 0.0 and, using the ***positive center*** slider:
    - change the center of positive samples to 0.2 - what happens to the ***shape*** of PR and ROC curves? Why?
    - can you improve the ***recal***? What's the consequence of achieving this?

## 3. Scikit-Learn

[Classification Metrics](https://scikit-learn.org/stable/modules/classes.html#classification-metrics)

Please check Aurelién Geron's "Hand-On Machine Learning with Scikit-Learn and Tensorflow" notebook on Classification [here](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb).

## 4. More Resources

[Confused by the Confusion Matrix](https://towardsdatascience.com/confused-by-the-confusion-matrix-e26d5e1d74eb)

[ROC and precision-recall with imbalanced datasets](https://classeval.wordpress.com/simulation-analysis/roc-and-precision-recall-with-imbalanced-datasets/)

[Introduction to the precision-recall plot](https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/)

[PyCM](https://github.com/sepandhaghighi/pycm/blob/master/README.md)

#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [14]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')