## Metrics

* Categorical metrics 
  * F1
* Regression metrics
    * Mean absolute error
    * Root mean squared error
    * Pearson's R

## Classification

* Precision

$$ \text{precision}=\frac{|\{\text{relevant documents}\}\cap\{\text{retrieved documents}\}|}{|\{\text{retrieved documents}\}|} $$
* Recall
$$\text{recall}=\frac{|\{\text{relevant documents}\}\cap\{\text{retrieved documents}\}|}{|\{\text{relevant documents}\}|}$$

* Accuracy
  * Fraction of correct predictions
  $$ {\text{Correct predictions} \over \text{All predictions}} $$
  

![](images/accuracy.png)

This shows all the data that we have seen. What about the data we *haven't* seen?

How can we explain the fact that a test might not have perfect recall?

* Jacob's airplane example

## F1 score

$$F_1 = \left(\frac{\mathrm{recall}^{-1} + \mathrm{precision}^{-1}}{2}\right)^{-1} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$$

In [None]:
from sklearn.metrics import f1_score

target = [1]
predicted = [1]
f1_score(target, predicted)

In [None]:
f1_score([1], [0])

In [None]:
f1_score([[1], [0]], [[1], [1]])

## Model fitness for linear models

**Actual** observed target: $y$

**Predicted** value from the model: $\hat{y}$

## Mean absolute error

$$\mathrm{MAE} = \frac{\sum_{i=1}^n\left| y_i-\hat{y}_i\right|}{n} =\frac{\sum_{i=1}^n\left| e_i \right|}{n}.$$

* Controls for sample size
  * "Controls" means "takes into account"
* Not sensitive to outliers

## Mean squared error

$$\mathrm{MSE} = \frac{\sum_{i=1}^n\left| y_i-\hat{y}_i\right|^2}{n} =\frac{\sum_{i=1}^n\left| e_i \right|^2}{n}.$$

* Controls for sample size
* Not sensitive to outliers

## Root mean squared error

$$\mathrm{RMSE} = \sqrt{\frac{\sum_{i=1}^n\left| y_i-\hat{y}_i\right|^2}{n}} = \sqrt{\frac{\sum_{i=1}^n\left| e_i \right|^2}{n}}$$

* Controls for sample size
* Sensitive to outliers

## Pearson's R

Describes the portion of variance (change) in the data that the model can predict

Imagine a model without any $x$ values.

$$\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i $$

Then we could get the total change in the data as the sum of squares (SS):

$$SS_\text{tot}=\sum_i (y_i-\bar{y})^2$$

And if we compare that to the *actual* error:
$$ SS_\text{res}=\sum_i (y_i - f_i)^2=\sum_i e_i^2\$$

We can know how much the model is able to predict, compared to the baseline data variance:

$$R^2 \equiv 1 - {SS_{\rm res}\over SS_{\rm tot}}$$

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Coefficient_of_Determination.svg/800px-Coefficient_of_Determination.svg.png)

So, for each point, how much error do we add to the explanation of the model?

In [None]:
from sklearn.metrics import r2_score

r2_score([1], [1])

In [None]:
r2_score([1, 2], [1, 1.5])

## Exercise

Run the $R^2$ score on the following data and explain the *meaning* of that value:

| Target | Predicted | 
| --- | --- |
| `[2]` | `[2]` |
| `[1, 2]` | `[1, 1.999999]` |
| `[1, 5, 3]` | `[1, 5, 2]` |
| `[2, 7]` | `[1.8, 7.2]` |

## Recap 

* F1 score for classification tasks
* $R^2$ score for regression tasks