# Error and Evaluation Metrics

* A key step in machine learning algorithm development and testing is determining a good error and evaluation metric. 

* Evaluation metrics help us to estimate how well our model is trained and it is important to pick a metric that matches our overall goal for the system.  

* Some common evaluation metrics include precision, recall, receiver operating curves, and confusion matrices.

### Classification Accuracy and Error 

* Classification accuracy is defined as the number of correctly classified samples divided by all samples:

\begin{equation}
\text{accuracy} = \frac{N_{cor}}{N} 
\end{equation}
where $N_{cor}$ is the number of correct classified samples and $N$ is the total number of samples.

* Classification error is defined as the number of incorrectly classified samples divided by all samples:

\begin{equation}
\text{error} = \frac{N_{mis}}{N}
\end{equation}
where $N_{mis}$ is the number of misclassified samples and $N$ is the total number of samples.

* Suppose there is a 3-class classification problem, in which we would like to classify each training sample (a fish) to one of the three classes (A = salmon or B = sea bass or C = cod). 

* Let's assume there are 150 samples, including 50 salmon, 50 sea bass and 50 cod.  Suppose our model misclassifies 3 salmon, 2 sea bass and 4 cod.

* Prediction accuracy of our binary classification model is calculated as:

\begin{equation}
\text{accuracy} = \frac{47+48+46}{50+50+50} = \frac{47}{50}
\end{equation}

* Prediction error is calculated as:

\begin{equation}
\text{error} = \frac{N_{mis}}{N} = \frac{3+2+4}{50+50+50} = \frac{3}{50}
\end{equation}



### Confusion Matrices

* A confusion matrix summarizes the classification accuracy across several classes. It shows the ways in which our classification model is confused when it makes predictions, allowing visualization of the performance of our algorithm. Generally, each row represents the instances of an actual class while each column represents the instances in a predicted class. 
#### Binary classification example

* In case of binary classifier the prediction will be one of 2 classes, for instance let the 2 classes be P and N. The figure below shows the prediction results in a confusion matrix:
<img src="figures/confusion_matrix_example.png"  style="width: 300px;"/>

* True positive (TP): correctly predicting event values, number of predictions where the classifier predicts P as P.
* False positive (FP): incorrectly calling non-events as an event, number of predictions where the classifier predicts N as P.
* True negative (TN): correctly predicting non-event values, number of predictions where the classifier predicts N as N.
* False negative (FN): incorrectly labeling events as non-event, number of predictions where the classifier predicts P as N.

#### Multi-class classification example
* If our classifier is trained to distinguish between salmon, sea bass and cod, then we can summarize the prediction result in the confusion matrix as follows:
    

| Actual/Predicted | Salmon | Sea bass | Cod  |
| --- | --- | --- | --- |
| Salmon | 47 | 2 | 1 |
| Sea Bass | 2 | 48 | 0 |
| Cod | 0 | 0 | 50 |

* In this confusion matrix, of the 50 actual salmon, the classifier predicted that 2 are sea bass, 1 is cod incorrectly and 47 are labeled salmon correctly. All correct predictions are located in the diagonal of the table. So it is easy to visually inspect the table for prediction errors, as they will be represented by values outside the diagonal. 


#### Common Performance Measures
* Precision is also called positive predictive value.

\begin{equation}
\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}
\end{equation}

* Recall is also called true positive rate, probability of detection

\begin{equation}
\text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}
\end{equation}


* Fall-out is also called false positive rate, probability of false alarm.

\begin{equation}
\text{Fall-out} = \frac{\text{FP}}{\text{All negative samples}}= \frac{\text{FP}}{\text{FP}+\text{TN}}
\end{equation}

* *Consider the salmon/non-salmon classification problem, what are the TP, FP, TN, FN values?*

| Actual/Predicted | Salmon | Non-Salmon  |
| --- | --- | --- | 
| Salmon | 47 | 3 | 
| Non-Salmon | 2 | 98 | 


### ROC curves 

* The Receiver Operating Characteristic (ROC) curve is a plot between the true positive rate (TPR) and the false positive rate (FPR), where the TPR is defined on the $y$-axis and FPR is defined on the $x$-axis. 

* $TPR = TP/(TP+FN)$ is defined as ratio between true positive prediction and all real positive samples. The definition used for $FPR$ in a ROC curve is often problem dependent.  For example, for detection of targets in an area, FPR may be defined as the ratio between the number of false alarms per unit area ($FA/m^2$).  In another example, if you have a set number of images and you are looking for targets in these collection of images, FPR may be defined as the number of false alarms per image.  In some cases, it may make the most sense to simply use the Fall-out or false positive rate.

* Given a binary classifier and its threshold, the (x,y) coordinates of ROC space can be calculated from all the prediction result.  You trace out a ROC curve by varying the threshold to get all of the points on the ROC.

* The diagonal between (0,0) and (1,1) separates the ROC space into two areas, which are left up area and right bottom area. The points above the diagonal represent good classification (better than random guess) which below the diagonal represent bad classification (worse than random guess).

* *What is the perfect prediction point in a ROC curve?*

### Precision-Recall curves

* ROC curves trace out the TPR vs. FPR over many thresholds.  

* Similarly, other metrics can be plotted over many thresholds.  Another common example is Precision-Recall curves

* PR curves are generated the same way as ROC curves, however, instead of plotting TPR vs. FPR, Precision vs. Recall (as defined above) are plotted over many thresholds.

* *What does the perfect PR curve look like?*

* PR curves are often preferred over ROC curves in cases of severely imbalanced data.  

* Similar to ROC and PR curves, any statistics that can be computed via the confusion matrix, can be plotted over all possible thresholds.




### MSE and MAE


* *Mean Square Error* (MSE) is the average of the squared error between prediction and actual observation. 

* For each sample $\mathbf{x}_i$, the prediction value is $y_i$ and the actual output is $d_i$. The MSE is

\begin{equation}
MSE = \sum_{i=1}^n \frac{(d_i - y_i)^2}{n}
\end{equation}

* *Root Mean Square Error* (RMSE) is simply the square root the MSE. 

\begin{equation}
RMSE = \sqrt{MSE}
\end{equation}


* *Mean Absolute Error* (MAE) is the average of the absolute error.
\begin{equation}
MAE = \frac{1}{n} \sum_{i=1}^n \lvert d_i - y_i \rvert
\end{equation}


### Final Thoughts/Summary

Of course, there are many other evaluation metrics.  These are just a few of the most commonly used.  In practice, the *best* evaluation metrics are those that provide you insight into your problem - particularly when it is not tied closely to your objective function.

<img src="figures/evaluation_meme.png"  style="width: 500px;"/>

-F. Diaz, Source: https://twitter.com/841io/status/1405184102798667777?s=20

It is important to remember Goodhart's Law when using evaluation metrics (https://en.wikipedia.org/wiki/Goodhart%27s_law): 

"When a measure becomes a target, it ceases to be a good measure."