# Evaluating Classifiers

## Overview

We have so far seen a number of classification techniques. For a given problem, more than 
one algorithm may be applicable. There is therefore a need to examine how we can 
assess how good a selected algorithm is. 

In this section we will review methods that we can employ to evaluate the
suitability of a classifier. Whatever the conclusion we draw from the analysis one should 
not bear in mind that this is conditioned on the dataset given. Thus, we assess
the performance of an algorithm on the specified application and say nothing about the
performance of the algorithm in general.

Finally, note that for any learning algorithm there will be a dataset where the algorithm
will be very accurate and another dataset where the algorithm will perform poorly. 
This is called the No Free Lunch Theorem.

## Evaluating classifiers

As we already know, a model typically is trained on a training set and evaluated over 
a test set. Typically, we do not want the model to be trained over the test set as this
will be a form of cheating.  In addition, we have seen that accuracy is a common metric that can be used
in order to evaluate the performance of a classifier. Accuracy is defined as


\begin{equation}
Accuracy=\frac{\text{Total number of correct classifications}}{\text{Total number of points}}
\end{equation}


If we know the accuracy, we can compute the error rate and vice versa as the two are related according to

\begin{equation}
\text{Error rate} = 1 - \text{Accuracy} \tag{1}
\end{equation}

Conceptually, we can view accuracy as an estimate of the probability that an arbitrary 
instance $\mathbf{x}\in \mathbf{X}$ is classified correctly i.e. as the probability [1]

\begin{equation}
P(\hat{c}(\mathbf{x})) = c(\mathbf{x}| \hat{f})
\end{equation}

However, when we deal with imbalanced classes this metric is not enough and can even be misleading. A <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix</a> allows us to visualize  
various quality metrics associated with the goodness of fit. Some of these metrics are described below

### Precision and recall

Precision and recall are two error measures used to assess the quality
of the results produced by a binary classifier. They are defined as follows

\begin{equation}
\text{Precision} = \frac{TP}{TP + FP}
\end{equation}

\begin{equation}
\text{Recall} = \frac{TP}{TP + FN}
\end{equation}

Notice that the recall is also known as sensitivity of the model.

### F1-score

A good predicitve model should have both high precision and high recall. We can combne precision and
recall into a single score.
The f1-measure or simply f-measure equals the harmonic mean of a classifier’s precision and recall. Namely,

$$f-score = \frac{2*\text{Recall}*\text{Precision}}{\text{Recall} + \text{Precision}}$$


The f-measure provides us with a robust evaluation for an individual class. However, note that there is no official standard for an acceptable f-measure. Appropriate values can vary from problem to problem. 
Usually, we  treat f-measures in the range 0.9 to 1.0 as excellent i.e. the model performs
exceptionally well. An f-measure of 0.8 to 0.89 is considered as very good however there is room for improvement.
An f-measure of 0.7 to 0.79 is considered good; the model performs adequately but is not very impressive.
An f-measure of 0.6 to 0.69 is unacceptable but still better than random. 
Finally, f-measure values below 0.6 are usually treated as totally unreliable.

----
**Remark**

The harmonic mean is intended to measure the central tendency of
rates, such as velocities

----

Sometimes, an f-measure may be the same as the model accuracy.
This is not surprising since both metrics are suppossed to measure the model performance. 
However, there is no guarantee that these should be the same. The
difference between the metrics is especially noticeable when the classes are imbalanced.
Generally, the f-measure is considered a superior prediction metric due to its sensitivity to imbalance.

### Receiver operating characteristic

## Statistical distribution of errors

Machine learning alrgorithms are in general probabilistic in nature. 
Thus we need to account somehow for the randomness in the training data and/or
the weights initialization e.t.c. To do so, we use the same algorithm and generate
multiple classifiers and test them on multiple test sets.

Bootstraping may be a procedure we can use to create randomly selected training and
test datasets. This approach will create one or more training datasets. Given that boostrap is
essentially sampling with replacement, some of training sets are repeated. The corresponding test datasets are then constructed from the set of examples that were not selected for the respective training datasets.

## Summary

## References

1. Peter Flach, _Machine Learning The Art and Science of Algorithms that Make Sense of Data_, Cambridge Press