In [1]:
%run ../../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython
config_ipython()
setup_matplotlib()
set_css_style()

# Performance Metrics in Classification

## The possible outcomes of a binary classification

If we are in a classification problem where the two classes are $1$ (call it positive) and $0$ (call it negative), our model can spit out either of them, but chances are some points will be classified wrongly. This leaves us with 4 possible situations in terms of how the points get classified: 

* $TP$: True Positives, those points which are predicted as $1$ and are actually $1$;
* $TN$: True Negatives, those points which are predicted as $0$ and are actually $0$;
* $FP$: False Positives, those points which are predicted as $1$ but are actually $0$;
* $FN$: False Negatives, those points which are predicted as $0$ but are actually $1$

<img src="../../imgs/metrics-class.pdf" align="left" width="400" style="margin:20px 50px"/>

The figure depicts these groups visually. In the context of Information Retrieval, let us say that I run a query against my corpus of texts: some documents will be retrieved but not all of them will be a good match to my query. The positive class represents the match to the query. 

Now, the sum $TP + FN + FP + TN$ gives the number of all documents in the corpus. $TP + FP$ (the area of the ellipse) will give me the documents selected by the query; but the elements which are relevant to the query are instead given by the sum of the green areas, $TP + FN$. 

% TODO Add this (from scikit docs) Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.
% TODO see stuff in http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures, particularly for macro/micro

% TODO link to the bias/variance tradeoff explanation and how is it related to underfitting/overfitting

## Confusion Matrix

It is sometimes called ``contingency table'' or ``error matrix''. It is meant to \textit{visualize} performance. 

Again let us consider the simple case of two classes (binary classification). Then, the confusion matrix is represented in Table \ref{table:confmat}.

\begin{table}[ht!]\label{table:confmat}
    \centering
	\begin{tabular}{ *4c *4c | *6c }
    \multicolumn{2}{c}{\textbf{\ \ \ \ Predicted}} \\
    \midrule \midrule
    \multirow{2}{*}{\rotatebox{90}{\textbf{Real}}}
    & TP & FN \\
    & FP & TN   \\
    \bottomrule
	\end{tabular}
	\caption{The Confusion Matrix: shortenings are meant to represent the number of data points falling in each group.}
\end{table}

In the case of a multi-class classification problem with $n$ classes, the matrix will be a $n \times n$ one where the diagonal contains the counts for items predicted in each class, and out of diagonal items will report the number of wrongly classified items.

## Accuracy, Precision, Recall, Specificity

The \textit{accuracy} is defined as

$$
a = \frac{TP + TN}{TP + TN +FP + FN} \ ,
$$

and measures the number of correct predictions over the total of data points. The weak spot of the accuracy is that it attributes equal cost to both kind of errors, giving no insight on %TODO continue

The \textit{precision} is defined as

$$
p = \frac{TP}{TP + FP} \ ,
$$

and is the fraction of true positives over the total of points classified as positive. With reference to figure \ref{fig:metrics-class-groups}, the precision gives the number of relevant items found by the model divided by the total number of items retrieved in the positive class. In the Information Retrieval context described above, it would measure ``how useful the results are''.

$p=1$ says that all items labelled as belonging to class $1$ where actually in class $1$.

The \textit{recall} (also called \textit{sensitivity}) is defined as

$$
r = \frac{TP}{TP + FN} \ ,
$$

and gives the fraction of true positives over the total of points belonging to the positive class. Again referring to figure \ref{fig:metrics-class-groups}, the recall furnishes the number of relevant items found by the model divided by the total of existing documents in the positive class. In the Information Retrieval context described above, it would measure ``how complete the results are''.

$r=1$ means that all items in real class $1$ where actually classed as in class $1$, but says nothing about items wrongly classified in class $1$.

The relation between precision and recall is often inverse: it is possible to increase the one by reducing the other, so they have to be investigated together. The $F-score$ described later is a metrics taking both of them into account.

\

\textit{Example}

Let us imagine there is a surgeon who needs to remove all cancerous cells from a patient to prevent regeneration. In the process, if healthy cells are removed as well, this would leave disgraceful lesions to the organs involved. 
The decision to increase recall at the cost of precision is one where more cells than needed are removed and ensure that all bad ones will go. The decision to increase precision at the cost of recall, on the other hand, would see the surgeon be more conservative and ensure only bad cells are removed, at the cost of not removing them all.

\

The \textit{specificity} is defined as 

$$
s = \frac{TN}{FP + TN} \ ,
$$

and it is the symmetrical metric of precision but for the negative class.

Sensitivity and specificity are also respectively called the \textit{TPR} (True Positive Rate) and the \textit{FPR} (False Positive Rate), which are the values used in a ROC analysis, see %TODO put ref to ROC section

\paragraph{Lift}

The \textit{lift} is defined as

$$
l = \frac{\frac{TP}{TP + FN}}{\frac{TP + FP}{TP + FN + FP + TN}} \ ,
$$

and is a ratio between the recall (ratio of correctly classified positive samples) and the ratio of positively classified samples in the whole dataset.

The lift measures the strength of the classifier on the basis of the positive ($1$) samples which are correctly predicted. It is a metrics typically used in marketing % TODO continue from page 2/4, the example and so on

%TODO ROC here?

\paragraph{F-score}

In general, a \textit{F-score} is defined as 

$$
F_\beta = (1 + \beta^2)\frac{pr}{\beta^2p + r} \ \ \beta > 0, \beta \in \mathbb{R}
$$

The most commonly seen metric in this class is the \textit{F1-score}, which is just the harmonic mean of precision and recall:

$$
F_1 = 2 \frac{pr}{p + r}
$$

The F-score furnishes a way to weigh precision and recall differently: while the F1-score weighs them equally, the F2-score
gives more weight to the precision and the F0.5-score does the reverse, for instance.

## Per-class Accuracy

The accuracy specified above is an example of a microaverage: all classes are treated on the same ground and this means that the information about the classification error in each is hidden. 

The per-class accuracy, on the other hand, is a macroaverage metric, and furnishes the average of the ratios of correctly classified samples per each of the classes. 

This metric is particularly useful when classes are imbalanced as the accuracy alone may give a quite distorded picture as the class with the highest number of samples will dominate the statistics. It is typically a good idea to look at both things anyway.\newline

\textit{Example}

Let us assume that we have a binary classifier, tested on $100$ samples for class $1$ and $200$ samples for class $0$ and whose performance reports these results:

\begin{itemize}
	\item $TP = 80$;
	\item $FN = 20$;
	\item $FP = 5$;
	\item $TN = 195$
\end{itemize}

The accuracy gives $a = \frac{80 + 195}{80 + 20 + 5 + 195} = 0.91$, so quite high. But this masks the fact that actually class $1$ has a misclassification rate of $20/100$ (class $0$ has a misclassification rate of $5/200$, so quite small). In terms of accuracies per class (the complements of these missclassification rates), class $1$ has $a_1 = 0.8$ and class $0$ has $a_0 = 0.975$, so that the per-class accuracy is 

$$
a_p = \frac{1}{2} \left(0.8 + 0.975 \right) = 0.88
$$

\paragraph{Log-loss}

This metric can be used when the probability of the classification of each sample is accessible and is defined as

$$
L = \frac{1}{N} \sum_{i=1}^N \left[ y_i \log p_i + (1 - y_i) \log (1-p_i) \right] \ ,
$$

where $N$ is the number of samples, $y_i \in {0, 1}$ is the class of sample with index $i$ and $p_i$ its classification probability.

The log-loss is the cross-entropy between the distributions of true labels and the predictions. %TODO cross-entropy reference 
and is a measure of the unpredictability, factoring the noise coming from using a predictor rather than the true labels. The log-loss is then meant to be small. 