# Metrics averaging 

In this notebook I will explain  in more detail what is the meaning of the `average` parameter in the scikit-learn scores and how to use the `confusion_matrix` function. 
I will discuss only the binary classifiers and for this purpose I will use a simple  Gaussian classifier

In [None]:
import numpy as np
from scipy.stats import norm

In [None]:
n_pos = 100
n_neg = 300

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [None]:
positives = norm(0,0.9).rvs(size=n_pos)
negatives = norm(1,0.5).rvs(size=n_neg)
features = np.concatenate((positives, negatives))
labels = np.concatenate((np.ones_like(positives), np.zeros_like(negatives))).astype('int32')

train_f, test_f, train_l, test_l = train_test_split(features, labels, test_size=0.2)

gaus = GaussianNB()

gaus.fit(train_f.reshape(-1,1), train_l)
test_pred = gaus.predict(test_f.reshape(-1,1))

## Confusion matrix 

The confusion matrix gives most detailed information about the performance of the classifier. The $(i,j)$ entry of this matrix counts the number of instances of class $i$ classified as $j$. However different users use different conventions (orientations) as to whether the true labels correspond to rows or to columns. E.g. 
scikit-learn provides a confusion matrix function

In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [None]:
confusion_matrix(test_l, test_pred)

but its layout is diffrent from the [Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix) or my classification notebook! The rows of this matrix correspond to true labels and columns to  predicted labels

<table style="text-align:center;font-size:14pt">
    <tr><td/><td/><th colspan=2>test_pred labels</th></tr>
    <tr><td/><td/><th>N</th><th>P</th></tr>
    <tr> <th rowspan=2> True labels</th><th>N</th>   <td> TN</td><td>FP</td> </tr>
    <tr> <th>P</th>                                  <td> FN</td><td>TP</td> </tr>
</table>

We can check this by calculating the entries "by hand"

In [None]:
np.sum(test_l>test_pred) # False negatives

In [None]:
np.sum(test_l<test_pred)# False positives

In [None]:
np.sum((test_l==test_pred) & (test_l==1)) # True positives

In [None]:
np.sum((test_l==test_pred) & (test_l==0)) # True negatives

The `plot_confusion_matrix` function labels the axes accordingly

In [None]:
plot_confusion_matrix(gaus, test_f.reshape(-1,1), test_l);

We can assign all four quantities at once:

In [None]:
TN, FP, FN, TP = confusion_matrix(test_l, test_pred).ravel() #ravel "flattens"  a multidimensional array

The confusion matrix can be normalized in several ways. The `'true'` normalisation normalizes the rows (true labels) as to make the sum in each row equal to one

In [None]:
confusion_matrix(test_l, test_pred, normalize='true')

In [None]:
confusion_matrix(test_l, test_pred, normalize='true').sum(axis=1)

This returns the rates:

<table style="text-align:center;font-size:14pt">
    <tr><td/><td/><th colspan=2>test_pred labels</th></tr>
    <tr><td/><td/><th>N</th><th>P</th></tr>
    <tr> <th rowspan=2> True labels</th><th>N</th>   <td> TNR</td><td>FPR</td> </tr>
    <tr> <th>P</th>                                  <td> FNR</td><td>TPR</td> </tr>
</table>

The `'pred'` normalizes the columns

In [None]:
confusion_matrix(test_l, test_pred, normalize='pred')

In [None]:
confusion_matrix(test_l, test_pred, normalize='pred').sum(axis=0)

Leading to 

<table style="text-align:center;font-size:14pt">
    <tr><td/><td/><th colspan=2>test_pred labels</th></tr>
    <tr><td/><td/><th>N</th><th>P</th></tr>
    <tr> <th rowspan=2> True labels</th><th>N</th>   <td> $\frac{TN}{TN+FN}$</td><td>$\frac{FP}{FP+TP}$</td> </tr>
    <tr> <th>P</th>                                  <td> $\frac{FN}{TN+FN}$</td><td>$\frac{FP}{FP+TP}$</td> </tr>
</table>

According to [Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix) entry this corresponds to

<table style="text-align:center;font-size:14pt;">
    <tr><td/><td/><th colspan=2>test_pred labels</th></tr>
    <tr><td/><td/><th>N</th><th>P</th></tr>
    <tr> <th rowspan=2> True labels</th><th>N</th>   <td> NPV</td><td>FDR</td> </tr>
    <tr> <th>P</th>                                  <td> FOR</td><td>PPV</td> </tr>
</table>

<table style="font-size:12pt;"> 
    <tr><td> NPV = negative predictive value </td></tr>
    <tr><td> FOR = false ommision rate </td></tr>
    <tr><td> FDR = false discovery rate </td></tr>
    <tr><td> PPV = positive predictive rate, precision </td></tr>
</table>    

And finally we can normalize across the whole matrix

In [None]:
confusion_matrix(test_l, test_pred, normalize='all')

In [None]:
confusion_matrix(test_l, test_pred, normalize='all').sum()

### Takeaway

Please check the orientation and normalization of the confusion matrix before interpreting its entries. 

## Averaging scores

Most of the scores were designed for binary classifiers with two classes  denoted traditionally as "positives" and "negatives". When there are more then  two labels there are no "positives" and "negatives". So what is done is that each class is in turn treated as positives and all other as negatives. Then we calculate the scores as for the binary classifier. That way obtain as many scores as there are classes. The averaging describes how those "partial" scores are combined in one finall score. 

### Recall 

Let's  take recall as example. [Recall](https://en.wikipedia.org/wiki/Precision_and_recall) by definition is the true positives rate:

In [None]:
TP/(TP+FN)

And if we use it with default value ('binary') for the `average` parameter we get just that

In [None]:
from sklearn.metrics import recall_score

print(recall_score(test_l, test_pred))
print(recall_score(test_l, test_pred, average='binary'))

For binary classifier we have two ways of assigning  classes to positives and negatives. What we have done above with `average='binary'` was to  treat label "one" as positives and "zero" as negatives. If we switch this assignment, that is switch positives with negatives, the recall would be 

In [None]:
TN/(TN+FP)

We can do this in the `recall_score` function by   explicitelly  indicating which label should be treated as positive

In [None]:
recall_score(test_l, test_pred, pos_label=0)

Setting no averaging returns scores for all possible assignments of the positives label

In [None]:
recall_score(test_l, test_pred, average=None)

### Macro averaging

One possibility of combining the scores is to take their average

In [None]:
recall_score(test_l, test_pred, average=None).mean()

That coresponds `'macro'` averaging

In [None]:
recall_score(test_l, test_pred, average='macro')

### Micro averaging

The micro averaging is slightly more complicated: we first calculate the average TP and  FN  for each  assignment and then use those averages to calculate the final score. Let's assume that $i$ is the label of the positive class. For each $i$ we can calculate the number true positives $TP_i$ and false negatives $FN_i$. We define the averages

$$\overline{TP} = \frac{1}{K}\sum_i TP_i,\quad \overline{FN} = \frac{1}{K}\sum_i FN_i$$

where $K$ is the number of classes (two in our case). We then use  those averages to calculate the final recall score

$$\frac{\overline{TP}}{\overline{TP}+\overline{FN}}$$

In our example

$$TP_0=TN,\quad FN_0= FP,\quad TP_1=TP,\quad FN_1= FN$$

$$\frac{\overline{TP}}{\overline{TP}+\overline{FN}} = \frac{TN+TP}{TN+FP+TP+FN}$$

In [None]:
(TN+TP)/(TN+TP+FN+FP)

And indeed this is what we get

In [None]:
recall_score(test_l, test_pred, average='micro')

### Weighted averaging

And finally the weighted averaging is like macro averaging but we weight the average by the support of each class _i.e._ the number of labels of each class. For binary classifier that is the number of negatives and the number of positives

In [None]:
support = np.asarray([TN+FP,TP+FN])
print(support)

The weighted average is

In [None]:
np.sum(recall_score(test_l, test_pred, average=None) * support)/support.sum()

In [None]:
recall_score(test_l, test_pred, average='weighted')

#### Problem

The weighted average gives same answer as micro averaging. Show that for recall score this is always the case.

Recall (no pun intendend) that 

$$R_0 = \frac{TN}{TN+FP},\quad R_1 = \frac{TP}{TP+FN}$$

The support is the number of positives

$$S_0=TN+FP,\quad S_1 = TP+FN$$

The weighted average

$$\frac{1}{TN+FP+TP+FN}\left(\frac{TN}{TN+FP}(TN+FP)+\frac{TP}{TP+FN}(TP+FN)\right)$$

gives

$$\frac{TN+TP}{TN+FP+TP+FN}$$

which is exactly what we get for micro averaging.

### Take away

Even for binary classifiers the different averaging methods gives different values

In [None]:
average = ['binary', 'macro', 'micro', 'weighted']
for a in average:
    print(f"recall(average='{a:s}') = {recall_score(test_l, test_pred, average=a):8.7}")

There is no "error" in using any of them, but you should be aware of the differences. 

The `classification_report` function reports several scores listing the partial scores as well as averaged ones

In [None]:
from sklearn.metrics import precision_recall_fscore_support, classification_report
print(classification_report(test_l, test_pred))