# Exercise: More metrics derived from confusion matrices

In this exercise we will learn about different metrics, using them to explain the results obtained from the *binary classification model* we built in the previous unit.

## Data visualization

We will use the dataset with different classes of objects found on the mountain one more time:



In [1]:
import pandas
import numpy
# !wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
# !wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/snow_objects.csv

#Import the data from the .csv file
dataset = pandas.read_csv('snow_objects.csv', delimiter="\t")

#Let's have a look at the data
dataset

Unnamed: 0,size,roughness,color,motion,label
0,50.959361,1.318226,green,0.054290,tree
1,60.008521,0.554291,brown,0.000000,tree
2,20.530772,1.097752,white,1.380464,tree
3,28.092138,0.966482,grey,0.650528,tree
4,48.344211,0.799093,grey,0.000000,tree
...,...,...,...,...,...
2195,1.918175,1.182234,white,0.000000,animal
2196,1.000694,1.332152,black,4.041097,animal
2197,2.331485,0.734561,brown,0.961486,animal
2198,1.786560,0.707935,black,0.000000,animal


Recall that to use the dataset above for *binary classification*, we need to add another column to the dataset, and set it to `True` where the original label is `hiker`, and `False` where it's not.

Let's then add that label, split the dataset and train the model again:


In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Add a new label with true/false values to our dataset
dataset["is_hiker"] = dataset.label == "hiker"

# Split the dataset in an 70/30 train/test ratio. 
train, test = train_test_split(dataset, test_size=0.3, random_state=1, shuffle=True)

# define a random forest model
model = RandomForestClassifier(n_estimators=1, random_state=1, verbose=False)

# Define which features are to be used 
features = ["size", "roughness", "motion"]

# Train the model using the binary label
model.fit(train[features], train.is_hiker)

print("Model trained!")

Model trained!


We can now use this model to predict whether objects in the snow are hikers or not.

Let's plot its *confusion matrix*:

In [3]:
# sklearn has a very convenient utility to build confusion matrices
from sklearn.metrics import confusion_matrix
import plotly.figure_factory as ff

# Calculate the model's accuracy on the TEST set
actual = test.is_hiker
predictions = model.predict(test[features])

# Build and print our confusion matrix, using the actual values and predictions 
# from the test set, calculated in previous cells
cm = confusion_matrix(actual, predictions, normalize=None)

# Create the list of unique labels in the test set, to use in our plot
# I.e., ['True', 'False',]
unique_targets = sorted(list(test["is_hiker"].unique()))

# Convert values to lower case so the plot code can count the outcomes
x = y = [str(s).lower() for s in unique_targets]

# Plot the matrix above as a heatmap with annotations (values) in its cells
fig = ff.create_annotated_heatmap(cm, x, y)

# Set titles and ordering
fig.update_layout(  title_text="<b>Confusion matrix</b>", 
                    yaxis = dict(categoryorder = "category descending"))

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=0.5,
                        y=-0.15,
                        showarrow=False,
                        text="Predicted label",
                        xref="paper",
                        yref="paper"))

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=-0.15,
                        y=0.5,
                        showarrow=False,
                        text="Actual label",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

# We need margins so the titles fit
fig.update_layout(margin=dict(t=80, r=20, l=120, b=50))
fig['data'][0]['showscale'] = True
fig.show()

In [4]:
# Let's also calculate some values that will be used throughout this exercise
# We already have actual values and corresponding predictions, defined above
correct = actual == predictions
tp = numpy.sum(correct & actual)
tn = numpy.sum(correct & numpy.logical_not(actual))
fp = numpy.sum(numpy.logical_not(correct) & actual)
fn = numpy.sum(numpy.logical_not(correct) & numpy.logical_not(actual))

print("TP - True Positives: ", tp)
print("TN - True Negatives: ", tn)
print("FP - False positives: ", fp)
print("FN - False negatives: ", fn)


TP - True Positives:  75
TN - True Negatives:  523
FP - False positives:  29
FN - False negatives:  33


We can use the values and matrix above to help us understand other metrics.


## Calculating metrics

From here on we will take a closer look at each at the following metrics, how they are calculated and how they can help explain our current model. 

* Accuracy
* Sensitivity/Recall
* Specificity
* Precision
* False positive rate

Let's first recall some useful terms:

* TP = True positives: a positive label is correctly predicted
* TN = True nositives: a negative label is correctly predicted
* FP = False positives: a negative label is predicted as a positive
* FN = False negatives: a positive label is predicted as a negative


### Accuracy
Accuracy is the number of correct predictions divided by the total number of predictions:

```
    accuracy = (TP+TN) / number of samples
```

It's possibly the most basic metric used but, as we've seen, it's not the most reliable when *imbalanced datasets* are used.

In code:

In [5]:
# Calculate accuracy
# len(actual) is the number of samples in the set that generated TP and TN
accuracy = (tp+tn) / len(actual) 

# print result as a percentage
print(f"Model accuracy is {accuracy:.2f}%")

Model accuracy is 0.91%


### Sensitivity/Recall

*Sensitivity* and *Recall* are interchangeable names for the same metric, which expresses the fraction of samples __correctly__ predicted by a model:


```
    sensitivity = recall = TP / (TP + FN)
```

This is an important metric, that tells us how out of all the existing __positive__ samples, how many are __correctly__ predicted.

In code:

In [6]:
# code for sensitivity/recall
sensitivity = recall = tp / (tp + fn)

# print result as a percentage
print(f"Model sensitivity/recall is {sensitivity:.2f}%")

Model sensitivity/recall is 0.69%


### Specificity
Specificity expresses the fraction of __negative__ labels correctly predicted over the total number of existing negative samples:

```
    specificity = TN / (TN + FP)
```

It can be calculated using the following code:

In [7]:
# Code for specificity
specificity = tn / (tn + fp)

# print result as a percentage
print(f"Model specificity is {specificity:.2f}%")

Model specificity is 0.95%


### Precision
Precision expresses the proportion of __correctly__ predicited positive samples over all positive predictions:

```
    precision = TP / (TP + FP)
```
In other words, it indicates how out of all positive predictions, how many are trully positive labels.

It can be calculated using the following code:

In [8]:
# Code for precision

precision = tp / (tp + fp)

# print result as a percentage
print(f"Model precision is {precision:.2f}%")

Model precision is 0.72%


### False positive rate
False positive rate or FPR, is the number of __incorrect__ positive predictions divided by the total number of negative samples:

```
    false_positive_rate = FP / (FP + TN)
```


In code:

In [9]:
# Code for false positive rate
false_positive_rate = fp / (fp + tn)

# print result as a percentage
print(f"Model false positive rate is {false_positive_rate:.2f}%")


Model false positive rate is 0.05%


Notice that the sum of `specificity` and `false positive rate` should always be equal to `1`.

## Conclusion

There are several different metrics that can help us evaluate the performance of a model, in the context of the quality of its predictions.

The choice of the most adequate metrics, however, is primarily a funciton of the data and the problem we are trying to solve.

## Summary

We covered the following topics in this unit:

* How to calculate the very basic measurements used in the evaluation of classification models: TP, FP, TN, FN.
* How to use the measurement aboves to calculate more meaningful metrics, such as:
    * Accuracy
    * Sensitivity/Recall
    * Specificity
    * Precision
    * False positive rate
* How the choice of metrics depends on the dataset and the problem we are trying to solve.

