<a href="https://colab.research.google.com/github/munich-ml/MLPy2021/blob/main/32_evaluate_fMNIST_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

## References
Resources used to create this notebook:
- [scikit-learn website](https://scikit-learn.org)
- [Matplotlib website](https://matplotlib.org/)
- [Wikipedia](https://en.wikipedia.org/wiki/Main_Page)
- Hands-on Machine Learning with Scikit-learn, Keras & TensorFlow, Aurelien Geron, [Book on Amazon](https://www.amazon.de/Aur%C3%A9lien-G%C3%A9ron/dp/1492032646/ref=sr_1_3?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=Hands-on+Machine+Learning+with+Scikit-learn%2C+Keras+%26+TensorFlow%2C+Aurelien+Geron%2C&qid=1589875241&sr=8-3)
- Introduction to Machine Learning with Python, Andreas Mueller, [Book on Amazon](https://www.amazon.de/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)


## Setup

First, do the common imports.

Tensorflow must be 2.x, because there are major changes from 1.x

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Common imports
import os
import numpy as np
import pandas as pd

# to make this notebook's output stable across runs
np.random.seed(42)

# Setup matplotlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

in_colab = 'google.colab' in sys.modules   # check if note is executed within Colab

# Forces tensorflow version (only in colab)
if in_colab:
    %tensorflow_version 2.x           

# Clone the repository if executed in Google Colab
if in_colab:  
    if "MLPy2021" in os.listdir():
        !git -C MLPy2021 pull
    else:
        !git clone https://github.com/munich-ml/MLPy2021/

# lib.helper_funcs.py. The import path depends on Colab or local execution 
if in_colab:
    from MLPy2021.lib.helper_funcs import plot_confusion_matrix, plot_prediction_examples, pickle_in
else: 
    from lib.helper_funcs import plot_confusion_matrix, plot_prediction_examples, pickle_in


# Load a model

## Mount google drive

In [None]:
mount_dir = os.path.join(os.getcwd(), "drive")
mount_dir

In [None]:
from google.colab import drive
drive.mount(mount_dir)

## load_model()



In [None]:
save_dir = os.path.join(mount_dir, "My Drive", "Colab Notebooks", "models")
os.listdir(save_dir)

In [None]:
fn = "fMNIST_NN_v1_ageron"	
model = keras.models.load_model(os.path.join(save_dir, fn + ".h5"))
model.summary()

## Load the validation and test data

In [None]:
print([var for var in vars() if not var.startswith("_")])

In [None]:
pickle_in(os.path.join(save_dir, fn+'_data.pkl'), locals())

In [None]:
print([var for var in vars() if not var.startswith("_")])

In [None]:
class_names

In [None]:
X_valid.shape

In [None]:
X_test.shape

# Evaluate the model


###model.evaluate()


`model.evaluate()` predicts restults on the testset and computes loss and metrics with respect to the expected results  

In [None]:
model.evaluate(X_valid, y_valid);

### model.predict()

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
y_proba = model.predict(X_valid[:5])
pd.DataFrame(y_proba, columns=class_names).T

### model.predict_classes()

In [None]:
y_pred = model.predict_classes(X_valid)
y_pred

In [None]:
np.array(class_names)[y_pred]

In [None]:
some_indexes = [1, 2, 3, 4, 11, 12, 23]
plt.figure(figsize=(13, 2.5))
for col, index in enumerate(some_indexes):
    plt.subplot(1, len(some_indexes), col+1)
    plt.imshow(np.squeeze(X_valid[index]), cmap="binary")
    title = "actl='{}'\n".format(class_names[y_valid[index]])
    title +="pred='{}'\n".format(class_names[y_pred[index]])
    plt.title(title, fontsize=11), plt.axis('off')

## Confusion matrix

A confusion matrix is a two dimensional histogram of actual (rows) and predicted (cols) classes. 
- the main diagonal are correct predictions
- all other entries are fails

In [None]:
confusion = tf.math.confusion_matrix(y_valid, y_pred)
confusion

###Exercise 
Previously, we computed the **accurary** using `model.evaluate()`. Accuracy is defined by:

$
\text{accuracy} = \cfrac{\text{all}True}{\text{all}} 
$ 

Check that result with the confusion matrix supported by **numpy**.


Hint: The follwing line converts the **tensor** `confusion` into a **numpy array** and computes the sum of all items.
```python
np.array(confusion).sum()
```



### Plotting the confusion matrix

In [None]:
plot_confusion_matrix(confusion, xticks=class_names, yticks=class_names)

Interpretation of the confusion matrix?

One usually focusses on the **false predictions**, thus ignoring the main diagonal may improve the visualization:

In [None]:
plot_confusion_matrix(confusion, xticks=class_names, yticks=class_names, ignore_main_diagonal=True)

## Performance measures for classifiers

A **Digit-5 detector** is used as an example to compare different performance metrics:

![](https://github.com/munich-ml/MLPy2021/blob/main/images/digit5-detector.png?raw=1)

**True negative** for instance means:
- **True**: The digit was classified correctly
- **negative**: The digit is **not** a 5



### Accuracy

Accuracy is a good measure for symmetric datasets. It's definition again:

$
\text{accuracy} = \cfrac{\text{all}True}{\text{all}} 
$ 

If the counts of **false negatives** greatly differ from the **false positives** of if their costs is greatly different, alternative performance measures are required.

### Precision
Precision (ideally 1.0) is decreased by **false positives** (FP). FP means the prediction `True` is wrong.

$
\text{precision} = \cfrac{TP}{\text{all}P} = \cfrac{TP}{TP + FP}
$

Example application: *Email Spam Detection*
FP (mail sorted out) is worse than FN (spam coming through)


### Recall (or sensitivity)
Recall (ideally 1.0) says how good a model is at detecting the positives. 

Recall is decreased by **false negatives** (FN). FN means the prediction `False` is wrong.

$
\text{recall} = \cfrac{TP}{\text{all}T} = \cfrac{TP}{TP + FN}
$

Example application: *Medical Diabetic Detection*

FN (Diabetic not detected) is worse than FP (Diabetic detected but patient is healthy)


### Specificity 
Specificity says how good a model is at detecting the negatives (avoiding false alarms).
$
\text{specificity} = \cfrac{TN}{\text{all}N} = \cfrac{TN}{TN + FP}
$

### F1-score
*Harmonic mean* of precision and recall. 

$
F_1 = \cfrac{2}{{precision^{-1}} + {recall^{-1}}} = 2 \times \cfrac{\text{precision}\, \times \, \text{recall}}{\text{precision}\, + \, \text{recall}} 
$

Whereas *regular mean* treats all values equally, the *harmonic mean* gives more weight to low values.

Evaluating the **Digit-5 detector** for the various metrics

![](https://github.com/munich-ml/MLPy2021/blob/main/images/precision-recall.png?raw=1)


In [None]:
tn, fp, fn,tp = 5, 1, 2, 3

perf = {}
perf["accuracy"] = (tp+tn) / (tp+tn+fp+fn)
perf["precision"] = tp / (tp+fp)
perf["recall"] = tp / (tp+fn)
perf["specificity"] = tn / (tn+fp)
perf["F1-score"] = 2 * perf["precision"]*perf["recall"] / (perf["precision"]+perf["recall"])

for label, value in perf.items():
    print("{:12s}{:.0%}".format(label, value))

## Classification report of the FMNIST-model


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_valid, y_pred, target_names=class_names))

### Exercise 
Check the *precision* and *recall* values returned by `classification_report()` for one class (e.g. 'Coat')


####Solution

In [None]:
plot_confusion_matrix(confusion, xticks=class_names, yticks=class_names)

In [None]:
CLASS_LABEL = "Coat"
idx = class_names.index(CLASS_LABEL)
idx

In [None]:
type(confusion)

Convert the **Tensor** into a familiar **numpy array** 

In [None]:
cm = np.array(confusion)
cm

In [None]:
tp = cm[idx, idx]
tp

FP are all items predicted as ``CLASS_LABEL`` minus TP

In [None]:
all_positive_predictions = cm[:, idx].sum()
fp = all_positive_predictions - tp
fp

FN are all actual ``CLASS_LABEL`` items minus TP

In [None]:
fn = cm[idx, :].sum() - tp
fn

In [None]:
print("Class '{}': precision={:.2f}, recall={:.2f}".format(CLASS_LABEL, tp/(tp+fp), tp/(tp+fn)))

**Conclusion**: The ``classification_report`` output is proven to be correct!

## Examples of predictions


Let's look at some examples of right and wrong predictions.

In [None]:
class_names

In [None]:
validation_class = 6
plot_prediction_examples(validation_class, class_names, y_pred, y_valid, X_valid)