# Evaluating model performance on test dataset

In this notebook, we will explore the model's performance on the test dataset. In particular we will focus on accuracy and demonstrate the ability to boost accuracy performance using test time augmentation. 


As noted in the prior notebook, our model is performing at >90% (~93%) top 5 accuracy. This notebook will demonstrate that the model, with test time augmentation, can classify images from the food-101 test dataset with 85.3% accuracy.

In [4]:
# import dependencies
from keras import backend as K
from keras.models import Model
from keras.models import load_model
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
from sklearn.metrics import accuracy_score

Using TensorFlow backend.


We will load the model saved as the output of the prior notebook. 

In [None]:
model = load_model('fully_trained_resnet.h5', compile=True)

The below cells creates a data pipeline for the test dataset. At this stage we do not apply image augmentations, but simply rescale the image. We also turn shuffle off to ensure our labels align with the data passed into the model. We also set the batch_size to 1.

In [106]:
datagen = ImageDataGenerator(rescale=1./255)
samples = datagen.flow_from_directory('/home/jupyter/data/test',
                                      class_mode='categorical',
                                      batch_size=1,
                                      shuffle=False,
                                      target_size=(512,512))
samples.shuffle = False
samples.reset()
nums_labels = samples.classes

Found 25250 images belonging to 101 classes.


We generate predictions for each image. As the batch size is one, we need to take a step for each image.

In [3]:
test_preds = model.predict_generator(samples, steps=25250)

To get the predictions in the correct format, we take the max number from each prediction array. The model outputs an array with a predicted probability for each of the 101 classes, at this stage we just want the class with the highest predicted probability. 

We also have to subtract one from the classes because as the provided begin with 0. 

In [5]:
test_preds.argmax(axis=1)-1

array([  0,   0,   8, ..., 100, 100, 100])

In [6]:
labels

array([  0,   0,   0, ..., 100, 100, 100], dtype=int32)

We use the accuracy_score function from sci-kit learn to evaluate the accuracy performance of our model. As shown below, the model performs with 83.3% accuracy on the test dataset. Not bad, but short of our goal to achieve greater than 85% accuracy on the test dataset.

In [7]:
accuracy_score(labels, test_preds.argmax(axis=1)-1)

0.8335841584158415

# Using test time augmentation to boost performance to >85% top 1 accuracy

Test time augmentation uses image augmentation techniques traditionally used during training when making predictions. This means that rather than having our model make a single prediction based on an image, we have the model make multiple predictions of the same image (with the image augmented slightly differently each time). We than take the average of all the predictions to arrive at a final prediction. 

As shown below, this approach can substantially boost performance. In our case it boosts top-1 accuracy form 83.3% to 85.3%.

Prepeating the data pipeline for TTA is identical to the pipeline shown above (included here for reference).

In [31]:
# Prepare data for TTA
datagen = ImageDataGenerator(rescale=1./255)
samples = datagen.flow_from_directory('/home/jupyter/data/test',
                                      class_mode='categorical',
                                      batch_size=1,
                                      shuffle=False,
                                      target_size=(512,512))
samples.shuffle = False
samples.reset()
labels = samples.classes

Found 25250 images belonging to 101 classes.


The below function, accepts a datagenerator object, a model, an image, and a specified number of test time augmentations to make. The function returns predictions for each augmentation as well as the combined prediction. 

In [3]:
def test_time_prediction(datagen, model, image, n_examples=7):
    augmenter = datagen.flow(image[0], batch_size=1)
    augmenter.shuffle = False
    augmenter.reset()
    preds = model.predict_generator(augmenter, steps=n_examples, verbose=0)
    all_preds = np.mean(preds, axis=0)
    pred = np.argmax(all_preds, axis=-1)
    return pred, all_preds

The below function accepts a model, array of images, and an array of labels. The function takes each image and generates a prediction that utilizes test time augmentation (11 predictions per image) and the generates a combined prediction. The function returns the label array, the combined prediction array, and an array with all of the predictions.

Note, the y_label array is not strictly necessary to this function at this time, but could be used in the future.

In [5]:
def test_time_augmentation(model, X_samples, y_labels):
    datagen = ImageDataGenerator(rescale=1./255, 
                                 brightness_range=[0.7,1.3],
                                 rotation_range=30,
                                 horizontal_flip=True,
                                 zoom_range=[0.7,1.3],
                                 fill_mode='nearest')
    examples_per_image = 11
    preds = []
    all_preds = []
    for i in range(len(X_samples)):
        if ((i % 500) == 0):
            print(i, 'Predictions Complete')
        pred, all_preds = test_time_prediction(datagen, model, X_samples[i], examples_per_image)
        preds.append(pred)
    return y_labels, preds, all_preds 

The below cell generates predictions utilizing test time augmentation for every image in the test dataset.

In [6]:
y_labels, preds, all_preds = test_time_augmentation(model, samples, labels)

0 Predictions Complete
500 Predictions Complete
1000 Predictions Complete
1500 Predictions Complete
2000 Predictions Complete
2500 Predictions Complete
3000 Predictions Complete
3500 Predictions Complete
4000 Predictions Complete
4500 Predictions Complete
5000 Predictions Complete
5500 Predictions Complete
6000 Predictions Complete
6500 Predictions Complete
7000 Predictions Complete
7500 Predictions Complete
8000 Predictions Complete
8500 Predictions Complete
9000 Predictions Complete
9500 Predictions Complete
10000 Predictions Complete
10500 Predictions Complete
11000 Predictions Complete
11500 Predictions Complete
12000 Predictions Complete
12500 Predictions Complete
13000 Predictions Complete
13500 Predictions Complete
14000 Predictions Complete
14500 Predictions Complete
15000 Predictions Complete
15500 Predictions Complete
16000 Predictions Complete
16500 Predictions Complete
17000 Predictions Complete
17500 Predictions Complete
18000 Predictions Complete
18500 Predictions Complet

The below cells demonstrate what the first 10 predictions are versus the labels. As expected, since shuffle is turned off, all the first ten images are from the first class. Note, as discussed previously, the labels begin counting at 0; thus we add one to each label at this step.

In [7]:
preds[:10]

[1, 1, 1, 1, 1, 1, 22, 1, 1, 1]

In [8]:
y_labels[:10] + 1

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

As noted above, with test time augmentation the performance of the image classifier on the food-101 test set exceeds 85% accuracy, with performance of 85.3% top-1 accuracy.

This notebook demonstrates the power of test time augmentation as a technique to boost performance of predictive models.

In [9]:
acc = accuracy_score(y_labels+1, preds)
print(acc)

0.8533465346534653


The below cells save the combined prediction, label, all prediction arrays as csv files to enable future reference.

In [12]:
preds = np.asarray(preds)
np.savetxt('preds.csv', preds, delimiter=',')

y_labels = np.asarray(y_labels)
np.savetxt('y_labels.csv', y_labels, delimiter=',')

np.savetxt('all_preds.csv', all_preds, delimiter=',')