This tutorial shows hot to classify cats or dogs from images.
- Building data input pipelines using the `tf.keras.preprocessing.image.ImageDataGenerator` class to efficiently work with data on disk to use with the model.
- Overfitting - How to identify and prevent it
- Data augmentation and dropout - Key techniques to fight overfitting in computer vision tasks to incorporate into the data pipeline and image classifier model.

This tutorial follows a basic machine learning workflow:
1. Examine and understand data
2. Build an input pipeline
3. Build the model
4. Train the model
5. Test the model
6. Improve the model and repeat the process

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import os
import numpy as np
import matplotlib.pyplot as plt

In [0]:
# Load data
# This tutorial uses a filtered version of Dogs vs Cats dataset from Kaggle/
# Download the archive version of the dataset and store it in the "/tmp" directory

_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'

path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=_URL, extract=True)

PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered') 

After extracting its contents, assign variables with the proper file path for the training and validation set.

In [0]:
train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')

In [0]:
train_cats_dir = os.path.join(train_dir, 'cats') # directory with out training cat pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')
validation_cats_dir = os.path.join(validation_dir, 'cats')
validation_dogs_dir = os.path.join(validation_dir, 'dogs')

## Understand the data
Let's look at how many cats and dogs images are in the training and validation directory:

In [0]:
num_cats_tr = len(os.listdir(train_cats_dir))
num_dogs_tr = len(os.listdir(train_dogs_dir))

num_cats_val = len(os.listdir(validation_cats_dir))
num_dogs_val = len(os.listdir(validation_dogs_dir))

total_train = num_cats_tr + num_dogs_tr
total_val = num_cats_val + num_dogs_val

In [0]:
print('total training cat images:', num_cats_tr)
print('total training dog images:', num_dogs_tr)

print('total validation cat images:', num_cats_val)
print('total validation dog images:', num_dogs_val)
print("--")
print("Total training images:", total_train)
print("Total validation images:", total_val)

In [0]:
from PIL import Image

In [0]:
def get_num_pixels(filepath):
    width, height = Image.open(filepath).size
    return width, height

In [0]:
width, height = \
        get_num_pixels(os.path.join(train_cats_dir, 'cat.0.jpg'))
print('width: ', width);
print('height: ', height)

width, height = \
        get_num_pixels(os.path.join(train_cats_dir, 'cat.1.jpg'))
print('width: ', width);
print('height: ', height)

In [0]:
# For convenience, set up variables to use while pre-processing the dataset
# and training the network.
batch_size = 128
epochs = 30
IMG_HEIGHT = 150
IMG_WIDTH = 150

## Data preparation
Format the images into appropriately pre-processed floating point tensors before feeding to the network:
1. Read images from the disk.
2. Decode contents of these images and convert it into proper grid format as per their RGB content.
3. Convert them into floating point tensors.
4. Resclae the tensors from values between 0 and 255 to values between 0 and 1, as neuralnetworks prefer to deal with small input values

We do all these tasks using `ImageDataGenerator` class

In [0]:
train_image_generator = ImageDataGenerator(rescale=1./255) #Generator for our training data
validation_image_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

After defining the generators for training and validation images,
the `flow_from_directory` method load images from the disk, applies rescaling, and resizes the images into the required dimensions.

In [0]:
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                           class_mode='binary')

In [0]:
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                              directory=validation_dir,
                                                              target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                              class_mode='binary')

## Visualize training images

In [0]:
sample_training_images, _ = next(train_data_gen)

The `next` function returns a batch from the dataset. The return value of `next` function is in form of `(x_train, y_train)` where x_train is training features and y_train, its labels. Discard the labels to only visualize the training images.

In [0]:
def plot_images(imges_arr):
    fig, axes = plt.subplots(1, 5, figsize=(10, 10))
    for img, ax in zip(imges_arr, axes):
        ax.imshow(img)
        ax.axis('off')
    plt.tight_layout()
    plt.show()

In [0]:
plot_image(sample_training_images[:5])

## Create the model

In [0]:
model = tf.keras.models.Sequential([
    Conv2D(16, (3, 3), padding='same', 
           activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    MaxPooling2D((2, 2)),
    Conv2D(32, 3, padding='same', activation='relu'), # kerner_size = integer/tuple/list
    MaxPooling2D(), # Default pool_size=2
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(512, activation='relu'),
    Dense(1, activation='sigmoid')
])

## Compile the model
choose the ADAM optimizer and binary crossentropy loss function. To view training and validation accuracy for each training epoch, pass the metrics argument

In [0]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [0]:
model.summary()

## Train the model
Use the `fit_generator` method of the `ImageDataGenerator` class to train the network.

In [0]:
history = model.fit_generator(
    train_data_gen,
    steps_per_epoch=total_train//batch_size,
    epochs=epochs,
    validation_data=val_data_gen,
    validation_steps=total_val//batch_size
)

Keras' generators are infinite.

Because of this, Keras cannot know by itself how many batches the generators should yield to complete one epoch.

validation_steps = Total number of steps (batches of samples) to validate before stopping.

## Visualize training results

In [0]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss= history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend()
plt.title('Training and Validation Loss')

plt.show()

Training accuracy and validation accuracy are off by large margin and the model has achieved only around 70% accuracy on the validation set.

## Overfitting
In the plots above, the training accuracy is increasing linearly over time, whereas validation accuracy stalls around 70% in the training process. Also, the difference in accuracy between training and validation accuracy is noticeable - a sign of overfitting.

When there are a small number of training examples, the model sometimes learns from noises or unwanted details from training examples - to an extent that it negatively impacts the performance of the model on new examples. This phenomenon is known as overfitting. It means that the model will have a difficulty time generalizing on a new dataset.

There are multiple ways to fight overfitting in the training process. In this tutorial, you'll use data augmentation and dropout to out model.

## Data augmentation
Overfitting generally occurs when there are a small number of training examples. One way to fix this problem is to augment the dataset so that it has a suffcient number of training examples. Data augmentation takes the approach of generating more training data from exisiting training samples by augmentiong the samples using random transformations that yield believable-looking images. The goal is the model will never see the exact same picture twice during training. This helps expose the model to more aspects of the data and generalize better.

Implement this in `tf.keras` using `ImageDataGenerator` class. Pass different transformation to the dataset and it will take care of applying it during the training process.

In [0]:
# Apply horizontal flip
image_gen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)

In [0]:
train_data_gen = image_gen.flow_from_directory(batch_size=batch_size,
                                               directory=train_dir,
                                               shuffle=True,
                                               target_size=(IMG_HEIGHT, IMG_WIDTH))

In [0]:
# Take one sample image from the training example and repeat it five times
# so that the augmentation is applied to the same image five times.
augmented_images = [train_data_gen[0][0][0] for i in range(5)]

In [0]:
plot_images(augmented_images)

In [0]:
# Ramdonly rotte the image
image_gen = ImageDataGenerator(rescale=1./255, rotation_range=45)
train_data_gen = image_gen.flow_from_directory(batch_size=batch_size,
                                               directory=train_dir,
                                               shuffle=True,
                                               target_size=(IMG_HEIGHT, IMG_WIDTH))
augmented_images = [train_data_gen[0][0][0] for i in range(5)]

In [0]:
plot_images(augmented_images)

In [0]:
# Apply zoom augmentation to the dataset to zoom images up to 50% randomly.
image_gen = ImageDataGenerator(rescale=1./255, zoom_range=0.5)

In [0]:
train_data_gen = image_gen.flow_from_directory(batch_size=batch_size,
                                               directory=train_dir,
                                               shuffle=True,
                                               target_size=(IMG_HEIGHT, IMG_WIDTH))
augmented_image = [train_data_gen[0][0][0] for _ in range(5)]

In [0]:
plot_images(augmented_image)

## Put it all together
APply all the previous augmentations. Here, applied rescale, 45 degree rotation, width shift, height shift, horizontal flip and zoom augmentation to the training images.

In [0]:
image_gen_train = ImageDataGenerator(
                    rescale=1./255,
                    rotation_range=45,
                    width_shift_range=.15,
                    height_shift_range=.15,
                    horizontal_flip=True,
                    zoom_range=.5
)

In [0]:
train_data_gen = image_gen_train.flow_from_directory(batch_size=batch_size,
                                                     directory=train_dir,
                                                     shuffle=True,
                                                     target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                     class_mode='binary')

In [0]:
# Visualize how a single image would look five different times when passing these
# augmentations randomly to the dataset.
augmented_images = [train_data_gen[0][0][0] for i in range(5)]
plot_images(augmented_images)

## Dropout
Another technique to reduce overfitting is to introduce *dropout* to the network. It is a form of *regularization* that foces the weights in the network to take only small values, which makes the distribution of weights values more regular and the network can reduce overfitting on small train examples.

When appling 0.1 dropout to a certain layer, it randomly kills 10% of the output units in each training epoch. (epoch not batch.)

In [0]:
# Creating a new network with Dropouts
model_new = Sequential([
    Conv2D(16, 3, padding='same', activation='relu', 
           input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
    MaxPooling2D(),
    Dropout(0.2),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Dropout(0.2),
    Flatten(),
    Dense(512, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [0]:
model_new.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
model_new.summary()

In [0]:
history = model_new.fit_generator(
    train_data_gen,
    steps_per_epoch=total_train//batch_size,
    epochs=epochs,
    validation_data=val_data_gen,
    validation_steps=total_val//batch_size
)

In [0]:
epoch_range = range(epochs)

plt.figure(figsize=(8, 8))
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

plt.subplot(1, 2, 1)
plt.plot(epoch_range, acc, label='Training Accuracy')
# plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend('lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epoch_range, loss, label='Training Loss')
plt.plot(epoch_range, val_loss, label='Validation Loss')
plt.legend()
plt.title('Training and Validation Loss')