In [None]:
# Copyright 2018 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

## Training Demonstration for Computer Vision / Grayscale - EMNIST

This demonstration will use the EMNIST dataset. This is a dataset that is an extension of the MNIST (handwritten digits). It is derived from the NIST Special Dataset 19 for handwritten lower and uppercase characters and digits; thus consisting of 62 categories (vs. 10 in MNIST). The dataset consists of 800K images (vs. 70K for MNIST). 

The images in this version of EMNIST have been prepared in a similar method popularized by Yan Lecun method for the NIST MNIST dataset. In his method, the orignal images were upsampled from 20x20 to 28x28 and anti-aliased.  The original images of the NIST Special Dataset 19 in this method where downsampled from 128x128 to 28x28 using openCV INTER_AREA interpolation, which gives good results at minimizing artificats when downsampling.

## Prerequistes

The following needs to be pre-installed:

        openCV : pip install opencv-python
        numpy  : pip install numpy
        ipynb  : pip install import-ipynb

In [None]:
import cv2
import numpy as np
import import_ipynb

### Download the Dataset

The EMNIST dataset will need to be downloaded to the same directory (folder) as this notebook.

A zip file (compressed) of the dataset can be obtained at this location:

https://pantheon.corp.google.com/storage/browser/cloud-samples-data/air/emnist

## ML Pipeline Chain

The following ML Pipelines will be chained together for this demonstration

        emnist -> openCV -> hdf5 -> model_keras

### Process Image Files into Machine Learning Data using OpenCV module

In [None]:
# Import the openCV ML pipeline
import openCV

In [None]:
# Process the on-disk set of images to in-memory set of machine learning ready data
dataset = openCV.load_directory('emnist', colorspace=openCV.GRAYSCALE, resize=(28,28), flatten=False, concurrent=4, verbose=True)

Dataset should be 62 collections (26 lowercase letters, 26 uppercase letters, 10 digits)  
Each collection should consist of a set of three entries: data, labels, and errors.

In [None]:
print( "Number of collections:", len(dataset) )
print( "Number of sets in a collection:", len(dataset[0]))

The first collection should have the label (letter) 'Z' and consist of 2698 images.

In [None]:
print("Number of images:", len(dataset[0][0]))
print("Label for collection:", dataset[0][1])

In [None]:
print("Shape of Preprocessed Image", dataset[0][0][0].shape)

### Store Machine Learning Ready (preprocessed images) data into HDF5 storage

In [None]:
# Import the HDF5 storage ML pipeline
import hdf5

In [None]:
# Store the machine learning ready data to HDF5
hdf5.store_dataset('emnist', dataset, verbose=True)

In [None]:
import os
print("HDF5 file size:", int( os.path.getsize('emnist.h5') / (1024 * 1024) ), "MB")

### Construct CNN using Keras

In [None]:
# Import the Keras CNN Model ML pipeline
import model_keras

In [None]:
# Construct a CNN with input layer of NN of:
# Convolutional Layer of 32 filters with input vector (28, 28, 1)
# Convolutional Layer of 64 filters
# Neural Network Layer of 128 nodes and 0.50% dropout
# Nerual Network Layer of 64 nodes
# Output Layer with 62 nodes (classes)
model = model_keras.construct_cnn( (28, 28, 1), 62, n_filters=(32, 64), n_nodes=(128, 64), dropout=(0.50,0))

### Train the Model

In [None]:
# load the dataset back into memory
collections, labels, classes = hdf5.load_dataset('emnist')

In [None]:
print("Images", type(collections), len(collections))
print("Labels", type(labels), len(labels))
print("Classes", classes)

During training (in verbose mode), each epoch will output the current accuracy on the training data (acc) and accuracy on the testing data (val_acc).

*Best Practices*
1. Once the value of val_acc levels off (stops improving) you should stop training; otherwise the model may overfit.

2. If there is a high value for acc and low value for val_acc, the model is likely overfitted. Things to try:
        A. Add higher dropout or dropout to more layers.
        B. Reduce the number of nodes.
        
3. If you increase the batch size, the training time per epoch is reduced. Common practice is to set (mini) batch sizes between 32 and 256.

In [None]:
# Train the model
accuracy = model_keras.train_cnn(model, collections, labels, epochs=10, batch_size=256, verbose=True)

In [None]:
# Display the accuracy
print(accuracy)

### Save the Model

In [None]:
# Save the model
model.save('emnist.model.h5')