# Deep Learning Exercise with MNIST and Keras

## Introduction

This exercise complements the topics learned in the prerequisite tutorial [Deep Learning from Pre-Trained Models with Keras](keras-transfer-learning-tutorial.ipynb).

This exercise aims to reinforce the previous learnings and prepare participants to apply Keras and Image Classification to their own datasets.  Participants will:

* Practice what they've learned from the prerequisite *Deep Learning from Pre-Trained Models with Keras* tutorial,
* create their own CNN based image classifier for the MNIST digits dataset,
* and finally, submit the classification results from their model to Kaggle for evaluation.

Participants are expected to bring their own laptops and sign-up for free online cloud services (e.g., Google Colab, Kaggle).  They may also need to download free, open-source software prior to arriving for the workshop.

* Tutorial materials are derived from:
  * [PyTorch Tutorials](https://github.com/kaust-vislab/pytorch-tutorials) by David Pugh.

## Setup

### Setup Colab

In order to run this notebook in [Google Colab](https://colab.research.google.com) you will need a [Google Account](https://accounts.google.com/).  Sign-in to your Google account, if necessary, and then start the notebook.

Change Google Colab runtime to use GPU:

* Click `Runtime` -> `Change runtime type` menu item
* Specify `Hardware accelerator` as `GPU`
* Click **[Save]** button

The session indicator (toolbar / status ribbon under menu) should briefly appear as `Connecting...`.  When the session restarts, continue with the next cell (specifying TensorFlow version v2.x):

In [None]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

### Download MNIST Data

In [None]:
# %load cache_utils.py
import pathlib
import tensorflow.keras.utils as Kutils

def cache_mnist_data():
    for n in ["mnist.npz", "kaggle/train.csv", "kaggle/test.csv"]:
        path = pathlib.Path("../datasets/mnist/%s" % n).absolute()
        if not path.is_file():
            print("missing local dataset file: %s" % n)
        DATA_URL = "file:///" + str(path)
        try:
            data_file_path = Kutils.get_file(n.replace('/','-mnist-'), DATA_URL)
            print("cached file: %s" % n)
        except (FileNotFoundError, ValueError, Exception) as e:
            print("FAILED: First fetch file: %s" % n)

def cache_cifar10_data():
    for n in ["cifar-10.npz", "cifar-10-batches-py.tar.gz"]:
        path = pathlib.Path("../datasets/cifar10/%s" % n).absolute()
        if not path.is_file():
            print("missing local dataset file: %s" % n)
        DATA_URL = "file:///" + str(path)
        try:
            data_file_path = Kutils.get_file(n, DATA_URL)
            print("cached file: %s" % n)
        except (FileNotFoundError, ValueError, Exception) as e:
            print("FAILED: First fetch file: %s" % n)

def cache_models():
    for n in ["vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5"]:
        path = pathlib.Path("../models/%s" % n).absolute()
        if not path.is_file():
            print("missing local dataset file: %s" % n)
        DATA_URL = "file:///" + str(path)
        try: 
            data_file_path = Kutils.get_file(n, DATA_URL, cache_subdir='models')
            print("cached file: %s" % n)
        except (FileNotFoundError, ValueError, Exception) as e:
            print("FAILED: First fetch file: %s" % n)


If you are using Binder to run this notebook, then the data is already downloaded and available.  Skip to the next step.

If you are using Google Colab to run this notebook, then you will need to download the data before proceeding.

#### Download MNIST with Keras

If you are running this notebook using Google Colab, then dowload the data using the Keras `load_data()` API by running the code in the following cell.

In [None]:
from tensorflow.keras.datasets import mnist
cache_mnist_data();
mnist.load_data();

## Exercise

### Setup

Initialize the Python environment by importing and verifying the modules we will use.

In [None]:
import os
import sys
import pathlib
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import tensorflow as tf
import tensorflow.keras as keras

`%matplotlib inline` is a magic command that makes *matplotlib* charts and plots appear was outputs in the notebook.

`%matplotlib notebook` enables semi-interactive plots that can be enlarged, zoomed, and cropped while the plot is active.  One issue with this option is that new plots appear in the active plot widget, not in the cell where the data was produced.

In [None]:
%matplotlib inline

In [None]:
# %load verify_runtime.py
# Verify runtime environment

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    IS_COLAB = True
except Exception:
    IS_COLAB = False
print("is_colab:", IS_COLAB)

assert tf.__version__ >= "2.0", "TensorFlow version >= 2.0 required."
print("tensorflow_version:", tf.__version__)

assert sys.version_info >= (3, 5), "Python >= 3.5 required."
print("python_version:", "%s.%s.%s-%s" % (sys.version_info.major, 
                                          sys.version_info.minor,
                                          sys.version_info.micro,
                                          sys.version_info.releaselevel
                                         ))

print("executing_eagerly:", tf.executing_eagerly())

try:
    __physical_devices = tf.config.list_physical_devices('GPU')
except AttributeError:
    __physical_devices = tf.config.experimental.list_physical_devices('GPU')
    
if len(__physical_devices) == 0:
    print("No GPUs available. Expect training to be very slow.")
    if IS_COLAB:
        print("Go to `Runtime` > `Change runtime` and select a GPU hardware accelerator."
              "Then `Save` to restart session.")
else:
    print("is_built_with_cuda:", tf.test.is_built_with_cuda())
    print("gpus_available:", [d.name for d in __physical_devices])


### Dataset Pre-processing - MNIST

The previously acquired MNIST dataset is the essential input needed to train an image classification model. Before using the dataset, there are several preprocessing steps required to load the data, and create the correctly sized training, validation, and testing arrays used as input to the network.

The following data preparation steps are needed before they can become inputs to the network:

* Cache the downloaded dataset (to assist Keras `load_data()` functionality).
* Load the dataset (MNIST is small, and fits in memory).
    * Convert from textual CSV files into binary arrays.
    * Reshape from (784, 1) to (28, 28,1) – and maybe to (32, 32, 3)
* Verify the shape and type of the data, and understand it...
* Convert label indices into categorical vectors.
* Convert image data from integer to float values, and normalize.
  * Verify converted input data.

#### Cache Data

Make downloaded data available to Keras.  Provide dataset utility functions.

In [None]:
# Cache MNIST Datasets
cache_mnist_data()

In [None]:
%%bash
find ~/.keras -name "*mnist*" -type f

In [None]:
# Helper functionality to provide human-readable labels
def mnist_index_label(idx):
    return int(idx)

def mnist_category_label(cat):
    return mnist_index_label(cat.argmax())

def mnist_label(v):
    return mnist_index_label(v) if np.isscalar(v) or np.size(v) == 1 else mnist_category_label(v)

In [None]:
cache_models()

#### Load Data

Load data via Keras API.  This loads data into a `numpy` array, and the test examples are labelled.

In [None]:
from tensorflow.keras.datasets import mnist

# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

**Note:** Backup plan: Run the following cell if the data didn't load via `mnist.load_data` above.

In [None]:
# Try secondary data source if the first didn't work
try:
    print("data loaded." if type((x_train, y_train, x_test, y_test)) else "load failed...")
except NameError:
    with np.load('../datasets/mnist/mnist.npz') as data:
        x_train = data['x_train']
        y_train = data['y_train']
        x_test = data['x_test']
        y_test = data['y_test']
    print("alternate data load." if type((x_train, y_train, x_test, y_test)) else "failed...")

#### Explore Data

Explore data types, shape, and value ranges.  Ensure they make sense, and you understand the data well.

In [None]:
print('x_train type:', type(x_train), ',', 'y_train type:', type(y_train))
print('x_train dtype:', x_train.dtype, ',', 'y_train dtype:', y_train.dtype)
print('x_train shape:', x_train.shape, ',', 'y_train shape:', y_train.shape)
print('x_test shape:', x_test.shape, ',', 'y_test shape:', y_test.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

In [None]:
print('x_train (min, max, mean): (%s, %s, %s)' % (x_train.min(), x_train.max(), x_train.mean()))
print('y_train (min, max): (%s, %s)' % (y_train.min(), y_train.max()))

In [None]:
def to_image(arr):
    if len(arr.shape) == 3 and arr.shape[-1] == 3:
        return arr
    elif len(arr.shape) == 3 and arr.shape[-1] == 1:
        return arr.reshape(arr.shape[0:-1])
    elif len(arr.shape) == 1 and np.sqrt(arr.shape[0]) == int(np.sqrt(arr.shape[0])):
        return arr.reshape((int(np.sqrt(arr.shape[0])), int(np.sqrt(arr.shape[0]))))
    else:
        return arr # Give up, let `matplotlib` `imshow` complain about malformed image.

def imageset_plot(img_data=None):
    (x_imgs, y_imgs) = img_data if img_data else (x_train, y_train)
    fig = plt.figure(figsize=(16,8))

    for i in range(40):
        plt.subplot(4, 10, i + 1)
        plt.xticks([])
        plt.yticks([])
        idx = int(random.uniform(0, x_imgs.shape[0]))
        plt.title("%s" % (mnist_label(y_imgs[idx])))
        plt.imshow(to_image(x_imgs[idx]), cmap=plt.get_cmap('gray'))
    plt.show()

In [None]:
# Show array of random labelled images with matplotlib (re-run cell to see new examples)
imageset_plot((x_train, y_train))

In [None]:
# %load histogram_utils.py
# Histogram utils

def histogram_plot(img_data=None):
  (x_data, y_data) = img_data if img_data else (x_train, y_train)

  fig = plt.figure(figsize=(12,5))

  plt.subplot(1,2,1)
  plt.hist(y_data, bins = range(int(y_data.min()), int(y_data.max() + 2)))
  plt.xticks(range(int(y_data.min()), int(y_data.max() + 2)))
  plt.title("y histogram")
  plt.subplot(1,2,2)
  plt.hist(x_data.flat, bins = range(int(x_data.min()), int(x_data.max() + 2)))
  plt.title("x histogram")
  plt.tight_layout()
  plt.show()

  hist, bins = np.histogram(y_data, bins = range(int(y_data.min()), int(y_data.max() + 2)))
  print('y histogram counts:', hist)

def histogram_label_plot(train_img_data=None, test_img_data=None):
  (x_train_data, y_train_data) = train_img_data if train_img_data else (x_train, y_train)
  (x_test_data, y_test_data) = test_img_data if test_img_data else (x_test, y_test)

  x_data_min = int(min(x_train_data.min(), x_test_data.min()))
  x_data_max = int(min(x_train_data.max(), x_test_data.max()))
  y_data_min = int(min(y_train_data.min(), y_test_data.min()))
  y_data_max = int(min(y_train_data.max(), y_test_data.max()))
  num_rows = y_data_max - y_data_min + 1
  
  fig = plt.figure(figsize=(12,12))

  plot_num = 1
  for lbl in range(y_data_min, y_data_max):
    plt.subplot(num_rows, 2 , plot_num)
    plt.hist(x_train_data[y_train_data.squeeze() == lbl].flat, bins = range(x_data_min, x_data_max + 2))
    plt.title("x train histogram - label %s" % lbl)
    plt.subplot(num_rows, 2 , plot_num + 1)
    plt.hist(x_test_data[y_test_data.squeeze() == lbl].flat, bins = range(x_data_min, x_data_max + 2))
    plt.title("x test histogram - label %s" % lbl)
    plot_num += 2

  plt.tight_layout(pad=0)
  plt.show()


In [None]:
histogram_plot((x_train, y_train))

In [None]:
histogram_plot((x_test, y_test))

The data looks reasonable: there are sufficient examples for each category (y_train) and the histogram showning mostly black (0) and near-white grayscale (>250) agrees with the examples shown previously.

Lets do one more sanity check to ensure that the data distributions are also similar per-category.

In [None]:
histogram_label_plot((x_train, y_train), (x_test, y_test))

The per-category histograms also look similar (again, the background dominates the histogram).

##### Visualizing training samples using PCA

[Principal Components Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) can be used as a visualization tool to see if there are any obvious patterns in the training samples.

In [None]:
import sklearn
import sklearn.decomposition

_prng = np.random.RandomState(42)

pca = sklearn.decomposition.PCA(n_components=40, random_state=_prng)

x_train_flat = x_train.reshape(*x_train.shape[:1], -1)
y_train_flat = y_train.reshape(y_train.shape[0])
print("x_train:", x_train.shape, "y_train", y_train.shape)
print("x_train_flat:", x_train_flat.shape, "y_train_flat", y_train_flat.shape)
pca_train_features = pca.fit_transform(x_train_flat, y_train_flat)
print("pca_train_features:", pca_train_features.shape)

# Sample 10% of the PCA results
_idxs = _prng.randint(y_train_flat.shape[0], size=y_train_flat.shape[0] // 10)
pca_features = pca_train_features[_idxs]
pca_category = y_train_flat[_idxs]
print("pca_features:", pca_features.shape, 
      "pca_category", pca_category.shape, 
      "min,max category:", pca_category.min(), pca_category.max())

In [None]:
def pca_components_plot(components_, shape_=(32, 32, 3)):
    fig = plt.figure(figsize=(16,8))

    for i in range(min(40, components_.shape[0])):
        plt.subplot(4, 10, i + 1)
        plt.xticks([])
        plt.yticks([])
        eigen_vect = (components_[i] - np.min(components_[i])) / np.ptp(pca.components_[i])
        plt.title('component: %s' % i)
        plt.imshow(eigen_vect.reshape(shape_), cmap=plt.get_cmap('gray'))
    plt.show()

In [None]:
pca_components_plot(pca.components_, (28, 28))

In [None]:
def category_scatter_plot(features, category, title='MNIST'):
    num_category = 1 + category.max() - category.min()

    fig, ax = plt.subplots(1, 1, figsize=(12, 10))
    cm = plt.cm.get_cmap('tab10', num_category)
    sc = ax.scatter(features[:,0], features[:,1], c=category, alpha=0.4, cmap=cm)
    ax.set_xlabel("Component 1")
    ax.set_ylabel("Component 2")
    ax.set_title(title)
    plt.colorbar(sc)
    plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

def category_scatter3d_plot(features, category, title='MNIST'):
    num_category = 1 + category.max() - category.min()
    mean_feat = np.mean(features, axis=0)
    std_feat = np.std(features, axis=0)
    min_range = mean_feat - std_feat
    max_range = mean_feat + std_feat
    
    fig = plt.figure(figsize=(12, 10))
    cm = plt.cm.get_cmap('tab10', num_category)
    ax = fig.add_subplot(111, projection='3d')
    sc = ax.scatter(features[:,0], features[:,1], features[:,2],
                    c=category, alpha=0.85, cmap=cm)
    ax.set_xlabel("Component 1")
    ax.set_ylabel("Component 2")
    ax.set_zlabel("Component 3")
    ax.set_title(title)
    ax.set_xlim(2.0 * min_range[0], 2.0 * max_range[0])
    ax.set_ylim(2.0 * min_range[1], 2.0 * max_range[1])
    ax.set_zlim(2.0 * min_range[2], 2.0 * max_range[2])
    plt.colorbar(sc)
    plt.show()

In [None]:
category_scatter_plot(pca_features, pca_category, title='MNIST - PCA')

**Note:** 3D PCA plot works best with `%matplotlib notebook` to enable interactive rotation (enabled at start of session).

In [None]:
category_scatter3d_plot(pca_features, pca_category, title='MNIST - PCA')

The data in its original image space appears to *almost* clump into categories, but is not cleanly seperable.

##### Visualizing training sample using t-SNE

[t-distributed Stochastic Neighbor Embedding (t-SNE)](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE) is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. For more details on t-SNE including other use cases see this excellent *Toward Data Science* [blog post](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1)

It is highly recommended to use another dimensionality reduction method (e.g. PCA) to reduce the number of dimensions to a reasonable amount if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.

In [None]:
import sklearn
import sklearn.decomposition
import sklearn.pipeline
import sklearn.manifold

_prng = np.random.RandomState(42)

embedding2_pipeline = sklearn.pipeline.make_pipeline(
    sklearn.decomposition.PCA(n_components=0.95, random_state=_prng),
    sklearn.manifold.TSNE(n_components=2, random_state=_prng))

embedding3_pipeline = sklearn.pipeline.make_pipeline(
    sklearn.decomposition.PCA(n_components=0.95, random_state=_prng),
    sklearn.manifold.TSNE(n_components=3, random_state=_prng))

In [None]:
# Sample 10% of the data

_prng = np.random.RandomState(42)

_idxs = _prng.randint(y_train_flat.shape[0], size=y_train_flat.shape[0] // 10)
tsne_features = x_train_flat[_idxs]
tsne_category = y_train_flat[_idxs]
print("tsne_features:", tsne_features.shape, 
      "tsne_category", tsne_category.shape, 
      "min,max category:", tsne_category.min(), tsne_category.max())

In [None]:
# t-SNE is SLOW (but can be GPU accelerated!); 
#       lengthy operation, be prepared to wait...

transform2_tsne_features = embedding2_pipeline.fit_transform(tsne_features)

print("transform2_tsne_features:", transform2_tsne_features.shape)
for i in range(2):
    print("min,max features[%s]:" % i, 
          transform2_tsne_features[:,i].min(), 
          transform2_tsne_features[:,i].max())

In [None]:
category_scatter_plot(transform2_tsne_features, tsne_category, title='MNIST - t-SNE')

t-SNE relates the data points (images) according to their closest neighbours.  The *MNIST - t-SNE* plot above shows that the manifold of the original image space can *almost* be well separated into distinct clusters.  Notice the closely related clusters of 9-(blue) & 4-(purple), and 3-(red) & 5-(brown).  When written, those pairs of digit shapes have recognizable similarities to each other too.

In [None]:
# t-SNE is SLOW (but can be GPU accelerated!); 
#       lengthy operation, be prepared to wait...

transform3_tsne_features = embedding3_pipeline.fit_transform(tsne_features)

print("transform3_tsne_features:", transform3_tsne_features.shape)
for i in range(3):
    print("min,max features[%s]:" % i, 
          transform3_tsne_features[:,i].min(), 
          transform3_tsne_features[:,i].max())

In [None]:
category_scatter3d_plot(transform3_tsne_features, tsne_category, title='MNIST - t-SNE')

#### Data Conversion

The data type for the training data is `uint8`, while the input type for the network will be `float32` so the data must be converted.  Also, the data should be normalized, and the labels need to be categorical.  I.e., instead of label existing as 10 different values in a 1-D space, they need to exist as Boolean values in a 10-D space — one dimension for each category, and either a 0 or 1 value in each dimension to represent membership in that category.

* https://keras.io/examples/vision/mnist_convnet/

In [None]:
num_classes = (y_train.max() - y_train.min()) + 1
print('num_classes =', num_classes)

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

In [None]:
print("shape:", x_train.shape, x_test.shape)
x_train = x_train.reshape(x_train.shape + (1,))
x_test = x_test.reshape(x_test.shape + (1,))
print("reshape:", x_train.shape, x_test.shape)

In [None]:
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

train_data = (x_train, y_train)
test_data = (x_test, y_test)

In [None]:
print('x_train type:', type(x_train))
print('x_train dtype:', x_train.dtype)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('y_train type:', type(y_train))
print('y_train dtype:', y_train.dtype)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

### Evaluate Model

Visualize accuracy and loss for training and validation.

* https://keras.io/visualization/

In [None]:
def history_plot(history):
    fig = plt.figure(figsize=(12,5))

    plt.title('Model accuracy & loss')

    # Plot training & validation accuracy values
    ax1 = fig.add_subplot()
    #ax1.set_ylim(0, 1.1 * max(history.history['loss']+history.history['val_loss']))
    ax1.set_prop_cycle(color=['green', 'red'])
    p1 = ax1.plot(history.history['loss'], label='Train Loss')
    p2 = ax1.plot(history.history['val_loss'], label='Test Loss')

    # Plot training & validation loss values
    ax2 = ax1.twinx()
    ax2.set_ylim(0, 1.1 * max(history.history['accuracy']+history.history['val_accuracy']))
    ax2.set_prop_cycle(color=['blue', 'orange'])
    p3 = ax2.plot(history.history['accuracy'], label='Train Acc')
    p4 = ax2.plot(history.history['val_accuracy'], label='Test Acc')

    ax1.set_ylabel('Loss')
    ax1.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy')

    pz = p3 + p4 + p1 + p2
    plt.legend(pz, [l.get_label() for l in pz], loc='center right')
    plt.show()

In [None]:
def prediction_plot(model, test_data):
    (x_test, y_test) = test_data
    fig = plt.figure(figsize=(16,8))
    correct = 0
    total = 0
    rSym = ''
    
    for i in range(40):
        plt.subplot(4, 10, i + 1)
        plt.xticks([])
        plt.yticks([])
        idx = int(random.uniform(0, x_test.shape[0]))
        result = model.predict(x_test[idx:idx+1])[0]
        if y_test is not None:
            rCorrect = True if np.argmax(y_test[idx]) == np.argmax(result) else False
            rSym = '✔' if rCorrect else '✘'
            correct += 1 if rCorrect else 0
        total += 1
        plt.title("%s %s" % (rSym, np.argmax(result)))
        plt.imshow(x_test[idx][:,:,0], cmap=plt.get_cmap('gray'))
    plt.show()
    
    if y_test is not None:
        print("% 3.2f%% correct (%s/%s)" % (100.0 * float(correct) / float(total), correct, total))

In [None]:
def prediction_classes_plot(model, test_data):
    (x_test, y_test) = test_data
    fig = plt.figure(figsize=(16,8))
    correct = 0
    total = 0
    rSym = ''
    
    for i in range(40):
        plt.subplot(4, 10, i + 1)
        plt.xticks([])
        plt.yticks([])
        idx = int(random.uniform(0, x_test.shape[0]))
        result = model.predict_classes(x_test[idx:idx+1])[0]
        if y_test is not None:
            rCorrect = True if np.argmax(y_test[idx]) == result else False
            rSym = '✔' if rCorrect else '✘'
            correct += 1 if rCorrect else 0
        total += 1
        plt.title("%s %s" % (rSym, result))
        plt.imshow(x_test[idx][:,:,0], cmap=plt.get_cmap('gray'))
    plt.show()
    
    if y_test is not None:
        print("% 3.2f%% correct (%s/%s)" % (100.0 * float(correct) / float(total), correct, total))

In [None]:
def prediction_proba_plot(model, test_data):
    (x_test, y_test) = test_data
    fig = plt.figure(figsize=(15,15))
    
    for i in range(10):
        plt.subplot(10, 2, (2*i) + 1)
        plt.xticks([])
        plt.yticks([])
        idx = int(random.uniform(0, x_test.shape[0]))
        result = model.predict_proba(x_test[idx:idx+1])[0] * 100 # prob -> percent
        if y_test is not None:
            plt.title("%s" % np.argmax(y_test[idx]))
        plt.xlabel("#%s" % idx)
        plt.imshow(x_test[idx][:,:,0], cmap=plt.get_cmap('gray'))
        
        ax = plt.subplot(10, 2, (2*i) + 2)
        plt.bar(np.arange(len(result)), result, label='%')
        plt.xticks(range(0, len(result) + 1))
        ax.set_xticklabels(range(10))
        plt.title("classifier probabilities")

        plt.tight_layout()
    plt.show()

### Create Your Own CNN Classifier Model

Create a basic CNN (Convolutional Neural Network) based classifier from scratch.

Try and create your own deep learning model to classify the MNIST data. Refer to the prerequisite tutorial and use the [CNN Classifier Model](keras-transfer-learning-tutorial.ipynb#CNN-Classifier-Model) code as a template.  Here are a few ideas to try.

1. Add more convolutional layers.
2. Add more neurons in each convolutional layer(s).
3. Try different activation layers.
4. Try using a different optimizer.
5. Try tuning the hyper-parameters of your chosen optimizer.
6. Train the model for more epochs (but don't overfit!)

* https://keras.io/examples/vision/mnist_convnet/

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Activation, Dropout, Conv2D, MaxPooling2D
from tensorflow.keras.optimizers import RMSprop, Adam

# TODO: Tune the following hyper-parameters as needed

batch_size = 128
epochs = 12
learning_rate = 1e-3
decay = 1e-6


def create_my_model():
    model = Sequential()

    # TODO: Add your model creation code here...
    
    
    return model

def create_my_compiled_model(learning_rate=1e-4, decay=1e-6):
    model = create_my_model()

    # TODO: Add your `model.compile` code here...
   

    return model

In [None]:
model = create_my_compiled_model(learning_rate=learning_rate, decay=decay)
model.summary()

In [None]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_data=(x_test, y_test),
                    shuffle=True)

In [None]:
history_plot(history)

In [None]:
# Score trained model.
scores = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])

In [None]:
prediction_plot(model, (x_test, y_test))

In [None]:
prediction_proba_plot(model, (x_test, y_test))

## Kaggle Digit Recognizer Competition

Kaggle hosts an MNIST-base [Digit Recognizer competition](https://www.kaggle.com/c/digit-recognizer), complete with a [rolling leaderboard](https://www.kaggle.com/c/digit-recognizer/leaderboard). We will enter this competition using our MNIST image classification network above, and compare its performance.

### Create a Kaggle Account

#### 1. Register for an account

In order to download Kaggle competition data you will first need to create a [Kaggle](https://www.kaggle.com/) account.

#### 2. Create an API key

Once you have registered for a Kaggle account you will need to create [API credentials](https://github.com/Kaggle/kaggle-api#api-credentials) in order to be able to use the `kaggle` CLI to download data.

* Go to the `Account` tab of your user profile, 
* and click `Create New API Token` from the API section.  

This generates a `kaggle.json` file (with 'username' and 'key' values) to download.

#### Download MNIST Data

If you are using Binder to run this notebook, then the data is already downloaded and available.  Skip to the next step.

If you are using Google Colab to run this notebook, then you will need to download the data before proceeding.

##### Download MNIST from Kaggle

**Note:** Before attempting to download the competition data you will need to login to your [Kaggle](https://www.kaggle.com) account and accept the rules for this competition.

Set your Kaggle username and API key (from the `kaggle.json` file) into the cell below, and execute the code to download the Kaggle [Digit Recognizer: Learn computer vision with the famous MNIST data](https://www.kaggle.com/c/digit-recognizer) competition data. 

In [None]:
%%bash
# NOTE: Replace YOUR_USERNAME and YOUR_API_KEY with actual credentials 
export KAGGLE_USERNAME="YOUR_USERNAME"
export KAGGLE_KEY="YOUR_API_KEY"
kaggle competitions download -c digit-recognizer -p ../datasets/mnist/kaggle

In [None]:
%%bash
for archive in ../datasets/mnist/kaggle/*.zip ; do
  unzip -n "${archive}" -d ../datasets/mnist/kaggle
done
find ../datasets/mnist/kaggle -name "*.csv"

#### (Alternative) Download MNIST from GitHub

If you are running this notebook using Google Colab, but did *not* create a Kaggle account and API key, then  dowload the data from our GitHub repository by running the code in the following cells.

In [None]:
import pathlib
import requests

def fetch_mnist_data():
    RAW_URL = "https://github.com/holstgr-kaust/keras-tutorials/raw/master/datasets/mnist"
    DEST_DIR = pathlib.Path('../datasets/mnist')

    DEST_DIR.mkdir(parents=True, exist_ok=True)
    for n in ["mnist.npz", "kaggle/train.csv", "kaggle/test.csv", "kaggle/sample_submission.csv"]:
        path = DEST_DIR / n
        path.parent.mkdir(exist_ok=True)
        if not path.is_file():  # Don't download if file exists
            with path.open(mode = 'wb') as f:
                response = requests.get(RAW_URL + "/" + n)
                f.write(response.content)

In [None]:
fetch_mnist_data()
cache_mnist_data()

### Submitting to Kaggle

Submit your model's predictions to Kaggle, using your previously created Kaggle account, and then see how well your results compare to those of your peers.

The Kaggle MNIST dataset is provided in CSV (Comma Separated Values) format; a common textual format for data exchange.  Each line in a CSV file corresponds to a row in the table.  The first line may represent the column lables.

Because the files are text, we can view their content with simple shell commands like `cat`, `less`, `head`, or `tail`.  The `head` command shows the first `-n` lines of the file.  We will view the first two lines of each file in the following cells:

In [None]:
!head -n 2 ../datasets/mnist/kaggle/train.csv

In [None]:
!head -n 2 ../datasets/mnist/kaggle/test.csv

Note that the first line contains the column labels, and that the first column label of the `train.csv` dataset is `label`.  Notice that the `test.csv` file does not have a `label` column.  Finally, note that the second line, which will be the data for a single image has mostly small, positive integer values.

[pandas](https://pandas.pydata.org/) is popular Python library for data analysis.  It has many useful features, including efficiently loading CSV data, converting them to Numpy arrays, and pretty-printing large tables in Jupyter notebooks.  Let's take a quick look at our data with `pandas`.

In [None]:
pd.read_csv("../datasets/mnist/kaggle/train.csv", delimiter=',')

In [None]:
pd.read_csv("../datasets/mnist/kaggle/test.csv", delimiter=',')

With `pandas` it is easy to determine the dimensions of the two files (shown below table), which will be useful information when specifying which columns are part of the image.

#### Load Data

Load the Kaggle version of MNIST CSV format data via pandas API and convert into a `numpy` array.  

**Note:** There are no 'test labels' for this dataset, because the test accuracy will be evaluated by Kaggle, not by us.  This will have implications for how we validate training.

In [None]:
_train_features = pd.read_csv("../datasets/mnist/kaggle/train.csv", 
                              delimiter=',', skiprows=0, usecols=range(1,785), dtype=np.uint8).to_numpy()
_train_labels = pd.read_csv("../datasets/mnist/kaggle/train.csv",
                            delimiter = ',', skiprows=0, usecols=[0], dtype=np.uint8).to_numpy()
_test_features = pd.read_csv("../datasets/mnist/kaggle/test.csv",
                             delimiter=',', skiprows=0, dtype=np.uint8).to_numpy()

#### Explore Data

Explore data types, shape, and value ranges.  Ensure they make sense, and you understand the data well.

Is the Kaggle version of MNIST the same, and the test examples are labelled.

In [None]:
print('_train_features type:', type(_train_features), '_test_features type:', type(_test_features))
print('_train_features dtype:', _train_features.dtype, '_test_features dtype:', _test_features.dtype)
print('_train_features shape:', _train_features.shape, '_test_features shape:', _test_features.shape)
print(_train_features.shape[0], '_train_features samples')
print(_test_features.shape[0], '_test_features samples')

In [None]:
print('_train_features (min, max, mean): (%s, %s, %s)' % (_train_features.min(),
                                                          _train_features.max(),
                                                          _train_features.mean()))
print('_test_features (min, max, mean): (%s, %s, %s)' % (_test_features.min(),
                                                         _test_features.max(),
                                                         _test_features.mean()))
print('_train_labels (min, max): (%s, %s)' % (_train_labels.min(),
                                              _train_labels.max()))

In [None]:
# Show array of random labelled images with matplotlib (re-run cell to see new examples)
imageset_plot((_train_features, _train_labels))

In [None]:
histogram_plot((_train_features, _train_labels))

The Kaggle MNIST dataset looks very similar to the original dataset we used previously – similar min, max, mean, and appearance – however, there are fewer samples in the dataset (and the test dataset has no labels), and the image data is flattened into a 1-D array of size 784 (we will need to reshape that into a 2-D array to use convolution).

#### Data Conversion

The data type for the training data is `uint8`, while the input type for the network will be `float32` so the data must be converted.  Also, the data should be normalized, and the labels need to be categorical (one-hot encoding).

In [None]:
num_classes = (_train_labels.max() - _train_labels.min()) + 1
print('num_classes =', num_classes)

_train_labels = keras.utils.to_categorical(_train_labels, num_classes)

In [None]:
print("shape:", _train_features.shape, _test_features.shape)
_train_features = _train_features.reshape((_train_features.shape[0], 28, 28, 1))
_test_features = _test_features.reshape((_test_features.shape[0], 28, 28, 1))
print("reshape:", _train_features.shape, _test_features.shape)

In [None]:
_train_features = _train_features.astype('float32')
_test_features = _test_features.astype('float32')
_train_features /= 255
_test_features /= 255

train_data = (_train_features, _train_labels)
test_data = (_test_features, None )

In [None]:
print('_train_features type:', type(_train_features))
print('_train_features dtype:', _train_features.dtype)
print('_train_features shape:', _train_features.shape)
print('_test_features shape:', _test_features.shape)

print('_train_labels type:', type(_train_labels))
print('_train_labels dtype:', _train_labels.dtype)
print('_train_labels shape:', _train_labels.shape)
print('_test_labels:', None)

### Train Model

We already have an MNIST classifier `model` trained on a larger MNIST dataset.  Could we use that model for the Kaggle MNIST dataset?

#### Evaluate previous MNIST model

In [None]:
# NOTE: Trained model score is evaluated on training data, not a withheld testing dataset.
#       The testing dataset for Kaggle MNIST is hidden, so only the training set is labelled; 
#       as a result, the accuracy scores are not entirely accurate.

scores = model.evaluate(_train_features, _train_labels, verbose=0)
print('Train loss:', scores[0])
print('Train accuracy:', scores[1])

In [None]:
prediction_classes_plot(model, train_data)

In [None]:
prediction_classes_plot(model, test_data)

#### Train and Evaluate Kaggle MNIST model

The previouse model should perform well.  We can either skip ahead to the submission step, or try to train our model on the Kaggle MNIST version of the dataset.

In [None]:
# TODO: Modify the following hyper-parameters if you want to change them from previous values

#batch_size = 128
#epochs = 12
#learning_rate = 1e-3
#decay = 1e-6

# Create a new compiled model from the function you completed previously
model_kaggle = create_my_compiled_model(learning_rate=learning_rate, decay=decay)

# Train this model on the Kaggle MNIST training data; 
# there is no test data; use `validation_split` to withhold part of the training data for validation
history = model_kaggle.fit(_train_features, _train_labels,
                           batch_size=batch_size,
                           epochs=epochs,
                           validation_split=0.1,
                           shuffle=True)

In [None]:
history_plot(history)

In [None]:
# NOTE: Trained model score is evaluated on training data, not a withheld testing dataset.
#       The testing dataset for Kaggle MNIST is hidden, so only the training set is labelled; 
#       as a result, the accuracy scores are not entirely accurate.

scores = model_kaggle.evaluate(_train_features, _train_labels, verbose=0)
print('Train loss:', scores[0])
print('Train accuracy:', scores[1])

In [None]:
prediction_classes_plot(model_kaggle, train_data)

In [None]:
prediction_classes_plot(model_kaggle, test_data)

### Generate Predictions

In [None]:
predictions = model.predict_classes(_test_features)
predictions_kaggle = model_kaggle.predict_classes(_test_features)

In [None]:
cmp_pred = predictions == predictions_kaggle
differs_idxs = np.argwhere((cmp_pred == False)).flatten()
print('differences found:', differs_idxs.size, 'showing random: 10')
title_format = "test index:%s model predicts:%s kaggle model predicts:%s"
for i in differs_idxs[np.random.randint(differs_idxs.size, size=10)]:
    plt.xticks([])
    plt.yticks([])
    plt.title(title_format % (i, predictions[i], predictions_kaggle[i]))
    plt.imshow(to_image(_test_features[i]), cmap=plt.get_cmap('gray'))
    plt.show()

### Export Predictions

In [None]:
pd.read_csv("../datasets/mnist/kaggle/sample_submission.csv", delimiter=',')

In [None]:
import os
import time
import pathlib

SUBMISSION_DIR = pathlib.Path('../results/kaggle-submissions')
SUBMISSION_DIR.mkdir(parents=True, exist_ok=True)

timestamp = time.strftime("%Y%m%d-%H%M%S")

SUBMISSION_FILE = SUBMISSION_DIR / f"submission-{timestamp}.csv"

number_predictions = predictions_kaggle.shape[0]

df = pd.DataFrame({"ImageId": range(1, number_predictions + 1), "Label": predictions_kaggle})
df.to_csv(str(SUBMISSION_FILE), index=False)

pd.read_csv(str(SUBMISSION_FILE), delimiter=',')

### Submit to Kaggle

Once you have successfully submited your predictions then you can check the [Digit-Recognizer competition](https://www.kaggle.com/c/digit-recognizer) website and see how well your best model compares to your peers.

In [None]:
%%bash
export KAGGLE_USERNAME="YOUR_USERNAME"
export KAGGLE_KEY="YOUR_API_KEY"
# Submits the latest `submission-*.csv` file
kaggle competitions submit -c digit-recognizer \
  -f "$(ls ../results/kaggle-submissions/submission-*.csv | tail -n 1)" \
  -m "My digit recognizer submission!"

#### Check Results

Check out your results on the Kaggle competition leaderboards for `digit-recognizer`:

* [Competitions](https://www.kaggle.com/competitions).
* [Leaderboard](https://www.kaggle.com/c/digit-recognizer/leaderboard).

Your entry should be highlighted if you're logged in.

### Congratulations!

You are now ready to explore deeper into Keras Image Classification.  Many happy trainings!

## Mentor Bios

Glendon Holst is a Staff Scientist in the Visualization Core Lab at KAUST (King Abdullah University of Science and Technology) specializing in HPC workflow solutions for deep learning, image processing, and scientific visualization.

David R. Pugh is a Staff Scientist in the Visualization Core Lab at KAUST (King Abdullah University of Science and Technology) specializing in Data Science and Machine Learning. David is also a certified Software and Data Carpentry Instructor and Instructor Trainer and is the lead instructor of the Introduction to Data Science Workshop series at KAUST.

## References

* https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
* http://yann.lecun.com/exdb/mnist/index.html
* https://www.kaggle.com/c/digit-recognizer
* https://jupyter-notebook.readthedocs.io/en/stable/
* https://github.com/kaust-vislab/handson-ml2