# Analyzing Fashion MNIST Data

## About the Data
Fashon-MNIST (https://github.com/zalandoresearch/fashion-mnist) is an up and coming new dataset of Zalando's article images. There are 60,000 training samples and 10,000 test samples in the dataset. Each sample is a 28x28 grayscale image which are associated with one of ten clothing labels, which look like this:

![FashonMNIST](https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png)

The ten clothing labels are:
0. T-Shirt/top
1. Trouser
2. Pullover
3. Dress
4. Coat
5. Sandal
6. Shirt
7. Sneaker
8. Bag
9. Ankle Boot

(although in the dataset, the labels are zero indexed so T-Shirt/Top was labeled as '0' and Ankle Boot was labeled as '9')


#### Why Fashion MNIST instead of the MNIST dataset?
This might sound a bit strange because most data scientists tend to use the original MNIST dataset which contains several handwritten number samples. However, I really wanted to go down this route because of one key reason in that MNIST is too simple. The primary reason is that the Fashion MNIST dataset is relatively easy to predict nowadays with the advancement of many machine learning and neural network models.

Even though this study was created by the team that created Fashion-MNIST, I don't have a reason to doubt their experiment. In this study, they created many machine learning models, ingested both the MNIST dataset and the Fashion-MNIST dataset into those models, and compared accuracy (http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/). Generally speaking, almost all of the Fashion-MNIST scores were less accurate than MNIST, which would support the theory that MNIST has become too easy to predict.

To also build on the rationale, there's also a current script on GitHubGist where someone was able to compare MNIST digits based on one pixel, which (I'd hope) many machine learning models would quickly pick up on if it found it (https://gist.github.com/dgrtwo/aaef94ecc6a60cd50322c0054cc04478)

Ultimately, I didn't want a dataset which would tell me 'stick to a machine learning model' again. I wanted a dataset that was just complex enough so that it's at least conceivable that I'd need a neural network model. I also wanted a dataset that was large enough so that I could definitiely conclude if I was overtraining my model or not.


----
# The Goal
I wanted to determine which model could best model the Fashion-MNIST dataset. I have assistance here because the Zalando Research team has already tested many machine learning models on this dataset, so now all I'd have to do was try various neural network approaches. 

----
# Machine Learning Models
As specified earlier, the Zalando Research Team already tested many machine learning models on this dataset and the top 8 models were:

![MachineLearning](Jupyter/MachineLearning.jpg)

My takeaways from this were:
- SVC is generally the best performer, but takes a long amount of training time edging out at 1 hour minimum.
- GradientBoost isn't worth the runtime pains.
- RandomForest has a lot of promise with a small training time and reasonable accuracy, but it still peaks out at 0.879 accuracy. I suspect I can do better.

----
# Technical Setup

For this project, I used
- Anaconda 5.0.0 which uses Python 3.6.3
  - TensorFlow 1.1.0
  - Keras 2.0.8
  - Theano 0.9.0 (Do not assume I'm using Theano unless otherwise specified)
- iMac running macOS High Sierra with a:
  - 3.8GHz quad‑core Intel Core i5
  - [When a GPU was required] EVGA GeForce GTX 1050 2 GBs

`Keras` is a high level Python Package which lets me build neural networks and use either `TensorFlow`, `Theano`, or `Microsoft's CNTK` as the computation engine. It gives me the opportunity to test my model against all those computation packages without rewriting my model for each model.

I began this project using Theano 0.9.0, and unfortunately midway, Theano was announced that it'll be depricated. At that point, I've switched to TensorFlow. I've redone most of my studies for TensorFlow but there will be certain tests that will stick remain on Theano. I will note this when it happens.

----
# Non-Model Specific Code
I first created a function to handle the **input arguments** I might pass into my script.

In [None]:
from argparse import ArgumentParser
import numpy as np

import models
import params
import plot

np.set_printoptions(precision=2)

def parse_args(inargs=None):
    """ Parses input arguments """
    parser = ArgumentParser("./loader.py")
    standard_path = os.path.dirname(os.path.realpath(__file__))

    iargs = parser.add_argument_group('Input Files/Data')
    iargs.add_argument('--csv_file',
                       default=os.path.join(standard_path, 'data.csv'),
                       help='Path to CSV File')
    iargs.add_argument('--model', default='cnn',
                       help='Select: cnn (default), rnn, neural')

    oargs = parser.add_argument_group('Output Files/Data')
    oargs.add_argument('--out',
                       default=os.path.join(standard_path, 'Run'),
                       help='Path to save output files')

    if not inargs:
        args = parser.parse_args()
    else:
        args = parser.parse_args(inargs)
    return args



And I had to create a function to **re-shape my data** to the appropriate shape. I'm using the `channels_first` setting in Keras which means that the quantity of my samples will be the first dimension of the dataset. I also had to reshape this array to four dimensions so that my dimensionality is (Quantity of Pictures, Quantity of Colors (just 1 since this is grayscale), Pixels Width, and Pixels Height).

In [None]:
def flatten_data(args, x_train, x_test, y_train, y_test):
    """ Flattens data into a one dimension Numpy Array
    """
    x_train = x_train.astype('float32') / 255
    x_test = x_test.astype('float32') / 255

    if args.model != 'rnn':
        x_train = x_train.reshape(x_train.shape[0], 1, 28, 28)
        x_test = x_test.reshape(x_test.shape[0], 1, 28, 28)

    y_train = np_utils.to_categorical(y_train, 28)
    y_test = np_utils.to_categorical(y_test, 28)
    return x_train, y_train, x_test, y_test

I intend to run my script so that it can loop between a variety of options for a specific parameter. Because of that, I'd like it to save the:
* Confusion Matrix for each option it tests
* Some plot to compare each option it tests

I created two **plotting functions** to do that.

_Note: When code is ran in the Jupyter Notebook, it will NOT use the coding function below, but rather it'll use plot.py in the original directory where this Notebook rescides. That file is replicated below for completeness._

In [None]:
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

def conf_matrix(y_test, y_test_predict, classes, title='Confusion Matrix',
                out=None):
    # Converts both output arrays into just one column based on the class
    y_test_predict_class = y_test_predict.argmax(1)
    y_test_class = y_test.argmax(1)

    # Creates confusion matrix
    cm_data = confusion_matrix(y_test_class, y_test_predict_class)
    np.set_printoptions(precision=2)

    # Plots Confusion Matrix
    plt.figure()
    plt.imshow(cm_data, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title(title)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.xlabel('Predicted Label')
    plt.yticks(tick_marks, classes)
    plt.ylabel('True Label')

    # Plots data on chart
    thresh = cm_data.max() / 2.
    for i, j in itertools.product(range(cm_data.shape[0]), range(cm_data.shape[1])):
        plt.text(j, i, format(cm_data[i, j], 'd'),
                 horizontalalignment="center",
                 color="white" if cm_data[i, j] > thresh else "black")

    plt.tight_layout()

    # Saves or Shows Plot
    if out:
        plt.savefig(out)
    else:
        plt.show()


def dict_trends(data, xlabel='Variable', out=None):
    """ Plots a dictionary's worth of trends """
    data_df = pd.DataFrame.from_dict(data, orient='index')
    ax = data_df.plot()

    # Sets Axes
    ax.set_xlabel(xlabel)
    ax.set_ylabel('Score')
    ax.set_title('Modifying {}'.format(xlabel))

    # Saves or Shows Plot
    if out:
        plt.savefig(out)
    else:
        plt.show()  

As my script looks between different parameters, I want it to always have a default parameter value so that if I don't specify anything, it'll use the proper default setting. To do that, I created a **Parameters Configuration File** as params.py, which is effectively a dictionary.

_The parameters file shown below is the final version once all models have completed. I modified this file step-by-step as I went through the various experiments._

In [None]:
def standard():
    params = {}

    # Build Parameters
    params['conv_filters'] = 20  # Number of filters in the convolutional network
    params['kernel_size'] = 4  # Size of the kernel looping through the image
    params['kernel_stride'] = 1  # 'Speed' of how many pixels the Kernel moves by
    params['dropout'] = 0.0  # The percentage of neurons which are randomly deactivated per epoch
    params['optimizer'] = 'adam'  # Optimizer Formula used in optimizing the building of the model
    params['loss'] = 'categorical_crossentropy'  # Loss Function method

    # Fit Parameters
    params['epoch'] = 8  # How many iterations are ran
    params['dropout'] = 0.1  # The percentage of neurons which are randomly deactivated per epoch
    params['batch_size'] = 128  # The size of the image batch being fed into the fitting process

    # Dense Activation
    params['dense_1'] = 120  # Number of filters in the neural network
    params['activate_1'] = 'relu' # Activation Function used in the Neural Dense Layers
    return params

As my script tests different models, I'd presumably like to run test data through the models to see if the predictions seem accurate. Because this fit function is the universal same function across all models, I chose to make a unified fit function in models.py as seen below.

_Note: Like all other things, as I use this function, it'll use models.py rather than the version seen below_

In [None]:
def basic_neural(model_params, shape):
    """ Builds basic neural network model """
    from keras.layers import Dense, Flatten, InputLayer
    from keras.layers.normalization import BatchNormalization
    from keras.models import Sequential

    model = Sequential()

    model.add(InputLayer(input_shape=(shape[1], shape[2], shape[3])))
    model.add(BatchNormalization())

    model.add(Flatten())

    model.add(Dense(model_params['dense_1'], activation=model_params['activate_1']))
    model.add(Dense(28, activation='softmax'))

    model.compile(loss=model_params['loss'],
                  optimizer=model_params['optimizer'],
                  metrics=['accuracy'])

    print(model.summary())
    return model

Finally, I created a **Main Function** which will connect all of the aforementioned functions as well as the model functions to be. It will run each model three times with each variable permutation option, average the results from those runs, and store it in a dictionary for comparisions later.

_Note: This function shows the final state of the Main Function after all expansion. I'll talk about specific additions to this function in the Neural Network sections below._

In [None]:
from keras.datasets import fashion_mnist
from keras.utils import np_utils
import pandas as pd
import os

def main(args):
    # Loads CSV File
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
    x_train, y_train, x_test, y_test = flatten_data(args, x_train, x_test, y_train, y_test)

    # Creates output directory
    if not os.path.isdir(args.out):
        os.makedirs(args.out)

    # Put code here to loop between various permutations.
    # The code here would loop between activation equations
    change = 'activation'
    range = ['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'selu',
             'elu', 'linear']
    history_dict = {x: {'loss': 0.0, 'acc': 0.0} for x in range}

    # Runs Model
    for new in range:
        history_dict[new] = {'loss': [], 'acc': []}
        for loop in [1, 2, 3]:
            print('Creating Model with the {} {}'.format(new, change))
            model_params = params.standard()

            if args.model == 'rnn':
                model = models.basic_rnn(model_params, x_train.shape)
            elif args.model == 'neural':
                model = models.basic_neural(model_params, x_train.shape)
            else:
                model = models.double_cnn(model_params, x_train.shape)

            y_pred, metrics = models.fit_model(model, model_params, 
                                               x_train, y_train, x_test, y_test)

            # Adds Data to Trends
            history_dict[new]['loss'].append(metrics['loss'])
            history_dict[new]['acc'].append(metrics['acc'])

        # Calculates Average
        history_dict[new]['loss'] = np.mean(history_dict[new]['loss'])
        history_dict[new]['acc'] = np.mean(history_dict[new]['acc'])

        # Plots Confusion Matrix
        classes = {0: 'T-Shirt/top',
                   1: 'Trouser',
                   2: 'Pullover',
                   3: 'Dress',
                   4: 'Coat',
                   5: 'Sandal',
                   6: 'Shirt',
                   7: 'Sneaker',
                   8: 'Bag',
                   9: 'Ankle boot'}
        class_values = list(classes.values())
        title = "{} (Loss {} & Acc {})".format(new, metrics['loss'], metrics['acc'])
        conf_png = '{}/{}_{}.png'.format(args.out, new, change)
        plot.conf_matrix(y_test, y_pred, class_values, out=conf_png, title=title)

    # Plots Accuracy & Loss Trends
    trends_png = '{}/{}.png'.format(args.out, change)
    plot.dict_trends(history_dict, xlabel=change, out=trends_png)

    return x_train, y_train, x_test, y_test, y_pred


if __name__ == "__main__":
    ARGS = parse_args()
    x_train, y_train, x_test, y_test, y_pred = main(ARGS)

----
# About Neural Networks

### Neural Network Principles
The primary Neural Network Layer in Keras is the **Dense** layer. In the most simple sense, this layer takes in an input, performs some calculation on them (typically a matrix vector multiplication type function), and outputs the data in some different dimensionality. This calculation is typically referred to as an **Activation** Function.

It's worth noting that when I first get my data, it's technically in four dimensions as: (Quantity of Image Samples, Colorscale, Width Pixels, and Height Pixels). In this case:
- Colorscale is always 1 because this ia greyscale image
- Each image is 28x28 samples.

I cannot immediately feed these images into the `Dense` Layer with that dimensionality. So I first need to send it through a **Flatten** layer which flattens it into a two dimension array as: (Quantity of Image Samples, Width Pixels * Height Pixels).


----
# Building a Neural Network
### Layers I'll Use
I wanted to begin by creating a basic neural network with only two Neural Layers. 
- The first layer will be a `Dense` layer and I will cycle between various activation functions to find the ideal one. (This of course, will happen after a `Flatten` layer. I will use the nadam optimzier initially, although I'll probably test this in a second experiment.
- The second layer will be a `Dense` layer and this will stay under the `Softmax` Activation Function.

Softmax is a logarithmic function which assigns probabilities for each possible option so that all options add to 1. Because of this, Softmax is regarded as one of the best 'final' functions to classify results. 

Speaking of classification, there has been some research in foregoing this final fit function, and rather, sending this data to another machine learning model and having that do the classification instead. It's plausible that given the success of Random Forests earlier, that Random Forests would do a better job at classifying the data produced by the Neural Network, than the Neural Network itself.



### First Experiment: Testing Activation Functions in a Neural Network
This is the code for my first Neural Network model. It only has two Dense Layers and we will loop between these **activation functions**:
* `softplus`
* `softsign`
* `relu`
* `thanh`
* `sigmoid`
* `hard_sigmoid`
* `selu`
* `elu`
* `linear`

![Activation](Jupyter/Activation.tiff)

For now, I'm using the `nadam` optimizer & `categorical_crossentropy` loss function.

**Activation functions** are needed to introduce non-linear classification techniques to our neural network. Without these, the neural network would stick to linear classification, which would then imply we should have used simplier machine learning or statistical algorithms instead such as a linear regression/classification model.

**Loss functions** are also important to classify the quality of the model and we'll talk more about these next.

In [None]:
def basic_neural(model_params, shape):
    """ Builds basic neural network model """
    from keras.layers import Dense, Dropout, Flatten, InputLayer
    from keras.layers.normalization import BatchNormalization
    from keras.models import Sequential

    model = Sequential()

    model.add(InputLayer(input_shape=(shape[1], shape[2], shape[3])))
    model.add(BatchNormalization())

    model.add(Dropout(model_params['dropout']))
    model.add(Flatten())

    model.add(Dense(model_params['dense_1'], activation=model_params['activate_1']))
    model.add(Dense(28, activation='softmax'))

    model.compile(loss=model_params['loss'],
                  optimizer=model_params['optimizer'],
                  metrics=['accuracy'])

    print(model.summary())
    return model

In [None]:
def main(args):
    # Loads CSV File
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
    x_train, y_train, x_test, y_test = flatten_data(args, x_train, x_test, y_train, y_test)

    # Creates output directory
    if not os.path.isdir(args.out):
        os.makedirs(args.out)

    # Creates range to loop filter between
    change = 'activation'
    range = ['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'selu',
             'elu', 'linear']
    history_dict = {x: {'loss': 0.0, 'acc': 0.0} for x in range}

    # Runs Model
    for new in range:
        history_dict[new] = {'loss': [], 'acc': []}
        for loop in [1, 2, 3]:
            print('Creating Model with the {} {}'.format(new, change))
            model_params = params.standard()
            model_params['activate_1'] = new

            model = models.basic_neural(model_params, x_train.shape)

            y_pred, metrics = models.fit_model(model, model_params, 
                                               x_train, y_train, x_test, y_test)

            # Adds Data to Trends
            history_dict[new]['loss'].append(metrics['loss'])
            history_dict[new]['acc'].append(metrics['acc'])

        # Calculates Average
        history_dict[new]['loss'] = np.mean(history_dict[new]['loss'])
        history_dict[new]['acc'] = np.mean(history_dict[new]['acc'])

        # Plots Confusion Matrix
        classes = {0: 'T-Shirt/top',
                   1: 'Trouser',
                   2: 'Pullover',
                   3: 'Dress',
                   4: 'Coat',
                   5: 'Sandal',
                   6: 'Shirt',
                   7: 'Sneaker',
                   8: 'Bag',
                   9: 'Ankle boot'}
        class_values = list(classes.values())
        title = "{} (Loss {} & Acc {})".format(new, metrics['loss'], metrics['acc'])
        conf_png = '{}/{}_{}.png'.format(args.out, new, change)
        plot.conf_matrix(y_test, y_pred, class_values, out=conf_png, title=title)

    # Plots Accuracy & Loss Trends
    trends_png = '{}/{}.png'.format(args.out, change)
    plot.dict_trends(history_dict, xlabel=change, out=trends_png)

    return x_train, y_train, x_test, y_test, y_pred

main(ARGS)

Below are the results from the test. We are considering two metrics here: **Accuracy and Loss**.

* `Accuracy`... speaks for itself. The higher the accuracy, the better.
* `Loss`, from a high-level point of view, calculates if the model is over-training to the data. The lower the loss, the better.

I'm using the `categorical_crossentropy` method to compute loss. This method, unlike many of the other options, works for categorical classification problems where multiple classes are possible, such as this problem.

![Neural_Activation](TensorData/Neural_Activation/Activation.png)

All of the activation functions, besides for Linear, appeared to perform well. While Sigmoid had the lowest loss, Relu & Softplus had the highest accuracy. 

I chose **Sigmoid** because it had reasonable accuracy to the other activation functions, but had notably lower loss. The confusion matrix for this is attached below.

![Neural_Sigmoid](TensorData/Neural_Activation/sigmoid_activation.png)

For reference, the Sigmoid Function typically looks like:

![Wikipedia_Sigmoid](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png)

## Second Experiment: Testing Optimizers
I now wanted to test the **optimizers** used by the first Neural Network Layer across these options:

- `RMSprop`
- `Adagrad`
- `Adadelta`
- `Adam`
- `Adamax`
- `Nadam`

To save on space, I won't replicate the code I used to run it, but it basically just involved me swapping out the 'range' and 'change' variables from main(). Below are the results _(disregard how the plot says 'epoch' rather than 'optimizer'. That was a formatting bug which did not affect the results)_:

![Neural_Optimizer](TensorData/Neural_Optimizer/optimizer.png)

The differences between each optimizer are marginal, but **Adam** optimizer had both the lowest loss & highest accuracy. Its Confusion Matrix is attached below:

![Neural_Adagrad](TensorData/Neural_Optimizer/Adam_optimizer.png)

## Experiment Three: Epochs
An **Epoch** is a single pass of the data through the neural network model during the fitting process. Right now, I was using a default value of '8'. More Epochs usually increase accuracy, but it runs the risk of increasing loss & runtime (both are bad). 

Here are the runtime results for the epochs I tested for each pass of the model _(technically I run each model through an epoch setting three times, so I divide the elapsed run time by three for these results)_:
* 1 Epoch: 1 Second
* 4 Epochs: 4 Seconds
* 8 Epochs: 8 Seconds
* 12 Epochs: 12 Seconds
* 16 Epochs: 17 Seconds
* 20 Epochs: 21 Seconds
* 24 Epochs: 25 Seconds

And below are the actual metrics:
![Neural_Epoch](TensorData/Neural_Epoch/epoch.png)

We can see that at around 12-16 Epoch, we hit the highest accuracy before the loss begins increasing. The confusion matrix for **16 Epochs** is attached below, although any range between 12-16 seems to be optimal.
![Neural_16Epoch](TensorData/Neural_Epoch/16_epoch.png)

## Experiment Four (Last for Neural Networks): Dropout Layer
The last exeperiment I had for the Neural Network was investigating the usefulness of a **Dropout** layer. Such a layer randomly deactivates x% of neurons. This decreases the likelihood of overtraining the network, especially if one neuron has greater impact on the final weights than the other neurons. On the flip side, this could decrease accuracy, as noted in the paper http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf.

I chose to loop between randomly dropping out [10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%] of values. Needless to say, 0% and 100% of dropout would be overkill. The results are attached below:

![Neural_Dropout](TensorData/Neural_Dropout/dropout.png)

The **best Dropout rate appears to be 20%** and its results are attached below:
![Neural_Dropout](TensorData/Neural_Dropout/0.2_dropout.png)

## Neural Network Conclusions
I ultimately got **0.323 Loss & 0.884 Accuracy** using a two layer neural network, where
* The first layer used the Adam Optimizer and Sigmoid Activation Function
* The second layer used the Softmax Activation Function.
* Using both a Dropout Layer (at 20% drop rate) and Flatten Layer
* 12 Epochs = 12 Seconds Runtime
* 16 Epochs = 17 Seconds Runtime

Recall that for the Machine Learning Models:
* SVC had the highest accuracy at 0.897 but needing 1:12 hours to run.
* Random Forests had a decent accuracy at 0.879 but needing 8 minutes to run.

Our model fell right in the middle with 0.882 accuracy, but only needed 12-17 seconds to run. This gives us the opportunity to add more Neural Network layers which would most likely increase accuracy.

Another way we could increase accuracy is by investigating other types of Neural Networks (or rather, Neural Network Layers) and involving them in our mix.

Cue in the next section. But note in the next section, I will begin creating a model _without the `Dropout Layer`_ we implemented in Experiment Four. So the model I'll begin to implement should be compared to the model from Experiment Three, which had a 0.328 Loss & 0.882 Accuracy.

----
# About Convolutional Neural Networks

### Convolutional Neural Network Principles
_(All GIFs in this section are obtained from https://hackernoon.com/visualizing-parts-of-convolutional-neural-networks-using-keras-and-cats-5cc01b214e59)_

A convolutional neural network does not eliminate the layers from the neural networks, but rather, augments it with its own layers. The CNN is primarily driven by the **Convolutional Layer,** which effectively is another way to simplify the data.

![ConvolutionalLayer](https://cdn-images-1.medium.com/max/1600/1*ZCjPUFrB6eHPRi4eyP6aaA.gif)

As seen in the image above,
* The sliding yellow window is the _Kernel_
* The _Stride_ of the kernel refers to how many 'pixels' it moves in each move
* Each pixel has a _Filter_. A filter is a combination of weights (denoted in red text) and the weights change to accomodate what the CNN is learning. We multiply the weight to whatever value was originally in that square.

This produces a _convolved feature_. There are further types of layers which we can do at this point to reduce the size of this convolved feature. One of those types is **Max Pooling** or **Average Pooling** in which we create a kernel on this convolved feature and completely move it to seperate regions, selecting either the single highest or the average value across all the values within that kernel.

![Pooling](https://cdn-images-1.medium.com/max/800/1*Feiexqhmvh9xMGVVJweXhg.gif)

We can also create **Dropout** Layers which will temporarily turn off certain outputs while training the model, to help reduce the risk that we're overfitting the model.

----
# Building a CNN

## Four Parts
To reiterate, the four permutations of layers I could initially use are:

1. `Convolutional Layer`
2. `Convolutional Layer` + `Max Pooling`
3. `Convolutional Layer` + `Average Pooling`
4. `Convolutional Layer` + (whichever pooling wins) + `Dropout`

Technically I could also attempt to 'stack' convolutional layers on each other or try other university CNN backed models too, but that's beyond the scope of this current section.

Recall that in the Neural Network test, we determined that 12 epochs with a Dense Layer set to these values performed the best.
* `Adam` Optimizer
* `Sigmoid` Activation Function

Also recall that the Neural Network contained these layers:
* A `Dropout` Layer to reduce overtraining risk at a 20% drop rate 
* A `Flatten` Layer to reduce dimensionality

Besides the Dropout rate, which I'd rather redetermine at the end since this could greately chain based on the convolutional components, I'll use these values while building my CNN.

## Components for the Convolutional Layer
The convolutional layer has these parameters:
* Number of Filters: Which is the number of layers requested in the output
* A tuple containing the size of the kernel and the stride value
* And something I'll keep static is the _padding_, which is how it'll ensure how each kernel doesn't go past the edge of the image.

In [None]:
def basic_cnn(model_params, shape):
    """ Builds basic Convolutional neural network model """
    from keras.layers import Dense, Flatten, InputLayer
    from keras.layers.normalization import BatchNormalization
    from keras.layers.convolutional import Conv2D
    from keras.models import Sequential

    model = Sequential()

    model.add(InputLayer(input_shape=(shape[1], shape[2], shape[3])))
    model.add(BatchNormalization())

    model.add(Conv2D(model_params['conv_filters'],
                     model_params['kernel_size'],
                     strides=model_params['kernel_stride'],
                     padding='same'))
    model.add(Flatten())

    model.add(Dense(model_params['dense_1'], activation=model_params['activate_1']))
    model.add(Dense(28, activation='softmax'))

    model.compile(loss=model_params['loss'],
                  optimizer=model_params['optimizer'],
                  metrics=['accuracy'])

    print(model.summary())
    return model

def main(args):
    # Loads CSV File
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
    x_train, y_train, x_test, y_test = flatten_data(args, x_train, x_test, y_train, y_test)

    # Creates output directory
    if not os.path.isdir(args.out):
        os.makedirs(args.out)

    # Creates range to loop filter between
    change = 'conv_filters'
    range = [4, 14, 24, 32]
    history_dict = {x: {'loss': 0.0, 'acc': 0.0} for x in range}

    # Runs Model
    for new in range:
        history_dict[new] = {'loss': [], 'acc': []}
        for loop in [1, 2, 3]:
            print('Creating Model with the {} {}'.format(new, change))
            model_params = params.standard()
            model_params['conv_filters'] = new

            model = models.basic_cnn(model_params, x_train.shape)

            y_pred, metrics = models.fit_model(model, model_params, 
                                               x_train, y_train, x_test, y_test)

            # Adds Data to Trends
            history_dict[new]['loss'].append(metrics['loss'])
            history_dict[new]['acc'].append(metrics['acc'])

        # Calculates Average
        history_dict[new]['loss'] = np.mean(history_dict[new]['loss'])
        history_dict[new]['acc'] = np.mean(history_dict[new]['acc'])

        # Plots Confusion Matrix
        classes = {0: 'T-Shirt/top',
                   1: 'Trouser',
                   2: 'Pullover',
                   3: 'Dress',
                   4: 'Coat',
                   5: 'Sandal',
                   6: 'Shirt',
                   7: 'Sneaker',
                   8: 'Bag',
                   9: 'Ankle boot'}
        class_values = list(classes.values())
        title = "{} (Loss {} & Acc {})".format(new, metrics['loss'], metrics['acc'])
        conf_png = '{}/{}_{}.png'.format(args.out, new, change)
        plot.conf_matrix(y_test, y_pred, class_values, out=conf_png, title=title)

    # Plots Accuracy & Loss Trends
    trends_png = '{}/{}.png'.format(args.out, change)
    plot.dict_trends(history_dict, xlabel=change, out=trends_png)

    return x_train, y_train, x_test, y_test, y_pred

main(ARGS)

Also unless otherwise specified or previously determined in prior CNN experiment, the default parameters are as follows

In [None]:
def standard():
    params = {}

    # Build Parameters
    params['conv_filters'] = 20  # Number of filters in the convolutional network
    params['kernel_size'] = 4  # Size of the kernel looping through the image
    params['kernel_stride'] = 1  # 'Speed' of how many pixels the Kernel moves by
    params['optimizer'] = 'adam'  # Optimizer Formula used in optimizing the building of the model
    params['cnn_activation'] = 'relu'  # Activation Function used in the Convolutional Layers, if applicable
    params['loss'] = 'categorical_crossentropy'  # Loss Function method

    # Fit Parameters
    params['epoch'] = 12  # How many iterations are ran
    params['dropout'] = 0.1  # The percentage of neurons which are randomly deactivated per epoch
    params['batch_size'] = 128  # The size of the image batch being fed into the fitting process

    # Dense Layer
    params['dense_1'] = 120  # Number of filters in the neural network
    params['activate_1'] = 'sigmoid'  # Activation Function used in the Neural Dense Layers

## Experiment 1: Testing Number of Filters (Output Dimensionality)
I wanted to modify the dimensionality of my output filters first, just to get a general idea of where in the world my output filter quantity should be.

I chose to initially loop between these values of filters: [8, 12, 14, 16, 20, 24, 28]. I chose values around 14 filters because 28 Pixels / 2 Pixel Kernels = 14 Filters. The results are attached:

![CNN_Filters](TensorData/CNN_Filter/conv_filters.png)

Using **20 filters** gave me the best results and its confusion matrix is attached below:
![CNN_20Filters](TensorData/CNN_Filter/20_conv_filters.png)

Something interesting was that when I tested 4 filters, I had runtimes of 48 seconds. But anything greater than that gave me a 50% runtime reduction. On hindsight, this isn't that surprising since there is less data reduction needed at the higher values, but I didn't expect the runtime to drop this dramatically and then plateau even as I added more filters.

Recall that the basic neural network had an Accuracy of 0.8821 & Loss of 0.3279. By adding the convolutional layer, albid at a very basic state, we were able to improve those two values slightly.

## Experiment 2: Testing Kernel Sizes & Strides
The next parameter to modify was the kernel size and/or strides, which was that range which slides through the image, selecting the 'best' value.

This experiment has two variables because mutiple effects can happen
* _If Kernel Size increases alone_, the computation time needed to find the best value would increase because there's more area to process per snapshot for the same number of iterations. While this could help with accuracy, it could increase the risk of overtraining too.
* _If Kernel Stride increases alone_, the computation time would decrease since we would have less iterations, but we could be undertraining our model as a result.
* _If both Kernel Size and Kernel Stride increases_, a permutation of the above two conditions would happen.

A Research Paper (https://arxiv.org/pdf/1409.1556.pdf) suggests that a Stride of 1 is best, at least for the first convolutional layer. Because the first layer has no idea where the most important features are, it's important to make sure we cover all areas of the image in equal proportions. This is also Keras' & TensorFlow's default setting too.

#### Experiment 2a: Testing Kernel Size with a Stride of 1
If we used a stride of 1 for now, I'll loop between kernel sizes first. Because this is a 28x28 image, I wanted to try to pick sizes which would make mathematical sense for this image's dimensions.  As a result, I tested with [1, 2, 4, 6, 7, 8] as reflected below.

![CNN_Kernel](TensorData/CNN_KernelSize/kernel_size.png)


Increasing the kernel size caused a runtime impact so that the size of the kernel is equivilant to the seconds of runtime necessary per epoch. This means that a 12 epoch neural network with 2 Kernel Size needed 24 seconds & a 4 Kernel Size needed 48 seconds. This makes it more advantageous for me to pick a model with the smallest kernel size possible, but still reasonable accuracy and loss.

The Kernel Size of 7 technically had the best results, but it took 84 seconds.
![CNN_7Kernel](TensorData/CNN_KernelSize/7_kernel_size.png)

The Kernel Size of 4 had the next best results, but it took 48 seconds, and performed extremelly well to the Kernel Size of 7.
![CNN_4Kernel](TensorData/CNN_KernelSize/4_kernel_size.png)

So tentatively speaking, **Kernel Size 4 & 7** appear to perform the best. I'll use these sizes when testing which Kernel Stride is the best value.

#### Experiment 2b: Testing Stride with Kernel Size 7
If the Kernel Size is 7, I don't have much room to modify the stride. Because the image is 28x28, if the stride isn't 1 or 4, we will have regions of the kernel which either extend past the edge of the image, or have a reduced surface area of pixels to process, which could hurt the neural network's accuracy. Testing with KernelSize=7 & Stride=[1,4] gave me:
* Stride 1 = 84 Seconds, 0.323 Loss, 0.884 Accuracy
* Stride 7 = 48 Seconds, 0.343 Loss, 0.874 Accuracy

#### Experiment 2c: Testing Stride with Kernel Size 4
With a Kernel Size of 4, we have more room to work with because it's an even number on an even number of pixels. I can select strides of [1, 2, 4] and ensure the entire image will get equally represented. This testing gave me:
* Stride 1 = 48 Seconds, 0.323 Loss, 0.882 Accuracy
* Stride 2 = 36 Seconds, 0.327 Loss, 0.884 Accuracy
* Stride 4 = 24 Seconds, 0.330 Loss, 0.882 Accuracy

#### Conclusions
A **Stride of 1** always appears to be the best, and with very little variation between Kernel Sizes at that level, I chose to stick to a **Kernel Size of 4**. I know there are other data reduction layers in a Convolutional Neural Network and I suspect those are better for this purpose.



## Experiment 3: Pooling Techniques
There are two other layers I can add after the convolutional layer.

* `Max Pooling`: Where the feature with the highest value within the `kernel` is selected to represent that kernel
* `Average Pooling`: Where the average of all features within the `kernel` is selected to reprsent that kernel.

Each of these technically have parameters to set its kernel (this is a different kernel than the previous convolutional layer). For now, I'm using the default in Keras which is a kernel that's 2x2 sized, with a stride of 2. I'll enable the settings so that this kernel will always stay within the boundaries of my image.

This is what the CNN model currently looks like now:

In [None]:
def basic_cnn(model_params, shape):
    """ Builds basic Convolutional neural network model """
    from keras.layers import AveragePooling2D, Dense, Flatten, InputLayer, MaxPooling2D
    from keras.layers.normalization import BatchNormalization
    from keras.layers.convolutional import Conv2D
    from keras.models import Sequential

    model = Sequential()

    model.add(InputLayer(input_shape=(shape[1], shape[2], shape[3])))
    model.add(BatchNormalization())

    model.add(Conv2D(model_params['conv_filters'],
                     model_params['kernel_size'],
                     strides=model_params['kernel_stride'],
                     padding='same'))
    # model.add(MaxPooling2D(padding='same'))  # Uncomment to enable Max Pooling
    # model.add(AveragePooling2D(padding='same'))  # Uncomment to enable Average Pooling
    model.add(Flatten())

    model.add(Dense(model_params['dense_1'], activation=model_params['activate_1']))
    model.add(Dense(28, activation='softmax'))

    model.compile(loss=model_params['loss'],
                  optimizer=model_params['optimizer'],
                  metrics=['accuracy'])

    print(model.summary())
    return model

Using the settings from above, this is the confusion matrix from Max Pooling:
![CNN_MaxPool](TensorData/CNN_Pool/Maxpool.png)

And this is the confusion matrix from Average Pooling:
![CNN_AvgPool](TensorData/CNN_Pool/Avgpool.png)

Both took 4 seconds per epoch, which sets runtime at 48 seconds long.

Recall that the model without pooling had a loss of 0.323 & accuracy of 0.882. With that in mind, **Max Pooling performed slightly better** than No Pooling, which performed better than Average Pooling in this case.

This doesn't surprise me too much. The variation between clothing items, especially something like a Shirt vs. T-Shirt, can't be too much nowadays. The computer probably needs extreme data points to make those determinations rather than the average values.

I didn't try increasing the `kernel` size to anything larger than 2x2 only because our `convolved feature` right now is more simplified. Increasing the `kernel` size would only increase the risk we overtrained it.

## Experiment 4: Dropout Layer
The `Dropout Layer` is the same layer we referred to in Experiment Four in the Neural Network section. When building the Convolutional Neural Network, I initially did not use this because I suspect the drop rate would greately change as we added CNN components. 

Using the knowledge I do know now, I'll only test between a Dropout Rate of [10%, 20%, 30%, 40%]. Anything greater then that seemed like it wouldn't work right out of the gate.

![CNN_Dropout](TensorData/CNN_Dropout/dropout.png)

The results were split between 5% and 10% dropout:
* 05% Dropout = 0.310 Loss & 0.885 Accuracy
* 10% Dropout = 0.316 Loss & 0.887 Accuracy
* 15% Dropout = 0.322 Loss & 0.879 Accuracy

But I ultimately sided on the **10% Dropout** value because it gave me the most meaningful difference for both loss and accuracy versus the model that had no dropout value. 

## Convolutional Neural Network Conclusions
For a basic neural network, I ultimately obtained a **0.316 Loss & 0.887 Accuracy** result with:
* One Convolutional Layer with 20 Filters, a 4x4 Kernel with stride 
* One Max Pooling Layer with a 2x2 Kernel and stride of 2
* And the same Dense & Flatten layers as the Neural Network Model.

Versus the Neural Network, which had 0.323 Loss & 0.884 Accuracy, we had a very slight refinement by adding the convolutional layers. However, the neural network took about 12 seconds to run whereas the convolutional neural network took about 72 seconds.

This is still significantly faster & more accurate than the fastest running machine learning model. That model was the 8:39 minute random forest model scoring 0.879 accuracy. However, it's still less accurate, than the best model which was an SVC model which took 1:12:39 hours & had an accuracy of 0.897. 

----
# Using a GPU to improve the existing CNN

## Runtime Improvements
Something we could do to improve runtime is to use a GPU unit. For this experiment, I'm utilizing a:

* EVGA NVIDIA GeForce GTX 1050, 2 GBs of RAM (Cuda Compute Capability 6.1)
  * Cuda Toolkit 9.0.176
  * Cuda Driver 9.0.222
  * cuDNN 7

When I re-ran the CNN model from the last experiments, I got:
* CNN with CPU: 72 Seconds
* CNN with GPU: 18 Seconds

Using a GPU provided a x4 improvement, which would give us great flexibility to test further model configurations. As such, **all further variants from this point forward will utilize the GPU, rather than CPU**.

## Additional Epochs
In the past, I was limited on how many Epochs I can run based on my runtime. Because I'm using a GPU, I have the option to run significantly more epochs. In this test, I chose to run [50, 100, 150, 200, 250, 300] epochs.

![CNN_GPU_Epoch1](TensorData/CNN_GPUEpoch/gpu_cnn_epoch_300.png)

At 50 Epochs, the loss begins increasing dramatically. So I chose to scale it back to [10, 20, 30, 40, 50] epochs.

![CNN_GPU_Epoch2](TensorData/CNN_GPUEpoch/gpu_cnn_epoch.png)

* Recall that our prior CNN had 12 Epochs, which got us 0.316 Loss & 0.887 Accuracy at 12 Seconds.
* We have the best loss at 20 Epochs: 0.298 Loss & 0.890 Accuracy at 30 Seconds.
* We have the best accuracy at 30 Epochs: 0.307 Loss & 0.895 Accuracy at 45 Seconds.

We will stick to **30 Epochs with the GPU**. It might be around x3 more runtime than the original CNN without a GPU, but it delivers stronger results. Also, recall that the SVCs were the best machine learning model scoring an accuracy of 0.896-0.897, but needed 1:12 hours to complete. In less than a minute of runtime, we practically tied those results.

----
# Using a GPU to create more advanced models

## Variant One: Multiple CNN Layers
A possible way to improve the quality of the convolutional neural network would be by adding additional convolutional layers & dropout layers to extract key features.

In the model shown below, two primary changes happened
1. There is +2 more convolutional layer and +1 more dense neural network layer.
2. The convolutional layers now conduct their own activation function, just using the generic 'relu' function for now.

In [1]:
def triple_cnn(model_params, shape):
    """ Builds a triple Convolutional neural network model """
    from keras.layers import Dense, Dropout, Flatten, InputLayer, MaxPooling2D
    from keras.layers.normalization import BatchNormalization
    from keras.layers.convolutional import Conv2D
    from keras.models import Sequential

    model = Sequential()

    model.add(InputLayer(input_shape=(shape[1], shape[2], shape[3])))
    model.add(BatchNormalization())

    model.add(Conv2D(model_params['conv_filters'],
                     model_params['kernel_size'],
                     strides=model_params['kernel_stride'],
                     activation=model_params['cnn_activation'],
                     padding='same'))

    model.add(Conv2D(model_params['conv_filters'],
                     model_params['kernel_size'],
                     strides=model_params['kernel_stride'],
                     activation=model_params['cnn_activation'],
                     padding='same'))

    model.add(Conv2D(model_params['conv_filters'],
                     model_params['kernel_size'],
                     strides=model_params['kernel_stride'],
                     activation=model_params['cnn_activation'],
                     padding='same'))

    model.add(MaxPooling2D(padding='same'))
    #model.add(Dropout(model_params['dropout']))  # Will be enabled in a second test in this section

    model.add(Flatten())

    model.add(Dense(model_params['dense_1'], activation=model_params['activate_1']))
    model.add(Dense(model_params['dense_1'], activation=model_params['activate_1']))
    model.add(Dense(28, activation='softmax'))

    model.compile(loss=model_params['loss'],
                  optimizer=model_params['optimizer'],
                  metrics=['accuracy'])

    print(model.summary())
    return model

We are also using the similar parameters as noted in the last example of said parameters, which were influenced by what we previously determined. These are:

In [None]:
def standard():
    params = {}

    # Build Parameters
    params['conv_filters'] = 20  # Number of filters in the convolutional network
    params['kernel_size'] = 4  # Size of the kernel looping through the image
    params['kernel_stride'] = 1  # 'Speed' of how many pixels the Kernel moves by
    params['optimizer'] = 'adam'  # Optimizer Formula used in optimizing the building of the model
    params['cnn_activation'] = 'relu'  # Activation Function used in the Convolutional Layers, if applicable
    params['loss'] = 'categorical_crossentropy'  # Loss Function method

    # Fit Parameters
    params['epoch'] = 30  # How many iterations are ran
    params['dropout'] = 0.1  # The percentage of neurons which are randomly deactivated per epoch
    params['batch_size'] = 128  # The size of the image batch being fed into the fitting process

    # Dense Layer
    params['dense_1'] = 120  # Number of filters in the neural network
    params['activate_1'] = 'sigmoid'  # Activation Function used in the Neural Dense Layers

##### Variant 1a: 3 Convolutional Layers without Dropout
The first pass with 3 Convolutional Layers & 3 Neural Layers proved favorable:

![CNN3_GPU](TensorData/CNN3_GPUEpoch/gpu_doublecnn.png)

Each epoch took 14 seconds to run which means the 6 Epoch Model took about 84 seconds. Not only is it faster than the machine learning models, it's also more accurate than them at 0.231 Loss & 0.919 Accuracy (the 30 Epoch CNN had 0.307 Loss & 0.895 Accuracy at 45 Seconds on a GPU).

![CNN3_GPU_Triple](TensorData/CNN3_GPUEpoch/6_gpu_doublecnn.png)

##### Variant 1b: 3 Convolutional Layers with Dropout
I next applied a quick Dropout Layer just to help reduce the risk of over-training the model. I set the dropout percentage higher this time at 50% because I figured the multiple convolutional layers are increasing the likelihood we are nitpicking on details, which means we should reduce the quantity of neurons more to compensate.

![CNN3_GPU_Dropout](TensorData/CNN3_GPUEpochDropout/gpu_doublecnn.png)
 
![CNN3_GPU_Triple](TensorData/CNN3_GPUEpochDropout/8_gpu_doublecnn.png)

##### Summary from 3 Convolutional Layers
The **Triple Convolutional Layer Network with 8 Epochs & Dropout Layer** performed better than the standard CNN with a 0.217 Loss & 0.922 Accuracy at 112 seconds (1:52 Minutes). This performs very favorably against our 30 Epoch CNN which had a 0.307 Loss & 0.895 Accuracy at 45 Seconds on the same GPU.

## Variant Two: VGG-like CNN Model
A VGG model was first pioneered in the University of Oxford in 2015 as one of the most notable CNN models to successfully classify image data. Their paper is at https://arxiv.org/pdf/1409.1556.pdf

Their claim to fame is that they used very small kernel sizes (the region which loops around each image's pixels), but in exchange, used significantly more convolutional layers. This level of detail gave them the opportunity to extract really fine details and entrust the convoltuional neurons to figure out the differences between those details.

I adapted their VGG model for this problem set. I couldn't apply it directly because they started with images of different input sizes and thus, the mathematics behind their layer design would need to differ for me.

##### Variant 2a: VGG-16 Model
Attached below is the model architecture. In short, it's
* 2 Convolutional Layers with a trailing MaxPooling layer
* 2 Convolutional Layers, with double the filters & half the image pixel space each, and the same trailing MaxPooling layer at the end.
* 3 Convolutional Layers, with the same architecture as the above
* 3 Convolutional Layers, with the same architecture as the above
* 3 Dense Neural Layers, similar to the original CNN Dense Neural Layers

In [None]:
def vgg16(model_params, shape):
    """ Builds a VGG-16 like model """
    from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling2D
    from keras.layers.normalization import BatchNormalization
    from keras.layers.convolutional import Conv2D
    from keras.models import Model

    print(shape)
    input = Input(shape=(shape[1], shape[2], shape[3]))

    tensor = Conv2D(model_params['vgg_filters_1'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(input)
    tensor = Conv2D(model_params['vgg_filters_1'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = MaxPooling2D(padding='same')(tensor)

    tensor = Conv2D(model_params['vgg_filters_2'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = Conv2D(model_params['vgg_filters_2'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = MaxPooling2D(padding='same')(tensor)

    tensor = Conv2D(model_params['vgg_filters_3'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = Conv2D(model_params['vgg_filters_3'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = Conv2D(model_params['vgg_filters_3'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = MaxPooling2D(padding='same')(tensor)
   
    tensor = Conv2D(model_params['vgg_filters_4'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = Conv2D(model_params['vgg_filters_4'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = Conv2D(model_params['vgg_filters_4'], model_params['vgg_kernel'],
                    activation=model_params['vgg_activation'], padding='same')(tensor)
    tensor = MaxPooling2D(padding='same')(tensor)

    tensor = Flatten()(tensor)

    tensor = Dense(model_params['dense_1'], activation=model_params['vgg_activation'])(tensor)
    tensor = Dense(model_params['dense_1'], activation=model_params['vgg_activation'])(tensor)
    tensor = Dense(28, activation='softmax')(tensor)

    model_compile = Model(input, tensor)
    model_compile.compile(loss=model_params['loss'],
                          optimizer=model_params['optimizer'],
                          metrics=['accuracy'])

    print(model_compile.summary())
    return model_compile

And attached are the model parameters used:

In [None]:
def standard():
    # VGG Parameters
    params['vgg_filters_1'] = 8  # Number of filters in each convolutional layer
    params['vgg_filters_2'] = 16
    params['vgg_filters_3'] = 32
    params['vgg_filters_4'] = 64
    params['vgg_kernel'] = 4  # Size of the kernel to loop through
    params['vgg_dense'] = 504  # Number of filters in the first two neural layers
    params['vgg_activation'] = 'relu'  # Activation function for each convolutional layer

I chose to test this at [5, 10, 15, 20, 25, 30, 35, 40] Epochs, and the model performed the best at 10 epochs. While the VGG model had slightly better accuracy at higher epochs, it also had slightly worse losses, and it really didn't appear to be a justifable difference especially given the runtime increases.

![VGG_10](TensorData/VGG16_GPU/10_gpu_vgg16.png)

The loss is 0.257 and accuracy is 0.909. Each epoch took about 13 seconds which meant this took 130 seconds or 2:10 minutes.

This compares slightly worse than our triple convolutional neural network model which had a 0.217 Loss & 0.922 Accuracy at 1:52 minutes.

At this time without additional tweaking, the **VGG16 model does not appear to bring enough gains** at the moment. However, it of course is possible that with more tweaking of the hyperparameters, we would be able to perform better.

##### Variant 2b: VGG-19 Model
The VGG 19 model (or at least, something like it) features more convolutional layers. The difference it has from VGG 16 is that instead of 3 Convolutional Layers in the third & fourth convolutional block, we now have 4 convolutional layers. I created a new function called vgg19() to handle this functionality.

I tested this against [5, 10, 15, 20, 25, 30] epochs and the best results for the VGG 19 model were with 15 epochs.

![VGG19_15](TensorData/VGG19_GPU/15_gpu_vgg19.png)

The loss for this model is 0.258 and the accuracy is 0.911. With a runtime of 15 seconds per epoch, this took 225 seconds which is 3:45 minutes. While not as good as the triple CNN model, there is another thought...


##### Variant 2c: VGG-19 Model with Dropout Layer
If a VGG19 model actually had reasonable accuracy but not as good of a loss score, I thought maybe adding a Dropout Layer set at 50% before the last dense layer would help. Unfortunately, this did not appear to work, and actually made everything worse at 15 Epochs (and the same could be said for all of the other epochs).

![VGG19_15_Dropout](TensorData/VGG19_GPU_Dropout/15_gpu_vgg19.png)

##### Summary from VGG Models
The VGG16 and VGG19 models were really close to each other and if we ran more epochs, we probably would still conclude on the same result. Both VGG models did not perform as well as the Triple Convolutional Network, but it stands to reason that with more development, it might have been able to perform just as well.

## Variant Three: ResNets
A Residual Network is renowned for being one of the most famous types of Convolutional Neural Networks. It was first created in a paper from Microsoft Research (https://arxiv.org/pdf/1512.03385.pdf) in 2015.

A ResNets adopts some of the model advancements we've discussed so far such as smaller kernel sizes and more convolutional layers. It does this by creating a block of layers with many `convolutional layers`, each one followed by some `activation function`. These layers help the model pick on details far more effectively with greater non-linearlity, but it makes it harder to do backwards propogation effectively.

To assist, a ResNet introduces a new concept called a `Shortcut`. The shortcut effectively produces a second path which shortcuts a given block so that a specific tensor can jump across this single block of convolutional & activation layers via one convolutional & activation layer. This shortcut has about x4 the amount of neurons as the standard approach, but helps simplify the network. The data still runs through both the standard 'blocks' and 'shortcut', but because there's some redundancy here, removing a layer or two would not break the network completely. This differs from a typical Convolutional Neural Network or Neural Network, where removing a layer would break the network most likely.

This brings a new concept to contend with. Not only do I have to chain Convolutional Layers together, but I also need it arranged so I can pass the image tensor from several layers ago into a shortcut, effectively creating a second track. I then need to merge this second track back with the first track.

Below is the function which takes care of this. By default, the model passes through three Convolutional Layers, Batch Normalizations, and Activation Functions. But if I set shortcut to the original input tensor, it triggers the shortcut which creates this second track and merge process.

In [None]:
def standard_conv(model, filter, kernel_size, activation, shortcut=None):
    """ Creates a stack of three standard Convolutional Layers 
    
        Inputs:
        model: The Keras Model to add onto
        filter: The number of filters to use in the Convolutional Layers
        kernel_size: The kernel size to use in the Convolutional Layers
        activation: The Activation Function to use
        shape: The Keras Model from 'model'. Set this to use the Shortcut Layer.
                        [Default = None => Don't use the shortcut layer]
    """
    eps = 1.1e-5

    model = Conv2D(filter, kernel_size, padding='same')(model)
    model = BatchNormalization()(model)
    model = Activation(activation)(model)

    model = Conv2D(filter, kernel_size, padding='same')(model)
    model = BatchNormalization()(model)
    model = Activation(activation)(model)

    model = Conv2D(filter*4, kernel_size, padding='same')(model)
    model = BatchNormalization()(model)

    if shortcut != None:
        shortcut_model = Conv2D(filter*4, kernel_size, padding='same')(shortcut)
        shortcut_model = BatchNormalization()(shortcut_model)
        model = Add()([model, shortcut_model])

    model = Activation(activation)(model)
    return model

And below is the function which calls those convolutional blocks. In a ResNet 18, the architecture is:

* One series of a
 * ZeroPadding2D call (to ensure the image is padded correctly)
 * A Convolutional Layer
 * Some Batch Normalization
 * A Relu Activation Function
 * MaxPool 2D.
* One series with three standard_conv() blocks, with the same kernel size as before
* One series with three standard_conv() blocks, with double the kernel size as before
* One series with three standard_conv() blocks, with double the kernel size as before
* One series with standard_conv() blocks, with double the kernel size as before

If I uncomment out the currently commented out blocks of code, I'd get a ResNet 34 model which I might test later.

In [None]:
def main(model_params, shape):
    """ Builds a ResNet 18 Convolutional Neural Network
        If I uncomment out the block comments, I'd get ResNet 34.
    """
    eps = 1.1e-5

    input = Input(shape=(shape[1], shape[2], shape[3]))

    # First Series (Orange)
    model = ZeroPadding2D((3, 3))(input)
    model = Conv2D(model_params['res_filters_1'],
                   model_params['res_kernel_size'],
                   padding='same')(model)
    model = BatchNormalization(epsilon=eps)(model)
    model = Activation(model_params['cnn_activation'])(model)
    model = MaxPool2D((3, 3), strides=2)(model)
    
    # Second Series (Purple)
    model = standard_conv(model,
                          model_params['res_filters_1'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'],
                          shortcut=model)
    model = standard_conv(model,
                         model_params['res_filters_1'],
                         model_params['res_kernel_size'],
                         model_params['cnn_activation'])
    model = standard_conv(model,
                          model_params['res_filters_1'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])

    
    # Third Series (Green)
    model = standard_conv(model,
                          model_params['res_filters_3'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'],
                          shortcut=model)
    model = standard_conv(model,
                          model_params['res_filters_3'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    model = standard_conv(model,
                          model_params['res_filters_3'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    """
    model = standard_conv(model,
                          model_params['res_filters_3'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    """

    # Four Series (Red)
    model = standard_conv(model,
                          model_params['res_filters_4'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'],
                          shortcut=model)
    model = standard_conv(model,
                          model_params['res_filters_4'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    model = standard_conv(model,
                         model_params['res_filters_4'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    """
    model = standard_conv(model,
                          model_params['res_filters_4'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    model = standard_conv(model,
                          model_params['res_filters_4'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    model = standard_conv(model,
                          model_params['res_filters_4'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])
    """

    # Five Series (Purple)
    model = standard_conv(model,
                          model_params['res_filters_5'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'],
                          shortcut=model)
    model = standard_conv(model,
                          model_params['res_filters_5'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])    
    model = standard_conv(model,
                          model_params['res_filters_5'],
                          model_params['res_kernel_size'],
                          model_params['cnn_activation'])

    # Final Neural Layer
    model_neural = AveragePooling2D()(model)
    model_neural = Flatten()(model_neural)
    #model_neural = Dense(model_params['res_dense'],
    #                    activation=model_params['res_activate'])(model_neural)
    model_neural = Dense(28, activation='softmax')(model_neural)

    # Compiles Model
    model_compile = Model(input, model_neural)
    model_compile.compile(loss=model_params['loss'],
                          optimizer=model_params['optimizer'],
                          metrics=['accuracy'])

    print(model_compile.summary())
    return model_compile

##### Variant 3a: ResNet-18 with no Activation Blocks in standard_conv()
For my first test, I chose to disable all the Activation Blocks in standard_conv() except for the very last one, just to keep it more similar to the convolutional network layers I tested so far.

![Resnet18_20Epoch](TensorData/Res18_LessActivation/20_gpu_res18.png)

This model had a 0.256 Loss & 0.908 Accuracy. Similar to the VGG Model, the Triple CNN model had better performance with 0.217 Loss & 0.922 Accuracy. But, the ResNets took far longer to run, averaging about 2 minutes per epoch, which meant this 20 epoch model took 40 minutes on the GPU. This is significantly longer than every other neural network thus far which has been less than two minutes on average.

##### Variant 3c: ResNet-34


