# Artificial Neural Network Example/Excercise

This notebook is an exercise in using Artificial Neural Networks to **CLASSIFY** the MNIST handwritten digit data set (the "Hello, World" of ML datasets).

## Table of contents:

* [Imported Libraries](#imports)
* [Loading MNIST Dataset](#load)
  - [Viewing the data](#view)
* [Preprocess the data](#preprocess)
  - [Reshape and normalize image data](#reshape)
  - [Process the labels](#labels)
* [Flat neural network](#simple)
  - [Define model architecture](#model)
  - [Evaluate model](#simple-evaluate)
* [Flat: Adam](#simple-adam)
* [Flat: Dense](#simple-dense)
* [Simple convolutional](#conv-simple)
* [Multi-layer convolutional network](#conv-simple)

# Imported Libraries <a class="anchor" id="imports"></a>

Start by importing the following libraries.

**NOTE:**

- Versions for keras, tensorflow, scikit-learn
    * You may install these versions in anaconda *via* `$: conda install keras=2.2.2`
- If you have an Nvidia gpu, you can install gpu-enabled versions of tensorflow/keras for improved performance:
    * `$: conda install -c defaults tensorflow-gpu keras-gpu`
    * **NOTE:** As of 2018-10-02 the conda-forge versions of the above were not working with my GPU (Quadro P5000)

In [2]:
%matplotlib notebook
import numpy as np
np.random.seed(123)
from matplotlib import pyplot as plt

import keras
print(f"keras version: {keras.__version__}")
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils
from keras.datasets import mnist
from keras import backend as K
from keras.callbacks import EarlyStopping
import tensorflow as tf
print(f"tensorflow version: {tf.__version__}")

###################################
# TensorFlow wizardry
config = tf.ConfigProto()
 
# Don't pre-allocate memory; allocate as-needed
config.gpu_options.allow_growth = True
 
# Only allow a total of half the GPU memory to be allocated
# config.gpu_options.per_process_gpu_memory_fraction = 0.5
 
# Create a session with the above options specified.
K.tensorflow_backend.set_session(tf.Session(config=config))
###################################

import os

import sklearn
print(f"scikit-learn version: {sklearn.__version__}")
from sklearn import metrics

keras version: 2.2.4
tensorflow version: 1.12.0
scikit-learn version: 0.20.0


# Load MNIST Dataset <a class="anchor" id="load"></a>

In this example we will use the MNIST dataset from keras. This dataset is nx28x28 in size, and is already nicely split into training and testing data

In [3]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print("Training data shape")
print(X_train.shape)
print("Testing data shape")
print(X_test.shape)

Training data shape
(60000, 28, 28)
Testing data shape
(10000, 28, 28)


## Visualize the training data <a class="anchor" id="view"></a>

Let's view a few examples of the handwritten images

In [4]:
fig = plt.figure()
for i in range(8):
    ax = fig.add_subplot(2,4,i+1)
    ax.axis('off')
    ax.imshow(X_train[i], cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title("Training: {}".format(y_train[i]))
plt.show()

<IPython.core.display.Javascript object>

# Preprocess the data <a class="anchor" id="preprocess"></a>

Before we can start to build and train our neural network, we first need to preprocess the data so that the keras framework can understand and interpret it

## Reshape and normalize image data <a class="anchor" id="reshape"></a>

"Typical" image data is RGB and would have a shape Nx28x28x3. This data is greyscale, so the shape needs to be Nx28x28x1. Our data is only Nx28x28, so we need to add an additional dimension for Keras to appropriately handle

We will alsoconvert to float32 and normalize the inputs to [0,1]. While this isn't strictly necessary, normalizing input data tends to improve performance

In [5]:
# reshape the data
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
# cast to float32
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# normalize inputs
X_train /= 255
X_test /= 255

## Process the labels <a class="anchor" id="labels"></a>

As provided, the labels are directly associated with the data, i.e.

`y_train[9] = 4 # the 10th data point is an image of a "4"`

Because the output of our neural network will be 10 neurons, each associated with a digit, we need the labels to be categorized: rather than `y_train[9] = 4`, we need `y_train[9] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]`

We use `np_utils.to_categorical` to achieve this

In [6]:
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

# Flat Neural Network <a class="anchor" id="simple"></a>

Now we will define the architecture of our model. 

## Define model <a class="anchor" id="model"></a>

Let's start with an extremely simple Neural Net. Similar to the SVM example, we will operate on a flat array, so that

In [7]:
fig = plt.figure(figsize=(3,3))
ax = fig.add_subplot(111)
# ax.axis('off')
ax.imshow(np.flipud(X_train[0].reshape(28, 28)), cmap=plt.cm.gray_r,
          interpolation='nearest')
ax.set_xlim((0, 28))
ax.set_xticks([i for i in range(0, 28, 4)])
ax.set_ylim((0, 28))
ax.set_yticks([i for i in range(0, 28, 4)])
plt.show()

<IPython.core.display.Javascript object>

becomes

In [8]:
fig = plt.figure(figsize=(6,3))
ax = fig.add_subplot(111)
# ax.axis('off')
# ax.plot(X_train[0].reshape(-1))
ax.scatter(np.arange(X_train[0].shape[0]*X_train[0].shape[1]),
           X_train[0].reshape(-1), c=np.abs(X_train[0].reshape(-1)),
           cmap='Greys', edgecolor='none', marker='s',
           vmin=0, vmax=1)
ax.set_xlim((0, 28*28))
ax.set_ylim((0, 1))
plt.show()

<IPython.core.display.Javascript object>

### First Layer: Flat

We start with a flat input layer:

In [8]:
model = Sequential()
model.add(Flatten(input_shape=(28, 28, 1)))

### Second Layer: Dense

The second layer is a dense (fully connected) layer. We will only use 16 nodes. We will also use one of the simplest nonlinear activation functions, sigmoid.

**NOTE:** We are taking an input of $28*28=784$ nodes and piping those into only $16$ nodes. This is done on purpose to limit the effectiveness of the neural network for demonstration purposes.

In [9]:
model.add(Dense(16, activation='sigmoid'))

### Output Layer:

The final layer consists of 10 nodes, corresponding to the 10 different digits. We use the `softmax` activation funciton to force there only being a single node active for each set of data (this makes sense because each image can only be a single number

In [10]:
model.add(Dense(10, activation='softmax'))

### Compile model:

We need to set a few more parameters:

1. Loss function: this defines the function that will measure the performance of the neural network. Many loss functions are available. Because we are classifying (categorizing), we will use the `categorical_crossentropy` loss function
2. Optimizer: Defines the method the neural network will use to optimize its performance. We will use the simple `Stochastic Gradient Descent`
3. Metrics: `accuracy` is the basic metric, and it is all we will use here

In [11]:
model.compile(loss='categorical_crossentropy', optimizer="sgd", metrics=['accuracy'])

### Model Summary:

In [12]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                12560     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                170       
Total params: 12,730
Trainable params: 12,730
Non-trainable params: 0
_________________________________________________________________


### Fit the model

Now, we will actually train the model! We specify the number of samples to be collected before updating the network weights (`batch_size`) and the number of times we iterate over the entire dataset (`epochs`).

In [13]:
history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/1

## Evaluate model <a class="anchor" id="simple-evaluate"></a>

Let's consider the performance of the model

### Plot the training progress

Note that I continued to increase the number of epochs until I hit a plateau

In [14]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Question:

Does the crossover of training and validation indicate overfitting?

In [15]:
score = model.evaluate(X_test, y_test)
print(f"loss = {score[0]}, accuracy={score[1]}")

loss = 0.20429219291806222, accuracy=0.9422


### Analyze the confusion matrix

A nice way to understand and analyze the performance of a classification ML tool is a confusion matrix. Such a matrix displays the probability of accurately predicting a given category (diagonal), as well as incorrectly predicting a given category (off-diagonal)

In [30]:
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

for i, j in ((x, y) for x in range(cm_norm.shape[1]) for y in range(cm_norm.shape[0])):
    if i != j:
        l_color = "white"
    else:
        l_color = "black"
    ax.annotate(str(np.round_(cm_norm[i,j], decimals=2)), xy=(i-0.4,j+0.2), color=l_color)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

<IPython.core.display.Javascript object>

## Summary <a class="anchor" id="simple-summary"></a>

Our very simple neural net has an overall accuracy of around 94% after 1000 epochs, detailed here

In [31]:
print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7fa6130fcf60>:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       980
           1       0.98      0.98      0.98      1135
           2       0.94      0.93      0.94      1032
           3       0.92      0.93      0.93      1010
           4       0.94      0.94      0.94       982
           5       0.94      0.92      0.93       892
           6       0.94      0.95      0.95       958
           7       0.94      0.94      0.94      1028
           8       0.93      0.92      0.92       974
           9       0.93      0.92      0.92      1009

   micro avg       0.94      0.94      0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000


Confusion matrix:
[[ 964    0    2    1    0    2    6    1    4    0]
 [   0 1115    3    4    0    2    4    2    5    0]
 [  10    4  961

Let's try to dig in a bit (see [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support) for more information)

#### Precision

Precision is $t_p \; / \; \left(t_p + f_p\right)$, the ratio of true positives to all predicted positives. This is the ability to avoid a false positive

#### Recall

Recall is $t_p \; / \; \left(t_p + f_n\right)$, the ratio of true positives to all values that should be positive. This is the ability to find the correct number of positives

#### F-score

f1-score is a weighted mean of the precision and recall. A perfect score is 1, and its worst is 0. This weighting can be adjusted with the `beta` parameter. `beta` defaults to 1, meaning recall and precision are equally important. This score will weight recall more than precision by a factor of `beta`

While we are alright overall, and better than the out-of-the-box SVM, we can do better

We also specify that we will have 32 filters to train i.e. 32 different "patterns" that will be learnt

## Flat improvement: ADAM optimizer <a class="anchor" id="simple-adam"></a>

### First, an example of overfitting

Notice how the testing starts to decline after approx. 20 epochs

In [33]:
model = Sequential()
model.add(Flatten(input_shape=(28, 28, 1)))
# model.add(Dense(28*28, activation='relu'))
model.add(Dense(16, activation='sigmoid'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True)
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

for i, j in ((x, y) for x in range(cm_norm.shape[1]) for y in range(cm_norm.shape[0])):
    if i != j:
        l_color = "white"
    else:
        l_color = "black"
    ax.annotate(str(np.round_(cm_norm[i,j], decimals=2)), xy=(i-0.4,j+0.2), color=l_color)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/1

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.23977845682073384, 0.9412]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7fa604113cc0>:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97       980
           1       0.97      0.98      0.98      1135
           2       0.94      0.91      0.93      1032
           3       0.91      0.92      0.92      1010
           4       0.93      0.94      0.94       982
           5       0.92      0.91      0.92       892
           6       0.94      0.97      0.96       958
           7       0.95      0.94      0.95      1028
           8       0.91      0.94      0.93       974
           9       0.95      0.91      0.93      1009

   micro avg       0.94      0.94      0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000


Confusion matrix:
[[ 952    0    5    3    3    0   13    1    3    0]
 [   0 1113    7    1    0    0    3    1   10    0]
 [   9    3  944

### Avoid overfitting

Use early stopping

In [34]:

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=3, verbose=1)
model = Sequential()
model.add(Flatten(input_shape=(28, 28, 1)))
# model.add(Dense(28*28, activation='relu'))
model.add(Dense(16, activation='sigmoid'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True, callbacks=[early_stop])
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 00021: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 16)                12560     
_________________________________________________________________
dense_8 (Dense)              (None, 10)                170       
Total params: 12,730
Trainable params: 12,730
Non-trainable params: 0
_________________________________________________________________


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.19261997017264365, 0.9445]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7fa6040d9a20>:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97       980
           1       0.97      0.99      0.98      1135
           2       0.94      0.94      0.94      1032
           3       0.95      0.92      0.93      1010
           4       0.95      0.94      0.94       982
           5       0.91      0.92      0.91       892
           6       0.94      0.96      0.95       958
           7       0.95      0.95      0.95      1028
           8       0.93      0.93      0.93       974
           9       0.93      0.93      0.93      1009

   micro avg       0.94      0.94      0.94     10000
   macro avg       0.94      0.94      0.94     10000
weighted avg       0.94      0.94      0.94     10000


Confusion matrix:
[[ 953    0    3    1    2    4   12    0    3    2]
 [   0 1121    5    0    0    2    2    1    4    0]
 [   8    1  971

While the overall accuracy in the model improves a little bit $0.94 \to 0.95$, we see quite an improvement in training time by using the Adam optimizer (1/4 of the epochs = 4x faster).

Can we do better using more nodes?

## More flat improvements: Increase number of second layer nodes <a class="anchor" id="flat-dense"></a>

Lets use 64

In [35]:
model = Sequential()
model.add(Flatten(input_shape=(28, 28, 1)))
model.add(Dense(64, activation='sigmoid'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
early_stop = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=3, verbose=1)

history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True, callbacks=[early_stop])
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 00019: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 64)                50240     
_________________________________________________________________
dense_10 (Dense)             (None, 10)                650       
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.09160591517440043, 0.9721]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7fa5ac645b00>:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       980
           1       0.98      0.99      0.99      1135
           2       0.96      0.97      0.96      1032
           3       0.97      0.97      0.97      1010
           4       0.98      0.97      0.98       982
           5       0.97      0.96      0.96       892
           6       0.97      0.97      0.97       958
           7       0.97      0.97      0.97      1028
           8       0.96      0.97      0.97       974
           9       0.97      0.96      0.97      1009

   micro avg       0.97      0.97      0.97     10000
   macro avg       0.97      0.97      0.97     10000
weighted avg       0.97      0.97      0.97     10000


Confusion matrix:
[[ 966    0    4    2    0    1    3    2    2    0]
 [   0 1124    3    1    0    0    3    0    4    0]
 [   5    3  999

We are now up to 97% accuracy in 14 epochs!

That's very impressive...but we can do much better with a convolutional model

# Simple Convolutional Model <a class="anchor" id="conv-simple"></a>

Let's start with a very simple convolutional model:

1. Convolutional Layer to handle the image
2. Output Layer

## Convolution Animation <a class="anchor" id="conv-ani"></a>

We will be using a multi-layer neural net to learn handwritten digits. Because the input images are 2D, and because of their relative strength in learning images, we will use convolutional layers to learn our model.

By passing a 3x3 kernel over the 28x28 image, we end up with an output image of 26x26 (see how this results below)

In [39]:
import matplotlib.animation as animation

fig = plt.figure()
ax = fig.add_subplot(111)
im_arr = X_train[0].reshape(28, 28)
ax.imshow(im_arr, extent=(0, 28, 0, 28), cmap=plt.cm.gray_r, vmin=0, vmax=1)
ax.set_title("Convolutional Kernel Animation", fontsize=24)
kernel_arr = np.zeros((28, 28, 4))
kernel_arr[:3, :3, :] = [0.3, 0.5, 0.8, 0.5]
im = ax.imshow(kernel_arr, extent=(0, 28, 0, 28), vmin=0, vmax=1)

ani_i = 0
ani_j = 0

def updatefig(*args):
    global im_arr, ax
    global ani_i, ani_j
    # increment j
    ani_j += 1
    # increment i once j resets
    ani_i = ani_i + 1 if (ani_j%26 == 0) else ani_i
    # handle periodic boundaries
    ani_j = ani_j % 26
    ani_i = ani_i % 26
    kernel_arr = np.zeros((28, 28, 4))
    kernel_arr[ani_i:ani_i+3, ani_j:ani_j+3, :] = [0.3, 0.5, 0.8, 0.5]
    im.set_array(kernel_arr)
    return im,

# display animation
ani = animation.FuncAnimation(fig, updatefig, blit=True, interval=20, repeat=True)
ax.set_xlim((0, 28))
ax.set_ylim((0, 28))
plt.show()

<IPython.core.display.Javascript object>

### Sigmoid Activation

In [25]:
model = Sequential()
model.add(Convolution2D(32, (3, 3), activation='sigmoid', input_shape=(28, 28, 1)))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=3, verbose=1)

history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True, callbacks=[early_stop])
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 00020: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
flatten_8 (Flatten)          (None, 21632)             0         
_________________________________________________________________
dense_12 (Dense)             (None, 10)                216330    
Total params: 216,650
Trainable params: 216,650
Non-trainable params: 0
_________________________________________________________________


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.07410266082277521, 0.9773]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7f2d5839ef98>:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       980
           1       0.99      0.99      0.99      1135
           2       0.99      0.97      0.98      1032
           3       0.99      0.95      0.97      1010
           4       0.99      0.98      0.98       982
           5       0.94      0.99      0.96       892
           6       0.97      0.98      0.98       958
           7       0.99      0.97      0.98      1028
           8       0.97      0.97      0.97       974
           9       0.97      0.98      0.98      1009

   micro avg       0.98      0.98      0.98     10000
   macro avg       0.98      0.98      0.98     10000
weighted avg       0.98      0.98      0.98     10000


Confusion matrix:
[[ 968    0    2    0    0    3    5    1    1    0]
 [   0 1126    0    1    1    2    3    0    2    0]
 [   3    4 1000

### tanh activation

In [26]:
model = Sequential()
model.add(Convolution2D(32, (3, 3), activation='tanh', input_shape=(28, 28, 1)))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=5, verbose=1)

history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True, callbacks=[early_stop])
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 00028: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
flatten_9 (Flatten)          (None, 21632)             0         
_________________________________________________________________
dense_13 (Dense)             (None, 10)                216330    
Total params: 216,650
Trainable params: 216,650
Non-trainable params: 0
_________________________________

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.15869283981304616, 0.9588]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7f2dc81b5cf8>:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       980
           1       0.98      0.99      0.98      1135
           2       0.91      0.97      0.94      1032
           3       0.95      0.96      0.96      1010
           4       0.97      0.97      0.97       982
           5       0.98      0.95      0.96       892
           6       0.97      0.96      0.97       958
           7       0.97      0.94      0.96      1028
           8       0.96      0.91      0.93       974
           9       0.94      0.95      0.95      1009

   micro avg       0.96      0.96      0.96     10000
   macro avg       0.96      0.96      0.96     10000
weighted avg       0.96      0.96      0.96     10000


Confusion matrix:
[[ 961    0    6    0    2    2    5    1    3    0]
 [   0 1118    5    1    0    0    1    1    8    1]
 [   1    7  999

### relu activation

In [27]:
model = Sequential()
model.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=5, verbose=1)

history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True, callbacks=[early_stop])
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 00010: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_6 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
flatten_10 (Flatten)         (None, 21632)             0         
_________________________________________________________________
dense_14 (Dense)             (None, 10)                216330    
Total params: 216,650
Trainable params: 216,650
Non-trainable params: 0
_________________________________________________________________


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.085833394788031, 0.9814]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7f2ba6212f60>:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       980
           1       0.99      1.00      0.99      1135
           2       0.97      0.98      0.98      1032
           3       0.98      0.98      0.98      1010
           4       0.98      0.99      0.98       982
           5       0.98      0.98      0.98       892
           6       0.99      0.98      0.98       958
           7       0.98      0.98      0.98      1028
           8       0.97      0.98      0.98       974
           9       0.99      0.96      0.98      1009

   micro avg       0.98      0.98      0.98     10000
   macro avg       0.98      0.98      0.98     10000
weighted avg       0.98      0.98      0.98     10000


Confusion matrix:
[[ 967    0    3    1    1    1    4    0    3    0]
 [   0 1130    4    0    0    0    0    1    0    0]
 [   1    2 1013

### leaky relu

In [28]:
from keras.layers import LeakyReLU
model = Sequential()
model.add(Convolution2D(32, (3, 3), input_shape=(28, 28, 1)))
# note you need to add as a layer, not as an activation to a layer
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=5, verbose=1)

history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                    validation_split=0.2, shuffle=True, callbacks=[early_stop])
model.summary()

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

score = model.evaluate(X_test, y_test)
print(score)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 00011: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_7 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 26, 26, 32)        0         
_________________________________________________________________
flatten_11 (Flatten)         (None, 21632)             0         
_________________________________________________________________
dense_15 (Dense)             (None, 10)                216330    
Total params: 216,650
Trainable params: 216,650
Non-trainable params: 0
_________________________________________________________________


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[0.09936573145005387, 0.9732]


<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7f2ba5a4c208>:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       980
           1       0.98      0.99      0.99      1135
           2       0.97      0.97      0.97      1032
           3       0.98      0.97      0.98      1010
           4       0.98      0.98      0.98       982
           5       0.98      0.96      0.97       892
           6       0.98      0.98      0.98       958
           7       0.97      0.96      0.96      1028
           8       0.96      0.96      0.96       974
           9       0.97      0.96      0.96      1009

   micro avg       0.97      0.97      0.97     10000
   macro avg       0.97      0.97      0.97     10000
weighted avg       0.97      0.97      0.97     10000


Confusion matrix:
[[ 971    1    2    0    1    0    2    1    2    0]
 [   0 1129    2    0    1    0    1    1    1    0]
 [   3    7 1000

### Comparison

| Activation |  Score | N_Epochs |
|:-----------|--------|----------|
|   Sigmoid  | 0.9791 |    20    |
|    TanH    | 0.9388 |    18    |
|    ReLU    | 0.9805 |     7    |
| Leaky ReLU | 0.9732 |     8    |

# Mulit-layer Convolutional Neural Network <a class="anchor" id="conv-multi"></a>

Now let's improve with more layers...not like actually needs to be improved 

In [40]:
model = Sequential()
model.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))

## Add more layers

We'll add yet another convolutional layer. Then, we'll have a 2x2 kernel pass over the previous layer, taking the max of the 4 values. Finally, a dropout layer will be added to accelerate back-propogation of weights and help prevent overfitting. Very briefly, dropout helps improve training by periodically deactivating neurons during training, preventing two neurons from effectively blindly passing values along without taking the activation of other connecting neurons into account

In [41]:
model.add(Convolution2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

## Complete the network

We now flatten the network into a single dimension, add a fully connected layer, and then our output layer. Because we are classifying, the final layer will have as many nodes as our classes (10). We use the `softmax` activation function to enforce that only one node activates in the output layer

In [42]:
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

# Compile the model

Declare the loss function, optimizer, etc.

In [43]:
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

## Train the network/Fit the data

Train the network and save; training takes O(10 min) on a CPU, and O(2 min) on a GPU.

In [44]:
# load the data
load = False
if not (os.path.exists("keras_mnist_ann.h5") and os.path.exists("keras_mnist_ann.json") and load):
    early_stop = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=5, verbose=1)

    history = model.fit(X_train, y_train, batch_size=32, epochs=100, verbose=1,
                        validation_split=0.2, shuffle=True, callbacks=[early_stop])
    # save model
    model_json = model.to_json()
    with open("keras_mnist_ann.json", "w") as json_file:
        json_file.write(model_json)
    model.save_weights("keras_mnist_ann.h5")
else:
    json_file = open("keras_mnist_ann.json", "r")
    loaded_model_json = json_file.read()
    json_file.close()
    model = keras.models.model_from_json(loaded_model_json)
    model.load_weights("keras_mnist_ann.h5")
    model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

Train on 48000 samples, validate on 12000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 00013: early stopping


# Evaluate model on test data

In [45]:
model.summary()
score = model.evaluate(X_test, y_test)
print(score)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['acc'])
ax.plot(history.history['val_acc'])
ax.set_title('model accuracy')
ax.set_ylabel('accuracy')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(history.history['loss'])
ax.plot(history.history['val_loss'])
ax.set_title('model loss')
ax.set_ylabel('loss')
ax.set_xlabel('epoch')
ax.legend(['train', 'validation'], loc='upper left')
plt.show()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 4608)              0         
_________________________________________________________________
dense_11 (Dense)             (None, 128)               589952    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
__________

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# View Confusion Matrix

In [46]:
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm_norm = cm_norm.T

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.imshow(cm_norm, interpolation='nearest')

ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(0, 10, 1))
ax.set_xticklabels(np.arange(0, 10, 1))
ax.set_yticklabels(np.arange(0, 10, 1))
ax.set_xticks(np.arange(-0.5, 10, 1), minor=True)
ax.set_yticks(np.arange(-0.5, 10, 1), minor=True)
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)

for i, j in ((x, y) for x in range(cm_norm.shape[1]) for y in range(cm_norm.shape[0])):
    if i != j:
        l_color = "white"
    else:
        l_color = "black"
    ax.annotate(str(np.round_(cm_norm[i,j], decimals=2)), xy=(i-0.4,j+0.2), color=l_color)

cbar = fig.colorbar(cax, ticks=[i for i in np.arange(0, 1, 0.1)])

ax.set_xlabel("Predicted Digit")
ax.set_ylabel("Actual Digit")
ax.set_title(r"Confusion Matrix for ANN Classifier")

plt.show()

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

<IPython.core.display.Javascript object>

Classification report for classifier <keras.engine.sequential.Sequential object at 0x7fa5a5e00128>:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       980
           1       1.00      1.00      1.00      1135
           2       1.00      0.99      0.99      1032
           3       0.99      1.00      0.99      1010
           4       0.99      0.99      0.99       982
           5       0.99      0.99      0.99       892
           6       1.00      0.99      0.99       958
           7       0.98      0.99      0.99      1028
           8       0.99      0.99      0.99       974
           9       0.99      0.98      0.99      1009

   micro avg       0.99      0.99      0.99     10000
   macro avg       0.99      0.99      0.99     10000
weighted avg       0.99      0.99      0.99     10000


Confusion matrix:
[[ 979    0    0    0    0    0    0    1    0    0]
 [   0 1135    0    0    0    0    0    0    0    0]
 [   1    0 1024

# Summary <a class="anchor" id="summary"></a>

|    Network   | Accuracy | n_epochs |
|--------------|----------|---------:|
|   Flat-SGD   |  0.9392  |    100   |
|   Flat-ADAM  |  0.9485  |    25    |
|  Flat-Dense  |  0.9724  |    15    |
| Conv-Sigmoid |  0.9791  |    20    |
|   Conv-tanh  |  0.9388  |    18    |
|   Conv-ReLU  |  0.9805  |     7    |
|  Conv-LReLU  |  0.9732  |     8    |
|  Conv-Multi  |  0.9919  |    15    |


Compare to SVM/SVC performance:

| SVM | Accuracy |
| ------------- |-----:|
| SVC | 0.8883 |

# Visualizing the Neural Network layers

Artificial neural networks are often referred to as "black boxes" because the weighting and outputs of the layers are not easy to view (and as we will see, are not necessarily easy to inuit/interpret when we can).

Here we create another neural net, this time with 3 convolutional layers before a maxpooling layer. We will then view the output of each layer after training when supplied with an image.

In [7]:
model = Sequential()

model.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
convout1 = Activation('relu')
model.add(convout1)

model.add(Convolution2D(32, (3, 3), activation='relu'))
convout2 = Activation('relu')
model.add(convout2)

model.add(Convolution2D(32, (3, 3), activation='relu'))
convout3 = Activation('relu')
model.add(convout3)

model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
maxout = Activation('relu')
model.add(maxout)

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, nb_epoch=10, validation_data=(X_test, y_test))

model.summary()
score = model.evaluate(X_test, y_test)
print(score)

y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)



Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
activation_1 (Activation)    (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 32)        9248      
_________________________________________________________________
activation_2 (Activation)    (None, 24, 24, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 22, 22, 32)        9248      
_________________________________________________________________
activation_3 (Activation)    (None, 22, 22, 32)

In [8]:
# choose any image to want by specifying the index
img_to_visualize = X_train[13]
# Keras requires the image to be in 4D
# So we add an extra dimension to it.
img_to_visualize = np.expand_dims(img_to_visualize, axis=0)

def layer_to_visualize(layer, my_title):
    """
    compute the output of the convolutional and maxpool layers
    
    requires the use of particular backend flags to turn off
    training behavior
    """
    inputs = [K.learning_phase()] + model.inputs

    _convout1_f = K.function(inputs, [layer.output])
    def convout1_f(X):
        # The [0] is to disable the training phase flag
        return _convout1_f([0] + [X])

    # compute the convolutional output of the layer given
    # the input image
    # e.g. The first layer will take a 28x28 image and will use
    # a Convolution2D layer to pass a 3x3 kernel to create 32 filters
    # meaning 1 28x28x1 image becomes 32 26x26x1 images
    convolutions = convout1_f(img_to_visualize)
    # use the squeeze function dump the channel data
    convolutions = np.squeeze(convolutions)

    print ('Shape of conv:', convolutions.shape)
    
    # technically this is "backward" from the original example
    # this is because we are using a slightly different API to
    # pass our images into the network
    n = convolutions.shape[2]
    n = int(np.ceil(np.sqrt(n)))
    
    # Visualization of each filter of the layer
    fig = plt.figure(figsize=(12,8))
    for i in range(convolutions.shape[2]):
        ax = fig.add_subplot(n,n,i+1)
        ax.axis('off')
        ax.imshow(convolutions[:,:,i], cmap='gray')
    fig.suptitle(my_title, fontsize=24)

In [9]:
# Specify the layer to want to visualize
layer_to_visualize(convout1, "First Layer")

# # As convout2 is the result of a MaxPool2D layer
# # We can see that the image has blurred since
# # the resolution has reduced 
layer_to_visualize(convout2, "Second Layer")

layer_to_visualize(convout3, "Third Layer")

layer_to_visualize(maxout, "Max Pool Layer")



Shape of conv: (26, 26, 32)


<IPython.core.display.Javascript object>

Shape of conv: (24, 24, 32)


<IPython.core.display.Javascript object>

Shape of conv: (22, 22, 32)


<IPython.core.display.Javascript object>

Shape of conv: (11, 11, 32)


<IPython.core.display.Javascript object>

## Deeper example

[Adapted from kaggle notebook](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6/notebook)

In [12]:
model = Sequential()

# note the padding argument here keeps the image size the same...and prevents the shrinking
# also note the larger kernel
model.add(Convolution2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
convout1 = Activation('relu')
model.add(convout1)
model.add(Convolution2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
convout2 = Activation('relu')
model.add(convout2)
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Convolution2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
convout3 = Activation('relu')
model.add(convout3)
model.add(Convolution2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
convout4 = Activation('relu')
model.add(convout4)
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))


model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=3, verbose=1)

model.fit(X_train, y_train, batch_size=512, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stop])

model.summary()
score = model.evaluate(X_test, y_test)
print(score)

y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))

print("Classification report for classifier %s:\n%s\n"
      % (model, metrics.classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
print("Confusion matrix:\n%s" % cm)

Train on 60000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 00012: early stopping
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 28, 28, 32)        832       
_________________________________________________________________
activation_5 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 32)        25632     
_________________________________________________________________
activation_6 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________

In [10]:
import gc
del model
K.clear_session()
gc.collect()

0

In [25]:
print(X_train[58].shape)

(28, 28, 1)


In [29]:
plt.figure()
plt.imshow(X_train[58].reshape((28,28)), cmap="gray")
plt.show()

<IPython.core.display.Javascript object>

In [21]:
# choose any image to want by specifying the index
img_to_visualize = X_train[58]
# Keras requires the image to be in 4D
# So we add an extra dimension to it.
img_to_visualize = np.expand_dims(img_to_visualize, axis=0)

def layer_to_visualize(layer, my_title):
    """
    compute the output of the convolutional and maxpool layers
    
    requires the use of particular backend flags to turn off
    training behavior
    """
    inputs = [K.learning_phase()] + model.inputs

    _convout1_f = K.function(inputs, [layer.output])
    def convout1_f(X):
        # The [0] is to disable the training phase flag
        return _convout1_f([0] + [X])

    # compute the convolutional output of the layer given
    # the input image
    # e.g. The first layer will take a 28x28 image and will use
    # a Convolution2D layer to pass a 3x3 kernel to create 32 filters
    # meaning 1 28x28x1 image becomes 32 26x26x1 images
    convolutions = convout1_f(img_to_visualize)
    # use the squeeze function dump the channel data
    convolutions = np.squeeze(convolutions)

    print ('Shape of conv:', convolutions.shape)
    
    # technically this is "backward" from the original example
    # this is because we are using a slightly different API to
    # pass our images into the network
    n = convolutions.shape[2]
    n = int(np.ceil(np.sqrt(n)))
    
    # Visualization of each filter of the layer
    fig = plt.figure(figsize=(12,8))
#     for i in range(convolutions.shape[2]):
#         ax = fig.add_subplot(n,n,i+1)
#         ax.axis('off')
#         ax.imshow(convolutions[:,:,i], cmap='gray')
    for i in range(4):
        for j in range(8):
            try:
                img, loss = kept_filters[i * 8 + j]
                ax = fig.add_subplot(4,8,i * 8 + j + 1)
                ax.axis('off')
                ax.imshow(convolutions[:,:,i*8+j], cmap='gray')
            except:
                pass
    fig.suptitle(my_title, fontsize=24)

In [43]:
# Specify the layer to want to visualize
layer_to_visualize(convout1, "First Layer")

# # As convout2 is the result of a MaxPool2D layer
# # We can see that the image has blurred since
# # the resolution has reduced 
layer_to_visualize(convout2, "Second Layer")

layer_to_visualize(convout3, "Third Layer")

layer_to_visualize(convout4, "Fourth Layer")



Shape of conv: (28, 28, 32)


<IPython.core.display.Javascript object>

Shape of conv: (28, 28, 32)


<IPython.core.display.Javascript object>

Shape of conv: (14, 14, 64)


<IPython.core.display.Javascript object>

Shape of conv: (14, 14, 64)


<IPython.core.display.Javascript object>

In [44]:
print(y_train[58])
print(model.predict(np.expand_dims(X_train[58], axis=0)))
print(model.predict(np.expand_dims(X_train[58], axis=0)).argmax(axis=1))

[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[[5.5456014e-15 4.0124486e-09 1.1152201e-09 7.3804997e-13 1.0000000e+00
  2.5844828e-12 8.9363569e-12 2.5656901e-09 2.3506169e-10 5.4295981e-09]]
[4]


# kernel visualization

https://blog.keras.io/category/demo.html

In [20]:
model = Sequential()

# note the padding argument here keeps the image size the same...and prevents the shrinking
# also note the larger kernel
model.add(Convolution2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
convout1 = Activation('relu')
model.add(convout1)
model.add(Convolution2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
convout2 = Activation('relu')
model.add(convout2)
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Convolution2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
convout3 = Activation('relu')
model.add(convout3)
model.add(Convolution2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
convout4 = Activation('relu')
model.add(convout4)
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))


model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, nb_epoch=10, validation_data=(X_test, y_test))

model.summary()



Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_8 (Conv2D)            (None, 28, 28, 32)        832       
_________________________________________________________________
activation_9 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 28, 28, 32)        25632     
_________________________________________________________________
activation_10 (Activation)   (None, 28, 28, 32)        0         
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 14, 14, 32)

In [98]:
#get_weights [x, y, channel, nth convolutions layer ]
# print(len(model.layers))
# print(len(model.layers[0].get_weights()))
print(model.layers[0].get_weights()[0].shape)
weight_conv2d_1 = model.layers[8].get_weights()[0][:,:,0,:]
 
col_size = 8
row_size = 8
filter_index = 0
fig, ax = plt.subplots(row_size, col_size, figsize=(12,12))
for row in range(0,row_size): 
    for col in range(0,col_size):
        ax[row][col].imshow(weight_conv2d_1[:,:,filter_index],cmap="gray")
        ax[row][col].axis("off")
        filter_index += 1

(5, 5, 1, 32)


<IPython.core.display.Javascript object>

## First Layer

In [56]:
import time
# dimensions of the generated pictures for each filter.
img_width = 28
img_height = 28

# the name of the layer we want to visualize
layer_name = 'conv2d_5'

# util function to convert a tensor into a valid image

def deprocess_image(x):
    # normalize tensor: center on 0., ensure std is 0.1
    x -= x.mean()
    x /= (x.std() + K.epsilon())
    x *= 0.1

    # clip to [0, 1]
    x += 0.5
    x = np.clip(x, 0, 1)

    # convert to RGB array
    x *= 255
    if K.image_data_format() == 'channels_first':
        x = x.transpose((1, 2, 0))
    x = np.clip(x, 0, 255).astype('uint8')
    return x


# # build the VGG16 network with ImageNet weights
# model = vgg16.VGG16(weights='imagenet', include_top=False)
# print('Model loaded.')

model.summary()

# this is the placeholder for the input images
input_img = model.input
# input_img = np.expand_dims(X_train[1], axis=0)
# input_img *= 255
# input_img = input_img.astype('float')

# get the symbolic outputs of each "key" layer (we gave them unique names).
layer_dict = dict([(layer.name, layer) for layer in model.layers])
print(layer_dict.keys())


def normalize(x):
    # utility function to normalize a tensor by its L2 norm
    print(x)
    return x / (K.sqrt(K.mean(K.square(x))) + K.epsilon())


kept_filters = []
for filter_index in range(32):
    # we only scan through the first 200 filters,
    # but there are actually 512 of them
    print('Processing filter %d' % filter_index)
    start_time = time.time()

    # we build a loss function that maximizes the activation
    # of the nth filter of the layer considered
    layer_output = layer_dict[layer_name].output
#     layer_output = convout1.output
    if K.image_data_format() == 'channels_first':
        loss = K.mean(layer_output[:, filter_index, :, :])
    else:
        loss = K.mean(layer_output[:, :, :, filter_index])
    loss = K.mean(layer_output[:, :, :, filter_index])
    # we compute the gradient of the input picture wrt this loss
    grads = K.gradients(loss, input_img)[0]

    # normalization trick: we normalize the gradient
    grads = normalize(grads)

    # this function returns the loss and grads given the input picture
    iterate = K.function([input_img], [loss, grads])

    # step size for gradient ascent
    step = 1.

    # we start from a gray image with some random noise
#     if K.image_data_format() == 'channels_first':
#         input_img_data = np.random.random((1, 3, img_width, img_height))
#     else:
#         input_img_data = np.random.random((1, img_width, img_height, 3))
#     input_img_data = (input_img_data - 0.5) * 20 + 128
    if K.image_data_format() == 'channels_first':
        input_img_data = np.random.random((1, 3, img_width, img_height))
    else:
        input_img_data = np.random.random((1, img_width, img_height, 1))
    input_img_data = np.expand_dims(np.copy(X_train[58]), axis=0)
    input_img_data = (input_img_data - 0.5) * 20 + 128

    # we run gradient ascent for 20 steps
    for i in range(20):
        loss_value, grads_value = iterate([input_img_data])
        input_img_data += grads_value * step

        print('Current loss value:', loss_value)
        if loss_value <= 0.:
            # some filters get stuck to 0, we can skip them
            break

    # decode the resulting input image
    if loss_value > 0:
        img = deprocess_image(input_img_data[0])
        kept_filters.append((img, loss_value))
    end_time = time.time()
    print('Filter %d processed in %ds' % (filter_index, end_time - start_time))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 28, 28, 32)        832       
_________________________________________________________________
activation_5 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 32)        25632     
_________________________________________________________________
activation_6 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 14, 14, 64)        18496     
__________

Tensor("gradients_457/conv2d_5/convolution_grad/Conv2DBackpropInput:0", shape=(?, 28, 28, 1), dtype=float32)
Current loss value: 12.031662
Current loss value: 12.186527
Current loss value: 12.341393
Current loss value: 12.496257
Current loss value: 12.651119
Current loss value: 12.8059845
Current loss value: 12.960849
Current loss value: 13.115713
Current loss value: 13.2705765
Current loss value: 13.42544
Current loss value: 13.580303
Current loss value: 13.7351675
Current loss value: 13.890032
Current loss value: 14.044894
Current loss value: 14.1997595
Current loss value: 14.354626
Current loss value: 14.509488
Current loss value: 14.664351
Current loss value: 14.819215
Current loss value: 14.974081
Filter 8 processed in 2s
Processing filter 9
Tensor("gradients_458/conv2d_5/convolution_grad/Conv2DBackpropInput:0", shape=(?, 28, 28, 1), dtype=float32)
Current loss value: 1.735148
Current loss value: 1.8529898
Current loss value: 1.9708318
Current loss value: 2.0886734
Current loss va

Tensor("gradients_468/conv2d_5/convolution_grad/Conv2DBackpropInput:0", shape=(?, 28, 28, 1), dtype=float32)
Current loss value: 11.488802
Current loss value: 11.700026
Current loss value: 11.911258
Current loss value: 12.122484
Current loss value: 12.333708
Current loss value: 12.544943
Current loss value: 12.756171
Current loss value: 12.967398
Current loss value: 13.178626
Current loss value: 13.390002
Current loss value: 13.602028
Current loss value: 13.814063
Current loss value: 14.0261
Current loss value: 14.238315
Current loss value: 14.451014
Current loss value: 14.664024
Current loss value: 14.877509
Current loss value: 15.090983
Current loss value: 15.30446
Current loss value: 15.51794
Filter 19 processed in 2s
Processing filter 20
Tensor("gradients_469/conv2d_5/convolution_grad/Conv2DBackpropInput:0", shape=(?, 28, 28, 1), dtype=float32)
Current loss value: 3.2765326
Current loss value: 3.480998
Current loss value: 3.6902223
Current loss value: 3.9032702
Current loss value: 

Tensor("gradients_479/conv2d_5/convolution_grad/Conv2DBackpropInput:0", shape=(?, 28, 28, 1), dtype=float32)
Current loss value: 12.596809
Current loss value: 12.7488785
Current loss value: 12.900949
Current loss value: 13.053024
Current loss value: 13.205095
Current loss value: 13.357165
Current loss value: 13.509237
Current loss value: 13.661309
Current loss value: 13.81338
Current loss value: 13.965451
Current loss value: 14.117524
Current loss value: 14.269596
Current loss value: 14.421668
Current loss value: 14.573735
Current loss value: 14.72581
Current loss value: 14.877881
Current loss value: 15.029953
Current loss value: 15.182027
Current loss value: 15.334094
Current loss value: 15.486168
Filter 30 processed in 2s
Processing filter 31
Tensor("gradients_480/conv2d_5/convolution_grad/Conv2DBackpropInput:0", shape=(?, 28, 28, 1), dtype=float32)
Current loss value: 13.1098385
Current loss value: 13.252125
Current loss value: 13.394414
Current loss value: 13.536699
Current loss va

In [34]:
# we will stich the best 64 filters on a 8 x 8 grid.
n = 6

# the filters that have the highest loss are assumed to be better-looking.
# we will only keep the top 64 filters.
# kept_filters.sort(key=lambda x: x[1], reverse=True)
# kept_filters = kept_filters[:n * n]

# # build a black picture with enough space for
# # our 8 x 8 filters of size 128 x 128, with a 5px margin in between
# margin = 5
# width = n * img_width + (n - 1) * margin
# height = n * img_height + (n - 1) * margin
# stitched_filters = np.zeros((width, height, 3))

In [57]:
print(len(kept_filters))
# why is this 31 and not 32...oh well
# Visualization of each filter of the layer
fig = plt.figure(figsize=(12,8))
for i in range(4):
    for j in range(8):
        try:
            img, loss = kept_filters[i * 8 + j]
            ax = fig.add_subplot(4,8,i * 8 + j + 1)
            ax.axis('off')
            ax.imshow(img.reshape(28,28), cmap='gray')
        except:
            pass
# for i in range(convolutions.shape[2]):
#     ax = fig.add_subplot(n,n,i+1)
#     ax.imshow(convolutions[:,:,i], cmap='gray')

32


<IPython.core.display.Javascript object>

## Deepest Layer

In [61]:
import time
# dimensions of the generated pictures for each filter.
img_width = 28
img_height = 28

# the name of the layer we want to visualize
layer_name = 'conv2d_8'

# util function to convert a tensor into a valid image

def deprocess_image(x):
    # normalize tensor: center on 0., ensure std is 0.1
    x -= x.mean()
    x /= (x.std() + K.epsilon())
    x *= 0.1

    # clip to [0, 1]
    x += 0.5
    x = np.clip(x, 0, 1)

    # convert to RGB array
    x *= 255
    if K.image_data_format() == 'channels_first':
        x = x.transpose((1, 2, 0))
    x = np.clip(x, 0, 255).astype('uint8')
    return x


# # build the VGG16 network with ImageNet weights
# model = vgg16.VGG16(weights='imagenet', include_top=False)
# print('Model loaded.')

model.summary()

# this is the placeholder for the input images
input_img = model.input
# input_img = np.expand_dims(X_train[1], axis=0)
# input_img *= 255
# input_img = input_img.astype('float')

# get the symbolic outputs of each "key" layer (we gave them unique names).
layer_dict = dict([(layer.name, layer) for layer in model.layers[1:]])
print(layer_dict.keys())


def normalize(x):
    # utility function to normalize a tensor by its L2 norm
    print(x)
    return x / (K.sqrt(K.mean(K.square(x))) + K.epsilon())


kept_filters = []
for filter_index in range(64):
    # we only scan through the first 200 filters,
    # but there are actually 512 of them
    print('Processing filter %d' % filter_index)
    start_time = time.time()

    # we build a loss function that maximizes the activation
    # of the nth filter of the layer considered
    layer_output = layer_dict[layer_name].output
#     layer_output = convout1.output
    if K.image_data_format() == 'channels_first':
        loss = K.mean(layer_output[:, filter_index, :, :])
    else:
        loss = K.mean(layer_output[:, :, :, filter_index])
#     loss = K.mean(layer_output[:, :, :, filter_index])
    loss = K.mean(model.output[:,5])
    # we compute the gradient of the input picture wrt this loss
    grads = K.gradients(loss, input_img)[0]

    # normalization trick: we normalize the gradient
    grads = normalize(grads)

    # this function returns the loss and grads given the input picture
    iterate = K.function([input_img], [loss, grads])

    # step size for gradient ascent
    step = 1.

    # we start from a gray image with some random noise
#     if K.image_data_format() == 'channels_first':
#         input_img_data = np.random.random((1, 3, img_width, img_height))
#     else:
#         input_img_data = np.random.random((1, img_width, img_height, 3))
#     input_img_data = (input_img_data - 0.5) * 20 + 128
    if K.image_data_format() == 'channels_first':
        input_img_data = np.random.random((1, 3, img_width, img_height))
    else:
        input_img_data = np.random.random((1, img_width, img_height, 1))
#     input_img_data = np.expand_dims(np.copy(X_train[58]), axis=0)
    input_img_data = (input_img_data - 0.5) * 20 + 128

    # we run gradient ascent for 20 steps
    for i in range(20):
        loss_value, grads_value = iterate([input_img_data])
        input_img_data += grads_value * step

        print('Current loss value:', loss_value)
        if loss_value <= 0.:
            # some filters get stuck to 0, we can skip them
            break

    # decode the resulting input image
#     if loss_value > 0:
#         img = deprocess_image(input_img_data[0])
#         kept_filters.append((img, loss_value))
    img = deprocess_image(input_img_data[0])
    kept_filters.append((img, loss_value))
    end_time = time.time()
    print('Filter %d processed in %ds' % (filter_index, end_time - start_time))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 28, 28, 32)        832       
_________________________________________________________________
activation_5 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 32)        25632     
_________________________________________________________________
activation_6 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 14, 14, 64)        18496     
__________

KeyboardInterrupt: 

In [37]:
# we will stich the best 64 filters on a 8 x 8 grid.
n = 6

# the filters that have the highest loss are assumed to be better-looking.
# we will only keep the top 64 filters.
# kept_filters.sort(key=lambda x: x[1], reverse=True)
# kept_filters = kept_filters[:n * n]

# build a black picture with enough space for
# our 8 x 8 filters of size 128 x 128, with a 5px margin in between
margin = 5
width = n * img_width + (n - 1) * margin
height = n * img_height + (n - 1) * margin
stitched_filters = np.zeros((width, height, 3))

In [52]:
print(len(kept_filters))

64


In [60]:
# Visualization of each filter of the layer
fig = plt.figure(figsize=(12,12))
n=8
for i in range(8):
    for j in range(8):
        img, loss = kept_filters[i * n + j]
        ax = fig.add_subplot(8,8,i * n + j + 1)
        ax.axis('off')
        ax.imshow(img.reshape(28,28), cmap='gray')
# for i in range(convolutions.shape[2]):
#     ax = fig.add_subplot(n,n,i+1)
#     ax.imshow(convolutions[:,:,i], cmap='gray')

<IPython.core.display.Javascript object>

Copied code from keras

In [79]:
import time
# dimensions of the generated pictures for each filter.
img_width = 28
img_height = 28

# util function to convert a tensor into a valid image

def deprocess_image(x):
    # normalize tensor: center on 0., ensure std is 0.1
    x -= x.mean()
    x /= (x.std() + K.epsilon())
    x *= 0.1

    # clip to [0, 1]
    x += 0.5
    x = np.clip(x, 0, 1)

    # convert to RGB array
    x *= 255
    if K.image_data_format() == 'channels_first':
        x = x.transpose((1, 2, 0))
    x = np.clip(x, 0, 255).astype('uint8')
    return x


# # build the VGG16 network with ImageNet weights
# model = vgg16.VGG16(weights='imagenet', include_top=False)
# print('Model loaded.')

model.summary()

# this is the placeholder for the input images
input_img = model.input
# input_img = np.expand_dims(X_train[1], axis=0)
# input_img *= 255
# input_img = input_img.astype('float')

def normalize(x):
    # utility function to normalize a tensor by its L2 norm
    print(x)
    return x / (K.sqrt(K.mean(K.square(x))) + K.epsilon())


kept_filters = []
print(model.output.shape)
loss = K.mean(model.output[:, 2])
# we compute the gradient of the input picture wrt this loss
grads = K.gradients(loss, input_img)[0]

# normalization trick: we normalize the gradient
grads = normalize(grads)

# this function returns the loss and grads given the input picture
iterate = K.function([input_img], [loss, grads])

# step size for gradient ascent
step = 1.

# we start from a gray image with some random noise
#     if K.image_data_format() == 'channels_first':
#         input_img_data = np.random.random((1, 3, img_width, img_height))
#     else:
#         input_img_data = np.random.random((1, img_width, img_height, 3))
#     input_img_data = (input_img_data - 0.5) * 20 + 128
if K.image_data_format() == 'channels_first':
    input_img_data = np.random.random((1, 3, img_width, img_height))
else:
    input_img_data = np.random.random((1, img_width, img_height, 1))
#     input_img_data = np.expand_dims(np.copy(X_train[58]), axis=0)
input_img_data = (input_img_data - 0.5) * 20 + 128

# we run gradient ascent for 20 steps
for i in range(20):
    loss_value, grads_value = iterate([input_img_data])
    input_img_data += grads_value * step

    print('Current loss value:', loss_value)
    if loss_value <= 0.:
        # some filters get stuck to 0, we can skip them
        break

# decode the resulting input image
#     if loss_value > 0:
#         img = deprocess_image(input_img_data[0])
#         kept_filters.append((img, loss_value))
img = deprocess_image(input_img_data[0])
kept_filters.append((img, loss_value))
end_time = time.time()
print('Filter %d processed in %ds' % (filter_index, end_time - start_time))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 28, 28, 32)        832       
_________________________________________________________________
activation_5 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 32)        25632     
_________________________________________________________________
activation_6 (Activation)    (None, 28, 28, 32)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 14, 14, 64)        18496     
__________

In [80]:
img, loss = kept_filters[0]
input_test = np.expand_dims(img, axis=0)
y_out = model.predict(input_test)
print(y_out)

[[0.000000e+00 0.000000e+00 1.000000e+00 1.792211e-23 0.000000e+00
  0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00]]


In [81]:
plt.figure()
img, loss = kept_filters[0]
plt.imshow(img.reshape((28,28)), vmin=0, vmax=255, cmap="gray")
plt.show()

<IPython.core.display.Javascript object>

In [None]:
# fill the picture with our saved filters
for i in range(n):
    for j in range(n):
        img, loss = kept_filters[i * n + j]
        width_margin = (img_width + margin) * i
        height_margin = (img_height + margin) * j
        stitched_filters[
            width_margin: width_margin + img_width,
            height_margin: height_margin + img_height, :] = img

In [None]:
plt.figure()
plt.imshow(stitched_filters)

In [None]:
from scipy.misc import imsave
# save the result to disk
save_img('stitched_filters_%dx%d.png' % (n, n), stitched_filters)

## Original Visualization Example

[Link to original example](https://github.com/yashk2810/Visualization-of-Convolutional-Layers/blob/master/Visualizing%20Filters%20Python3%20Theano%20Backend.ipynb)

In [None]:
# Model 
new_model = Sequential()

new_model.add(Convolution2D(32, (3, 3), input_shape=(28,28,1))) 
convout1 = Activation('relu')
new_model.add(convout1)
convout2 = MaxPooling2D()
new_model.add(convout2)

new_model.add(Flatten())

new_model.add(Dense(128))
new_model.add(Activation('relu'))
new_model.add(Dropout(0.2))
new_model.add(Dense(10))
new_model.add(Activation('softmax'))

new_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
new_model.fit(X_train, y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, y_test))

In [None]:
# choose any image to want by specifying the index
img_to_visualize = X_train[1]
# Keras requires the image to be in 4D
# So we add an extra dimension to it.
img_to_visualize = np.expand_dims(img_to_visualize, axis=0)
print(img_to_visualize.shape)
# print(new_model.inputs)

def layer_to_visualize(layer):
    inputs = [K.learning_phase()] + new_model.inputs

    _convout1_f = K.function(inputs, [layer.output])
    def convout1_f(X):
        # The [0] is to disable the training phase flag
        return _convout1_f([0] + [X])

    convolutions = convout1_f(img_to_visualize)
    convolutions = np.squeeze(convolutions)

    print ('Shape of conv:', convolutions.shape)
    
    n = convolutions.shape[2]
    n = int(np.ceil(np.sqrt(n)))
    
    # Visualization of each filter of the layer
    fig = plt.figure(figsize=(12,8))
    for i in range(convolutions.shape[2]):
        ax = fig.add_subplot(n,n,i+1)
        ax.imshow(convolutions[:,:,i], cmap='gray')

In [None]:
# Specify the layer to want to visualize
layer_to_visualize(convout1)

# # As convout2 is the result of a MaxPool2D layer
# # We can see that the image has blurred since
# # the resolution has reduced 
layer_to_visualize(convout2)

