# About
In this notebook, we will:
- Test our model from last time on unseen data
- Investigate whether a convolutional neuron network can outperform the original neural network comprised of Dense layers

# Imports

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import random

from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Input, Conv2D, MaxPool2D, Flatten
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

In [2]:
# set seeds for reproducable results
random.seed(1234)
tf.random.set_seed(1234)

## Data

In [3]:
X = np.load('data/X.npy')
y = np.load('data/y.npy')

In [4]:
print(f'X has a shape of {X.shape}')
print(f'y has a shape of {y.shape}')

X has a shape of (5000, 400)
y has a shape of (5000, 1)


## Load model
We will load the same model we trained from the last notebook.

In [5]:
overfitted_model = load_model('models/Denselayer.keras')

In [6]:
m,n = X.shape
indices = np.random.randint(0,m,n)
num_correct_predictions = sum([np.argmax(overfitted_model.predict(X[idx].reshape(1,n), verbose=0)) for idx in indices]==y[indices].reshape(-1,))

print(f'{num_correct_predictions} out of {len(indices)} digits correctly predicted \n{num_correct_predictions/len(indices) * 100:.1f}% success rate')

392 out of 400 digits correctly predicted 
98.0% success rate


This model does extremely good at predicting target values for data which it trained on. It may have "overfitted the data". <br>Let's see what happens when we train the same model as last time but test it on a subset of data it has not seen before. 

## Retrain Model
We will load the same model, before it was trained (with randomly initialised weights), from the last notebook.

In [7]:
model = load_model('models/Denselayer_beforetraining.keras')

In [8]:
# you will need to run pip install scikit-learn
from sklearn.model_selection import train_test_split

In [9]:
# do a 80|20 training|test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
print(f'X_train has shape {X_train.shape}')
print(f'y_train has shape {y_train.shape}')
print(f'X_test has shape {X_test.shape}')
print(f'y_test has shape {y_test.shape}')

X_train has shape (4000, 400)
y_train has shape (4000, 1)
X_test has shape (1000, 400)
y_test has shape (1000, 1)


We have split our data into two subsets: training and test data 

In [10]:
# define loss function
model.compile(
    loss = SparseCategoricalCrossentropy(from_logits = True),
    optimizer = Adam(0.001)
)

# train model
num_epochs = 20
history = model.fit(X_train, y_train,epochs = num_epochs)

Epoch 1/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 2.0538
Epoch 2/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.9513
Epoch 3/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.5479
Epoch 4/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.3858
Epoch 5/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.3135
Epoch 6/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.2746
Epoch 7/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2479
Epoch 8/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.2274
Epoch 9/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.2110
Epoch 10/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - lo

In [11]:
m,n = X_test.shape
indices = np.random.randint(0,m,n)
num_correct_predictions = sum([np.argmax(model.predict(X_test[idx].reshape(1,n), verbose=0)) for idx in indices]==y_test[indices].reshape(-1,))

print(f'{num_correct_predictions} out of {len(indices)} digits correctly predicted \n{num_correct_predictions/len(indices)*100:.1f}% success rate')

370 out of 400 digits correctly predicted 
92.5% success rate


Still convincing, but the fact that the prediction rate has come down suggests our model is overfitting to the training data a little bit. Let's explore convolutional neural networks, which are less prone to overfitting (I'll explain later). 

# Understanding Convolutional Neuron Networks
A convolutional neuron network responds to inputs differently than the neural network we built from the last notebook. I found the article from Axel Thevenot has useful animations to explain: https://towardsdatascience.com/conv2d-to-finally-understand-what-happens-in-the-forward-pass-1bbaafb0b148
![cnn](media/cnn.gif) <br>
Let's break it down:
- On the left, we have our 9×9 image, a total of 81 pixels
- The 3×3 grid in the middle represents our neuron in the convolutional neuron network (CNN). Neurons in a CNN are often referred to as kernels
- On the right is the kernel's output

Still with me, you're doing great! Instead of consuming the entire input image in one go, as did our neurons in the Dense layer, our kernel slides over the image. Let's see what the kernel does in step 1 <br>
## Kernel Calculation
![cnn_step1](media/cnn_step1.jpg)<br><br>
Those 9 pixels on the left serve as input to our kernel - each pixel is represented by a number between 0-255 (if you don't know about this, research greyscale values).<br>
The kernel itself has 9 weights because 3×3=9.<br><br>
![kernel](media/kernel.jpg)<br><br>
If we perform element-wise multiplication & sum the results, we get a single number ouput.<br><br>
![kernel](media/kernel_calculation.jpg) <br><br>
This is the number representing the top left most pixel in the output. Repeat this calculation every time the kernel slides across.

## Activation Function
Let's assume the output of our kernel is the following 7×7 matrix <br><br>
![kernel_output_before_activation](media/kernel_output_before_activation.jpg) <br><br>
We are not done, each kernel has an activation function - here we will use Rectified Linear Unit (ReLU). That means anything negative turns to zero and anything else stays the same. <br> <br>
![kernel_output_before_activation](media/kernel_output_after_activation.jpg) <br><br>
This 2D matrix is the output of 1 convolutional neuron. So after all that, how is this any better than the traditional dense neural network. Interestingly, this output can represent 49 pixels or a 7×7 pixel grid. If we plot this as a greyscale image, we can see what features of the input image our neurons are focusing on. I won't plot the output since all numbers here are fictional, but if you're interested in what each neuron actually sees, I recommend: https://medium.com/@neethamadhu.ma/visualizing-the-magic-a-guide-to-understanding-convolutional-neural-networks-c701978373c0 <p>
So how are these networks less prone to overfitting? I urge you think about the number of weights in your dense neural network (DNN) vs convolutional neural network (CNN). In a DNN, for each neuron, there is a weight per pixel. In a CNN, for each neuron, there are only 9 weights full stop. So in the case of a 20×20 image, we will have 400 weights for the DNN neuron vs. 9 weights for the CNN neuron. Less weights mean the network stays generalisable.

# Applying Convolutional Neuron Networks 

The following image depicts what our neural network is doing to the original 20×20 greyscale digit image <br>
![cnn_architecture](media/cnn_architecture.png) <br>
Image generated with Python. Source code can be found at: https://github.com/gwding/draw_convnet <p>
Notice there are 10 outputs, this corresponds to numbers between 0-9, the 10 digits our handwritten digit could be

## Build and Train Network

In [12]:
# build network 
cnn = Sequential(
    [
        Input(shape = (20,20,1)),
        Conv2D(filters = 32, kernel_size=(3,3), activation = 'relu'),
        MaxPool2D(pool_size = (2,2)),
        Conv2D(filters = 64, kernel_size=(3,3), activation = 'relu'),
        MaxPool2D(pool_size = (2,2)),
        Flatten(),
        Dense(64, activation = 'relu'),
        Dense(10)
    ]
)
cnn.save('models/cnn_beforetraining.keras')
cnn.summary()

In [13]:
print(f'Currently, our inputs in X_train are numpy arrays of shape {X_train[0].shape}. Before we train our model, we need to reshape the inputs to shape {cnn.layers[0].input.shape[1:]} \nas this is what the CNN is expecting.')

Currently, our inputs in X_train are numpy arrays of shape (400,). Before we train our model, we need to reshape the inputs to shape (20, 20, 1) 
as this is what the CNN is expecting.


In [14]:
X_traincnn = []

for x in X_train:
    x = x.reshape(20,20).T.reshape(20,20,1) # included a transpose to get our pixels the right way around
    X_traincnn.append(x)

X_traincnn = np.array(X_traincnn)
y_traincnn = y_train
print(f'Old inputs shape: {X_train.shape}')
print(f'New inputs shape: {X_traincnn.shape}')

Old inputs shape: (4000, 400)
New inputs shape: (4000, 20, 20, 1)


In [15]:
# define loss
cnn.compile(
    loss = SparseCategoricalCrossentropy(from_logits=True),
    optimizer =Adam(0.001)
)

# train model
cnn.fit(X_traincnn, y_traincnn, epochs = num_epochs)

Epoch 1/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 1.5351
Epoch 2/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.2487
Epoch 3/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.1495
Epoch 4/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.1115
Epoch 5/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.0881
Epoch 6/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0685
Epoch 7/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.0516
Epoch 8/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0384
Epoch 9/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0332
Epoch 10/20
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - lo

<keras.src.callbacks.history.History at 0x25e17489990>

## Test Network

In [16]:
# process test data inputs
X_testcnn = []

for x in X_test:
    x = x.reshape(20,20).T.reshape(20,20,1) # included a transpose to get our pixels the right way around
    X_testcnn.append(x)

X_testcnn = np.array(X_testcnn)
y_testcnn = y_test

In [17]:
# feedforward data
outputs = cnn.predict(X_testcnn, verbose = 0)
outputs.shape

(1000, 10)

In [18]:
# metrics
m,n = outputs.shape
predictions = [np.argmax(outputs[i]) for i in range(m)]
num_correct_predictions = sum([predictions[i] == y_testcnn[i] for i in range(m)])
print(f'{num_correct_predictions[0]} out of {m} digits correctly predicted \n{num_correct_predictions[0]/m *100:.1f}% success rate')

963 out of 1000 digits correctly predicted 
96.3% success rate


Interesting. The CNN does 4-5% better than the dense neural network.

# Appendix
The above networks have been trained for only 20 epochs, which is very little. Let us see what happens when we train for 200 epochs, which shouldn't take much longer.

## Dense Neural Network

In [19]:
model = load_model('models/Denselayer_beforetraining.keras')

In [20]:
# define loss & optimizer
model.compile(
    loss = SparseCategoricalCrossentropy(from_logits = True),
    optimizer = Adam(0.001)
)

# train model
num_epochs = 200
history = model.fit(X_train, y_train,epochs = num_epochs, verbose = 0)

In [21]:
# feedforward data
outputs = model.predict(X_test, verbose = 0)

# metrics
m,n = outputs.shape
predictions = [np.argmax(outputs[i]) for i in range(m)]
num_correct_predictions = sum([predictions[i] == y_test[i] for i in range(m)])
success_rate_dnn = num_correct_predictions[0]/m *100

print(f'{num_correct_predictions[0]} out of {m} digits correctly predicted \n{success_rate_dnn:.1f}% success rate')

921 out of 1000 digits correctly predicted 
92.1% success rate


Performs roughly the same as its 20 epoch counterpart ~ 90.5%

## Convolutional Neuron Network

In [22]:
cnn = load_model('models/cnn_beforetraining.keras')

In [23]:
# define loss & optimizer
cnn.compile(
    loss = SparseCategoricalCrossentropy(from_logits = True),
    optimizer = Adam(0.001)
)

# train model
num_epochs = 200
history = cnn.fit(X_traincnn, y_traincnn,epochs = num_epochs, verbose = 0)

In [24]:
# feedforward data
outputs = cnn.predict(X_testcnn, verbose = 0)

# metrics
m,n = outputs.shape
predictions = [np.argmax(outputs[i]) for i in range(m)]
num_correct_predictions = sum([predictions[i] == y_testcnn[i] for i in range(m)])
success_rate_cnn = num_correct_predictions[0]/m *100

print(f'{num_correct_predictions[0]} out of {m} digits correctly predicted \n{success_rate_cnn:.1f}% success rate')

973 out of 1000 digits correctly predicted 
97.3% success rate


In [25]:
print(f'CNN success rate is better than DNN by {success_rate_cnn - success_rate_dnn:.1f}%. This was tested on {len(X_test)} unseen images.')

CNN success rate is better than DNN by 5.2%. This was tested on 1000 unseen images.
