# Convolutional Neural Networks

Welcome to the seventh lab exercise. In this notebook, you will:

- Implement helper functions that you will use when implementing a CNN model using tensorflow and keras
- Implement a fully functioning ConvNet using TensorFlow 

In the previous exercises, you have applied simple artificial neural networks to classify images. In this exercise we discuss a particular class of ANN that is well suited for image data - convolutional neural networks (CNN). A CNN is a special case of ANN that contains several convolutional layers. Convolutional layer designed in a such way that it is able to learn to "detect" a specific pattern in an image. Such a pattern could be a simple geometric shape such as a circle or a higher-level concept such as a tree. 

Learning goals:

- understand the basic principles of convolutional layers
- understand the basic principles of a pooling layer 
- learn how CNN is constructed by combining convolutional layers
- how to use the "padding" and "stride" parameters in CNN
- how to determine CNN parameters required for a given data set
- how to visualize the activations (outputs) of different neurons within a CNN

#####**NOTE: Use GPU as runtime type for this lab exercise**

## Recommended Reading

- Deep Learning with Python F.Chollet, [chapter 5](https://livebook.manning.com/book/deep-learning-with-python/chapter-5)
- Convolutional neural networks, [stanford](https://cs231n.github.io/neural-networks-3/)
- [CNN basics](https://mlnotebook.github.io/post/CNN1/)

- Have fun with [interactive CNN](https://poloclub.github.io/cnn-explainer/)
- [Convolution Layer](https://www.youtube.com/watch?v=jPOAS7uCODQ&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=7), Andrew Ng
- [Max pooling layer, Andrew Ng](https://www.youtube.com/watch?v=XTzDMvMXuAk)
- Convolutional Neural Networks [Standford, CS231n](http://cs231n.github.io/convolutional-networks/)
- [Comprehensive guide to CNN](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

### Tensorflow

- [Tensorflow 2](https://www.tensorflow.org/tutorials/quickstart/beginner) Quickstart for beginners
- [Tensorflow tutorials](https://www.tensorflow.org/tutorials)

Let's get started

First of all, mount google drive.
This will mount the google drive for google colab and you will be able access contents of your drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


import os
os.chdir('/content/drive/My Drive/Deep_learning_unit/Assignment7')
folder = os.path.join('/content/drive/My Drive/Deep_learning_unit/Assignment7')
!ls

## 1 - Packages

First, let's run the cell below to import all the packages that you will need during this exercise.
- [tensorflow](https://www.tensorflow.org) is the library which provides the functions for deep neural networks.
- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- [sklearn](http://scikit-learn.org/stable/) provides simple and efficient tools for data mining and data analysis. 


In [None]:
#  tensorflow library provides functions for deep neural networks 
import tensorflow as tf

# to load the mnist data set
from tensorflow.keras.datasets import mnist 

# import plt library which provides functios to visualize data
import matplotlib.pyplot as plt

# import numpy library which provides functions for matrix computations
import numpy as np

#  to get a text report showing the main classification metrics for each class
from sklearn.metrics import classification_report

# for reproducibility 
from numpy.random import seed
seed(1)
tf.random.set_seed(1)

# library for generating plots
import matplotlib.pyplot as plt

## 2 - Dataset

We will use the same dataset as for Round 2, i.e., [MNIST](https://www.tensorflow.org/datasets/catalog/mnist) dataset. Each data point is represented by a gray scale image of size 28x28 pixels. Each image represent the specific number from 0-9 and each data point is associated with a label taking on values $y=0,...,9$ according to 10 different number classes.

We can load this dataset using the command `tf.keras.datasets.mnist.load_data()`

In [None]:
(X_train_orig, Y_train), (X_test_orig, Y_test) = mnist.load_data()

# shape of train and test datasets
print(f'Number of training examples: {X_train_orig.shape}')
print(f'Number of test examples: {X_test_orig.shape}')

Let's normalize our training and test data.

Then, Reshaping training set, as we need to specify number of channels (one for grayscale images and three for RGB). Since, these are grayscale images so we will specify channel as 1.

In [None]:
#  normalize train and test dataset

### START CODE HERE ### (approx. 2 lines)


### End CODE HERE ###

### START CODE HERE ### (approx. 1 lines). # Reshape X_train to specify channels as 1

### End CODE HERE ###

print(f'Number of training examples: {X_train.shape}')
print(f'Number of testing examples: {X_test.shape}')

Till now, you have built a fully-connected network for all image datasets. But here, it is more natural to apply a ConvNet to it.
To get started, let's examine the shapes of your data.

In [None]:
print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

Let us check our dataset by visualizing one single example. You can check other images by changing the index number.

In [None]:
# Example of a picture
index = 8
plt.imshow(X_train_orig[index])
print ("y = " + str(np.squeeze(Y_train[index])))

## 3 - Convolutional neural network

In the previous exercises, we used an ANN to predict the label for classification of images. There, we have used an ANN constituted by dense layers with each neuron of the layer connected to each neuron in the preceding layer. 

Consider an ANN applied to images with a (rather low) resolution of $200 \times 200$ pixels. Let us assume that hidden layer consisting of only $128$ neurons. We would then already obtain $200 \times 200 \times 128$ tunable weights for this single hidden layer. 

For the ANN with a single dense layer, we would need around $$ 200 \times 200 \times 128 = 5120000$$ parameters which need to be trained. This exceed significantly the number of training samples in the any image dataset. 

Hence, we focus on another option to reduce the number of tunable parameters (weights) in the ANN by using convolutional neural networks. 

A CNN consists of sequence of a different kinds of layers. There are three main types of layers in CNN:
- Convolution layer (conv)
- Pooling layer (pooling)
- Fully connected Layer (FC)

A typical CNN architecture looks like this ([image source](https://cezannec.github.io/Convolutional_Neural_Networks/)): 

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/cnn.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

First, let's first see what are the input and convolution layers in CNN.

### Input Layer

In ANN, we have "flattened" the image. Flattening refers to the process of stacking the image pixel intensities into a one-dimensional feature vector. 

A very useful representation of image and video data is in the form of **tensors**. A tensor is a multidimensional array of numbers. For the special case of two dimensions, tensors become **matrices** and for one dimension, tensors become **vectors**. In particular, we can represent a RGB colour image with $128 \times 128$ pixels by a three-dimensional tensor of shape $(128,128,3)$. 

See below an example input to CNN - an image matrix (volume) of a dimension 4x4x3 ([source](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)). The values of the volume are pixel values from 0 to 255.

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/volume.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

### Convolution Layer

The crucial building block of CNN is a convolutional layer. A convolutional layer performes **convolution** operation between the image and filters (or kernels). Basically, the filter is a set of weights and convolution of the image and filter is just sliding the filter across the image and computing the weighted sum of the small area (patch) of the image. The filter is typcially much smaller (few pixels only) compared to the original image. Convolution of the image with kernels is used in image processing for edge detection, blurring, sharpening effects etc. [look more here](https://en.wikipedia.org/wiki/Kernel_(image_processing))

Consider a gray scale image of shape $5 \times 5 \times 1$ (height/ weight/channels) and a filter of size $3 \times 3 \times 1$. For each location in the original image, we compute the sum of the element-wise products between filter (or filter weights) and image pixel values:

`1*1 + 2*3 + 3*5 + 2*2 +0*(-1) + 0*9 + 3*1 + 4*1 + 4*(-1) = 29`

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/c1.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

Next, kernel slides to right by 1 pixel and produces the new value of a feature map as shown below.

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/c2.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

In similar fashion, kernel slides over the whole images (right and down) and produces the output of the convolution - a feature or activation 

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/c3.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

Note that the resulting ouput image is smaller than the input image due to convolution operation. The input to the convolution can be an image (pixel values) or the output from another convolution (feature map). The convolution of the input and filter may be seen as a filtering or feature extraction operation, where different structures or patterns (such as edges or shapes) are extracted. For example, there are edge kernels that allow to pass only information about the edges from the image.

As we discussed the convolution is a feature extraction operation. The filter (or kernel) learn the particular feature from the image. In order to learn multiple features in parallel for a given input,  convolution layer has many different filters.

In CNN, filters are initialized randomly and become parameters (weights), which will be subsequently learnt by the network. To update the weights of the convolutional layer we use the same iterative procees of gradient-based learning as for a simple ANN, but the backpropagation step is a bit more complicated due to the presence of convolution operation.

The important thing to remember while doing convolution operation is that the the depth of a filter/kernel should  match that of the input depth. If you have an image of volume 5x5x3 as an input (where 3 represents 3 color channels: red, blue and green), filter should also be of same depth, for example 3x3x3. 

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/c4.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

The activation maps from each kernel stacked together along the depth dimension, thus creating the 3D output volume.



#### Padding

Above we saw that filter is traversing across the image by one pixel at a time. We call a number of pixel (or step size) by which filter traversed in each slide a **stride**. \
Also, we saw that the size of the output from convolution operation, i.e feature map, is smaller than the input image size. This means that we are losing some pixel values around the perimeter of image. Since CNN might consists of many convolutional layers, loss of pixels values in each successive convolution layer might result in a loss of important features from the image. To get the input sized output, we employ a technique called **zero padding**. Padding is a technique in which we add zero valued pixel around the image symmetrically. 

Below you can see an example of zero padding, where size of zero padding is one.

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/padding.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

In order to get the output volume of same spatial dimension as input volume given the **stride=1**, we can find the size of zero padding needed with following formula: 

<center> zero padding size = $\large\frac{(F-1)}{2}$ </center>
<center>where $F$ is filter/kernel size</center>

The animation below illustrates how the output is calculated for an MxNx3 image and 3x3x3 filter (kernel): 

In [None]:
from IPython.display import Image 
Image(open('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/cnn.gif','rb').read())

Here is an animated example of mutlifilter multichannel convolution operation involving 2 kernels (W0 and W1) in 3 channeled image:

In [None]:
Image(open('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/standford_CNN.gif','rb').read())

<figure>
    <figcaption style="text-align:center">Convolution demo from <a href="http://cs231n.github.io/convolutional-networks/">Standford CS231n course</a>
         with K=2, F=3, S=2 and P=1 
        Where:
        <ul>
  <li>K = Number of filters</li>
  <li>F = filter size</li>
  <li>S = stride length</li>
  <li>P = amount of zero padding </li>
</ul> 
</figcaption>
</figure>

<br>

The above animation illustrates the application of a convolutional layer to a zero padded 7x7x3 input image and with a kernel of size 3x3x3 (note that the depth of the convolution filter matches the depth of the image, both being 3). When the kernal is shifted to a particular location in the input image, it covers a small volume of the input (receptive field) and performs convolution operation with this input. We sum up results of convolutions from all channels (and in this case, we also added bias term) for each location.

Since 3D volumes are difficult to visualize, the input volume (blue), the kernel volume (red), and the output volume (green) are depicted row-wise. The filter slides over the input and performs the convolution at every location aggregating the result in a feature map. 

This feature map is of size 3x3x1, shown as the green slice on the right. Since we used 2 different filters we  have 2 feature maps of size 3x3x1 and stacking them along the depth dimension would give us the final output of the convolution layer: a volume of size 3x3x2.

In contrast to dense layers, which learn global patterns in their input space (each neuron is connected to every input entry), convolutional layers aim at detecting local patterns (each neuron is connected to only it's own receptive field). 


<b>In summary, convolutional layer:</b>
<ul>
  <li>learns local patterns </li>
  <li>has following hyperparameters:</li>
    <ul>
      <li>Number of filters, K</li>
      <li>Stride length, S</li>
      <li>Zero padding size, P</li>
   </ul>
  <li>It accepts the input volume of size:  $W_{in}$x$H_{in}$x$D_{in}$</li>
  <li>It outputs the volume of size: $W_{out}$x$H_{out}$x$D_{out}$</li>
    <br>
  where 
    $W_{out}$x$H_{out}$x$D_{out}$ = $[\frac{W_{in}+2P-F}{S} +1,\frac{H_{in}+2P-F}{S  } +1,K]$


### Pooling Layer

Another important building block of CNN is a [pooling layer](https://keras.io/api/layers/pooling_layers/). Pooling layers combine the ouput of several close-by entries of the input activation map to form the output of the pooling layer. Pooling operation greately reduces the number of parameters in the network and, thus, prevents overfitting of the network. 

Two of the most common pooling operations are:

- Max pooling 
- Average pooling 

Similar to the funciton of a convolution layer, a pooling layer amounts to applying the same operation to small areas (patches) of an image. In contrast to convolutional layers, this operation is a non-linear pooling operation. The ouput of a Max pooling layer are largest values of the corresponding (small) patch of the input. In contrast, the output in an average pooling layers is obtained by the (local) average over all entries in the corresponding patch of the input.  

Below is a visualization of the max pooling operation. 

Max pooling with filter size of 2 and stride 1. We take the largest value from the window of the activation map overlapped by the filter.

In [None]:
from IPython.display import Image 
Image(open('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/poolfig.gif','rb').read())

<b>In summary, pooling layer:</b> 
<ul>
<li>reduces the number of parameters</li>
<li>has following hyperparameters:</li>
    <ul>
      <li>Filter size, F</li>
      <li>Stride length, S</li>
    </ul>
<li>It does not have learnable parameters (its just performing maxout operation) </li>    
<li>It accepts the input volume of size:  $W_{in}$x$H_{in}$x$D_{in}$</li>
<li>It output the volume of size: $W_{out}$x$H_{out}$x$D_{out}$</li>

<br>
 where 
$W_{out}$x$H_{out}$x$D_{out}$ = $[(\frac{W_{in}-F}{S} +1,\frac{H_{in}-F}{S  } +1,D_{in}]$
</ul>

Max pooling is used much more often than average pooling and two most common hyperparameter choices are F=3, S=2, and F=2, S=2 (later one being even more common) 
<a href='https://www.youtube.com/watch?v=8oOgPUO-TBY'>[1]</a>
<a href='http://cs231n.github.io/convolutional-networks/'>[2]</a>.

### Fully-Connected layer

In this layer, feature map from the last convolution or pooling layer is flattened  into single vector of values and feed it into a fully connected layer. Fully connected layers are the same as in ANN we saw before and perform the same mathematical operations. After passing through the fully connected layers, the final layer uses the softmax activation function which gives the probabilities of the input belonging to a particular class.

In [None]:
plt.figure(figsize=(12,8))
img=plt.imread('/content/drive/My Drive/Deep_learning_unit/Assignment7/images/fully_connected.png')
plt.imshow(img)
plt.axis('off') 
plt.show()

## 4 - Building and Using CNN in Keras

Following are the steps which will be considered in building a CNN architecture using keras.

1. Define the CNN architecture. 
2. Configure learning process by chosing a loss function an optimizer. 
3. Train the model to find good choices for network parameters (weights and bias). 

### 4.1 - Choose CNN Architecture 

Now, we will build and train a Convolutional Neural Network using sequential API from Keras. Our network architecture following the sequence of following layers.

   <b><center>Input → 2 * (Conv → Conv → Pool) → Flatten → Dense → Dense</center></b>

We use <b>Conv2d</b> Keras class to define convolution layer:  `tf.keras.layers.Conv2D(args)`. This method requires the numbers of parameters (arguments). The parameters we defined are:
    
- `filters`     - the number of filters the layers will learn; the dimensionality of the output space (i.e. the number of output filters in the convolution).


- `kernel_size` - integer specfifying the height and width of the 2D convolution window size. Can be a single integer to specify the same value for all spatial dimensions.


- `padding`     - types of padding to apply; one of "valid" or "same" (case-insensitive). "valid" means no padding, thus spatial dimension will be reduced. "same" results in padding evenly to the left/right or up/down of the input such that output has the same height/width dimension as the input.


- `activation`  - string specifying the activation function to apply after performing the convolution.

Similarly, we apply max pooling in our pooling layer by using <b>MaxPool2D</b> class: `tf.keras.layers.MaxPool2D(args)`. The parameter in this layer are:
- `pool_size` - determines the kernel size or filter size

Conv2d and and MaxPool2d have some more optional parameters to define, read more [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) and [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D). \


We use <b>Dense</b> Keras class to define ully connected layer:  `tf.keras.layers.Dense(args)`. This method requires the numbers of parameters (arguments). The parameters we defined are:

- the size of the fully connected layer i.e. the number of neurons in the fully connected layer.

- `activation`  - string specifying the activation function to apply after performing the fully connected layer operation.



**Note!** Keras has it own docs, which are sometimes more useful, e.g. [Conv2D layer](https://keras.io/api/layers/convolution_layers/convolution2d/)

In addition, we can automatically flatten the input by calling `tf.keras.layers.Flatten()`

**Note!** The activation is `relu` for all hidden layers and `softmax` for the output layer.

In [None]:
# define the model architecture
model = tf.keras.models.Sequential([
    #"""
    #Remember to add comma (,) at the end of all the layers
    #"""
    # input + 1st(Conv → Conv → Pool) block, activation is relu for both conv layers
    # 1st conv layer --> 32 filters, kernel_size 3, padding same, input_shape=[28,28,1] So the input_shape is only given in the first layer and not in rest of the layers
    # 2nd conv layer --> 32 filters, kernel_size 3, padding same
    # Pool layer --> pool_size is 2 
    ### Start CODE HERE ### (3 lines of code) 
    


    ### End CODE HERE ###

    # 2nd(Conv → Conv → Pool) block, activation is relu for both conv layers
    # 1st conv layer --> 64 filters, kernel_size 3, padding same
    # 2nd conv layer --> 64 filters, kernel_size 3, padding same
    # Pool layer --> pool_size is 2 
    ### Start CODE HERE ### (3 lines of code)
   


    ### End CODE HERE ###

    # Flatten 
    ### Start CODE HERE ### (1 lines of code)
    
    ### End CODE HERE ###

    # Two fully connected layers with first layer of size 128 and second layer of size 10.
    # Remember to use relu activation in first layer and softmax activation in second layer 
    ### Start CODE HERE ### (2 lines of code)
    
    
    ### End CODE HERE ###
])

**Number of parameters in CNN layer**

Deep learning model learns hundreds of thousands of parameters (weights and biases). Knowing the total numbers of learnable parameters helps to determine the required sample size (number of training data points) in order to avoid overfitting. 
For CNN we can calculate the number of learnable parameters in each layer as follows:

`Number of params  = (kernel_ width * kernel_height * channels_in + 1 (for bias)) * channels_out`



Using the above formula, let's calculate the numbers of parameters in the CNN we defined above: 


param_calculation| input|output|layer|param
     ---|---|----|----|---
        (3 * 3 * 1 + 1)* 32    |28x28x1  |28x28x32 | Conv2D  |320
        (3 * 3 * 32 + 1) * 32 |28x28x32  |28x28x32 | Conv2D   |9248
        0                      |28x28x32 |14x14x32 | MaxPool2D |0
            (3 * 3 * 32 + 1) * 64   |14x14x32  |14x14x64 | Conv2D  |18494
        (3 * 3 * 64 + 1) * 64   |14x14x64  |14x14x64  |Conv2D    |36928
        0                       |7x7x64    |7x7x64   | MaxPool2D |0
         0                       |7x7x64    |3136      |Flatten  |0
        (3136 + 1) * 128       |3136      |128       |Dense      |401536
        (128 + 1) * 10         |128      |10       | Dense      |1290

We don't have to calculate the number of parameters manually as there is a built-in function `summary()`, which will do it automatically:

In [None]:
model.summary()

Furthermore, we can visualize the architecture of the model with `utils.plot_model()` function:

In [None]:
# plot the graph  of the model and save to file
tf.keras.utils.plot_model(
    model,
    to_file='model.png', show_shapes=True, show_layer_names=True)

You may notice that the spatial volume of output is decreasing while number of the filters learned is increasing as we go deeper into the network. This is a common architecture of CNN.\
**Note!** The convention is to chose the number of filters equal to powers of 2 (e.g. $2^5=32$, $2^6=64$), and kernel size to be an odd integer value (e.g. 3 or 5).

### 4.2 - Loss Function and optimizer

You can compile the model by using the following `model.compile(args)` where args are the parameters which are passed as:

- `loss` - In order to find good values for the CNN weights we use the `sparse_categorical_crossentropy`.  

**Note!** Use `categorical_crossentropy` loss function for multiclass classification and when labels are provided in one_hot representation. Use `sparse_categorical_crossentropy` when you want to provide labels as integers.\
What are the differences? In principle none, as they both compute categorical crossentropy, but read more [here](https://stackoverflow.com/questions/58565394/what-is-the-difference-between-sparse-categorical-crossentropy-and-categorical-c).

- `optimizer` - We can specify optimizer RMSprop as follows: `optimizer='rmsprop'` or `optimizer=tf.keras.optimizers.RMSprop`

**Note!** Other optimizers, such as Adam, also automatically adapt the learning rate during training, and would work equally well here.

- `metrics` - the metrics we want to compute and we generally compute accuracy and is given as `metrics=["accuracy"]`


In [None]:
# Compile your model
### Start CODE HERE ### (1 lines of code)

### End CODE HERE ###

### 4.3 - Train the network

We will train CNN for 20 epochs - this may take a few minutes to run.

The Loss and Accuracy are great indicators of learning progress. Model makes predictions for the training data and then loss evaluating predictions against the known labels, calculating accuracy, the portion of correct guesses.

In [None]:
# training the network (~40 min on CPU, ~5 min on Colab GPU)
history = model.fit(X_train, Y_train, epochs=20, batch_size=32)

In [None]:
# accuracy values are stored in the dictionary `History.history` 
# the dictonary key to access these accuracy values are:
# "acc" in tensorflow versions <2
# "accuracy" in later versions

# check the current version
if int(tf.__version__.split('.')[0]) > 1:
    acc_key = 'accuracy'
else:
    acc_key = 'acc'

#-----------------------------------------------------------
# Retrieve a list of list results on training and validation data
# sets for each training epoch
#-----------------------------------------------------------
acc      = history.history[acc_key]
loss     = history.history['loss']
epochs   = range(1,len(acc)+1) # Get number of epochs
#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt.plot(epochs, acc,  label='Training accuracy')
plt.title('Training accuracy')
plt.xticks(epochs)
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend();

Now, let's check the accuracy of model on test set, which is done by calling `evaluate` method on model. We will also defin `batch_size` parameter here.

In [None]:
test_loss, test_accuracy = model.evaluate(X_test.reshape(-1,28,28,1), Y_test, batch_size=128, verbose=2)
print('Accuracy on test dataset:', test_accuracy)

## 5 - Visualizing the activation maps of  convolutional layers

In this section we will plot the activation maps of a particular CNN layer. Visualizing the activation of individual layer helps to understand how the input is decomposed into some relevant pixel patterns within an image or in other words, what local patterns the layer is learning. 

In [None]:
plt.imshow(X_test[0], cmap='gray')
plt.show()

In [None]:
# extract the outputs from all layers:
layer_outputs = [layer.output for layer in model.layers]

# create a model that will return these outputs, given the model input:
activation_model = tf.keras.models.Model(inputs = model.input, outputs = layer_outputs)

# return feature maps of the first training image
activation = activation_model.predict(X_test[0].reshape(1, 28, 28, 1))

Let's take an activation of first convolution layer for our input.

In [None]:
first_layer_activation = activation[0] 
print(first_layer_activation.shape)

The output of the first convolution layer are activation maps of size $28 \times 28 \times 32$. Instead of only three channels (red,green,blue) this maps has $32$ channels which correspond to the different filters we apply to the input image with. The code snippet below displays activation maps of the fisrt convolutional layer. 

In [None]:
# visuale activation maps of the first convolutional layer
plt.figure(figsize=(15,15))

for i in range(first_layer_activation.shape[-1]):
    plt.subplot(6,6,i+1)
    plt.xticks([]) # remove ticks on x-axis
    plt.yticks([]) # remove ticks on y-axis
    plt.imshow(first_layer_activation[0, :, :, i], cmap='gray')
    plt.title('act. map '+ str(i+1))

plt.show()

Each kernel encodes relatively independent features of the input image. We can see that first convolutional layer has learned lower level features/pattern from the image, such as various edges. \
Now, let's go deeper into the network and select activation maps of the 5th layer.

In [None]:
# Calculate fifth_layer_activation
### Start CODE HERE ### (1 lines of code)

### END CODE HERE ###

print(fifth_layer_activation .shape)

In [None]:
# visuale activation maps of the fifth convolutional layer
plt.figure(figsize=(15,15))

for i in range(fifth_layer_activation.shape[-1]):
    plt.subplot(8,8,i+1)
    plt.xticks([]) # remove ticks on x-axis
    plt.yticks([]) # remove ticks on y-axis
    plt.imshow(fifth_layer_activation[0, :, :, i], cmap='gray')
    plt.title('act. map '+ str(i+1))

plt.show()

Activations from deeper layers reveals that as we go deeper into the network, the feature learned becomes less visually interpretable, meaning encoding the higher level feature/pattern of an object.

You can plot activation maps for different number categories and check if those will be different from what we observe here.

## 6 - Prediction Accurarcy on Test Set 

Let's evaluate the accuracy of our model on test set:

In [None]:
test_loss, test_accuracy = model.evaluate(X_test.reshape(-1,28,28,1), Y_test, verbose=2)
print('Accuracy on test dataset:', test_accuracy)

We can also inspect which items CNN predicted incorrectly. Sometimes it is useful to manually check which type of images are misclassified as this might give a hint how to improve the model.

In [None]:
#get the predictions for the test data
predicted_classes  = np.argmax(model.predict(X_test.reshape(-1,28,28,1)), axis=-1)
#get true test_label
y_true=Y_test

In [None]:
#to get the total correct and incorrect prediction from the predict class
correct=np.nonzero(predicted_classes==y_true)[0]
correct.shape[0]
incorrect=np.nonzero(predicted_classes!=y_true)[0]

print("Correct predicted classes:",correct.shape[0])
print("Incorrect predicted classes:",incorrect.shape[0])

In [None]:
class_names = ['Zero', 'One', 'Two', 'Three', 'Four',
               'Five',      'Six',   'Seven',  'Eight',   'Nine']

In [None]:
def plot_images(data_index):
    '''
        This is a function to plot first 9 images.    
        data_index: indices of images.
    
    '''
    # plot the sample images 
    f, ax = plt.subplots(3,3, figsize=(10,10))

    for i, indx in enumerate(data_index[:9]):
        ax[i//3, i%3].imshow(X_test[indx].reshape(28,28), cmap='gray')
        ax[i//3, i%3].axis('off')
        ax[i//3, i%3].set_title("True:{}  Pred:{}".format(class_names[Y_test[indx]],class_names[predicted_classes[indx]]))
    plt.show()    

# display correctly classified images
plot_images(correct)

In [None]:
# display incorrectly classified images
plot_images(incorrect)

Congratulations! You have finished the assignment and built a model that recognizes number with almost 99% accuracy on the test set. If you wish, feel free to play around with this dataset further. 

