[Previous Notebook](Part_2.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Home Page](../Start_Here.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](Resnets.ipynb)

# CNN Primer and Keras 101 - Continued 

This notebook introduces convolutional neural networks and associated terminology and concepts.

**Contents of this notebook:**

- [Convolutional neural networks (CNNs)](#Convolutional-Neural-Networks-(CNNs))
- [Why are CNNs good for image related tasks?](#Why-are-CNNs-good-for-image-related-tasks?)
- [Implementing image classification using CNNs](#Implementing-image-classification-using-CNNs)
- [Conclusion](#Conclusion)


**By the end of this notebook you will:**

- Understand how a convolutional neural network works
- Write your own CNN classifier and train it

## Convolutional Neural Networks (CNNs) 

Convolutional neural networks are widely used in the field of image classification, object detection, and facial recognition because they are very effective in reducing the number of parameters without sacrificing on the quality of models.

Let's now understand what makes up a CNN architecture and how it works. Here is an example of a CNN architecture for a classification task:

![alt_text](images/cnn.jpeg)

*Source: https://fr.mathworks.com/solutions/deep-learning/convolutional-neural-network.html*

Each input image will be passed through a series of convolution layers with filters (kernels), pooling layers, fully connected layers, and finally a softmax function will be applied to classify an object with probabilistic values between 0 and 1. 

We will discuss the following concepts:

- Convolution Layer 
- Strides and Padding 
- Pooling Layer
- Fully Connected Layer 

#### Convolution Layer: 

A convolution layer learns features from the input by preserving the relationships between neighbouring pixels, and is typically the first layer in the network. The kernel size is a hyperparameter and can be altered according to the complexity of the problem. Let's see how a kernel operates on input data:

![alt_text](images/conv.gif)

*Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53*

Given this definition of a convolution operation, let us now see how a convolution layer is applied in the context of a neural network.

![alt_text](images/conv_depth.png)

*Source: https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215*

Defining the terms:

- Hin: Height dimension of the input layer
- Win: Width dimension of the input layer
- Din: Depth of the input layer
- h: Height of the kernel
- w: Width of the kernel
- Dout: Number of kernels/filters to apply (and number of output channels) 

Note: The depth of the input layer (Din) needs to match the depth of each kernel.

Din and Dout are also called the number of channels in the layer. We can notice from the introductory image that typically the number of channels increases as we go deeper into the network, while the height and width keep decrease. This is done so that the filters aggregate features from the previous layers.


#### Strides and Padding 

The stride is the number of pixels shifted over the input matrix during convolution. When the stride is 1, we move the filters 1 pixel at a time. When the stride is 2, we move the filters 2 pixels at a time, and so on.

Sometimes filters do not fit perfectly on the input image. So, we have two options:
- Pad the picture with zeros (zero-padding) so that it fits
- Drop the part of the image where the filter did not fit. This is called valid padding which keeps only the valid part of the image.

#### Pooling Layer

Pooling layers reduce the number of parameters when the images are too large. Spatial pooling, also called subsampling or downsampling, reduces the dimensionality of each layer but retains important information. Spatial pooling can be of different types:
- Max Pooling
    - Take the largest element from the input data.
- Average Pooling
    - Take the average of the elements from the input data.
- Sum Pooling
    - Sum of all elements in the input data.

![alt_text](images/max_pool.png)

*Source: https://www.programmersought.com/article/47163598855/*

#### Fully Connected Layer

We will then flatten the output from the convolution layers and feed into it a _fully connected layer_ to generate a prediction. The fully connected layer is a model whose inputs are the features of the inputs obtained from the convolution layers.

These fully connected layers are then trained along with the _kernels_ during the training process.

### Transposed Convolution

When we apply our convolution operation over an image, we find that the number of channels increases while the height and width of the image decrease. In some cases (for different applications) we will need to up-sample our images. _Transposed convolution_ helps to up-sample the images from neural network layers.

<table><tr>
<td> <img src="images/convtranspose.gif" alt="Drawing" style="width: 540px;"/></td>
<td> <img src="images/convtranspose_conv.gif" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>

*Source https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215*

Tranposed convolution can also be visualised as convolution of a layer with 2x2 padding (as seen on the right).


## Why are CNNs good for image related tasks? 

In 1970, **David Marr** wrote a paper called [Vision](http://lolita.unice.fr/~scheer/cogsci/Marr%2082%20-%20Vision.pdf). It was a breakthrough in understanding how the brain does vision: he stated that the vision task is performed in a hierarchical manner. You start simple and get complex. For example, you start with something as simple as identifying edges and colours and then build upon them to detect the object and then classify them and so on.

The architecture of CNNs is designed as such to emulate the human brain's technique to deal with images. As convolutions are mainly used for extracting high-level features from the images such as edges/other patterns, these algorithms try to emulate our understanding of vision. Certain filters do operations such as blurring the image, sharpening the image and then performing pooling operations on each of these filters to extract information from an image. As stated earlier, our understanding of vision indicates that vision is a hierarchical process, and our brain deals with vision in a similar fashion. CNNs also deal with understanding and classifying images similarly, thereby making them the appropriate choice for these kinds of tasks.

# Implementing Image Classification using CNNs

We will follow the same steps for data preprocessing as mentioned in the previous notebook.

In [None]:
# Import Necessary Libraries

from __future__ import absolute_import, division, print_function, unicode_literals

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Let's Import the Dataset
fashion_mnist = keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Print array size of training dataset
print("Size of Training Images: " + str(train_images.shape))
# Print array size of labels
print("Size of Training Labels: " + str(train_labels.shape))

# Print array size of test dataset
print("Size of Test Images: " + str(test_images.shape))
# Print array size of labels
print("Size of Test Labels: " + str(test_labels.shape))

# Let's see how our outputs look
print("Training Set Labels: " + str(train_labels))
# Data in the test dataset
print("Test Set Labels: " + str(test_labels))

train_images = train_images / 255.0
test_images = test_images / 255.0

## Further data preprocessing

You may have noticed by now that the training set is of shape `(60000,28,28)`.

In convolutional neural networks, we need to feed the data in the form of a 4D Array as follows:

`(num_images, x-dims, y-dims, num_channels_per_image)`

So, as our image is grayscale, we will reshape it to `(60000,28,28,1)` before passing it to our neural network architecture.

In [None]:
# Reshape input data from (28, 28) to (28, 28, 1)
w, h = 28, 28
train_images = train_images.reshape(train_images.shape[0], w, h, 1)
test_images = test_images.reshape(test_images.shape[0], w, h, 1)

## Defining Convolution Layers

Let us see how to define convolution, max pooling, and dropout layers.


#### Convolution Layer 

We will be using the following API to define the Convolution Layer.

```tf.keras.layers.Conv2D(filters, kernel_size, padding='valid', activation=None, input_shape)```


Let us briefly define the parameters:

- `filters`: The dimensionality of the output space (i.e. the number of output filters in the convolution).
- `kernel_size`: An integer or tuple/list of 2 integers, specifying the height and width of the 2D convolution window. Can be a single integer to specify the same value for all spatial dimensions.
- `padding`: One of "valid" or "same" (case-insensitive).
- `activation`: Activation function to use (see activations). If you don't specify anything, no activation is applied (i.e. "linear" activation: a(x) = x).

Documentation: [Convolutional Layers](https://keras.io/layers/convolutional/)

#### Pooling Layer 

`tf.keras.layers.MaxPooling2D(pool_size=2)`

- `pool_size`: Size of the max pooling window.

Documentation: [Pooling Layers](https://keras.io/layers/pooling/)

#### Dropout 

Dropout is an approach to regularization in neural networks which helps reduce interdependent learning amongst the neurons.

Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase, with a certain set of neurons chosen at random. By “ignoring," we mean these units are not considered during a particular forward or backward pass.

It is defined by the following function:

`tf.keras.layers.Dropout(0.3)`

- Parameter: Fraction of the input units to drop (float between 0 and 1).

Documentation: [Dropout](https://keras.io/layers/core/#dropout)

## Defining our Model and Training  

Now that we are aware of the code for building a CNN, let us now build a five layer model:

- Input layer: (28, 28, 1) 
    - Size of the input image
- Convolution layers:
    - First layer: Kernel size (2 x 2), resulting in 64 channels.
        - Pooling of size (2 x 2) makes the layer (14 x 14 x 64)
    - Second layer: Kernel size (2 x 2), resulting in 32 channels.
        - Pooling of size (2 x 2) makes the layer (7 x 7 x 32)
- Fully connected layers:
    - Flatten the convolution layers to 1567 nodes = (7 * 7 * 32)
    - Dense layer of size 256
- Output layer:
    - Dense layer with 10 classes using `softmax` activation

![alt_text](images/our_cnn.png)

Now we can define our model in Keras.

In [None]:
from tensorflow.keras import backend as K
import tensorflow as tf
K.clear_session()
model = tf.keras.Sequential()

# Must define the input shape in the first layer of the neural network
model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=2, padding='same', activation='relu', input_shape=(28,28,1))) 
model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
# model.add(tf.keras.layers.Dropout(0.3))
# Second convolution layer
model.add(tf.keras.layers.Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
model.add(tf.keras.layers.Dropout(0.3))
# Fully connected layer
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

# Take a look at the model summary
model.summary()

### Compile the model

Before the model is ready for training, it needs a few more settings. These are added during the model's *compile* step:

* *Loss function* —This measures how accurate the model is during training. You want to minimize this function to "steer" the model in the right direction.
* *Optimizer* —This is how the model is updated based on the data it sees and its loss function.
* *Metrics* —Used to monitor the training and testing steps. The following example uses *accuracy*, the fraction of the images that are correctly classified.

In [None]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

Training the neural network model requires the following steps:

1. Feed the training data to the model. In this example, the training data is in the `train_images` and `train_labels` arrays.
2. The model learns to associate images and labels.
3. You ask the model to make predictions about a test set—in this example, the `test_images` array. Verify that the predictions match the labels from the `test_labels` array.

To start training,  call the `model.fit` method—so called because it "fits" the model to the training data:

In [None]:
model.fit(train_images, train_labels, batch_size=32, epochs=5)

In [None]:
# Evaluating the model using the test set

test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)

print('\nTest accuracy:', test_acc)

## Making Predictions

In [None]:
# Making predictions from the test_images

predictions = model.predict(test_images)

In [None]:
# Reshape input data from (28, 28) to (28, 28, 1)
w, h = 28, 28
train_images = train_images.reshape(train_images.shape[0], w, h)
test_images = test_images.reshape(test_images.shape[0], w, h)


# Helper functions to plot images 
def plot_image(i, predictions_array, true_label, img):
  predictions_array, true_label, img = predictions_array, true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])

  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'

  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  predictions_array, true_label = predictions_array, true_label[i]
  plt.grid(False)
  plt.xticks(range(10))
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1])
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

In [None]:
# Plot the first X test images, their predicted labels, and the true labels.
# Color correct predictions in blue and incorrect predictions in red.
num_rows = 5
num_cols = 3
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
  plt.subplot(num_rows, 2*num_cols, 2*i+1)
  plot_image(i, predictions[i], test_labels, test_images)
  plt.subplot(num_rows, 2*num_cols, 2*i+2)
  plot_value_array(i, predictions[i], test_labels)
plt.tight_layout()
plt.show()

### Conclusion

Running both our models for five epochs, here is a table comparing them (your results may be slightly different):

|  Model                                   | Train Accuracy | Train Loss | Test Accuracy | Test Loss |
|------------------------------------------|----------------|------------|---------------|-----------|
| Fully connected network - After 5 Epochs |         0.8923 |     0.2935 |        0.8731 |    0.2432 |
| Convolutional network - After 5 Epochs   |         0.8860 |     0.3094 |        0.9048 |    0.1954 |        

## Exercise

Play with different hyper-parameters (number of epochs, depth of layers, kernel size) to bring down the loss further.

## Important

<mark>Shut down the kernel before clicking on “Next Notebook” to free up the GPU memory.</mark>

## Acknowledgement : 


[Transposed Convolutions explained](https://medium.com/apache-mxnet/transposed-convolutions-explained-with-ms-excel-52d13030c7e8)

[Why are CNNs used more for computer vision tasks than other tasks?](https://www.quora.com/Why-are-CNNs-used-more-for-computer-vision-tasks-than-other-tasks)

[Comprehensive introduction to Convolution](https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)

## Licensing
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)

[Previous Notebook](Part_2.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Home Page](../Start_Here.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](Resnets.ipynb)