<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti107/blob/main/session-1/first_cnn_for_image_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# First Convolutional Neural Network for Image Classification

In this exercise, you will learn to build your first simple Convolutional Neural Network and use it to classify images. 

You will learn: 
- how to construct a Convolutional Neural Networks 
- adjust the different hyper-parameters of the network (e.g. number of filters, number of layers, etc) and observe the effects 
- how to visualize the activations of the hidden layers 


## Fashion MNIST Dataset

We will be using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset which contains 70,000 grayscale images in 10 categories. 

![fashion-mnist](images/fashion-mnist.png)

The images are 28x28 NumPy arrays, with pixel values ranging from 0 to 255. The *labels* are an array of integers, ranging from 0 to 9. These correspond to the *class* of clothing the image represents:

|Label|Class|
|---|---|
|0|T-shirt/top|
|1|Trouser|
|2|Pullover|
|3|Dress|
|4|Coat|
|5|Sandal|
|6|Shirt|
|7|Sneaker|
|8|Bag|
|9|Ankle boot|       

Let's load the data using `keras.datasets` as it is part of datasets available from keras.
For a list of dataset available from keras, see https://www.tensorflow.org/api_docs/python/tf/keras/datasets



In [None]:
import tensorflow as tf

mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
print('Shape of training_images = {}'.format(training_images.shape))
print('Shape of test_images = {}'.format(test_images.shape))

## Preprocess the images



You need to preprocess the image before using it as the input to the CNN.
CNN expects our input to be of the shape (batch, heigt, width, channels). Furthermore, the pixel values of the original image is in the range (0,255). Neural network will learn better if the input values are normalized to between (0.0, 1.0). 


In [None]:
# reshape to a 4-D tensors, with number of channel as 1, since this is a gray scale image
training_images=training_images.reshape(60000, 28, 28, 1)
test_images = test_images.reshape(10000, 28, 28, 1)

# scale the input to between 0. and 1.0
training_images=training_images / 255.0
test_images=test_images/255.0


## Build your first CNN

A typical CNN consists of 1 or more blocks of Conv2D layer followed by MaxPooling2D layer. The 2D array from the last convolutional block will then be flattened into 1D array before feeding into Dense (fully connected) layer for classification. The last layer uses `softmax` to ouput the probabilities of each of the 10-classes. Note that the last layer has to have same number of output units as the number of classes (in our case, we have 10 classes, so we need 10 output units). Look at the model summary carefully and make sure you understand why the output shape is as shown and also how to calculate the number of parameters. 


**Exercise**:

Construct a convnet that consist of following: 
- Conv layer with 32 filters of size 3x3, and using 'relu' activation function, followed by Max Pooling layer of pool size 2x2. 
- Conv layer with 64 filters of size 3x3, and using 'relu' activation function, followed by Max Pooling layer of pool size 2x2. 
- Flatten the 2D array into 1D
- Fully connected layer with 128 neurons, using 'relu' activation function
- Fully connected layer with 10 neurons with a softmax function. 

Use Adam optimizer and specify 'sparse_categorical_crossentropy' as loss function. 

***Note***: If you are using one-hot-encoding for your output label, then you should specify 'categorical_crossentropy' as your loss function.

<details><summary>Click here for answer</summary> 
<br/>
    
```
model = Sequential([
  Conv2D(32, (3,3), activation='relu', input_shape=(28, 28, 1)),
  MaxPooling2D(2, 2),
  Conv2D(64, (3,3), activation='relu'),
  MaxPooling2D(2,2),
  Flatten(),
  Dense(128, activation='relu'),
  Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
```
    
</details>


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import plot_model


### Start your code here ###


### End your code ###

model.summary()

In [None]:
# You can also plot the model to a png file

#plot_model(model, 'model.png', show_shapes=True)

## Train the model

In [None]:
history = model.fit(training_images, training_labels, batch_size=256, epochs=5, validation_data=(training_images, training_labels))


## Plot the training vs validation accuracy and loss

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt


acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc)+1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and Validation accuracy')
plt.legend()
plt.show()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()

We can see that model is doing vey well on both training and validation accuracy, achieving close to 99% accuracy. However, as you can see from the plots, the accuracy goes up and down as training epochs goes, it is a bit difficult to see the trend. We can smooth out the graph using smoothing average:


## Visualizing the Convolutions and Pooling

It is often said that deep learning network is a blackbox. However, this is certainly not true for Convnets. The representations learnt by Convnets are highly interpretable, as they are representations of visual concepts. 

The following codes allows us to visualize the output of the feature maps learnt by Convnet. By looking at output (activations) of these feature maps, for different kind of images, we will understand how a specific image is being classified. 


Let's first print out the labels of the first 10 test labels.

In [None]:
print(test_labels[:10])

Let us look two different images, image 0 with label 9 (ankle boot) and image 2 with label 1 (trouser).

In [None]:
ANKLE_BOOT_IDX = 0
TROUSER_IDX = 2

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.imshow(test_images[ANKLE_BOOT_IDX].reshape(28,28))
ax2.imshow(test_images[TROUSER_IDX].reshape(28,28))

Let's create activation model for each individual layer.

In [None]:
import matplotlib.pyplot as plt
from tensorflow.keras import Model

# extract the outputs of first four layers (only the Conv2D, MaxPooling2D layers)
layer_outputs = [layer.output for layer in model.layers][:4]

# create activation models that will return these outputs given the model input
activation_model_conv1 = Model(inputs=model.input, outputs=layer_outputs[0])
activation_model_pool1 = Model(inputs=model.input, outputs=layer_outputs[1])
activation_model_conv2 = Model(inputs=model.input, outputs=layer_outputs[2])
activation_model_pool2 = Model(inputs=model.input, outputs=layer_outputs[3])

Let's look at activations from the 1st Conv2D layer for both images. There are 64 filter maps from the 2nd Conv layer, but we going to look at only the first 10.

In [None]:
fig, axarr = plt.subplots(2,10, figsize=(20,4))
ankle_boot_activations_conv1 = activation_model_conv1.predict(test_images[ANKLE_BOOT_IDX].reshape(1, 28, 28, 1))
trouser_activations_conv1 = activation_model_conv1.predict(test_images[TROUSER_IDX].reshape(1, 28, 28, 1))

for filter_idx in range(0, 10):
    axarr[0, filter_idx].imshow(ankle_boot_activations_conv1[0,:,:, filter_idx])
    axarr[1, filter_idx].imshow(trouser_activations_conv1[0,:,:,filter_idx])

From the plots, we can see that 1st Conv layer seems to act as detector of lines and edges. Some filter such as filter 0 is more like an vertical line detector, whereas some filter such as filter 4 seems to detect edges of the shape.

Your filter output may not be the same as we have shown here as the specific filters learnt by the Conv layer are not deterministic.

**Exercise**

Now, complete the code below to examine the activations from the 2nd Conv2D layer for both images. There are 64 filter maps from the 2nd Conv layer, but we going to look at only the first 10.  What do you observe?

<details><summary>Click here for answer</summary>
<br/>
   
```
fig, axarr = plt.subplots(2,10, figsize=(20,4))

ankle_boot_activations_conv2 = activation_model_conv2.predict(test_images[ANKLE_BOOT_IDX].reshape(1, 28, 28, 1))
trouser_activations_conv2 = activation_model_conv2.predict(test_images[TROUSER_IDX].reshape(1, 28, 28, 1))

for filter_idx in range(0, 10):
    axarr[0, filter_idx].imshow(ankle_boot_activations_conv2[0,:,:, filter_idx])
    axarr[1, filter_idx].imshow(trouser_activations_conv2[0,:,:,filter_idx])
    
```
</details>

You will observe that the outputs seems to be more abstract and seems to detect a higher-level construct, such a the presence of certain part of the object (e.g. the collar part of the boot)

In [None]:
### Start your code here ###


### End your code ###

Now, complete the code below to examine the activations from the last maxpooling layer for both images. There are 64 filter maps from the 2nd Conv layer, but we going to look at only the first 10.  What do you observe?

<details><summary>Click here for answer</summary>
<br/>
   
```
fig, axarr = plt.subplots(2,10, figsize=(20,4))

ankle_boot_activations_pool2 = activation_model_pool2.predict(test_images[ANKLE_BOOT_IDX].reshape(1, 28, 28, 1))
trouser_activations_pool2 = activation_model_pool2.predict(test_images[TROUSER_IDX].reshape(1, 28, 28, 1))

for filter_idx in range(0, 10):
    axarr[0, filter_idx].imshow(ankle_boot_activations_pool2[0,:,:, filter_idx])
    axarr[1, filter_idx].imshow(trouser_activations_pool2[0,:,:,filter_idx])

    
```
</details>

The MaxPooling2D just highlight or emphasize more sharply the abstract part detected by the Conv layer. 

In [None]:
### Start your code here ###


### End your code ###

The MaxPooling2D just highlight or emphasize more sharply the abstract part detected by the Conv layer. 

**Additional Exercises:**

1. Try editing the convolutions. Change the 32s to either 16 or 64. What impact will this have on accuracy and/or training time.

2. Remove the final Convolution. What impact will this have on accuracy or training time?

3. How about adding more Convolutions? What impact do you think this will have? Experiment with it.

4. Remove all Convolutions but the first. What impact do you think this will have? Experiment with it. 
