<a href="https://colab.research.google.com/github/luigiselmi/dl_tensorflow/blob/main/computer_vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computer Vision
We have already seen the application of fully connected neural networks to the MNIST digits images classification task. In this notebook we use convolutional neural networks for the same task that provides several advantages over the fully connected layers.

In [3]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import time

## The MNIST digits classification task
We use the functional syntax to build our convolutional model. We use three convolutional layers, followed by a max pooling layer, then a flatten layer, that is a fully connected layer of one dimension with the same number of units as the size of the last convolutional layer. The flatten layer is used to reduce the output to one dimension. Finally we have a fully connected layer with a softmax activation function to provide the probability for each digit. As we can see from the model summary, the size of the model's features, i.e. width and heigh, shrinks while the number of channels, or feature maps, increases. The shape of the inputs is Height x Width x Channel, in the MNIST case 28 x 28 x 1.  

In [4]:
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2  (None, 13, 13, 32)        0         
 D)                                                              
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 5, 5, 64)          0         
 g2D)                                                            
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 3, 128)         73856 

We reuse most of the code used for the MNIST digits classification in the notebook [Machine Learning Fundamentals](ml_fundamentals.ipynb) with the fully connected layers.

In [1]:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


We compile and fit the CNN model  

In [5]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x792256251930>

We perform an evaluation of the model's performances on the test set. We can see the our network with only three convolutional layers achieves a better performance than the fully connected model.

In [6]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

Test accuracy: 0.994


## The convolution operation
The difference between a dense layer and a convolutional layer is that a dense (fully connected) layer learns global patterns in an image, while a convolutional layer learns local patterns within small 2D windows that do not depend on their location in the image. In our convolutional layers we have used a window of size 3x3 pixels (kernel size). The advantages of using convolutional layers is that they learn patterns that are

* translational invariant
* hierarchically organized

An object can be learnt by the convolutional layer as made up of simpler objects such as circles and lines, wherever they appear in the same configuration. A convolutional layer is defined by its

* window (or kernel) size
* number of filters (or channels / feature maps)

Each kernel of size K x K is learnt so that the number of parameters to be learnt by a convolutional layer depends on the kernel size, the number of filters, and the way the kernel is moved over the feature maps, that is the padding and the stride.  