# Deep Learning for Handwritten Digit Recognition using Convolutional Neural Networks

In this notebook, we will build a Convolutional Neural Network (CNN) using a sequential model. CNNs are commonly used for image recognition and computer vision tasks, making them well-suited for handwritten digit recognition.

## Import of modules and libraries 

In [42]:
import numpy as np
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras import datasets

## Import of the training data

In [43]:
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()

In [44]:
print(x_train.shape)

(60000, 28, 28)


## Model building

### Layers in our CNN:

1. Conv2D:
   - This layer takes an input image and creates a filter (convolutional kernel).
   - The filter slides across the image, taking a 3x3 matrix of pixels.
   - It multiplies the filter's values by the corresponding pixel values and generates the dot product.
   - The dot product is saved in our feature map, which is passed to the next hidden layer.
   - This filter helps the network recognize specific patterns and features in the images.

2. MaxPooling2D:
   - This layer downsamples the data while retaining the most important information.
   - It slides a small matrix (usually a 2x2 matrix) across the feature map.
   - At each position, it takes the maximum value within that window.
   - The maximum value is then kept, and the rest of the values are discarded.
   - This process generates a new downsampled feature map with reduced dimensions.

3. Flatten:
   - The flatten layer is used to reduce the dimensions from a 2D feature map to a 1D vector.
   - This allows the two-dimensional data to be converted into other formats and passed through other hidden layers.
   - It takes the input of the 2-dimensional feature map and rearranges the data by stacking the values on 1 dimension.
   - The output is a 1-dimensional vector.

4. Dense:
   - This layer is used to understand the complex patterns in the data.
   - It does this through the weights and biases stored by the neurons.
   - The weighted sums pass through an activation function, determining whether the output is passed to the next layer.

### Activation Function:

The final layer of our CNN uses the `softmax` function instead of the `relu` function. The `softmax` function transforms the data into a matrix representing the probabilities that the neural network has calculated for the different outcomes.

In [45]:
# Create an instance of the sequential model
model = Sequential()

# Add layers
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))

## Model compilation

In [46]:
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

When compiling this model, we have to specify our optimizer, loss function and the metrics for the evaluation of the performance.

### Optimizer - Adam:

Adam is an extended version of stochastic gradient descent that can be used to update the weights in neural networks. It has two main differences when compared to traditional algorithms:
   - It maintains a per-parameter learning rate that improves performance on problems with sparse gradients.
   - It also changes the learning rate based on the average of the recent changes in magnitude of the gradients for the weights.

### Loss function - Sparse Categorical Cross entropy

Sparse categorical cross entropy is a type of loss function that is used in multiclass classification problems. It calculates the difference between the predicted probability distribution and actual distribution of labels.

## Model fit and evaluation

When fitting the model, we have to specify a number of parameters beyond the training  and validation datasets. These parameters are:
1. `epochs` 
   - represents the amount of times the data will be trained with all the data from the training set.
   - choosing the right number of epochs is important, too many will lead to a model that is overfit to the data, whilst too few      epochs will lead to the model underfitting the data.
   
2. `batch size`
   - the number of datapoints that are tested each time before the model is updated.
   - the size of the batch size can influence the training time, the bigger the batch size and the smaller the training time.
   - smaller batch sizes can lead to more noise in the data, which can help the model to generalize to unseen data.

In [50]:
# Train the model
history = model.fit(x_train, y_train, epochs=3, batch_size=32, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3
