# Convolutional Neural Network

In [1]:
import tensorflow as tf
import keras as keras


In [2]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D

Build a CNN model

In [37]:
# Build the Sequential CNN model
model = Sequential([
    Conv2D(16, (3,3), activation='relu', input_shape=(28,28,1)),
    # Put more layers
    MaxPooling2D((3,3)), # it performs downsampling by selecting the maximum value from each pooling window.
    Flatten(),
    Dense(10, activation='softmax')
])

In [38]:
# Print the model summary
model.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_8 (Conv2D)           (None, 26, 26, 16)        160       
                                                                 
 max_pooling2d_7 (MaxPoolin  (None, 8, 8, 16)          0         
 g2D)                                                            
                                                                 
 flatten_8 (Flatten)         (None, 1024)              0         
                                                                 
 dense_14 (Dense)            (None, 10)                10250     
                                                                 
Total params: 10410 (40.66 KB)
Trainable params: 10410 (40.66 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In Output Shape (None, 26, 26, 16), 
* None represents the batch size, which is not fixed during model definition.
  - This allows for flexibility, meaning the model can process inputs with any batch size.
* 26, 26 are the height and width of the feature map after convolution.
  - The size decreases from the input shape (28, 28) because the kernel size is (3, 3) and the default stride is (1,1), which reduces the dimensions:
  - Output size = Input size - Kernel size + 1 = 28 - 3 + 1 = 26.
* 16 is the number of fiters.

* The total number of trainable parameters is calcuated as:
  - $\text{Parameters} = (\text{Kernel height} \times \text{Kernel width} \times \text{Input channels}) + \text{Bias terms}$
  - Each filter has $3 \times 3 \times 1 = 9$ parameters
  - We have 16 filters: $9 \times 16 = 144$ parameters
  - Each filter has one bias term: 16 bias terms
  - Thus 144 (kernel parameters) + 16 (bias terms) = 160

Output shpae after pooling 
* The output shape after applying MaxPooling2D((3,3)) changes because pooling reduces the spatial dimensions (height and width) of the input.
  - Here's how the dimensions are calculated:
  - $ \text{Output Size (H or W)} = \frac{\text{Input Size} - \text{Pooling Window Size}}{\text{Stride}} + 1 $
  - Here, the input size = 26, pooling window size = 3, stride defaults to th epooling window size (stride = (3,3))
  - Thus, $ \text{Output Size (H or W)} = \frac{\text{26} - \text{3}}{\text{3}} + 1 ;= 8$
  - The number of channels (16) remains unchanged.
 
For Flatten: $ 8 \times 8 \times 16 = 1024$

For Dense layer:
* The number of parameters is given by
  - $\text{Parameters} = (\text{Input Features} \times \text{Output Units} + \text{Output Units (Bias terms)} = 1024 \times 10 + 10 = 10250$

In [12]:
# What if
# Build the Sequential CNN model
model = Sequential([
    Conv2D(16, (3,3), padding = 'SAME', activation='relu', input_shape=(28,28,1)),
    # Put more layers
    MaxPooling2D((3,3)), # it performs downsampling by selecting the maximum value from each pooling window.
    Flatten(),
    Dense(10, activation='softmax')
])
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_4 (Conv2D)           (None, 28, 28, 16)        160       
                                                                 
 max_pooling2d_3 (MaxPoolin  (None, 9, 9, 16)          0         
 g2D)                                                            
                                                                 
 flatten_1 (Flatten)         (None, 1296)              0         
                                                                 
 dense_1 (Dense)             (None, 10)                12970     
                                                                 
Total params: 13130 (51.29 KB)
Trainable params: 13130 (51.29 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Since we use the padding = 'SAME', 
* output shape is the same as input shape (None, 28, 28, 16)
* After pooling:
  - $ \text{Output Size (H or W)} = \frac{\text{Input Size}}{\text{Stride}} = \frac{28}{3} = 9$
* flatten: $9 \times 9 \times 16 = 1296$
* For Dense layer: the number of parameters is given by
  - $\text{Parameters} = (\text{Input Features} \times \text{Output Units} + \text{Output Units (Bias terms)} = 1296 \times 10 + 10 = 12970$

In [15]:
# what if we add "strides = 2"
# Build the Sequential CNN model
model = Sequential([
    Conv2D(16, (3,3), padding = 'SAME', strides = 2, activation='relu', input_shape=(28,28,1)),
    # Put more layers
    MaxPooling2D((3,3)), # it performs downsampling by selecting the maximum value from each pooling window.
    Flatten(),
    Dense(10, activation='softmax')
])
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_6 (Conv2D)           (None, 14, 14, 16)        160       
                                                                 
 max_pooling2d_5 (MaxPoolin  (None, 4, 4, 16)          0         
 g2D)                                                            
                                                                 
 flatten_3 (Flatten)         (None, 256)               0         
                                                                 
 dense_3 (Dense)             (None, 10)                2570      
                                                                 
Total params: 2730 (10.66 KB)
Trainable params: 2730 (10.66 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


* Input shape (28, 28, 1)
* Filters 16
* Stride 2
* Padding same
  - $ \text{Output Size (H or W)} = \frac{\text{Input Size}}{\text{Stride}} = \frac{28}{2} = 14$
  - Thus output shape is (None, 14, 14, 16)
  - $\text{Parameters} = (\text{Kernel height} \times \text{Kernel width} \times \text{Input channels}) + \text{Bias terms}$
  - $\text{Parameters} = (\text{3} \times \text{3} \times \text{16}) + \text{16} =  144 + 16 = 160$

MaxPooling2D layer:
* input shape: (None, 14, 14, 16)
* Pooling Window size: (3,3)
* Stride: Defaults to the pooling window size (3, 3) if not explicitly specified.
* Padding: Defaults to 'VALID' (no padding).
  - $ \text{Output Size} = \left\lfloor \frac{\text{Input Size} - \text{Pooling Window Size}}{\text{Stride}} + 1 \right\rfloor $
  - So, $ \text{Output Size} = \left\lfloor \frac{\text{14} - \text{3}}{\text{3}} + 1 \right\rfloor =  \left\lfloor 4.67 \right\rfloor = 4$

Flatten layer reshapes the 3D output into a 1D vector: $ \text{Output Shape} = 4 \times 4 \times \ 16 = 256 $

Dense layer: 
* $\text{Parameters} = (\text{Input Features} \times \text{Output Units} + \text{Output Units (Bias terms)} $
* So, $\text{Parameters} = \text{256} \times \text{10} + \text{10} = 2570$

# Weight and bias initializers
* We investigate different ways to initialize weights and biases in the layers of NNs.

In [16]:
%matplotlib inline 
import pandas as pd

### Default weights and biases

In the models we have worked with so far, we have not specified the initial values of the weights and biases in each layer of our neural networks.

The default values of the weights and biases in TensorFlow depend on the type of layers we are using. 

For example, in a `Dense` layer, the biases are set to zero (`zeros`) by default, while the weights are set according to `glorot_uniform`, the Glorot uniform inztialiser. 

The Glorot uniform znitialiser draws the weights uniformly at random from the closed interval $[-c,c]$, where $$c = \sqrt{\frac{6}{n_{input}+n_{output}}}$$

and $n_{input}$ and $n_{output}$ are the number of inputs to, and outputs from the layer respectively.

### Initialising your own weights and biases
We often would like to initialize our own weights and biases, and TensorFlow makes this process quite straightforward.

When we construct a model in TensorFlow, each layer has optional arguments `kernel_initialiser` and `bias_initialiser`, which are used to set the weights and biases respectively.

If a layer has no weights or biases (e.g. it is a max pooling layer), then trying to set either `kernel_initialiser` or `bias_initialiser` will throw an error.

Let's see an example, which uses some of the different initializations available in Keras.

In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Conv1D, MaxPooling1D 

In [20]:
# Construct a model

model = Sequential([
    Conv1D(filters=16, kernel_size=3, input_shape=(128, 64), kernel_initializer='random_uniform', bias_initializer="zeros", activation='relu'),
    MaxPooling1D(pool_size=4),
    Flatten(),
    Dense(64, kernel_initializer='he_uniform', bias_initializer='ones', activation='relu'),
])

* Conv1D applies 1D convolutional filters over the input data to extract meaningful features from sequential data.
  - filters=16: 16 filters in the layer
  - kernel_size=3: size of the filter (each filter spans 3 sequential elements of the input)
  - input_shape=(128, 64): length of the sequence = 128; 64 features (channels) per step

* kernel_initializer='he_uniform'
  - He Uniform Initialization is a weight initialization method introduced by Kaiming He et al. in the paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification".
  - is designed for layers with ReLU (Rectified Linear Unit)
 
Weights are sampled from:
$$
w \sim \mathcal{U}\left(-\sqrt{\frac{6}{\text{fan\_in}}}, \sqrt{\frac{6}{\text{fan\_in}}}\right)
$$

Where:

- $\mathcal{U}$: Uniform distribution.
- $\text{fan\_in}$: The number of input units (or neurons) in the layer.


In [21]:
model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_2 (Conv1D)           (None, 126, 16)           3088      
                                                                 
 max_pooling1d_2 (MaxPoolin  (None, 31, 16)            0         
 g1D)                                                            
                                                                 
 flatten_6 (Flatten)         (None, 496)               0         
                                                                 
 dense_6 (Dense)             (None, 64)                31808     
                                                                 
Total params: 34896 (136.31 KB)
Trainable params: 34896 (136.31 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


* Since there is no padding (padding='valid'):
  - Output Length = Input Length − Kernel Size + 1 = 128 − 3 + 1 = 126
  - Final output shape (126,16)

  - The total number of trainable parameters in a Conv1D layer is calculated as:
  - $\text{Parameters} = (\text{Kernel Size} \times \text{Input channels} \text{Filters} ) + \text{Bias terms}$
  - So, $\text{Parameters} = (\text{3} \times \text{64} \text{16} ) + \text{16} = 3088$ (Each filter has one bias term)

* MaxPooling1D(pool_size=4)
  - Downsamples the output of Conv1D by taking the maximum value from non-overlapping windows of size 4.
  - Output Length: 126 / 4 = 31 (truncated).
  - Final Shape After Pooling: (31, 16)

* Flatten()
  - Flattens the 2D output of shape (31, 16) into a 1D vector: 31 × 16 = 496.

As the following example illustrates, we can also instantiate initializers in a slightly different manner, allowing us to set optional arguments of the initialization method.

In [31]:
# Add some layers to our model

model.add(Dense(64, 
                kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05), 
                bias_initializer=tf.keras.initializers.Constant(value=0.4), 
                activation='relu'),)

model.add(Dense(8, 
                kernel_initializer=tf.keras.initializers.Orthogonal(gain=1.0, seed=None), 
                bias_initializer=tf.keras.initializers.Constant(value=0.4), 
                activation='relu'))

* kernel_initializer=tf.keras.initializers.Orthogonal(gain=1.0, seed=None)
  - Orthogonal initialization ensures that the weight matrix is orthogonal.

A weight matrix \( W \) is orthogonal if:

$$
W^T W = I
$$

where \( I \) is the identity matrix.

* gain=1.0
  - A multiplicative factor applied to the orthogonal matrix. It scales the initialized weights: Default is 1.0
    - Use sqrt(2) for ReLU activations (He initialization).
    - Use 1.0 for linear or sigmoid activations.
* seed = None: Used to make the random number generation

### Custom weight and bias initialisers
It is also possible to define your own weight and bias initialzsers.
Initializers must take in two arguments, the `shape` of the tensor to be initiazised, and its `dtype`.

Here is a small example, which also shows how you can use your custom initializer in a layer.

In [24]:
import tensorflow.keras.backend as K

In [26]:
# Define a custom initializer

def my_init(shape, dtype=None):
    return K.random_normal(shape, dtype=dtype)

model.add(Dense(64, kernel_initializer=my_init))

* shape: A tuple specifying the dimensions of the weight matrix. For example:
  - For a Dense layer with 64 neurons and an input of size 128, shape = (128, 64).
* dtype: The data type of the weights, typically float32.

* K.random_normal(shape, dtype=dtype): generates a matrix of random numbers sampled from a normal distribution (mean=0, std=1).
  - The shape of the weight matrix is (128, 64).

In [30]:
# Print the model summary of finalized model

model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_2 (Conv1D)           (None, 126, 16)           3088      
                                                                 
 max_pooling1d_2 (MaxPoolin  (None, 31, 16)            0         
 g1D)                                                            
                                                                 
 flatten_6 (Flatten)         (None, 496)               0         
                                                                 
 dense_6 (Dense)             (None, 64)                31808     
                                                                 
 dense_7 (Dense)             (None, 64)                4160      
                                                                 
 dense_8 (Dense)             (None, 8)                 520       
                                                      

* Conv1D(filters=16, kernel_size=3, input_shape=(128, 64), kernel_initializer='random_uniform', bias_initializer="zeros", activation='relu')
  - No stride: Defaults to 1
  - No padding: defaults to 'valid' (no padding)
  - So, the output lenght is reduced to
  - Output Length = Input Length − Kernel Size + 1 = 128 − 3 + 1 = 126
  - Output shape: (None, 126, 16) (batch size remains None).
 
  - What if stride =2 and padding ="same"?
$$
\text{Output Length} = \left\lceil \frac{\text{Input Length}}{\text{Stride}} \right\rceil
$$
$$
\text{Output Length} = \left\lceil \frac{\text{128}}{\text{2}} \right\rceil = 64
$$
  - The output shape becomes: (None,64,16)
 
  - Number of parameters:
    - Parameters = (Kernel Size × Input Channels × Filters) + Bias Terms
    - Parameters = (3 × 64 × 16) + 16 = 3072 + 16 = 3088
    - Even though stride and padding are applied, the number of parameters is unchanged.

* MaxPooling1D(pool_size=4):

$$
\text{Output Length} = \left\lceil \frac{\text{Input Length}}{\text{Pooling Window Size}} \right\rceil
$$
$$
\text{Output Length} = \left\lceil \frac{\text{126}}{\text{4}} \right\rceil = 31
$$

* Output Shape: (None, 31, 16).

* Flatten(): Flattened Size=31×16=496

* Dense(64, kernel_initializer='he_uniform', bias_initializer='ones', activation='relu')
  - Input Shape: (496).
  - Output Units: 64.
 
  - #parameters: Parameters = (Input Features × Output Units) + Bias Terms = (496 × 64) + 64 = 31808

* Dense(64, kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05), bias_initializer=tf.keras.initializers.Constant(value=0.4), activation='relu')
  - Input Shape: (64) (from previous Dense layer).
  - Output Units: 64
  - Parameters = (64 × 64) + 64 = 4096 + 64 = 4160

* Dense(8, kernel_initializer=tf.keras.initializers.Orthogonal(gain=1.0), bias_initializer=tf.keras.initializers.Constant(value=0.4), activation='relu')
  - Input Shape: (64) (from previous Dense layer).
  - Output Units: 8
  - Parameters = (64 × 8) + 8 = 512 + 8 = 520

* Dense(64, kernel_initializer=my_init)
  - Input Shape: (8) (from previous Dense layer).
  - Output Units: 64
  - Parameters = (8 × 64) + 64 = 512 + 64 = 576

* Dense(64, kernel_initializer=my_init)
  - Input Shape: (64) (from previous Dense layer).
  - Output Units: 64
  - Parameters = (64 × 64) + 64 = 4096 + 64 = 4160