# **Convolutional Neural Network Architectures for Image Analysis**

**Objectives:**
- Understand how to setup a convolutional neural network (CNN) through 5 steps: architecture setup, optimisation setup, training, validation and testing
- Understand the residual neural network (ResNet) architecture
- Understand the U-net architecture

**Why can't we just use a dense NN to processing images?**

Dense neural net's are not ideal for processing images. Firstly, image data has to be flattened to a very long vector of pixel values. This means we need many weights and biases. A 100x100 RBG image would have an input dimension of 30,000! Addionally, the forward pass would take a long time and effectively training such a large model (with lots of learnable parameters) requires a large amount of data. Also neural net's cannot  capture the spatial relationships within an image. 

**What is a CNN?**

A CNN is a deep learning algorithm designed to process and analyse visual data, such as images. It automatically learns patterns and features from an image by applying filters with learnable weights and biases. These filters detect important features like edges, textures, shapes and eventaully entire objects. One of the key strengths of CNNs is their ability to recognise spatial hierarchies of features. This means the network learns to detect simple patterns (e.g edges) in the early layers, then combines these patterns to identify more complex structures (e.g. corners and textures) in deeper layers, and finally recognises entire objects (e.g. faces, cars). This step-by-step feature extraction allows CNN's to understand images in a way that mimics how humans percieve what they see. CNNs can automatically learn which features are important, making them highly effective for tasks such as **image classification**, **facial recognition** and **object detection**.

*Here is a visualisation of a convolutional neural network:* 

<div style="text-align: center;">
    <img src="images/cnn_visualisation.png" width="750" height="250">
</div>

### Input Images

Images are represented as **3D tensors** (multi-dimensional arrays), structured as a stack of 2D colour channels. The most commonly used colour model is the RGB colour model, where there are 3 channels - red, green and blue. Each pixel is defined by 3 values, representing the intensity, of each colour, where 0 is the lowest intensity and 255 is the highest intensity. The picture on the left shows a 4x4x3 RGB image, where the height and width are 4 pixels each, and the depth represents the 3 colour channels. Note, there can be more than 3 colour channels (C > 3). The picture on the right represents a general image tensor.

<div style="display: flex; justify-content: center; gap: 10px;">
    <img src="images/rgb_grid.png" width="400" height="300">
    <img src="images/image_tensor.png" width="400" height="300">
</div>


### Convolution Layer

**Kernel**

A kernel in a convolution layer is a smalll matrix of weights that act like a filter, sliding across an input image to extract specific local features by performing element-wise multiplication with the corresponding pixels in the image (convolution). It essentially identifies patterns within the data.

For example, say the kernel/filter is,

$$
K =
\begin{bmatrix}
1 & 0 & 1 \\
0 & 1 & 0 \\
1 & 0 & 1
\end{bmatrix}
$$


Consider a 5x5x1 image, representing a single-channel image with 5x5 pixels. For simplicity, let's assume pizel values are either 0 or 1, though in reality, they can range from 0 to 255. We apply the 3x3x1 kernel defined above. This kernel will slide across the image, performing convolution at each step. The images below illustrate this process, with the left side showing the kernel's movement and the right showing the resulting convolved feature

<div style="display: grid; grid-template-columns: repeat(3, 1fr); gap: 10px; max-width: 600px;">
    <img src="images/conv_layer_1.png" alt="Image 1" style="width: 100%; height: auto;">
    <img src="images/conv_layer_2.png" alt="Image 2" style="width: 100%; height: auto;">
    <img src="images/conv_layer_3.png" alt="Image 3" style="width: 100%; height: auto;">
    <img src="images/conv_layer_4.png" alt="Image 4" style="width: 100%; height: auto;">
    <img src="images/conv_layer_5.png" alt="Image 5" style="width: 100%; height: auto;">
    <img src="images/conv_layer_6.png" alt="Image 6" style="width: 100%; height: auto;">
    <img src="images/conv_layer_7.png" alt="Image 7" style="width: 100%; height: auto;">
    <img src="images/conv_layer_8.png" alt="Image 8" style="width: 100%; height: auto;">
    <img src="images/conv_layer_9.png"alt="Image 9" style="width: 100%; height: auto;">
</div>

Note, the movement of the kernel:

<div style="text-align: center;">
    <img src="images/kernel_movement.png" width="280" height="270">
</div>

**How does this process extract important features?**

$$
K = 
\begin{bmatrix}
-1 & 0 & 1 \\
-1 & 0 & 1 \\
-1 & 0 & 1
\end{bmatrix}
$$
