# Convolutional Neural Networks - Basics

## A tidbit of history 📜

CNNs first appeared in the 1990s. The architecture was introduced by [Yann Lecun](https://en.wikipedia.org/wiki/Yann_LeCun) to classify handwritten digits from images. Due to the outstanding performance of CNNs on image classification tasks, this type of feedforward neural network gained a lot of attention and led to many improvements in the field of computer vision. That's the reason why I am learning about CNNs today.

## Why CNNs... and not our good old MLPs? 🤷

- When using MLPs all pixels have to be colapsed to ONE axis before being fed in the network. So, instead of feeding a matrix, we feed in a vector. The consequence? our model loose spacial-related information in the input image.

- Using a CNN instead of a fully connected layer substantially reduce the number of weights.

- Using this architecture let us feed the image in the model as is, thus preserving all the information and let the network

## What is a CNN made of?

Typically, CNNs are composed of different types of layers.

1. **Convolutional** layers

2. **Subsampling** layers

3. **Fully connected** layers

Let's explore each layer type.

### Convolutional layers

**Convolutional layers** are, in my opinion, the most important of all the layers in a CNN. As a matter of fact, the architecture is named after them. I have a lot to learn on them. Let's get ready.

A **discrete** convolution (or simply convolution) is a fundamental operation in a CNN. It is the operation that happens within convolutional layers. Here is a description of what a convolution is. 

Imagine you have two sequences of numbers, let's call them sequence A and sequence B. Discrete convolution takes each number from sequence A, one by one, and multiplies it with the corresponding number from sequence B. Then, it adds up all these multiplied results to create a new sequence, which we'll call the convolution result.

To better visualize it, think of sequence B as a sliding window that moves across sequence A. At each position, the numbers in both sequences align, and we multiply them together. The convolution result represents the sum of all these multiplications as the window slides along the sequences.

Discrete convolution is widely used in various fields, such as signal processing and image processing, to analyze and manipulate sets of data. It helps us find relationships, extract features, and perform operations like blurring, sharpening, or detecting patterns in signals or images.

Personally, every time I think about the word "convolution" the image of sliding window pops up in my head. Here is an illustration of a 1-D discrete convolution. Notice: The filter is rotated before the convolution is computed.

With a basic intuiton of a convolution is, let's learn some basic definition and notations. A discrete convolution for two vectors $x$ and $w$ is denoted by:

$$
y = x * w \to y[i] = \sum_{k=-\infty}^{+\infty} x[i - k] \hspace{1mm} w[k]
$$

- Vector $x$ is the **input**, also called "signal"
- Vector $w$ is known as the **filter**, or "kernel"
- Vector $y$ is the result of the convolution. It is called is called a **feature map**.

The first thing that I found weird was $-\infty$ to $+\infty$, mainly because finite feature vectors, at least in my relatively small machine learning experience. For example, if $x$ has 10 features with indices $0, 1, 2, \dots, 8, 9,$ then indices –1 and 10 are out of bounds for $x$. Therefore, to correctly compute the summation shown in the preceding formula, it is assumed that **x and w are filled with zeros**. This will result in an output vector, $y$, that also has infinite size, with lots of zeros as well. Since this is not useful in practical situations, $x$ is padded only with a finite number of zeros.

This process, I learned is called **zero-padding** or simply **padding**. The number of zeros padded on each side in denoted by the letter $p$.

![Example of padding](./images/img-2.jpg)

If we assume the original input, $x$, and filter, $w$, have $n$ and $m$ elements, respectively, where $m \leq n$. 
Therefore, the padded vector, $x^{p}$, has size $n + 2p$. The practical formula for computing a discrete convolution will change to the following:

$$
y = x * w \to y[i] = \sum_{k=0}^{m - 1} x^{p}[i + m - k] \hspace{1mm} w[k]
$$

$x$ is the original input and has $n$ elements.

$w$ is the filter and has $m$ elements.

$x^{p}$ is the padded vector and has size $n + 2p$

The second thing that is worth mentioning is the indexing. $x$ and $w$ are indexed in *different directions* in the summation above. To make sure both $x$ and $w$ are indexed in the same directions, we can rotate the filter. If we assume $w^{r}$ to be the rotated filter, then the dot product becomes $x[i: 1+m].w^{r}$ to get one element $y[i]$. $x[i: 1+m]$ is a patch of $x$ withh size $m$. The operation is repeated like in a *sliding window* approach to get all the output elements. Here is an example:

![1-D Discrete convolution](./images/img-1.png)

In the previous example, the padding size is zero ($p = 0$). Notice, the rotated filter $w^{r}$ is shifted by two cells each time. This shift, in addition of the padding value, is another hyperparameter of a convolution. The shift I just described is called the **stride**, and is denoted $s$. In the previous example, $s = 2$.

In practice, the padding is not always to zero. In fact, there are three ways to "pad" our input vector:

1. **Full** mode: The padding parameter $p$ is set to $m - 1$. Full mode increases the dimensions of the output, so it is rarely used in CNN architectures.

2. **Same** mode: The padding parameter $p$ is chosen so that the output vector has **the same size** as the input vector $x$. For this reason, $p$ is computed *according* to (1) the filter size, (2) the stride, and (3) the input size. This mode is the most commonly used. The output size $o$ of a convolution is given by the expression below. You can use it to determine the appropriate value of $p$. I let you try :) 
$$o = \lfloor \frac{n + 2p - m}{s} \rfloor + 1$$

3. **Valid** mode: Where padding is set to zero. It results in an output vector smaller than the input vector $x$. It is the case we explored in the example earlier.

Before closing this long section on convolution layers, I would like to talk about **Cross-correlation**.

A "Cross-correlation" (or simply correlation) between an input vector $x$ and a filter $w$ is denoted $y = x * w$ just like like a convolution, but unlike convolutions... the filter is **not** rotated. Most deep learning frameworks (including PyTorch) implement correlation, but refers to it as "convolution."

\*Exhale\*! We learned a lot of new terms and concepts. Let's now implement a basic 1-D convolution, like the one described in the image above.

In [1]:
#performing a discrete 1-D convolution
import numpy as np

def conv1d(x, w, p=0, s=1):
    #1- Rotate filter
    w_rot = np.array(w[::-1])
    
    #2- Pad input vector
    x_padded = np.array(x)
    if p > 0:
        zero_pad = np.zeros(shape=p)
        x_padded = np.concatenate([
            zero_pad, x_padded, zero_pad
        ])
        
    #3- Slide along the input.
    #   From 0 to (input_length - filter size) + 1, and making 's' steps per iteration
    res = []
    for i in range(0, int(len(x_padded) - len(w_rot)) + 1, s):
        res.append(
            np.sum(x_padded[i: i + w_rot.shape[0]] * w_rot)
        )
    
    return res

Let's test the `conv1d` function and use Numpy's `convolve` to double check our resutls.

In [2]:
x = [1, 3, 2, 4, 5, 6, 1, 3]
w = [1, 0, 3, 1, 2]

print('Conv1d Implementation:',
     conv1d(x, w, p=2, s=1))

Conv1d Implementation: [5.0, 14.0, 16.0, 26.0, 24.0, 34.0, 19.0, 22.0]


In [3]:
print('NumPy Results:', 
      np.convolve(x, w, mode='same'))

NumPy Results: [ 5 14 16 26 24 34 19 22]


Results match!

Before finishing and moving on to the next section, I would like to implement a 2D convolution. This is particularly important that I get an intuition for 2-D convolutions because the CNNs I will implement will be represented using 2-D matrices. I came across this GIF which I think illustrates the idea pretty nicely.

![2-D convolution GIF](./images/img-gif-1.gif)

It demonstrates the 2D convolution of an input matrix of size $(6 \times 6)$ with a kernel (i.e. filter) $(3 \times 3)$. The output of the 2D convolution is another matrix of size $(8 \times 8)$. Because the size of the input and output matrices have the **same** size, can you tell what padding was used?😊

Let's consider another example. An input matrix of size $(3 \times 3)$ is convolved with a kernel of size $(3 \times 3)$ using padding $p = (1, 1)$ and stride $s = (2, 2)$.

![2-D Convolution](./images/img-3.jpg)

The result is the following $2 \times 2$ matrix:

$$
\begin{bmatrix}
4.6 & 1.6\\
7.5 & 2.9
\end{bmatrix}
$$

Let's write a naive implementation for the 2-D convolution just like what we did for the 1-D convolution.

In [4]:
import numpy as np
import scipy.signal

def conv2d(X, W, p=(0, 0), s=(1, 1)):    
    #Rotate filter
    W_rot = np.array(W)[::-1,::-1]
    
    #Copy input matrix
    X_orig = np.array(X)
    
    #Compute dimension of padded matrix
    n1 = X_orig.shape[0] + 2*p[0]
    n2 = X_orig.shape[1] + 2*p[1]
    
    #Create a matrix full of zeros THEN fill in the old array in the new array
    #The result is the original matrix, padded.
    X_padded = np.zeros(shape=(n1, n2))
    X_padded[p[0]:p[0]+X_orig.shape[0],
             p[1]:p[1]+X_orig.shape[1]] = X_orig

    res = []
    for i in range(0,
                   int((X_padded.shape[0] - W_rot.shape[0]))+1, s[0]):
        res.append([])
        
        for j in range(0,
                      int((X_padded.shape[1] - W_rot.shape[1])) + 1, s[1]):
            X_sub = X_padded[i:i+W_rot.shape[0],
                            j:j+W_rot.shape[1]]
            
            res[-1].append(np.sum(X_sub * W_rot))
            
    return(np.array(res))

Let's test the `conv2d` function, and use scipy's `convolve2d` to double check results.

In [5]:
# REMEMBER: Numpy is by default 'row-major'
X = [[1, 3, 2, 4], [5, 6, 1, 3], [1, 2, 0, 2], [3, 4, 3, 2]]
W = [[1, 0, 3], [1, 2, 1], [0, 1, 1]]

print('Conv2d Implementation:\n',
      conv2d(X, W, p=(1, 1), s=(1, 1)))

Conv2d Implementation:
 [[11. 25. 32. 13.]
 [19. 25. 24. 13.]
 [13. 28. 25. 17.]
 [11. 17. 14.  9.]]


In [6]:
print('SciPy Results:\n',
      scipy.signal.convolve2d(X, W, mode='same'))

SciPy Results:
 [[11 25 32 13]
 [19 25 24 13]
 [13 28 25 17]
 [11 17 14  9]]


Interestingly, both results match!

*Note: We implemented both `conv1d` and `conv2d` for education purposes; to get a sense of what is going on. Not to be used in real-word NN applications.*

To conclude my discussion on convolutional layers, I would like to talk about **images**, specifically what happens to an image going through a convolution layer in a CNN. I talked about convolutions, how to compute them, but now... I want to say how we are going to apply this to images.

Images are made of pixels. But in case of colored (i.e. RGB) images, I know that the color of **EACH** pixel is obtained by *combining different amount* of the colors Red, Green, and Blue. A **color channel** refers to the set of values that represent the intensity of pixels for a specific color in an image. In RGB images, there are three color channels: red, green, and blue.

As a result, colored images are often represented by 3-D matrices. Different Deep Learning frameworks will follow different conventions, but I know for sure that one dimension represents the width of the image, another represents the height of the image, and the remaining one represents the number of **color channel** in the image... which is our case is **3**.

But the question now is:

> **How do you convolve on a 3-D matrix?**

Luckily, I came across this beautiful picture on Stack Overflow, which I think illustrates the idea pretty nicely.

![3-D Convolution](./images/img-4.png)

Let's talk about the picture a little bit. 

Consider that the input is a $6 \times 6$ picture. The proportion of *green* in **EACH** pixel of the image is stored in the green matrix. The proportion of *blue* in **EACH** pixel of the image is stored in the blue matrix. And, the proportion of *red* in **EACH** pixel of the image is stored in the red-ish matrix. As you and I can see, this single image is stored as a $6 \times 6 \times 3$ (i.e. a 3-D matrix).

Then we have the filters. Notice, we have 2 filters. In fact, we can have as many filters as you want. The size of the filter must be smaller than the dimension of the input image. But most importantly, the filters must have the *SAME* number of channels as the input image (3). Another way to put this would be this: If your input image is a 3-D matrix, your filters must also be 3-D matrices. If the image is a 1-D matrix, then the filters should also be 1-D matrices. Make sense? 🤔

With the image, and the filters described let's talk about the convolution. The input matrix (i.e. image) is convolved with **EACH** filter, resulting in a *feature map* of size $(4 \times 4)$. Since we have TWO filters, we end with TWO feature maps, each of size $(4 \times 4)$. If the input matrix was convolved with 5 kernels, we'd have 5 feature maps as result.

So, the resulting matrix after convolving the input image with both filters is of size $(4 \times 4 \times 2)$.

### Subsampling layers

Subsampling layers usually come *after* convolutional layers in CNNs. These layers are where the **pooling** operation takes place. A pooling (i.e. subsampling) layer is usually denoted $P_{( n_1 \times n_2)}$. The subscript denotes the number of adjacent pixels in each dimension where the pooling operation is performed. We refer to such a neighborhood as the *pooling size*.

Typically two forms of pooling operations take place in CNNs: **max-pooling** and **mean-pooling**. Max-pooling takes the maximum value from a neighborhood of pixels, while mean-pooling computes their average. Here is an image that illustrates the point nicely:

![Pooling example](./images/img-5.jpg)

But why is pooling useful? The advantage is twofold:

1. Pooling reduces the size of features, which results in higher computational efficiency.

2. Pooling (max-pooling) introduces a local invariance. This means that small changes in a local neighborhood do not change the result of max-pooling. It helps with generating features that more resistant to noise in the input data.

### Fully-connected layers

A fully-connected layers is when every single activation unit in one layer is connected to *ALL* activation units in the following layer. An MLP is an example of multiple fully-conncected layer