# Convolutional Neural Networks - Basics

## A tidbit of history 📜

CNNs first appeared in the 1990s. The architecture was introduced by [Yann Lecun](https://en.wikipedia.org/wiki/Yann_LeCun) to classify handwritten digits from images. Due to the outstanding performance of CNNs on image classification tasks, this type of feedforward neural network gained a lot of attention and led to many improvements in the field of computer vision. That's the reason why I am learning about CNNs today.

## Why CNNs... and not our good old MLPs? 🤷

- When using MLPs all pixels have to be colapsed to ONE axis before being fed in the network. So, instead of feeding a matrix, we feed in a vector. The consequence? our model loose spacial-related information in the input image.

- Using a CNN instead of a fully connected layer substantially reduce the number of weights.

- Using this architecture let us feed the image in the model as is, thus preserving all the information and let the network

## What is a CNN made of?

Typically, CNNs are composed of different types of layers.

1. **Convolutional** layers

2. **Subsampling** layers

3. **Pooling** layers

Let's explore each layer type.

### Convolutional layers

**Convolutional layers** are, in my opinion, the most important of all the layers in a CNN. As a matter of fact, the architecture is named after them.

A **discrete** convolution (or simply convolution) is a fundamental operation in a CNN. It is the operation that happens within convolutional layers. Here is a description of what a convolution is. 

Imagine you have two sequences of numbers, let's call them sequence A and sequence B. Discrete convolution takes each number from sequence A, one by one, and multiplies it with the corresponding number from sequence B. Then, it adds up all these multiplied results to create a new sequence, which we'll call the convolution result.

To better visualize it, think of sequence B as a sliding window that moves across sequence A. At each position, the numbers in both sequences align, and we multiply them together. The convolution result represents the sum of all these multiplications as the window slides along the sequences.

![1-D Discrete convolution](./images/img-1.png)

Discrete convolution is widely used in various fields, such as signal processing and image processing, to analyze and manipulate sets of data. It helps us find relationships, extract features, and perform operations like blurring, sharpening, or detecting patterns in signals or images.

Personally, every time I think about the word "convolution" the image of sliding window pops up in my head. Here is an illustration of a 1-D discrete convolution. Notice: The filter is rotated before the convolution is computed.

With a basic intuiton of a convolution is, let's learn some basic definition and notations. A discrete convolution for two vectors $x$ and $w$ is denoted by:

$$
y = x * w \to y[i] = \sum_{k=-\infty}^{+\infty} x[i - k] \hspace{1mm} w[k]
$$

- Vector $x$ is the **input**, also called "signal"
- Vector $w$ is known as the **filter**, or "kernel"
- Vector $y$ is the result of the convolution. It is called is called a **feature map**.

The first thing that I found weird was $-\infty$ to $+\infty$, mainly because finite feature vectors, at least in my relatively small machine learning experience. For example, if $x$ has 10 features with indices $0, 1, 2, \dots, 8, 9,$ then indices –1 and 10 are out of bounds for $x$. Therefore, to correctly compute the summation shown in the preceding formula, it is assumed that **x and w are filled with zeros**. This will result in an output vector, $y$, that also has infinite size, with lots of zeros as well. Since this is not useful in practical situations, $x$ is padded only with a finite number of zeros.

This process, I learned is called **zero-padding** or simply **padding**. The number of zeros padded on each side in denoted by the letter $p$.

![Example of padding](./images/img-2.jpg)

If we assume the original input, $x$, and filter, $w$, have $n$ and $m$ elements, respectively, where $m \leq n$. 
Therefore, the padded vector, $x^{p}$, has size $n + 2p$. The practical formula for computing a discrete convolution will change to the following:

$$
y = x * w \to y[i] = \sum_{k=0}^{m - 1} x^{p}[i + m - k] \hspace{1mm} w[k]
$$

$x$ is the original input and has $n$ elements.

$w$ is the filter and has $m$ elements.

$x^{p}$ is the padded vector and has size $n + 2p$

TODO: Talk about types of padding

Let's now to implement a basic 1-D convolution, like the one described in the image above.

In [None]:
#performing a discrete 1-D convolution

Before finishing and moving on to other layers, let's implement a 2D convolution.

### Subsampling layers

### Fully-connected layers

A fully-connected layers is when every single activation unit in one layer is connected to *ALL* activation units in the following layer. An MLP is an example of multiple fully-conncected layer