# Convolutional Neural Network

### Motivation:

- MNIST dataset: input is cleaned, all 28x28, handwriting is in center of image $\implies$ what if image is **diff size, distorted**?
- Very hard to make NN bigger, since there will be so many parameters to train
- **Bad inductive bias**: When flatten data, this ignores geometry of the image (BAD)
- Not flexible: different image size require different models

## 1. Convolution Operator (in 2D)

Different types of <mark>**kernels**</mark> are used to **extract a layer of info**
- Low-level info
    - Verticle edge detector
    - Horizontal edge detector, etc.
- High-level info
    - Head? Nose? Ears?

Calculations are like dot product between elements $\implies$ **results in a scalar for each operation**

<img src="images/Screenshot 2023-10-24 at 6.50.07 PM.png" width="50%" height="50%">

These kernels used to be **hand-crafted**, meaning their values are **set**, and then applied to classify images.

HOWEVER, this will not work for distorted images and even small tweaks $\implies$ we need something better

***
## 2. Convolutional Neural Networks

<mark>**CNN learns the kernel values**</mark>

- Weight sharing: detech the same local features across the image
- Locally connected layers: local features in small regions of the image

**All parameters are trained end-to-end (updated at the same time)**

<img src="images/Screenshot 2023-10-24 at 7.13.26 PM.png">

- Last layers are classifying strong features, which is then fed into fully connected network (as before)

### Terminologies

- <mark>**Zero padding**</mark>: Adding zeros to border of image before convolution (avoid bias towards center of image)
- <mark>**Stride**</mark>: Distance between 2 consecutive positions of the kernel $\implies$ **controls output resolution**

$$\text{Output size} = \lfloor {i + 2p + k \over s} \rfloor + 1$$

- i = image dimension (ixi)
- k = kernel size (kxk)
- p = padding size
- s = stride size

**NOTE**: If input size is m x n and kernel size is k x l: Output size is $O_w$ x $O_h$ where:

$$
\begin{cases}
    O_w = \lfloor {m + 2p + k \over s} \rfloor + 1\\
    O_h = \lfloor {n + 2p + l \over s} \rfloor + 1
\end{cases}
$$

### CNN on RGB

Imagine image will now have **depth = 3**, then kernel **depth = 3**

$$\text{Kernel depth always = Input depth}$$
$$\text{Each kernel layer always produces output depth 1}$$

ALSO, we might want to include multiple **types of kernels** in each NN layer, extracting different information.

$\begin{align}
&\text{Number of input channels = input depth} \\
&\text{Number of output channels = number of kernels} \\
&\text{Number of trainable weights = number of kernels * img depth (kernel depth) * img resolution}
\end{align}$

<img src="images/Screenshot 2023-10-25 at 6.40.30 PM.png" height=50% width=50%>

***
## 3. Pooling Operator

**Consolidate information**, removing ones that are not useful (the deeper we go into layers)

### <mark>Max Pooling</mark>

Extracts the max value within a region

$$\text{Output size} = \lfloor {i - k \over s} \rfloor + 1$$

### <mark>Average Pooling</mark>

Extracts the average value within a region (IN PRACTICE: max pooling generally works better)

$$\text{Output size} = \lfloor {i + 2p + k \over s} \rfloor + 1 \text{(same as normal convolution)}$$



***
## 4. CNN in PyTorch

### <mark>CNN Architecture Blueprint</mark>

- Filter **depth increases** (more kernels per layer)
    - For the first layers, we are only extracting low-level features that are simple   
    - The more deep we go, complex kernels are required to combine low-level features to extract high-level features

- Feature map **height and width decreases** (image resolution decreases)
    - In low-level, we only learn simple fetures that can be extracted in small regions of image (e.g. simple edge)
    - The higher-level requires capturing bigger area of the image (e.g. the entire face, the whole body)
    - So to keep the kernel size constant, we shrink the image to help the kernel capture meaningful information

<img src="images/Screenshot 2023-10-24 at 7.43.10 PM.png">

In [None]:
class LargeNet(nn.Module):
    def _init_(self):
        super(LargeNet, self)._init_()
        self.name = "Large"
        self.conv1 = nn.Conv2d(3, 5, 5)
        self.pool = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(5, 10, 5)
        self.fc1 = nn.Linear(10 * 5 * 5, 32)
        self.fc2 = nn.Linear(32, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 10 * 5 * 5)  # linearize
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

NOTE: Why can we define 1 pool, but use 2 times?
- Doesn't depend on input size (pool is just arithmetic operation)
- Pool has NO WEIGHTS TO LEARN