# CNN Architectures

## 1. LeNet

- Tool into consideration <mark>**Input Invariance**</mark> (rotated, different bold levels, etc.)
- Problem with establising **relationship between objects**

<img src="images/Screenshot 2023-10-25 at 6.41.47 PM.png">

In [None]:
class LeNet5(nn.Module):
    def __init__(self):
        super(LeCun, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(5 * 5 * 16, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 5 * 5 * 16)
        x = F.tanh(self.fc1(x))
        x = F.tanh(self.fc2(x))
        x = self.fc3(x)
        return x

***
## 2. AlexNet (UofT)

- Larger training dataset (from ImageNet)
- Increase in compute power
- **Larger model size/more layers**

Also application of tricks to increase accuracy

- Deeper model (large number of convolutional layers)
- Use ReLU activation functions instead of sigmoids
- Dropout, <mark>**data augmentation**</mark> - random transformation with noise to data (crop and resize, flip, color distord, rotate, etc.)

PROBLEM:

- Training deep models often failed due to <mark>**vanishing or exploding gradients**</mark>

<img src="images/Screenshot 2023-10-25 at 6.51.32 PM.png" height=70% width=70%>

***
## 3. GoogLeNet

Go deeper, much more parameter efficient

<img src="images/Screenshot 2023-10-25 at 6.54.34 PM.png" height=70% width=70%>

### **Inception Block**

- Used a mixture of 3x3, 5x5, 7x7 filters on one layer, and then **concatenate** the results
- Can mostly use 3x3 layers

#### *VGG (Visual Geometry Group) research*

- Found out that we only need 3x3 filters. Stacked 3x3 filters can approximate **any** larger-sized convolution, more efficiently

<img src="images/Screenshot 2023-10-25 at 6.56.37 PM.png" height=50% width=50%>

### **Pointwise convolution** (3x1x1 kernels)

Realize that the kernel always reduce output depth to 1, so we **only have 3 parameters to learn**

Also, reducing depth of an input, so next layer have less parameters

### **Auxilary Loss**

- Intermediate classifiers to calculate loss function in intermediate layers. Final loss is a combination of the intermediate losses and the final loss
- Forces the model to also perform well in intermediate layers
- Avoid gradient vanishing problem

<img src="images/Screenshot 2023-10-25 at 7.00.12 PM.png" height=50% width=50%>

### **Residual Networks**

Uses **skip connections** to provide deeper layers more direct access to signals, which otherwise be lost to vanishing gradients

Basically ensure that IF the gradient is 0 there is still a non-zero input from the first input layer

<img src="images/Screenshot 2023-10-25 at 7.03.45 PM.png" height=50% width=50%>

***
## 4. ResNets (Residual Networks)

Applying the idea of skipping connections:

<img src="images/Screenshot 2023-10-25 at 9.38.33 PM.png" height=70% width=70%>

NOTE:

- Downsampling using stride 2 instead of max/avg pooling
- Apply **global average pooling** in last layer, meaning that from 512x5x5, avg pooling is applied on each 5x5, and generate **512x5x5 scalar flattenned values**
- Also, only a **single fully-connected classification** layer is needed, since the learned embeddings are already so strong

*** 
## 5. Transfer Learning

Apparently, the entire Convolutional part of the network (encoder) that is before the Classification part generates something called **embedding**

This embedding is a learned, succinct representation of input image. CNNs learn something general about representing images!

This can then a fed into various classification layers for different appliations! (so we don't have to reinvent the cycle and train the CNN layers again)

<img src="images/Screenshot 2023-10-25 at 9.46.09 PM.png">