## Modern CNN

### 1. AlexNet (2012)

<img src="img/lnetalex.png" width=500 height=500 />

#### Architecture

- In AlexNet’s first layer, the convolution window shape is 11 × 11. ( a larger convolution window is needed to capture the object)

- The convolution window shape in the second layer is reduced to 5 × 5, followed by 3 × 3

- The convolution window shape in the second layer is reduced to 5 × 5, followed by 3 × 3

- after the first, second, and fifth convolutional layers, the network adds max-pooling layers with a window shape of 3 × 3 and a stride of 2.

- AlexNet has ten times more convolution channels than LeNet.

- After the final convolutional layer, there are two huge fully connected layers with 4096 outputs.

#### Activation Functions

- AlexNet changed the sigmoid activation function to a simpler ReLU activation function.

    - the computation of the ReLU activation function is simpler
    - the ReLU activation function makes model training easier when using different parameter initialization methods
      
        - when the output of the sigmoid activation function is very close to 0 or 1, the gradient of these regions is almost 0, so that backpropagation cannot continue to update some of the model parameters.
          
        - By contrast, the gradient of the ReLU activation function in the positive interval is always 1
     
    - local response normalization
        - to bound the output of the ReLU layer
        - to encourage lateral inhibition wherein a neuron can suppress the activity of its neighbors
        - Unlike batch normalization, it contains no trainable parameters.

#### Capacity Control and Preprocessing

- AlexNet controls the model complexity of the fully connected layer by dropout, while LeNet only uses weight decay.

- To augment the data even further, the training loop of AlexNet added a great deal of image augmentation, such as flipping, clipping, and color changes. This makes the model more robust and the larger sample size effectively reduces overfitting

In [40]:
import torch
import torch.nn as nn
from torchvision import models
from torchsummary import summary
import torchinfo

In [41]:
alex_model = models.alexnet(pretrained=True)
torchinfo.summary(alex_model,(3, 224, 224),batch_dim = 0)

Layer (type:depth-idx)                   Output Shape              Param #
AlexNet                                  [1, 1000]                 --
├─Sequential: 1-1                        [1, 256, 6, 6]            --
│    └─Conv2d: 2-1                       [1, 64, 55, 55]           23,296
│    └─ReLU: 2-2                         [1, 64, 55, 55]           --
│    └─MaxPool2d: 2-3                    [1, 64, 27, 27]           --
│    └─Conv2d: 2-4                       [1, 192, 27, 27]          307,392
│    └─ReLU: 2-5                         [1, 192, 27, 27]          --
│    └─MaxPool2d: 2-6                    [1, 192, 13, 13]          --
│    └─Conv2d: 2-7                       [1, 384, 13, 13]          663,936
│    └─ReLU: 2-8                         [1, 384, 13, 13]          --
│    └─Conv2d: 2-9                       [1, 256, 13, 13]          884,992
│    └─ReLU: 2-10                        [1, 256, 13, 13]          --
│    └─Conv2d: 2-11                      [1, 256, 13, 13]         

### 2. Networks Using Blocks (VGG) (2014)

The basic building block of CNNs is a sequence of the following: 

    - (i) a convolutional layer with padding to maintain the resolution, 
    - (ii) a nonlinearity such as a ReLU, 
    - (iii) a pooling layer such as max-pooling to reduce the resolution. 

One of the problems with this approach is that **the spatial resolution decreases quite rapidly**. 

In particular, this imposes a hard limit of $log_{2}$ $d$ convolutional layers on the network before all dimensions $(d)$ are used up.

>> For instance, in the case of ImageNet, it would be impossible to have more than 8 convolutional layers in this way.

**The key idea of VGG** was to use multiple convolutions in between downsampling via max-pooling in the form of a block

**VGG block** consists of a sequence of convolutions with 3 × 3 kernels with padding of 1 (keeping height and width) followed by a 2 × 2 max-pooling layer with stride of 2 (halving height and width after each block)

### VGG Network

the VGG Network can be partitioned into two parts: 

- the first consisting mostly of convolutional and pooling layers

- and the second consisting of fully connected layers that are identical to those in AlexNet


**The key difference** is that the convolutional layers are grouped in nonlinear transformations that leave the dimensonality unchanged, followed by a resolution-reduction step

<img src="img/vgg.png" width=500 height=500 />

In [36]:
vgg16_model = models.vgg16(pretrained=True)

torchinfo.summary(vgg16_model,(3, 224, 224),batch_dim = 0)

Layer (type:depth-idx)                   Output Shape              Param #
VGG                                      [1, 1000]                 --
├─Sequential: 1-1                        [1, 512, 7, 7]            --
│    └─Conv2d: 2-1                       [1, 64, 224, 224]         1,792
│    └─ReLU: 2-2                         [1, 64, 224, 224]         --
│    └─Conv2d: 2-3                       [1, 64, 224, 224]         36,928
│    └─ReLU: 2-4                         [1, 64, 224, 224]         --
│    └─MaxPool2d: 2-5                    [1, 64, 112, 112]         --
│    └─Conv2d: 2-6                       [1, 128, 112, 112]        73,856
│    └─ReLU: 2-7                         [1, 128, 112, 112]        --
│    └─Conv2d: 2-8                       [1, 128, 112, 112]        147,584
│    └─ReLU: 2-9                         [1, 128, 112, 112]        --
│    └─MaxPool2d: 2-10                   [1, 128, 56, 56]          --
│    └─Conv2d: 2-11                      [1, 256, 56, 56]          29

### 3. Network in Network (NiN) (2013)

LeNet, AlexNet, and VGG all share a common design pattern: 

**extract features exploiting spatial structure via a sequence of convolutions and pooling layers and post-process the representations via fully connected layers**

##### This design poses two major challenges. 

- First, the fully connected layers at the end of the architecture consume tremendous numbers of parameters (not suitable for mobile and embedded devices)

- Second, it is equally impossible to add fully connected layers earlier in the network to increase the degree of nonlinearity: doing so would destroy the spatial structure and require potentially even more memory.

**The network in network (NiN) blocks** offer an alternative, capable of solving both problems in one simple strategy.

They were proposed based on a very simple insight: 

- (i) use 1 × 1 convolutions to add local nonlinearities across the channel activations and
  
- (ii) use global average pooling to integrate across all locations in the last representation layer.


The idea behind NiN is to apply a fully connected layer at each pixel location (for each height and width). The resulting 1×1 convolution can be thought of as a fully connected layer acting independently on each pixel location.

<img src="img/nin.png" width=500 height=500 />

In [42]:
def nin_block(out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.LazyConv2d(out_channels, kernel_size, strides, padding),
        nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), 
        nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), 
        nn.ReLU())

In [48]:
class NiN(nn.Module):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        
        self.net = nn.Sequential(
            nin_block(96, kernel_size=11, strides=4, padding=0),
            nn.MaxPool2d(3, stride=2),
            nin_block(256, kernel_size=5, strides=1, padding=2),
            nn.MaxPool2d(3, stride=2),
            nin_block(384, kernel_size=3, strides=1, padding=1),
            nn.MaxPool2d(3, stride=2),
            nn.Dropout(0.5),
            nin_block(num_classes, kernel_size=3, strides=1, padding=1),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten())

### 4. Multi-Branch Networks (GoogLeNet) (2015)

- It uses a structure that combined the strengths of NiN, repeated blocks, and a cocktail of convolution kernels. 

- It was arguably the first network that exhibited a clear distinction among the stem (data ingest), body (data processing), and head (prediction) in a CNN

- **The key contribution in GoogLeNet** was the design of the network body.

    - It solved the problem of selecting convolution kernels (1 × 1 to 11 × 11 ) in an ingenious way. It simply concatenated multi-branch convolutions.

#### Inception Blocks

<img src="img/inception.png" width=500 height=500 />

- The first three branches use convolutional layers with window sizes of 1 × 1, 3 × 3, and 5 × 5 to extract information from different spatial sizes.
-  The middle two branches also add a 1 × 1 convolution of the input **to reduce the number of channels**, reducing the model’s complexity.
-  The fourth branch uses a 3 × 3 max-pooling layer, followed by a 1 × 1 convolutional layer to change the number of channels.

- The four branches all use appropriate padding to give the input and output the same height and width.
- The outputs along each branch are concatenated along the channel dimension and comprise the block’s output.
- The commonly-tuned hyperparameters of the Inception block are the number of output channels per layer, i.e., how to allocate capacity among convolutions of different size.

**To gain some intuition for why this network works so well, consider the combination of the filters.**

#### GoogLeNet Model

<img src="img/googlelenet.png" width=500 height=500 />

In [56]:
googlenet = models.googlenet(pretrained=True)

torchinfo.summary(googlenet,(3, 224, 224),batch_dim = 0)

Layer (type:depth-idx)                   Output Shape              Param #
GoogLeNet                                [1, 1000]                 --
├─BasicConv2d: 1-1                       [1, 64, 112, 112]         --
│    └─Conv2d: 2-1                       [1, 64, 112, 112]         9,408
│    └─BatchNorm2d: 2-2                  [1, 64, 112, 112]         128
├─MaxPool2d: 1-2                         [1, 64, 56, 56]           --
├─BasicConv2d: 1-3                       [1, 64, 56, 56]           --
│    └─Conv2d: 2-3                       [1, 64, 56, 56]           4,096
│    └─BatchNorm2d: 2-4                  [1, 64, 56, 56]           128
├─BasicConv2d: 1-4                       [1, 192, 56, 56]          --
│    └─Conv2d: 2-5                       [1, 192, 56, 56]          110,592
│    └─BatchNorm2d: 2-6                  [1, 192, 56, 56]          384
├─MaxPool2d: 1-5                         [1, 192, 28, 28]          --
├─Inception: 1-6                         [1, 256, 28, 28]          --
│