## Implementing VGG from scratch

VGG architecture was one of the key milestone in the development of CNN classifiers. The main contribution of VGG was the thorough investigation and evaluation of increasing depth using an architecture with very small (3×3) convolution filters. VGG was able to push the depth to 19 layers and also improved the State of the Art on ImageNet Challenge. 
The details of the paper can be found at https://arxiv.org/pdf/1409.1556.pdf

Compared to the prior works, VGG architectures had 2 major changes

Prior architectures often had relativel larger kernels (7x7, 11x11) in the first conv layers. Instead VGG used very small 3×3 filters throughout the whole net. 

Note that three 3x3 filters have the same receptive field as a single 7x7 filter. So what does replacing the 7x7 filter with three smaller filters buy? 
1. With three smaller filters, there is more non-linearity due to ReLU applied after every filter
2. The number of parameters are reduced from $49C^2$ ($7^2 C^2$) to $27C^2$ ($3*(3^2C^2)$). A reduced number of parameters means faster learning and more robust to over-fitting.

Additionally prior architectures relied on a normalization layer, Local Response Normalization (LRN). The authors showed that LRN layers did not lead to improvement in performance for Imagenet. So these layers were dropped.


In this notebook, we will take a look at how to implement a VGG-11 network from scratch. In practice, this is seldom done. `torchvision.models` already provides ready-made implementations for all the VGG architectures. However by building the network from scratch, we will gain a deeper understanding of the architecture.

Regardless of the specific architecture, all VGG networks follow a common structure. The commonalities are listed
1. All architectures work on (224, 224) sized input images.
2. All architectures have 5 Conv Blocks. 
    1. Each Block can have multiple number of convolutional layers followed by a MaxPool layer.
    2. All the individual convolutional layers inside a block are 3x3 kernels, with a padding of 1
    3. The individual convolutional layers do not change the spatial resolution of the feature map
    4. All the individual convolutional layers within a block  have the same output size features
    5. Each Convolutional block has a MaxPool layer at the end which reduces the spatial resolution of the feature map
3. Since each block downsamples the size by 2, at the end, the input image of size (224, 224) is reduced by a factor of $2^5$ i.e (7,7). Additionally at each block the number of features are doubled. 
4. All architectures have a classifier which comprises of 3 Fully Connected (FC) layers
    1. The first FC takes a 512*7*7 input, converts it to a 4096 dimensional output
    2. The second FC takes the resulting 4096 dimensional output and converts it to another 4096 dimensional output
    3. The final FC layer converts the 4096 dimensional input to a 1000 dimensional output. 1000 being the number of classes for ImageNet.

Let us first impelement the convolutional block. As mentioned earlier, each block can have multiple convolutional layers (depending on the architectures), followed by a MaxPool layer.  

In [1]:
import torch
import torchvision

from torch import nn
from torchsummary import summary

In [2]:
class ConvBlock(nn.Module):
    def __init__(self, in_channels, num_conv_layers, num_features):
        super(ConvBlock, self).__init__()
        modules = []
        for i in range(num_conv_layers):
            modules.extend([
                nn.Conv2d(in_channels, num_features, kernel_size=3, padding=1),
                nn.ReLU(inplace=True)
            ])
            in_channels = num_features
        modules.append(nn.MaxPool2d(kernel_size=2))
        self.conv_block = nn.Sequential(*modules)
    
    def forward(self, x):
        return self.conv_block(x)

In [3]:
conv_1 = ConvBlock(in_channels=3, num_conv_layers=1, num_features=64)

# Let us simulate a forward pass using a dummy input tensor
x = torch.rand([1, 3, 224, 224])

conv_1_out = conv_1(x)

assert conv_1_out.shape == torch.Size([1, 64, 112, 112])

Let us now implement the conv backbone builder. This will take in a list of configurations for each of the individual convolutional block.

The config is a list of size 5 (corresponding to each Conv block). Each element is a 3 tuple of the form (in_channels, num_conv_layers, num_features) corresponding to that Conv block

In [4]:
class ConvBackbone(nn.Module):
    def __init__(self, cfg):
        super(ConvBackbone, self).__init__()
        
        self.cfg = cfg
        self.validate_config(cfg)
        
        modules = []
        for block_cfg in cfg:
            in_channels, num_conv_layers, num_features = block_cfg
            modules.append(ConvBlock(in_channels, num_conv_layers, num_features))
        self.features = nn.Sequential(*modules)
            
    def validate_config(self, cfg):
        assert len(cfg) == 5 # 5 Conv blocks
        for i, block_cfg in enumerate(cfg):
            assert type(block_cfg) == tuple and len(block_cfg) == 3
            if i == 0:
                assert block_cfg[0] == 3 #Input channels always has to be 3
            else:
                assert block_cfg[0] == cfg[i-1][-1] #num_features of previous block is the input num features to this block
        
    def forward(self, x):
        return self.features(x)

In [5]:
# Let us look at the VGG 11 backbone configuration
vgg11_cfg = [
    (3, 1, 64),
    (64, 1, 128),
    (128, 2, 256),
    (256, 2, 512),
    (512, 2, 512)
]

vgg11_backbone = ConvBackbone(vgg11_cfg)
# Let us simulate a forward pass using a dummy input tensor
x = torch.rand([1, 3, 224, 224])

vgg11_conv_out = vgg11_backbone(x)

assert vgg11_conv_out.shape == torch.Size([1, 512, 7, 7])

Now let us implement the VGG module. As mentioned previously, the VGG module has 2 key features
1. The Conv backbone comprising of 5 Convolutional Blocks
2. The classifier comprising of Fully Connected Layers

In [6]:
class VGG(nn.Module):
    def __init__(self, conv_backbone, num_classes):
        super(VGG, self).__init__()
        self.conv_backbone = conv_backbone
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes)
        )
    
    def forward(self, x):
        conv_features = self.conv_backbone(x)
        # We need to flatten the conv features before passing it to the classifier
        logits = self.classifier(conv_features.view(conv_features.shape[0], -1)) 
        return logits

In [7]:
# We are now ready to create our VGG11. Let's say we want to use a VGG-11 classifier for a 1000 class problem
num_classes = 1000
vgg11 = VGG(vgg11_backbone, num_classes)

x = torch.rand([1, 3, 224, 224])
logits = vgg11(x)
assert logits.shape == torch.Size([1, num_classes])

In [8]:
vgg11

VGG(
  (conv_backbone): ConvBackbone(
    (features): Sequential(
      (0): ConvBlock(
        (conv_block): Sequential(
          (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): ReLU(inplace=True)
          (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        )
      )
      (1): ConvBlock(
        (conv_block): Sequential(
          (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): ReLU(inplace=True)
          (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        )
      )
      (2): ConvBlock(
        (conv_block): Sequential(
          (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): ReLU(inplace=True)
          (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (3): ReLU(inplace=True)
          (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=Fal

In [9]:
# We can now take a look at the summary to visualize the output shape, number of parameters and the layers
summary(vgg11, input_size=(3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
         MaxPool2d-3         [-1, 64, 112, 112]               0
         ConvBlock-4         [-1, 64, 112, 112]               0
            Conv2d-5        [-1, 128, 112, 112]          73,856
              ReLU-6        [-1, 128, 112, 112]               0
         MaxPool2d-7          [-1, 128, 56, 56]               0
         ConvBlock-8          [-1, 128, 56, 56]               0
            Conv2d-9          [-1, 256, 56, 56]         295,168
             ReLU-10          [-1, 256, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         590,080
             ReLU-12          [-1, 256, 56, 56]               0
        MaxPool2d-13          [-1, 256, 28, 28]               0
        ConvBlock-14          [-1, 256,

In [10]:
# As an elementary check, let us compare the number of parameters between our implementation
# and the official torchivision implementation and assert that they are equal

num_vgg_params = sum(p.numel() for p in vgg11.parameters() if p.requires_grad)

torch_vgg11 = torchvision.models.vgg11()
num_torch_vgg_params = sum(p.numel() for p in torch_vgg11.parameters() if p.requires_grad)

assert num_vgg_params == num_vgg_params

And voila! We have our own bare-bones implementation of VGG-11. Note that this is a barebones implementation. (We could potentially add BatchNorm in between the Conv layers. Similarly, we can apply AveragePool to the output of ConvBackbone to handle variable sized images). The purpose here is to understand and get a sense for the broad architecture.