# Visual Geometry Group Architecture

From individual operation to block of operation

In [1]:
import torch
from torch import nn
!pip install torchinfo



Rather than stacking layers, we stack blocks that we repeat $X$ time  
Blocks can contain a different number of convolution layers but have the same *resolution* and *stride*

<center>
    <img src='images/vgg.svg' width=55% style="margin-left:auto; margin-right:auto"/>
    <p style="font-size:14px;">Source: <a href='http://d2l.ai/'>D2L</a></p>
</center>

Each block is composed of convolutional layers with padding to maintain the resolution, ReLU activation function, and a MaxPooling layer

In [2]:
def vgg_block(num_convs, in_channels, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
        in_channels = out_channels
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)

Original **VGG** architecture called **VGG-11** can be represented as follows:

In [3]:
conv_arch_11 = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512)) # tuple (nb_conv_layers, channel_out)

One version often used called **VGG-16** can be represented as follows:

In [4]:
conv_arch_16 = ((2, 64), (2, 128), (3, 256), (3, 512), (3, 512)) # tuple (nb_conv_layers, channel_out)

Thanks to Pytorch's object-oriented philosophy, it's very easy to create blocks and assemble them together

In [5]:
def vgg(conv_arch):
    conv_blks = []
    in_channels = 3
    # The convolutional part
    for (num_convs, out_channels) in conv_arch:
        conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
        in_channels = out_channels

    return nn.Sequential(
        *conv_blks, nn.Flatten(),
        # The fully-connected part
        nn.Linear(out_channels * 7 * 7, 4096), # output resolution specific to Imagenet (224, 224)
        nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 1000))

net = vgg(conv_arch_11)

In [6]:
from torchinfo import summary
summary(net, input_size=(32, 3, 224, 224))

Layer (type:depth-idx)                   Output Shape              Param #
Sequential                               --                        --
├─Sequential: 1-1                        [32, 64, 112, 112]        --
│    └─Conv2d: 2-1                       [32, 64, 224, 224]        1,792
│    └─ReLU: 2-2                         [32, 64, 224, 224]        --
│    └─MaxPool2d: 2-3                    [32, 64, 112, 112]        --
├─Sequential: 1-2                        [32, 128, 56, 56]         --
│    └─Conv2d: 2-4                       [32, 128, 112, 112]       73,856
│    └─ReLU: 2-5                         [32, 128, 112, 112]       --
│    └─MaxPool2d: 2-6                    [32, 128, 56, 56]         --
├─Sequential: 1-3                        [32, 256, 28, 28]         --
│    └─Conv2d: 2-7                       [32, 256, 56, 56]         295,168
│    └─ReLU: 2-8                         [32, 256, 56, 56]         --
│    └─Conv2d: 2-9                       [32, 256, 56, 56]         590,08

### In VGG paper, authors found that having narrow convolution kernel (i.e. 3x3) but deeper model was better than larger kernels and less deep CNN. Why?