# Networks Using Blocks (VGG)

In VLSI (very large scale integration) in chip design, designers moved from placing individual transistors, to logical elements to logical blocks. Similarly, neural network design has become ever more abstract. Now, ML researchers may even use entire, pretrained models as the basis of their engineering, these are called _foundation models_. The idea of using blocks like this came from the Visual Geometry Group at the university of Oxford, and is the origin of the VGG netowkr. 

In [3]:
import torch
from torch import nn
from d2l import torch as d2l

## VGG Blocks

The basic buildling block of CNNS it a sequence of convolutional layers with padding, a noninearity such ad ReLU, and a pooling layer. One problem with this approach is that the spatial resolution of the images decreases rapidly. 

The novel idea of Simonyan and Zisserman (2014) was to use multiple layers of convolutions between each layer of max-pooling in the form of a "block". Their original interest was in whether deeper or wider networks perform better, for example, two 3x3 convolutions and one 5x5 convolutions touch the name number of pixels, but the larger 5x5 convolution uses substantially more computational power. 

They demonstrated that deep and narrow networks significantly outperform their shallower counterparts. This has become the gold standard, with over 100 layers in many applications, and successive 3x3 convolutions becoming the gold standard for image recognition. 

VGG consists of a sequence of convlutions with a 3x3 kernel, with padding of 1, to keep the height and width of the image hte same, followed by a 2x2 max pooling layer with stride of 2. 

In [4]:
def vgg_block(num_convs, out_channels):
    layers = []

    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    
    return nn.Sequential(*layers)

## VGG Network

Like AlexNet and LeNet, VGG networks consist of two main sections - the convolution and pooling layers, and the later fully connected layers. 

In [6]:
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        conv_blks = []

        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))

        self.net = nn.Sequential(
            # Expand convolutional blocks
            *conv_blks, 

            # Flatten output before linear section
            nn.Flatten(),

            # Fully connected linear section
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),

            # Output
            nn.LazyLinear(num_classes)
        )

        self.net.apply(d2l.init_cnn)

In [8]:
VGG(
    arch=(
        (1, 64), 
        (1, 128), 
        (2, 256),
        (2, 512), 
        (2, 512)
    )).layer_summary(
    (1, 1, 224, 224)
)



Sequential output shape:	 torch.Size([1, 64, 112, 112])
Sequential output shape:	 torch.Size([1, 128, 56, 56])
Sequential output shape:	 torch.Size([1, 256, 28, 28])
Sequential output shape:	 torch.Size([1, 512, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
Flatten output shape:	 torch.Size([1, 25088])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])


## Training

Because this is so computationally intensive, to test the training we will use just a small number of channels. The original was designed for imagenet and we'll just fit FashionMNIST, which it iwll be more than sufficient for.

In [9]:
model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)), lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)

if torch.cuda.is_available():
    trainer.fit(model, data)

    # Save the parameters since this is expensive
    name = "VGG_" + datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S') + ".pt"
    torch.save(model.state_dict(), name)
else: 
    print("You're not training this without a GPU. If you really wanna, you can remove this check")

Could be argued that VGG is the first truly modern CNN. AlexNet introduced many of the components that make deep learning, but VGG parameterized them as blocks and introduced the concept of a whole family of models of deep and narrow networks.