In [2]:
from matplotlib import pyplot as plt
import pandas as pd
import random
import time
import math
import d2l
import os

from mxnet import autograd, np, npx, gluon, init
from mxnet.gluon import loss as gloss
from mxnet.gluon import nn
npx.set_np()

#  07. Modern Convolutional Neural Networks
Now that we understand the basics of wiring together convolutional neural networks, we will take you through a tour of modern deep learning. In this chapter, each section will correspond to a significant neural network architecture that was at some point (or currently) the base model upon which an enormous amount of research and projects were built. Each of these networks was at briefly a dominant architecture and many were at one point winners or runners-up in the famous `ImageNet` competition, which has served as a barometer of progress on supervised learning in computer vision since 2010.

These models include 
+ `AlexNet`: the first large-scale network deployed to beat conventional computer vision methods on a large-scale vision challenge; 
+ `VGG`: makes use of a number of repeating blocks of elements
+ `NiN`: network in network, which convolves whole neural networks patch-wise over inputs
+ `GoogLeNet`: makes use of networks with parallel concatenations
+ `ResNet`: residual networks, which are the most popular go-to architecture today
+ `DenseNet`: densely connected networks, which are expensive to compute but have set some recent benchmarks


## 7.2 Networks Using Blocks (VGG)
While `AlexNet` proved that deep convolutional neural networks can achieve good results, it did not offer a general template to guide subsequent researchers in designing new networks. In the following sections, we will introduce several heuristic concepts commonly used to design deep networks.

Progress in this field mirrors that in chip design where engineers went from placing transistors to logical elements to logic blocks. Similarly, the design of neural network architectures had grown progressively more abstract, with researchers moving from thinking in terms of individual neurons to whole layers, and now to blocks, repeating patterns of layers.

The idea of using blocks first emerged from the `Visual Geometry Group` (`VGG`) at `Oxford University`, in their eponymously-named `VGG` network. It is easy to implement these repeated structures in code with any modern deep learning framework by using loops and subroutines.

### 7.2.1 VGG Blocks
The basic building block of classic convolutional networks is a sequence of the following layers:
1. a convolutional layer (with padding to maintain the resolution)
2. a nonlinearity such as a ReLU
3. a pooling layer such as a max pooling layer

One `VGG` block consists of a sequence of convolutional layers, followed by a max pooling layer for spatial downsampling. In the original `VGG` paper (`Simonyan & Zisserman, 2014`), the authors employed convolutions with $3\times3$ kernels and $2 \times 2$ max pooling with stride of $2$ (halving the resolution after each block). In the code below, we define a function called `vgg_block` to implement one `VGG` block. The function takes two arguments corresponding to the number of convolutional layers `num_convs` and the number of output channels `num_channels`.

In [3]:
def vgg_block(num_convs, num_channels):
    blk = nn.Sequential()
    for _ in range(num_convs):
        blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1, activation='relu'))
    blk.add(nn.MaxPool2D(pool_size=2, strides=2))
    return blk

### 7.2.2 VGG Network
Like `AlexNet` and `LeNet`, the `VGG` Network can be partitioned into two parts: 
+ the first consisting mostly of convolutional and pooling layers 
+ a second consisting of fully-connected layers

The convolutional portion of the net connects several `vgg_block` modules in succession. In `Fig. 7.2.1`, the variable `conv_arch` consists of a list of tuples (one per block), where each contains two values: the number of convolutional layers and the number of output channels, which are precisely the arguments requires to call the `vgg_block` function. The fully-connected module is identical to that covered in `AlexNet`.

<img src="images/07_03.png" style="width:500px;"/>

The original `VGG` network had 5 convolutional blocks, among which the first two have one convolutional layer each and the latter three contain two convolutional layers each. The first block has 64 output channels and each subsequent block doubles the number of output channels, until that number reaches $512$. Since this network uses $8$ convolutional layers and $3$ fully-connected layers, it is often called `VGG-11`.

In [4]:
conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))

The following code implements `VGG-11`. This is a simple matter of executing a for loop over `conv_arch`.

In [5]:
def vgg(conv_arch):
    net = nn.Sequential()
    # The convolutional layer part
    for (num_convs, num_channels) in conv_arch:
        net.add(vgg_block(num_convs, num_channels))
    # The fully connected layer part
    net.add(nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
            nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
            nn.Dense(10))
    return net

net = vgg(conv_arch)

Next, we will construct a single-channel data example with a height and width of 224 to observe the output shape of each layer.

In [6]:
net.initialize()
X = np.random.uniform(size=(1, 1, 224, 224))
for blk in net:
    X = blk(X)
    print(blk.name, 'output shape:\t', X.shape)

sequential1 output shape:	 (1, 64, 112, 112)
sequential2 output shape:	 (1, 128, 56, 56)
sequential3 output shape:	 (1, 256, 28, 28)
sequential4 output shape:	 (1, 512, 14, 14)
sequential5 output shape:	 (1, 512, 7, 7)
dense0 output shape:	 (1, 4096)
dropout0 output shape:	 (1, 4096)
dense1 output shape:	 (1, 4096)
dropout1 output shape:	 (1, 4096)
dense2 output shape:	 (1, 10)


As you can see, we halve height and width at each block, finally reaching a height and width of 7 before flattening the representations for processing by the fully-connected layer.

### 7.2.3 Model Training
Since `VGG-11` is more computationally-heavy than `AlexNet` we construct a network with a smaller number of channels. This is more than sufficient for training on `Fashion-MNIST`.

In [9]:
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
# net.summary

In [10]:
net.initialize()
X = np.random.uniform(size=(1, 1, 224, 224))
for blk in net:
    X = blk(X)
    print(blk.name, 'output shape:\t', X.shape)

sequential19 output shape:	 (1, 16, 112, 112)
sequential20 output shape:	 (1, 32, 56, 56)
sequential21 output shape:	 (1, 64, 28, 28)
sequential22 output shape:	 (1, 128, 14, 14)
sequential23 output shape:	 (1, 128, 7, 7)
dense9 output shape:	 (1, 4096)
dropout6 output shape:	 (1, 4096)
dense10 output shape:	 (1, 4096)
dropout7 output shape:	 (1, 4096)
dense11 output shape:	 (1, 10)


Apart from using a slightly larger learning rate, the model training process is similar to that of `AlexNet` in the last section.

In [None]:
lr, num_epochs, batch_size = 0.05, 10, 128,
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr)

##### Summary
+ `VGG-11` constructs a network using reusable convolutional blocks. Different `VGG` models can be defined by the differences in the number of convolutional layers and output channels in each block.
+ The use of blocks leads to very compact representations of the network definition. It allows for efficient design of complex networks.
+ In their work Simonyan and Ziserman experimented with various architectures. In particular, they found that several layers of deep and narrow convolutions (i.e., $3 \times 3$) were more effective than fewer layers of wider convolutions.

##### Exercises
1. When printing out the dimensions of the layers we only saw 8 results rather than 11. Where did the remaining 3 layer informations go?
2. Compared with `AlexNet`, `VGG` is much slower in terms of computation, and it also needs more GPU memory. Try to analyze the reasons for this.
3. Try to change the height and width of the images in Fashion-MNIST from 224 to 96. What influence does this have on the experiments?
4. Refer to Table 1 in (`Simonyan & Zisserman, 2014`) to construct other common models, such as `VGG-16` or `VGG-19`.