# Network in Network (NiN)
:label:`sec_nin`

LeNet, AlexNet, and VGG all share a common design pattern:
extract features exploiting *spatial* structure
via a sequence of convolutions and pooling layers
and post-process the representations via fully connected layers.
The improvements upon LeNet by AlexNet and VGG mainly lie
in how these later networks widen and deepen these two modules.

This design poses two major challenges.
First, the fully connected layers at the end
of the architecture consume tremendous numbers of parameters. 

Second, it is equally impossible to add fully connected layers
earlier in the network to increase the degree of nonlinearity: doing so would destroy the
spatial structure and require potentially even more memory.

The *network in network* (*NiN*) blocks offer an alternative,
capable of solving both problems in one simple strategy.
They were proposed based on a very simple insight: (i) use $1 \times 1$ convolutions to add
local nonlinearities across the channel activations and (ii) use global average pooling to integrate
across all locations in the last representation layer. Note that global average pooling would not
be effective, were it not for the added nonlinearities. 


## NiN Blocks

The inputs and outputs of convolutional layers
consist of four-dimensional tensors with axes
corresponding to the example, channel, height, and width.
Also recall that the inputs and outputs of fully connected layers
are typically two-dimensional tensors corresponding to the example and feature.
The idea behind NiN is to apply a fully connected layer
at each pixel location (for each height and width).
**The resulting $1 \times 1$ convolution can be thought as
a fully connected layer acting independently on each pixel location**.

Note both the difference in the NiN blocks (the initial convolution is followed by $1 \times 1$ convolutions, whereas VGG retains $3 \times 3$ convolutions) and in the end where we no longer require a giant fully connected layer.

In [1]:
import torch
from torch import nn
from d2l import torch as d2l

In [6]:
def nin_block(in_channels, out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding), nn.ReLU(), 
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(), 
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU())

## NiN Network

In [7]:
net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, strides=4, padding=0), # in_channel=1
    nn.MaxPool2d(3, stride=2),
    nin_block(96, 256, kernel_size=5, strides=1, padding=2),
    nn.MaxPool2d(3, stride=2),
    nin_block(256, 384, kernel_size=3, strides=1, padding=1),
    nn.MaxPool2d(3, stride=2), nn.Dropout(p=0.5),
    nin_block(384, 10, kernel_size=3, strides=1, padding=1), # 10 categories
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten()) # (batch_size, output_channel=10)

In [8]:
X = torch.randn(1, 1, 224, 224)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'Output shape:\t', X.shape)

Sequential Output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d Output shape:	 torch.Size([1, 96, 26, 26])
Sequential Output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d Output shape:	 torch.Size([1, 256, 12, 12])
Sequential Output shape:	 torch.Size([1, 384, 12, 12])
MaxPool2d Output shape:	 torch.Size([1, 384, 5, 5])
Dropout Output shape:	 torch.Size([1, 384, 5, 5])
Sequential Output shape:	 torch.Size([1, 10, 5, 5])
AdaptiveAvgPool2d Output shape:	 torch.Size([1, 10, 1, 1])
Flatten Output shape:	 torch.Size([1, 10])


### Training

Test on the Fashion-MNIST dataset.
We **construct a single-channel data example** with both height and width of 224 (**to observe the output shape of each layer**). 

In [12]:
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())