In [None]:
!pip install d2l==0.17.2

Collecting d2l==0.17.2
  Downloading d2l-0.17.2-py3-none-any.whl (82 kB)
[?25l[K     |████                            | 10 kB 24.4 MB/s eta 0:00:01[K     |████████                        | 20 kB 11.6 MB/s eta 0:00:01[K     |████████████                    | 30 kB 8.6 MB/s eta 0:00:01[K     |████████████████                | 40 kB 8.3 MB/s eta 0:00:01[K     |███████████████████▉            | 51 kB 5.0 MB/s eta 0:00:01[K     |███████████████████████▉        | 61 kB 5.2 MB/s eta 0:00:01[K     |███████████████████████████▉    | 71 kB 5.4 MB/s eta 0:00:01[K     |███████████████████████████████▉| 81 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 82 kB 492 kB/s 
Collecting requests==2.25.1
  Downloading requests-2.25.1-py2.py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 6.3 MB/s 
[?25hCollecting matplotlib==3.3.3
  Downloading matplotlib-3.3.3-cp37-cp37m-manylinux1_x86_64.whl (11.6 MB)
[K     |████████████████████████████████| 1

Illustrates the main structural differences
between VGG and NiN, and their blocks.

![Comparing architectures of VGG and NiN, and their blocks.](http://d2l.ai/_images/nin.svg)

In [None]:
import torch
from torch import nn
from d2l import torch as d2l


def nin_block(in_channels, out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU())

One significant difference between NiN and AlexNet is that NiN avoids fully-connected layers altogether. Instead, NiN uses an NiN block with a number of output channels equal to the number of label classes, followed by a global average pooling layer, yielding a vector of logits. One advantage of NiN’s design is that it significantly reduces the number of required model parameters. However, in practice, this design sometimes requires increased model training time.

In [None]:
net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, strides=4, padding=0),
    nn.MaxPool2d(3, stride=2),
    nin_block(96, 256, kernel_size=5, strides=1, padding=2),
    nn.MaxPool2d(3, stride=2),
    nin_block(256, 384, kernel_size=3, strides=1, padding=1),
    nn.MaxPool2d(3, stride=2),
    nn.Dropout(0.5),
    # There are 10 label classes
    nin_block(384, 10, kernel_size=3, strides=1, padding=1),
    nn.AdaptiveAvgPool2d((1, 1)),
    # Transform the four-dimensional output into two-dimensional output with a
    # shape of (batch size, 10)
    nn.Flatten())

We create a data example to see the output shape of each block.

In [None]:
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

Sequential output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
Sequential output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
Sequential output shape:	 torch.Size([1, 384, 12, 12])
MaxPool2d output shape:	 torch.Size([1, 384, 5, 5])
Dropout output shape:	 torch.Size([1, 384, 5, 5])
Sequential output shape:	 torch.Size([1, 10, 5, 5])
AdaptiveAvgPool2d output shape:	 torch.Size([1, 10, 1, 1])
Flatten output shape:	 torch.Size([1, 10])


##Training

we use Fashion-MNIST to train the model

In [None]:
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

<Figure size 252x180 with 1 Axes>

##Summary
* NiN uses blocks consisting of a convolutional layer and multiple  1×1  convolutional layers. This can be used within the convolutional stack to allow for more per-pixel nonlinearity.

* NiN removes the fully-connected layers and replaces them with global average pooling (i.e., summing over all locations) after reducing the number of channels to the desired number of outputs (e.g., 10 for Fashion-MNIST).

* Removing the fully-connected layers reduces overfitting. NiN has dramatically fewer parameters.

* The NiN design influenced many subsequent CNN designs.