In [3]:
import sys
import torch.nn as nn
import torch
import warnings
sys.path.append('/home/jovyan/work/d2l_solutions/notebooks/exercises/d2l_utils/')
import d2l
from torchsummary import summary
warnings.filterwarnings("ignore")

def nin_block(out_channels, kernel_size, strides, padding, nums_conv1):
    layers = [nn.LazyConv2d(out_channels, kernel_size=kernel_size, strides=strides, padding=padding),nn.ReLU()]
    for i in range(nums_conv1):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=1))
        layers.append(nn.ReLU())
    return nn.Sequential(*layers)

class Nin(d2l.Classifier):
    def __init__(self, arch, lr=0.1):
        super().__init__()
        self.save_hyperparameters()
        layers = []
        for i in range(len(arch)-1):
            layers.append(nin_block(*arch[i]))
            layers.append(nn.MaxPool2d(3, stride=2))
        layers.append(nn.Dropout(0.5))
        layers.append(nin_block(*arch[-1]))
        layers.append(nn.AdaptiveAvgPool2d((1, 1)))
        layers.append(nn.Flatten())
        self.net = nn.Sequential(*layers)
        self.net.apply(d2l.init_cnn)

# 1. Why are there two $1\times1$ convolutional layers per NiN block? Increase their number to three. Reduce their number to one. What changes?

In Network in Network (NiN) architecture, $1\times1$ convolutional layers are used to introduce additional non-linearity and increase the capacity of the network without introducing too many parameters. The inclusion of these $1\times1$ convolutions has specific effects on the network's expressiveness and complexity:

1. **Two $1\times1$ Convolutional Layers per NiN Block**:
   - When there are two $1\times1$ convolutional layers per NiN block, it creates multiple pathways for feature transformation. Each $1\times1$ convolution performs its own set of operations, allowing the network to capture complex relationships between features and enable better representation learning.
   - Having two $1\times1$ convolutions can increase the model's capacity and non-linearity, potentially leading to improved accuracy and more expressive features.

2. **Three $1\times1$ Convolutional Layers per NiN Block**:
   - Increasing the number of $1\times1$ convolutional layers further amplifies the network's capacity. Each additional convolutional layer introduces more non-linearity and the possibility of capturing more complex interactions between features.
   - However, increasing the number of $1\times1$ convolutions also increases the number of parameters and computations, potentially leading to overfitting and higher computational costs.

3. **One $1\times1$ Convolutional Layer per NiN Block**:
   - Using only one $1\times1$ convolutional layer reduces the complexity of each NiN block. It limits the capacity of the network to capture complex feature interactions, and may lead to underfitting if the dataset and task are complex.
   - Reducing the number of $1\times1$ convolutions also decreases the number of parameters and computations, which can be beneficial for faster training and reduced memory usage.

Overall, the number of $1\times1$ convolutional layers in NiN blocks impacts the network's capacity, complexity, and computational requirements. The optimal choice depends on factors such as the dataset's complexity, available computational resources, and desired trade-off between accuracy and efficiency. Experimentation and validation on a specific task are necessary to determine the most suitable configuration for the network.

In [None]:
class 

# 2. What changes if you replace the $1\times1$ convolutions by $3\times3$ convolutions?

# 3. What happens if you replace the global average pooling by a fully connected layer (speed, accuracy, number of parameters)?

# 4. Calculate the resource usage for NiN.

## 4.1 What is the number of parameters?

## 4.2 What is the amount of computation?

## 4.3 What is the amount of memory needed during training?

## 4.4 What is the amount of memory needed during prediction?

# 5. What are possible problems with reducing the $384\times5\times5$ representation to a $10\times5\times5$ representation in one step?

# 6. Use the structural design decisions in VGG that led to VGG-11, VGG-16, and VGG-19 to design a family of NiN-like networks.