## 1. Residual Networks (ResNet) and ResNeXt

when design deeper networks, what important is the ability to design networks where adding layers makes networks strictly **more expressive rather than just different.**

#### Function Classes

given a dataset with features $X$ and labels $y$, we might try finding $ f_{F}^{*} $ by solving the following optimization problem:

<img src="img/funclass.png" width=300 height=300 />

<img src="img/nest.png" width=500 height=500 />

- If larger function classes contain the smaller ones we are guaranteed that increasing them strictly increases the expressive power of the network.

- **For deep neural networks**, if we can train the newly-added layer into an identity function $f(x) = x$, the new model will be as effective as the original model. As the new model may get a better solution to fit the training dataset, the added layer might make it easier to reduce training errors.

  
- **The idea behind residual network** (ResNet) is that every additional layer should more easily contain **the identity function** as one of its elements.

### Residual Blocks

<img src="img/res.png" width=500 height=500 />

- The right figure illustrates the **residual block of ResNet**, where the solid line carrying the layer input x to the addition operator is called a **residual connection** (or shortcut connection).

- The portion within the dotted-line box needs to learn **the residual mapping** $g (x) = f (x)-x$ making **the identity mapping** $f (x) = x$ easier to learn.

- the residual block can be thought of as a special case of the multi-branch Inception block: it has two branches one of which is the identity mapping.

#### ResNet block with and without 1 × 1 convolution

<img src="img/resnet.png" width=500 height=500 />

In [14]:
import torch
import torch.nn as nn

In [None]:
class Residual(nn.Module): 
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1,
                                   stride=strides)
        self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
                                        stride=strides)
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()
    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

#### ResNet Model

<img src="img/resnetm.png" width=500 height=500 />

In [1]:
from torchvision import models
from torchsummary import summary
import torchinfo

In [5]:
resNet_model = models.resnet18(pretrained=True)
torchinfo.summary(resNet_model,(3, 224, 224),batch_dim = 0)

Layer (type:depth-idx)                   Output Shape              Param #
ResNet                                   [1, 1000]                 --
├─Conv2d: 1-1                            [1, 64, 112, 112]         9,408
├─BatchNorm2d: 1-2                       [1, 64, 112, 112]         128
├─ReLU: 1-3                              [1, 64, 112, 112]         --
├─MaxPool2d: 1-4                         [1, 64, 56, 56]           --
├─Sequential: 1-5                        [1, 64, 56, 56]           --
│    └─BasicBlock: 2-1                   [1, 64, 56, 56]           --
│    │    └─Conv2d: 3-1                  [1, 64, 56, 56]           36,864
│    │    └─BatchNorm2d: 3-2             [1, 64, 56, 56]           128
│    │    └─ReLU: 3-3                    [1, 64, 56, 56]           --
│    │    └─Conv2d: 3-4                  [1, 64, 56, 56]           36,864
│    │    └─BatchNorm2d: 3-5             [1, 64, 56, 56]           128
│    │    └─ReLU: 3-6                    [1, 64, 56, 56]           --
│

### ResNeXt

- One of the challenges one encounters in the design of ResNet is the trade-off between non-linearity and dimensionality within a given block.
    - meaning, we could add more nonlinearity by increasing the number of layers, or by increasing the width of the convolutions
    - An alternative strategy is to increase the number of channels that can carry information between blocks. But this technique comes with a quadratic penalty.

<br>

- Inspiration from the Inception block, **ResNeXt applies multiple independent groups to the ResNet block**
- Different from the smorgasbord of transformations in Inception, **ResNeXt** adopts the same transformation in all branches, thus minimizing the need for manual tuning of each branch.
  

<img src="img/resnext.png" width=500 height=500 />

- Breaking up a convolution from $c_{i}$ to $c_{o}$ channels into one of $g$ groups of size $c_{i}/g$ generating $g$ outputs of size $c_{o}/g$ is called, **a grouped convolution**
- The computational cost is reduced from $O(c_{i}. c_{o})$ to $O(g.(c_{i}/g).(c_{o}/g))$ $=$ $O(c_{i}. c_{o} /g )$, i.e., it is $g$ times faster.
- the number of parameters needed to generate the output is also reduced from a $c_{i}$ x $c_{o}$ matrix to $g$ smaller matrices of size $(c_{i}/g)$ x $(c_{o}/g)$, again a $g$ times reduction. 

- The only challenge in this design is that no information is exchanged between the $g$ groups.

**The ResNeXt block** amends this in two ways:

- the grouped convolution with a 3 × 3 kernel is sandwiched in between two 1 × 1 convolutions.

In [13]:
class ResNeXtBlock(nn.Module):
    """The ResNeXt block."""
    def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
                 strides=1):
        super().__init__()
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1)
        self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
                                   stride=strides, padding=1,
                                   groups=bot_channels//groups)
        self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=1)
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()
        self.bn3 = nn.LazyBatchNorm2d()
        if use_1x1conv:
            self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
            self.bn4 = nn.LazyBatchNorm2d()
        else:
            self.conv4 = None
    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = F.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.conv4:
            X = self.bn4(self.conv4(X))
        return F.relu(Y + X)

## 2. DenseNet

- ResNet decomposes functions into:


>> $f(x) = g(x) + x$

ResNet decomposes $f$ into a simple linear term and a more complex nonlinear one. 

What if we wanted to capture (not necessarily add) information beyond two terms? One such solution is DenseNet 

<img src="img/res_vs_dens.png" width=300 height=300 />

- we perform a mapping from $x$ to its values after applying an increasingly complex sequence of functions:

<img src="img/denseq.png" width=400 height=400 />

- The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The final layer of such a chain is densely connected to all previous layers.

<img src="img/densgraph.png" width=300 height=300 />

>>> Note how the dimensionality increases with depth.

The main components that comprise a DenseNet are **dense blocks** and **transition layers**.

**dense blocks**
    
    - define how the inputs and outputs are concatenated.



In [15]:
def conv_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=3, padding=1))

In [17]:
class DenseBlock(nn.Module):
    def __init__(self, num_convs, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate input and output of each block along the channels
            X = torch.cat((X, Y), dim=1)
        return X

In [18]:
blk = DenseBlock(2, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape



torch.Size([4, 23, 8, 8])

**transition layers**

- control the number of channels so that it is not too large, since the expansion $ x 
 \rightarrow [x, f_{1}(x), f_{2} ([x, f_{1} (x)]),...]$ can be quite high-dimensional.

- it reduces the number of channels by using a 1 × 1 convolution.
- Moreover, it halves the height and width via average pooling with a stride of 2.


In [19]:
def transition_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

In [20]:
blk = transition_block(10)
blk(Y).shape

torch.Size([4, 10, 4, 4])

### DenseNet Model

In [None]:
class DenseNet(nn.module):
    def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
             lr=0.1, num_classes=10):
        super(DenseNet, self).__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(self.b1())
        for i, num_convs in enumerate(arch):
            
            self.net.add_module(f'dense_blk{i+1}', DenseBlock(num_convs,
                                                          growth_rate))
            # The number of output channels in the previous dense block
            num_channels += num_convs * growth_rate
            
            # A transition layer that halves the number of channels is added
            # between the dense blocks
            if i != len(arch) - 1:
                num_channels //= 2
                self.net.add_module(f'tran_blk{i+1}', transition_block(
                    num_channels))
        
        self.net.add_module('last', nn.Sequential(
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
            nn.LazyLinear(num_classes)))
        self.net.apply(nn.init.xavier_uniform_)
        
        def b1(self):
            return nn.Sequential(
                nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
                nn.LazyBatchNorm2d(), nn.ReLU(),
                nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

- here DenseNet uses four dense blocks.
- we set the number of channels (i.e., **growth rate**) for the convolutional layers in the dense block to 32, so 128 channels will be added to each dense block.
- we use the transition layer to halve the height and width and halve the number of channels.
- a global pooling layer and a fully connected layer are connected at the end to produce the output.