# Residual networks

In [1]:
import torch

First, let's start with a network that has layer normalization added.

In [None]:
class MyModelLN(torch.nn.Module):
    def __init__(self, layer_size = [512, 512, 512]):
        super.__init__()
        layers = []
        layers.append(torch.nn.Flatten())
        c = 128 * 128 * 3
        for s in layer_size:
            # bias will be learned by the layer norm
            layers.append(torch.nn.LayerNorm(s, bias=False))
            layers.append(torch.nn.Linear(c, s))
            layers.append(torch.nn.ReLU())
            c = s
        layers.append(torch.nn.Linear(c, 102, bias=False))
        self.model = torch.nn.Sequential(*layers)


    def forward(self, x):
        return self.model(x)

Residual networks are no longer sequential networks. They have "skip connections" that skip a bunch of sequential layers.

## Benefits of residual networks

Residual networks reframe the function mapping as `f(x) = x + g(x)`. This re-mapping, where we have to learn the residuals of teh network, has some important benefits when training deep networks. These benefits revolve primarily around being able to overcome the vanishing gradient as well as "skip" layers of the network by being able to set residuals to 0 and therefore learn the identity function.

Residual networks take us from training 20 layers maximum (with normalization) to >1000 layers.

Here's what ChatGPT says about it (good summary actually):

### Practical Perspective
#### Ease of Training:

**Gradient Flow**: Residual connections help in mitigating the vanishing gradient problem by providing a direct path for gradients to flow back through the network during backpropagation. This ensures that even very deep networks can be trained effectively.
**Stable Gradients**: By allowing gradients to bypass certain layers, residual connections prevent the gradients from becoming too small (vanishing) or too large (exploding).

##### Improved Convergence:

**Faster Convergence**: Residual networks often converge faster compared to plain networks because the residual connections facilitate more efficient gradient propagation.
**Ease of Optimization**: The skip connections make the optimization landscape smoother, which allows for better and more efficient training.

#### Enhanced Feature Propagation:

**Direct Information Flow**: Residual connections allow information to flow directly from earlier to later layers, which helps in preserving important features that might otherwise be lost through the depth of the network.
**Better Feature Utilization**: Layers can focus on learning residual mappings (the difference between the input and the desired output) rather than the full transformation, making it easier for each layer to learn.

### Theoretical Perspective
#### Identity Mapping:

**Learning Identity Functions**: If the identity mapping (i.e., passing input directly to output) is optimal, residual connections make it easier for the network to learn this mapping. Without residual connections, the network would have to learn this mapping through multiple nonlinear transformations, which is harder.
**Expressive Power**: Residual connections increase the expressive power of the network by allowing the network to model both the identity mapping and the residuals. This flexibility makes it easier for the network to learn complex functions.

#### Depth and Model Complexity:

**Depth with Stability**: Deeper networks can potentially represent more complex functions, but training them is challenging due to issues like vanishing/exploding gradients. Residual connections enable very deep networks to be trained by stabilizing the gradient flow.
**Model Regularization**: Residual connections can act as a form of implicit regularization, preventing the model from overfitting by making the network favor learning simpler functions that build upon the identity mapping.

#### Mathematical Insights:

**Residual Function**: The idea is to reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. Mathematically, for a layer $F(x)$ with input $x$ , the network learns $H(x) = x + F(x)$, where $x$ is the identity mapping.

**Ease of Optimization**: Learning residuals is often easier because the network only needs to learn the modifications or differences from the identity function, which is a simpler and more stable task.


## Implementing the residual network

Residual networks are no longer sequential, so we need to update our model design accordingly.

We'll do this by creating a `Block` class that will include our linear layer and activation function, and then adding the residual connections accordingly.

In [8]:
class MyModelLN(torch.nn.Module):
    class ResidualBlock(torch.nn.Module):
        def __init__(self, in_channels, out_channels):
            super().__init__()
            self.model = torch.nn.Sequential(
                torch.nn.Linear(in_channels, out_channels),
                torch.nn.LayerNorm(out_channels),
                torch.nn.ReLU()
            )

            # we need to check if the shapes of the input and output channels
            # are the same. If they are, we can directly add +x in the forward
            # function. Otherwise, we need to apply a linear transformation to x
            # to make the shapes match.
            if in_channels != out_channels:
                # add a linear layer to the skip connection
                self.skip = torch.nn.Linear(in_channels, out_channels)
            else:
                # do not use a linear layer, but just return the input x
                self.skip = torch.nn.Identity()

        def forward(self, x):
            residual_connection = self.skip(x)
            return self.model(x) + residual_connection


    def __init__(self, layer_size = [512, 512, 512]):
        super().__init__()
        layers = []
        layers.append(torch.nn.Flatten())
        c = 128 * 128 * 3
        # normally you don't start with blocks immediately. You normally have
        # either a linear layer or a convolutional layer at the beginning.
        # For example, in images you'll have a convolutional layer in the
        # beginning, whereas in language models you'll have an embedding layer
        layers.append(torch.nn.Linear(c, 512, bias=False))
        c = 512
        for s in layer_size:
            layers.append(self.ResidualBlock(c, s))
            c = s
        layers.append(torch.nn.Linear(c, 102, bias=False))
        self.model = torch.nn.Sequential(*layers)


    def forward(self, x):
        return self.model(x)

Now let's run our network

In [9]:
x = torch.randn(10, 3, 128, 128)

In [10]:
net = MyModelLN([512]*4)

Now that we've generated our network, let's train.