# 4.4 Add quick links

Next, let’s discuss the concept behind shortcut connections, which are also known as skip or residual connections.
Initially, shortcut connections were proposed for deep networks in computer vision (particularly residual networks) to alleviate the challenge of vanishing gradients.
The vanishing gradient problem refers to the problem that the gradients (which guide weight updates during training) gradually become smaller as the layers backpropagate, making it difficult to train early layers, as shown in Figure 4.12.

**Figure 4.12 is a comparison between a deep neural network consisting of 5 layers without shortcut connections (left) and with shortcut connections (right).
A shortcut connection involves adding the input of a layer to its output, effectively creating an alternate path that bypasses certain layers.
The gradients shown in Figure 1.1 represent the average absolute gradient for each layer, which we will calculate in the code example below.**

![fig4.12](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-4-12.jpg?raw=true)

As shown in Figure 4.12, shortcut connections create a shorter alternate path for gradients to flow through the network by skipping one or more layers. This is achieved by adding the output of one layer to the output of the following layer.
This is why these connections are also called skip connections.
They play a vital role in backpropagation during training to keep the gradients flowing.

In the following code example, we implement the neural network shown in Figure 4.12 and see how to add shortcut connections in the forward method:

### Code Example 4.5 A Neural Network Illustrated with Shortcut Connections

In [1]:
import torch
import torch.nn as nn
from torch.nn import GELU


class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList([
# Implement a 5-layer neural network
            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
       ])
    def forward(self, x):
        for layer in self.layers:
# Calculate the output of the current layer
            layer_output = layer(x)
# Check if shortcut connection can be applied
            if self.use_shortcut and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

This code implements a 5-layer deep neural network, each consisting of a linear layer and a GELU activation function.
In the forward pass, we iteratively pass the input through the layers, optionally adding the shortcut connections shown in Figure 4.12 if the self.use_shortcut attribute is set to True.

Let's use this code to initialize a neural network without shortcut connections.
Here, each layer will be initialized so that it accepts an example with 3 input values ​​and returns 3 output values.
The last layer returns a single output value:

In [2]:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123) # 为初始权重指定随机种子以确保可重现性
model_without_shortcut = ExampleDeepNeuralNetwork(
     layer_sizes, use_shortcut=False
)

Next, we will implement a function to calculate the gradient in the backward pass of the model through the following code:

In [3]:
def print_gradients(model, x):
# Forward pass
    output = model(x)
    target = torch.tensor([[0.]])
   
# Calculate the loss based on how close the target and output are
# What is the output format?
    loss = nn.MSELoss()
    loss = loss(output, target)

# Backpropagate to compute gradients
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
# Output the mean absolute gradient of the weights
           print(f"{name} has gradient mean of {param.grad.abs().mean()}")

In the code above, we specify a loss function that calculates how close the model output is to a user-specified target (which is 0 here for simplicity).
Then, when loss.backward() is called, PyTorch calculates the gradient of the loss for each layer in the model.
We can iterate over the weight parameters via model.named_parameters().
Let’s assume that a given layer has a 3×3 matrix of weight parameters.
In this case, the layer will have 3×3 gradient values, and we print the average absolute gradient of these 3×3 gradient values ​​to get a single gradient value for each layer, so that gradients between layers can be more easily compared.

In short, the .backward() method is a convenient way in PyTorch to calculate the loss gradients required during model training, without having to implement the mathematical operations of gradient calculations ourselves, making the use of deep neural networks easier.
If you are not familiar with the concepts of gradients and neural network training, it is recommended to read sections A.4 "Automatic differentiation made easy" and A.7 "Typical training loop" in Appendix A.

Now let's use the print_gradients function and apply it to the model without skip connections:

In [4]:
print_gradients(model_without_shortcut, sample_input)

layers.0.0.weight has gradient mean of 0.0002017411752603948
layers.1.0.weight has gradient mean of 0.00012011770741082728
layers.2.0.weight has gradient mean of 0.0007152437465265393
layers.3.0.weight has gradient mean of 0.0013988513965159655
layers.4.0.weight has gradient mean of 0.005049604922533035


The output looks like this:

layers.0.0.weight has gradient mean of 0.00020173587836325169 \
layers.1.0.weight has gradient mean of 0.0001201116101583466 \
layers.2.0.weight has gradient mean of 0.0007152041653171182 \
layers.3.0.weight has gradient mean of 0.001398873864673078 \
layers.4.0.weight has gradient mean of 0.005049646366387606

From the output of the print_gradients function, we can see that the gradient gradually decreases from the last layer (layers.4) to the first layer (layers.0). This phenomenon is called the vanishing gradient problem.

Now, let’s instantiate a model with skip connections and see how it differs from the previous model:

In [5]:
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)

layers.0.0.weight has gradient mean of 0.22186796367168427
layers.1.0.weight has gradient mean of 0.207092747092247
layers.2.0.weight has gradient mean of 0.32923877239227295
layers.3.0.weight has gradient mean of 0.2667771875858307
layers.4.0.weight has gradient mean of 1.3268063068389893


The output looks like this:

layers.0.0.weight has gradient mean of 0.22169792652130127 \
layers.1.0.weight has gradient mean of 0.20694105327129364 \
layers.2.0.weight has gradient mean of 0.32896995544433594 \
layers.3.0.weight has gradient mean of 0.2665732502937317 \
layers.4.0.weight has gradient mean of 1.3258541822433472 \

As we can see, the last layer (layers.4) still has a larger gradient than the other layers according to the output.
However, as we progress towards the first layer (layers.0), the gradient values ​​become more stable and do not shrink to extremely small values.

In summary, shortcut connections are important for overcoming the limitations imposed by the vanishing gradient problem in deep neural networks.
Shortcut connections are a core building block of large models such as LLMs, and when we train the GPT model in the next chapter, they will help facilitate more efficient training by ensuring consistent gradient flow across layers.

Having introduced shortcut connections, we will now connect all the previously introduced concepts (layer normalization, GELU activation function, feed-forward modules, and shortcut connections) in the transformer module in the next section, which is the final building block we need to encode the GPT architecture.