# Layers and Modules
:label:`sec_model_construction`

When we first introduced neural networks,
we focused on linear models with a single output.
Here, the entire model consists of just a single neuron.
Note that a single neuron
(i) takes some set of inputs;
(ii) generates a corresponding scalar output;
and (iii) has a set of associated parameters that can be updated
to optimize some objective function of interest.
Then, once we started thinking about networks with multiple outputs,
we leveraged vectorized arithmetic
to characterize an entire layer of neurons.
Just like individual neurons,
layers (i) take a set of inputs,
(ii) generate corresponding outputs,
and (iii) are described by a set of tunable parameters.
When we worked through softmax regression,
a single layer was itself the model.
However, even when we subsequently
introduced MLPs,
we could still think of the model as
retaining this same basic structure.

Interestingly, for MLPs,
both the entire model and its constituent layers
share this structure.
The entire model takes in raw inputs (the features),
generates outputs (the predictions),
and possesses parameters
(the combined parameters from all constituent layers).
Likewise, each individual layer ingests inputs
(supplied by the previous layer)
generates outputs (the inputs to the subsequent layer),
and possesses a set of tunable parameters that are updated
according to the signal that flows backwards
from the subsequent layer.


While you might think that neurons, layers, and models
give us enough abstractions to go about our business,
it turns out that we often find it convenient
to speak about components that are
larger than an individual layer
but smaller than the entire model.
For example, the ResNet-152 architecture,
which is wildly popular in computer vision,
possesses hundreds of layers.
These layers consist of repeating patterns of *groups of layers*. Implementing such a network one layer at a time can grow tedious.
This concern is not just hypothetical---such
design patterns are common in practice.
The ResNet architecture mentioned above
won the 2015 ImageNet and COCO computer vision competitions
for both recognition and detection :cite:`He.Zhang.Ren.ea.2016`
and remains a go-to architecture for many vision tasks.
Similar architectures in which layers are arranged
in various repeating patterns
are now ubiquitous in other domains,
including natural language processing and speech.

To implement these complex networks,
we introduce the concept of a neural network *module*.
A module could describe a single layer,
a component consisting of multiple layers,
or the entire model itself!
One benefit of working with the module abstraction
is that they can be combined into larger artifacts,
often recursively. This is illustrated in :numref:`fig_blocks`. By defining code to generate modules
of arbitrary complexity on demand,
we can write surprisingly compact code
and still implement complex neural networks.

![Multiple layers are combined into modules, forming repeating patterns of larger models.](../img/blocks.svg)
:label:`fig_blocks`


From a programming standpoint, a module is represented by a *class*.
Any subclass of it must define a forward propagation method
that transforms its input into output
and must store any necessary parameters.
Note that some modules do not require any parameters at all.
Finally a module must possess a backpropagation method,
for purposes of calculating gradients.
Fortunately, due to some behind-the-scenes magic
supplied by the auto differentiation
(introduced in :numref:`sec_autograd`)
when defining our own module,
we only need to worry about parameters
and the forward propagation method.


# 层与模块
:label:`sec_model_construction`

当我们首次介绍神经网络时，
重点关注的是单输出的线性模型。
此时整个模型仅由单个神经元组成。
注意单个神经元：
(i) 接收一组输入；
(ii) 生成对应的标量输出；
(iii) 拥有可更新参数集用于优化目标函数。

当我们开始考虑具有多个输出的网络时，
利用向量化运算描述整个神经元层。
与单个神经元类似，
层：
(i) 接收一组输入；
(ii) 生成对应输出；
(iii) 由一组可调参数描述。
在实现softmax回归时，
单个层就构成了整个模型。
即使后续引入多层感知机(MLP)时，
模型仍保持这种基本结构。

有趣的是，对于MLP，
整个模型及其组成层共享相同结构。
整个模型接收原始输入(特征)，
生成输出(预测)，
并拥有参数(各层参数集合)。
每个独立层接收前层输入，
生成后层输入，
并拥有根据反向传播信号更新的可调参数集。

虽然神经元、层和模型提供了足够的抽象，
实践中常需要讨论比单层大、比整模小的组件。
例如计算机视觉中广泛使用的ResNet-152架构，
包含数百个层，
这些层由重复的*层组模式*构成。
逐层实现这样的网络会非常繁琐。
ResNet架构在2015年ImageNet和COCO竞赛中
斩获识别与检测双冠:cite:`He.Zhang.Ren.ea.2016`，
至今仍是视觉任务的标杆方案。
类似层重复模式架构也普遍存在于
自然语言处理、语音等领域。

为实现复杂网络，
引入神经网络*模块*概念。
模块可以是单层、多层组件或整个模型！
模块化抽象的优点是能递归组合成更大结构，
如 :numref:`fig_blocks` 所示。
通过定义按需生成任意复杂度模块的代码，
可用紧凑代码实现复杂神经网络。

![多层组合成模块，形成更大模型的重复模式](../img/blocks.svg)
:label:`fig_blocks`

从编程角度看，
模块由*类*表示。
其子类必须定义：
1. 将输入转为输出的前向传播方法
2. 存储必要参数(部分模块可能无参)
3. 反向传播方法(用于梯度计算)

得益于自动微分(见 :numref:`sec_autograd`)的底层魔法，
自定义模块时只需关注参数和前向传播方法，
反向传播可自动处理。

In [1]:
import torch
from torch import nn
from torch.nn import functional as F

[**To begin, we revisit the code
that we used to implement MLPs**]
(:numref:`sec_mlp`).
The following code generates a network
with one fully connected hidden layer
with 256 units and ReLU activation,
followed by a fully connected output layer
with ten units (no activation function).


[**首先我们回顾一下多层感知机(MLP)的实现代码**]
(:numref:`sec_mlp`)。
以下代码生成一个网络结构：
包含256个单元并使用ReLU激活函数的全连接隐藏层，
后接一个10个单元且无激活函数的全连接输出层。

In [2]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)
net(X).shape

torch.Size([2, 10])

In this example, we constructed
our model by instantiating an `nn.Sequential`, with layers in the order
that they should be executed passed as arguments.
In short, (**`nn.Sequential` defines a special kind of `Module`**),
the class that presents a module in PyTorch.
It maintains an ordered list of constituent `Module`s.
Note that each of the two fully connected layers is an instance of the `Linear` class
which is itself a subclass of `Module`.
The forward propagation (`forward`) method is also remarkably simple:
it chains each module in the list together,
passing the output of each as input to the next.
Note that until now, we have been invoking our models
via the construction `net(X)` to obtain their outputs.
This is actually just shorthand for `net.__call__(X)`.


在这个例子中，我们通过实例化`nn.Sequential`来构建模型，
按执行顺序传入各层作为参数。
简而言之，( **`nn.Sequential`定义了一种特殊的`Module`** )，
这是PyTorch中表示模块的基类。
它维护着一个构成模块(Module)的有序列表。
注意，两个全连接层都是`Linear`类的实例，
而`Linear`类本身是`Module`的子类。
前向传播(`forward`)方法也非常简单：
将列表中的模块按顺序链式连接，
将每个模块的输出作为下一个模块的输入。

需要特别说明的是，
我们之前一直通过`net(X)`的方式调用模型获取输出。
这实际上是`net.__call__(X)`的语法糖。

## [**A Custom Module**]

Perhaps the easiest way to develop intuition
about how a module works
is to implement one ourselves.
Before we do that,
we briefly summarize the basic functionality
that each module must provide:


1. Ingest input data as arguments to its forward propagation method.
1. Generate an output by having the forward propagation method return a value. Note that the output may have a different shape from the input. For example, the first fully connected layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation method. Typically this happens automatically.
1. Store and provide access to those parameters necessary
   for executing the forward propagation computation.
1. Initialize model parameters as needed.


In the following snippet,
we code up a module from scratch
corresponding to an MLP
with one hidden layer with 256 hidden units,
and a 10-dimensional output layer.
Note that the `MLP` class below inherits the class that represents a module.
We will heavily rely on the parent class's methods,
supplying only our own constructor (the `__init__` method in Python) and the forward propagation method.


## [**自定义模块**]

要深入理解模块的工作原理，
最直接的方式是自己实现一个。
在此之前，我们先简要总结模块必须提供的基本功能：

1. **输入处理**：通过前向传播方法接收输入数据
2. **输出生成**：前向传播方法返回输出值(输出形状可能与输入不同，如示例中首个全连接层可接收任意维度输入但输出256维)
3. **梯度计算**：通过反向传播方法自动计算输出相对于输入的梯度(通常自动完成)
4. **参数管理**：存储并提供执行前向传播所需的参数访问
5. **参数初始化**：按需初始化模型参数

以下代码实现了一个自定义MLP模块，
包含具有256个隐藏单元的隐藏层和10维输出层。
注意`MLP`类继承自表示模块的基类，
我们主要复用父类方法，
仅需实现构造函数(`__init__`)和前向传播方法。

In [3]:
class MLP(nn.Module):
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

In [4]:
class MyMLP(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)
        
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

Let's first focus on the forward propagation method.
Note that it takes `X` as input,
calculates the hidden representation
with the activation function applied,
and outputs its logits.
In this `MLP` implementation,
both layers are instance variables.
To see why this is reasonable, imagine
instantiating two MLPs, `net1` and `net2`,
and training them on different data.
Naturally, we would expect them
to represent two different learned models.

We [**instantiate the MLP's layers**]
in the constructor
(**and subsequently invoke these layers**)
on each call to the forward propagation method.
Note a few key details.
First, our customized `__init__` method
invokes the parent class's `__init__` method
via `super().__init__()`
sparing us the pain of restating
boilerplate code applicable to most modules.
We then instantiate our two fully connected layers,
assigning them to `self.hidden` and `self.out`.
Note that unless we implement a new layer,
we need not worry about the backpropagation method
or parameter initialization.
The system will generate these methods automatically.
Let's try this out.


让我们首先关注前向传播方法的实现。
该方法接收输入`X`，
计算经过激活函数处理的隐藏表示，
并输出其logits(未归一化的预测结果)。
在此MLP实现中，
两个层都定义为实例变量。
这样设计的原因在于，
假设我们实例化两个MLP网络`net1`和`net2`，
并在不同数据上训练它们时，
这两个网络应能学习到不同的模型参数。

我们通过以下方式[**在构造函数中实例化MLP的层**]，
并在每次前向传播时[**调用这些层**]。
注意几个关键细节：
1. 自定义的`__init__`方法通过`super().__init__()`调用父类构造函数，
   避免重复编写适用于大多数模块的样板代码
2. 实例化两个全连接层`self.hidden`和`self.out`
3. 由于没有实现新层类型，
   无需手动编写反向传播方法或参数初始化逻辑，
   系统会自动生成这些方法

In [5]:
net = MLP()
net(X).shape

torch.Size([2, 10])

In [6]:
my_net = MyMLP()
my_net(X).shape

torch.Size([2, 10])

A key virtue of the module abstraction is its versatility.
We can subclass a module to create layers
(such as the fully connected layer class),
entire models (such as the `MLP` class above),
or various components of intermediate complexity.
We exploit this versatility
throughout the coming chapters,
such as when addressing
convolutional neural networks.


## [**The Sequential Module**]
:label:`subsec_model-construction-sequential`

We can now take a closer look
at how the `Sequential` class works.
Recall that `Sequential` was designed
to daisy-chain other modules together.
To build our own simplified `MySequential`,
we just need to define two key methods:

1. A method for appending modules one by one to a list.
1. A forward propagation method for passing an input through the chain of modules, in the same order as they were appended.

The following `MySequential` class delivers the same
functionality of the default `Sequential` class.


## [**顺序模块**]
:label:`subsec_model-construction-sequential`

模块抽象的核心优势在于其灵活性。
我们可以通过继承模块类来创建：
- 基础层（如全连接层）
- 完整模型（如前述的`MLP`类）
- 中等复杂度的组件

这种灵活性在后续章节（如卷积神经网络）中将得到充分体现。

### 自定义顺序模块实现
要实现简化的`MySequential`类，
需定义两个核心方法：
1. **模块追加方法**：将模块逐个添加到列表
2. **前向传播方法**：按添加顺序对输入进行链式处理

以下是实现标准`Sequential`相同功能的代码示例：
```python
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        self.modules_list = nn.ModuleList(args)  # 存储模块的容器

    def forward(self, X):
        for module in self.modules_list:  # 按添加顺序执行模块
            X = module(X)
        return X
```
关键实现细节说明：
1. 使用`nn.ModuleList`容器确保参数正确注册
2. 自动继承父类的参数管理功能
3. 前向传播严格按模块添加顺序执行
4. 支持动态添加模块（通过`add_module`方法扩展）

使用示例：
```python
# 构建与nn.Sequential等效的模型
model = MySequential(
    nn.Linear(20, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
```
该实现保留了PyTorch自动微分系统的兼容性，
可通过相同方式调用`model(X)`进行前向传播。

In [8]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

In [9]:
class ChiSequential(nn.Module):
    def __init__(self, *args) -> None:
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)
            
    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

In the `__init__` method, we add every module
by calling the `add_modules` method. These modules can be accessed by the `children` method at a later date.
In this way the system knows the added modules,
and it will properly initialize each module's parameters.


When our `MySequential`'s forward propagation method is invoked,
each added module is executed
in the order in which they were added.
We can now reimplement an MLP
using our `MySequential` class.


In [10]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

torch.Size([2, 10])

In [16]:
chi_net = ChiSequential(
    nn.Flatten(), nn.LazyLinear(256),nn.ReLU(),
    nn.LazyLinear(10)
)
chi_net(X).shape

torch.Size([2, 10])

Note that this use of `MySequential`
is identical to the code we previously wrote
for the `Sequential` class
(as described in :numref:`sec_mlp`).


## [**Executing Code in the Forward Propagation Method**]

The `Sequential` class makes model construction easy,
allowing us to assemble new architectures
without having to define our own class.
However, not all architectures are simple daisy chains.
When greater flexibility is required,
we will want to define our own blocks.
For example, we might want to execute
Python's control flow within the forward propagation method.
Moreover, we might want to perform
arbitrary mathematical operations,
not simply relying on predefined neural network layers.

You may have noticed that until now,
all of the operations in our networks
have acted upon our network's activations
and its parameters.
Sometimes, however, we might want to
incorporate terms
that are neither the result of previous layers
nor updatable parameters.
We call these *constant parameters*.
Say for example that we want a layer
that calculates the function
$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$,
where $\mathbf{x}$ is the input, $\mathbf{w}$ is our parameter,
and $c$ is some specified constant
that is not updated during optimization.
So we implement a `FixedHiddenMLP` class as follows.


注意，这里使用的`MySequential`类与我们之前为`Sequential`类编写的代码完全相同（如 :numref:`sec_mlp` 所述）。

## [**在前向传播方法中执行代码**]

`Sequential`类使模型构建变得简单，允许我们在不定义自己的类的情况下组合新的架构。然而，并非所有架构都是简单的链式结构。当需要更大的灵活性时，我们将需要定义自己的块。例如，我们可能希望在前向传播方法中执行Python的控制流。此外，我们可能希望执行任意的数学运算，而不是仅仅依赖预定义的神经网络层。

你可能已经注意到，到目前为止，我们网络中的所有操作都作用于网络的激活及其参数。但有时，我们可能希望引入既不是前几层结果也不是可更新参数的项。我们称这些为*固定参数*。例如，假设我们想要一个计算函数$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$的层，其中$\mathbf{x}$是输入，$\mathbf{w}$是我们的参数，$c$是在优化期间不更新的指定常数。为此我们实现`FixedHiddenMLP`类如下：

In [17]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In this model,
we implement a hidden layer whose weights
(`self.rand_weight`) are initialized randomly
at instantiation and are thereafter constant.
This weight is not a model parameter
and thus it is never updated by backpropagation.
The network then passes the output of this "fixed" layer
through a fully connected layer.

Note that before returning the output,
our model did something unusual.
We ran a while-loop, testing
on the condition its $\ell_1$ norm is larger than $1$,
and dividing our output vector by $2$
until it satisfied the condition.
Finally, we returned the sum of the entries in `X`.
To our knowledge, no standard neural network
performs this operation.
Note that this particular operation may not be useful
in any real-world task.
Our point is only to show you how to integrate
arbitrary code into the flow of your
neural network computations.


在该模型中，我们实现了一个隐藏层，其权重(`self.rand_weight`)在实例化时被随机初始化，之后保持固定。该权重不是模型参数，因此永远不会通过反向传播更新。网络随后将这个"固定"层的输出传递给全连接层。

值得注意的是，在返回输出之前，我们的模型执行了一个特殊操作。我们运行了一个while循环，检测其$\ell_1$范数是否大于1，如果条件满足就将输出向量除以2，直到满足条件为止。最后返回`X`中各元素的和。据我们所知，没有标准的神经网络会执行这种操作。需要说明的是，这个特定操作在实际任务中可能没有实用价值。我们的目的仅在于展示如何将任意代码集成到神经网络计算流程中。

In [18]:
net = FixedHiddenMLP()
net(X)

tensor(-0.0724, grad_fn=<SumBackward0>)

We can [**mix and match various
ways of assembling modules together.**]
In the following example, we nest modules
in some creative ways.


我们可以[**混合搭配各种方式组合模块**]。在下面的例子中，我们将以创造性的方式嵌套模块。

In [9]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

tensor(-0.0192, grad_fn=<SumBackward0>)

In [23]:

class chiNestMLP(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.net = nn.Sequential(
                                nn.LazyLinear(64), nn.ReLU(),
                                nn.LazyLinear(32), nn.ReLU()
                            )
        self.linear = nn.LazyLinear(64)
        
    def forward(self, X):
        return self.linear(self.net(X))
    
chi_chimera = nn.Sequential(chiNestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chi_chimera(X)

tensor(-0.0497, grad_fn=<SumBackward0>)

## Summary

Individual layers can be modules.
Many layers can comprise a module.
Many modules can comprise a module.

A module can contain code.
Modules take care of lots of housekeeping, including parameter initialization and backpropagation.
Sequential concatenations of layers and modules are handled by the `Sequential` module.


## Exercises

1. What kinds of problems will occur if you change `MySequential` to store modules in a Python list?
1. Implement a module that takes two modules as an argument, say `net1` and `net2` and returns the concatenated output of both networks in the forward propagation. This is also called a *parallel module*.
1. Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same module and build a larger network from it.


[Discussions](https://discuss.d2l.ai/t/55)


## 摘要

单个层可以是模块。
多个层可以组成一个模块。
多个模块可以组成更大的模块。

模块可以包含代码。
模块负责处理许多底层工作，包括参数初始化和反向传播。
层和模块的顺序连接由`Sequential`模块处理。

## 练习题

1. 如果将`MySequential`改为使用Python列表存储模块会导致什么问题？
2. 实现一个模块，接收两个模块`net1`和`net2`作为参数，在前向传播中返回两个网络输出的拼接结果（称为*并行模块*）。
3. 假设需要拼接多个相同网络的实例，实现一个工厂函数来生成相同模块的多个实例，并构建更大的网络。

### 问题1深度分析（基于代码实现）
原始ModuleList实现关键优势：
```python
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        self.layers = nn.ModuleList(args)  # 关键注册

    def forward(self, X):
        for layer in self.layers:
            X = layer(X)
        return X
```
改为Python列表会导致：
1. **参数不可追踪**：优化器无法发现子模块参数
2. **序列化黑洞**：`state_dict()`丢失子模块状态
3. **设备隔离**：`.to(device)`无法传播到子模块
4. **类型混淆**：无法识别嵌套Module结构

### 增强版并行模块实现
```python
class Parallel(nn.Module):
    def __init__(self, net1, net2, dim=1):
        super().__init__()
        self.net1 = net1
        self.net2 = net2
        self.dim = dim
        
    def forward(self, X):
        return torch.cat([self.net1(X), self.net2(X)], dim=self.dim)
        
    def __repr__(self):
        return f"Parallel({self.net1.__class__.__name__}, {self.net2.__class__.__name__}, dim={self.dim})"

# 验证示例
net = Parallel(nn.Linear(20, 128), nn.Conv1d(20, 64, 3))
print(net)
# 输出：Parallel(Linear, Conv1d, dim=1)
```

### 工业级模块工厂实现
```python
from copy import deepcopy

def module_factory(blueprint, num_copies, **kwargs):
    """动态生成参数隔离的模块副本"""
    modules = []
    for _ in range(num_copies):
        new_mod = deepcopy(blueprint)
        new_mod.reset_parameters()  # 假设模块实现参数重置方法
        modules.append(new_mod)
    return nn.Sequential(*modules)

# 使用示例
base_layer = nn.Linear(256, 512)
mega_net = module_factory(base_layer, 8)

# 参数独立性验证
for i, layer in enumerate(mega_net):
    print(f"Layer {i} weight ptr: {id(layer.weight)}")
# 输出不同内存地址
```

### 架构示意图更新
```mermaid
graph TD
    subgraph 工厂模式
    A[原始模块] --> B[深拷贝]
    B --> C[实例1]
    B --> D[实例2]
    C --> E[前向传播1]
    D --> F[前向传播2]
    end
    
    subgraph 并行计算
    X[输入] --> Y[Parallel]
    Y --> Z[网络A]
    Y --> W[网络B]
    Z --> M[输出A]
    W --> N[输出B]
    M --> CAT[拼接]
    N --> CAT
    end
```