# TUT6-1-model-construction

Previously, we implement some simple models including Softmax Regression and MLP. Both of them only contain several simple layers. So, we can just stack those layers to form our networks.

But we usually need to implement some large models including some complicated layers. These layers actually consist of repeating patterns of *groups of layers* like the model in the figure. 

![Multiple layers are combined into blocks, forming repeating patterns of larger models.](http://d2l.ai/_images/blocks.svg)


## **1. Layers and Blocks**
Therefore, we introduce the concept of a neural network **block**.
A block could describe a single layer,
a component consisting of multiple layers,
or the entire model itself.

We can freely combine these blocks, often recursively to build large models.

From a programming standpoint, a block is represented by a ***class***. In Pytorch, it is `nn.Module`.

The `nn.Module` class is very useful. It defines basic architecture of a layer or a block.

So we could create subclass of the `nn.Module` class to build our layers of blocks.

We just need to define the forward propagation function and how to arrange our parameters.

If you are interested in what how the `nn.Module` is constructed, you may refer its [source code](https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module).

[**To begin, we revisit the code
that we used to implement MLPs**]

A network with one fully-connected hidden layer
with 256 units and ReLU activation,
followed by a fully-connected output layer
with 10 units (no activation function).


In [1]:
import torch
from torch import nn
from torch.nn import functional as F

net = nn.Sequential(nn.Linear(20, 256), 
                    nn.ReLU(), 
                    nn.Linear(256, 10))

X = torch.rand(2, 20)
net(X)

tensor([[ 0.0903,  0.0643,  0.1121, -0.0141, -0.0981, -0.2249,  0.2130, -0.0885,
         -0.0124, -0.1376],
        [ 0.0918,  0.0252,  0.1326, -0.1469, -0.2427, -0.3593,  0.1703,  0.0129,
         -0.0499,  0.0526]], grad_fn=<AddmmBackward0>)

In this example, we constructed
our model by instantiating an `nn.Sequential`, with some layers passed to it in this order.
Actually, **`nn.Sequential` defines a special kind of `Module`**. It's also like a block.

It constructs a block with an ordered list of `nn.Module`s.
Note that each of the two fully-connected layers is an instance of the `nn.Linear` class. The `nn.Linear` class is itself a subclass of `nn.Module`.

The forward propagation (`forward`) function is also remarkably simple:
it chains each block in the list together,
passing the output of each as the input to the next.

Note that until now, we have been invoking our models
via the construction `net(X)` to obtain their outputs.

Next, let's see how the block works.

## **2. A Custom Block**

Perhaps the easiest way to develop intuition
about how a block works
is to implement one ourselves.
Before we implement our own custom block,
we briefly summarize the basic functionality
that each block must provide:


1. Ingest input data as arguments to its forward propagation function.
1. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. 
1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation function. Typically this happens automatically.
1. Store and provide access to those parameters necessary
   to execute the forward propagation computation.
1. Initialize model parameters as needed.


Then let's code up a block from scratch
corresponding to an MLP
with one hidden layer with 256 hidden units,
and a 10-dimensional output layer like we build before.

Note that the `MLP` class below inherits the class that represents a block.
We will heavily rely on the parent class's functions,
supplying only our own constructor (the `__init__` function in Python) and the forward propagation function.


In [2]:
class MLP(nn.Module):
    # Declare a layer with model parameters. Here, we declare two fully
    # connected layers
    def __init__(self):
        # Call the constructor of the `MLP` parent class `Module` to perform
        # the necessary initialization. In this way, other function arguments
        # can also be specified during class instantiation, such as the model
        # parameters, `params` (to be described later)
        super().__init__()
        self.hidden = nn.Linear(20, 256)  # Hidden layer
        self.relu = nn.ReLU(inplace=True) # ReLU layer
        self.out = nn.Linear(256, 10)  # Output layer

    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input `X`
    def forward(self, X):
        X = self.hidden(X)
        X = self.relu(X)
        X = self.out(X)
        return X

Let us first focus on the forward propagation function.
Note that it takes `X` as the input,
calculates the hidden representation
with the activation function applied,
and outputs its logits.

We [**instantiate the MLP's layers**]
in the constructor
(**and subsequently invoke these layers**)
on each call to the forward propagation function.

Note a few key details.
First, our customized `__init__` function
invokes the parent class's `__init__` function
via `super().__init__()`
sparing us the pain of restating
boilerplate code applicable to most blocks.
We then instantiate our two fully-connected layers,
assigning them to `self.hidden` and `self.out`.
Note that unless we implement a new operator,
we need not worry about the backpropagation function
or parameter initialization.
The system will generate these functions automatically.
Let us try this out.


In [3]:
net = MLP() # -> __init__(self)
net(X)      # -> forward(self, X)

tensor([[ 0.1539,  0.1288,  0.0579, -0.0442,  0.2028,  0.1582, -0.0716, -0.2399,
          0.0633,  0.1591],
        [ 0.1063,  0.0590, -0.0611,  0.0661,  0.1596,  0.0098, -0.0008, -0.2224,
          0.2123,  0.1943]], grad_fn=<AddmmBackward0>)

A key virtue of the block abstraction is its versatility.
We can subclass a block (`nn.Module`) to create layers
(such as the fully-connected layer class),
entire models (such as the `MLP` class above),
or various components of intermediate complexity.

## **3. Init a Model with model parameters**
We can also define a model with parameters inputs (num_class).

In [4]:
class MLP2(nn.Module):
    def __init__(self, num_class = 3):
        # Specified the output classes during class instantiation
        super().__init__()
        self.hidden = nn.Linear(20, 256)  # Hidden layer
        self.relu = nn.ReLU(inplace=True) # ReLU layer
        self.out = nn.Linear(256, num_class)  # Output layer

    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input `X`
    def forward(self, X):
        X = self.hidden(X)
        X = self.relu(X)
        X = self.out(X)
        return X

The input `X` is a (2, 20) matrix

In [5]:
X

tensor([[0.4151, 0.7088, 0.2830, 0.0239, 0.0146, 0.0602, 0.7729, 0.6212, 0.8712,
         0.0146, 0.1740, 0.7631, 0.4419, 0.2659, 0.9755, 0.3481, 0.7154, 0.3032,
         0.4802, 0.7021],
        [0.0927, 0.4254, 0.8712, 0.9307, 0.2788, 0.5742, 0.3066, 0.0033, 0.6408,
         0.0058, 0.2721, 0.5410, 0.2894, 0.5601, 0.1927, 0.8475, 0.4338, 0.8386,
         0.7734, 0.5987]])

Input the parameter during model initialization.
`num_class=6`

In [6]:
net = MLP2(num_class=8) # -> __init__(self, num_class)
net(X)                  # -> forward(self, X)

tensor([[-0.1586,  0.0686,  0.0355, -0.1779,  0.0440,  0.1953,  0.0668, -0.0779],
        [-0.0716,  0.1717,  0.1281, -0.2385,  0.1536,  0.2048,  0.1064, -0.1016]],
       grad_fn=<AddmmBackward0>)

Use the default value for the `num_class`.

In [7]:
net = MLP2() # -> __init__(self, num_class=3)
net(X)       # -> forward(self, X)

tensor([[-0.0055,  0.0099,  0.2133],
        [ 0.0239, -0.0593,  0.1162]], grad_fn=<AddmmBackward0>)

## **4. The Sequential Block**

We can now take a closer look
at how the `Sequential` class works.
Recall that `Sequential` was designed
to daisy-chain other blocks together.
To build our own simplified `MySequential`,
we just need to define two key function:
1. A function to append blocks one by one to a list.
2. A forward propagation function to pass an input through the chain of blocks, in the same order as they were appended.

The following `MySequential` class delivers the same
functionality of the default `Sequential` class.


In [8]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            # Here, `module` is an instance of a `Module` subclass. We save it
            # in the member variable `_modules` of the `Module` class, and its
            # type is OrderedDict
            self._modules[str(idx)] = module

    def forward(self, X):
        # OrderedDict guarantees that members will be traversed in the order
        # they were added
        for block in self._modules.values():
            X = block(X)
        return X

In the `__init__` method, we add every module
to the ordered dictionary `_modules` one by one.
You might wonder why every `Module`
possesses a `_modules` attribute
and why we used it rather than just
define a Python list ourselves.
In short the chief advantage of `_modules`
is that during our module's parameter initialization,
the system knows to look inside the `_modules`
dictionary to find sub-modules whose
parameters also need to be initialized.


When our `MySequential`'s forward propagation function is invoked,
each added block is executed
in the order in which they were added.
We can now reimplement an MLP
using our `MySequential` class.


In [9]:
net = MySequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
net(X)

tensor([[-0.0936, -0.3120, -0.1840,  0.0635, -0.1278, -0.0626,  0.2373,  0.0758,
          0.2220,  0.2871],
        [ 0.0759, -0.2019, -0.1076,  0.0968, -0.0641, -0.1233,  0.3476,  0.0336,
          0.1752,  0.3002]], grad_fn=<AddmmBackward0>)

Note that this use of `MySequential`
is identical to the code we previously wrote
for the `Sequential` class.


## **5. Executing Code in the Forward Propagation Function**

So now you may wondering now we have the `nn.Sequential` class, 
and it's very easy to assemble new architectures without writing 
additional codes like the MLP, 
why we are trying to use the subclass of `nn.Module` to build blocks?

The reason is not all architectures are simple daisy chains.
We often want to define more flexible blocks, for example, with
Python's control flow or arbitrary mathematical operations.

Sometimes, we might want to incorporate terms
that are neither the result of previous layers
nor updatable parameters.
We call these *constant parameters*.
Say for example that we want a layer
that calculates the function
$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$,
where $\mathbf{x}$ is the input, $\mathbf{w}$ is our parameter,
and $c$ is some specified constant
that is not updated during optimization.
So we implement a `FixedHiddenMLP` class as follows.


In [10]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20), requires_grad=False)
        self.linear = nn.Linear(20, 20)

    def forward(self, X):
        X = self.linear(X)
        # Use the created constant parameters, as well as the `relu` and `mm`
        # functions
        X = F.relu(torch.mm(X, self.rand_weight) + 1)
        # Reuse the fully-connected layer. This is equivalent to sharing
        # parameters with two fully-connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In this `FixedHiddenMLP` model,
we implement a hidden layer whose weights
(`self.rand_weight`) are initialized randomly
at instantiation and are thereafter constant.
This weight is not a model parameter
and thus it is never updated by backpropagation.
The network then passes the output of this "fixed" layer
through a fully-connected layer.

Note that before returning the output,
our model did something unusual.
We ran a while-loop, testing
on the condition its $L_1$ norm is larger than $1$,
and dividing our output vector by $2$
until it satisfied the condition.
Finally, we returned the sum of the entries in `X`.

To our knowledge, no standard neural network
performs this operation.
Note that this particular operation may not be useful
in any real-world task.
Our point is only to show you how to integrate
arbitrary code into the flow of your
neural network computations.


In [11]:
net = FixedHiddenMLP()
net(X)

tensor(-0.1849, grad_fn=<SumBackward0>)

We can [**mix and match various
ways of assembling blocks together.**]
In the following example, we nest several blocks using different methods.


In [12]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(20, 64), nn.ReLU(),
                                 nn.Linear(64, 32), nn.ReLU())
        self.linear = nn.Linear(32, 16)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(NestMLP(), nn.Linear(16, 20), FixedHiddenMLP())
chimera(X)

tensor(0.0132, grad_fn=<SumBackward0>)

But I don't recommend you to do it. Using the block style rather the sequential style will make your code more clear.