# Chapter 6 - Builders' Guide

## 6.1. Layers and Modules

A *module* could describe a single layer, a component consisting of multiple layers, or the entire model itself. Working with the module abstraction allows use to combine them into larger artifacts, and to reuse them across multiple models.

From a programming standpoint, a module is represented by a *class*. Any subclass of it must define a froward propagation method that transforms its input into output and must store any necessary parameters. The module must possess a backpropagation method, for purpose of calculating gradients.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
# build a simple MLP with nn.Sequential
net = nn.Sequential(
    nn.LazyLinear(256),
    nn.ReLU(),
    nn.LazyLinear(10)
)

X = torch.randn(2, 20)
net(X).shape



torch.Size([2, 10])

We just built our model by instantiating an `nn.Sequential` with layers in the order that they should be executed passed as arguments. The `nn.Sequential` class defines a special kind of `Module`, the class that presents a module in PyTorch. It maintains an ordered list of constituent `Module`s.

The `Linear` class itself is a subclass of `Module`. The forward propagation (`forward`) method chains each module in the list together, passing the output of each as input to the next.

We invoked our models via the construction `net(X)` to obtain the outputs, which is a shorthand for `net.__cal__(X)`.

### 6.1.1. A Custom Module

Each module must provide the following functionalities:
1. Ingest input data as arguments to its forward propagation method.
2. Generate an output by having the forward propagation method return a value. Note that the output may have a different shape from the input. 
3. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation method.
4. Store and provide access to those parameters necessary to execute the forward propagation computation.
5. Initialize these parameters as needed.

We can code up a module from scratch by subclassing the `Module` class. Note that the `MLP` class inherits the `Module` class. We will heavily rely on the parent class's methods, supplying only our own constructor (the `__init__` method) and forward propagation method.

In [3]:
class MLP(nn.Module):
    def __init__(self):
        # call the constructor of the parent class nn.Module to perform the necessary initialization
        super().__init__()

        # define the layers
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    # define the forward pass, that is,
    # how to return the required model output based on the input X
    def forward(self, X):
        h = self.hidden(X)
        h = F.relu(h)
        out = self.out(h)
        return out

In this `MLP` implementation, both `self.hidden` and `self.output` are `Linear` instances. Each has its own weight and bias parameters. We instantiate the `MLP`’s layers in the constructor and subsequently invoke these layers on each call to the forward propagation method. 

The `__init__` method in `MLP` invokes the parent class's `__init__` method via `super().__init__()` sparing us the pain of restating boilerplate code applicable to most modules.

In [4]:
net = MLP()
net(X).shape



torch.Size([2, 10])

### 6.1.2. The Sequential Module

The `Sequential` class was designed to daisy-chain other modules together.

In [5]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()

        # chain the layers and store them in a ModuleList
        for idx, module in enumerate(args):
            self.add_module(
                str(idx), # name of the child module
                module  # child module
            )

    def forward(self, X):
        # apply each module sequentially
        for module in self.children():
            X = module(X)
        return X

In the `__init__` method, we add every module by calling the `add_modules` method. These modules can be accessed by the `children` method at a later time.

In [6]:
net = MySequential(
    nn.LazyLinear(256),
    nn.ReLU(),
    nn.LazyLinear(10)
)

net(X).shape



torch.Size([2, 10])

### 6.1.3. Executing Code in the Forward Propagation Method

Sometimes we might want to incorporate terms that are neither the result of previous layers nor updatable parameters. We call these *constant parameters*. For example, we want a layer that calculates the function $f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$, where $\mathbf{x}$ is the input, $\mathbf{w}$ is our parameter, and $c$ is some specified constant that is not updated during optimization:

In [7]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()

        # random weight parameters that will not compute gradients and 
        # therefore keep constant during training
        self.rand_weight = torch.rand((20,20))
        
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1) # @ stands for matrix multiplication

        # reuse the fully-connected layer. This is equivalent to sharing parameters
        # with two fully-connected layers
        X = self.linear(X)

        # control flow
        while X.abs().sum() > 1:
            X /= 2

        return X.sum()

In this model, we implement a hidden layer whose weights (`self.rand_weight`) are initialized randomly at instantiation and are thereafter constant. This weight is not a model parameter and thus it is never updated by backpropagation. The network then passes the output of this **"fixed"** layer through a fully-connected layer.

Before returning the output, this model ran a while-loop, testing on the condition its $\ell_1$ norm is larger than 1, and dividing the output by 2 until it satisfies the condition. This is not a standard practice in deep learning, but it helps illustrate that arbitrary code can be inserted in the forward propagation method.

In [8]:
net = FixedHiddenMLP()
net(X)



tensor(0.0612, grad_fn=<SumBackward0>)

We can mix and match various ways of assembling modules together:

In [9]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()

        self.net = nn.Sequential(
            nn.LazyLinear(64),
            nn.ReLU(),
            nn.LazyLinear(32),
            nn.ReLU()
        )

        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))

In [10]:
net = nn.Sequential(
    NestMLP(),
    nn.LazyLinear(20),
    FixedHiddenMLP()
)

net(X)



tensor(0.1345, grad_fn=<SumBackward0>)

## 6.2. Parameter Management

* Access parameters for debugging, diagnostics, and visualizations
* Share parameters across different model components

In [1]:
import torch
from torch import nn

In [2]:
# start with a simple MLP
net = nn.Sequential(
    nn.LazyLinear(8),
    nn.ReLU(),
    nn.LazyLinear(1)
)
X = torch.rand(size=(2, 4))
net(X).shape



torch.Size([2, 1])

### 6.2.1. Parameter Access

When a model is defined via the `Sequential` class, we can access any layer by indexing into the model as through it were a list.

In [3]:
# inspect parameters of the 2nd FC layer
net[2].state_dict()

OrderedDict([('weight',
              tensor([[ 0.0656, -0.1771,  0.2366, -0.0695,  0.3005, -0.1005, -0.1911, -0.2595]])),
             ('bias', tensor([-0.0613]))])

#### 6.2.1.1. Targeted Parameters

Each parameter is represented as an instance of the parameter class.

We can extract the bias from the 2nd layer, which returns a parameter class instance and further accesses that parameter's value:

In [5]:
net[2].bias

Parameter containing:
tensor([-0.0613], requires_grad=True)

In [6]:
type(net[2].bias), net[2].bias.data

(torch.nn.parameter.Parameter, tensor([-0.0613]))

Parameters are complex objects, containing values, gradients, and additional information.

We can also access the gradient:

In [9]:
net[2].weight.grad == None

True

In [10]:
net[2].weight.grad?

[1;31mType:[0m        NoneType
[1;31mString form:[0m None
[1;31mDocstring:[0m   <no docstring>

#### 6.2.1.2. All Parameters at Once

We can allocate a fully connected layer and then use its parameters specifically to set those of another layer:

In [11]:
# we need to give the shared layer a name so that we can refer to its parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(
    nn.LazyLinear(8), # 0th
    nn.ReLU(),       # 1st
    shared,         # 2nd
    nn.ReLU(),    # 3rd
    shared,     # 4th
    nn.ReLU(), # 5th
    nn.LazyLinear(1) # 6th
)
net(X)



tensor([[0.1471],
        [0.1573]], grad_fn=<AddmmBackward0>)

In [12]:
# check whether parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])


In [13]:
net[2].weight.data[0,0] = 100
# make sure that they are actually the same object rather than just having the same value
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])


This shows that the parameters of the second and third layer are tied. They are not just equal, they are identical objects. Thus, if we change one of the parameters, the other one changes, too.

Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are added together during backpropagation.

## 6.3. Parameter Initialization

The deep learning framework provides default random initializations to its layers. However, we often want to initialize our weights according to various other protocols. The framework provides most commonly used protocols, and also allows to create a custom initializer.

In [14]:
import torch
from torch import nn

By default, PyTorch initializes weight and bias matrices uniformly by drawing from a range that is computed according to the input and output dimension.

The PyTorch's `nn.init` module provides a variety of preset initialization methods.

In [15]:
net = nn.Sequential(
    nn.LazyLinear(8),
    nn.ReLU(),
    nn.LazyLinear(1)
)
X = torch.rand(size=(2, 4))
net(X).shape



torch.Size([2, 1])

### 6.3.1. Built-in Initialization

We can initialize all weight parameters as Gaussian random variables with standard deviation 0.01, while bias parameters set to zero:

In [16]:
def init_normal(module):
    if type(module) == nn.Linear:
        # initialize weight parameter as Gaussian distribution
        nn.init.normal_(module.weight, mean=0, std=0.01)
        # initialize bias parameter as zero
        nn.init.zeros_(module.bias)

In [17]:
# apply the initializer to the net
net.apply(init_normal)

# check whether parameters are initialized as we expected
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.0116, -0.0126, -0.0066,  0.0054]), tensor(0.))

We can also initialize all the parameters to a given constant value:

In [18]:
def init_constant(module):
    if type(module) == nn.Linear:
        constant = 1
        nn.init.constant_(module.weight, constant)
        nn.init.zeros_(module.bias)

net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([1., 1., 1., 1.]), tensor(0.))

We can also apply different initializers for certain blocks. For example, below we initialize the first layer with the Xavier initializer and the second layer with the constant value of 10:

In [19]:
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_constant10(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 10)

In [20]:
# apply the xavier initializer to the first layer
net[0].apply(init_xavier)
# apply the constant initializer to the second layer
net[2].apply(init_constant10)

# check whether parameters are initialized as we expected
print(net[0].weight.data[0])
print(net[2].weight.data)

tensor([-0.6387, -0.6261, -0.5671, -0.6601])
tensor([[10., 10., 10., 10., 10., 10., 10., 10.]])


#### 6.3.1.1. Custom Initialization

We can define an initializer for any weight parameter $w$ using the following strange distribution:
\begin{split}
\begin{aligned}
    w \sim \begin{cases}
        U(5, 10) & \textrm{ with probability } \frac{1}{4} \\
            0    & \textrm{ with probability } \frac{1}{2} \\
        U(-10, -5) & \textrm{ with probability } \frac{1}{4}
    \end{cases}
\end{aligned}
\end{split}

In [21]:
def my_init(module):
    if type(module) == nn.Linear:
        print("Initialize", *[
            (name, param.shape) for name, param in module.named_parameters()
        ][0])

        nn.init.uniform_(module.weight, -10, 10)
        module.weight.data *= module.weight.data.abs() >= 5

In [22]:
net.apply(my_init)
net[0].weight[:2]

Initialize weight torch.Size([8, 4])
Initialize weight torch.Size([1, 8])


tensor([[8.3590, -0.0000, 9.7783, 7.5524],
        [8.8788, 5.6372, 5.9302, -0.0000]], grad_fn=<SliceBackward0>)

We always have the option of setting parameters directly:

In [23]:
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

tensor([42.0000,  1.0000, 10.7783,  8.5524])

## 6.4. Lazy Initialization

In [24]:
import torch
from torch import nn

from d2l import torch as d2l

In [25]:
# define the MLP architecture
net = nn.Sequential(
    nn.LazyLinear(256),
    nn.ReLU(),
    nn.LazyLinear(10)
)



At this point, the network cannot possibly know the dimensions of the input layer’s weights because the input dimension remains unknown, because the framework has not yet initialized any parameters.

In [26]:
net[0].weight

<UninitializedParameter>

In [27]:
# pass some data through the network to initialize the parameters
X = torch.rand(2, 20)
net(X)

net[0].weight.shape

torch.Size([256, 20])

As soon as we know the input dimensionality, 20, the framework can identify the shape of the first layer’s weight matrix by plugging in the value of 20. Having recognized the first layer’s shape, the framework proceeds to the second layer, and so on through the computational graph until all shapes are known.

The following method passes in dummy inputs through the network for a dry run to infer all parameter shapes and subsequently initializes the parameters. It will be used later when default random initializations are not desired:

In [29]:
@d2l.add_to_class(d2l.Module)
def apply_init(self, inputs, init=None):
    self.forward(*inputs)
    if init is not None:
        self.net.apply(init)

## 6.5. Custom Layers

In [30]:
import torch
from torch import nn
from torch.nn import functional as F

from d2l import torch as d2l

### 6.5.1. Layers without Parameters

We can define custom layers that do not have any parameters.

In [31]:
class CenteredLayer(nn.Module):
    '''Custom layer that subtracts the mean from the input'''
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

In [33]:
layer = CenteredLayer()

layer(torch.tensor([1.,2.,3.,4.,5.]))

tensor([-2., -1.,  0.,  1.,  2.])

Now we can incorporate this layer as a component in consturcting more complex models:

In [34]:
net = nn.Sequential(
    nn.LazyLinear(128),
    CenteredLayer()
)



We can verify this by sending random data through the network and check that the mean is floating around 0:

In [35]:
Y = net(torch.rand(4, 8))
Y.mean()

tensor(2.3283e-09, grad_fn=<MeanBackward0>)

### 6.5.2. Layers with Parameters

We can use built-in functions to create parameters, which provide some basic housekeeping functionality. In particular, they govern access, initialization, sharing, saving, and loading model parameters. This way, among other benefits, we will not need to write custom serialization routines for every custom layer.

In [36]:
class MyLinear(nn.Module):
    def __init__(self, in_units, units):
        '''Custom fully connected layer

        Parameters
        ----------
        in_units : int
            Number of input units
        units : int
            Number of output units
        '''
        super().__init__()

        # create weight and bias parameters
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units,))

    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

In [37]:
linear = MyLinear(5, 3)
# the parameters are randomly initialized
linear.weight

Parameter containing:
tensor([[ 1.1529,  0.4675,  1.8489],
        [ 0.5873, -0.4713, -1.9925],
        [-1.2103, -0.3342, -0.0992],
        [ 0.0434, -0.5535,  0.8387],
        [-0.4879,  0.2478,  0.6408]], requires_grad=True)

In [38]:
# forward pass
linear(torch.rand(2, 5))

tensor([[0.8388, 0.5045, 0.2561],
        [1.1452, 0.8066, 1.3693]])

In [39]:
# use MyLinear custom layer to build a MLP
net = nn.Sequential(
    MyLinear(64, 8),
    MyLinear(8, 1)
)

net(torch.rand(2, 64))

tensor([[11.4477],
        [ 5.8700]])

## 6.6. File I/O

In [1]:
import torch
from torch import nn
from torch.nn import functional as F

### 6.6.1. Loading and Saving Tensors

In [2]:
x = torch.arange(4)

# save a tensor to disk
torch.save(x, 'x-file')

In [3]:
# read a tensor from disk
x2 = torch.load('x-file')
x2

tensor([0, 1, 2, 3])

In [4]:
# save a list of tensors to disk
y = torch.zeros(4)
torch.save([x, y], 'x-files')

In [5]:
# read a list of tensors from disk
x2, y2 = torch.load('x-files')
x2, y2

(tensor([0, 1, 2, 3]), tensor([0., 0., 0., 0.]))

In [6]:
# save a dictionary of tensors to disk
mydict = {'x': x, 'y': y}
torch.save(mydict, 'mydict')

In [7]:
# read a dictionary of tensors from disk
mydict2 = torch.load('mydict')
mydict2

{'x': tensor([0, 1, 2, 3]), 'y': tensor([0., 0., 0., 0.])}

### 6.6.2. Loading and Saving Model Parameters

We prefer saving model parameters over entire models, becuase models can contain arbitrary code, hence they cannot be serialized as naturally.

In [10]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    def forward(self, x):
        return self.out(F.relu(self.hidden(x)))

In [11]:
net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

In [12]:
# save the model parameters to disk
torch.save(net.state_dict(), 'mlp.params')

In [13]:
# read the model parameters from disk
clone = MLP() # initialize a clone MLP network
clone.load_state_dict(torch.load('mlp.params')) # load the model parameters
clone.eval() # set the clone MLP network to evaluation mode



MLP(
  (hidden): LazyLinear(in_features=0, out_features=256, bias=True)
  (out): LazyLinear(in_features=0, out_features=10, bias=True)
)

In [14]:
# check whether the results are the same
Y_clone = clone(X)
Y_clone == Y

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]])

## 6.7. GPUs

In PyTorch, every array has a device, referred as a *context*. By default, all variables and associated computation have been assigned to the CPU.

In [1]:
import torch
from torch import nn

from d2l import torch as d2l

### 6.7.1. Computing Devices

In PyTorch, the CPU and GPU can be indicated by `torch.device('cpu')` and `torch.device('cuda')`. The `cpu` device means all physical CPUs and memory. A `gpu` device only represents one card and the corresponding memory. If there are multiple GPUs, we use `torch.device(f'cuda:{i}')` to represent the i-th GPU (i starts at 0). `gpu:0` and `gpu` are equivalent.

In [2]:
def cpu():
    '''Get the CPU device'''
    return torch.device('cpu')

def gpu(i=0):
    '''Get a GPU device'''
    return torch.device(f'cuda:{i}')

In [3]:
cpu(), gpu(), gpu(1)

(device(type='cpu'),
 device(type='cuda', index=0),
 device(type='cuda', index=1))

In [4]:
# query the number of GPUs
def num_gpus():
    '''Get the number of GPUs'''
    return torch.cuda.device_count()

In [5]:
num_gpus()

1

Utility function to request GPU:

In [6]:
def try_gpu(i=0):
    '''Return gpu(i) if exists, otherwise return cpu()'''
    if num_gpus() >= i + 1:
        return gpu(i)
    return cpu()

def try_all_gpus():
    '''Return all available GPUs, or [cpu(),] if no GPU exists'''
    if num_gpus() >= 1:
        return [try_gpu(i) for i in range(num_gpus())]
    return [cpu(),]

In [7]:
try_gpu()

device(type='cuda', index=0)

In [8]:
try_gpu(10)

device(type='cpu')

In [9]:
try_all_gpus()

[device(type='cuda', index=0)]

### 6.7.2. Tensors and GPUs

In [10]:
x = torch.tensor([1,2,3])
x.device # the device where the tensor is stored

device(type='cpu')

Whenever we want to operate on multiple terms, they need to be on the same device.

#### 6.7.2.1. Storage on the GPU

We can use the `nvidia-smi` command to view GPU memory usage.

In [11]:
# create a tensor and allocate on GPU
X = torch.ones(2, 3, device=try_gpu())
X

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

In [13]:
# since we only have one GPU, the tensor Y is allocated on CPU if we specify gpu(1)
Y = torch.rand(2, 3, device=try_gpu(1))
Y

tensor([[0.8272, 0.3463, 0.0360],
        [0.7980, 0.3112, 0.0974]])

#### 6.7.2.2. Copying

Since `X` and `Y` are on different devices, we cannot operate on them. We need to copy `Y` to the same device as `X` first.

In [14]:
X + Y

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [15]:
Z = Y.cuda(0) # move Y to GPU 0
print(X)
print(Z)

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')
tensor([[0.8272, 0.3463, 0.0360],
        [0.7980, 0.3112, 0.0974]], device='cuda:0')


In [16]:
# now we can perform the operation on GPU
X + Z

tensor([[1.8272, 1.3463, 1.0360],
        [1.7980, 1.3112, 1.0974]], device='cuda:0')

Transferring data is not only slow, it also makes parallelization a lot more difficult, since we have to wait for data to be sent (or rather to be received) before we can proceed with more operations. 

When we print tensors or convert tensors to the NumPy format, if the data is not in the main memory, the framework will copy it to the main memory first, resulting in additional transmission overhead. Even worse, it is now subject to the dreaded global interpreter lock that makes everything wait for Python to complete. This is something that we never want to do and should always be aware of in deep learning.

### 6.7.3. Neural Networks and GPUs

In [17]:
# we can put model parameters on GPU
net = nn.Sequential(nn.LazyLinear(1))
# specify device
net = net.to(device=try_gpu())



In [18]:
# when the input is a tensor on the GPU, the model will calculate the result on the same GPU
net(X)

tensor([[-0.2615],
        [-0.2615]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [19]:
# check where the model parameters are stored
net[0].weight.data.device

device(type='cuda', index=0)

Let the trainer support GPU:

In [20]:
@d2l.add_to_class(d2l.Trainer)
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
    self.save_hyperparameters()

    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]

@d2l.add_to_class(d2l.Trainer)
def prepare_batch(self, batch):
    if self.gpus:
        batch = [a.to(self.gpus[0]) for a in batch]
    return batch

@d2l.add_to_class(d2l.Trainer)
def prepare_model(self, model):
    model.trainer = self
    model.board.xlim = [0, self.max_epochs]
    if self.gpus:
        model.to(self.gpus[0])
    self.model = model

In short, as long as all data and parameters are on the same device, we can learn models efficiently.