# Layers and Modules

A **module is a set of layers**. As such, it could be a single layer, a generic subset of the model or the entire model.\
Each module must have a forward and a backpropagation method. It may, or may not, have parameters inside itself.\
Since auto-differentiation provides us with backpropagation, we are only intersted about parameters and forward propagation.

In [2]:
import torch
from torch import nn
from torch.nn import functional as F

Let's revisit the code for implementing MLPs.

In [3]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)
net(X).shape



torch.Size([2, 10])

# A custom module

Let's create a module with a hidden layer of 256 hidden units, and a 10-dimensional output layer. We modify the parent's class only in order to provide the essential features that a layer must give.

In [4]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

In [5]:
net = MLP()
net(X).shape

torch.Size([2, 10])

## Sequential Class

It essentially contains a method for appending modules to the list, and a method to do the forward pass.

In [8]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)

        return X

In [9]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape



torch.Size([2, 10])

## Customizing the forward propagation method

Let's implement a function that calculate a function of the type

$$f(x, w) = c \cdot \mathbf{w}^T\mathbf{x}$$

In [10]:
class FixedHiddenMlp(nn.Module):
    def __init__(self):
        super().__init__()
        #Random weight parameters that will not compute gradients and 
        #therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        #Reuse the fully connected layer. This is equivalent to sharing
        #parameters with two fully connected layers
        X = self.linear(X)
        #Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

Note how the layer weights are (randomly) intialized as constants, thus they are not updated with backpropagation.

In [16]:
net = FixedHiddenMlp()
net(X)

tensor(0.1442, grad_fn=<SumBackward0>)

In [19]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))
    
chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMlp())
chimera(X)

tensor(-0.1942, grad_fn=<SumBackward0>)

# Parameter Management


## Accessing parameters

In [20]:
import torch
from torch import nn

In [24]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

In `Sequential`, we can access parameters as if they were elements of a list.

In [25]:
net[2].state_dict()

OrderedDict([('weight',
              tensor([[ 0.2845, -0.1764,  0.1907, -0.2521, -0.3480, -0.3409,  0.3269,  0.1113]])),
             ('bias', tensor([-0.1759]))])

In [26]:
type(net[2].bias), net[2].bias.data

(torch.nn.parameter.Parameter, tensor([-0.1759]))

Gradients are not initialized until they backpropagation isn't called.

In [28]:
net[2].weight.grad == None

True

An example of an operation on a single parameter.

In [29]:
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

## Sharing parameters

In [None]:
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)
#Let's check that the parameters are the same
