# Deep Learning Computation

## Layers and Blocks 

A block could describe a single layer, a component consisting of multiple layers, or the entire model itself.  
**Benefit**: Combining multiple layers into blocks can implement complex networks.  

When programming, we need to worry about **parameters** and **the forward propagation function**.  

**A Custom Block**
The basic functionality that each block must provide:
1. Ingest input data as arguments to its forward propagation function.
2. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. For example, the first fully-connected layer in our model above ingests an input of dimension 20 but returns an output of dimension 256.
3. Calculate the gradient of its output with respect to its input, which can be accessed via its back propagation function. Typically this happens automatically.
4. Store and provide access to those parameters necessary to execute the forward propagation computation.
5. Initialize model parameters as needed

In [1]:
import torch
from torch import nn
from torch.nn import functional as F

In [2]:
net = nn.Sequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))

X = torch.rand(2, 20)
net(X)

tensor([[-0.1266, -0.1119, -0.0194,  0.0367,  0.0129, -0.1256, -0.0122,  0.0760,
          0.0910,  0.0641],
        [-0.1256, -0.1306,  0.1358,  0.1173, -0.0057, -0.0514, -0.0688,  0.0318,
          0.0220,  0.0721]], grad_fn=<AddmmBackward0>)

In [3]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(20, 256)
        self.out = nn.Linear(256, 10)
        
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

In [4]:
net = MLP()
net(X)

tensor([[ 0.1416,  0.0397,  0.2359,  0.0339,  0.0709, -0.1745,  0.0223, -0.0975,
          0.0793, -0.1087],
        [ 0.0816,  0.0416,  0.1981,  0.0437,  0.0139, -0.1104, -0.0383, -0.0727,
          0.2010, -0.2099]], grad_fn=<AddmmBackward0>)

Show how the **Sequential** class works.  
If we want to implement a new **MySequential** class, which has the same functionality with the default **Sequential** class, we need to define two key function: 
1. A function to append blocks one by one to a list. 
2. A forward propagation function to pass an input through the chain of blocks, in the same order as they were appended.

In [5]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            # Here, `module` is an instance of a `Module` subclass. We save it
            # in the member variable `_modules` of the `Module` class, and its
            # type is OrderedDict
            # Advantage: during our moduleʼs parameter initialization, 
            # the system knows to look inside the _modules dictionary to find 
            # sub-modules whose parameters also need to be initialized.
            self._modules[str(idx)] = module
    def forward(self, X):
        # OrderedDict guarantees that the members will be traversed in the order they added
        for block in self._modules.values():
            X = block(X)
        return X

In [6]:
net = MySequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
net(X)

tensor([[-0.1100,  0.0394, -0.0080, -0.1859, -0.1853, -0.0766,  0.1514, -0.2393,
         -0.1976, -0.0616],
        [-0.0349,  0.1682,  0.1669, -0.1522, -0.3319, -0.0587,  0.0477, -0.2973,
         -0.1834, -0.1772]], grad_fn=<AddmmBackward0>)

**Advantage**: A custom block is more flexible.

In [7]:
class fixedhiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.rand_weight = torch.rand((20, 20), requires_grad=False)
        self.linear = nn.Linear(20, 20)
    
    def forward(self, X):
        X = self.linear(X)
        X = F.relu(torch.mm(X, self.rand_weight) + 1)
        X = self.linear(X)
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In [8]:
net = fixedhiddenMLP()
net(X)

tensor(-0.0256, grad_fn=<SumBackward0>)

In [9]:
# Try to mix various blocks and layers together.
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU())
        self.linear = nn.Linear(32, 16)
        
    def forward(self, X):
        return self.linear(self.net(X))

In [10]:
chimera = nn.Sequential(NestMLP(), nn.Linear(16, 20), fixedhiddenMLP())
chimera(X)

tensor(0.0564, grad_fn=<SumBackward0>)

**Summary**:
1. Layers are blocks.
2. Many layers can comprise a block.
3. Many blocks can comprise a block.
4. A block can contain code.
5. Blocks take care of lots of housekeeping, including parameter initialization and backpropa- gation.
6. Sequential concatenations of layers and blocks are handled by the **Sequential block**.

## Parameter Management 

In [11]:
# Given an MLP with one hidden layer as an example
import torch
from torch import nn

net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
X = torch.rand((2, 4))
net(X)

tensor([[0.3573],
        [0.3759]], grad_fn=<AddmmBackward0>)

**Parameter Access**

In [12]:
print(net[2].state_dict())

OrderedDict([('weight', tensor([[-0.0926,  0.0126, -0.0264,  0.1643,  0.1994, -0.1117,  0.0232,  0.1997]])), ('bias', tensor([0.2834]))])


In [13]:
print(type(net[2].bias))
print(net[2].bias)
print(net[2].bias.data)

<class 'torch.nn.parameter.Parameter'>
Parameter containing:
tensor([0.2834], requires_grad=True)
tensor([0.2834])


In [14]:
net[2].weight.grad == None

True

In [15]:
print(*[(name, param.shape) for name, param in net[0].named_parameters()])
print(*[(name, param.shape) for name, param in net.named_parameters()])

('weight', torch.Size([8, 4])) ('bias', torch.Size([8]))
('0.weight', torch.Size([8, 4])) ('0.bias', torch.Size([8])) ('2.weight', torch.Size([1, 8])) ('2.bias', torch.Size([1]))


In [16]:
net.state_dict()['2.bias'].data

tensor([0.2834])

In [17]:
def block1():
    return nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 4), nn.ReLU())

def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add_module(f'block {i}', block1())
    return net

In [18]:
rgnet = nn.Sequential(block2(), nn.Linear(4, 1))
rgnet(X)

tensor([[-0.2290],
        [-0.2290]], grad_fn=<AddmmBackward0>)

In [19]:
print(rgnet)

Sequential(
  (0): Sequential(
    (block 0): Sequential(
      (0): Linear(in_features=4, out_features=8, bias=True)
      (1): ReLU()
      (2): Linear(in_features=8, out_features=4, bias=True)
      (3): ReLU()
    )
    (block 1): Sequential(
      (0): Linear(in_features=4, out_features=8, bias=True)
      (1): ReLU()
      (2): Linear(in_features=8, out_features=4, bias=True)
      (3): ReLU()
    )
    (block 2): Sequential(
      (0): Linear(in_features=4, out_features=8, bias=True)
      (1): ReLU()
      (2): Linear(in_features=8, out_features=4, bias=True)
      (3): ReLU()
    )
    (block 3): Sequential(
      (0): Linear(in_features=4, out_features=8, bias=True)
      (1): ReLU()
      (2): Linear(in_features=8, out_features=4, bias=True)
      (3): ReLU()
    )
  )
  (1): Linear(in_features=4, out_features=1, bias=True)
)


In [20]:
rgnet[0][1][0].bias.data

tensor([-0.4816, -0.0984,  0.3902,  0.0649,  0.3528, -0.4240,  0.2899,  0.4091])

**Parameter Initialization**

In [21]:
def init_normal(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, mean=0, std=0.01)
        nn.init.zeros_(m.bias)
        
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.0089, -0.0041, -0.0040, -0.0080]), tensor(0.))

In [22]:
def init_constant(m):
    if type(m) == nn.Linear:
        nn.init.constant_(m.weight, 1)
        nn.init.zeros_(m.bias)

net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([1., 1., 1., 1.]), tensor(0.))

In [23]:
# use differnt initial methods in different blocks
net[0].apply(init_normal)
net[2].apply(init_constant)

print(net[0].weight.data[0], net[0].bias.data[0])
print(net[2].weight.data[0], net[2].bias.data[0])

tensor([ 0.0122, -0.0154,  0.0068, -0.0022]) tensor(0.)
tensor([1., 1., 1., 1., 1., 1., 1., 1.]) tensor(0.)


In [24]:
# custom initialization
def my_init(m):
    if type(m) == nn.Linear:
        print('Init', *[(name, param) for name, param in net.named_parameters()])
        nn.init.uniform_(m.weight, -10, 10)
        m.weight.data *= m.weight.data.abs() >= 5
        
net.apply(my_init)
net[0].weight[:2]

Init ('0.weight', Parameter containing:
tensor([[ 0.0122, -0.0154,  0.0068, -0.0022],
        [-0.0008,  0.0030, -0.0155, -0.0128],
        [ 0.0105, -0.0074, -0.0074, -0.0075],
        [ 0.0111,  0.0068, -0.0126,  0.0121],
        [-0.0134,  0.0055, -0.0038, -0.0015],
        [-0.0054, -0.0041, -0.0007,  0.0003],
        [-0.0037,  0.0063, -0.0077, -0.0130],
        [ 0.0119, -0.0062,  0.0074,  0.0257]], requires_grad=True)) ('0.bias', Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)) ('2.weight', Parameter containing:
tensor([[1., 1., 1., 1., 1., 1., 1., 1.]], requires_grad=True)) ('2.bias', Parameter containing:
tensor([0.], requires_grad=True))
Init ('0.weight', Parameter containing:
tensor([[ 6.1249,  0.0000, -9.7135, -6.3069],
        [ 6.1423, -0.0000,  0.0000,  0.0000],
        [ 9.1035,  0.0000,  0.0000, -6.7997],
        [ 0.0000, -0.0000,  0.0000,  7.4413],
        [-0.0000, -0.0000,  0.0000, -8.2895],
        [ 9.0873,  9.9752, -0.0000, -9.

tensor([[ 6.1249,  0.0000, -9.7135, -6.3069],
        [ 6.1423, -0.0000,  0.0000,  0.0000]], grad_fn=<SliceBackward0>)

**Tied Parameters**

In [25]:
# we need to give a shared layer name so that we can refer to its parameters
shared = nn.Linear(8, 8)

net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), shared, nn.ReLU(), shared, nn.ReLU(), nn.Linear(8, 1))
net(X)

# check whether the parameters are shared
print(net[2].weight.data == net[4].weight.data)
net[2].weight.data[0] = 100
# check whether the shared parameters update together
print(net[2].weight.data == net[4].weight.data)

tensor([[True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True]])
tensor([[True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True]])


## Custom Layers 

In [27]:
import torch
from torch import nn
from torch.nn import functional as F

# Layers without parameters
class CenteredLayer(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, X):
        return X - X.mean()

In [28]:
layer = CenteredLayer()
layer(torch.FloatTensor([1, 2, 3, 4, 5]))

tensor([-2., -1.,  0.,  1.,  2.])

In [30]:
net = nn.Sequential(nn.Linear(8, 128), CenteredLayer())
Y = net(torch.rand(4, 8))
Y.mean()

tensor(-4.6566e-09, grad_fn=<MeanBackward0>)

In [31]:
# Layers with parameters
class MyLinear(nn.Module):
    def __init__(self, in_units, units):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units, ))
    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

In [32]:
linear = MyLinear(5, 3)
linear.weight

Parameter containing:
tensor([[ 1.6702,  0.6636, -1.7579],
        [ 1.3084, -0.5721, -1.6945],
        [-0.3314,  0.2430,  0.4385],
        [ 0.0226, -1.1508, -1.7520],
        [-1.6324,  0.3254, -1.1020]], requires_grad=True)

In [33]:
linear(torch.randn(2, 5))

tensor([[0.0000, 0.0000, 0.1374],
        [0.0000, 4.8473, 2.8738]])

In [34]:
net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
net(torch.randn(5, 64))

tensor([[0.0000],
        [3.7473],
        [0.0000],
        [8.2525],
        [0.0000]])

**Summary**
1. We can design custom layers via the basic layer class. This allows us to define flexible new layers that behave differently from any existing layers in the library.
2. Once defined, custom layers can be invoked in arbitrary contexts and architectures. 
3. Layers can have local parameters, which can be created through built-in functions.

## File I/O 

In [38]:
import torch
from torch import nn
from torch.nn import functional as F

x = torch.arange(4)
torch.save(x, 'x-file')

In [39]:
x2 = torch.load('x-file')
x2

tensor([0, 1, 2, 3])

**Loading and Saving Model Parameters**

In [40]:
net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

In [41]:
torch.save(net.state_dict(), 'mlp.params')

In [42]:
clone = MLP()
clone.load_state_dict(torch.load('mlp.params'))
clone.eval()

MLP(
  (hidden): Linear(in_features=20, out_features=256, bias=True)
  (out): Linear(in_features=256, out_features=10, bias=True)
)

In [43]:
Y_clone = clone(X)
Y_clone == Y

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]])

**Summary**
1. The save and load functions can be used to perform file I/O for tensor objects.
2. We can save and load the entire sets of parameters for a network via a parameter dictionary. 
3. Saving the architecture has to be done in code rather than in parameters.