# Parameter Management

The ultimate goal of training deep networks is to find good parameter values for a given architecture. When everything is standard, the __torch.nn.Sequential__ class is a perfectly good tool for it. However, very few models are entirely standard and most scientists want to build things that are novel. This section shows how to manipulate parameters. In particular we will cover the following aspects:

- Accessing parameters for debugging, diagnostics, to visualize them or to save them is the first step to understanding how to work with custom models.
- Secondly, we want to set them in specific ways, e.g. for initialization purposes. We discuss the structure of parameter initializers.
- Lastly, we show how this knowledge can be put to good use by building networks that share some parameters.

As always, we start from our trusty Multilayer Perceptron with a hidden layer. This will serve as our choice for demonstrating the various features.

In [2]:
import torch
import torch.nn as nn

net = nn.Sequential()
net.add_module('Linear_1', nn.Linear(20, 256, bias = False))
net.add_module('relu', nn.ReLU())
net.add_module('Linear_2', nn.Linear(256, 10, bias = False))

# the init_weights function initializes the weights of our multi-layer perceptron 
def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)

# the net.apply() applies the above stated initialization of weights to our net        
net.apply(init_weights) 

x = torch.randn(2,20)   #initialing a random tensor of shape (2,20)
net(x)  #Forward computation

tensor([[-0.3476,  0.6260, -0.2687, -0.3632,  0.0774, -0.0411, -0.1944,  0.4926,
          0.2889, -0.0884],
        [-0.0889, -0.1205, -0.4132,  0.2207, -0.0655, -0.7331, -0.0264,  0.0929,
          0.5911, -0.0599]], grad_fn=<MmBackward>)

## Parameter Access


In the case of a Sequential class we can access the parameters with ease, simply by calling __net.parameters__. Let’s try this out in practice by inspecting the parameters.

In [10]:
print(net[0].parameters)
print(net[1].parameters)
print(net[2].parameters)
print(net.parameters)

<bound method Module.parameters of Linear(in_features=20, out_features=256, bias=False)>
<bound method Module.parameters of ReLU()>
<bound method Module.parameters of Linear(in_features=256, out_features=10, bias=False)>
<bound method Module.parameters of Sequential(
  (Linear_1): Linear(in_features=20, out_features=256, bias=False)
  (relu): ReLU()
  (Linear_2): Linear(in_features=256, out_features=10, bias=False)
)>


The output tells us a number of things. Firstly, there are 3 layers; 2 linear layers and 1 ReLU layer as we would expect. The output also specifies the shapes that we would expect from linear layers. In particular the names of the parameters are very useful since they allow us to identify parameters uniquely even in a network of hundreds of layers and with nontrivial structure. Also, the output tells us that bias is __False__ as we specified it.  


### Targeted Parameters

In order to do something useful with the parameters we need to access them, though. There are several ways to do this, ranging from simple to general. Let’s look at some of them.

In [21]:
print(net[0].bias)

None


It returns the bias of the first linear layer. Since we initialized the bias to be __False__, the output is None. <font color=red>We can also access the parameters by name, such as `Linear_1`. Both methods are entirely equivalent but the first method leads to much more readable code. </font>

In [4]:
print(net.Linear_1.weight)
print(net[0].weight)

Parameter containing:
tensor([[-0.0240, -0.0679,  0.0283,  ..., -0.1449, -0.0616,  0.0193],
        [-0.1202, -0.1113, -0.1229,  ..., -0.0826, -0.0808,  0.0885],
        [ 0.0946, -0.0338, -0.0716,  ...,  0.0438,  0.0479, -0.0715],
        ...,
        [-0.0840, -0.0112,  0.1035,  ..., -0.0980,  0.1118,  0.1394],
        [-0.0126,  0.0873,  0.0235,  ..., -0.0922,  0.0635,  0.0197],
        [-0.1274,  0.0696,  0.0887,  ...,  0.0605,  0.1470,  0.0634]],
       requires_grad=True)
Parameter containing:
tensor([[-0.0240, -0.0679,  0.0283,  ..., -0.1449, -0.0616,  0.0193],
        [-0.1202, -0.1113, -0.1229,  ..., -0.0826, -0.0808,  0.0885],
        [ 0.0946, -0.0338, -0.0716,  ...,  0.0438,  0.0479, -0.0715],
        ...,
        [-0.0840, -0.0112,  0.1035,  ..., -0.0980,  0.1118,  0.1394],
        [-0.0126,  0.0873,  0.0235,  ..., -0.0922,  0.0635,  0.0197],
        [-0.1274,  0.0696,  0.0887,  ...,  0.0605,  0.1470,  0.0634]],
       requires_grad=True)


Note that the weights are nonzero. This is by design since we applied __Xavier initialization__ to our network. We can also compute the gradient with respect to the parameters. It has the same shape as the weight. However, since we did not invoke backpropagation yet, the output is None.

In [5]:
print(net[0].weight.grad)

None


### All parameters at once

Accessing parameters as described above can be a bit tedious, in particular if we have more complex blocks, or blocks of blocks (or even blocks of blocks of blocks), since we need to walk through the entire tree in reverse order to how the blocks were constructed. To avoid this, <font color=red>blocks come with a method __state_dict__ which grabs all parameters of a network in one dictionary such that we can traverse it with ease. It does so by iterating over all constituents of a block and calls __state_dict__ on subblocks as needed.</font> To see the difference consider the following:

In [19]:
print(net[0].state_dict) # only for first layer, return a method
print(net.state_dict) # for entire network, similar to net.parameters

<bound method Module.state_dict of Linear(in_features=20, out_features=256, bias=False)>
<bound method Module.state_dict of Sequential(
  (Linear_1): Linear(in_features=20, out_features=256, bias=False)
  (relu): ReLU()
  (Linear_2): Linear(in_features=256, out_features=10, bias=False)
)>


This provides us with a third way of accessing the parameters of the network. If we wanted to get the value of the weight term of the second linear layer we could simply use this:

In [17]:
net.state_dict()['Linear_1.weight']

tensor([[ 0.1123, -0.1297, -0.0970,  ..., -0.1204, -0.0854,  0.0092],
        [ 0.0077,  0.0488,  0.1312,  ..., -0.0447, -0.0044,  0.0673],
        [ 0.0033,  0.0940,  0.1256,  ...,  0.1111,  0.0165,  0.0680],
        ...,
        [ 0.0944, -0.0028,  0.1007,  ..., -0.0975, -0.0777, -0.1427],
        [ 0.0014, -0.0281,  0.0673,  ..., -0.0432, -0.0399, -0.1075],
        [ 0.0391,  0.1284,  0.0727,  ...,  0.0479, -0.1375,  0.0133]])

<font color=red>**Three ways to access the network parameter values:**</font>
1. `net[0].weight`; 
2. `net.Linear_1.weight`;
3. `net.state_dict()['Linear_1.weight']` . `state_dict` is a method, while `state_dict()` returns the OrderedDict of net parameter values. 

<font color=red>`net.parameters` is a method that show the net structure, which is equivalent to `net.state_dict`. `net.parameters()` is a generator object that can be iterated with `for`. `net.state_dict()` returns an OrderDict. </font>

### Rube Goldberg strikes again

Let's see how the parameter naming conventions work if we nest multiple blocks inside each other. For that we first define a function that produces blocks (a block factory, so to speak) and then we combine these inside yet larger blocks.

In [3]:
def block1():
    net = nn.Sequential(nn.Linear(16, 32),
                        nn.ReLU(),
                        nn.Linear(32, 16),
                        nn.ReLU())
    return net

def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add_module('block' + str(i), block1())
    return net    
        
rgnet = nn.Sequential()
rgnet.add_module('model',block2())
rgnet.add_module('Last_linear_layer', nn.Linear(16,10))
rgnet.apply(init_weights)
x = torch.randn(2,16)
rgnet(x) # forward computation

tensor([[-0.0419,  0.0923,  0.0180, -0.0494,  0.0616, -0.0501, -0.0488,  0.0974,
         -0.1014,  0.1024],
        [-0.0367,  0.0878,  0.0306, -0.0513,  0.0600, -0.0401, -0.0568,  0.0841,
         -0.1074,  0.0789]], grad_fn=<AddmmBackward>)

Now that we are done designing the network, let's see how it is organized. __state_dict__ provides us with this information, both in terms of naming and in terms of logical structure.

In [6]:
# print(rgnet.parameters)
idx = 0
for param in rgnet.parameters():
    idx += 1
    print(param.size(), param.dtype)
print(idx) 
#     print(param.size(), param.data, param.dtype) 

for key, value in rgnet.state_dict().items():
    print(key)
print(len(rgnet.state_dict()))

# # equivalent form, where named_parameters() is a generator object
# for name, param in rgnet.named_parameters():
#     print(name)

18
model.block0.0.weight
model.block0.0.bias
model.block0.2.weight
model.block0.2.bias
model.block1.0.weight
model.block1.0.bias
model.block1.2.weight
model.block1.2.bias
model.block2.0.weight
model.block2.0.bias
model.block2.2.weight
model.block2.2.bias
model.block3.0.weight
model.block3.0.bias
model.block3.2.weight
model.block3.2.bias
Last_linear_layer.weight
Last_linear_layer.bias
18


### <font color=red> Three ways to iterate all the net parameters</font>
1. `for param in net.parameters()`, `param.size(), param.data, param.dtype`;
2. `for key, value in net.state_dict().items()`, `value` is a Tensor, equivalent to `param.data`;
3. `for name, param in net.named_parameters()`.

Since the layers are hierarchically generated, we can also access them accordingly. For instance, to access the first major block, within it the second subblock and then within it, in turn the bias of the first layer, we perform the following.

In [33]:
rgnet[0][1][0].bias.data

tensor([ 0.1900,  0.1228,  0.2197,  0.2124, -0.1286, -0.1921,  0.0118, -0.0039,
         0.1521,  0.1227, -0.1745, -0.0468, -0.1413,  0.1385,  0.1220,  0.0322,
         0.1811, -0.0116,  0.1389, -0.0166,  0.1364, -0.1574,  0.1024,  0.1932,
        -0.1661, -0.1360, -0.1226,  0.0430,  0.1220,  0.0052, -0.1072, -0.1957])

## Parameter Initialization

Now that we know how to access the parameters, let's look at how to initialize them properly. We discussed the need for initialization in section Numerical Stability. We often need to use methods to initialize the weights. PyTorch's init module provides a variety of preset initialization methods, but if we want something out of the ordinary, we need a bit of extra work. To initialize the weights of a single layer, we use a function from __torch.nn.init__ . For instance:

In [34]:
linear1 = nn.Linear(2,5,bias=True)
torch.nn.init.normal_(linear1.weight, mean=0, std =0.01)  

Parameter containing:
tensor([[-0.0004,  0.0117],
        [ 0.0066,  0.0032],
        [-0.0063, -0.0070],
        [ 0.0066, -0.0039],
        [ 0.0049, -0.0051]], requires_grad=True)

If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to `Constant(1)`

In [35]:
def init_weight(m):
    if type(m) == nn.Linear:
        torch.nn.init.normal_(m.weight)
        
net = nn.Sequential()
net.add_module('Linear_1', nn.Linear(2, 5, bias = False))
net.add_module('Linear_2', nn.Linear(5, 5, bias = False))

net.apply(init_weight)
print(net.state_dict())

OrderedDict([('Linear_1.weight', tensor([[ 0.3055,  1.3963],
        [-0.4637,  0.4855],
        [-0.2062, -0.2978],
        [-1.8589,  0.4849],
        [-0.6600,  0.5511]])), ('Linear_2.weight', tensor([[ 1.0764,  0.3084,  1.9261, -0.2602, -0.4658],
        [-0.4218,  1.8713, -1.4679,  1.9242,  0.1827],
        [ 1.1824, -1.4874, -0.4012,  0.5764, -0.5203],
        [ 0.0208,  0.0575,  1.5464, -0.5157, -0.9706],
        [-0.4874, -0.2499, -0.1963,  0.6635,  0.0814]]))])


### Built-in Initialization

Let’s begin with the built-in initializers. The code below initializes all parameters with Gaussian random variables.

In [36]:
def gaussian_normal(m):
    if type(m) == nn.Linear:
        torch.nn.init.normal_(m.weight)
        
net.apply(gaussian_normal)
print(net[0].weight)

Parameter containing:
tensor([[ 0.6190,  0.0903],
        [ 1.5809,  0.4545],
        [ 0.1126, -0.9759],
        [ 0.4031,  0.0233],
        [ 0.1432,  2.5701]], requires_grad=True)


<font color=red>If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to __torch.nn.init.constant_(tensor,1)__.</font>

In [37]:
def ones(m):
    if type(m) == nn.Linear:
        torch.nn.init.constant_(m.weight, 1)
        
net.apply(ones)
print(net.state_dict())

OrderedDict([('Linear_1.weight', tensor([[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]])), ('Linear_2.weight', tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]]))])


If we want to initialize only a specific parameter in a different manner, we can simply set the initializer only for the appropriate subblock (or parameter) for that matter. For instance, below we initialize the __second layer__ to a constant value of __42__ and we use the __Xavier initializer__ for the weights of the __first layer__.

In [38]:
block1 = nn.Sequential()
block1.add_module('Linear_1', nn.Linear(2,5,bias=False))
block2 = nn.Sequential()
block2.add_module('Linear_2', nn.Linear(5,5,bias=False))

model = nn.Sequential()
model.add_module('first', block1)
model.add_module('second', block2)

def xavier_normal(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
def init_42(m):
    if type(m) == nn.Linear:
        torch.nn.init.constant_(m.weight, 42)

# initialize the blocks separately             
block1.apply(xavier_normal)
block2.apply(init_42)
print(model.state_dict())

OrderedDict([('first.Linear_1.weight', tensor([[-0.1309,  0.5675],
        [ 0.1852,  0.7134],
        [-0.5765, -0.3884],
        [ 0.3217, -0.0702],
        [ 0.4140, -0.2638]])), ('second.Linear_2.weight', tensor([[42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42.]]))])


### Custom Initialization

Sometimes, the initialization methods we need are not provided in the init module. At this point, we can implement our desired implementation by writing the desired functions and use them to initialize the weights. In the example below, we pick a decidedly bizarre and nontrivial distribution, just to prove the point. We draw the coefficients from the following distribution:

$$ \begin{aligned} w \sim \begin{cases} U[5, 10] & \text{ with probability } \frac{1}{4} \   
0 & \text{ with probability } \frac{1}{2} \      
U[-10, -5] & \text{ with probability } \frac{1}{4} \end{cases} \end{aligned} $$

In [39]:
def custom(m):
    torch.nn.init.uniform_(m[0].weight, -10,10)
    for i in range(m[0].weight.data.shape[0]):
        for j in range(m[0].weight.data.shape[1]):
            if m[0].weight.data[i][j]<=5 and m[0].weight.data[i][j]>=-5:
                m[0].weight.data[i][j]=0
    
    
m = nn.Sequential(nn.Linear(5,5,bias=False))
custom(m)
print(m.state_dict())

OrderedDict([('0.weight', tensor([[ 0.0000,  6.4634,  7.9618,  0.0000,  0.0000],
        [ 7.5903,  0.0000, -6.7692,  0.0000,  0.0000],
        [ 0.0000,  6.1067,  0.0000,  0.0000,  7.3671],
        [ 0.0000,  7.3875, -6.7359,  9.6736,  8.2806],
        [ 7.0169,  7.7853,  0.0000, -9.0504, -6.4618]]))])


If even this functionality is insufficient, we can set parameters directly. Since __.data__ returns a Tensor we can access it just like any other matrix.

In [40]:
m[0].weight.data +=1
m[0].weight.data[0][0] = 42
m[0].weight.data

tensor([[42.0000,  7.4634,  8.9618,  1.0000,  1.0000],
        [ 8.5903,  1.0000, -5.7692,  1.0000,  1.0000],
        [ 1.0000,  7.1067,  1.0000,  1.0000,  8.3671],
        [ 1.0000,  8.3875, -5.7359, 10.6736,  9.2806],
        [ 8.0169,  8.7853,  1.0000, -8.0504, -5.4618]])

## Tied Parameters

In some cases, we want to <font color=red>share model parameters across multiple layers</font>. For instance when we want to find good word embeddings we may decide to use the same parameters both for encoding and decoding of words. Let’s see how to do this a bit more elegantly. In the following we <font color=red>allocate a linear layer and then use it multiple times for sharing the weights.</font>

In [41]:
# We need to give the shared layer a name such that we can reference its
# parameters

shared = nn.Sequential()
shared.add_module('linear_shared', nn.Linear(8,8,bias=False))
shared.add_module('relu_shared', nn.ReLU())                  
net = nn.Sequential(nn.Linear(20,8,bias=False),
               nn.ReLU(),
               shared,
               shared,
               nn.Linear(8,10,bias=False))

net.apply(init_weights)

print(net[2][0].weight==net[3][0].weight)


tensor([[True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True]])


The above example shows that the parameters of the second and third layer are tied. They are identical rather than just being equal. That is, by changing one of the parameters the other one changes, too. 

## Summary

* We have several ways to access, initialize, and tie model parameters.
* We can use custom initialization.
* PyTorch has a sophisticated mechanism for accessing parameters in a unique and hierarchical manner.

## Exercises

1. Use the FancyMLP defined in :numref:`chapter_model_construction` and access the parameters of the various layers.
1. Look at the [PyTorch documentation](https://pytorch.org/docs/stable/_modules/torch/nn/init.html) and explore different initializers.
1. Try accessing the model parameters after `net.apply(initialization)` and before `net(x)` to observe the shape of the model parameters. What changes? Why?
1. Construct a multilayer perceptron containing a shared parameter layer and train it. During the training process, observe the model parameters and gradients of each layer.
1. Why is sharing parameters a good idea?
