# 04 Computational graph

This tutorial is not related to the "[Deep Learning with PyTorch](https://pytorch.org/assets/deep-learning/Deep-Learning-with-PyTorch.pdf) book.

## Contents


1. MiniNet  
2. PyTorch's computational graph
    1. Weight and gradient values 
    2. Updating weights
    3. Updating gradients
3. Good to know

  
## Some of Andrew's videos related to this topic

- [Computation Graph (C1W2L07)](https://www.youtube.com/watch?v=hCP1vGoCdYU&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=13)
- [Derivatives With Computation Graphs (C1W2L08)](https://www.youtube.com/watch?v=nJyUyKN-XBQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=14)
- [Deep L-Layer Neural Network (C1W4L01)](https://www.youtube.com/watch?v=2gw5tE2ziqA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=36) (to clarify notations)
- [Forward Propagation in a Deep Network (C1W4L02)](https://www.youtube.com/watch?v=a8i2eJin0lY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=39) (clarify notations)

In [1]:
import torch
from torch import nn

torch.manual_seed(123)

<torch._C.Generator at 0x7fd7e344b7d0>

## 1. MiniNet

Here is the definition of a very simple MLP that we will use this exercise. It corresponds to the following network (following Andrew's notations from [Deep L-Layer Neural Network (C1W4L01)](https://www.youtube.com/watch?v=2gw5tE2ziqA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=36) and [Forward Propagation in a Deep Network (C1W4L02)](https://www.youtube.com/watch?v=a8i2eJin0lY&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=39)):

![MiniNet architecture](MiniNet.png)

We will use the following implementation of MiniNet, it might then be useful to read the code below.

In [2]:
class MiniNet(nn.Module):
    def __init__(self):
        super().__init__() 
        
        # number of layers in our network (following Andrew's notations)
        self.L = 2
        
        # Where we will store our neuron values
        # - z: before activation function 
        # - a: after activation function (a=f(z))
        self.z = {i : None for i in range(1, self.L+1)}
        self.a = {i : None for i in range(self.L+1)}

        # Fully connected layers
        # We have to use nn.ModuleDict and to use strings as keys here to 
        # respect pytorch requirements (otherwise, the model does not learn)
        self.fc = nn.ModuleDict({str(i): None for i in range(1, self.L+1)})
        self.fc['1'] = nn.Linear(in_features=2, out_features=3)
        self.fc['2'] = nn.Linear(in_features=3, out_features=2)

        
    def forward(self, x):
        # The first dimension of the input must be the batch size
        out = torch.flatten(x, 1)

        # Input layer
        self.a[0] = out
        
        # First layer (hidden layer)
        self.z[1] = self.fc['1'](out)
        self.a[1] = torch.tanh(self.z[1])
        
        # Second layer (output layer)
        self.z[2] = self.fc['2'](self.a[1])
        self.a[2] = torch.tanh(self.z[2])

        return self.a[2]

## 2. PyTorch's computational graph

### 2.1 Weight and gradient values 

- In general, we can access trainable parameter values using ``model.layer_name.weight.data``
- In general, we can access trainable parameter gradients using ``model.layer_name.weight.grad``
- With our MiniNet, we access neuron values using ``model.a`` and ``model.z``

In [3]:
model = MiniNet()

def print_parameters(model):
    """
    Print trainable parameters of our MiniNet
    """
    for name, p in model.named_parameters():
        print("\nName : ", name, "\nValue: ", p.data)

def print_neuron_values(model):
    """
    Print neuron values (a and z) of our MiniNet 
    """
    print("\n -------------- Input ---------------- ")
    print("a0:            ", model.a[0] )
    print("\n -------------- First Layer ---------------- ")
    print("z1:            ", model.z[1] )
    print("a1 = tanh(z1): ", model.a[1] )
    print("\n --------------  2nd Layer  ---------------- ")
    print("z2:            ", model.z[2] )
    print("a2 = tanh(z2): ", model.a[2] )

print(" =================== Print parameters =================== ")
print_parameters(model) # We can see that all parameters are randomly initialized
print("\n ======== Access parameter values and gradients ========== ")
# We can access our layers using their name
print("\n -- Layer 'model.fc['1']':\n\n", model.fc['1'])
# As well as its corresponding trainable parameter
print("\n -- Trainable parameter 'model.fc['1'].weight':\n\n", model.fc['1'].weight)
# With its corresponding trainable parameter values
print("\n -- Parameter values 'model.fc['1'].weight.data':\n\n",model.fc['1'].weight.data)
# And its corresponding trainable parameter gradient
# Note that the gradient is None for now because we haven't called .backward() yet
print("\n -- Gradient values 'model.fc['1'].weight.grad':\n\n",model.fc['1'].weight.grad)


print("\n ========= Neuron values at initialization  ============== ")
# We have not given any input to our model yet, so all neuron values should be None
model = MiniNet()
print_neuron_values(model)

print("\n ========= Neuron values after first input  ============== ")
# Now we give some input...
input = torch.tensor([[1, 1]], dtype=torch.float)
output = model(input)
# ... and everything has been computed in the forward pass 
print_neuron_values(model)


Name :  fc.1.weight 
Value:  tensor([[-0.2883,  0.0234],
        [-0.3512,  0.2667],
        [-0.6025,  0.5183]])

Name :  fc.1.bias 
Value:  tensor([-0.5140, -0.5622, -0.4468])

Name :  fc.2.weight 
Value:  tensor([[ 0.2615, -0.2133,  0.2161],
        [-0.4900, -0.3503, -0.2120]])

Name :  fc.2.bias 
Value:  tensor([-0.1135, -0.4404])


 -- Layer 'model.fc['1']':

 Linear(in_features=2, out_features=3, bias=True)

 -- Trainable parameter 'model.fc['1'].weight':

 Parameter containing:
tensor([[-0.2883,  0.0234],
        [-0.3512,  0.2667],
        [-0.6025,  0.5183]], requires_grad=True)

 -- Parameter values 'model.fc['1'].weight.data':

 tensor([[-0.2883,  0.0234],
        [-0.3512,  0.2667],
        [-0.6025,  0.5183]])

 -- Gradient values 'model.fc['1'].weight.grad':

 None


 -------------- Input ---------------- 
a0:             None

 -------------- First Layer ---------------- 
z1:             None
a1 = tanh(z1):  None

 --------------  2nd Layer  ---------------- 
z2:      

### 2.2 Updating weights

If you take a closer look at the above output of "Trainable parameter 'model.fc['1'].weight'" you can see that it mentions "``requires_grad=True``". Then, if you take a closer look at the output of "Neuron values after first input" you see that it mentions "``grad_fn=<TanhBackward>``"  "``grad_fn=<AddmmBackward>``"

What is this all about? Well, it has to do with the *computational graph* which is how Pytorch manages all the operations made during the forward pass (``outputs = model(inputs)`` i.e ``forward`` method) so that it can compute all the gradients in the backward pass (``loss.backward()``) and finally update parameters accordingly when calling ``optimizer.step()``. 

Now to illustrate how necessary it is to have some understanding of this computational graph, let's try to initialize our weights in a custom way and check how easily we can mess up everything.

This is okay:

- ``model.layer.param.data = new_values``
- ``model.layer.param.data[:] = new_values``

This is **NOT** okay:

- ``model.layer.param = new_values``     Raises an error (unless ``new_values`` are of [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter) and not [torch.Tensor](https://pytorch.org/docs/stable/tensors.html?highlight=tensor#torch.Tensor)): ``TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)``
- ``model.layer.param[:] = new_values``  Will remove the parameter from the list of leaves and put ``CopySlices`` as gradient function

Basically, we want each of our trainable parameters (weights) to require grad ([requires_grad](https://pytorch.org/docs/stable/autograd.html?highlight=requires#torch.Tensor.requires_grad)) and to be a leaf ([is_leaf](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.is_leaf)). Variables that have nothing to do with the computational graph (i.e. that are not a part of the network) should be detached of the computational graph. (see [detach](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach)). 

The concept of leaf might be counter intuitive in PyTorch. The fact that your weight is in the middle of your network does not mean that it should not be a leaf. It should always be a leaf. In Pytorch, weights are leaves because in the forward pass their values do not depend on the values of the input. Their values only change when calling ``optimizer.step()`` or when you initialize them manually. 


In [4]:
def check_computational_graph(model):
    """
    Make sure all trainable parameters require grad and are leaves
    """
    res = True
    # Go through all layers
    for i_layer in range(1, model.L+1):
        # Each layer has a weight and bias parameter
        for param_name in ['weight', 'bias']:
            
            # 'getattr(object, string variable)' is like `object.myattribute` when variable = "myattribute"
            param =  getattr(model.fc[str(i_layer)], param_name)
            msg = " !!!! WARNING !!!!\nmodel.fc[" + str(i_layer) + "]." + param_name
            if not param.requires_grad:
                print(msg + " does not require grad!")
                print(param)
                res = False
            if not param.is_leaf:
                print(msg + " is not a leaf!")
                print(param)
                res = False
    if res:
        print("\nAll parameters seem correctly attached to the computational graph! :) ")


print("\n ====================== Initialization ====================== ")

model = MiniNet()
check_computational_graph(model)      # So far so good, since we have not done anything yet



print("\n ==================== Updated parameters ==================== ")

# We can update weight values using '.data'
model.fc['1'].weight.data = torch.ones(3,2)
model.fc['1'].weight.data[0,0] = 42

model.fc['2'].bias.data[:] = torch.ones(2)
model.fc['2'].bias.data[0] = 42

print(model.fc['1'].weight)
print(model.fc['2'].bias)

check_computational_graph(model)      # To check that there are still correctly attached to the graph

print_parameters(model)               # To check that parameters have been updated
print("\n ========== Updated parameters WITHOUT '.data' =============" )
model.fc['1'].weight[:,:] = torch.zeros_like(model.fc['1'].weight)
print_parameters(model)               # To check that parameters have been updated
check_computational_graph(model)      # Now fc['1'].weight is not a leaf anymore! (and see "grad_fn=<CopySlices>"")

# This would raise an error if uncommented
#model.fc['1'].weight=torch.arange(1,7, dtype=torch.float).view(3,2)/10 




All parameters seem correctly attached to the computational graph! :) 

Parameter containing:
tensor([[42.,  1.],
        [ 1.,  1.],
        [ 1.,  1.]], requires_grad=True)
Parameter containing:
tensor([42.,  1.], requires_grad=True)

All parameters seem correctly attached to the computational graph! :) 

Name :  fc.1.weight 
Value:  tensor([[42.,  1.],
        [ 1.,  1.],
        [ 1.,  1.]])

Name :  fc.1.bias 
Value:  tensor([-0.1320, -0.3793, -0.0643])

Name :  fc.2.weight 
Value:  tensor([[ 0.5470, -0.0455,  0.0183],
        [-0.0900,  0.0908,  0.5144]])

Name :  fc.2.bias 
Value:  tensor([42.,  1.])



RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

### 2.3 Updating gradients

By default, Pytorch keeps track of all the operations you make that involve a tensor that requires grad (i.e. with ``requires_grad=True``) in order to correctly update trainable parameters at the next training step. 

This is "contagious". For example if you define ``a`` using your model output and then ``b`` using ``a`` then both ``a`` and ``b`` will require grad because your ouput does (because your weights do...).

So whenever you manipulate your model or weights or outputs or losses outside the training loop you should always put your operations inside a ``with torch.no_grad():`` context (see [torch.no_grad](https://pytorch.org/docs/stable/generated/torch.no_grad.html?highlight=no_grad#torch.no_grad)) so that pytorch does not pollute your computational graph with unwanted operations. Note that this is what we did in the ``compute_accuracy`` function in the previous tutorials

In addition, pytorch documentation highly advise again using in-place operations (see [in-place operations with autograd](https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd)). So even if you like to write ``a += b``, try to stick to ``a = a + b``. 


In [None]:
model = MiniNet()

print("\n ======== Gradient values 'model.fc['1'].weight.grad' ======== ")

# Initially there is no gradient
print("\n ------------- At initialization -------------- ")
print(model.fc['1'].weight.grad)

# Now we will do the forward and backward pass
print("\n --- After first input and backpropagation ---- ")

# First, forward pass
x = torch.tensor([[1., 1.]])
y = model(x)

# ... then compute some loss
y_exp = torch.tensor([[0., 1.]])
loss = torch.sum( (y - y_exp)**2 )

# ... and backprogagate the gradient
loss.backward()
print(model.fc['1'].weight.grad)

# Remember that after each training iteration we zero out the gradients
print("\n ------ After zeroing out the gradients ------- ")
model.zero_grad()
print(model.fc['1'].weight.grad)

print("\n ============ Which tensors require grad =============== ")

print("Does 'x' require grad?                  ", x.requires_grad)
print("Does 'model.fc['1'].weight' require grad?   ", model.fc['1'].weight.requires_grad)
print("Does 'y' require grad?                  ", y.requires_grad)
print("Does 'loss' require grad?               ", loss.requires_grad)

a = torch.zeros_like(model.fc['1'].weight.grad)
a = model.fc['1'].weight.data
print("'a = model.fc['1'].weight.data' requires grad?   ", a.requires_grad)
a = model.fc['1'].weight.grad
print("'a = model.fc['1'].weight.grad' requires grad?   ", a.requires_grad)
a = model.fc['1'].weight
print("'a = model.fc['1'].weight' requires grad?        ", a.requires_grad)


print("\n ======== Operations irrelevant to the training ======== ")
print("\n ------ Without torch.no_grad ------- ")
a = torch.cos(torch.tensor([[1., 1.]]))
print("Does 'a' require grad?    ", a.requires_grad)
a = 10 + y
# Now a requires grad (and pytorch will automatically compute its gradient)
print("Does 'a' require grad?    ", a.requires_grad)

print("\n ------ Using torch.no_grad ------- ")
a = torch.cos(torch.tensor([[1., 1.]]))
print("Does 'a' require grad?    ", a.requires_grad)
with torch.no_grad():
    a = 10 + y
# a doesn't require grad this time 
print("Does 'a' require grad?    ", a.requires_grad)


## 3. Good to know

- We access trainable parameter values using ``model.layer_name.weight.data``.
- We access trainable parameter gradients using ``model.layer_name.weight.grad``.
- ``model.layer_name.weight`` is not a regular [torch.Tensor](https://pytorch.org/docs/stable/tensors.html?highlight=tensor#torch.Tensor) but a [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter). They also have  ``requires_grad = True``.
- By default, pytorch keeps track of all operations that involve at least one tensor that has ``requires_grad = True``.
- [requires_grad](https://pytorch.org/docs/stable/autograd.html?highlight=requires#torch.Tensor.requires_grad) is contagious: if ``a`` requires grad, then ``b`` defined as ``b = a + 1`` requires grad as well.
- Use ``with torch.no_grad():`` context whenever you manipulate your model or weights or outputs or losses outside the training loop.
- Don't use in-place operations (e.g. don't use ``a += b``, but ``a = a + b`` instead)