In [1]:
import torch
from torch import nn
from tests_backpropagation import main_test


torch.manual_seed(123)
torch.set_default_dtype(torch.double)

## Class ``MyNet``

Read carefully how ``MyNet`` is implemented in the cell below. In particular:  
- ``n_hid`` is a list of integer, representing the number of hidden units in each hidden layer.   
-  ``MyNet([2, 3, 2]) = MiniNet()`` where ``MiniNet`` is the neural network defined in the fourth tutorial, in which notations are also clarified.     
- ``model.L`` is the number of hidden layers, ``L``   
- ``model.f[l]`` is the activation function of layer ``l``, $f^{[l]}$ (here ``torch.tanh``)   
- ``model.df[l]`` is the derivative of the activation function, $f'^{[l]}$   
- ``model.a[l]``  is the tensor $A^{[l]}$, (shape: ``(1, n(l))``)   
- ``model.z[l]``  is the tensor $Z^{[l]}$, (shape: ``(1, n(l))``)  
- ``Weights $W^{[l]}$`` (shape: ``(n(l+1), n(l))``) and biases $\mathbf{b}^{[l]}$ (shape: ``(n(l+1))``) can be accessed as follows:
```
weights = model.fc[str(l)].weight.data
bias = model.fc[str(l)].bias.data
```

In [2]:
class MyNet(nn.Module):
    def __init__(self, n_l = [2, 3, 2]):
        super().__init__() 
        
        
        # number of layers in our network (following Andrew's notations)
        self.L = len(n_l)-1
        self.n_l = n_l
        
        # Where we will store our neuron values
        # - z: before activation function 
        # - a: after activation function (a=f(z))
        self.z = {i : None for i in range(1, self.L+1)}
        self.a = {i : None for i in range(self.L+1)}

        # Where we will store the gradients for our custom backpropagation algo
        self.dL_dw = {i : None for i in range(1, self.L+1)}
        self.dL_db = {i : None for i in range(1, self.L+1)}

        # Our activation functions
        self.f = {i : lambda x : torch.tanh(x) for i in range(1, self.L+1)}

        # Derivatives of our activation functions
        self.df = {
            i : lambda x : (1 / (torch.cosh(x)**2)) 
            for i in range(1, self.L+1)
        }
        
        # fully connected layers
        # We have to use nn.ModuleDict and to use strings as keys here to 
        # respect pytorch requirements (otherwise, the model does not learn)
        self.fc = nn.ModuleDict({str(i): None for i in range(1, self.L+1)})
        for i in range(1, self.L+1):
            self.fc[str(i)] = nn.Linear(in_features=n_l[i-1], out_features=n_l[i])
        
    def forward(self, x):
        # Input layer
        self.a[0] = torch.flatten(x, 1)
        
        # Hidden layers until output layer
        for i in range(1, self.L+1):

            # fully connected layer
            self.z[i] = self.fc[str(i)](self.a[i-1])
            # activation
            self.a[i] = self.f[i](self.z[i])

        # return output
        return self.a[self.L] 

## Tasks

Write a function ``backpropagation(model, y_true, y_pred)`` that computes:

- $\frac{\partial L}{\partial w^{[l]}_{i,j}}$ and store them in ``model.dL_dw[l][i,j]`` for $l \in [1 .. L]$ 
- $\frac{\partial L}{\partial b^{[l]}_{j}}$ and store them in ``model.dL_db[l][j]`` for $l \in [1 .. L]$ 

assuming ``model`` is an instance of the ``MyNet`` class.

A vectorized implementation would be appreciated.

In [3]:
def backpropagation(model, y_true, y_pred):
    with torch.no_grad():

        #layer L --> calculating BP1
        dL_dy = 2*torch.sub(y_pred, y_true) #part one of BP1
        tanh_prime_z_L = model.df[model.L](model.z[model.L]) #part two of BP2
        dL_df_L = dL_dy * tanh_prime_z_L #completing BP1 for layer L 
        dL_df_k = dL_df_L #creating general for next layers (L-n) in backprop
        
        #dL_dw for layer L using BP4
        #here we can use dL_df_k from above
        h_k_T = model.a[model.L-1]
        dL_dw = dL_df_L.T @ h_k_T
        
        #dL_db for layer L using BP3
        #this is the same as our dL_df_L
        dL_db = dL_df_L

        #updating the weight and bias gradients
        model.dL_dw[model.L] = dL_dw
        model.dL_db[model.L] = dL_db.squeeze()
        
        for l in range(model.L-1, 0, -1):

            #finding the derivative of the activation function for the layer
            tanh_prime_z_l = model.df[l](model.z[l])

            #dL_df_l --> using BP2 we need w^(l+1) and dL_df^(l+1)
            #we already have w^(l+1) from forward pass and dL_df^(l+1) from the last round
            #dL_df_k means layer l-1
            w_from_last = model.fc[str(l+1)].weight.data
            dL_df_k = (w_from_last.T @ dL_df_k.T).T * tanh_prime_z_l
            
            #dL_dw --> using BP4 we need activation from l-1 and dL_df_k 
            h_k_T = model.a[l-1]
            dL_dw = dL_df_k.T @ h_k_T
            
            #dL_db --> using BP3 we only need dl_df_k
            dL_db = dL_df_k
            
            #print(dL_db, torch.flatten(dL_db), torch.squeeze(dL_db), dL_db.squeeze())
            #updating weights and bias gradients
            model.dL_dw[l] = dL_dw
            model.dL_db[l] = dL_db.squeeze()
       

## Run the cells below, and check the output

- In the 1st cell, we use a toy dataset and the same architecture as the MiniNet class of the fourth tutorial. 
- In the 2nd cell, we use a few samples of the MNIST dataset with a consistent model architecture (``24x24`` black and white cropped images as input and ``10`` output classes). 

You can set ``verbose`` to ``True`` if you want more details about your computations versus what is expected.

In [4]:
model = MyNet([2, 3, 2])
main_test(backpropagation, model, verbose=False, data='toy')


 __________________________________________________________________ 
                          Check gradients                             
 __________________________________________________________________ 

 TEST PASSED: Gradients consistent with autograd's computations.

 TEST PASSED: Gradients consistent with finite differences computations.

 __________________________________________________________________ 
                 Check that weights have been updated               
 __________________________________________________________________ 

 TEST PASSED: Weights have been updated.

 __________________________________________________________________ 
                      Check computational graph                     
 __________________________________________________________________ 

 TEST PASSED: All parameters seem correctly attached to the computational graph!

 __________________________________________________________________ 
                             Conclusion 

In [5]:
model = MyNet([24*24, 16, 10])
main_test(backpropagation, model, verbose=True, data='mnist')

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 24883774.71it/s]


Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 51220166.52it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz





Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 32169929.28it/s]

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz





Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 10613107.95it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw







 __________________________________________________________________ 
                          Check gradients                             
 __________________________________________________________________ 


  return F.mse_loss(input, target, reduction=self.reduction)




 -------- Gradcheck with finite differences  --------- 
 residual error:
 [0.0, 0.0, 0.0, 0.0, 0.0]

 --------- Comparing with autograd values  ----------- 

 ******* fc['1'].weight.grad ******* 
  Our computation:
 tensor([[  1.7421e-27,   1.7421e-27,   1.7421e-27,  ...,   1.7421e-27,
           1.7421e-27,   1.7421e-27],
        [-5.9696e-177, -5.9696e-177, -5.9696e-177,  ..., -5.9696e-177,
         -5.9696e-177, -5.9696e-177],
        [ -3.7006e-19,  -3.7006e-19,  -3.7006e-19,  ...,  -3.7006e-19,
          -3.7006e-19,  -3.7006e-19],
        ...,
        [-1.6763e-132, -1.6763e-132, -1.6763e-132,  ..., -1.6763e-132,
         -1.6763e-132, -1.6763e-132],
        [  1.4466e-40,   1.4466e-40,   1.4466e-40,  ...,   1.4466e-40,
           1.4466e-40,   1.4466e-40],
        [  9.5900e-56,   9.5900e-56,   9.5900e-56,  ...,   9.5900e-56,
           9.5900e-56,   9.5900e-56]])

  Autograd's computation:
 tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [-0., -0., -0.,  ..., -0., -0., -0.],
