# PyTorch Computational Graph and Automatic Differentiation
We use data models to fit unknown functions of which we have a set of samples. The data model is built from functions that we choose because we think they can represent the real unknown function after we have set their parameters using the set of samples. The procedure to find the right paramneter values is always the same: we define a loss function, e.g. the mean squared error, to be minimized using the backpropagation algorithm. The data model can be as simple as a linear function, i.e. $y = ax + b$, with only the two parameters a,b or much more complex like a deep learning model where the data model is represented by a set of nested parametric functions that are a combination of affine transformations and nonlinear activation functions. The backpropagation algorithm requires the computation of the gradient of the loss function that becomes quicky difficult as soon as we raise the complexity of our model. In this notebook we will see how we can train our model with PyTorch overcoming the difficulties of computing the backpropagation algorithm using its computational graph and automatic differentiation.   

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import torch
import torch.nn as nn
warnings.filterwarnings('ignore')
print("NumPy version: %s"%np.__version__)
print("Pandas version: %s"%pd.__version__)
print("PyTorch version: %s"%torch.__version__)

  from .autonotebook import tqdm as notebook_tqdm


NumPy version: 1.23.1
Pandas version: 1.4.3
PyTorch version: 1.13.0


## Computational graph
We build a graph for a very simple computation: 

$$s = 2 * (x - y) + z$$

The computational graph is a directed acyclic graph that can be built from operators such as sub(), mul(), add() that implement the arithmetic operations on scalars: subtraction, multiplication, addition. 

In [11]:
def compute_s(x, y, z):
    r1 = torch.sub(x, y)
    r2 = torch.mul(r1, 2)
    z = torch.add(r2, z)
    return z

We assign a value to the variables that are represented as tensors of rank 0 and then we compute the value of the dependent variable s

In [14]:
x = torch.tensor(1)
y = torch.tensor(2)
z = torch.tensor(5)
s = compute_z(a, b, c)
print('s =', s)       

s = tensor(3)


In [16]:
torch.manual_seed(1)
w = torch.empty(2, 3)
nn.init.xavier_normal_(w)
print(w)

tensor([[ 0.4183,  0.1688,  0.0390],
        [ 0.3930, -0.2858, -0.1051]])


## Backpropagation and autodifferentiation
A deep learning model is a parametric model of nested functions, e.g. y = f(g(h(x))), that we use to fit an unknown function $\hat{y}$ of which we have a number N of sample data points, like a function that maps a picture of a cat to its label "cat". Fitting this type of models is not different conceptually from fitting a linear model such as y = ax + b. In this simple case we have only two parameters a, b, that we want to find using a set of examples. The procedure is to define a loss function, e.g. 

$$\mathscr{L} = \frac{1}{2} \sum_{i=1}^N|| y - \hat{y_i}||^2$$ 

that we want to minimize by updating the parameters a, b till the loss is as close to zero as possible. The same idea is used also for a deep neural network model. We have a model built from nested parametric nonlinear functions for which we know the form (a combination of affine transformations and well known activation functions) but not the right parameters so the procedure is again to write a loss function to be minimized by updating the model's parameters $w$ using a set of examples and the backpropagation algorithm

$$w_{j+1} = w_j - \gamma \nabla_w{\mathscr{L}}$$

where

$$\nabla_w{\mathscr{L}} = ||y - \hat{y}|| \frac{\partial{y}}{\partial{w}}$$

Clearly the y function must be differentiable if we want to compute its gradient, that is its partial derivatives with respect to the parameters $w$. In case of a simple model we could write the equations to compute the derivatives numerically or by symbolic substitution and the chain rule known from Calculus. The problem is that with deep neural networks with thousands or millions of parameters this procedure would be unfeasible and numerically unstable. A better technique is computing the derivatives of the y function with respect to the model parameters using the computational graph that is built with the model. When we define a model PyTorch builds the corresponding computational graph. In order to compute the derivatives with respect to a paramenter we have to tell PyTorch which parameters require the derivative (or the gradient if it is an array). 

As an example we will build a linear model $y = ax + b$. We can draw our model's computational graph starting from the leaves, that is the two parameters a and b and one data point (x1, y1). We assign the initial values to our parameters. These values will be updated in the training process using the data points and the backpropagation algorithm. What is important to know here is that PyTorch requires the parameters to be enabled for automatic differentiation by setting the attribute requires_grad to true.   

In [40]:
a = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.5, requires_grad=True) 

In our simple linear case the loss function is a quadratic function of the two parameters a and b and that means it is also a convex function with only one minimum so that it will be quite easy for the backpropagation algorithm to find

$$\mathscr{L} = \frac{1}{2} \sum_{i=1}^N|| ax + b - \hat{y_i}||^2$$ 

The partial derivatives of the loss must be evaluated at the data points, in our case at (x1, y1)

$$\frac{\partial{\mathscr{L}}}{\partial{a}} = (y - \hat{y}) \frac{\partial{y}}{\partial{a}} = (ax + b - \hat{y})x$$

$$\frac{\partial{\mathscr{L}}}{\partial{b}} = (y - \hat{y}) \frac{\partial{y}}{\partial{b}} = ax + b -\hat{y}$$

We will use only one data point to compute the loss and its gradient

In [41]:
x1 = torch.tensor([1.4])
y1 = torch.tensor([2.1])

We can easily define our linear model using the PyTorch functions mul() for multiplication, and add() for addition 

In [42]:
y = torch.add(torch.mul(a, x), b)
y

tensor([1.9000], grad_fn=<AddBackward0>)

Since the loss function is built from differentiable parameters we can compute its gradient using the backward() method available to all PyTorch tensors

In [43]:
loss = 0.5 * (y - y1).pow(2).sum()
loss.backward()

We can print the value of the partial derivatives of the loss function with respect to the parameters a and b. These values will be used to update the value of the paramenters in order to reduce the loss

In [44]:
print('dL/da : ', a.grad)
print('dL/db : ', b.grad)

dL/da :  tensor(-0.2800)
dL/db :  tensor(-0.2000)


we can compute manually the partial derivative of the loss with respect to the parameter a to see that it matches with the result computed by PyTorch via automatic differentiation

In [45]:
x1*(a*x1 + b - y1)

tensor([-0.2800], grad_fn=<MulBackward0>)