# PyTorch Computational Graph and Automatic Differentiation

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import torch
import torch.nn as nn
warnings.filterwarnings('ignore')
print("NumPy version: %s"%np.__version__)
print("Pandas version: %s"%pd.__version__)
print("PyTorch version: %s"%torch.__version__)

NumPy version: 1.23.1
Pandas version: 1.4.3
PyTorch version: 1.13.0


We build a graph for a very simple computation: 

$$s = 2 * (x - y) + z$$

The computational graph is a directed acyclic graph that can be built from operators such as sub(), mul(), add() that implement the arithmetic operations on scalars: subtraction, multiplication, addition. 

In [11]:
def compute_s(x, y, z):
    r1 = torch.sub(x, y)
    r2 = torch.mul(r1, 2)
    z = torch.add(r2, z)
    return z

We assign a value to the variables that are represented as tensors of rank 0 and then we compute the value of the dependent variable s

In [14]:
x = torch.tensor(1)
y = torch.tensor(2)
z = torch.tensor(5)
s = compute_z(a, b, c)
print('s =', s)       

s = tensor(3)


In [16]:
torch.manual_seed(1)
w = torch.empty(2, 3)
nn.init.xavier_normal_(w)
print(w)

tensor([[ 0.4183,  0.1688,  0.0390],
        [ 0.3930, -0.2858, -0.1051]])


## Backpropagation and autodifferentiation
A deep learning model is a parametric model of nested functions, e.g. y = f(g(h(x))), that we use to fit an unknown function $\hat{y}$, like a function that maps a picture of a cat to its label "cat". Fitting this type of models is not different conceptually from fitting a linear model such as y = ax + b. In this simple case we have only two parameters a, b, that we want to know using a set of examples. The procedure is to define a loss function, e.g. $\mathscr{L} =|| y - \hat{y}||^2$, that we want to minimize by updating the parameters a, b till the loss is as close to zero as possible. The same is valid also for a deep neural network model. We have a model built from nested parametric nonlinear functions for which we know the form (a combination of affine transformations and well known activation functions) but not the right parameters so the procedure is again to write a loss function and to minimize the loss by updating the model's parameters $w$ using a set of examples and the backpropagation algorithm

$$w_{i+1} = w_i - \gamma \nabla_w{\mathscr{L}}$$

where

$$\nabla_w{\mathscr{L}} = 2||y - \hat{y}|| \frac{\partial{y}}{\partial{w}}$$

Clearly the y function must be differentiable if we want to compute its gradient, that is its partial derivatives with respect to the parameters $w$. In case of a simple model we could write the equations to compute the derivatives numerically or by symbolic substitution and the chain rule known from Calculus. The problem is that with deep neural networks with thousands or millions of parameters this procedure would be unfeasible and numerically unstable. A better technique is computing the derivatives of the y function with respect to the model parameters using the computational graph that is built with the model. When we define a model PyTorch builds the corresponding computational graph. In order to compute the derivatives with respect to a paramenter we have to tell PyTorch which parameters require the derivative (or the gradient if it is an array). 