**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!apt install libgraphviz-dev pkg-config # to fix broken installation of pygraphviz
!{sys.executable} -m pip install pygraphviz==1.7
!{sys.executable} -m pip install git+https://gitlab.com/michalgregor/ani_torch.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
from ani_torch import TorchGraph, trackable_function
import matplotlib.pyplot as plt
import numpy as np
import torch

# hide a PYDEV warning triggered by the use of sys.gettrace in Google Colab
import warnings
warnings.filterwarnings('ignore', message='PYDEV DEBUGGER WARNING:.*')

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Automatic Differentiation using PyTorch

Given that most current machine learning techniques are based on optimization and many of the popular optimization methods are gradient-based (including the ones used in deep learning), we need to be able to compute gradients of mathematical expressions as easily and efficiently as possible.

Automatic differentiation (autodiff; in the theory of artificial neural networks also known as the backprogation algorithm), is a method which computes gradients by constructing a graph of the expression and then running it forward (to compute the output) and backward (to propagate the gradients from the output back to the input). Autodiff can therefore compute the gradient at only roughly two times the cost of the forward run. This is incomparably more efficient than the other two methods that we have discussed so far: numerical differentiation and symbolic differentiation.

### The Computation Graph and the Gradient

In PyTorch, the computation graph is created automatically, by running standard imperative code, but using special objects. Instead of standard arrays, one will use PyTorch tensors. Also, instead of `numpy` operations such as `np.cos` or `np.exp` one will use their PyTorch equivalents `torch.cos` and `torch.exp`. Otherwise the code will look virtually identical.

Let us start by defining a simple PyTorch function that will return $\cos(ax + c)$:



In [None]:
def func(x, a, c):
    y = torch.sin(a*x + c)
    return y

In order to run the function, all we need to do is to create PyTorch tensors. We can create them by converting standard Python data types or even numpy arrays. To compute gradients w.r.t. the inputs we will need to do two things though: ensure that the tensors have a floating-point data type and that their `requires_grad` flag is set to `True`. The latter is to prevent unnecessary computation: we rarely need to know gradients w.r.t. all variables.



In [None]:
x = torch.tensor(2, dtype=float, requires_grad=True)
a = torch.tensor(3, dtype=float, requires_grad=True)
c = torch.tensor(4, dtype=float, requires_grad=True)

We will now run the function on our tensors and collect the output. We can also print it immediately.



In [None]:
y = func(x, a, c)
print(y.item())

To run the backward pass to compute the gradients, all we now need to do is to call `y.backward()`. The gradients then get backpropagated to the tensors and we can access them through the `.grad` attribute.



In [None]:
y.backward()

print(x.grad)
print(a.grad)
print(c.grad)

### Visualizing the Computation Graph

We will now use an auxiliary library to display the computation graph. The library is not part of PyTorch: we will only be using it here to better illustrate how automatic differentiation works. All we need to do is to create a `TorchGraph` object using our function and some input values (these can be numbers or numpy arrays – they will automatically be wrapped as PyTorch tensors).



In [None]:
graph = TorchGraph(func, [2, 3, 4])
graph.plot()

#### Visualizing the Forward and the Backward Run

Even more importantly, we are able to visualize autodiff's forward and backward run using an animated figure. This will enable us to give visual explanations of how backpropagation of gradients works.



In [None]:
graph.animate(direction="forward")

In [None]:
graph.animate(direction="backward")

### Gradient Propagation for Some Common Cases

Perhaps the easiest way to understand how autodiff works, is to go through a few of the more common cases such as addition, multiplication and such and explain how the gradients get backpropagated.

#### Addition: Distributing the Gradient

Addition is probably the simplest case: it merely distributes the gradient from the output into the two input branches.



In [None]:
def func_add(a, b):
    y = a + b
    return y

graph = TorchGraph(func_add, [2, 3])
graph.plot(with_all=True)

In [None]:
graph = TorchGraph(func_add, [2, 3], [2])
graph.plot(with_all=True)

#### Multiplication: Swapping and Multiplying

With multiplication, we merely swap the inputs from the forward run (and obviously multiply them by the output gradient as per the chain rule).



In [None]:
def func_mult(a, b):
    y = a * b
    return y

graph = TorchGraph(func_mult, [2, 3])
graph.plot(with_all=True)

In [None]:
graph = TorchGraph(func_mult, [2, 3], [2])
graph.plot(with_all=True)

#### Branches: Accumulation of Gradients

Whenever branches occur in the graph and the same variable is used multiple times, gradients from all the branches accumulate in the backward run.



In [None]:
def func_branch(x):
    y1 = torch.sqrt(x)
    y2 = torch.sqrt(x)
    return y1, y2

graph = TorchGraph(func_branch, [4], [4, 8])
graph.plot(with_all=True)

#### The `max` Operator: A Gradient Switch

The `max` operator is frequently used as a pooling operation in deep convolutional networks. How do the gradients propagate through it? Clearly the output of the operator only depends on the greatest input. The full gradient propagates to that input. The gradients w.r.t. the other inputs are zero: the change in the other inputs has no effect.

One could, of course, object, because changing the values of the inputs will have an effect provided that they become the greatest input instead. However, we need to recall that when computing gradients, we are interested in the effect of infinitesimally small changes and an infinitesimally small change in the input is not going to make one input larger than the other.



In [None]:
def func_branch(a, b):
    y = torch.max(a, b)
    return y

graph = TorchGraph(func_branch, [2, 5], [2])
graph.plot(with_all=True)

### Defining New Operations

To get an even fuller understanding how autodiff works, we are going to implement a new operation: the sigmoid function. Its mathematical definition is as follows:

\begin{equation}
\sigma(x) = \frac{1}{1 + e^{-x}}
\end{equation}
and its derivative is:

\begin{equation}
\sigma'(x) = \sigma(x) (1 - \sigma(x))
\end{equation}
We will define our new function as a subclass of `torch.autograd.Function`. We will define two static methods (if you don't know what that means, don't worry: just add the `@staticmethod` decorator):

* **forward:**  this takes care of the forward pass;
* **backward:**  this backpropagates the gradients from the outputs to the inputs of our function.
Clearly, the output of the forward pass could be reused when computing the backward pass so that we do not have to needlessly recompute the expensive nonlinear function multiple times. To cache the output, we store it in the context object `ctx` using `ctx.save_for_backward`.

Finally, we decorate the class itself with `@trackable_function`. This decorator is not part of PyTorch: we are adding it so that our new function can be visualized. We also name it using `"name = $\sigma$"` so that its name in the visualization is $\sigma$ and not `sigmoid`.



In [None]:
@trackable_function
class Sigmoid(torch.autograd.Function):
    name = "$\sigma$"
    
    @staticmethod
    def forward(ctx, x):
        y = 1 / (1 + torch.exp(-x))
        ctx.save_for_backward(y)
        return y

    @staticmethod
    def backward(ctx, grad_output):
        y, = ctx.saved_tensors
        grad_input = y * (1 - y) * grad_output
        return grad_input

sigmoid = Sigmoid.apply

In [None]:
def func_sigmoid(x):
    y = sigmoid(x)
    return y

In [None]:
graph = TorchGraph(func_sigmoid, [2])
graph.plot()

---
### Task 1: Run Autodiff on a Function

**Run autodiff on the following function:** 
$$
y = a \sin(bx) + c
$$

**at** 
$$
a=5, b=4, c=7, x=2.
$$
**What is the gradient w.r.t. $x$?** 

---


In [None]:
def func(x, a, b, c):
    
    
    # ---
    
    
plt.figure(figsize=(10, 6))


# ---

