In [1]:
%load_ext notexbook

In [2]:
%texify

# Torch `autograd`

**TL;DR**:

(doc. reference: [torch autograd](https://pytorch.org/docs/stable/autograd.html))

`torch.autograd` provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. 

It requires minimal changes to the existing code - you only need to declare `Tensor`s for which gradients should be computed with the `requires_grad=True` keyword.

##### The Learning Process

<img src="./imgs/learning_process.png" class="maxw90" />

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_ </span>

### Warm-up: `numpy` ol' friend

Numpy does not know **anything** about computation graphs, or deep learning, or gradients. 

However we can easily use `numpy` to fit a two-layer network to random data by **manually** implementing the `forward` and `backward` passes through the network using numpy primitives:

In [2]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [3]:
learning_rate = 1e-6

`relu(x) = max(0, x)`

In [4]:
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss SSE
    loss = np.square(y_pred - y).sum()
    if t % 50 == 0:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
print('Final Loss: ', loss)

0 33052448.593803793
50 17876.250040513933
100 914.5740628432775
150 76.9590834372805
200 7.756566591819045
250 0.8591385843619725
300 0.10145726232634263
350 0.012588678201822967
400 0.001625548456419009
450 0.00021682230955448584
Final Loss:  3.087683054173224e-05


### `torch.Tensor`

``torch.Tensor`` is the central class of the package. 

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of **50x or greater** ([some benchmark here](https://github.com/jcjohnson/cnn-benchmarks)), so unfortunately `numpy` won’t be enough for modern deep learning.

Let's rewrite the previous example using `torch.Tensor` and PyTorch API to _manually_ implement the forward and backward passes through the network:

In [5]:
import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

In [6]:
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)  # x * w
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 50 == 0:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    
print('Final Loss: ', loss)

0 27719548.0
50 12302.02734375
100 459.7351989746094
150 28.23314666748047
200 2.1183807849884033
250 0.1761164367198944
300 0.015943627804517746
350 0.0018115155398845673
400 0.0003535313589964062
450 0.00011185034236405045
Final Loss:  4.997898213332519e-05


In the above examples, we had to manually implement both the forward and backward passes of our neural network. 

Manually implementing the backward pass is **not** ideal for networks bigger than just two layers.

### `torch.autograd`

The `autograd` package in PyTorch provides exactly this functionality. 

When using autograd, the forward pass of your network will define a **(dynamic) computational graph**;
_nodes_ in the graph will be _Tensors_, and _edges_ will be functions that produce output _Tensors_ from input _Tensors_. 

Backpropagating through this graph then allows to easily compute gradients.

##### `torch.autograd` & `torch.Tensor`

If you set its attribute ``.requires_grad`` as ``True``, it starts to track all operations on it. 

When you finish your computation you can call ``.backward()`` and have all the
gradients computed automatically. 

The gradient for this tensor will be accumulated into ``.grad`` attribute.

To stop a tensor from tracking history, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.

To prevent tracking history (and using memory), you can also wrap the code block
in ``with torch.no_grad():``. 

This can be particularly helpful when evaluating a model because the model may have trainable parameters with ``requires_grad=True``, but for which we don't need the gradients.

There’s one more class which is very important for autograd implementation: a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic
graph, that encodes a complete history of computation. 

Each tensor has a ``.grad_fn`` attribute that references a ``Function`` that has created
the ``Tensor`` (except for Tensors created by the user - their ``grad_fn is None``).

If you want to compute the derivatives, you can call ``.backward()`` on a ``Tensor``. 

If ``Tensor`` is a scalar, you don’t need to specify any arguments to ``backward()``,
however if it has more elements, you need to specify a ``gradient``
argument that is a tensor of matching shape.

---
<span id="fn1">Source: [PyTorch Autograd Tutorial](https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/autograd_tutorial.py)</span>


In [6]:
# Create a tensor requiring a gradient
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


In [7]:
# Let's use this tensor in a function
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [8]:
# y was created as a result of an operation, so it has a ``grad_fn``.
print(y.grad_fn)

<AddBackward0 object at 0x109cb5d50>


In [9]:
# Let's do more operations on y
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


###### `tensor.requires_grad_` method

``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad`` flag in-place. 

The input flag defaults to ``False`` if not given.

In [10]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x11fe40ed0>


##### Gradients

Let's backprop now!

Because ``out`` contains a single scalar, ``out.backward()`` is equivalent to ``out.backward(torch.tensor(1.))``.

In [11]:
out.backward()

In [12]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


In [13]:
x.data  # `.data` property returns the data of the tensor

tensor([[1., 1.],
        [1., 1.]])

 You should have got a matrix of ``4.5``. 
 
 Let’s call the ``out`` *Tensor* $o$.

We have that $o = \frac{1}{4}\sum_i z_i$, $z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.

Therefore, $\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, 
hence
$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

#### Few notes about `tensor.backward` and non-leaf tensors

```python
tensor.backward(gradient=None, retain_graph=None, create_graph=False)
```

[`tensor.backward`](https://pytorch.org/docs/stable/_modules/torch/tensor.html#Tensor.backward) 
computes the gradient of current tensor w.r.t. graph leaves.

The graph is differentiated using the chain rule. 

If the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient. 

It should be a tensor of matching type and location, that contains the gradient of the differentiated function w.r.t. self.

This function accumulates gradients in the **leaves** - you might need to zero them before calling it.

**Parameters**:

- `gradient` (Tensor or None): Gradient w.r.t. the
    tensor. If it is a tensor, it will be automatically converted
    to a Tensor that does not require grad unless ``create_graph`` is True.
    None values can be specified for scalar Tensors or ones that
    don't require grad. If a None value would be acceptable then
    this argument is optional.
    
- `retain_graph` (bool, optional): If ``False``, the graph used to compute
    the grads will be freed. Note that in nearly all cases setting
    this option to True is not needed and often can be worked around
    in a much more efficient way. Defaults to the value of
    ``create_graph``.

- `create_graph` (bool, optional): If ``True``, graph of the derivative will
  be constructed, allowing to compute higher order derivative
  products. Defaults to ``False``.

In [14]:
# this is supposed to raise a warning
out.grad



##### `backward` and Dynamic Computational Graph

Gradient enabled tensors (a.k.a. `variables`) along with functions (`operations`) combine to create the dynamic computational graph. 
The flow of data and the operations applied to the data are defined at runtime hence constructing the computational graph dynamically. 

This graph is made dynamically by the `autograd` class under the hood.

A simple DCG for multiplication of two tensors would look like this:

![DCG Example](https://miro.medium.com/max/672/1*jGo_2J9UQeynwG_3olUD4w.png)

Each dotted outline box in the graph is a variable and the purple rectangular box is an operation.

Every variable object has several members some of which are:

`data`: It’s the data a variable is holding. 

`requires_grad`: if true starts tracking all the operation history and forms a backward graph for gradient calculation. 

`grad`: grad holds the value of gradient. If `requires_grad` is `False` it will hold a `None` value. Even if `requires_grad` is `True`, it will hold a `None` value unless `.backward()` function is called from some other node. 
For example, if you call `out.backward()` for some variable out that involved `x` in its calculations then `x.grad` will hold $\frac{\partial out}{\partial x}$.

`grad_fn`: This is the backward function used to calculate the gradient.

`is_leaf`: A node is a **leaf** if:
    - It was initialized explicitly by some factory method (e.g. `x = torch.randn(1, 1)`).
    - It is created after operations on tensors which all have `requires_grad=False`.
    - It is created by calling `.detach()` method on some tensor (see below).

On calling `backward()`, gradients are populated only for the nodes which have both `requires_grad` **and** `is_leaf=True`. 

Gradients are of the output node from which `.backward()` is called, w.r.t other leaf nodes.

On turning `requires_grad=True` PyTorch will start tracking the operation and store the gradient functions at each step as follows:

![DCG Gradient](https://miro.medium.com/max/942/1*viCEZbSODfA8ZA4ECPwHxQ.png)


Algorithmically, here is how backpropagation happens with a computation graph. 

> (Not the actual implementation, only representative)

```python

def backward (incoming_gradients):
	self.Tensor.grad = incoming_gradients

	for inp in self.inputs:
		if inp.grad_fn is not None:
			adjont = incoming_gradient * local_grad(self.Tensor, inp)
			inp.grad_fn.backward(adjont)
		else:
			continue
```

**Exercise**: Try to generate the code that has produced that above graph

In [15]:
out.retain_grad()

# what if we try to re-run backward once again ?
out.backward()

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

In [16]:
# let's try again
y = x + 2
z = y * y * 3
out = z.mean()

out.retain_grad()

out.backward(create_graph=True, retain_graph=True)

In [17]:
# first of all let's have a look at the grad 
x.grad

tensor([[9., 9.],
        [9., 9.]], grad_fn=<AddBackward0>)

As you can see, the `gradient` has been **accumulated**, not replaced! 

This will explain **why** we need to `zero_grad` the gradient of model parameters before applying a 
new step in the `optimiser`.

In [18]:
# to zero the gradient of a tensor
x.grad.zero_()
x.grad

tensor([[0., 0.],
        [0., 0.]], grad_fn=<ZeroBackward>)

In [19]:
out.backward()  # this won't complain now

In [21]:
x.grad

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]], grad_fn=<ZeroBackward>)

In [20]:
out.grad  # this should be 1 - we did not zero the grad after a second step

tensor(2.)

In [22]:
y.grad  # this will fail as we are **not** retaining the grad for "internal" tensors (default)

#### Detaching the Gradient (`Variable` $\mapsto$ `Tensor`)

###### `Variable`: Sorry what?


`Variable` (**now [deprecated](https://pytorch.org/docs/stable/autograd.html#variable-deprecated)**) were initially used to differentiate `Tensor` objects that were used as input to the `autograd` module. 

In other words, it was necessary to convert a `tensor` into a `Variable` in order to use `autograd`.

This (explicit) transformation is **no longer** necessary!

This distinction was indeed confusing - reminds the very same sort of confusion that what happening in TF to distinguish between `Variable` and `placeholder` (see [here](https://stackoverflow.com/questions/36693740/whats-the-difference-between-tf-placeholder-and-tf-variable)).

`Variable(tensor)`, and `Variable(tensor, requires_grad=True)` still work but they do now return a `Tensor` rather than a `Variable`.

In addition, factory methods to create `Tensor` such as `torch.randn()`, `torch.zeros()`, `torch.ones()`, and others include the `requires_grad` parameter:

```python
autograd_tensor = torch.randn((2, 3, 4), requires_grad=True)
```

It is always possible to instruct `autograd` to stop from tracking history on Tensors with `.requires_grad=True` either by wrapping the code block in with `torch.no_grad()`

In [23]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

True
True
False


Or by using `.detach()` to get a **new** Tensor with the same content but that does not require gradients:

In [24]:
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())

True
False
tensor(True)


**Note on `detach`** (see [doc](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach))

> Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks. 
>
> IMPORTANT NOTE: Previously, in-place `size` | `stride` | `storage` changes 
> (such as `resize_` | `resize_as_` | `set_` | `transpose_`) to the returned tensor 
also update the original tensor. 
> Now, these in-place changes will not update the original tensor anymore, and will instead trigger an error. For sparse tensors: In-place indices / values changes (such as `zero_` | `copy_` | `add_`) to the returned tensor will not update the original tensor anymore, and will instead trigger an error.

---

#### Two-Layer Network (MLP) using (built-in) `torch.autograd`

So, now going back to our previous example, let's rewrite our code in order to exploit the integrated `autograd` engine to calculate the gradients, without having to manually write the **backward** pass through the network:

In [7]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

This time we're going to **require gradient** (i.e. `required_grad=True`)

In [8]:
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

In [9]:
learning_rate = 1e-6
for t in range(500):
    # Forward pass: 
    # ============
    # compute predicted y using operations on Tensors;
    # these are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand!
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor [of shape (1,)]
    loss = (y_pred - y).pow(2).sum()
    if t % 50 == 0:
        print(t, loss.item())  # loss.item() gets the scalar value held in the loss.

    # Backward pass: 
    # =============
    # Use autograd to compute the backward pass. 
    # This call will compute the gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call `w1.grad` and `w2.grad` will be Tensors holding the gradient
    # of the loss with respect to `w1` and `w2`, respectively.
    loss.backward()

    # Parameters Update (still manual)
    # =================
    # Manually update weights using gradient descent. 
    # NOTE: 
    # Using torch.no_grad()
    # because weights have `requires_grad=True`, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # We will use torch.optim.SGD later for the optimisation step.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 36546536.0
50 15414.4384765625
100 812.7899169921875
150 65.06072998046875
200 6.199832439422607
250 0.6663901805877686
300 0.07852109521627426
350 0.010125966742634773
400 0.001647276571020484
450 0.0004102089151274413


**Addendum**: [TensorFlow 1.x Static Graph](addendum_tf_static_graph.ipynb)

---

### References and Futher Reading:

1. [(*Paper*) PyTorch: An Imperative Style, High-Performance Deep Learning Library](https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf)
2. [PyTorch Autograd Function](https://pytorch.org/docs/stable/autograd.html#function)
3. [(**Terrific**) PyTorch Examples Repo](https://github.com/jcjohnson/pytorch-examples) (*from which, most of the examples in this section have been taken*)