<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">COMP5611M - Introduction to Autograd</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Marc de Kamps and University of Leeds</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

## Introduction to Autograd

### Introduction

In the notebook *'Datastructures in PyTorch'* we've seen that a network is a mathematical function that can be built of simple blocks: tensors that are much like Numpy arrays and in built functions like sigmoid.  It is meant to be an even gentler introduction than:
https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#differentiation-in-autograd,
which you should consider next.


In [1]:
import torch
x = torch.tensor([9.], requires_grad=True)

f=x**2
print(f)

tensor([81.], grad_fn=<PowBackward0>)


What's going on here? $f$ is the square of $x$, and its numerical value should be $x^2$, which works out as 9. This is correct, but you also see that f is equiped with a gradient function, which is called *PowBackward0*, which we didn't ask for. Or did we? In x, we specified <code>requires_grad=True</code>, which overrules the default, which is <code>False</code>. 

We can simply call backward on $f$, which appears to do nothing.

In [2]:
print(f.backward())

None


But if we inspect the gradient information of $x$, we find it is correct. If
$$
f(x) = x^2,
$$
then 
$$
f^{\prime}(x) = 2x
$$
which for $x=9$, works out as $2x =18$.

In [3]:
print(x.grad)

tensor([18.])


Calculating the gradient is state dependent, calling *backward* twice is undefined:

In [4]:
try:
    f.backward()
except RuntimeError:
    print('This triggers a run time error.')

This triggers a run time error.


### A slightly more complex example:
Consider 
$$
f(x,y) = \sqrt{x^2 + y^2}
$$
We want to calculate
$$
\frac{\partial f}{\partial x} \mid_{(3,2)}
$$

**Exercise**: Verify that:
$$
\frac{\partial f}{\partial x} = \frac{x}{\sqrt{x^2 + y^2}}
$$
so 
$$
\frac{\partial f}{\partial x} \mid_{(3,2)} = \frac{3}{\sqrt{9+4}} = \frac{3}{\sqrt{13}} \approx 0.83209
$$
Also,
$$
\frac{\partial f}{\partial y} \mid_{(3,2)} = \frac{2}{\sqrt{9+4}} = \frac{2}{\sqrt{13}} \approx 0.5547
$$

In [5]:
xn = torch.tensor([3.],requires_grad=True)
yn = torch.tensor([2.],requires_grad=True)

fn = torch.sqrt(xn**2+yn**2)

fn.backward()
print(xn.grad)
print(yn.grad)

tensor([0.8321])
tensor([0.5547])


### Perceptron Gradient

We know that if we have a perceptron that:
$$
o(x) = \sigma( \boldsymbol{w}^T \boldsymbol{x})
$$
Here, $\sigma$ is the sigmoid:
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

In the main text, we show that:
$$
\frac{\partial o}{ \partial \boldsymbol{w}} = o(1-o)\boldsymbol{\boldsymbol{x}}
$$

Let's pick:
$$
\boldsymbol{w}^T = (1,1,1)
$$
and 
$$
\boldsymbol{x}^T = (1,2,3)
$$
We find that 
$$
o = \sigma( 1 \cdot 1 + 1 \cdot 2 + 1 \cdot 3 ) = \frac{1}{1 + e^{-6}} = 0.99752
$$
For $\boldsymbol{w}^T$ we find:
$$
\boldsymbol{w}^T = o(1-o)\boldsymbol{x}^T = (0.00246651, 0.00493302, 0.00739953)
$$



In [6]:
torch.set_printoptions(precision=8)

w=torch.tensor([1.,1.,1.],requires_grad = True)
x=torch.tensor([1.,2.,3.])
o=torch.sigmoid(x@w)
o.backward()
print(w.grad)

tensor([0.00246647, 0.00493293, 0.00739940])


**Exercise**: The results is close enough to inspire confidence that the computation is by and large correct, yet small deviations are visible. Explain whether this is cause for concern. In your answer consider what the machine representation for these tensor are (how would you find out?).

### How does PyTorch achieve this?

Any mathematical expression is decomposed as a graph. Every mathematical operator has a method called *forward* which implements its actual implementation. They also a *backward* method, which implements the derivative. We have already seen the *forward* method when we built our own net.  In general, it is not necessary to implement the *backward* method for your own objects because they can be inferred, as long as you use pytorch objects to build your forward *computational graph*

**Exercise**: Define your own sigmoid function instead of torch.sigmoid. Use it in the implementation of the Net class. Run the rest of the notebook to see what happens.

In [7]:
import torch.nn as nn


class Net(nn.Module):
    
    def __init__(self):
        
        super(Net, self).__init__()
        
        self.weights = torch.tensor([1.,1.,1.],requires_grad=True)

    def forward(self,x):
        self.h=torch.sigmoid(x@self.weights)
        return self.h


Using the *forward* method we can apply the network directly on pattern $\boldsymbol{x}$.

In [8]:
net=Net()
x=torch.tensor([1.,2.,3.])
o=net(x)
print(o)

tensor(0.99752742, grad_fn=<SigmoidBackward>)


If we try to print the gradient of the weights, we will find it's not yet available.

In [9]:
print(net.weights.grad)


None


After calling backward on the calculation result, which is still inside the net object, we have the correct gradient information on the weights.

In [10]:
net.h.backward()
print(net.weights.grad)

tensor([0.00246647, 0.00493293, 0.00739940])


### Gradient of a loss function

We're usually interestin the gradient with respect to a loss function, not in the gradient vector of the network. The gradient of the loss function tells us in which direction our weights need to change.

Consider the MSE loss function:
$$
\mathcal{L} = \frac{1}{2}(d - o)^2,
$$
where $d$ is the desired network output, and $o$, the actual network output. In our previous example, we entered
the pattern $\boldsymbol{x}^T = (1, 2, 3)$ in the network. The actual output already has been calculated to be
0.9975.

For this input point, what is the gradient? This depends on how far the output is removed from the desired output.
Let us assume that 
$$
d=0.9
$$

Working out the gradient algebraically we find:
$$
\frac{\partial \mathcal{L}}{\partial \boldsymbol{w}} = (o-d) \frac{\partial o}{\partial \boldsymbol{w}} =
(o-d)o(1-o)\boldsymbol{x}
$$

All these quantities are known for our example.

In [11]:
d=0.9
print((o-d)*o*(1-o)*x)

tensor([0.00024055, 0.00048110, 0.00072164], grad_fn=<MulBackward0>)


We will now show that automatic differentiation can produce the same result without need for deriving
the gradient algebraically.

In [12]:
d=0.9
# A new net
mod=Net()
# Produce an output, the naming of the variables is more in line with PyTorch conventions
pred=mod(x)
print(pred)
# Now we construct a loss function, which we can evaluate for this output
loss = 0.5*(pred-d)**2
print(loss)
# Now calculate the gradient - of the loss function!! - with respect to the weights
loss.backward()
print(mod.weights.grad)


tensor(0.99752742, grad_fn=<SigmoidBackward>)
tensor(0.00475580, grad_fn=<MulBackward0>)
tensor([0.00024055, 0.00048110, 0.00072164])


## Computational Graphs

### How does autograd do it (and why do I almost don't need to do anything?)

Consider the expression:
$$
\cos(\frac{1}{x+y})
$$
and consider differentiation with respect to $x$ in the point $(1,2)$.
We can imagine the computation as a directed acyclical graph (DAG), where each node has a well defined role.


   ![computational graph](forwardbackward.png)
   
To perform the differentiation we need to apply the chain rule:
$$
\frac{\partial}{\partial x} \cos( \frac{1}{x+y}) = -\sin( \frac{1}{x+y}) \cdot -(x+y)^{-2} \cdot 1
$$

We see that the chain rule walks the DAG in reverse order, and can do so if at each node it knows what the derivative is of the corresponding forward node, and inserts numerical values that have been calculated in the forward pass.

**The chain rule is backpropagation**

The feedforward network discussed in the main text is just a particularly simple DAG.

Most mathematical functions in Pytorch have a forward and a backward pass. If you create complex functions
built from PyTorch objects, the backward pass is already available and you do not have to implement it yourself.