Training neural networks is essentially a very complicated application of the chain rule in calculus. The basic feature of neural networks libraries like tensorflow (google) or pytorch (facebook) is **automatic differentiation**. This features taking the derivative of arbitrarily complex code. To do this, the system has to keep track of graph of computations that led up to a certain output. There are two ways to keep track of this graph

1. Static. The code builds a graph, which is then compiled and executed. This is the approach tensorflow takes.
2. Dynamic. The graph is built as the code is executed. This is the approach pytorch takes. Code in this approach looks much like normal imperative python code.

Torch implements a numpy-like syntax that should be familiar to most. For some reason, they made small annoying changes to the numpy interface, which IMO is the best possible array interface. Tensorflow does the same things....

In [1]:
%matplotlib inline

import torch
import matplotlib.pyplot as plt

Here is how you initialize a torch tensor (e.g. array)

In [2]:
a = torch.rand(10)
a

tensor([0.8701, 0.7510, 0.7492, 0.1008, 0.5165, 0.3492, 0.7933, 0.2739, 0.8511,
        0.0909])

By default, these won't track derivatives

In [3]:
a.requires_grad

False

But we can enable that like this

In [4]:
a.requires_grad = True

Let's take the derivative of a pytohn function

In [5]:
def f(x):
    return x**2

In [6]:
x =torch.tensor(1.0, requires_grad=True)
y = f(x)
y

tensor(1., grad_fn=<PowBackward0>)

We see that the anwer is 4.0, as expected. Now how do we compute the gradient?

In [7]:
y.backward()

It is called `backward` because pytorch technically uses "reverse-mode" automatic differentiation which is also known as "back-propagation". This method is more efficient than forward mode autodiff when there are many more independent than dependent variables. this operation stores the gradient in the `.grad` attribute of a torch tensor.

Here is the value of $f'(x)=2x$:

In [8]:
x.grad

tensor(2.)

This simple concept scales to much more complicated functions, such as an 3 layer neural network

In [9]:
from torch import nn


f = nn.Sequential(
    nn.Linear(10,32),
    nn.ReLU(),
    nn.Linear(32,32),
    nn.ReLU(),
    nn.Linear(32,1),
)


x=  torch.rand(1,10, requires_grad=True)
y = f(x)
print("y is", y)

y.backward()
print()
print("dy/dx is", x.grad)

y is tensor([[-0.0363]], grad_fn=<AddmmBackward>)

dy/dx is tensor([[-0.0103,  0.0227, -0.0268,  0.0189, -0.0137,  0.0019, -0.0098,  0.0250,
          0.0173,  0.0084]])


The parameters of the neural network can be accessed in the `f` object. For instnace, here are the weights of the first layer

In [10]:
W1 = f[0].weight

And here is the gradient of $y$ with respect to these weights.

In [11]:
W1.grad

tensor([[-1.6412e-04, -7.7425e-03, -4.2580e-04, -3.7917e-03, -4.7222e-04,
         -2.7842e-03, -4.5615e-03, -4.5665e-03, -2.0552e-03, -6.8821e-03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [-3.3566e-04, -1.5835e-02, -8.7085e-04, -7.7548e-03, -9.6577e-04,
         -5.6942e-03, -9.3292e-03, -9.3393e-03, -4.2033e-03, -1.4075e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+0