![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/9/96/Pytorch_logo.png/800px-Pytorch_logo.png)

<h2> Pytorch Autograd </h2>
<h4>Lets see Numpy do this!</h4>
Now on to something that makes Pytorch (and other Deep Learning frameworks) unique, the auto-differentiable computational graphs! (don't worry about how this exactly works)<br>
Remember how we compute the gradients of parameters (weights) of a model by "backpropagation". First we calculate the "gradient" of the loss with respect to the model's output and then using the chain rule find the gradient of the loss with respect to the parameters or the input and on and on for larger networks. Seems like a pretty repetitive process governed by some well known rules right? Well you know what is good at doing repetitive well defined things?!?! Computers!!<br>
This "automatic" backpropagation (among other things) is what Pytorch REALLY gives us that makes training Neural Networks easy. So how does it do it? Well first Pytorch keeps track of everything we do!! (unless we tell it not to) It does this by forming a "computational graph" - a tree-like structure of all the operations we perform starting at some initial tensor. When we tell Pytorch to backpropagate from some point, it works backwards up this tree and calculates and stores the gradients with respect to the point from where we back propagated from.

Lets see an example of this!

In [1]:
import torch
import torchvision
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Define some values
x1 = torch.FloatTensor([4])
w1 = torch.FloatTensor([2])
b1 = torch.FloatTensor([3])

# Create a simple linear equation
y = w1 * x1 + b1    # y = 2 * x + 3

# We can easily work out the partial derivatives for this equation
dy_dx = w1
dy_dw = x1
dy_db = 1

print("Calculated Gradients") 
print("dy/dx", dy_dx.item())
print("dy/dw", dy_dw.item())
print("dy/db", dy_db)

Calculated Gradients
dy/dx 2.0
dy/dw 4.0
dy/db 1


In [3]:
#lets create some tensors, requires_grad tells Pytorch we want to store the gradients for this tensor
#we need to do this if we are working with basic Pytorch tensors
x = torch.FloatTensor([4])
x.requires_grad = True
w = torch.FloatTensor([2])
w.requires_grad = True
b = torch.FloatTensor([3])
b.requires_grad = True

#By performing a simple computation Pytorch will build a computational graph.
y = w * x + b    # y = 2 * x + 3

#Compute gradients via Pytorch's Autograd
y.backward()

#Print out the calculated gradients
#These gradients are the gradients with respect to the point where we backprop'd from - y
#Create your own equation and use the auto backprop to see the partial derivatives!
print("Calculated Gradients") 
print("dy/dx", x.grad.item())    # x.grad = dy/dx = 2 
print("dy/dw", w.grad.item())    # w.grad = dy/dw = 4
print("dy/db", b.grad.item())   # b.grad = dy/db = 1  
#Note: .item() simply returns a 0D Tensor as a Python scalar

Calculated Gradients
dy/dx 2.0
dy/dw 4.0
dy/db 1.0


<h3> Finding the optimum point </h3>
We can use gradient decent to find the minimum of an equation

In [None]:
# Lets find the minimum of a parabola!

# Define the equation as a lambda function
fx = lambda  x: 3 * x**2 + 2 * x + -1.2

# Create a random point X
x_ = torch.randn(1)
x_.requires_grad = True

# Lets use Pytorch's Autograd to find the gradient at this point
y_ = fx(x_)
y_.backward()

# The gradient tells us the direction to travel to increase Y
dy_dx_ = x_.grad.item()
print("dy/dx is %.2f when x is %.2f" % (dy_dx_, x_))

In [None]:
# Lets take some steps to decend the gradient!

# Create a random point X
x_ = torch.randn(1)
x_.requires_grad = True

# Create some loggers
x_logger = []
y_logger = []

# We'll keep track of how many steps we've done
counter = 0

# Set a scale for the step size
learning_rate = 0.01

# Initialise the gradient to a large value
dy_dx_ = 1000

# We'll limit the max number of steps so we don't create an infinite loop
max_num_steps = 1000

# Keep taking steps untill the gradient is small
while np.abs(dy_dx_) > 0.001:
    # Get the Y point at the current x value
    y_ = fx(x_)
    
    # Calculate the gradient at this point
    y_.backward()
    dy_dx_ = x_.grad.item()

    # Pytorch will not keep track of operations within a torch.no_grad() block
    # We don't want Pytorch to add our gradient decent step to the computational graph!
    with torch.no_grad():
        # Take a step down (decend) the curve
        x_ -= learning_rate * dy_dx_
        
        # Pytorch will accumulate the gradient over multiple backward passes
        # For our use case we don't want this to happen so we need to set it to zero
        # After we have used it
        x_.grad.zero_()
        
        # Log the X and Y points to plot
        x_logger.append(x_.item())
        y_logger.append(y_.item())
        
    counter += 1
    
    if counter == max_num_steps:
        break

print("Y minimum is %.2f and is when X = %.2f, found after %d steps" % (y_.item(), x_.item(), counter))

In [None]:
# Plot the steps we have taken
fig, ax = plt.subplots()
fig.set_size_inches(18.5, 10.5)
ax.plot(x_logger, y_logger, marker="x")

for (x, y) in zip(x_logger, y_logger):
    txt = "(%.2f, %.2f)" % (x, y)
    ax.annotate(txt, (x, y))


<h3> Curve fitting </h3>
Instead of finding the minimum of a given equation lets try to fit a parabola to some data

In [None]:
# Define a new equation
fx2 = lambda  x: -0.55 * x**2 + 1.2 * x + -0.81

# Create some noisy data
x = torch.linspace(-5, 7, 200)
random_data = fx2(x) + 2 * torch.randn_like(x)

In [None]:
plt.figure(figsize=(5, 5))
plt.scatter(x, random_data)

In [None]:
# Randomly initialise the parameters of a parabola
a = torch.randn(1)
b = torch.randn(1)
c = torch.randn(1)

y_out = a * x**2 + b * x + c

# Plot against the data
plt.figure(figsize=(5, 5))
plt.scatter(x, y_out)
plt.scatter(x, random_data)
print("y = %.2fx^2 + %.2fx + %.2f" % (a, b, c))

Lets perform gradient decent on the parameters of the parabola

In [None]:
# Set up the tensors for gradient decent
params = torch.cat((a, b, c)).unsqueeze(1)
# Create X matrix
# [x^2 x 1]
x_data = torch.cat(((x**2).unsqueeze(1), x.unsqueeze(1), torch.ones_like(x).unsqueeze(1)), 1)
# Make sure the output Y is the right size
y_data = random_data.unsqueeze(1)

In [None]:
# Perform the Matrix multiplication using the created tensors
# we should get the same output as before
y_pred = torch.mm(x_data, params)

plt.figure(figsize=(5, 5))
plt.scatter(x, y_pred)
plt.scatter(x, random_data)
print("y = %.2fx^2 + %.2fx + %.2f" % (params[0], params[1], params[2]))

<h4>Lets perform our gradient decent loop!</h4>
For curve fitting we DON'T want to minimize Y we want to minimize the MAGNITUDE of the difference between the predicted outputs and the REAL (ground truth) ouputs. One way to do that is to minimise the "mean squared error" between the real and "predicted" outputs. This objective is also known as our "Loss Function". <br><br>
It may seem obvious, but we need to ensure that whatever our "loss" is, it is differentiable. That is, we need to ensure that it is possible to find the partial derivative between the loss and the parameters. <br>
We also need to make sure it's possible for Pytorch to solve for the  partial derivatives/gradients. Any break in the computational graph will prevent Pytorch from backpropagating any further! <br>
Examples of such breaks include turning a Pytorch tensor into a Numpy array (and back) and using the .detach() function on a Pytorch tensor! We also cannot backprob within a "torch.no_grad" block

In [None]:
# Log the error to plot
error_log = []
counter = 0
# Set the gradient scale
learning_rate = 1e-3

# initialise the error to a large value
error = 1000

# We'll limit the max number of steps so we don't create an infinite loop
max_num_steps = 1000

# Re-create the parameter matrix here
params = torch.cat((a, b, c)).unsqueeze(1)
params.requires_grad = True

counter = 0

# Loop until the error is below some threshold
while error > 0.1:
    # Perform a "forward pass" 
    # aka get the output Y for all X values
    y_pred = torch.mm(x_data, params).squeeze()

    # Define our error/loss
    error = (y_pred - random_data).pow(2).mean()
    
    # NOTE: Pytorch will only backprop from a single value
    # So we must perform some sort of "reduction" aka take the average/sum etc
    error.backward()
    error_log.append(error.item())
    
    with torch.no_grad():
        params -= learning_rate * params.grad
        params.grad.zero_()

    counter += 1
    
    if counter == max_num_steps:
        break

In [None]:
plt.plot(error_log)

In [None]:
# How close is our equation?
with torch.no_grad():
    y_pred = torch.mm(x_data, params)
    plt.figure(figsize=(5, 5))
    plt.scatter(x, y_pred)
    plt.scatter(x, random_data)
    print("Predicted Equation:")
    print("y = %.2fx^2 + %.2fx + %.2f" % (params[0], params[1], params[2]))
    print("Real Equation:")
    print("y = %.2fx^2 + %.2fx + %.2f" % (-0.55, 1.2, -0.81))