In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch

# Introduction to Backpropagation

In this notebook, we'll cover the basics of backpropagation and how it is implemented in PyTorch using the autograd package.

## Part 1: What is Backpropagation?

Backpropagation is an algorithm used to train artificial neural networks by efficiently computing the gradients of the loss function with respect to the network's parameters. It is crucial in deep learning algorithms to enable the training of deep neural networks on large datasets.

The basic idea behind backpropagation is to use the chain rule to compute the gradients of the loss function with respect to each trainable parameter in the network. This is done by working backwards through the network, starting from the output layer and moving towards the input layer.



The key steps of the backpropagation algorithm are:

- Forward pass: compute the output of the network given an input.
- Compute the loss: compute the difference between the network's output and the target output.
- Backward pass: compute the gradients of the loss with respect to each parameter in the network using the chain rule.
- Update the parameters: update each parameter in the network using an optimization algorithm (e.g., stochastic gradient descent) to minimize the loss.

### Mathematics of Backpropagation

Let's consider a simple feedforward neural network with one hidden layer:

where x_1 and x_2 are the input features, w_{i,j} are the weights connecting the i-th input to the j-th hidden neuron, b_i is the bias of the i-th hidden neuron, f is the activation function, and o represents the output of each neuron.


We assume that the output layer has a single neuron that produces the final prediction y.

The pre-activation output is a linear combination of the hidden layer activations

where z_1 and z_2 are defined as:

where w_{i,j} represents the weight connecting the i-th input to the j-th neuron in the hidden layer, b_1 is the bias term for the hidden layer, and x_1 and x_2 are the input values.

Our goal is to train the network to minimize a given loss function L(y, y_true) with respect to the weights and biases of the network, where y_true is the ground-truth label. The backpropagation algorithm computes the gradients of the loss function with respect to each weight and bias using the chain rule:

where i and j range over the layers and neurons of the network, and dz_i/dw_{i,j} and dz_i/db_i are the partial derivatives of the pre-activation output z_i with respect to the weights and biases of the network, respectively.

The gradients are computed in a backwards pass through the network, starting from the output layer and moving towards the input layer. 

For example, the gradients of the loss function with respect to the weights and biases of the output layer can be computed as:

where j is the index of the output

## Part 2: Autograd in PyTorch

PyTorch provides a powerful automatic differentiation package called autograd, which makes it easy to compute gradients of tensor-valued functions. Autograd works by keeping track of the operations performed on tensors and building a computational graph that represents the chain of operations. This graph is used to efficiently compute the gradients of the loss function with respect to each parameter in the network.

In [4]:
# define a tensor variable
x = torch.randn(10)
x

tensor([ 0.9315,  0.2164,  0.0336,  0.2572, -1.5692, -1.5824, -1.2028, -0.3714,
         0.8283,  1.5677])

To use autograd, we need to define a tensor that tracks the gradients

In [5]:
x = torch.randn(10, requires_grad=True)
x

tensor([ 0.3928, -0.8631, -0.2639, -1.6774, -0.4690,  0.4454,  0.3350,  0.4344,
        -0.1703, -0.9364], requires_grad=True)

In [10]:
# define a function using the tensor variable
y = (x**2 + 2*x + 1).mean()

To compute the gradients in pytorch, we need to use the function 'backward()'

The backward() function can only run on scalar outputs. We need to take the mean in y.

In [12]:
# compute the gradients using autograd
y.backward()

The variable with gradients is x

In [13]:
# print the gradients
print(x.grad)

tensor([ 0.2786,  0.0274,  0.1472, -0.1355,  0.1062,  0.2891,  0.2670,  0.2869,
         0.1659,  0.0127])


This tells us that the gradient of the function y = x^2 + 2x + 1 with respect to x.