# Backpropagation

In the regression problem, we want to minimize the LOSS function given by the difference between $\hat{y}$ and $y$. 

$\hat{y}$ is calculated by the product of the input $x$ and the weight $w$. 

![Regression](./figures/regression.png)

To minimize the LOSS function we would like to know the derivative of the LOSS function with respect to the weight $w$.

$$\frac{\partial L}{\partial w} $$

### STEP 1: FORWARD PASS

The forward pass is the process of calculating the output $\hat{y}$ from the input $x$ and the weight $w$ (which can be randomly initialized). 

### STEP 2: LOCAL GRADIENTS 

We can compute the gradients locally, for example: 

$$ \frac{\partial L}{\partial s} = 2s$$

![local gradient](./figures/local_gradients.png)

### STEP 3: BACKWARD PASS

The backward pass is the process of calculating the gradients of the LOSS function with respect to the weight $w$ using the chain rule. 

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial s} \frac{\partial s}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} $$


In [2]:
import torch 

### Forwards pass to compute LOSS

In [5]:
# input data
x = torch.tensor(1.0)

# target
y = torch.tensor(2.0)

# weights
w = torch.tensor(1.0, requires_grad=True)

# forward pass and compute loss
y_hat = w * x 
loss = (y_hat - y)**2

print(loss)

tensor(1., grad_fn=<PowBackward0>)


### Backward pass to compute local gradients

In [6]:
# backward pass
loss.backward()

# print gradient of loss with respect to w
print(w.grad)

tensor(-2.)


We can easily check (by hand) that the $\frac{\partial L}{\partial w}$ is equal to $-2x$. In this case $x=1$ so the gradient is equal to $-2$.

![gradient](./figures/gradient.png)

In [30]:
x = torch.tensor([2.0,3.0,1.0])

# requires_grad=True usually associated to the 
# parameters that will be learned by the model
w = torch.tensor([1.0,1.0,2.0], requires_grad=True)

y = torch.tensor(3.0)

# forward pass to compute loss
y_hat = w@x

loss = (y_hat - y)**2

In [31]:
print(y_hat)

print(loss)

tensor(7., grad_fn=<DotBackward0>)
tensor(16., grad_fn=<PowBackward0>)


In [32]:
# backward pass to compute gradients
loss.backward()

# print gradients
print(w.grad)


tensor([16., 24.,  8.])


In [42]:
# learning rate
lr = 0.01

# update weights
w.data = w.data - lr * w.grad.data

In [43]:
w.data

tensor([0.8400, 0.7600, 1.9200])

In [44]:
# next forward pass
y_hat = w@x

loss = (y_hat - y)**2

print(y_hat)

print(loss)

tensor(5.8800, grad_fn=<DotBackward0>)
tensor(8.2944, grad_fn=<PowBackward0>)


In [45]:
# backward pass to compute gradients
loss.backward()

# print gradients
print(w.grad)

tensor([27.5200, 41.2800, 13.7600])


In [46]:
# update weights
w.data = w.data - lr * w.grad.data
w.data

tensor([0.5648, 0.3472, 1.7824])

In [47]:
# next forward pass
y_hat = w@x

loss = (y_hat - y)**2

print(y_hat)

print(loss)

tensor(3.9536, grad_fn=<DotBackward0>)
tensor(0.9094, grad_fn=<PowBackward0>)


In [48]:
# backward pass to compute gradients
loss.backward()

# print gradients
print(w.grad)

tensor([31.3344, 47.0016, 15.6672])


In [49]:
# update weights
w.data = w.data - lr * w.grad.data
w.data

tensor([ 0.2515, -0.1228,  1.6257])

In [50]:
# next forward pass
y_hat = w@x

loss = (y_hat - y)**2

print(y_hat)

print(loss)

tensor(1.7602, grad_fn=<DotBackward0>)
tensor(1.5371, grad_fn=<PowBackward0>)


Cool!