# Chapter 9: Backpropagation

In [1]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy

## Section 1: Backprop. Intro

We'll start off the chapter by backpropagating the ReLU function for a single neuron with the goal of minimizing **the output** from this neuron. This won't directly translate to our model ops, since the goal there is minimize **loss**, but it does serve as a good example showing how the process would work.

Let's initialize a neuron:


In [2]:
# Creating input list of length 3
x = [1.0, -2.0, 3.0]
# Creating random weights
w = [-3.0, -1.0, 2.0]
# Setting bias variable
b = 1

xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]
z = xw0 + xw1 + xw2 + b
# This could have been done just using a "z = np.dot(x, w) + b", but the format we've chosen is more convenient for our experimentation
print(f"the layer output before act. function is {z}")

y = max(z, 0)
print(f"the neuron output is {y}")

the layer output before act. function is 6.0
the neuron output is 6.0


Now that was a full forward pass through the (made up) data! Now we can think about how to approach backpropagation.

First, lets imagine what our function is actually doing, which can be roughly interpreted as $ReLU(\sum[inputs * weights] + bias)$ and which we can write more specifically as $ReLU(x0w0 + x1w1 + x2w2 + bias)$. We will rewrite this as $y = ReLU(sum(mul(x0, w0), mul(x1, w1), mul(x2, w2), bias))$ for the purposes of easier derivation. If we're trying to find the derivative of y with respect to x0, we can write the following:
$$
\frac{\partial}{\partial x_{0}}[ReLU(sum(mul(x0, w0), mul(x1, w1), mul(x2, w2), bias))] = \\
\frac{dReLU()}{dSum()} \cdot \frac{\partial sum()}{\partial mul(x_{0}, w_{0})} \cdot \frac{\partial mul(x_{0}), w_{0}}{\partial x_{0}}
$$
Now, if we were to just solve this out, we would see the impact that $x_{0}$ is actually having on the output.

During the backward pass, what we actually do is calculate the derivative of the loss function and multiply it with the derivative of the activation function, and then the derivative of the output layer, and so on, all the way through the hidden layers and activation functions.

In all of these layers, the derivative with respect to the weights and biases will form the gradients that will tell us how to update our weights and biases.

Let's work backwards through our network now, assuming that the neuron receives a gradient of 1 from the next layer.

The first step in our process is calculating the derivative of the ReLU activation function -- which we've already done before! I'll write it out below: 
$$
f(x) = max(x, 0) \rightarrow \frac{d}{dx} f(x) = 1(x > 0)
$$

Now, lets move to using this in python.

In [3]:
# Make sure you have run the previous code cell so there is a z to go off.

# Hard-coding the gradient from the previous layer
dValue = 1.0

# The RHS of the below is the derivative of the ReLU function with respect to z, because z denotes the neuron's output. 
dReluDz = dValue * (1. if (z > 0) else 0.)
print(dReluDz)

1.0


Now with our ReLU derivative handled, the immediately preceding operation was the summation of the weights inputs and bias. So, here we need to calculate a partial derivative of the sum function and then use the chain rule to multiply it by the derivative of the outer function -- which is the ReLU.  

We can begin defining the partial derivatives:
- dReluDxw0 -- the partial derivative of RELU w.r.t. the first weighted input, x0w0
- dReluDxw1 -- the partial derivative of RELU w.r.t. the second weighted input, x1w1
- dReluDxw2 -- the partial derivative of RELU w.r.t. the third weighted input, x2w2
- dReluDb -- the partial derivative of RELU w.r.t. the bias, b

As we know, the partial derivative of any sum operation is always 1, no matter what the inputs are.

So, we can now incorporate this into our python.

In [4]:
# Make sure you have run the previous code cells so there is a dReluDz to go off.

# I'm just going to make one variable, since all of it will just be 1
dSumDxwX = 1
dSumDb = 1

# Now let's calculate the derivative for each
dReluDxw0 = dReluDz * dSumDxwX
dReluDxw1 = dReluDz * dSumDxwX
dReluDxw2 = dReluDz * dSumDxwX
dReluDb = dReluDz * dSumDb

print(dReluDxw0, dReluDxw1, dReluDxw2, dReluDb)

1.0 1.0 1.0 1.0


Great, so that's the summation function! Now, we have to do arguably the most complex one: the multiplication function.

As we can remember, the derivative for a product is whatever the input is being multiplied by, as I'll show below:
$$
f(x,y) = x \cdot y \rightarrow \frac{\partial}{\partial x} f(x,y) = y \\
\frac{\partial}{\partial y} f(x,y) = x \\
$$

Following this, the partial derivative of the first weighted input $(x \cdot w)$ with respect to the input (x) is just the weight (w) -- as it is the other input of the function.

So, let's add this functionality to our code.

In [5]:
# Pull the variables
dMulDx0 = w[0]
dMulDx1 = w[1]
dMulDx2 = w[2]
dMulDw0 = x[0]
dMulDw1 = x[1]
dMulDw2 = x[2]

# Actually calculate the derivative
dReluDx0 = dReluDxw0 * dMulDx0
dReluDx1 = dReluDxw1 * dMulDx1
dReluDx2 = dReluDxw2 * dMulDx2
dReluDw0 = dReluDxw0 * dMulDw0
dReluDw1 = dReluDxw1 * dMulDw1
dReluDw2 = dReluDxw2 * dMulDw2

print(dReluDx0, dReluDw0, dReluDx1, dReluDw1, dReluDx2, dReluDw2)

-3.0 1.0 -1.0 -2.0 2.0 3.0


Now that is our entire set of neuronal partial derivatives with respect to the inputs, weights, and the bias. We can now use this to optimize these calculations. 

All together, these can be represented as:

In [6]:
dx = [dReluDx0, dReluDx1, dReluDx2] # the gradients on inputs
dw = [dReluDw0, dReluDw1, dReluDw2] # the gradients on the weights
db = dReluDb # the gradient on the bias, of which there is just one

print(dx, dw, db)

[-3.0, -1.0, 2.0] [1.0, -2.0, 3.0] 1.0


We'll now use these to see how we can change our weights to minimize the output (as was our goal for this example), but we would normally use them in our optimizer to improve the output.  

If we take a look at our current weights, bias, and output:

In [7]:
print(f"{w}, {b}, {z}")

[-3.0, -1.0, 2.0], 1, 6.0


Now, we can use our calculated partial derivatives to play with this and see if we can decrease output:

In [8]:
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

print(w, b)

[-3.001, -0.998, 1.997] 0.999


Lets perform a forward pass to see how this impacts our final output:

In [9]:
# Multiply inputs and weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Add up mult + bias
z = xw0 + xw1 + xw2 + b

# ReLU function for output
y = max(z, 0)

print(y)

5.985


That means that we've reduced our output! While it's only by a very tiny bit, 6.0 vs 5.985, it shows us that we're trending in the right direction! Like I said, optimizing a single neuron for the pure sake of minimizing it's output is something that won't translate into the real world, but it's a step. What we're actually going to be doing is working to decrease the final loss value 

Our next objective will be to apply this to a list of samples and expand it to a whole layer of neurons. In this example, our neural net will consist of a single hidden layer with 3 neurons (each with 3 inputs and 3 weights). Let's set up below:

In [15]:
# We'll make up the gradients from the "next" layer for the sake of this example
dvalues = np.array([[1.0, 1.0, 1.0]])

# We have 3 sets of weights and 4 inputs, meaning we need 4 weights each.
weights = np.array([[0.2, 0.8, -0.5, 1],
                   [0.5, -0.91, 0.26, -0.5],
                   [-0.26, -0.27, 0.17, 0.87]]).T

# Sum the weights of inputs and multipy by the gradients
dx0 = sum(weights[0]*dvalues[0])
dx1 = sum(weights[1]*dvalues[0])
dx2 = sum(weights[2]*dvalues[0])
dx3 = sum(weights[3]*dvalues[0])

dInputs = np.array([dx0, dx1, dx2, dx3])

print(dInputs)

[ 0.44 -0.38 -0.07  1.37]


From this, we see how dInputs is the gradient of the neuron function with respect to the outputs.

However, we can simplify this tremendously by just using np.dot!  

In [16]:
dInputs = np.dot(dvalues[0], weights.T)
print(dInputs)

[ 0.44 -0.38 -0.07  1.37]


That about does it -- but we're missing one thing: the ability to handle samples in our batch. Let's implement that now:

In [18]:
# We'll create gradient values for each batch
dvalues = np.array([[1.0, 1.0, 1.0],
                    [2.0, 2.0, 2.0],
                    [3.0, 3.0, 3.0]])

dInputs = np.dot(dvalues, weights.T)

print(dInputs)

[[ 0.44 -0.38 -0.07  1.37]
 [ 0.88 -0.76 -0.14  2.74]
 [ 1.32 -1.14 -0.21  4.11]]


Those are our gradients with respect to the inputs. That was a lot. So, now we should take a look at our gradients with respect to the weights. 

In [21]:
# We have 3 sets of sample inputs
inputs = np.array([[1, 2, 3, 2.5],
                   [2, 5, -1, 2],
                   [-1.5, 2.7, 3.3, -0.8]])

# Notice how this time we flip the position of inputs.T and dvalues so that the arrangement is (n x m) and (m x p).
dweights = np.dot(inputs.T, dvalues)

print(dweights)

[[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]


This correspondingly matches our shape of weights because we've summed the inputs for each weight and then multipled it by the input gradient. We can do this for biases as well!

In [22]:
# One bias for each neuron
biases = np.array([[2, 3, 0.5]])

# Sum it over the samples and keep the row vector dimensions
dbiases = np.sum(dvalues, axis=0, keepdims=True)

print(dbiases)

[[6. 6. 6.]]


Finally, we should also account for the ReLU function, which is 1 when > 0, 0 otherwise.

In [27]:
# Creating a random array of layer outputs
z = np.array([[1, 2, -3, -4],
              [2, -7, -1, 3],
              [-1, 2, 5, -1]])

dvalues = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])

# np.zeros_like(arg) is a function that returns an array of the same size as the arg but filled with 0's
drelu = np.zeros_like(z)
# This iterates through the elements and if z > 0, sets it to 1.
drelu[z > 0] = 1
print(drelu)

# Apply the chain rule
drelu *= dvalues
print(drelu)

[[1 1 0 0]
 [1 0 0 1]
 [0 1 1 0]]
[[ 1  2  0  0]
 [ 5  0  0  8]
 [ 0 10 11  0]]


I'm going to update our classes to account for what we've learned so far in this chapter, but I'm going to detail everything to check out from those changes:
- Within the DenseLayer class:
    - Added "self.inputs" as a object in the forward method
    - Created the "backward" method and its corresponding process
- Within the ReLU class:
    - Added "self.inputs" as a object in the forward method
    - Created the "backward" method and its corresponding process

## Section 2: Categorical Cross-Entropy loss derivatives