# Module 4 Backpropagation

This is how our network performs gradient descent. By treating the network as a chain of functions we can use the chain rule to find the gradients for every single weight. Using the gradients, we can update the weights and biases and make the network perform slightly better. We repeat this process until the network stops improving or after a fixed number of iterations. 

So what is our actual chain of operations that we will have our network perform? The first operation is to multiply the weights by our input and add the biases. The second is the activation function which takes in the output of the previous operation. The third is the cost function which takes in the output of the activation function. This would be the exact steps if we had a network with only one layer. If we use more layers we just repeat the first two steps for each layer.


\begin{align}
& \quad x = input \\
\mathrm{(1)} & \quad z(x) = wx + b  \\
\mathrm{(2)} & \quad a(z) = \text{we could use sigmoid, relu, linear, or any other activation} \\
\mathrm{(3)} & \quad C(a) = \frac{1}{2}\|y - a\|^2
\end{align}


We ultimately want to lower the cost or make the answers the network gives us closer to their actual value which we get from our labeled data. Therefore, we want to find the gradient of the cost function and use that to get the gradients of our weights and biases with respect to the cost. In other words we will find out what effect each weight and bias has on the cost function. This effect represents the slope of the cost function in the dimension of a particular weight or bias. That slope is what we want to "roll" down by subtracting it from the particular weight.

How can we compute the gradient for the weights in our network? It is similar to how we found the gradients for the slope and y intercept of our best fit line in the previous module except we have one extra step which is the activation function.

The gradients for the weights and biases in our weighted input function, number 1 above, is the same as with our best fit line. The gradient for the cost function is the same as well. The only difference is now we need to know the gradient of the activation function. However, once we have that, we can find the gradients for all the weights and biases.

We find the gradients for every step in our chain

\begin{align}
x & = \text{input} \\
y & = \text{input label or "correct answer"} \\
\frac{\partial z}{\partial w} & = x \\
\frac{\partial z}{\partial b} & = 1 \\
\frac{\partial a}{\partial z} & = a'(z) \\
\frac{\partial C}{\partial a} & = a - y
\end{align}

For each of our activation functions we can find the derivative or gradient in the following way. The functions listed are linear, relu, and sigmoid respectively.

\begin{align}
y(z) & = z \\
y'(z) & = 1 \\
relu(z) & = \begin{cases}
    z & \quad \text{if } z \text{ > 0} \\
    0 & \quad \text{if } z \text{ <= 0}
    \end{cases} \\
relu'(z) & = 
    \begin{cases}
    1 & \quad \text{if } z \text{ > 0} \\
    0 & \quad \text{if } z \text{ <= 0}
    \end{cases} \\
\sigma(z) & = \frac{1}{1+e^{-z}} \\
\sigma'(z) & = \sigma(z)(1 - \sigma(z))
\end{align}


We can chain the gradients together to find what we really want which is the gradients for each weight and bias in the cost function. In other words, how much does a change in the weight or bias affect the final cost and in what direction (positive or negative). Using this value, we can subtract it from the weight or bias to make it "roll" down the hill. 

So the expression $\frac{\partial C}{\partial w}$ really means what is the change in cost (top) divided by the change in weight (bottom). Or $\frac{\partial C}{\partial b}$ means what is the change in final cost over the change in a certain bias. These values tell us how much a change in the weight or bias affects the final cost and whether that relationship is positive or negative. 

If the relationship is positive that means we have a positive slope and as our weight or bias increases, so too does our cost. Alternatively, if the relationship is negative then as one goes up the other goes down. The magnitude of this value represents how big of a change will happen or how steep our slope is. In that sense it tells us the steepness and direction of the hill. 

So when we chain the gradients together for each of our steps, we can find those slopes.

\begin{align}
\frac{\partial C}{\partial w} & =  \frac{\partial C}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w} & = x a'(z) (a - y)\\
\frac{\partial C}{\partial b} & = \frac{\partial C}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial b} & = a'(z) (a - y) \\
\end{align}


Can you see how the tops and bottoms of the fractions in the middle cancel out to give us the left hand side of each equation? All of the a'(z) terms above can be replaced with the corresponding derivative of whichever activation function we decide to use. For instance if we're using the linear activation function, the $a'(z)$ just becomes 1.

Let's see if we can train a single layer network to predict house prices. This layer will only have 1 output for price and we will use a linear activation function. 

In [702]:
from sklearn.datasets import load_boston

dataset = load_boston()

# The house features are essentially a table with 13 columns
# each column is described in the dataset
house_features = dataset.data
house_prices = dataset.target

print(dataset.keys())
print(dataset.DESCR)

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX

# 1 Layer

To start with this problem of predicting house prices we have to build a network with only 1 layer and only 1 neuron. This what it looks like

![1layer](img/1-layer.svg)

In [706]:
from random import randint
import numpy as np

class Layer:
    def __init__(self, num_inputs, num_neurons):
        # we now have 1 row per neuron and 1 column per weight
        self.weights = np.random.uniform(-1, 1, size=(num_neurons, num_inputs))

        # we also randomly create a bias for each neuron
        self.biases = np.random.uniform(-1, 1, size=num_neurons)
    
    def predict(self, inputs):
        return self.weights.dot(inputs) + self.biases

            
    
# create a layer with the same number of inputs as there are features in our data
# and with only 1 output for the price
l1 = Layer(len(dataset.feature_names), 1)

# randomly pick a house from the dataset
house_idx = randint(0, len(dataset.data)-1)

# get the features and price of the house
house = house_features[house_idx]
price = house_prices[house_idx]

# have our layer predict a price for the house
predicted_price = l1.predict(house)[0]

print('predicted price: ${:.2f}, actual price: ${:.2f}'.format(predicted_price * 1000, price * 1000))

predicted price: $296500.06, actual price: $32200.00


Obviously the layer is not very good because we have not taught it anything. This is where gradient descent comes in. Now for every example in our data we can find the gradients and use them to update our layer. So we have 4 gradients to find for each input example

\begin{align}
1. \quad \frac{\partial z}{\partial w} & = \text{What effect does a change in our weight have on the output of the weighted input} \\
2. \quad \frac{\partial z}{\partial b} & = \text{What effect does a change in our bias have on the output of the weighted input} \\
3. \quad \frac{\partial a}{\partial z} & = \text{What effect does a change in our weighted input have on the output of the activation function} \\
4. \quad \frac{\partial C}{\partial a} & = \text{What effect does a change in our activation have on the output of the cost function}
\end{align}

We multiply (1, 3, 4) for the weight gradient and (2, 3, 4) for the biases for every input example. Take the average of all those gradients, multiply by the learning rate, and subtract it from the weights or biases. We are using the linear activation function so 3 just becomes 1 and we can basically ignore it because anything multiplied by 1 is just itself.

We already saw in the last module that $\frac{\partial z}{\partial w}$ is just $x$ or our input and $\frac{\partial z}{\partial b}$ is just 1. After ignoring all the 1's we are left with only 2 gradients we need to find.

\begin{align}
1. \quad \frac{\partial z}{\partial w} & = x \\
2. \quad \frac{\partial C}{\partial a} & = a - y
\end{align}

Then to find the gradients of our weights and biases we just do



\begin{align}
\text{weights} \quad \frac{\partial C}{\partial w} = \frac{\partial C}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w} = (a-y) * 1 * x = x (a-y) \\
\text{biases} \quad \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial b} = (a-y) * 1 * 1 = (a-y)
\end{align}

Where $a$ is the output activation of our layer (i.e. the predicted price), $x$ is just the input (i.e. the house features), and $y$ is the target value (i.e. the correct house price). Numpy also allows us to do a fancy trick where we can take two lists and subtract them which in turn just performs element-wise subtraction. Similarly, if we multiply two numpy arrays, it will just do element-wise multiplication. Therefore, we can find the gradients for all input examples without needing to iterate over them.

In [711]:
l1 = Layer(len(dataset.feature_names), 1)


# try out different learning rates and see how horrible it can get
learning_rate = .0000008

def find_cost(y, a):
    return (1/2)*(y - a)**2


def get_predictions(input_data, predictor):
    return [predictor(x)[0] for x in input_data]


# take our 10 examples along with their 10 corresponding 
# predictions and find the cost for each one, then take the average
def find_avg_cost(test_set, predictor):
    test_a = get_predictions(test_set, predictor)
    test_costs = [find_cost(y, a) for (y, a) in zip(house_prices, test_a)]
    return sum(test_costs)/len(test_costs)


print('====== BEFORE TRAINING ======')

def evaluate_predictor(x, y, predictor, N=10):
    # let's pick 10 houses to try before and after we train
    start = randint(0, len(x) - N)
    test_set = x[start:start+N]
    
    print('cost: {:.2f}'.format(find_avg_cost(test_set, predictor)))

    trained_predictions = get_predictions(test_set, predictor)
    actual_values = y[start:start+10]

    print('\npredicted\tactual')
    print('---------------------')
    print('\n'.join('{:.2f}\t\t{:.2f}'\
                    .format(a, y) for (a, y) in zip(trained_predictions, actual_values)))

    
evaluate_predictor(house_features, house_prices, l1.predict)

for i in range(200):
    a = [l1.predict(x)[0] for x in house_features]

    # every element in dC/db is just a single value 
    # representing the gradient for the single bias 
    # because we only have 1 neuron. There are 506
    # gradient values, 1 for each example
    dC_db = a - y

    # every element in dC/dw has 13 values, 1 for 
    # each weight in our layer. There are 506
    # elements, 1 for each example
    dC_dw = (house_features.transpose() * (a - y)).transpose()


    # THIS IS THE ACTUAL LEARNING PART!!!
    l1.biases = l1.biases - dC_db.mean()*learning_rate

    # here the axis 0 means we want to take the 
    # mean of every column so we're left with 13 
    # means and not just one mean of every value 
    # in the list
    l1.weights = l1.weights - dC_dw.mean(axis=0)*learning_rate


print('\n\n====== AFTER TRAINING ======')

evaluate_predictor(house_features, house_prices, l1.predict)

cost: 188291.48

predicted	actual
---------------------
677.19		18.40
670.89		15.40
620.14		10.80
560.85		11.80
649.89		14.90
670.67		12.60
673.58		14.10
645.54		13.00
553.29		13.40
662.26		15.20


cost: 258.20

predicted	actual
---------------------
-14.63		20.10
6.37		19.90
14.52		19.60
1.97		23.20
-0.65		29.80
14.23		13.80
12.19		13.30
11.38		16.70
7.33		12.00
13.28		14.60


# From 1 To N Layers

IT LEARNS!!! Our cost always gets smaller and our prices are getting closer!!! That means our network is actually getting better at predicting house prices!!!

However, this network only has 1 layer. How might we find the gradients if we have more than 1 layer. We can continue to use the chain rule just as we did before, we just need to add an extra step. This is what our chain of operations looks like when we go from 1 to 2 layers.

![title](img/layers.svg)

We know the gradients for the weighted input $\frac{\partial z}{\partial w}$ and $\frac{\partial z}{\partial b}$, activation functions $\frac{\partial a}{\partial z}$, and cost function $\frac{\partial C}{\partial a}$. Now we just have to find the gradient between the two layers. The input to layer 2 is the output of layer 1. The first operation in layer 2 is the weighted input operation and its input is the activation function output of layer 1.

We want to know how the input to the weighted input function affects it's output. Before our gradients $\frac{\partial z}{\partial w}$ and $\frac{\partial z}{\partial b}$ told us how the weights and biases affect the output, but now we want $\frac{\partial z}{\partial x}$ where x is the output from layer 1. If we replace the variable $x$ with $a$ so that we know our intput represents and activation we get the following gradient for the equation $z(a) = wa + b$

\begin{equation*}
\frac{\partial z}{\partial a} = w
\end{equation*}

In this case $w$ represents our slope and tells us how much a change to $a$ will affect the output of $z(a)$. Now we can simply add this gradient into our chain to find the gradients for the weights and biases of each layer.

Now that we have multiple layers we need to identify which weights and biases we're talking about, the ones from layer 1 or 2. Therefore, from now on when we refer to the weights, biases, or activations we will add a superscript that indicates which layer they are in. The steps to find the gradients for layer one are as follows.

\begin{equation*}
\frac{\partial C}{\partial w^{L1}} = \frac{\partial C}{\partial a^{L2}} \frac{\partial a^{L2}}{\partial z^{L2}} \frac{\partial z^{L2}}{\partial a^{L1}} \frac{\partial a^{L1}}{\partial z^{L1}} \frac{\partial z^{L1}}{\partial w^{L1}}
\end{equation*}

You can see that the fraction or gradient $\frac{\partial z^{L2}}{\partial a^{L1}}$ represents the effect that the output activation of $L1$ has on the output from the weighted input of $L2$. This term is precisely how we can allow our gradients to propagate backwards across multiple layers, hence backpropagation. However, this is really just the chain rule which we're using because we have a chain of operations where the output from one goes on to be the input of the next. The equation for the biases in $L1$ is the same except for the very last term which we replace with $\frac{\partial z^{L1}}{\partial b^{L1}}$ and we've already determined that this term is just 1 so you can ignore it if you like.

Now we just need the gradients for the weights and biases in $L2$ which will look very similar to our network with only one layer. The gradient terms we would like to find are $\frac{\partial C}{\partial w^{L2}}$ and $\frac{\partial C}{\partial b^{L2}}$ and we can do so as follows.

\begin{equation*}
\frac{\partial C}{\partial w^{L2}} = \frac{\partial C}{\partial a^{L2}} \frac{\partial a^{L2}}{\partial z^{L2}} \frac{\partial z^{L2}}{\partial w^{L2}}
\end{equation*}

Just as before, by replacing the laste term $\frac{\partial z^{L2}}{\partial w^{L2}}$ with $\frac{\partial z^{L2}}{\partial b^{L2}}$ we can obtain the gradients for the $L2$ biases. Ultimately, the value of $\frac{\partial z^{L2}}{\partial b^{L2}}$ is just 1 therefore we can essentially ignore it.

# Putting It All Together

We now have all the information we need to find the gradients of the weights and biases for every layer in our network. We have only explicitly mentioned 2 layers, but this process of chaining gradients together can be repeated for any number of layers.

It's killing me!!! what are the final gradient values substituting each respective term in our chain??? If we continue to use the linear activation function, all gradients are just 1. However, if we used another activation function we could replace the 1 with the derivative of whichever activation function we choose.

\begin{align}
\frac{\partial C}{\partial w^{L1}} & = \frac{\partial C}{\partial a^{L2}} \frac{\partial a^{L2}}{\partial z^{L2}} \frac{\partial z^{L2}}{\partial a^{L1}} \frac{\partial a^{L1}}{\partial z^{L1}} \frac{\partial z^{L1}}{\partial w^{L1}}  = (Y - a^{L2}) * 1 * w^{L2} * 1 * x  = (Y - a^{L2}) * w^{L2} * x \\
\frac{\partial C}{\partial b^{L1}} & = \frac{\partial C}{\partial a^{L2}} \frac{\partial a^{L2}}{\partial z^{L2}} \frac{\partial z^{L2}}{\partial a^{L1}} \frac{\partial a^{L1}}{\partial z^{L1}} \frac{\partial z^{L1}}{\partial b^{L1}} = (Y - a^{L2}) * 1 * w^{L2} * 1 * 1  = (Y - a^{L2}) * w^{L2} \\
\frac{\partial C}{\partial w^{L2}} & = \frac{\partial C}{\partial a^{L2}} \frac{\partial a^{L2}}{\partial z^{L2}} \frac{\partial z^{L2}}{\partial w^{L2}} = (Y - a^{L2}) * 1 * a^{L1} = (Y - a^{L2}) * a^{L1} \\
\frac{\partial C}{\partial b^{L2}} & = \frac{\partial C}{\partial a^{L2}} \frac{\partial a^{L2}}{\partial z^{L2}} \frac{\partial z^{L2}}{\partial b^{L2}} = (Y - a^{L2}) * 1 * 1 = (Y - a^{L2})  \\
\end{align}

Just for the sake of clarity I will remove all of the substitution and canceling

\begin{align}
\frac{\partial C}{\partial w^{L1}} & = (Y - a^{L2}) * w^{L2} * x \\
\frac{\partial C}{\partial b^{L1}} & = (Y - a^{L2}) * w^{L2} \\
\frac{\partial C}{\partial w^{L2}} & = (Y - a^{L2}) * a^{L1} \\
\frac{\partial C}{\partial b^{L2}} & = (Y - a^{L2})  \\
\end{align}

And finally because I told you we could do this with any number of layers, here are the generic equations where $1 <= K < N$ and $N$ is our number of layers. Remember the super script of each term just says which layer we're talking about and the fractions are just the effect that a particular input (the denominator) has on the output (the numerator) of that step in the chain. 

The left 2 terms tell us what effect the weights in layer $K$ have on the output activation of layer $K$. The middle term in parenthesis gets expanded for each subsequent layer. In english the middle term tells us what effect the activation from layer $K$ has on the output activation of layer $K+1$ up to $N$ and the last term tells us what effect the output activation from layer $N$ has on the cost function. The funky $\prod$ symbol just means multiply everything together.

I like to think about the top and bottom terms canceling when they are the same. In other words dividing something by itself just gives you 1. Therefore, we can add as many terms in the middle series as we want because their numerators and denominators cancel out.

\begin{equation*}
\frac{\partial C}{\partial w^K} = \frac{\partial z^K}{\partial w^K} * \frac{\partial a^K} {\partial z^K} * \left(\prod_{l = K + 1}^{N} \frac{\partial z^l}{\partial a^{l-1}} * \frac{\partial a^l}{\partial z^l}\right) * \frac{\partial C}{\partial a^N}
\end{equation*}


# Let's Do It!


In [737]:
# because we're using the linear activation function
da_dz = 1

def get_dC_da(a, y): return a - y

class Network:
    """
    We are creating the network class to act as a new kind of predictor. 
    Because we will now have N layers, we don't want to call predict on each one.
    Instead we can pass all our layers in to the network and call predict
    just once using the network.
    """
    def __init__(self, *layers):
        self.layers = layers
        # Every time we run an input through the network
        # we will store the activations of each layer
        # so that we can use them to find our gradients
        self.dz_dw = None
        
        # dz/db = 1 and we have 1 bias per neuron in every layer
        # therefore, for each layer we can create an array of ones
        # that has the same length as the number of biases
        self.dz_db = [np.ones(l.biases.shape) for l in layers]
        
        # the same is true for our activation because we're using 
        # the linear activation function, its gradient is 1
        self.da_dz = [np.ones(l.biases.shape) for l in layers]
        
        # dz/da = w for each layer. This just means that the effect
        # that the activation input has on the output of z is w
        # because a is multiplied by w
        self.dz_da = [l.weights for l in self.layers]

        
    def predict(self, x):
        # one activation for each neuron in each layer
        dz_dw = [x]
        prediction = x
        
        # the first iteration will pass the input x to the first layer
        # the next iterations will pass the output from the previous layer
        # into the current layer
        for l in self.layers[:-1]:
            prediction = l.predict(prediction)
            dz_dw.append(prediction)
            
        
        self.dz_dw = dz_dw
        return self.layers[-1].predict(prediction)
    
    

num_inputs = len(house_features[0])
num_neurons_l1 = 5
num_neurons_l2 = 3

l1 = Layer(num_inputs, num_neurons_l1)
l2 = Layer(num_neurons_l1, num_neurons_l2)
l3 = Layer(num_neurons_l2, 1)

net = Network(l1, l2, l3)

evaluate_predictor(house_features, house_prices, net.predict)


# δl=((wl+1)Tδl+1)⊙σ′(zl),
# (da/dz)L = ((dz/da)T da/dz) o. da/dz

def get_next_to_end(l, N, net, dc_dz):
        
    dz_da_next_layers = list(reversed(net.dz_da[l+1:N]))
    
    final_product = (dc_dz * dz_da_next_layers[0]).transpose()
    
    for dz_da in dz_da_next_layers[1:]:
        final_product = dz_da.transpose().dot(final_product)

    return final_product



learning_rate = .0000008
# we introduce a new term called epoch. This is simply the number of times 
# we go through the entire training set. Because we're also take the mean
# across all gradients for every input example, it is also the number of 
# updates we make to the weights and biases.
for epoch in range(30):
    # every element in all_dC_dws represents the gradients given 
    # one house example for all the weights and biases in the network
    all_dC_dws = []
    all_dC_dbs = []

    for (x, y) in zip(house_features, house_prices):
        prediction = net.predict(x)
        dC_da = get_dC_da(prediction, y)
        da_dw = da_dz * net.dz_dw
        da_db = da_dz * net.dz_db

        # we find the gradients for the weights in each layer
        dC_dw = []
        dC_db = []

        # for the last layer, we don't need to worry about 
        # the gradients for any layers that come after because 
        # no layers come after
        last_dC_dw = da_dw[-1] * dC_da[0]
        last_dC_db = da_db[-1] * dC_da[0]
        
        # here we're ignoring da/dz because it's 1
        dC_dz = dC_da[0] # * da/dz = 1

        N = len(net.layers)


        # for the rest of the layers we have to take the 
        # gradients of the layers that come after into account
        for (i, l) in enumerate(net.layers[:-1]):
            dC_da_L = get_next_to_end(i, N, net, dC_dz)
            dC_dw.append(np.outer(dC_da_L, da_dw[i]))
            
            # here we're ignoring da/dz and dz/db because they're 1
            dC_db.append(dC_da_L) # * da/dz * dz/db = 1

        dC_dw.append(last_dC_dw)
        dC_db.append(last_dC_db)


        all_dC_dws.append(dC_dw)
        all_dC_dbs.append(dC_db)


    # add up all our gradients
    sum_all_dC_dws = [np.zeros(shape=l.weights.shape) for l in net.layers]
    sum_all_dC_dbs = [np.zeros(shape=l.biases.shape) for l in net.layers]

    num_layers = len(net.layers)
    num_examples = len(house_features)

    
    # sum up all the gradients for all the weights and biases for every example
    for j in range(num_layers):
        for i in range(num_examples):
            sum_all_dC_dws[j] += all_dC_dws[i][j]
            sum_all_dC_dbs[j] += all_dC_dbs[i][j].flatten()

    for j in range(num_layers):
        # take the average of all the gradients just like we did before
        mean_all_dC_dws = sum_all_dC_dws[j] / num_examples
        mean_all_dC_dbs = sum_all_dC_dbs[j] / num_examples

        # THIS IS THE LEARNING 
        # here we update all the weights and biases
        net.layers[j].weights -= mean_all_dC_dws*learning_rate
        net.layers[j].biases -= mean_all_dC_dbs*learning_rate

        
evaluate_predictor(house_features, house_prices, net.predict)

cost: 1998.15

predicted	actual
---------------------
-43.23		18.70
-42.97		18.50
-47.34		18.30
-45.13		21.20
-42.88		19.20
-49.87		20.40
-46.85		19.30
1.59		22.00
2.98		20.30
3.03		20.50
cost: 51.36

predicted	actual
---------------------
16.33		26.40
18.62		33.10
30.98		36.10
33.04		28.40
36.69		33.40
36.01		28.20
41.57		22.80
38.30		20.30
26.67		16.10
36.64		22.10
