## Backpropagation example

With this numerical example we illustrate how forward- and back-propagation work to learn the weights (parameters) of a **neural network**.

We use a very simple network: this is far from a real-application network, but serves well the purpose of showing with numbers what happens in the network.

In particular, the following simplifications apply:
- only one training example (one record, with two features $x_1$ and $x_2$)
- two numerical outputs $y = [y_1,y_2]$ (e.g. two continuous measurements on the same training example)
- one hidden (intermediate) layer with two nodes
- no bias terms

<img src="simple_network.jpg" alt="network" style="width: 800px;"/>

We use the notation:
- $z_{ij}$ to indicate the linear combination of inputs in layer $i$ and node $j$
- $a_{ij}$ to indicate the activated linear combination $z$ in layer $i$ and node $j$
- $\hat{y}$ to indicate predicted values for the target variable $y$

### Initialization

Let's define the input and target variables and **initialize the weights** (give some initial random values to the parameters of the neural network model).
As for hyperparameters, we set the learning rate $\alpha$.

In [2]:
## input vars
x1 = 0.15
x2 = 0.08

## training outputs (two continuous measurements -e.g. phenotypes- on the same example)
y1 = 0.05
y2 = 0.95

## initialization of model parameters
w1 = 0.15
w2 = 0.10
w3 = 0.12
w4 = 0.09
w5 = 0.18
w6 = 0.20
w7 = 0.16
w8 = 0.24

## setting the hyperparameters
alpha = 0.75 ## learning rate

### A little set-up

We import some needed Python libraries and define functions that we'll use in this example:

- **logistic (sigmoid) function**: used to activate the linear combinations ($z$'s) calculated in the intermediate and output layers
- **mean squared error (MSE)**: cost function used to evaluate the performance of the network

In [3]:
import math
import numpy as np

## vectors
y = np.array([y1,y2]) ## vector of training target variables

## logistic function
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

## cost function (mean squared error); the 1/2 is the normalization constant (2 outputs)
def MSE(y,y_hat):
    return sum((1/2)*(y-y_hat)**2)

### Forward propagation

We start moving forward along the network: from inputs to outputs.
First, we calculate the linear combinations of inputs $z$'s in the hidden layer: 

In [5]:
z11 = w1*x1 + w2*x2
z12 = w3*x1 + w4*x2

print('z_11 is: ', z11)
print('z_12 is: ', z12)

z_11 is:  0.0305
z_12 is:  0.0252


From the linear combinations $z$'s, we can now calculate the "**activated**" values $a$'s using the **sigmoid activation function**:

In [6]:
a11 = sigmoid(z11)
a12 = sigmoid(z12)

print('a_11 is: ', a11)
print('a_12 is: ', a12)

a_11 is:  0.5076244089586275
a_12 is:  0.5062996666251706


We now move to the next layer, the **output layer** and calculate the linear combination of the inputs (the activations from the previous -hidden- layer):

In [7]:
z21 = w5*a11+w6*a12 ## linear combination from the first node in the output layer
z22 = w7*a11+w8*a12 ## linear combination from the second node in the output layer

print('z_21 is: ', z21)
print('z_22 is: ', z22)

z_21 is:  0.19263232693758708
z_22 is:  0.20273182542342133


We can now conclude the first forward propagation pass and calculate the predicted output values, using again the *sigmoid activation function*:

In [8]:
y_hat1 = sigmoid(z21)
y_hat2 = sigmoid(z22)

y_hat = np.array([y_hat1,y_hat2])

print("The two predicted outputs are")
'; '.join([str(round(x,5)) for x in y_hat])

The two predicted outputs are


'0.54801; 0.55051'

### The cost

With the predicted and actual target values we can calculate the cost, i.e. by how much we are currently off.
Since in this simplified example we assumed that the target values are continuous variables, we can use **mean squared error** (*MSE*), a typical cost function for continuous variables:

In [9]:
print("The two target outputs are")
'; '.join([str(round(x,5)) for x in y])

The two target outputs are


'0.05; 0.95'

In [10]:
cost_1 = MSE(y,y_hat)

print('The cost after the first forward propagation pass is: ', cost_1)

The cost after the first forward propagation pass is:  0.20380293722737602


### Backpropagation

Finally, we can start moving backwards across the network: from the **cost** to the **coefficients**.

Our goal with back propagation is to **update the weights** in the network so that they cause the network output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.

Let's start from the coefficient used to weigh the first input ($a_{11}$) in the linear combination $z_{21}$ calculated in the first node of the second (output) layer: $w_5$.

We see from the figure below that we need to move backwards, from the cost, to the prediction $\hat{y}_1$, to the lienar combination $z_{21}$. This is because $z_{21}$ is where the weight $w_5$ is indeed used:

$$
z_{21} = w_5 \cdot a_{11} + w_6 \cdot a_{12}
$$

Therefore: $MSE(y,\hat{y})$ $\rightarrow$ $\hat{y}_1$ $\rightarrow$ $z_{21}$ $\rightarrow$ $w_5$

<img src="first_back_prop.jpg" alt="first back propgation step" style="width: 800px;"/>

This very first backpropagation step involves calculating the **derivative of the cost function** $MSE()$ with respect to the target coefficient $w_5$. Using the [chain rule](https://mathsathome.com/chain-rule-differentiation/), this can be expressed as:

$$
\frac{\partial MSE}{\partial w_5} = \frac{\partial MSE}{\partial \hat{y}_1} \cdot \frac{\partial \hat{y}_1}{\partial z_{21}} \cdot \frac{\partial z_{21}}{\partial w_5}
$$


#### The first link in the chain

Let's start with the first element in the chain: $ \frac{\partial MSE}{\partial \hat{y}_1} $.
The mean squared error (cost) is the normalised sum of the two prediction errors ($\hat{y}_1$ and $\hat{y}_2$):

$$
\frac{1}{2} (y_1-\hat{y}_1)^2 + \frac{1}{2}(y_2-\hat{y}_2)^2
$$

since we are partialling on $\hat{y}_1$ (our variable of interest), the second term in the sum is a **constant** $\rightarrow$ the derivative of a constant is 0. We only need to take the derivative of the first term, of the form $f(x) = (a-x)^2$ $\rightarrow$ $f'(x) = 2(a-x) \cdot -1$:

$$
\frac{\partial MSE}{\partial \hat{y}_1} = 2 \cdot \frac{1}{2} (y_1-\hat{y}_1) \cdot -1 + 0 = -(y_1-\hat{y}_1)
$$

The $-1$ is because the variable in this partial derivative is $\hat{y}_1$ (with coefficient -1):

- $-1$: coefficient of -y_hat1 (function of function)
- 0: derivative of a constant

In [11]:
dMSEdy_hat1 = 2*(1/2)*(y1-y_hat1)*-1 + 0
print('first derivative in the chain: ', dMSEdy_hat1)

first derivative in the chain:  0.49800971457469184


#### The second link in the chain

Now we need to calculate the derivative of the prediction with respect to the linear corresponding combination, $\frac{\partial \hat{y}_1}{ \partial z_{21}}$. 
Let's remember that $\hat{y}_1 = \sigma(z_{21})$.

The **derivative of the logistic (sigmoid) function** is: $\sigma'(x) = \sigma(x) \cdot (1-sigma(x))$. 
In our case, this would be:

$$
\frac{\partial \hat{y}_1}{ \partial z_{21}} = \hat{y}_1 \cdot (1 - \hat{y}_1)
$$

In [13]:
dy_hat1dz21 = y_hat1*(1-y_hat1)
print('second derivative in the chain: ', dy_hat1dz21)

second derivative in the chain:  0.24769506730645663


#### The final link in the chain

Now it's the turn for $\frac{\partial z_{21}}{\partial w_5}$. This is the derviative of the linear combination $z_{21} = w_5 \cdot a_{11} + w_6 \cdot a_{12}$, which is pretty simple: the coefficient of our variable of interest ($w_5$), i.e. $a_{11}$ (the remaining terms of the expression are constants, whose derivatives are zero).


$$
\frac{\partial z_{21}}{\partial w_5} = a_1 + 0 + 0
$$

In [15]:
dz21dw5 = a11
print('third derivative in the chain: ', dz21dw5)

third derivative in the chain:  0.5076244089586275


#### Applying the chain rule

We now have all the elements to calculate the product in the chain:

$$
\frac{\partial MSE}{\partial w_5} = \frac{\partial MSE}{\partial \hat{y}_1} \cdot \frac{\partial \hat{y}_1}{\partial z_{21}} \cdot \frac{\partial z_{21}}{\partial w_5}
$$

In [17]:
dMSEdw5 = dMSEdy_hat1*dy_hat1dz21*dz21dw5
print('The derivative of the cost function with respect to w5 is: ', dMSEdw5)

The derivative of the cost function with respect to w5 is:  0.06261778041978408


### The updating step

With the derivative of the cost function with respect to the coefficient $w_5$, we can now proceed to the updating of the coefficient (**the "learning"**):

$$
w_5+ = w_5 - \alpha \cdot \frac{\partial MSE}{\partial w_5}
$$

where $\alpha$ is the l