In [1]:
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(precision=3)
np.set_printoptions(suppress=True)

# Neural Network implementation with Matrices #1: Predefined Network

In the previous example, we implemented a neural network with 2 input layers, 3 hidden layers, and 2 output layers. To compute the loss/cost for of this network, we used the quadratic cost function. The naming style for the weights, biases and activations, and the programming style we were using made it easy to understand what is going on in this relatively small network. Programming neural networks like this, however, is computationally very inefficient, and the usage of variables in a larger network would inevitably lead to the explosion of their number. This would in fact make the code long, inefficient and unreadable. The solution to this problem is to implement neural networks by means of __matrices__ and __matrix operations__.

The architecture of the network we will implement by using matrices will remain the same as in the previous example:

<img src="img/neural_networks_26.png" alt="drawing" width="950"/>

We will also use the same dataset we used before:

In [2]:
data = np.array([[ 1.2, 0.7],
                 [-0.3,-0.5],
                 [ 3.0, 0.1],
                 [-0.1,-1.0],
                 [-0.0, 1.1],
                 [ 2.1,-1.3],
                 [ 3.1,-1.8],
                 [ 1.1,-0.1],
                 [ 1.5,-2.2],
                 [ 4.0,-1.0]])

In [3]:
labels = np.array([  1,
                    -1,
                     1,
                    -1,
                    -1,
                     1,
                    -1,
                     1,
                    -1,
                    -1])

In [4]:
data[0]

array([1.2, 0.7])

## Naming the elements of a neural network

In this example, we will use the standard visualisation of the neural network architecture, abstracting the inner working of circuits by displaying only the neurons and their connections. Our network consists of the 3 layers. The input layer has 2 neurons, that used to be represented with $\mathbf{X}$ and $\mathbf{Y}$. The hidden layer consists of 3 neurons, and the output layer of 2 neurons. The cost function is not graphically represented here. 

<img src="img/neural_networks_27.png" alt="drawing" width="350"/>

### Neurons

To name the neurons, our convention will be to mark with the superscript the layer to which the neuron belongs. For example $l_1$ marks neurons belonging to the first layer. With the subscript we mark the number of a neuron within the layer. With this in mind, the third neuron of the second layer can be symbolised as $neuron_3^{(l_2)}$ or shortly $n_3^{(l_2)}$.

<img src="img/neural_networks_28.png" alt="drawing" width="400"/>

### Weights

To name the weights the convention is to use superscript to mark the layer to which the weight belongs. For example $w^{(l_2)}$ represents __all__ the weights of the second layer. To be more specific we use the subscript that contains two numbers. The first number of the subscript defines the number of the neuron in the current layer. For example $w_1^{(l_2)}$ represents all the weights of the first neuron in the second layer. The second number of the subscript defines which activation in the previous layer does the weight parametrise (multiply).For example $w_{21}^{(l_2)}$ or perhaps better $w_{2,1}^{(l_2)}$ would indicate the weight belonging to the second neuron of the second layer that parametrises the first activation (output of the first neuron) of the previous layer $(l_1)$. 

<img src="img/neural_networks_29.png" alt="drawing" width="400"/>

#### Weights of the layer $l_1$?

The first layer $l_1$ is an input layer, containing our data points (we used to mark them as $\mathbf{X}$ and $\mathbf{Y}$). Since our data points are constants in the computation and the neural network uses them to learn the weights and biases (variables), its values must not change. For this reason, this layer is not parametrised by any weights. 

#### Weights of the layer $l_2$

The weights of the layer 2 can be represented with a matrix whose number of rows matches the number of neurons in the layer 2, and whose number of colums matches the number of neurons (their activations more precisely) in the previous layer. Since the layer 2 has 3 neurons, and the layer 1 has 2 neurons, this matrix will be of the size $3x2$. The first row of the matrix represents the first neuron and its weights. The second row represents the second neuron and the third row—third neuron and its weights. The first column of the matrix represents weights that parametrise the activations of the first neuron of the previous layer. The second column of the matrix represents weights that parametrise the activations of the second neuron of the previous layer.

$$
w^{(l_2)}
=
\underset{\mathbf{3x2}}{
\begin{bmatrix}
w_{11}^{(l_2)} & w_{12}^{(l_2)}\\
w_{21}^{(l_2)} & w_{22}^{(l_2)}\\
w_{31}^{(l_2)} & w_{32}^{(l_2)}\\
\end{bmatrix}\\
}
$$

By using  _nympy_ we can easily encode the matrix $w^{(l_2)}$ and store it in a variable `w_2`. By using the function `randn`, we will fill it with the normally distributed random values. 

In [5]:
w_2 = np.random.randn(3,2)
w_2

array([[-0.124,  0.871],
       [ 0.692, -0.036],
       [ 0.455,  1.239]])

#### Weights of the layer $l_3$

The layer 3 has 2 neurons and its previous layer ($l_2$) has 3 layers, thus 3 activations. For this reason, the weight matrix $w^{(l_3)}$ will be of the size $2x3$.

$$\
w^{(l_3)}
=
\underset{\mathbf{2x3}}{
\begin{bmatrix}
w_{11}^{(l_3)} & w_{12}^{(l_3)} & w_{13}^{(l_3)}\\
w_{21}^{(l_3)} & w_{22}^{(l_3)} & w_{23}^{(l_3)}\\
\end{bmatrix}\\
}
$$

Again, we can easily store this in a variable `w_3`:

In [6]:
w_3 = np.random.randn(2,3)
w_3

array([[-0.006,  0.295,  0.207],
       [ 0.126,  0.399, -0.637]])

### Biases

<img src="img/neural_networks_30.png" alt="drawing" width="400"/>

#### Biases of the layer $l_1$?

For the same reason the layer 1 is not parametrised with the weights, it does not also involve any biases.

#### Biases of the layer $l_2$

The biases of the layer 2 can be represented with a single column matrix (also known as column vector) whose number of rows matches the number of neurons in the layer 2. Since the layer 2 has 3 neurons, this matrix will be of the size $3x1$. The first row of the matrix represents the bias of the first neuron, the second represents the bias of the second neuron, and the third, the the bias of the third neuron.

$$
b^{(l_2)}
=
\underset{\mathbf{3x1}}{
\begin{bmatrix}
b_{1}^{(l_2)}\\
b_{2}^{(l_2)}\\
b_{3}^{(l_2)}\\
\end{bmatrix}\\
}
$$

In [7]:
b_2 = np.random.randn(3,1)
b_2

array([[-0.923],
       [ 0.02 ],
       [-2.918]])

#### Biases of the layer $l_3$

Since the third layer has two neurons, it will also have 2 biases:

$$
b^{(l_3)}
=
\underset{\mathbf{2x1}}{
\begin{bmatrix}
b_{1}^{(l_3)}\\
b_{2}^{(l_3)}\\
\end{bmatrix}\\
}
$$

In [8]:
b_3 = np.random.randn(2,1)
b_3

array([[-0.07 ],
       [ 1.912]])

### Activations

<img src="img/neural_networks_31.png" alt="drawing" width="400"/>

#### Activations of the layer $l_1$

The first layer (input layer) contains the data points. Since the data points do not change (but are in fact used to train the network by adjusting the weights and biases), this layer can be considered as an activation layer, __activating__ the neurons of the next layer. Since our data is two dimensional (each data point holds 2 numbers) the layer will have 2 neurons, each contatining a single number. Here, $a_1^{(l_1)}$ would correspond to $\mathbf{X}$ of our previous example, and  $a_2^{(l_1)}$ would correspond to $\mathbf{Y}$.

$$
a^{(l_1)}
=
\underset{\mathbf{2x1}}{
\begin{bmatrix}
a_{1}^{(l_1)}\\
a_{2}^{(l_1)}\\
\end{bmatrix}\\
}
$$

In a proper neural network actual implementation (which we implement do in the next class), the cost/loss will be calculated by averaging the cost for all the data points, and well as the partial derivatives. Here, for the sake of simplicity we will compute only a single epoch of training a neural network by using a single data point. We will be using the first training example `[1.2, 0.7]` associated with the label `1`, and store its value in the variable `a_1_single`.

In [9]:
a_1_single = data[0].reshape(1,2).T
a_1_single

array([[1.2],
       [0.7]])

#### Activations of the layer $l_2$

To compute the activations of the layer 2 we used to multiply each component of the data point with a corresponding weight, add the bias, and then compute the sigmoid activation: <br>`
p1 = A1*X + B1*Y + C1 # intermediate value for the 1st neuron
p2 = A2*X + B2*Y + C2 # intermediate value for the 2nd neuron
p3 = A3*X + B3*Y + C3 # intermediate value for the 3rd neuron
N1 = sigmoid(p1) # 1st neuron activation
N2 = sigmoid(p2) # 2nd neuron activation
N3 = sigmoid(p3) # 3rd neuron activation`

We can achieve the same thing by using matrix multiplication! To compute the values that would correspond to `p1`, `p2` and `p3`, called __the weighted average__, we simply need to compute a matrix product of the weight matrix of the level 2 with the activations of the level 1. To this product, we simply add the biases matrix of the level 2. To store these intermediate values, we will use the variable $z$, whose superscript will denote the layer to which this value belongs, and the subscript the neuron to which this value belongs. Thus, $z^{(l_2)}$ will indicate the weighted averages of the level 2.

$$
z^{(l_2)}=w^{(l_2)}\times a^{(l_1)}+b^{(l_2)}
$$

\begin{align*}  
z^{(l_2)}
&=
\underset{\mathbf{3x2}}{
\begin{bmatrix}
w_{11}^{(l_2)} & w_{12}^{(l_2)}\\
w_{21}^{(l_2)} & w_{22}^{(l_2)}\\
w_{31}^{(l_2)} & w_{32}^{(l_2)}\\
\end{bmatrix}\\
}
\times
\underset{\mathbf{2x1}}{
\begin{bmatrix}
a_{1}^{(l_1)}\\
a_{2}^{(l_1)}\\
\end{bmatrix}\\
}
+
\underset{\mathbf{3x1}}{
\begin{bmatrix}
b_{1}^{(l_2)}\\
b_{2}^{(l_2)}\\
b_{3}^{(l_2)}\\
\end{bmatrix}\\
}\\\\
&=
\underset{\mathbf{3x1}}{
\begin{bmatrix}
(w_{11}^{(l_2)} * a_{1}^{(l_1)}) + (w_{12}^{(l_2)}*a_{2}^{(l_1)})+b_{1}^{(l_2)}\\
(w_{21}^{(l_2)} * a_{1}^{(l_1)}) + (w_{22}^{(l_2)}*a_{2}^{(l_1)})+b_{2}^{(l_2)}\\
(w_{31}^{(l_2)} * a_{1}^{(l_1)}) + (w_{32}^{(l_2)}*a_{2}^{(l_1)})+b_{3}^{(l_2)}\\
\end{bmatrix}\\
}\\\\
&=
\underset{\mathbf{3x1}}{
\begin{bmatrix}
z_1^{(l_2)}\\
z_2^{(l_2)}\\
z_3^{(l_2)}\\
\end{bmatrix}}
\end{align*}  

This can be simply computed by using the numpy function `dot`. 

In [10]:
z_2 = w_2.dot(a_1_single) + b_2
z_2

array([[-0.462],
       [ 0.825],
       [-1.504]])

Finally, to compute the activations, we just need to apply sigmoid activation to the matrix $z^{(l_2)}$ `(z_2)`

$$
a^{(l_2)}
\equiv
\begin{bmatrix}
a_{1}^{(l_2)}\\
a_{2}^{(l_2)}\\
a_{3}^{(l_2)}\\
\end{bmatrix}
=
\sigma(z^{(l_2)})
\equiv
\sigma
\left(\;
\begin{bmatrix}
z_{1}^{(l_2)}\\
z_{2}^{(l_2)}\\
z_{3}^{(l_2)}\\
\end{bmatrix}\;
\right)
\equiv
\begin{bmatrix}
\sigma(z_1^{(l_2)})\\
\sigma(z_2^{(l_2)})\\
\sigma(z_3^{(l_2)})\\
\end{bmatrix}
$$

Here we define the sigmoid function:

In [11]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Here we compute the activations of the layer $l_2$,  $a^{(l_2)}$ and store it in the variable `a_2`:

In [12]:
a_2 = sigmoid(z_2)
a_2

array([[0.386],
       [0.695],
       [0.182]])

#### Activations of the layer $l_3$

To compute the activations for the layer 3, we apply the same procedure we did for the level 2. To compute the weighted average matrix of the layer 3 we compute a matrix product of the weight matrix of the layer 3 with the activation matrix of the layer before (layer 2), and add the bias matrix of the level 3:

$$
z^{(l_3)}=w^{(l_3)}\times a^{(l_2)}+b^{(l_3)}
$$

\begin{align*}  
z^{(l_3)}
&=
\underset{\mathbf{2x3}}{
\begin{bmatrix}
w_{11}^{(l_3)} & w_{12}^{(l_3)} & w_{13}^{(l_3)}\\
w_{21}^{(l_3)} & w_{22}^{(l_3)} & w_{23}^{(l_3)}\\
\end{bmatrix}\\
}
\times
\underset{\mathbf{3x1}}{
\begin{bmatrix}
a_{1}^{(l_2)}\\
a_{2}^{(l_2)}\\
a_{3}^{(l_2)}\\
\end{bmatrix}\\
}
+
\underset{\mathbf{2x1}}{
\begin{bmatrix}
b_{1}^{(l_3)}\\
b_{2}^{(l_3)}\\
\end{bmatrix}\\
}\\\\
&=
\underset{\mathbf{2x1}}{
\begin{bmatrix}
(w_{11}^{(l_3)} * a_{1}^{(l_2)}) + (w_{12}^{(l_3)}*a_{2}^{(l_2)})+ (w_{13}^{(l_3)}*a_{3}^{(l_2)})+b_{1}^{(l_3)}\\
(w_{21}^{(l_3)} * a_{1}^{(l_2)}) + (w_{22}^{(l_3)}*a_{2}^{(l_2)})+ (w_{22}^{(l_3)}*a_{2}^{(l_2)})+b_{2}^{(l_3)}\\
\end{bmatrix}\\
}\\
&=
\underset{\mathbf{2x1}}{
\begin{bmatrix}
z_1^{(l_3)}\\
z_2^{(l_3)}\\
\end{bmatrix}}
\end{align*} 

In [13]:
z_3 = w_3.dot(a_2)+b_3
z_3

array([[0.17 ],
       [2.122]])

Again, to compute the final activation of the layer 3 (output layer), we apply the sigmoid function to the weighted average:

$$
a^{(l_3)}
\equiv
\begin{bmatrix}
a_{1}^{(l_3)}\\
a_{2}^{(l_3)}\\
\end{bmatrix}
=
\sigma(z^{(l_3)})
\equiv
\begin{bmatrix}
\sigma(z_1^{(l_3)})\\
\sigma(z_2^{(l_3)})\\
\end{bmatrix}
$$

In [14]:
a_3 = sigmoid(z_3)
a_3

array([[0.542],
       [0.893]])

## Computing the backward pass:  Backpropagation

Now that we have the weighted averages and activations for each layer, computation of the forward pass is complete. The next step is to compute the network cost of our training example `[1.2, 0.7]`, and then the backward pass by computing the partial derivatives of this cost in respect to all the weights and biases.

### Cost function

 The data point we were using to compute the forwared pass contains the values `[1.2, 0.7]` , and it is associated with the label `1`. 

As we have seen in the previous case, we need to interpret the data labels `-1` and `1` by means of 2 neurons. The output of the first neuron encodes the probability that the label is `-1`, and the second the probability that the label is `1`. To do this, we can reuse the function `convert_label` from the last lecture:

In [15]:
def convert_label(label):
    if (label == -1):
        return (1,0)
    if (label == 1):
        return (0,1)

In [16]:
convert_label(-1)

(1, 0)

In [17]:
convert_label(1)

(0, 1)

Here, we convert the label `1` associated with the data point `[1.2, 0.7]` into the variable `y`:

In [18]:
y = np.array(convert_label(labels[0])).reshape(2,1)
y

array([[0],
       [1]])

### Computing the derivatives of the cost function

Again, we will be using the _quadratic cost function_ to compute the cost of our example. Since we have two output layers, the formula for computing the cost of a single training example we will be:

$$
C_x = \frac{1}{2}
\sum_{k=1}^2 (y_k-a_k^{(l_3)})^2
$$

This means that we need to substract the activation of each neuron in the output layer from the true label associated with this neuron, square it and sum the squared terms. The reason we are multiplying the sum by $\frac{1}{2}$ ist that it will simplify the computation of the derivative. This same equation can be represented in simpler terms by using vectors:

$$
C_x =
\frac{1}{2} \|y-a^{(l_3)} \|^2
$$

We will store the result of this computation in the variable `C_x`:

In [19]:
C_x = 0.5*sum((a_3 - y)**2)[0]
C_x

0.15280874657485913

Because of multiplying the cost function by $\frac{1}{2}$, the computation of the partial derivatives in $\frac{\partial C_x}{\partial a^{(l_3)}}$ boils down to a difference between two column vectors:

$$\frac{\partial C}{\partial a^{(l_3)}} = (a^{(l_3)}-y)$$

In [20]:
(a_3 - y)

array([[ 0.542],
       [-0.107]])

To compute all the weights and biases we will be using a strategy that can be well illustrated on the diagram of the previous example:

<img src="img/neural_networks_26.png" alt="drawing" width="950"/>

With the computation `(a_3-y)` we have computed the derivatives of the cost of a single data point in respect to the activations of the layer 3 (output layer), designated in the last example by the variables $0_1$ and $0_2$.<br><br>
Efficiently computing the values of partial derivatives with matrices requires a strategy. The idea is to first compute partial derivatives of the cost/loss in respect to the __weighted sums__—intermediate values in each layer of the network (excluding the first layer which contains only our data points). Knowing these values allow us to easily compute the partial derivatives of all the weights and biases.<br>
In our previous example, the intermediate values of the third (output) layer were $z_1$ and $z_2$ and the intermediate values of the second layer $p_1$, $p_2$ and $p_3$

Since we are using matrices, we will store these values accordingly:

$$
\delta^{(l_3)}
=
\begin{bmatrix}
\delta_1^{(l_3)}\\
\delta_2^{(l_3)}\\
\end{bmatrix};\quad
\delta^{(l_2)}
=
\begin{bmatrix}
\delta_1^{(l_2)}\\
\delta_2^{(l_2)}\\
\delta_3^{(l_2)}\\
\end{bmatrix};
$$

### Computing the derivatives of the  weighted sums in the output layer $\delta^{(l_3)}$:

To compute the $\delta^{(l_3)}$, we need to use the chain rule and multiply the partial derivatives $\frac{\partial C}{\partial a^{(l_3)}}$ which we computed as `(a_3-y)`, and the partial derivative $\frac{\partial a^{(l_3)}}{\partial z^{(l_3)}}$, which is simply a derivative of a sigmoid function $\sigma^{\prime}$ applied to each element of a vector:
$$
\sigma^{\prime}
=
\sigma(z) * (1-\sigma(z))
$$

The following code computes the $\frac{\partial a^{(l_3)}}{\partial z^{(l_3)}}$:

In [21]:
sigmoid(z_3) * (1 - sigmoid(z_3))

array([[0.248],
       [0.096]])

Since we have already computed `a_3 = sigmoid(z_3)` in the forward pass, we can be efficient, and reuse this computation to compute the derivative of the sigmoid function:

In [22]:
a_3 * (1 - a_3)

array([[0.248],
       [0.096]])

Therefore computing the $\delta^{(l_3)}$ only requires us to use the chain rule:<br>
$$
\delta^{(l_3)}
\equiv
\frac{\partial C}{\partial \delta^{(l_3)}} = \frac{\partial C}{\partial a^{(l_3)}} * \frac{\partial a^{(l_3)}}{z^{(l_3)}} 
$$

or in code:

In [23]:
delta_3 = (a_3 - y) * a_3 * (1 - a_3)
delta_3

array([[ 0.135],
       [-0.01 ]])

### Computing the derivatives of the weighted sums in the hidden layer $\delta^{(l_2)}$:

#### Hadamard product

In computer science, we often need multiply two vectors, or matrices elementwise. This simple operation is not so well known in mathematics, so we will introduce it here. If $p$ and $t$ are two vectors of the same dimension, we can use $p⊙t$ to denote the elementwise product of the two vectors, called a __Hadamard product__.

\begin{eqnarray}
\left[\begin{array}{c} a \\ b \end{array}\right] 
  \odot \left[\begin{array}{c} c \\ d\end{array} \right]
= \left[ \begin{array}{c} a \cdot c \\ b \cdot d \end{array} \right]
\tag{28}\end{eqnarray}

An efficient computation of the $\delta^{(l_2)}$ should let us use the already computed partial derivatives in the matrix  $\delta^{(l_3)}$. 

In the last example, to compute the partial derivatives of the activations of the second layer `dN1`, `dN2` and `dN3`, we needed to take into consideration that the computation of the partial derivative of `dN1` involves considering two paths, one involving `dz1` and another `dz2`, and summing their outputs.

`dN1 = dz1*A4 + dz2*A5
dN2 = dz1*B4 + dz2*B5
dN3 = dz1*C4 + dz2*C5`

After that, to compute the partial derivatives of the sigmoid activations in respect to the __weighted averages__ `dp1`, `dp2`, and `dp3`, we multiplied the activations with the partial derivatives of the sigmoid activation in respect to the weighted sums:<br>

`dp1 = dN1*N1*(1-N1)
dp2 = dN2*N2*(1-N2)
dp3 = dN3*N3*(1-N3)`

The same procedure can be implemented with matrices by using the __transpose__ of the weight matrix. For example the transpose of the matrix:

$$\
w^{(l_3)}
=
\underset{\mathbf{2x3}}{
\begin{bmatrix}
w_{11}^{(l_3)} & w_{12}^{(l_3)} & w_{13}^{(l_3)}\\
w_{21}^{(l_3)} & w_{22}^{(l_3)} & w_{23}^{(l_3)}\\
\end{bmatrix}\\
}
$$

would be:

$$
\left(w^{(l_3)}\right)^T
=
\underset{\mathbf{3x2}}{
\begin{bmatrix}
w_{11}^{(l_3)} & w_{21}^{(l_3)}\\
w_{12}^{(l_3)} & w_{22}^{(l_3)}\\
w_{13}^{(l_3)} & w_{23}^{(l_3)}\\
\end{bmatrix}\\
}
$$

In [24]:
w_3.T

array([[-0.006,  0.126],
       [ 0.295,  0.399],
       [ 0.207, -0.637]])

The first column of this column contains all the weights of the first neuron, and the second all the weights of the second neuron.

If we compute a matrix product between $\left(w^{(l_3)}\right)^T$ and $\delta^{(l_3)}$ we get:

$$
\begin{align*}
\left(w^{(l_3)}\right)^T\delta^{(l_3)}
&=
\underset{\mathbf{3x2}}{
\begin{bmatrix}
w_{11}^{(l_3)} & w_{21}^{(l_3)}\\
w_{12}^{(l_3)} & w_{22}^{(l_3)}\\
w_{13}^{(l_3)} & w_{23}^{(l_3)}\\
\end{bmatrix}
}
\times
\underset{\mathbf{2x1}}{
\begin{bmatrix}
\delta_1^{(l_3)}\\
\delta_2^{(l_3)}\\
\end{bmatrix}}\\
\\
&=
\underset{\mathbf{3x1}}{
\begin{bmatrix}
(w_{11}^{(l_3)}*\delta_1^{(l_3)})+(w_{21}^{(l_3)}*\delta_2^{(l_3)})\\
(w_{12}^{(l_3)}*\delta_1^{(l_3)})+(w_{22}^{(l_3)}*\delta_2^{(l_3)})\\
(w_{13}^{(l_3)}*\delta_1^{(l_3)})+(w_{23}^{(l_3)}*\delta_2^{(l_3)})\\
\end{bmatrix}}
\end{align*}
$$

In [25]:
w_3.T.dot(delta_3)

array([[-0.002],
       [ 0.036],
       [ 0.034]])

In the previous example would be the equivalent to <br><br>
`dN1 = dz1*A4 + dz2*A5
dN2 = dz1*B4 + dz2*B5
dN3 = dz1*C4 + dz2*C5`

The only thing left is to multiply this matrix with the partial derivatives of the sigmoid activation  $\sigma'(z^{(l_2)})$ which we compute in the same way as for $\delta^{(l_3)}$

In [26]:
a_2 * (1 - a_2)

array([[0.237],
       [0.212],
       [0.149]])

Finally, this is the formula to compute the weighted averages matrix of the layer 2:

$$
\delta^{(l_2)} = ((w^{(l_3)})^T \delta^{(l_3)}) \odot \sigma'(z^{(l_2)})
$$

To compute this we simply need to multiply `w_3.T.dot(delta_3)` and `a_2 * (1 - a_2)`:

In [27]:
delta_2 = np.dot(w_3.T,delta_3) * a_2 * (1 - a_2)
delta_2

array([[-0.   ],
       [ 0.008],
       [ 0.005]])

### Computing the derivatives of the biases in the output layer $b^{(l_3)}$:

Once we know the partial derivatives of the cost in respect to the weighted averages, computing the biases is trivial. Since a bias is only added to the multiplications of weights and activations, its derivative is equal to 1. Therefore, the biases of a certain layer have the same value as the weighted averages of that layer:

In [28]:
db_3 = delta_3
db_3

array([[ 0.135],
       [-0.01 ]])

### Computing the derivatives of the biases of the hidden layer $b^{(l_2)}$:

In [29]:
db_2 = delta_2
db_2

array([[-0.   ],
       [ 0.008],
       [ 0.005]])

### Computing the derivatives of the weights in the output layer $w^{(l_3)}$:

In the last example, to compute the partial derivatives of the weights of the first neuron of the output layer `dA4`, and `dA5`, we simply multipled the values of the intermediate steps `dz1`, and `dz3` with he value of the first activation `N1`. To compute the partial derivatives of the weights of the second neuron of the output layer `dB4`, and `dB5`, we simply multipled the values of the intermediate steps `dz1`, and `dz3` with he value of the second activation `N2`. Finally, to compute the partial derivatives of the weights of the third neuron of the output layer `dC4`, and `dC5`, we simply multipled the values of the intermediate steps `dz1`, and `dz3` with he value of the third activation `N3`.

`dA4 = dz1*N1
dA5 = dz2*N1`

`dB4 = dz1*N2
dB5 = dz2*N2`

`dC4 = dz1*N3
dC5 = dz2*N3`

To achieve this with matrix multiplication, we need to compute the transpose matrix of the activations of the previous layer $l_2$:

$$
\begin{align*}
a^{(l_2)}
&\equiv
\underset{\mathbf{3x1}}{
\begin{bmatrix}
a_{1}^{(l_2)}\\
a_{2}^{(l_2)}\\
a_{3}^{(l_2)}\\
\end{bmatrix}}
\\\\
\left(a^{(l_2)}\right)^T
&\equiv
\underset{\mathbf{1x3}}{
\begin{bmatrix}
a_{1}^{(l_2)} & a_{2}^{(l_2)} & a_{3}^{(l_2)}
\end{bmatrix}}
\end{align*}
$$

In [30]:
a_2.T

array([[0.386, 0.695, 0.182]])

Now we can compute the matrix product of the already computed weighted average of the layer 3 and this transposed activation matrix:

$$
\begin{align*}
\delta^{(l_3)}\times\left(a^{(l_2)}\right)^T
&=
\underset{\mathbf{2x1}}{
\begin{bmatrix}
\delta_1^{(l_3)}\\
\delta_2^{(l_3)}\\
\end{bmatrix}}
\times
\underset{\mathbf{1x3}}{
\begin{bmatrix}
a_{1}^{(l_2)} & a_{2}^{(l_2)} & a_{3}^{(l_2)}
\end{bmatrix}}\\\\
&=
\underset{\mathbf{2x3}}{
\begin{bmatrix}
\delta_1^{(l_3)} a_{1}^{(l_2)} & \delta_1^{(l_3)} a_{2}^{(l_2)} & \delta_1^{(l_3)} a_{3}^{(l_2)} \\
\delta_2^{(l_3)} a_{1}^{(l_2)} & \delta_2^{(l_3)} a_{2}^{(l_2)} & \delta_2^{(l_3)} a_{3}^{(l_2)} \\
\end{bmatrix}}\\
\end{align*}
$$

In [31]:
dw_3 = delta_3.dot(a_2.T)
dw_3

array([[ 0.052,  0.094,  0.024],
       [-0.004, -0.007, -0.002]])

This gives the same 6 values as in `dA4`, `dA5`, `dB4`, `dB5`, `dC4`,and  `dC5`, compactly represented in a single matrix.

### Computing the derivatives of the weights in the hidden layer $w^{(l_2)}$:

In the last example, to compute the partial derivatives of the weights of the second layer `dA1`, `dA2`, `dA3`, we simply multipled the values of the intermediate steps `dp1`, `dp2`, and `dp3` with he value of the first activation (data point) $\mathbf{X}$. To compute the partial derivatives of the weights `dB1`, `dB2` and `dB3`, we multipled the same intermediate values `dp1`, `dp2`, and `dp3` with the value of the second activation (data point) $\mathbf{Y}$.

`dA1 = dp1*X
dA2 = dp2*X
dA3 = dp3*X`

`dB1 = dp1*Y
dB2 = dp2*Y
dB3 = dp3*Y`

To achieve this with matrix multiplication, we need to compute the transpose matrix of the activations of the previous (input) layer $l_1$:

$$
\begin{align*}
a^{(l_1)}
&\equiv
\underset{\mathbf{2x1}}{
\begin{bmatrix}
a_{1}^{(l_1)}\\
a_{2}^{(l_1)}\\
\end{bmatrix}}\\
\\
\left(a^{(l_1)}\right)^T
&\equiv
\underset{\mathbf{1x2}}{
\begin{bmatrix}
a_{1}^{(l_1)} & a_{2}^{(l_1)}
\end{bmatrix}}
\end{align*}
$$

In [32]:
a_1_single.T

array([[1.2, 0.7]])

Now we can compute the dot product of the already computed weighted average of the layer 2 and this transposed activation matrix:

$$
\begin{align*}
\delta^{(l_2)}\times\left(a^{(l_1)}\right)^T
&=
\underset{\mathbf{3x1}}{
\begin{bmatrix}
\delta_1^{(l_2)}\\
\delta_2^{(l_2)}\\
\delta_3^{(l_2)}\\
\end{bmatrix}}
\times
\underset{\mathbf{1x2}}{
\begin{bmatrix}
a_{1}^{(l_2)} & a_{2}^{(l_2)}
\end{bmatrix}}\\\\
&=
\underset{\mathbf{3x2}}{
\begin{bmatrix}
\delta_1^{(l_3)} a_{1}^{(l_2)} & \delta_1^{(l_3)} a_{2}^{(l_2)} \\
\delta_2^{(l_3)} a_{1}^{(l_2)} & \delta_2^{(l_3)} a_{2}^{(l_2)} \\
\delta_3^{(l_3)} a_{1}^{(l_2)} & \delta_3^{(l_3)} a_{2}^{(l_2)}
\end{bmatrix}}\\
\end{align*}
$$

In [33]:
dw_2 = delta_2.dot(a_1_single.T)
dw_2

array([[-0.001, -0.   ],
       [ 0.009,  0.005],
       [ 0.006,  0.004]])

This gives the same 6 values as in `dA1`, `dA2`, `dA3`, `dB1`, `dB2`,and  `dB3` compactly represented in a single matrix.

***

## Updating the weights and the biases of the neural network

We want to update the weights gradually (slowly descend the function's gradient).  For this reason we will use the variable `step_size` with which we multiply the weights and biases matrices.

In [34]:
step_size = 0.1
# updating the weights and biases of the third (output) layer
w_3 -= dw_3*step_size
b_3 -= db_3*step_size
# updating the weights and biases of the second (hidden) layer
w_2 -= dw_2*step_size
b_2 -= db_2*step_size

What we want to achieve with training is that our network outputs the value as close as possible to the label, which the data is associated with. The current label is:

In [35]:
y

array([[0],
       [1]])

Our old activation was:

In [36]:
a_3

array([[0.542],
       [0.893]])

We can compute the new activation by computing the forward pass again with the updated weights and biases. The new activation is:

In [37]:
a_2 = sigmoid(w_2.dot(a_1_single) + b_2)
a_3 = sigmoid(w_3.dot(a_2) + b_3)
a_3

array([[0.537],
       [0.893]])

This result is better than the previous one, which proves that the training is working. Now, this step needs to be repeated until the appropriate approximation of the label is reached. 