In this module, we're going to upgrade the uni-variate chain rule so that it can tackle
multi-variate functions and then see
an example of where this might come in handy.

# Multivariate chain rule

Remember our formula to find the total derivative of a function with three variables:

$\frac{\partial f(x,y,z)}{\partial t} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial t} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial t}$

Now let us generalize this concept to a more computational approach:

Given the function f(x1,x2,...,xn), we can write this function as: f(**x**), with x being a n-dimensional vector.

So if we want to find the total derivative of f(x) in relation to another scalar (t), we need to compute the following vectors:

derivative of f in relation to each x:

$\frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ . \\ \frac{\partial f(x)}{\partial x_n} \end{bmatrix}$

and derivative of each x in relation to t:

$\frac{\partial x}{\partial t} = \begin{bmatrix} \frac{\partial x_1}{\partial t} \\ \frac{\partial x_2}{\partial t} \\ . \\ \frac{\partial x_n}{\partial t} \end{bmatrix}$

So, we are looking to find the sum of the product
of each pair of terms in the same position in each vector.
Thinking back to our linear algebra,
this is exactly what the dot product does. 

And now to find our total derivative we can use the dot product between those vectors:

$\frac{\partial f}{\partial t} = \frac{\partial f(x)}{\partial x} . \frac{\partial x}{\partial t} $

Remember our Jacobian vector? did you notice that its the same as our $\frac{\partial f(x)}{\partial x}$ but with a row form instead of column?

You can express this by saying that the jacobian is equal to our partial f(x) in its transposed form:

$(J_f)^t = \frac{\partial f(x)}{\partial x}$

But more importantly, we can simplify the multivariate chain rule with the jacobian vector, since the multiplication of a row vector by a column vector is the same operation as the dot product:

$\frac{\partial f}{\partial t} = J_f \frac{\partial x}{\partial t} $

Now let's visualize how to construct the chain rule for a simple case, and you can use this knowledge to later on solve more complicated problems (which i hope you are doing in a computer and not by hand!)

How do you derivate $\frac{df}{dt}$ if this is your function: $f(x(u(t)))$?

Lets start looking at we got in your hands:

$f(x) = f(x_1,x_2)$

$x(u) = \begin{bmatrix} x_1(u_1,u_2) \\ x_2(u_1,u_2)\end{bmatrix}$

$u(t) = \begin{bmatrix} u_1(t) \\ u_2(t) \end{bmatrix}$

Now let's construct our chain rule:

$\frac{df}{dt} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial u} \frac{\partial u}{\partial t}$

![image.png](attachment:a094637c-c39c-4629-8dd9-3b065d3c1853.png)

Now we are getting somewhere! If we already have the notation, we just need to understand what each partial represents:

$\frac{df}{dt} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial u} \frac{\partial u}{\partial t} = [\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2}] ... \begin{bmatrix} \frac{\partial u_1}{\partial t} \\ \frac{\partial u_2}{\partial t} \end{bmatrix}$

We've already seen differentiating the scale of the valued function f,
with respect to its input vector x, gives us the Jacobian row vector (the orange one).\
We've also seen that differentiating a vector valued function
u with respect to the scalar variable t gives us a column vector of derivatives (the magenta one).\
But what about the middle term dx by du (the green one)? \

Well, for the function x, we need to find the derivative of each of the two
output variables -x1 and x2- with respect to each of the two input variables -u1 and u2- ($\frac{\partial x}{\partial u}$).\
So we end up with four terms in total,
which as we saw in the last module (building a jacobian matrix), can be conveniently arranged as a matrix.\
We still refer to this object as a Jacobian (but it's a matrix)

so in short we could say that this middle green marrix which represents the partial of every one of the "$x$s" to everyone of the "$u$s" Well, we need to derivate each term (x1 and x2) by each u(u1 and u2), and this gives us the final expression:

![image.png](attachment:df10a1aa-503b-4527-8d8f-1d45c9e01425.png)

$\frac{df}{dt} = \frac{\partial f}{\partial x} \frac{\partial x}{\partial u} \frac{\partial u}{\partial t} = [\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2}] \begin{bmatrix} \frac{\partial x_1}{\partial u_1} \frac{\partial x_1}{\partial u_2} \\ \frac{\partial x_2}{\partial u_1} \frac{\partial x_2}{\partial u_2} \end{bmatrix}\begin{bmatrix} \frac{\partial u_1}{\partial t} \\ \frac{\partial u_2}{\partial t} \end{bmatrix}$

Now we just need to calculate the partial derivates and do the maths!

Let's take an example:

$f(X) = f(x_1, x_2, x_3)$ \
$f(X) = x_1^3.cos(x_2).e^{x_3}$

$x_1(t) = 2t$ \
$x_2(t) = 1-t^2$ \
$x_3(t) = e^t$

We are trying to find the solution $\frac{df}{dt}$

$\frac{df}{dt} = \frac{\partial f}{\partial X} \frac{\partial X}{\partial T}$

and since we know that $\frac{\partial f}{\partial X}$ is the jacobian, se we will get the derivative of the output $f$ with respect to the inputs $(x_1, x_2, x_3)$ \

And we will get the other jacobian which is $\frac{\partial X}{\partial T}$, so we will get the derivatives for each output of the outputs $(x_1, x_2, x_3)$ with respect to everyone of the inputs $t$ (we have one input here which is $t$)

so we got $\frac{df}{dt} = [\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\frac{\partial f}{\partial x_3}] . \begin{bmatrix} \frac{\partial x_1}{\partial t} \\ \frac{\partial x_2}{\partial t}  \\ \frac{\partial x_3}{\partial t} \end{bmatrix}$

we can solve it to without the jacobians here

So we are going to use:

1 - chain rule\
2 - total derivative for X (all xs) and T (all ts)\
3 - chain rule for the found total derivatives\
4 - substitute to get the result

$\frac{df}{dt} = \frac{\partial f}{\partial X} \frac{\partial X}{\partial t}$

$\frac{df}{dX} = \frac{\partial f}{\partial dx_1} + \frac{\partial f}{\partial dx_2} + \frac{\partial f}{\partial dx_3}$

$\frac{dX}{dt} = [\frac{\partial x_1}{\partial dt} , \frac{\partial x_2}{\partial dt} , \frac{\partial x_3}{\partial dt}$]

then $\frac{df}{dX} \frac{dX}{dt} = \frac{df}{dt} =
\frac{\partial f}{\partial x_1} \frac{\partial x_1}{\partial t} +
\frac{\partial f}{\partial x_2} \frac{\partial x_2}{\partial t} +
\frac{\partial f}{\partial x_3} \frac{\partial x_3}{\partial t}$


so we will arrive to the same result we did arrive to using the jacobians tnd the dot product, so it's recommended to use the jcobians dot product as it is more easy and modular (so we can add more cycles -another jacobians- to the chain)


$\frac{df}{dT} =
( 3x_1^2.cos(x_2).e^{x_3} * 2 ) +
( -x_1^3.sin(x_2).e^{x_3} * -2t ) +
( x_1^3.cos(x_2).e^{x_3} * e^t )$

# Neural network in matrix format

Neural network has nodes (neurons) and edges (connectors)

Neural Networks are just a function that takes inputs which takes a variable in and gives you another variable back,
where both of these variables could be vectors. 

![image.png](attachment:6b4c8eda-7161-44e5-bb1e-4b7ccf155e0a.png)

when we look at a simple neural network which takes in a single scalar variable which we'll call a0,
and returns another scalar a1

![image.png](attachment:a489d827-a489-482e-9914-7430bf089783.png)


We can write this function down as follows: a1 equals Sigma of w times a0 plus b,
where b and w are just numbers,
but Sigma is itself a function.
It's useful at this point to give each of these terms a name,
as it will help you keep track of what's going on when things get a bit more complicated.
So, a terms are called activities,
w is a weight,
b is a bias and Sigma is what we call an activation function.
Now, you might be thinking,
how come all the terms used a sensible letter except for Sigma.
But the answer to this comes from the fact that it is
Sigma that gives neural networks their association to the brain.
Neurons in the brain receive information from
their neighbors through chemical and electrical stimulation.
And when the sum of all these stimulations goes beyond a certain threshold amount,
the neuron is suddenly activated and starts stimulating its neighbors in turn. 

An example of a function which has this threshold holding
property is the hyperbolic tangent function,
tanh

![image.png](attachment:4cd691e9-ffc8-44b6-9f40-a1e3696c38b0.png)

which is a nice well behaved function with a range from minus one to one.
You may not have met tanh before,
but it's just the ratio of some exponential terms
and nothing your account scalar tools can already handle.
Tanh actually belongs to a family of similar functions or with
this characteristic S shape called sigmoids.
Hence, why we use Sigma for this term.
So, here we are with our nonlinear function,
that we can evaluate on a calculator and also now know what all the terms are called. 

So, now, we're just going to start building up
some more complexity whilst keeping track of how the notation adapts to cope.
If we now add an additional neuron to our input there

![image.png](attachment:78e740a0-c354-4a45-8d45-b6a748417416.png)

However, things are starting to get a bit messy.
So, let's now generalize our expression to take n inputs,
for which we can just use the summation notation, or even beta.
Notice that each input has a weight.
So, we can make a vector of weights and a vector of inputs,
and then just take the dot product to achieve the same effect. 

![image.png](attachment:7fa91fdc-fdb4-4451-83fc-498ff49fe2c9.png)

So, let's now apply the same logic to the outputs.
Adding a second output neuron,
we'd call these two values a10 and a11,
where we now have twice the number of connectors,
each one with its own weighting and each neuron has its own bias.
So, we can write a pair of equations to describe this scenario,

![image.png](attachment:7644e486-1b4c-444f-b8b5-c99a72282c47.png)

(note that w is paired with one node and going to all the other nodes of the output, and the same for bias b as it's paired with the one node and summed to the result of the dot product of weight anf the input node)

with one for each of the outputs,
where each equation contains the same values of a0,
but each has a different bias in vector of weights.

Unsurprisingly, we can again crunch
these two equations down to a more compact vector form,
where the two outputs are each rows of a column vector,
meaning that we now hold our two weight vectors in a weight matrix,
and our two biases in a bias vector.

![image.png](attachment:d983eb24-e055-4995-a2e3-714b96c0a2a0.png)

Now, let's have a look at what our compact equation contains in all its glory.\
For, what we call a single Layer neural network with m outputs and n inputs.\
We can fully describe the function it represents with this equation.
And pairing inside, we can see all the weights and biases at work. 

![image.png](attachment:2a16040f-fe08-42f4-98cc-a7e2acc16130.png)

The last piece of the puzzle is that as we saw at the very beginning,
neural networks often have
one or several layers of neurons between the inputs and the outputs.
We refer to these as hidden layers,
and they behave in exactly the same way as we've seen so far,
except that outputs are now the inputs of the next layer. 

![image.png](attachment:f12dd1f7-9f54-4b24-8801-83b206ea443e.png)

so we have our total formula

![image.png](attachment:eb219dfe-be22-4637-9c5f-097de62c9041.png)


And with that, we have all the linear algebra in place
for us to calculate the outputs of a simple feed forward neural network. 

However, persuading your network to do something interesting such as image recognition,
then becomes a matter of teaching all the right weights
and biases which is what we're going to be looking at in the next lesson,

as we will bring the multivariate chain rule into play.

---

Here we are going to visualize what happens beneath a neural network in its matrix format.

Since no image will be avaliable to plot here, this part can be difficult to understand at first, but the maths are solid and can be usefull for those who already have a visual understanding a neural network:

> first exemple with a single neural network 

> this network takes a single scalar (input) $a^{(0)}$ and returns another scalar(output) $a^{(1)}$

We can write this function as: $a^{(1)} = \sigma(wa^{(0)} + b)$

a = activity
w = weight
b = bias
$\sigma$ = activate function

Note that sigma is where the name "neural networks" comes from, since it "copy" a neuron: When sigma exceeds a threshold, it starts stimulating the "neuron" and the function is activated.

> Now we are going to extendend our model, we will add another neuron that is function of $a^{(1)}$, so now we have two neurons:

$a_0^{(0)}$ and $a_1^{(0)}$ and both of those returns another scalar: $a^{(1)}$

Now to add this in our math notation we simple say that $a^{(1)}$ is equal to sigma times the sum of those neurons plus the bias:

$a^{(1)} = \sigma(w_0a_0^{(0)} + w_1a_1^{(0)} + b)$

if we add another neuron (or as many as we like), we simple follow the same logic:

$a^{(1)} = \sigma(w_0a_0^{(0)} + w_1a_1^{(0)} + w_2a_2^{(0)} + b)$

Note that we can generalize this expression with the summation notation:

$a^{(1)} = \sigma( (\sum_{j=0}^n w_ja_j^{0}) + b)$

but note! each input has a correspong weight, so we can make a vector if inputs and a vector of weights and take their dot product:

$a^{(1)} = \sigma (\hat{w}.\hat{a_{0}} + b)$

now that we can add as many inputs as we like, let's finally extend for as many output as we like:

> now we have 3 inputs ($a_0^{(0)}, a_1^{(0)}, a_2^{(0)})$ and 2 outputs ($a_0^{(1)}, a_1^{(1)}$)

Now each output will have its own formula with its weight and bias:

$a_0^{(1)} = \sigma (\hat{w_0}.\hat{a_{0}} + b_0)$

$a_1^{(1)} = \sigma (\hat{w_1}.\hat{a_{0}} + b_1)$

each equation for an output has the same inputs (and sigma!) but with different bias and weights

and now finally we write this in a compact way using matrix notation:

$a^{(1)} = \sigma (W^{(0)}.\hat{a_{0}} + b^{(1)})$

Note that our vectors w are now represented as a matrix, our bias is now represented as a vector and our outputs are represented as columns vectors, each position represents an output.

Now we extended our single layer neural network to m-1 outputs (since it starts with 0) and n-1 inputs, and the formula remains the same:

$a^{(1)} = \sigma (W^{(0)}.\hat{a_{0}} + b^{(1)})$

but our formula can be seen in its matrix format as:

$\begin{bmatrix} a_0^{(1)} \\ a_1^{(1)} \\ . \\ a_{m-1}^{(1)} \end{bmatrix} = \sigma \left ( \begin{bmatrix} w_{(0,0)}^{(1)} & w_{(0,1)}^{(1)} & ... & w_{(0,n-1)}^{(1)} \\ w_{(1,0)}^{(1)} & w_{(1,1)}^{(1)} & ... & w_{(1,n-1)}^{(1)} \\ . & . & . & . \\ w_{(m-1,0)}^{(1)} & w_{(m-1,1)}^{(1)} & ... & w_{(m-1,n-1)}^{(1)} \end{bmatrix} . \begin{bmatrix} a_0^{(0)} \\ a_1^{(0)} \\ . \\ a_{n-1}^{(0)} \end{bmatrix} + \begin{bmatrix} b_0^{(1)} \\ b_1^{(1)} \\ . \\ b_{m-1}^{(1)} \end{bmatrix}\right )$

With this we have our final formula for a feedforward neural network with as many layers of inputs and outputs:

$a^{(L)} = \sigma (W^{(L)}.\hat{a}^{L-1} + b^{(L)})$

# Quiz one (very important practical one)

Simple Artificial Neural Networks


Question 1

Recall from the video the structure of one of the simplest neural networks,

![image.png](attachment:96363060-b0dc-4fc3-8e83-846317dd85f2.png)

Here there are only two neurons (or nodes), and they are linked by a single edge.

The activation of neurons in the final layer, (1), is determined by the activation of neurons in the previous layer, (0),

![image.png](attachment:267528dd-7804-413b-8ca9-84a48bc9af52.png)

where w(1)w^{(1)}w(1) is the weight of the connection between Neuron (0) and Neuron (1), and b(1)b^{(1)}b(1) is the bias of the Neuron (1). These are then subject to the activation function, σ\sigmaσ to give the activation of Neuron (1)

Our small neural network won't be able to do a lot - it's far too simple. It is however worth plugging a few numbers into it to get a feel for the parts.

Let's assume we want to train the network to give a NOT function, that is if you input 1 it returns 0, and if you input 0 it returns 1.

For simplicity, let's use, $σ(z)=tanh⁡(z)\sigma(z) = \tanh(z)σ(z)=tanh(z)$, for our activation function, and randomly initialise our weight and bias to$ w(1)=1.3w^{(1)}=1.3w(1)=1.3$ and $b(1)=−0.1b^{(1)} = -0.1b(1)=−0.1$

Use the code block below to see what output values the neural network initially returns for training data.

In [None]:
# First we set the state of the network
σ = np.tanh
w1 = 1.3 #answer is -5
b1 = -0.1 #answer is 5

# Then we define the neuron activation.
def a1(a0) :
    return σ(w1 * a0 + b1)
  
# Finally let's try the network out!
# Replace x with 0 or 1 below,
a1(1)

t's not very good! But it's not trained yet; experiment by changing the weight and bias and see what happens.

Choose the weight and bias that gives the best result for a NOT function out of all the options presented.

Answer is

w1 = -5\
b1 = 5

Question 2

Let's extend our simple network to include more neurons.

![image.png](attachment:7a16de04-8a83-4d28-ba6a-4f081f0aec20.png)

We now have a slightly changed notation. The neurons which are labelled by their layer with a superscript in brackets, are now also labelled with their number in that layer as a subscript, and form vectors $a(0)\mathbf{a}^{(0)}a(0)$ and $a(1)\mathbf{a}^{(1)}a(1)$.

The weights now form a matrix $W(1)\mathbf{W}^{(1)}W(1)$, where each element, $wij(1)w^{(1)}_{ij}wij(1)$​, is the link between the neuron jjj in the previous layer and neuron iii in the current layer. For example $w12(1)w^{(1)}_{12}w12(1)$​ is highlighted linking $a2(0)a^{(0)}_2a2(0)$​ to $a1(1)a^{(1)}_1a1(1)$​.

The biases similarly form a vector b(1)\mathbf{b}^{(1)}b(1).

We can update our activation function to give,

![image.png](attachment:e5762c90-b72d-4968-b5c8-1f163ea5f3bc.png)

where all the quantities of interest have been upgraded to their vector and matrix form and σ\sigmaσ acts upon each element of the resulting weighted sum vector separately.

![image.png](attachment:8c8bedd2-ea02-41c6-adca-5bd90ce7a117.png)

You may do this calculation either by hand (to 2 decimal places), or by writing python code. Input your answer into the code block below.

(If you chose to code, remember that you can use the @ operator in Python to perform operate a matrix on a vector.)

In [2]:
import numpy as np
# First set up the network.
sigma = np.tanh
W = np.array([[-2, 4, -1],[6, 0, -3]])
b = np.array([0.1, -2.5])

# Define our input vector
x = np.array([0.3, 0.4, 0.1])

# Calculate the values by hand,
# and replace a1_0 and a1_1 here (to 2 decimal places)
# (Or if you feel adventurous, find the values with code!)

a1_0 = sigma((W[0] @ x ) + b[0])
a1_1 = sigma((W[1] @ x ) + b[1])

a1 = np.array([a1_0, a1_1])

a1

array([ 0.76159416, -0.76159416])

Question 3

Now let's look at a network with a hidden layer.

![image.png](attachment:7724768c-6629-4881-a859-8d80c46f8769.png)

Here, data is input at layer (0), this activates neurons in layer (1), which become the inputs for neurons in layer (2).

(We've stopped explicitly drawing the biases here.)

Which of the following statements are true?

![image.png](attachment:d85b2595-e9c6-4fc6-9b8f-059330d4ec88.png)

![image.png](attachment:1b01b2fa-10d1-480c-bb70-d995f841b503.png)

Question 5

So far, we have concentrated mainly on the structure of neural networks, let's look a bit closer at the function, and what the parts actually do.

We'll
 introduce another network, this time with a one dimensional input, a 
one dimensional output, and a hidden layer with two neurons.

Use 
the tool below to change the values of the four weights and three 
biases, and observe what effect this has on the network's function.

With the weights and biases set here, observe how $a0(1)a^{(1)}_0a0(1)$​ activates when $a0(0)a^{(0)}_0a0(0)$​ is active, and $a1(1)a^{(1)}_1a1(1)$​ activates when $a0(0)a^{(0)}_0a0(0)$​ is inactive. Then the output neuron, $a0(2)a^{(2)}_0a0(2)$​, activates when neither $a0(1)a^{(1)}_0a0(1)$​ nor $a1(1)a^{(1)}_1a1(1)$​ are too active.

(Interact with the plugin below to score the point for this question.)

![image.png](attachment:a8d66e74-9491-4850-8963-ce7e756e4d72.png)

# # Training a neural network

we're going to see how
the multivariate chain rule will enable us to iteratively update the values of
all the weights and biases such that the network learns to
classify input data based on a set of training examples. 

**back propagation** : because it looks
first at the output neurons and then it works back through the network.


If we start by choosing a simple structure such as the
one shown here with four input units,

![image.png](attachment:09a73cc7-888c-4d7d-8fda-d42a20ecdafd.png)

three units in the hidden layer and two units in the output layer,
what we're trying to do is find the 18 weights and five biases that
cause our network to best match the training inputs to their labels.
Initially, we will set all of our weights and biases to be a random number.
And so initially, when we pass some data into our network,
what we get out will be meaningless. 

However, we can then define a cost function,
which is simply the sum of the squares of the differences between the desired output y,
and the output that our untrained network currently gives us. 

![image.png](attachment:329ff74b-a6a7-463a-bcab-e67cd0736968.png)

If we were to focus on the relationship between
one specific weight and the resulting cost function,
it might look something like this,

![image.png](attachment:ed001e88-cd72-4425-9c49-6a416f5e95f6.png)

where if it's either too large or too small,
the cost is high.
But, at one specific value,
the cost is at a minimum.
Now, based on our understanding of calculus,
if we were somehow able to work out the gradient of C with respect to the variable W,
at some initial point W0,
then we can simply head in the opposite direction (i mean down).

For example, at the point shown on the graph,
the gradient is positive and therefore increasing W would also increase the cost.
So, we should make W smaller to improve our network.

However, at this point it's worth noting that
our cost function may look something more like this wiggly curve here,

![image.png](attachment:324f3561-7377-45a3-8179-8d7a3852e571.png)

which has several local minima and is more complicated to navigate.
Furthermore, this part is just considering one of our weights in isolation.


Furthermore, this part is just considering one of our weights in isolation.
But what we'd really like to do is find the minimum of
the multi-dimensional hyper surface much
like the 2D examples we saw in the previous module.

![image.png](attachment:cf95b52a-6cb7-4a3f-8be2-8dcfddda13ac.png)

So, also like the previous module,
if we want to head down hill,
we will need to build the Jacobian by gathering together
the partial derivatives of
the cost function with respect to all of the relevant variables.

we just need to look again at our simple two-node network.
And at this point, we could immediately write down a chain rule expression for
the partial derivative of the cost with respect to either our weight or our bias.
And I've highlighted the a1 term which links these two derivatives. 

![image.png](attachment:cb40664c-b3b1-45e7-a80a-00073f86bce6.png)

However, it's often convenient to make use of an additional term which we will call z1,
that will hold our weighted activation plus bias terms.

![image.png](attachment:ce0c0753-83d6-4b41-9cb7-549298d02faf.png)

This will allow us to think about differentiating
the particular sigmoid function that we happened to choose separately.
So, we must therefore include an additional link in our derivative chain.
We now have the two chains rule expressions we'd require to navigate
the two dimensional WB space in order to minimize
the costs of this simple network for a set of training examples.

So let's generalize the rule

![image.png](attachment:d35660be-31c6-4aca-9678-1d8734fd4cd9.png)

Clearly, things are going to get a little more complicated when we add more neurons.
But fundamentally, we're still just applying
the chain rule to link each of our weights and biases back to its effect on the cost,
ultimately allowing us to train our network.
In the following exercises,
we're going to work through how to extend what we saw
for the simple case to multi-layer networks.

---
To improve the performance of the neural network on the training data, we can vary the weight and bias. We can calculate the derivative of the example cost with respect to these quantities using the chain rule.

![image.png](attachment:9b0f855c-b382-4e06-98f4-09e1951fab8c.png)

Individually, these derivatives take fairly simple form. Go ahead and calculate them. We'll repeat the defining equations for convenience,

![image.png](attachment:dfdd4628-430e-4ded-9af2-d58d29e8d968.png)

![image.png](attachment:d3fbad53-feb8-41e6-a52b-3fc447a8a23e.png)\
![image.png](attachment:4d9f6e52-11dc-4219-8502-69274c0d4055.png)\
![image.png](attachment:d590042d-fbb8-4665-b45b-b121420f5af7.png)\
![image.png](attachment:4718a46e-1c36-47b2-b49d-a6e32bbdc084.png)


---

# Excersise

Question 1

In this exercise we'll look in more detail about back-propagation, using the chain rule, in order to train our neural networks.

Let's look again at our two-node network.

![image.png](attachment:7f88fecd-8f84-466f-9fc8-c8246ce98e21.png)

![image.png](attachment:6aa4eced-4eb3-4b2d-9e33-fb9a3de26f3d.png)

In [5]:
# First we set the state of the network
σ = np.tanh
w1 = 1.3
b1 = -0.1

# Then we define the neuron activation.
def a1(a0) :
  z = w1 * a0 + b1
  return σ(z)

# Experiment with different values of x below.
x = 0

print(a1(x))

a_1 = a1(x)
C_0 = (a1(x) - 1)*(a1(x) - 1)
C_0

-0.09966799462495582


1.209269698402472

Question 2

![image.png](attachment:e6fb11ad-9912-4c63-81d5-2c7d49c31a64.png)

![image.png](attachment:031d3118-9adf-4772-8a15-1e58fc91415c.png)

![image.png](attachment:546601b9-9afe-47b1-bf9d-7064510a0e9e.png)

Question 3

![image.png](attachment:671c6054-92f7-4604-a495-61f7dc6cc0d8.png)

![image.png](attachment:1226851c-6e78-4891-98c8-63c80c15e137.png)

![image.png](attachment:2c640666-c5c0-458e-99b9-921a62c682cd.png)

![image.png](attachment:049007f9-93f4-4f3a-a71e-b3dc9030c40b.png)

![image.png](attachment:e05763fe-86b8-4cd5-880c-8729374d45f6.png)

Question 4

![image.png](attachment:98fb8757-8ed5-423f-bfc1-a9dcae337826.png)

In [None]:
# First define our sigma function.
sigma = np.tanh

# Next define the feed-forward equation.
def a1 (w1, b1, a0) :
  z = w1 * a0 + b1
  return sigma(z)

# The individual cost function is the square of the difference between
# the network output and the training data output.
def C (w1, b1, x, y) :
  return (a1(w1, b1, x) - y)**2

# This function returns the derivative of the cost function with
# respect to the weight.
def dCdw (w1, b1, x, y) :
  z = w1 * x + b1
  dCda = 2 * (a1(w1, b1, x) - y) # Derivative of cost with activation
  dadz = 1/np.cosh(z)**2 # derivative of activation with weighted sum z
  dzdw = x # derivative of weighted sum z with weight
  return dCda * dadz * dzdw # Return the chain rule product.

# This function returns the derivative of the cost function with
# respect to the bias.
# It is very similar to the previous function.
# You should complete this function.
def dCdb (w1, b1, x, y) :
  z = w1 * x + b1
  dCda = 2 * (a1(w1, b1, x) - y)
  dadz = 1/np.cosh(z)**2
  """ Change the next line to give the derivative of
      the weighted sum, z, with respect to the bias, b. """
  dzdb = 1
  return dCda * dadz * dzdb

"""Test your code before submission:"""
# Let's start with an unfit weight and bias.
w1 = 2.3
b1 = -1.2
# We can test on a single data point pair of x and y.
x = 0
y = 1
# Output how the cost would change
# in proportion to a small change in the bias
print( dCdb(w1, b1, x, y) )

Question 5

Recall that when we add more neurons to the network, our quantities are upgraded to vectors or matrices.

![image.png](attachment:0a56b08a-db44-4ef1-9b34-7900f2a4838b.png)

![image.png](attachment:ad8f2e89-7f9e-4819-b9c0-919e693d8fd8.png)

In [3]:
import numpy as np

# Define the activation function.
sigma = np.tanh

# Let's use a random initial weight and bias.
W = np.array([[-0.94529712, -0.2667356 , -0.91219181],
              [ 2.05529992,  1.21797092,  0.22914497]])
b = np.array([ 0.61273249,  1.6422662 ])

# define our feed forward function
def a1 (a0) :
  # Notice the next line is almost the same as previously,
  # except we are using matrix multiplication rather than scalar multiplication
  # hence the '@' operator, and not the '*' operator.
  z = W @ a0 + b
  # Everything else is the same though,
  return sigma(z)

# Next, if a training example is,
x = np.array([0.7, 0.6, 0.2])
y = np.array([0.9, 0.6])

# Then the cost function is,
d = a1(x) - y # Vector difference between observed and expected activation
C = d @ d # Absolute value squared of the difference.

print(d)
print(C)

[-1.27261408  0.39910838]
1.7788340952508743


![image.png](attachment:72abd9be-4546-4534-bcd5-b8ffaa4f16af.png)

Question 6

Let's now consider a neural network with hidden layers.

![image.png](attachment:6ed50626-92bf-4578-bde0-0cda35271378.png)

![image.png](attachment:f268ecc1-355d-46a0-a39c-5987111c7014.png)

![image.png](attachment:412f3e85-4b84-400b-90c7-12725ce916ed.png)

![image.png](attachment:93af9922-29fa-496b-b141-8be81c3c3583.png)

# Important note

review the back probagation notebook

# Important note

# Neural network applied with the chain rule

The classic method to train a network is to do a backpropagation.

When give some input, we want this input to pass thorugh our network (with all weight and bias) and be as close as possible to our output.

To achieve this, we first pass a random set of weight and bias, and we will get a certain output.

And how can we fine tune this weight and bias to give us the best possible output? We simple define a Cost function (C):

$C = \sum_i(a_i^{(L)} - y_i)^2$

witch is simple the sum of the square of the difference between the desired output (y) and the output our untrained network gives us (a)

Now what do we do? We minimize this Cost function C! When this function is at is minima, it means our network is givings us the output closest to the one we desire (y)

As we seen before, we can head towards the minima point of a multimendion system with the jacobian, with the derivate of the cost function in respect to all relevant variables.

Imagine a network with one input and one output:

$a^{(1)} = \sigma(wa^{(0)} + b)$

$C = (a^{(1)} - y)^2$

Our C derivatives would be in respect to w and b (note the chain rule here):

$\frac{dC}{dw} = \frac{\partial C}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial w}$

$\frac{dC}{db} = \frac{\partial C}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial b}$

Note that it' often convenient to add one more function to our chain rule expression to navigate between our minima points of C in respect to w and b:

$z^{(1)} = wa^{(0)} + b$, with this our equation becomes:

$a^{(1)} = \sigma z^{(1)})$, and our partial derivatives now need to include our z term:

$\frac{dC}{dw} = \frac{\partial C}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial w}$

$\frac{dC}{db} = \frac{\partial C}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial b}$

Everything will becomer harder from now on, with our generalized form of what has been exemplified here, but the logic is the same:

$z^{(L)} = W^L . a^{(L-1)} + b^L$

$a^{(L)} = \sigma(z^{(L)})$

$C = \sum_i(a_i^{(L)} - y_i)^2$