# Demystifying Neural Networks 

---

# Training an ANN

We already said that we were using *Stochastic Gradient Descent* (SDG)
to train the network.  But what that SDG actually is.

The *Stochastic* bit just means that we use a random sample as a batch
at every step in the training.  We have an example of this in our `pytorch` ANN.
But the *Gradient Descent* is more mathematical.

## ANN the function

Whatever an ANN looks like we can say that it takes a multidimensional input
and spits a multidimensional output.
We generated completely random matrices and they worked as an ANN.
The random matrices produced completely random outputs
but produced outputs in the correct number of dimensions.

In other words the only difference between an untrained network and
a trained network are the values of weights.
Well, we kind of knew that already but now we can see it in mathematical terms.

Since we just said that both a trained and untrained ANN is something
which given a multidimensional input gives a multidimensional output,
we can argue that an ANN can be understood as a function.
A function parametrized by all the weights inside all the matrices.
Specifically we call the action of out ANN $N$ and say:

$$
N_{w_1, w_2, \dots, w_n}: \mathbb{R}^n \rightarrow \mathbb{R}^m
$$

For the case of our ANN dealing with the pulsars dataset we have:

$$
N_{w_1, w_2, \dots, w_n}: \mathbb{R}^8 \rightarrow \mathbb{R}^2
$$

We also say that our ANN is a model, i.e. an estimator:

$$
\hat{\vec{y}} = N_{w_1, w_2, \dots, w_n}(\vec{x})
$$

Next we imagine that out there exists a perfect model of our data.
We do not know the perfect model but we know the values it would output.
In the case of pulsars we know that for a certain input we have
$1$ for pulsar and $0$ for non-pulsar.
Or more exactly $[0, 1]$ for pulsar and $[1, 0]$ for non-pulsar
since the output it 2-dimensional.  We call this output $\vec{y}$ (the label).

The difference between the correct label and our estimated label is the error
out ANN is performing.

$$
E = \vec{y} - \hat{\vec{y}}
$$

There's a problem here though.
Since the error can be positive or negative it is difficult to compare two errors.
Therefore we use the squared error.

$$
SE = (\vec{y} - \hat{\vec{y}})^2
$$

Now we can define the function $F$ as follows:

$$
F_{w_1, w_2, \dots, w_n} = (\vec{y} -  N_{w_1, w_2, \dots, w_n}(\vec{x}))^2
$$

And this has a nice property that,
if $F$ decreases out ANN is getting better, if $F$ increases our ANN is getting worse!
So all we need to do is to change the values of the weights until
we get a minimum value of $F$.

# Extending $F$

As we saw when we wrote the SGD ourselves,
we never train the ANN with a single sample.
We train it with a small batch of samples at a time.

Therefore we are not really using $\vec{y}$ as the comparison.
Instead we are using several $\vec{y}$ together, we will call it $Y$,
a matrix with each column containing a $\vec{y}$.
With that in mind we are also not using $\vec{x}$ but several samples at a time.
We will write $X$, a matrix with $\vec{x}$ as columns.

Finally we write $F$ as:

$$
F_{w_1, w_2, \dots, w_n} = (Y -  N_{w_1, w_2, \dots, w_n}(X))^2
$$

There's a problem with this though, since the squared error was
$(\vec{y} - \hat{\vec{y}})^2$ the output of $F$ is not an error anymore,
it is several errors.
Nothing too difficult to solve, we can just get the mean of all those values,
resulting in yet another approach to $F$

$$
MSE = F_{w_1, w_2, \dots, w_n} = \text{mean}(Y -  N_{w_1, w_2, \dots, w_n}(X))^2
$$

This is often called the *Mean Squared Error* measure.
Other error functions (e.g. cross-entropy) exist but for simplicity
we will stick with MSE.

## Gradient

Now that we have a function $F$ that mirrors the behavior of our ANN
we could perturb the weights until we find a minimum.
Yet, there is a better way.
We can use the following fact.

> The *sign* of the partial derivative of a function wrt. one of its parameters
> points in the direction the function is increasing in that dimension.

So, for every single weight we have a possible dimension in which to tune our function.
And for every one of those dimensions (say, dimension $w_1$) we can compute

$$
g_1 = \frac{\partial MSE}{\partial w_1}
$$

And we know that the function increases in the direction of $g_1$.
But we want to find a minimum, so we also know that the function
decreases in the direction of $-g_1$.

This technique is called *Gradient Descent* because it is described through
the calculation of the gradient.  The gradient is:

$$
\nabla MSE_{w_1, w_2, \dots, w_n} =
\left[
\frac{\partial MSE}{w_1},
\frac{\partial MSE}{w_2},
\dots,
\frac{\partial MSE}{w_n},
\right]
$$

In other words, the gradient gives us the partial derivatives against every single weight.
The gradient therefore gives us the direction in which to go in order to make the ANN
perform better, it does not give us how far we need to go though.
Since the gradient may be quite large sometimes we then multiply it by a small constant
to make sure we do not wander too far.  This small constant is called the *learning rate*.

If we'd be able to flatten out all weights into an array we could write

$$
[w_1, w_2, \dots, w_n] = [w_1, w_2, \dots, w_n] - \alpha \cdot \nabla MSE_{w_1, w_2, \dots, w_n}
$$

Where $\alpha$ is the learning rate.

Note: the actual implementation keeps the gradients together with the weights,
in the same matrices.  Next we will look at `autograd` which is an implementation
that allows one to calculate the gradients and extends `numpy` arrays to
keep the computed gradients together with the weight matrices.