### Simple NN Layer Transform

Basic Neural layers transform input data $x$ according to some transform like

$f(W, x) = relu(W \cdot x + b)$

$W$ and $b$ are the *weights* or *training parameters* of the layer

### Training

Initially `W` is populated with small random values (*random initialization*) and are adjusted iteratively through a *training* process.

A typical *training loop* follows these steps:
1. Draw a batch of training samples $x$ and corresponding targets $y$
2. Run the network on $x$ (a step called *forward pass*) to obtain predictions $y_{pred}$
3. Compute the loss of the network on the batch, a measure of the mismatch between $y_{pred}$ and $y$
4. Update all weights of the network in a way that slightly reduces the loss on the batch.

To perform step 4, take advantage of the fact that layer transformations are *differentiable* and compute the *gradient* with respect to `W` of the *loss*. Move the weights in the opposite direction of the gradient to effectively reduce the *loss*.

If $x$ and $y$ are fixed, i.e. consider just the batch at hand, we have
 
$l(W) = loss(f(W, x), y)$

and

$W_1 = W_0 - s * \nabla l(W_0)$

where $s$ is a small *step* scaling factor.

Computing $\nabla l(W)$ is called *backward pass*.

### Stochastic Gradient Descent

We want to find the theoretical minimum of the differentiable function $l$, i.e. when $\nabla l(W) = 0$. In practice we do this by taking the small steps described above in step 4. This ios called *mini-batch stochastic gradient descent*

Use the concept of *momentum* to choose the right step size and avoid getting stuck at local minimum or taking too long to converge.

```
past_velocity = 0
momentum = 0.1
while loss > 0.01:
    W, loss, gradient = get_current_parameters()
    velocity = past_velocity * momentum - learning_rate * gradient
    W = W + momentum * velocity - learning_rate * gradient
    past_velocity = velocity
    update_parameter(W)
```

### Backpropagation

A network $f$ may be composed of 3 tensor operations with weight matrices $W_1, W_2, W_3$

$f(W_1, W_2, W_3) = f_1(W_1, f_2(W_2, f_3(W_3)))$

The chain rule tells us that these chains of functions are derived via
$f(g(x)) = f'(g(x)) * g'(x)$

This rule implies we can start at the final loss value and work backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter had in the loss value. This is called *backpropagation* or *reverse-mode differentiation*.