In [2]:
%run ../../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython
config_ipython()
setup_matplotlib()
set_css_style()

# Building blocks of ANNs

The way neural networks get set up changes the way they perform. Note that in general these algorithms have lots of parameters (all the weights and biases) and it's well known that with lots of parameters you can fit everything. Overfitting is a very common problem with neural networks. The discussion here loosely follows the brilliant chapter 2 of the wonderful Nielsen's [book](#nielsen), plus some other bits and bobs.

## Data representations: tensors

Tensors are generalisations of vectors and matrices to higher dimensions and furnish a representation which is very convenient to working with neural networks. In fact, TensorFlow, the Google Machine Learning library, especially devoted, but not only, to deep learning, is built on top of them. 

Tensors are such that:

* a 0D tensor (tensor of rank 0) is a scalar (a number)
* a 1D tensor (tensor of rank 1) is a vector
* a 2D tensor (...) is a matrix
* a 3D tensor is ... a 3D tensor
* ...

Each dimension of a tensor is called an *axis*. Note that you can create tensors of any rank with Numpy! Note that from the mathematical and physical points of view, this is an abuse of language as tensors aren't simply higher order matrices, but show some mathematical properties and transformation laws.

When passing a training set to a deep model, it is customary to place each sample in an array and build a tensor as this representation is very convenient. 

Suppose you're working on some images task for instance. Images are matrices (when grayscale), that is, tensors of rank 2. The typical way to pass a training set of images to a deep network is via the construction of a tensor of rank 3 where all images are placed in an array, effectively where the first axis denotes the sample. Same with colour images, which are tensors of rank 3, the third dimension being the colour channel; a set of those is a tensor of rank 4 where the first axis is the sample, the second and third are height and width and the fourth the colour.

Note that Tensorflow uses the convention here described, Theano puts the colour channel on the second axis, and then height and width instead. This structure is employed for each sort of dimensionality: you build a tensor where the first axes stores each sample. Videos go on 5D tensors as each frame is an image!

### Operations

Operations on tensors are the scaled up versions of operations on vectors and matrices. 

### Tensor broadcasting

Broadcasting is the procedure that makes it possible to compute operations over tensors of different rank, like for instance an addition between a vector and a matrix. What broadcasting does is "extending" the smallest of the tensors by replicating it across the missing axes so that it matches the shape of the other tensor. For example, if you want to sum vector $v = (1, 2, 2)$ to matrix $A = \begin{bmatrix}
    2  & 3 & 1 \\
    1  & 1 & 1
\end{bmatrix}
$, you replicate the vector over the missing axis to build a matrix, effectively then summing $V = \begin{bmatrix}
    1  & 2 & 2 \\
    1  & 2 & 2
\end{bmatrix}$ to $A$, which yields $\begin{bmatrix}
    3  & 5 & 3 \\
    2  & 3 & 3
\end{bmatrix}$ .

## The cost function and the optimiser

The cost function you will use with neural networks depends on the task at hand. Cross-entropy ones are good for classification tasks, squared errors for regression tasks. 

The cost function (also called the loss function) gets optimised by an optimiser, which is the algorithm performing the numerical selection of the minimum. It is gradient descent or a variation of it and in a typical implementation you will choose it.

## The backpropagation algorithm

The backpropagation algorithm is the core of how artificial neural networks manage to *learn*, doing so by iteratively correcting the error between the actual value and the predicted value in a back-propagation fashion. The original paper proposing this now universally adopted mechanism is a [Nature](#paper-original) from 1986 by Rumelhart, Hinton, Williams. 

The backpropagation algorithm is a brilliant way to train a network by perturbing each weight iteratively with an amount proportional to the partial derivative of the cost function with respect to it, propagating these derivatives backwards in the network. This is done to aid gradient descent and eventually train the network by reducing the error between what gets predicted and what is actually.

The idea per se is simple, the implementation is hard though and it took some research to figure out an efficient mechanism for it. Mechanism that arrived with Rumelhart & co. paper in 1986. And from that point on, a new interest in AI research resurged. 

The reason why backpropagation is the core of the learning procedure of neural networks is that by adjusting the weights though little kicks and repeatedly, the hidden layers of the network come to *learn* features. While what happens to the input and output layers is controllable, it is the hidden layer(s) that do all the painstaking work of representing the featured of the input data. If in a network there were no hidden layer, it would be easy to change the weights in such a way that the output matches the expected real output. But the network wouldn't be learning and wouldn't do anything worth of excitement. It is via backpropagation that the network can learn, in its hidden neurons, how to represent the data.

### The procedure in detail

#### Prologue: gradient descent

The notes here will follow both the original paper cited above and [this very helpful paragraph on Wikipedia](https://en.wikipedia.org/wiki/Backpropagation#Finding_the_derivative_of_the_error) about the topic, and will refer to a feedforward network of sigmoid neurons, however the backpropagation procedure applies to a generic activation function, so long as it's differentiable, but the sigmoid makes for very nice calculations.

Let's consider the transmission of information to a neuron $k$ in the $l$-th layer, we will use $i$ to indicate an input and $o$ to indicate an output, and will make use of the bias as a further weight. The neuron receives input from all the neurons in the previous, $l-1$-th layer as a weighted combination of their outputs as per activation function:

$$
i_k^l = \sum_i o_i^{l-1} w_{ik}^{l-1,l} 
$$

where the apex indicates the layer we are referring to. Note that the weight $w_{ik}^{l-1,l}$ is meant to represent the weight of the connection between neuron $i$ in layer $l-1$ and our reference neuron $k$ in layer $l$.

As per output function, the output of $k$ is (using a sigmoid output function as per tradition, this will prove to be a very convenient choice later on)

$$
o_k^l = \frac{1}{1 + e^{- \sum_i o_i^{l-1} w_{ik}^{l-1,l}}}
$$

The goal of training the network is finding the weights such that the network output is near the expected one. For this goal, we have to define a cost function which measures the difference between expected and obtained result, and minimise it. Now, the cost function can be given as the mean squared error

$$
E = \frac{1}{2n} \sum_i^n [o_i - e_i]^2 \ ,
$$

where the sum goes over all training samples, $o$ is the network output for the sample and $e$ the expected output. It gets minimised via gradient descent, which requires calculating the partial derivatives of it with respect to all weights (its parameters).

Note that the cost function could in principle be given differently, as long as it satisfies two requirements:

1. it is differentiable
2. it can be written as a sum of contributions (or, we can say as a mean) of the single training data points

The first requirement is needed because as per gradient descent we will have to compute derivatives of the cost function; the second because we generalise the calculation to the whole training set by applying the sum over the calculations on a training data point.

Also note that in practice the version of gradient descent applied is the stochastic one.

### Calculating the derivatives of the cost function: backpropagation

Backpropagation propagates the partial derivatives of the cost function with respect to the parameter weights from the output layer back to the first one, iteratively. The derivative of the cost function with respect to a weight is computable via chain rule (for simplicity, we are not indicating the layer)

$$
\begin{equation}
    \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial i_j} \frac{\partial i_j}{\partial w_{ij}} \ ,
\end{equation}
$$

Now let's break down the components. For the last one, we have 

$$
\frac{\partial i_j}{\partial w_{ij}} = o_i \ .
$$

For the second one, using the form of the activation function from above, we have (you can easily prove that the equality is right)

$$
\frac{\partial o_j}{\partial i_j} = o_j (1 - o_j) \ .
$$

We are left with calculating the first bit, $\frac{\partial E}{\partial o_j}$. The derivative of the cost function with respect to $o_j$ can be calculated if we think of the cost as a function of all outputs of all the neurons in layer $\bar{l}$ which receive input from neuron $j$, 

$$
\frac{\partial E}{\partial o_j} = \frac{\partial E(\{o_k \forall k \in \bar l\})}{\partial o_j} \ ,
$$

which can be written as

$$
\frac{\partial E}{\partial o_j} = \sum_{k \in \bar l} \frac{\partial E}{\partial o_k} \frac{\partial o_k}{\partial o_j} = \sum_{k \in \bar l} \frac{\partial E}{\partial o_k} \frac{\partial o_k}{\partial i_k} \frac{\partial i_k}{\partial o_j} = \sum_{k \in \bar l} \frac{\partial E}{\partial o_k} \frac{\partial o_k}{\partial i_k} w_{jk}
$$

This recursive equation states that the derivative of the cost function with respect to $o_j$ can be calculated from the same derivatives with respect to the output of neurons in the further layer. 

This all can be written as 

$$
\begin{equation}
    \frac{\partial E}{\partial w_{ij}} = o_i \delta_j \ ,
\end{equation}
$$

with 

$$
\delta_j = \sum_{k \in \bar l} \delta_k w_{jk} o_j (1-o_j) \ ,
$$

and in the case of a neuron in the output layer things are much easier to compute, leading to $\delta_j = (o_j-e_j)o_j(1-o_j)$. 

These expression encapsulate the essence of backpropagation: you calculate the difference between the expected output and the obtained one and starting from the output layer you propagate the derivatives back in the network. $\delta$ is the error we are propagating. We have to start from where we can compute it (the output layer) and then iteratively go back scaling the previous layers in order to obtain the calculation in the preceding layers. 

Following the gradient descent procedure, we need to update the parameters of the cost function (the weights) by perturbing them by an amount proportional to the derivative via a learning rate:

$$
\Delta w_{ji} = - \alpha o_i \delta_j
$$

## Regularising a network

To tackle overfitting, regularisation is a common choice as per machine learning tasks in general. One can apply $L_2$ or $L_1$ regularisation terms as per usual, or another common choice in neural networks is the so-called *dropout*, which works by actually modifying the network itself. What you do is starting with the whole network as is and then removing some of the neurons in the hidden layers (call them "dropout neurons"), choosing them at random. You make it proceed as normal and then repeat the procedure by choosing another set of dropout neurons. The fact that you'll have just some of the original neurons in the hidden layers has to be compensated by changing their outputs accordingly. The whole mechanism is a sort of averaged result of the training of different networks and the reason why it works in reducing overfitting is because with less neurons in the hidden parts there is less complexity the networks learns, and then an averaged result is computed. 

## Weights initialisation

The easiest way to initialise the network weights is to extract them at random from a Gaussian distribution. If you choose a Gaussian with standard deviation 1 for all neurong, you end up with the variable $\sum_j w_j x_j + b$ being distributed with a Gaussian which is very broad. This will make for easy saturation in several neurons as due to this broadness the probability to have large values is not so small and so the result of the sigmoid function will be easily close to 0 or 1. 

To prevent this, the usual choice is to extract the weights from a Gaussian with standard deviation equal to $\frac{a}{\sqrt{n_{in}}}$, $n_{in}$ being the number of input weights in the neuron.

## Data augmentation

The main reason why neural networks typically require lots of training data is because they have to learn so many parameters. Augmenting the training set by perturbing it to create new, artificial data points is a usual trick. For instance, in the case of images, you can slightly rotate them to create new ones.

## Choosing hyperparameters

### How to choose the number of neurons in layers?

There isn't a recipe. On a general basis, more neurons will allow the network to learn more sophisticated patterns in the input data, hence yielding higher precision. But too many neurons will risk overfitting (other than being more expensive to train, of course). 

## References

### General

1. <a name="nielsen"></a> M Nielsen, [**Neural networks and deep learning**](http://neuralnetworksanddeeplearning.com/), 2017

### On tensors

1. [**TensorFlow** on tensors](https://www.tensorflow.org/programmers_guide/tensors)
2. [**TensorFlow on broadcasting**](https://www.tensorflow.org/performance/xla/broadcasting)

### On backpropagation

1. <a name="paper-original"></a> D E Rumelhart, G E Hinton, R J Williams, [**Learning representations by back-propagating errors**](http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf), *Nature*, 323.6088, 1986
2. M Nielsen, [**Neural networks and deep learning**](http://neuralnetworksanddeeplearning.com/chap2.html), 2017