# Deep Neural Networks

## The Layer

In previous sections we saw a simple linear regression model. It had two inputs and a single output. This is a single neural network with very little depth.

Most real life dependencies cannot be modeled with a simple linear combination. Because we want to be better forecasters we need better models. 

Most of the time it means working with a model more sophisticated than a linear model. Mixing linear combinations and non-linearities allows us to model arbitrary functions. 

Our model changes to something which starts off with inputs that are linearly combined, then going through some non-linear transformation resulting in outputs.

An example of non-linearity is the sigmoid function:

$ sigmoid = \sigma(x) = \frac{1}{1+e^{-x}}$

The initial linear combination and the added non-linearity are called a 'layer'. The layer is the building block of neural networks.

When we have more than one layer, we are talking about a deep neural network.

## What is a Deep Net?

When you take a number of inputs and run them through a layer, i.e. a linear and a non-linear combination, you get outputs. If you then run those outputs through another layer, and those outputs through another layer, and so on, you get a deep net.

The final layer we build is the output layer, and the one that we compare the targets to. 

All layers between the start and end are hidden layers. We call them hidden as we know the inputs and the outputs, but the rest remains hidden.

The building blocks of a hidden layer are called hidden units, or hidden nodes. 

In mathematical terms, if h is the tensor relating to a hidden layer, each hidden unit is an element of that tensor. The number of hidden units (nodes) in each layer is often referred to as the width of the layer. Usually (but not always) we stack layers with the same width so the width is the same across the entire network. 

Depth refers to the number of hidden layers in a network. When we create an ML algo we choose its width and depth, calling these 'hyperparameters'. Hyperparameters should not be mistaken for parameters. 

Parameters are weights and biases, hyperparameters are width, depth and learning rate. 

The main difference between the two are that hyperparameters are pre-set by us to run the model and the parameters are found by optimisation. 

## Really understand deep nets

In the tensorflow model, the inputs of the deep net model are tf.placeholder.

Imagine you have a neural net with 8 inputs, possibly characteristics about the weather, such as humidity, sunlight, wind, rainfall etc.

You might pass it through a neural net with width of 9 in its hidden layer. We would combine these inputs linearly and then add in non-linearity for the output.

Linearity is easy to do, taking each input and multiplying it by a weighting $ x \times w$

We need to get the weights to use and in this case the weights are in an 8x9 matrix. As the inputs are essentially a 1x8 matrix, we end up in a 1x9 matrix for the output. 

This is the number of hidden units we have in the first layer.

In the standard diagrams you see for neural nets, each line represents a mathematical transformation (weight + non-linearity) to get from the inputs to the first layer. 

Non-linearities don't change the shape of the expression, just its linearity. 

As the weights matrix is 8x9 , there are 72 weights between the inputs and the first layer. There are 72 arrows in the diagram which represents this example.

$W_{36}$ where 3 = input unit and 6 = hidden unit.

The hidden unit $h_6$ is based on the weights $w_{16},w_{26},w_{36},w_{46},w_{56},w_{66},w_{76},w_{86}$ which are combined and then non-linearity is added. 

With the first hidden layer done, we can use the same logic to apply another layer based on the 9 inputs and non-linear transformation. As this would be a 9x9 weighting matrix, it is slightly more complex.

We can add as many layers as we want. 

Eventually you get to the final hidden layer, after which you apply the operation to get to the output layer, which can often have a different number of elements than the hidden layer. 

To get to an output number of, say, 4, you would end up with weighting of 9x4 = 36 weights. 

Our optimisation goal is to find weighting for matrices to allow us to convert inputs to outputs as best we can. This time we are not using a simple linear model, but a complex infrastructure that allows us to get to a more substantial result.

## Non-Linearities and their Purpose

Non-linearities are needed so we can break the linearity and represent more complicated relationships. An important consequence of adding non-linearity is the ability to stack layers. 

Stacking layers is the process of placing one layer after another in a meaningful way. 

We cannot stack layers when we only have linear relationships. The effect would just be a very long-winded way of doing a single linear transformation.

To summarise:

In order to have deep nets and find complex relationships through abritrary functions, we need non-linearities.

## Activation Functions

In a ML context, non-linearities are also called activation functions. This is how we will refer to them. Activation functions transform inputs into outputs of a different kind.

The basic logic behind an activation function is that it can take a linear output and turn it into a binary outcome given certain conditions.

In ML there are different activation functions, but there are a few used more frequently than others:
- Sigmoid (logistic function)
- TanH (hyperbolic tangent)
- ReLu (rectified linear unit)
- softmax

(the formulas, derivatives, graphs and ranges are in course notes)

All the functions are monotonic, continuous and differentiable.

Activation functions are also called transfer functions. The names have similar meanings in ML, but have different meanings in other fields. 

## Softmax Activation

The softmax function considers the information from ALL ELEMENTS.

Example for a deep neural net:

$ a=xw+b$ (a typical linear combination) <br>
$ y=softmax(a)$ (an activation function)

Given 4 elements in an input layer, 3 elements in a hidden layer, and 3 elements in an output layer:

$a_h = hw+b$ <br> 

Would give us an array of values, such as:

$ a = [-0.21,0.47,1.72]$

Using an activation method such as sigmoid, each element of the array would be fed into the sigmoid function:

$sigmoid(a) = [sigmoid(-0.21),sigmoid(0.47),sigmoid(1.72)]

Softmax is special. Each element in the output depends on the entire set of elements of the input.

$softmax(a) = \frac{e^{a_i}}{\sum_je^{a_j}}$

$ \sum_je^{a_j} = e^{-0.21}+e^{0.47}+e^{1.72} = 8$

$softmax(a) = [\frac{e^{-0.21}}{8},\frac{e^{0.47}}{8},\frac{e^{1.72}}{8}]$ 

$ y = [0.1,0.2,0.7]$ which is our output layer.

A key point about the softmax transformation is that the outputs are in the ranges between 0 and 1 and their sum is exactly 1.

Probabilities have exactly the same characteristics.

The softmax transformation transforms a bunch of arbitrarily large or small numbers into a valid probability distribution.

Softmax is often used as the activation of the output layer in classification problems.

## Backpropogation

The process of optimisation consisted of minimising the loss. Our updates were directly related to the partial derivatives of the loss and indirectly related to the errors, or deltas as we called them. 

The deltas were the differences between the targets and the observed outputs. 

Deltas for the hidden layers are tricker to define but they have a similar meaning. 

The procedure for calculating the deltas in hidden layers is called backpropogation of errors. Having these deltas allows us to vary the parameters using the familiar update rule.

Forward propogation is the process of pushing inputs through the net. At the end of each epoch, the obtained outputs are compared to the targets to form the errors. Then we backpropogate through partial derivatives and change each parameter so errors at the next epoch are minimised.

For the minimal example, the backpropogation consisted of a single step: aligning the weights, given the errors we obtained. 

When we have a deep net, we must update all the weights related to the input layer and the hidden layers. However, as we have activation functions, we have to take them into account. Finally, to update the weights, we must calculate the difference between the outputs and the targets which is done for each layer. However, we have no targets for the hidden units, so we have no errors...

Backpropogation solves this problem of no errors for hidden layers. We must derive the appropriate updates as if we had targets. 

We can trace the contribution of each unit to the error of the output. 

Mathematically this process is rough.

Backpropogation is one of the biggest challenges for the speed of an algorithm. 