# Neural Nets

## Set goals and introduce the framework

Think of a model as a black box, we feed it an input and it creates an output. 

The input could be a number of factors.

Before we can get comfortable with the output, we must train the model. It is a key part of ML.

Once we have trained the model we can feed it with data and get an output.

Training an algorithm involves four ingredients:
- data
- model
- objective function
- optimisation algorithm

Data:<br>
We must prepare a certain amount of data to train with.

Model:<br>
The simplest kind of model we could use is a linear model. This is just the tip of the iceberg. ML can help us create complex non-linear models that fit data better than a simple linear relationship.

Obejective function: <br>
We want the output to be as close to reality as possible. It estimates how correct the model outputs are on average. The entire ML framework boils down to optimising this function. 

Optimisation algorithm:<br>
It consists of the mechanics through which we vary the parameters of the model to optimise the objective function.

These are ingredients rather than steps, because this is iterative.



## Training the model

In a ML setting we dont explicitly provide instructions to solve the problem, we just set our goals. The ML process is a trial and error game. A reasonable optimisation algorithm would not try all possible permutations, it would be smarter than that.



## Types of ML

There are 3 major types of machine learning:
- supervised
- unsupervised
- reinforcement

Supervised: <br>
We provide the algorithm with inputs and their corresponding desired outputs. Based on this, it learns to produce outputs close to the ones that we are looking for.

Unsupervised: <br>
We provide inputs but there are no target outputs. This means we dont tell the algorithm what the answer is. We tell it to find dependents or patterns in the underlying data. This is useful for classification of data when you dont know what the categories are. It is a lot quicker than using supervised models which require people to classify items. Humans come in at the end of this models run and then classifies the resultant groups.

Reinforcement: <br>
We train a model to act in an environment based on rewards it receives, like training a pet. We can train a model to play Super Mario by rewarding it every time it progresses or scores points. 

This course focuses on supervised, which can be split into two types: 
- classification (provides outputs which are categories)
- regression (provides outputs which are numerical types)





## Linear model

We want to make the algorithm find out a y, given an x.
$ y = f(x)$

We provide the algorithm with as many pairs of x and y's as possible and get it to follow the methodology.

The linear model is the most simple model. In the linear universe $f(x) = xw +b$ where $x$ is the input, $w$ is the weight and $b$ is the bias(es).

The linear model can be defined as a number of things:
- $wx+b$
- $xw+b$ (we will use this one)
- $x^Tw+b$
- $w^Tx+b$

The goal of the ML algorithm would be to find such values for $w$ and $b$ so that the output of $xw+b$ is as close to the observed values as possible.

E.g. predicting the price of an apartment. $x$ is the size, $y$ is the price. If you have values for $w$ and $b$, you can input any value of $x$ (size) and get a prediction for the price, based on the linear model.





## Linear model - multiple inputs

What if we had additional info, say location of apartment in relation to proximity to see. A better model would have the size of the apartment as well as the proximity to the beach. 

$size \times size weight + proximity \times proximity weight + bias$

This is still $y = xw + b$.

$x$ is a 1x2 vector and $w$ is a 2x1 vector. Multiplying the two gives a 1x1 scalar, per matrix mathematics. 



## Multiple inputs and multiple outputs

What if we are interested in predicting the price of purchasing an apartment, but also price we can get for renting it out.

Out inputs are unchanged, size and proximity to beach. We have two outputs this time, and create two linear models.

$price = size \times size weight + proximity \times proximity weight + bias$ <br>
$rent = size \times size weight + proximity \times proximity weight + bias$

$y_1 = x_1w_{11}+x_2w_{21}+b_1$ <br>
$y_2 = x_1w_{12}+x_2w_{22}+b_2$

There are two numbers in the indices for the weights. The first relates to the input, the second relates to the output. We have 2 outputs, 2 inputs, 4 weights and 2 biases. The number of weights depends on the inputs and the outputs. 

In general if we have M outputs and K inputs, the number of weights would be KxM and the number of biases is equal to M.

$y_1y_2 = x_1x_2 \times w_{11}w_{12}w_{21}w_{22} + b_1b_2$

This example only had 2 observations. It could be extended to N observations, where the output is an NxM matrix, where M is number of output variables. The input matrix would be NxK, where K would be number of input variables. The weights matrix remains the same at KxM as weights dont change based on number of observations. This is also true for biases which would be 1xM. This last bit is important as it shows us that we can feed as much data into our model as we want to without affecting it, as each model is determined solely by its weights and biases. 

In ML, we vary only the weights and the biases but the logic of the model remains the same.



## Graphical representation of neural networks

Linear classifiers are useful if the data is linearly separable, i.e. the data split makes sense when separated by a straight line.

This is not always the case though.

What if you have several categories and you cant fit a straight line through them? In these instances, we must use non-linear models.



## Objective function

The objective function is the measure used to evaluate how well the model's outputs match the desired correct values.

Objective functions are generally split into two types:
- loss functions
- reward functions

Loss functions are also called cost functions. The lower the loss function, the higher the level of accuracy for the model. Most often we work with these. Typically a loss function measures the error of prediction and a low score here is better. These are typically used in supervised learning.

Reward functions are the opposite. The higher the reward function, the higher the accuracy of the model. These are typically used in reinforcement learning. 

This course looks at loss functions most, due to their prevalence in supervised learning.




## Types of loss functions

Two common types of loss functions:
- L2-norm (used in regression)
- Cross-entropy (used in classification)

The following is true for all models, irrespective of their linearity.

The target, T, is the desired value at which we are aiming. Generally we want our output to be as close as possible to T.

### L2-norm

Method for calculating L2-norm is the OLS (ordinary least squares) method commonly used in statistics.

$ L2norm = \sum_i(y_i-t_i)^2$

'Norm' comes from the fact it is the vector norm, or Euclidian distance of the outputs and the targets.

The lower the error, the lower the loss.

### Cross-entropy loss

Cross-entropy = $L(y,t) = -\sum_it_i\ln_i$

Imagine you had an image classifier that is classifying images as a Cat, a Dog or a Horse.

It would do so by giving each image a score where the targets (t) for each are:
- [1,0,0) for a cat
- [0,1,0] for a dog
- [0,0,1] for a horse

Given output scores (y) as:
- [0.4,0.4,0.2] for a dog
- [0.1,0.2,0.7] for a horse

You can use the cross-entropy formula to find out an accuracy score, remembering that the lower number is better:
- $L(y,t) = -0\times\ln0.4-1\times\ln0.4-0\times\ln0.2 = 0.92$ for the dog
- $L(y,t) = -0\times\ln0.1-0\times\ln0.2-1\times\ln0.7 = 0.36$ for the horse

The model is not sure that the first image is a dog or a cat, but it is quite sure that the second image is a horse. 

Given that the target scores have a number of zeros in them, you can simplify both calculations as follows:
- $L(y,t) = -1\times\ln0.4$
- $L(y,t) = -1\times\ln0.7$

Most regression and classification models are solved using these functions. There are others that are used and in general, any function that holds the basic property *'higher for worse results, lower for better results'* can be a loss function. 

## Gradient descent

The simplest and most fundamental optimisation algorithm is the Gradient Descent.

The gradient is the multivariate generalisation of the derivative concept. 

$f(x) = fx^2+3x-3$ 

Goal: find the minimum of this function.

The first step is to find the first derivative of the funciton.

$f'(x) = 10x+3$

The second step is to choose any arbitrary number (e.g. $x_0=4$) and then calculate a different number ($x_1=?$) using the update rule:

$x_{i+1}=x_i-\eta f'(x_i)$

$x_1=4-\eta[10\times4+3] = 4- \eta_43$

$\eta$(eta) is the learning rate. The rate at which the ML algorithm forgets old beliefs for new ones.

Using the update rule we can find $x_1, x_2, x_3, \dots, x_n$ 

When the values stop updating using that function, we know the minimum is reached. The first derivative of the function is zero:

$\eta f'(x_i) = 0$

Eventually it will become: 

$x_{i+1} = x_i$


If eta is too small, it will take too long to get to the minimum point. If it is too large, it will never reach the minimum and will oscillate at a higher value. 

Generally, we want the learning rate to be high enough so we can reach the closest minimum in a rational amount of time. We want it to be low enough so we dont oscillate around the minimum.

Several key takeaways:
- we can find the minimum by trial and error using gradient descent
- there is an update rule that allows us to cherry pick the trial and reach the minimum faster
- learning rate should be high enough so we dont iterate forever, but low enough so we dont oscillate forever
- we should stop updating once we've converged







## n-Parameter Gradient Descent

Consider linear model we have discussed so far:

$xw+b=y$ -> model

Now each output $y_i$ can be represented using the linear model equation, where the input is just the corresponding $x_i$. The weights and the bias remain unchanged.

Using the apartment example, $y_i$ would be the price of a single apartment. The corresponding $x_i$ is information we have about that apartment.

$x_iw+b=y_i$ -> single observation

Therefore, the output $y_i$ is a scalar and is equal to the corresponding $x_i \times w + b$.

We are interested in the target $t_i$ and this is what we will compare the output $y_i$ against.

We need to choose our loss function, denoted as $L(y,t)$.

$L(y,t)$ -> loss <br>
$C(y,t)$ -> cost <br>
$E(y,t)$ -> error <br>

We are going to look into a regression example, so lets choose the L2-norm function (modified slightly with the /2).

loss: $L(y,t) = \frac{L2norm}{2} = \frac{\sum_i(y_i-t_i)^2}{2}$

A division by the constant of 2 does not change the loss function as it holds the basic property of being higher for worse results and lower for better results.

To perform the gradient decent we need old beliefs which will be updated in each step.

The update rule: $x_{i+1} = x_i-\eta f'(x_i)$

Becomes: $w_{i+1} = w_i - \eta \nabla_wL(y,t)$ for weights

And: $b_{i+1} = b_i - \eta \nabla_bL(y,t)$ for biases

We want to minimise the loss function by varying the weights and the biases. This means we are typing to optimise the loss function regarding w and b.

$\nabla_wL = \sum_i\nabla_w\frac{1}{2}(y_i-t_i)^2 =$

$\nabla_wL = \sum_ix_i(y_i-t_i) =$

$\nabla_wL = \sum_ix_i\delta_i$

$\nabla_bL = \sum_i\delta_i$

