## Part 1 - Deep Learning from Scratch

### Linear Regression

A linear regression assumes that the true function (we are trying to replicate) can be written as:

\begin{equation*} 
   y = X\beta + \epsilon
\end{equation*}

In other words the true function is a linear combination of its parameters.

The vector of residuals given an estimate of $\beta$ is thus:

\begin{equation*} 
   e = y - X\beta^{OLS}
\end{equation*}

Where OLS indicates that this is the Ordinary Least Squares estimate of $\beta$

The OLS by definition estimate minimises the sum of squared residuals:

\begin{equation*} 
   e'e = (y - X\beta^{OLS})'(y - X\beta^{OLS})
\end{equation*}

\begin{equation*} 
   e'e = y'y - y'X\beta^{OLS} - \beta^{OLS'}X'y + \beta^{OLS'}X'X\beta^{OLS'}
\end{equation*}

Since the tranpose of a scalar is a scalar: $y'X\beta^{OLS} = (y'X\beta^{OLS})' = \beta^{OLS'}X'y$ we get the Residual Sum of Squares (RSS):

\begin{equation*} 
   RSS = e'e = y'y - 2\beta^{OLS'}X'y + \beta^{OLS'}X'X\beta^{OLS'}
\end{equation*}
  
We take the derivative of this w.r.t to beta-hat and set it equal to 0:

\begin{equation*} 
   0 = -2X'y + 2X'X\beta^{OLS}
\end{equation*}

The chart below shows an example of such a function; for the example we assume that wages are a linear function of height:

```R
X <- runif(100, -5, 5)  # Height
y <- X + rnorm(100) + 3  # Wages
```

The line-of-best-fit here is a line that minimises the squared sum of residuals, it has a slope of 0.95 and an intercept of 2.95.

![Figure1](pic_support/linearreg_0.png)

Assuming that $X'X$ is a positive definite matrix (our variables are not a perfect linear combination of each other & we have more observations than variables) we can find a closed-form solution for $\beta^{OLS}$:

\begin{equation*} 
   \beta^{OLS} = (X'X)^{-1}X'y
\end{equation*}

In R we can run:

```R
X_mat <- as.matrix(X)
# Add column of 1s for intercept coefficient
intcpt <- rep(1, length(y))
# Combine predictors with intercept
X_mat <- cbind(intcpt, X_mat)
# OLS (closed-form solution)
beta_hat <- solve(t(X_mat) %*% X_mat) %*% t(X_mat) %*% y
```

To obtain: 2.95, 0.95

However, we can also use an interative method known as **Gradient Descent**, this is a generic method for continuous optimisation. With GD we randomly initialise $\beta^{GD}$ and then calculate the residual (error) and move in the opposite direction to the gradient by a small amount proportional to a parameter we call the **learning-rate**. GD is a bit like rolling a ball down a hill - it will gradually converge to a stationary-point. If the function is convex with a small enough step-size (learning-rate) and high-enough number of iterations we are guaranteed to find a global minimiser. **Stochastic Gradient Descent** is usually used for neural-networks to avoid getting stuck in a local minimum due to a non-convex cost function (along with other methods).

The general-formula for GD:

1. Find a cost-function
2. Randomly initialise your $\beta$ vector
3. Get the derivative of the cost-function given $\beta$
4. Move the $\beta$ vector in the opposite direction to the gradient

In the case of this linear-regression:

Our cost-function is the RSS: $y'y - 2\beta^{OLS'}X'y + \beta^{OLS'}X'X\beta^{OLS'}$

The derivative of the cost-function is: $-2X'y + 2X'X\beta^{OLS}$

This can be simplified to: $2X'(X\beta^{OLS} - y)$

So, we can write our 'delta' as:

\begin{equation*} 
    \frac{dLoss}{d\beta} = \frac{2}{N}\sum_ix_i(x_i\beta^{OLS} - y)
\end{equation*}

This means our equation for $\beta^{OLS}$ becomes:

\begin{equation*} 
    \beta^{OLS} = \beta^{OLS} - \frac{lr}{N}\sum_ix_i(x_i\beta^{OLS} - y)
\end{equation*}

```R
for (j in 1:epochs)
{
    residual <- (X_mat %*% beta_hat) - y
    delta <- (t(X_mat) %*% residual) * (1/nrow(X_mat))
    beta_hat <- beta_hat - (lr*delta)
}
```

With learning-rate set to 0.1 and epochs set to 200 we converge to the same result: 2.95, 0.95. We can track how the line-of-best has been gradually fitted with this method by plotting it at each iteration:

![Figure1](pic_support/linearreg_2.png)

### Logistic Regression

A logistic regression is a linear regression that outputs a number bounded between 0 and 1. This means it is useful for classification problems, where we want to predict the probability of something happening. A binomial logistic regression is used when there are just two-classes, to extend beyound two-classes we would typically use a multi-nomial logistic regression.

Consider the iris-dataset where we try to predict whether a flower is "virginica" or "versicolor" by only looking at petal-length and sepal-length. We fit a linear line to 'best' split the categories:

![Figure1](pic_support/logit_0.png)

The above line has an intercept of -39.84, and coefficient of -31.73 for sepal-length and 105.17 for petal-length. These estimates are obtained by maximising the likelihood.

Because the log function is monotone, maximizing the likelihood is the same as maximizing the log-likelihood (or minimising the negative of the log-likelihood)

\begin{equation*} 
   l_x(\theta) = \log L_x(\theta)
\end{equation*}

For many reasons it is more convenient to use log likelihood rather than likelihood:

\begin{equation*}
   \log L_x
   =
   \sum_{i=1}^{N} y_i\beta^Tx_i - \log(1+e^{\beta^Tx_i})  
\end{equation*}

```R
log_likelihood <- function(X_mat, y, beta_hat)
{
  scores <- X_mat %*% beta_hat
  ll <- (y * scores) - log(1+exp(scores))
  sum(ll)
}
```

The log-likelihood in this example is -11.92.

Typically [BFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) or other numerical optimisation procedures are used to minimise the cost/max log-likelihood instead of GD, because the parameter space is pretty smaller (compared to neural-networks).

It can be shown that the derivative of the log-likelihood is:

\begin{equation*}
     delta = X'(y - prediction)
\end{equation*}

Where $prediction = \sigma(X\beta)$ and the sigma function (the activation/link function) that transforms our score into a probability is given by: $\sigma(z)=\frac{1}{1+e^-z}$

The process for using GD for a logistic regression is similar to that of a simple linear-regression:

```R
# Calculate activation function (sigmoid for logit)
sigmoid <- function(z){1.0/(1.0+exp(-z))}

logistic_reg <- function(X, y, epochs, lr)
{
  X_mat <- cbind(1, X)
  beta_hat <- matrix(1, nrow=ncol(X_mat))
  for (j in 1:epochs)
  {
    residual <- sigmoid(X_mat %*% beta_hat) - y
    # Update weights with gradient descent
    delta <- t(X_mat) %*% as.matrix(residual, ncol=nrow(X_mat)) *  (1/nrow(X_mat))
    beta_hat <- beta_hat - (lr*delta)
  }
  # Print log-likliehood
  print(log_likelihood(X_mat, y, beta_hat))
  # Return
  beta_hat
}
```

The only major difference is that we apply a sigmoid function to our prediction - to turn it into a probability. Below we can see why: the output is bounded between 0 and 1:

![Figure1](pic_support/logit_1.png)

The shape of the sigmoid curve also means that we can increase the speed of convergence by scaling the variables to be closer to 0 - where the gradient is high. Imagine our inputs have a value of 100 - this can create a very high error, however the gradient is nearly flat and thus the update to the coefficients will be tiny.

We run the below to optimise our logistic regression using GD:

```R
beta_hat <- logistic_reg(X, y, 300000, 5)
```

We match the original results with the coefficients: -38.84, -31.73, 105.17

![Figure1](pic_support/logit_2.png)

### Neural Network 

...