## Content

1. Vanilla Gradient Descent 
2. Gradient Descent with Momentum 
3. Steepest Gradient Descent 
4. Stochastic Gradient Descent 
5. Newton's Method of Gradient Descent
6. Quasi-Newton's Method of Gradient Descent

$\nabla$ denotes the derivative of the corresponding variable, of which it is a Hessian

### Vanilla Gradient Descent 

Put simply, gradient descent can be computed to correcting for weight estimates in a optimisation problem using the derivative of a certain error term $\epsilon(w)$.

The derivative of the gradient, given $w$, a weight vector containing well, a vector of weights, is: 

($\nabla$ indicating the derivative of)
$\large \nabla \epsilon(w) = \begin{pmatrix} \frac {\delta \epsilon} {\delta w_0} \\ \frac {\delta \epsilon} {\delta w_1} \\  . \\ . \\ . \\ \frac {\delta \epsilon}{\delta w_n} \end{pmatrix} $

And the training rule to converge on an optimal set of weights will be: 

$ w = w + \Delta w $  where $\Delta w = -\eta \nabla \epsilon (w)$

and $\eta$ is the learning rate, a positive small value constant

This is the algorithm that I used in my MLP notebook for the creation of a single node

#### Gradient Descent with Momentum

This is an extension to the simple gradient descent that allows the initial search to build inertia into the direction of the search space (extrema) and overcome the oscillations of noisy gradients and flat gradients in the search space.

This is done by adding an additional momentum term, that "remembers" the previous iteration. This way, it is easier to not overshot our minima. The diagram below represents this nicely.

$w = w - \eta \nabla \epsilon (w) + \alpha (w_t - w_{t - 1})$ where $\alpha$ is the momentum parameter, and $t$ is the iteration number. $t = 0$ means initial weights.

<img src = "media/gradient_descent_with_momentum.png" width="60%"/>

Then as an example, assuming an error term to be minimized: 

$\epsilon(w) = (w_1 - x_1)^2 + (w_2 - x_2)^2$ 

$\epsilon(w) = (w_1 - 2)^2 + (w_2 - 3)^2$


with $\alpha = 0.5$ $\nabla = 0.25$ and initial weights $ w_{t=0} = \begin{pmatrix} 0 \\ 6 \end{pmatrix} $ since $ t = 0 $

we get

$ \nabla \epsilon (w) = \begin{pmatrix} {2(w_1-2)} \\ {2(w_2-3)} \end{pmatrix} = \begin{pmatrix} -4 \\ 4 \end{pmatrix}$

and so 

$ w_{t =1} =  \begin{pmatrix} 0 \\ 6 \end{pmatrix} - 0.25 \begin{pmatrix} -4 \\ 4 \end{pmatrix} = \begin{pmatrix} 1 \\ 5 \end{pmatrix}$

$ w_{t =2} =  \begin{pmatrix} 1 \\ 5 \end{pmatrix} - 0.25 \begin{pmatrix} -2 \\ 2 \end{pmatrix} + 0.5 \begin{pmatrix} 1 \\ -1 \end{pmatrix} = \begin{pmatrix} 1 \\ 5 \end{pmatrix}$

Keeping this in mind, we can apply it: 

#### Steepest Gradient Descent

The steepest gradient descent method is one that optimizes the way we get to global minima by calculating the highest decrease in the gradient of the error term by calculating the vector perpendicular to the error plane.

$\epsilon (w) = \epsilon(w_1, w_2) = w_1^3 - 2w_1^2 + w_2^3+ 3w_2^2-8$

$\nabla \epsilon (w_1, w_2) = \begin{pmatrix} {w_1} \\ {w_2} \end{pmatrix} = \begin{pmatrix} {3w_1^2 - 4w_1} \\ {3w_2^2+6w_2} \end{pmatrix} $

We find d, **the direction of descent**, which is perpendicular to the direction of ascent, hence we $* -1$

$d = -\nabla \epsilon (w_1, w_2)$

**Now**

Given an initial starting $w_0 = \begin{pmatrix} {1} \\ {-1} \end{pmatrix}$

$d = -\begin{pmatrix} {-1} \\ {-3} \end{pmatrix} $

Then we update
$ w_{T=t+1} = w_{T = t} + \lambda d$, giving us: 

$w + \lambda d = \begin{pmatrix} {1 + \lambda} \\ {-1 + 3\lambda} \end{pmatrix}$

$\nabla \epsilon (w + \lambda d) = \begin{pmatrix} {3(1+\lambda)^2-4(1+\lambda)} \\ {3(-1+3\lambda)^2+6(-1+3\lambda)} \end{pmatrix}$

**Then, differentiating w.r.t to $\lambda$ to find optimal $\lambda$**

$\epsilon ' (\lambda) = \nabla \epsilon (w + \lambda d)^Td = \nabla \epsilon (w + \lambda d)^T\begin{pmatrix} {1} \\ {3} \end{pmatrix} = 3(1+\lambda)^2 -4(1+\lambda)+9(-1+3\lambda)^2+18(-1+3\lambda) $

Then solving for $\epsilon'(\lambda) = 0$, we get: 

$ \lambda = \frac 1 3, - \frac 5 {14}$

We update the weights accordingly, allowing us to find the optimal weights:

$ w_1 = w_0 + \lambda d = \begin{pmatrix} {\frac 4 3} \\ {0} \end{pmatrix}$

Then sub this back into our original to get the gradient of this minimum: 

$ \nabla \epsilon (w_1) = \begin{pmatrix} {0} \\ {30} \end{pmatrix}$

First iteration is complete

#### Stochastic Gradient Descent

So far, the gradient descent we have done is only for two weight terms. However, for large neural networks, where features (and hence number of weights) might be extremely large, this is alot of computation for gradient descent!

This is where stochastic gradient descent comes in. Stochastic gradient descent chooses 1 sample for a single training step (mini-batch allows a subset) since our samples might have redundencies (Similar data points). 

This is good for data with many parameters where gradient descent is not computationally feasible. 

#### Newton's Method of Gradient Descent

Remember our original method of finding our root using the newton method is: 

$w_{n+1} = w_n - \frac {f'(w_n)} {f''(w_n)}$

Translating this into matrix form for our gradient descent:

$w_{n+1} = w_n - (\nabla^2\epsilon (w_n))^{-1}\nabla\epsilon(w_n)$

Unlike the Gradient Descent Method, this algorithm uses way less number of steps to reach optima, but is quite inconvenient to compute as the Hessian and its inverse has to be calculated on every iteration, so its computationally expensive, it scales on a $O(n^3)$ complexity compare to $O(n)$ for the simple gradient descent. Therefore it comes at a tradeoff.

It is also highly dependent on initial conditions.

#### Quasi-Newton Methods of Gradient Descent

A better method will be some of the quasi-newton methods, which comes up with approximations that 
* BFGS (Broyden-Fletcher-Goldfarb-Shanno) update
