# Gradient Descent/Optimization Techniques

## Jacobian
The Jacobian matrix stores all of the first order partial derivatives of a vector valued function.
(Reminder: vector valued function = a function with multiple output values [into a vector]).

E.g.,

$f(x,y) = \begin{equation}
\left[
  \begin{array}{cccc}
  x^2y\\
  5x + sin y
  \end{array}
\right]
\end{equation}$

Then we can break it down into:

$f_1(x,y) = x^2y$

and

$f_2(x,y) = 5x + sin y$

Then the Jacobian matrix becomes:

$J_f(x,y) = \begin{equation}
\left[
  \begin{array}{cccc}
  \frac{\delta{f_1}}{\delta{x}} \frac{\delta{f_1}}{\delta{y}}\\
  \frac{\delta{f_2}}{\delta{x}} \frac{\delta{f_2}}{\delta{y}}
  \end{array}
\right]
\end{equation}$

$J_f(x,y) = \begin{equation}
\left[
  \begin{array}{cccc}
  2xy  &x^2\\
  5 &cos y
  \end{array}
\right]
\end{equation}$

The __Hessian__ matrix is similar, but it stores the second order partial derivatives (the derivative of the derivative).

## Gradient Descent

The gradient descent training process is as follows:

1. Randomly choose some weights for your network
2. Perform a forward pass on your data to get predictions for your data
3. Use a loss function to calculate how well your network is performing. The loss function will calculate some measure of difference between your predictions and the correct values. The lower the loss function, the better.
4. Calculate the gradient: this is the derivative of the cost function. Recall that a gradient will give you the slope of steepest ascent.
5. Update your weights by doing `weights - (gradient*learning rate)`. This will minimize the loss function, since we're moving "downhill", and so improve our weights. (Make the weights closer to values that will give more accurate predictions.)
6. Repeat steps 2-5
7. Once we have processed all of our data, repeat 2-6 for another epoch, and so on until we have good weights

#### "Stochastic"

What makes gradient descent __stochastic__ is that we evaluate our loss (aka error or cost) using just a subset of a data -- a mini-batch.

#### Backpropagation

Since our cost function involves calculating a prediction by using the various activation functions in our network (weighted sums, ReLU, Sigmoid, etc.), taking the derivative of this function will necessarily involve taking the derivative of our entire network. In practical terms, this is implemented by treating of each activation as a "gate" in a "circuit". When we're doing our forward pass, we also calculate the derivative of each "gate" and then use this derivative during the backward pass. During the backward pass, we just keep applying the chain rule to work out what effect this gate (mathematical operation) has on the final output.

This lecture has a good walkthrough of the process: https://www.youtube.com/watch?v=i94OvYb6noo.