# Gradient descent

Gradient descent is a way of estimating the best parameters for a function. Depending on the function, gradient descent may converge on local optima (a local "low point") rather than the global optima. The steps are:

1. choose an arbitrary starting value for each parameter
2. "step downhill" (towards lowest proximal function value)
3. continue until you reach a global minimum value for your function

> There is a [normal equations method](normal_equations_method.ipynb) which will solve the same problems without multiple steps, but it doesn't scale as well with large training sets

There can be an arbitrary number of $\theta$'s, but the following graph shows just two: $\theta_0$ and $\theta_1$.

![](../static/gradient_descent_graph.png)

**aka:** "Batch gradient descent" (looks at all training set - some versions may not).

## Gradient descent

repeat until convergence:
$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)$ (for $j = 0$ and $j = 1$ - aka do it for both thetas)

$\alpha$ = **learning rate** - controls how big of a step we take "downhill". If too small, might take too long to reach minimum. If too large, could overshoot and even fail to converge.

$\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)$ = **derivative term** - the current slope of the function $J$ given the current $\theta_0$ and $\theta_1$. This allows for the "steps" to get smaller as we approach the global minimum (where slope will be 0).

All thetas must be updated simultaneously:

```go
temp0 := gradientDescent(theta0)
temp1 := gradientDescent(theta1)
theta0, theta1 := temp0, temp1
```

## Examples

### With zero-value for $\theta_0$

Simplify by setting $\theta_0 = 0$ (same as removing $\theta_0$ from the equations).

Gradient descent: $\theta_1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_1}J(\theta_1)$

![](../static/gradient_descent_example.png)

Derivative term has positive slope, so $\theta_1 := \theta_1 - \alpha(positive)$ means $\theta_1$ will decrease (moving towards the minimum).

### Solving the cost function

Plug in the [cost function](cost_function.ipynb):

$$
\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
$$

We need to determine for $j=0,1$:

$$
j=0:\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})
$$

$$
j=1:\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}
$$

Repeat until convergence:

$$
\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})
$$

$$
\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x^{(i)}
$$