# Gradient Descent

A `first order method`. Assume `f` is a continuous and twice diferentiable function, and we want to solve: $\min_{x} f(x)$

An intuitive approach is to start at some initial point, and iteratively move in the direction that decreases `f`.<br>

A natural choice for the direction, is the negative [gradient](../calculus/gradients.ipynb): <br>
$x^{(k+1)} = x^{(k)} - t_k \nabla f(x^{(k)})$ where $t_k$ is a step size


the algorithm is as follows:<br>
#### English version
1. choose an initial solution $x^0$
2. choose a descent direction $d^0$
3. choose a step size $\alpha_0 > 0$
4. update the solution $x^1 = x^0 + \alpha_0 d^0$
5. if some stopping criteria is met, `stop`, else repeat with current solution. 

#### Math version
---
1: guess $x^{(0)}$, set k = 0<br>
2: while ||$\nabla f(x^{(k)})|| \geq \epsilon$ do <br>
&nbsp;&nbsp;&nbsp;&nbsp;3: $x^{(k+1)} = x^{(k)} - t_k \nabla f(x^{(k)})$<br>
&nbsp;&nbsp;&nbsp;&nbsp;4: k += 1<br>
5: end while<br>
6: return $x^{(k)}$<br>


Let $x^k$ be the current iteration, and we want to choose a "downhill" direction $d^k$ and a step size $\alpha$ such that $f(x^k + \alpha d^k) < f(x^k)$

By [Taylor's approximation](../../calculus/taylors_approximation.ipynb):
$$f(x^k + \alpha d^k) \approx f(x^k) + \alpha \nabla f(x^k)^T d_k$$

So we want $\nabla f(x^k)^Td^k < 0$. The steepest descent direction is $d^k = -\nabla f(x^k)$

---

We know what direction to move in, but how much should we move by? What should the step size be? 

* line search: define $g(\alpha) := f(x^k + \alpha d^k)$. Choose $\alpha$ to minimize $g$
* fixed step size: Fir $\alpha$ a priori (may not converge if $\alpha$ is too big


---
Let's look at this algo with a simple example: $x^2$

We know from the start that $\nabla f(x) = 2x$

In [9]:
import numpy as np

def gradient_descent(f, gf, tk, size=1):
    """
    :parameters
    f: function
    gf: gradient
    tk: step size
    """
    x = np.random.randint(0, 1000, size=size) 
    print(f"random initialization: {x}")
    eps = 1.0
    steps = 0
    
    while np.all(np.abs(gf(x)) >= eps):
        x_new = x - tk * gf(x)
        print(x_new)
        print(f"{x} => {x_new}")
        x = x_new
        steps += 1
    print(f"finished in {steps} steps")

tk = 0.25
gradient_descent(lambda x: x ** 2, lambda x: 2 * x, tk)

random initialization: [592]
[296.]
[592] => [296.]
[148.]
[296.] => [148.]
[74.]
[148.] => [74.]
[37.]
[74.] => [37.]
[18.5]
[37.] => [18.5]
[9.25]
[18.5] => [9.25]
[4.625]
[9.25] => [4.625]
[2.3125]
[4.625] => [2.3125]
[1.15625]
[2.3125] => [1.15625]
[0.578125]
[1.15625] => [0.578125]
[0.2890625]
[0.578125] => [0.2890625]
finished in 11 steps


How quickly we find the solution depends on the step size. We can alter our current approach to adoptively adjust the step size. 

<strong>Extact line search:</strong><br>
In each iteration choose the step that minimizes $f(x^{(k+1)})$

$\arg\min_{t\geq0} f(x^{(k)} - t\nabla f(x^{(k)})$

In [3]:
f = lambda x: x ** 2
gf = lambda x: 2 * x
x = np.random.randint(0, 1000)
res = list(map(lambda t: f(x - t * gf(x)), np.arange(0, 1, 0.1)))
print(res)
np.argmin(res)

[1600.0, 1024.0, 576.0, 255.9999999999999, 64.0, 0.0, 64.00000000000011, 256.0000000000002, 576.0, 1024.0]


5

In [4]:
def gradient_descent_line_search(f, gf):
    x = np.random.randint(0, 1000) 
    print(f"random initialization: {x}")
    eps = 1.0
    steps = 0
    tk = 1
    tk_range = np.arange(0, 1.0, 0.1)
    
    while np.abs(gf(x)) >= eps:
        tk_i = np.argmin(list(map(lambda t: f(x - t * gf(x)), tk_range)))
        tk = tk_range[tk_i]
        print(f"chosen step: {tk}")
        x_new = x - tk * gf(x)
        print(f"{x:.4f} => {x_new:.4f}")
        x = x_new
        steps += 1
    print(f"finished in {steps} steps")

gradient_descent_line_search(lambda x: x ** 2, lambda x: 2 * x)

f_ex = lambda x1, x2: 4*x1**2 + 2*x2**2 - 4*x1*x2
gf_ex = np.array([lambda x1, x2: 8*x1 - 4*x2, 
                  lambda x1, x2: 4*x2 - 4*x1])

random initialization: 543
chosen step: 0.5
543.0000 => 0.0000
finished in 1 steps


<strong>Backtracking line search:</strong><br>
Start with an initial t and then in iteration $k$, use $\frac{t}{2^{(k-1)}}$ or in general $t^*C$ where $C\in(0,1)$

In [5]:
def find_step(f, gf, x, c=0.9):
    t = np.random.rand()
    
    while(f(x + t * gf(x)) < f(x)):
        t = t * c # backtracking blind search
    return t
    

def gradient_descent_exact_line(f, gf):
    x = np.random.randint(0, 1000) 
    print(f"random initialization: {x}")
    eps = 1.0
    steps = 0
    tk = 1
    
    while np.abs(gf(x)) >= eps:
        tk = find_step(f, gf, x)
        print(f"chosen step: {tk:0.4f}")
        x_new = x - tk * gf(x)
        print(f"{x:.4f} => {x_new:.4f}")
        x = x_new
        steps += 1
    print(f"finished in {steps} steps")
    
gradient_descent_exact_line(lambda x: x ** 2, lambda x: 2 * x)

random initialization: 846
chosen step: 0.1044
846.0000 => 669.3692
chosen step: 0.7397
669.3692 => -320.9215
chosen step: 0.6443
-320.9215 => 92.6214
chosen step: 0.6007
92.6214 => -18.6573
chosen step: 0.4067
-18.6573 => -3.4821
chosen step: 0.8668
-3.4821 => 2.5547
chosen step: 0.5018
2.5547 => -0.0091
finished in 7 steps


---
Another example

$$\min f(x) = (x_1 + 1)^4 + x_1x_2 + (x_2 + 1)^4$$

The gradient is:

$$\nabla f(x) = \begin{bmatrix}
       \frac{\partial f}{\partial x_1} \\
       \frac{\partial f}{\partial x_2}
   \end{bmatrix}
$$

$$\frac{\partial f}{\partial x_1} = 4(x_1 + 1)^3 + x_2$$

$$\frac{\partial f}{\partial x_2} = x_1 + 4(x_2 + 1)^3$$


$$\nabla f(x) = \begin{bmatrix}
       4(x_1 + 1)^3 + x_2 \\
       x_1 + 4(x_2 + 1)^3
   \end{bmatrix}
$$

In [12]:
def f(x): return (x[0] + 1) ** 4 + x[0]*x[1] + (x[1] + 1)**4
def pg1(x): return 4 * (x[0] + 1) ** 3 + x[1]
def pg2(x): return x[0] + 4 * (x[1] + 1) ** 3
def gf(x): return np.array([pg1(x),pg2(x)])

gradient_descent(f, gf, tk, size=2)

random initialization: [287 313]
[-23887663.25 -30958902.75]
[287 313] => [-23887663.25 -30958902.75]
[1.36307876e+22 2.96726708e+22]
[-23887663.25 -30958902.75] => [1.36307876e+22 2.96726708e+22]
[-2.53257811e+66 -2.61258190e+67]
[1.36307876e+22 2.96726708e+22] => [-2.53257811e+66 -2.61258190e+67]
[1.62438342e+199 1.78323976e+202]
[-2.53257811e+66 -2.61258190e+67] => [1.62438342e+199 1.78323976e+202]
[-inf -inf]
[1.62438342e+199 1.78323976e+202] => [-inf -inf]
[nan nan]
[-inf -inf] => [nan nan]
finished in 6 steps


  
  This is separate from the ipykernel package so we can avoid doing imports until
  app.launch_new_instance()
  from ipykernel import kernelapp as app


#### Resources:
* [ISyE 6669 Discrete Optimization â€” Gradient Descent](https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2019/courseware/4f017d33a98749118de2413c8a8e4660/5131388fa021401a88546a6414af852e/1?activate_block_id=block-v1%3AGTx%2BISYE6669%2B2T2019%2Btype%40vertical%2Bblock%40ac9c0cad50ca493299d2832467f26b04)