Exercise 2: Advanced Gradient Descent
-------------------------------------

Before we start modifying our gradient descent algorithm, let us first
put it in a more "useful" form.  Remember from the lecture on higher-order
functions that we can pass any function `f` as parameter into another
function and then just call the passed function.

Copy your function `gradient_descent()` from the previous lecture and
modify it such that it takes an additional parameter: `grad` (which is
a function returning the gradient).

You can remove any `print()` statements from the function 
(we will do this in a better way soon).

In [None]:
def gradient_descent(grad, x0, eta, max_iter):
    # YOUR CODE HERE
    raise NotImplementedError()

Now we have a truly useful gradient descent function, which works for
any 1D function we throw at it.

In [None]:
def mexican_hat(x):
    return x**4 - 2 * x**2
    
def mexican_hat_grad(x):
    return 4 * (x**3 - x)

xmin = gradient_descent(mexican_hat_grad, -1.5, 0.1, 20)

In [None]:
print("f(", xmin, ") =", mexican_hat(xmin))

Try this out!  Make up some potential with a minimum, define its derivative,
see if your `gradient_descent` function can handle it.

Lists and plotting
---------------------

Let's understand graphically how the `mexican_hat` function looks.  To
do this, let us first create two lists, either by appending elements in a loop or
by list comprehension:

  - a list `x` with values from `[-1.5, -1.4, ..., 1.5]`
  - a list `y` with the associated values of `mexican_hat` for each of the `x`

In [None]:
# x = ...
# y = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(x) == len(y), "Arrays x and y should be of the same length"
assert abs(x[0] + 1.5) < 1e-10, "First value of x should be -1.5"
assert abs(x[-1] - 1.5) < 1e-10, "Last value of x should be 1.5"
assert abs(y[0] - mexican_hat(-1.5)) < 1e-5, "y values do not match x"
assert abs(y[-1] - mexican_hat(1.5)) < 1e-5, "y values do not match x"


Let's plot this now: plot the values `y` over `x` using matplotlib. Don't forget to import the library.

Remember for this and all subsequent plots: 
 - include a title
 - include axis labels
 - if you plot more than one function in a single figure, use labels and a legend

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Now we would like to understand how gradient descent converges to one of the minima shown above.

For this, copy the function from above and give it the new name `gradient_descent_all`.
Modify the function from its original as follows: instead of only the `x` value for the last
iteration it should give a list of values corresponding to `x` in each iteration.

In [None]:
def gradient_descent_all(grad, x0, eta, max_iter):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
gradient_descent_all(mexican_hat_grad, 1.5, 0.1, 20)

In [None]:
assert iter(gradient_descent_all(mexican_hat_grad, 1.5, 0.1, 20)), "should give list"
assert abs(gradient_descent_all(mexican_hat_grad, 1.77, 0.1, 20)[0] - 1.77) < 1e-5, "initial value missing"
assert abs(gradient_descent_all(mexican_hat_grad, 1.5, 0.1, 20)[-1] - 1) < 1e-5, "last value should converge"


Now you can plot this list over the iteration number `t` to see how it converges.
Make two figures, each figure with three or so plot lines: one where you vary `eta` and keep the
initial position constant and one where you vary the initial position but keep
`eta` constant

(Hint: look into the documentation of plot to make your life easier)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Nesterov-accelerated Gradient Descent
-----------------------------------------------
Finally, let's explore the effect of momentum on gradient descent.

Copy your function `gradient_descent_all`, and give it a new name
`nesterov_all`.  Modify the function to implement the Nesterov acceleration scheme.  You
now need the additional mixing parameter `gamma`.

In [None]:
def nesterov_all(grad, x0, eta, gamma, max_iter):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
nesterov_all(mexican_hat_grad, 2.0, 0.1, 0.1, 20)

In [None]:
# Sanity checks
assert iter(nesterov_all(mexican_hat_grad, 1.5, 0.1, 0.1, 20)), "should give list"
assert abs(nesterov_all(mexican_hat_grad, 1.77, 0.1, 0.1, 20)[0] - 1.77) < 1e-5, "initial value missing"
assert abs(nesterov_all(mexican_hat_grad, 1.5, 0.1, 0.1, 20)[-1] - 1) < 1e-5, "last value should converge"



Let's compare how Nesterov and regular GD converge: Make a figure with two plots: one where you plot the steps for regular gradient descent and one where you plot the steps for Nesterov acceleration for the `mexican_hat` function and the same starting point.

Play around a little bit

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Describe in words how the behaviour changes when momentum is included.

What happens when you vary `eta`? What happens when you vary `gamma`?

YOUR ANSWER HERE

Back-propagation
--------------------------
For a super-advanced final touch, let us briefly touch on back-propagation: for this, remember that it allows to compute the derivative of some composition of functions:
$$
        E(x) = f(g(x))
$$
by using the chain rule.

Let's try this out: write a function that takes as arguments four other functions, representing $f$, $g$ and the derivatives $f'$ and $g'$ and **return a new function** which computes $E'(x)$

In [None]:
def get_back_prop(f, g, gradf, gradg):
    def gradfg(x):
        # Here you should use f, g, gradf, and/or gradg to compute E'(x)
        # YOUR CODE HERE
        raise NotImplementedError()
        
    # This returns the function we have constructed above.  In other words,
    # get_back_prop is a function that combines functions to a new function.
    # (Cue horn sounds from the movie Inception.)
    return gradfg

Let's try out your function:

I have assumed that $E(x) =$ `mexican_hat(linear_trafo(x))`, in other words,
$f(x) =$ `mexican_hat(x)` and $g(x) =$ `linear_trafo(x)`.  This is common in
Machine Learning, where a linear and a non-linear part alternates.  Let's see
if we can apply back-propagation.

In [None]:
def linear_trafo(x):
    return 2 * x + 1

def linear_trafo_grad(x):
    return 2

def E(x):
    return mexican_hat(linear_trafo(x))

In [None]:
dEdx = get_back_prop(mexican_hat, linear_trafo, mexican_hat_grad, linear_trafo_grad)

In [None]:
assert abs(dEdx(0)) < 1e-5, "dEdx(0) should be zero"
assert abs(dEdx(-1)) < 1e-5, "dEdx(-1) should be zero"

assert abs(dEdx(0.5) - 48) < 1e-5, "dEdx(0.5) should be 48 (really? yes!)"
