<a href="https://colab.research.google.com/github/pserebrennikov/3rd-year-project/blob/master/1_first_order_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 1 - First order methods
### Course on Optimization for Machine Learning - Dr. F. Ballarin
### Master Degree in Data Analytics for Business, Catholic University of the Sacred Heart, Milano

In this notebook we implement first order methods using the [`numpy` library](https://numpy.org/). Furthermore, we rely on [`plotly`](https://plotly.com/) for the visualization and creation of interactive plots.

In [None]:
import typing

In [None]:
import numpy as np
import plotly.colors
import plotly.graph_objects as go
import plotly.subplots

## Exercise 1.1
Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the following quadratic function, known as *Booth function* and defined as
$$f(\boldsymbol{w}) = (w^{(0)} + 2 w^{(1)} - 7)^2 + (2 w^{(0)} + w^{(1)} - 5)^2.$$
Note that inside notebooks we will denote $\boldsymbol{w} = (w^{(0)}, w^{(1)})$ rather than $\boldsymbol{w} = (w^{(1)}, w^{(2)})$, because Python starts counting from 0!

1. Draw a surface plot and a contour plot of the function $f$ on the square domain $[-10, 10]^2$.

*Solution*:
> First of all we define the domain for each component of $\boldsymbol{w}$. Note that in this case they correspond to the same interval, but in general (e.g. a rectangular domain) they might be different.

In [None]:
domain_component_0 = [-10, 10]
domain_component_1 = [-10, 10]

> In order to prepare a plot, we define a uniform subdivision in each coordinate direction, made up of 100 points.

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

In [None]:
w_component_0

> We then evaluate the function $f$ at every $\boldsymbol{w} = (w^{(0)}, w^{(1)})$ pair by means of a simple `for` loop.

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = (
            (w_component_0[i] + 2 * w_component_1[j] - 7)**2
            + (2 * w_component_0[i] + w_component_1[j] - 5)**2
        )

> We show a surface plot using `plotly`. Note that you can interact in multiple ways with the plot, Try to hover your mouse on the surface, and also to change the options on the top-right of the figure.

In [None]:
fig = go.Figure(data=[go.Surface(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_layout(title="Booth function - surface plot")
fig.show()

> Note how `plotly` shows contour lines when hovering the mouse on the surface. Still, we can prepare a countour plot using similar instructions.

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_layout(title="Booth function - contour plot", width=512, height=512, autosize=False)
fig.show()

> One can even combine surface and contour plots in the same figure.

In [None]:
fig = go.Figure(data=[go.Surface(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_traces(contours_z=dict(show=True, project_z=True, usecolormap=True))
fig.update_layout(title="Booth function - surface and countour plots in the same figure")
fig.show()

*Futher suggestions*:
> * if you have a programming background or some experience with `numpy`, you might have heard about *vectorization* as a better (i.e., more efficient) evaluation of a function at several points. Feel free to change the nested for loop that we used to evaluate $f(\boldsymbol{w})$ with a vectorized version. You may want to use [`numpy.meshgrid`](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html). While devising algorithms that extract the best performance out of the computing architectures and libraries is certainly beneficial to achieve in machine learning applications, in this course we are more interested in using Python and its libraries are tools to understand optimization algorithms;
> * if you have had courses in Python before, you might have used [`matplotlib`](https://matplotlib.org/) to create your plots. `matplotlib` is arguably the most widely used plotting library in Python. In this course I have chosen to show you `plotly` because it offers simpler ways to interact with the generated plots compared to `matplotlib`. I believe that this is especially important for data science students, e.g. when carrying out preprocessing tasks on the dataset (e.g., you have a scatterplot and you would like to easily determine the IDs of points which are outliers, so that you can treat them appropriately).

2. Compute the gradient $\nabla f$ and determine the global minimum of the function $f$.

*Solution*:
> Recall that $$f(\boldsymbol{w}) = (w^{(0)} + 2 w^{(1)} - 7)^2 + (2 w^{(0)} + w^{(1)} - 5)^2.$$
> By taking the partial derivatives of $f$ we see that the gradient of $\nabla f$ is
\begin{equation*}
\nabla f(\boldsymbol{w}) = \begin{bmatrix}
2 \cdot (w^{(0)} + 2 w^{(1)} - 7) + 2 \cdot 2 \cdot (2 w^{(0)} + w^{(1)} - 5)\\
2 \cdot 2 \cdot (w^{(0)} + 2 w^{(1)} - 7) + 2 \cdot (2 w^{(0)} + w^{(1)} - 5)
\end{bmatrix} = \begin{bmatrix}
10 w^{(0)} + 8 w^{(1)} - 34\\
8 w^{(0)} + 10 w^{(1)} - 38\\
\end{bmatrix}.
\end{equation*}
> Since stationary points are such that $\nabla f(\boldsymbol{w}) = \boldsymbol{0}$, we have to solve
\begin{equation*}
\begin{cases}
10 w^{(0)} + 8 w^{(1)} - 34 = 0\\
8 w^{(0)} + 10 w^{(1)} - 38 = 0\\
\end{cases}.
\end{equation*}
This is a linear system that you can easily solve on paper to obtain $\boldsymbol{w}^* = (1, 3)$.
>
> Is $\boldsymbol{w}^*$ a local minimum? We can proceed in at least a couple of ways:
> * *graphical way*: we have just plotted $f$ in the square domain $[-10, 10]^2$, and we notice that $\boldsymbol{w}^*$ is inside that domain. If we hover our mouse on the surface plot we can easily locate $\boldsymbol{w}^*$ on the plot and confirm that it is a minimum.
> * *mathematical way*: we can make use of the second order optimality conditions. They requires us to compute the hessian of $f$
\begin{equation*}
\nabla^2 f(\boldsymbol{w}) =
\begin{bmatrix}
10 & 8\\
8 & 10\\
\end{bmatrix}
\end{equation*}
and check if it is positive definite. In order to check the positive definiteness of such matrix we use the equivalent definition in terms of its eigenvalues, which can be computed by means of the [`numpy.linalg.eig`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html) function.

In [None]:
hessian_f = np.array([[10, 8], [8, 10]])
hessian_f

In [None]:
eigs, _ = np.linalg.eig(hessian_f)
eigs

In [None]:
assert (eigs > 0).all()

> Since all the eigenvalues of the hessian matrix (evaluated at $\boldsymbol{w}^*$) are strictly positive we conclude that $\boldsymbol{w}^*$ is local minimum.
>
> Is $\boldsymbol{w}^*$ a global minimum? We have just shown that it is a local minimum; furthermore, there are not other possible minima (i.e., the equation $\nabla f(\boldsymbol{w}) = \boldsymbol{0}$) has only one solution), so $\boldsymbol{w}^*$ is actually a global minimum.

3. Implement the gradient descent method with constant step length in a Python function. Use the stopping criterion based on the error of the cost. Such function should:
   * take as input the value $\alpha$ of the step length, the tolerance $\varepsilon$ for the stopping criterion, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.

*Solution*:
> We first prepare two functions for the evaluation of $f$ and $\nabla f$. Note that the inputs of both functions i the point $\boldsymbol{w}$ stored as a [`numpy array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html), so we can access its components $w^{(0)}$ and $w^{(1)}$ using the code `w[0]` and `w[1]`.

In [None]:
def f_ex_1_1(w: np.ndarray) -> float:
    """Evaluate f(w)."""
    return (w[0] + 2 * w[1] - 7)**2 + (2 * w[0] + w[1] - 5)**2

> Furthermore, in preparing the function for $\nabla f$ we return the vector as a `numpy array` as well.

In [None]:
def grad_f_ex_1_1(w: np.ndarray) -> np.ndarray:
    r"""Evaluate \nabla f(w)."""
    return np.array([10 * w[0] + 8 * w[1] - 34, 8 * w[0] + 10 * w[1] - 38])

> We can do a simple check to verify that we have not made any mistake when copying the formule. We know that $\boldsymbol{w}^* = (1, 3)$ is such that $f(\boldsymbol{w}^*) = 0$ (by substitution in the expression of $f$) and that it is a minimum (so $\nabla f(\boldsymbol{w}^*) = \boldsymbol{0}$).

In [None]:
w_star = np.array([1, 3])
w_star

In [None]:
f_ex_1_1(w_star)

In [None]:
grad_f_ex_1_1(w_star)

> We next implement the gradient descent method with constant step length.

In [None]:
def gradient_descent_ex_1_1(alpha: float, epsilon: float, w_0: np.ndarray) -> typing.Tuple[
        np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the gradient descent method with constant step length.

    Parameters
    ----------
    alpha : float
        constant step length.
    epsilon : float
        tolerance for the stopping criterion on the error on the cost.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_1(w_0)]
    all_grad_f = [grad_f_ex_1_1(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the error on the cost to determine when the while loop should stop.
    # Since the optimal cost function is zero and f >= 0 everywhere, the error is actually the evaluation of f.
    while all_f[k] > epsilon:
        w_k = all_w[k]
        grad_f_k = all_grad_f[k]
        w_k_plus_1 = w_k - alpha * grad_f_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_1(w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_1(w_k_plus_1))

        # Bail out if the descent condition is not satisfied
        if all_f[k + 1] >= all_f[k]:
            print("WARNING: descent conditions is not satisfied")
            break

        # Increment iteration counter
        k += 1

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

4. Choose $\alpha = 10^{-2}$, $\varepsilon = 10^{-5}$ and $\boldsymbol{w}_0 = (-8, -8)$. Visualize:
   * the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a surface plot of $f$;
   * the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$;
   * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$;
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.
 
*Solution*:
> First of all we query our implementation `gradient_descent_ex_1_1`.

In [None]:
all_w, all_f, all_grad_f = gradient_descent_ex_1_1(1e-2, 1e-5, np.array([-8.0, -8.0]))

> To have an intuition of the how the iterative process went we may have a look at the outputs of the function.

In [None]:
assert all_w.shape[0] == all_f.shape[0] == all_grad_f.shape[0]
all_w.shape[0]  # number of iterations

In [None]:
all_w

In [None]:
all_f

In [None]:
all_grad_f

> We first prepare the image containing the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a surface plot of $f$.

In [None]:
fig = go.Figure(
    data=[go.Surface(x=w_component_0, y=w_component_1, z=f_w, opacity=0.5)]
)
fig.add_scatter3d(
    x=all_w[:, 0], y=all_w[:, 1], z=all_f,
    marker=dict(color="black", size=4),
    line=dict(color="black", width=2)
)
fig.update_layout(title="Booth function - optimization variable iterations over surface plot")
fig.show()

> Similarly, the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ can be plotted on a contour plot of $f$. Note how, the gradient method proceeds in a direction orthogonal to the contour lines, as expected (because from calculus courses we know that the gradient is orthogonal to the contour lines).

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w, opacity=0.5)])
fig.add_scatter(
    x=all_w[:, 0], y=all_w[:, 1],
    marker=dict(color="black", size=10),
    line=dict(color="black", width=2),
    mode="lines+markers"
)
fig.update_layout(
    title="Booth function - optimization variable iterations over contour plot",
    width=512, height=512, autosize=False
)
fig.show()

> The next plot asks to compute the error in the function value $f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)$. However, since in this case $f(\boldsymbol{w}^*)$ is zero, that is actually the same as plotting $f(\boldsymbol{w}_k)$. Preparing such plot with a linear vertical axis helps us to realize that there is a very steep decrease of the objective function in the first 10 iterations, as we could have also realized from the plots above. However, we cannot appreciate the effect of subsequent improvements because they are very close to the horizontal axis.

In [None]:
fig = go.Figure(data=go.Scatter(x=np.arange(all_f.shape[0]), y=all_f))
fig.update_layout(title="Booth function - error in the function value")
fig.show()

> In these situations (data that are distributed across several orders of magnitude) it is convenient to employ semilogarithmic plots. In this case, we adopt a logarithimic scale on the vertical axis.

In [None]:
fig = go.Figure(data=go.Scatter(x=np.arange(all_f.shape[0]), y=all_f))
fig.update_layout(title="Booth function - error on the function value - semilog plot")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> Note how, starting from $k > 25$, the curve in the previous plot resembles very closely a line. We will come back to this point later on the lecture.
>
> Similarly, we prepare a plot of the norm of the gradients $\left\|\nabla f(\boldsymbol{w}_k)\right\|$. We use a semilog plot as well, since also these norm will vary by several orders of magnitude.

In [None]:
fig = go.Figure(data=go.Scatter(x=np.arange(all_f.shape[0]), y=np.linalg.norm(all_grad_f, axis=1)))
fig.update_layout(title="Booth function - violation of first order optimality conditions - semilog plot")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> Note how also for the norm of the gradient the plot resembles a linear profile for $k > 25$. Note however that while the cost function error at the final iteration is of the order of $10^{-5}$, the violation of the first order optimality conditions is of the order of $5 \cdot 10^{-2}$. We will also come back to this point later on in the lecture.

5. Now keep $\alpha = 10^{-2}$ and $\varepsilon = 10^{-5}$, but change the initial conditions $\boldsymbol{w}_0$ by trying eight equispaced points on the circle centered at the origin and of radius 8. Does the gradient method always converge to $\boldsymbol{w}^*$?

*Solution*:

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w, opacity=0.5, showscale=False)])
for guess in range(8):
    theta = guess * np.pi / 4
    w_0_theta = 8 * np.array([np.cos(theta), np.sin(theta)])
    all_w_theta, _, _ = gradient_descent_ex_1_1(1e-2, 1e-5, w_0_theta)
    fig.add_scatter(
        x=all_w_theta[:, 0], y=all_w_theta[:, 1],
        marker=dict(color=plotly.colors.qualitative.Set1[guess], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[guess], width=2),
        mode="lines+markers", name="Theta = " + str(guess / 4) + " pi"
    )
fig.update_layout(
    title="Booth function - optimization variable iterations over contour plot - different initial points",
    width=612, height=512, autosize=False
)
fig.show()

> We do always have convergence to $\boldsymbol{w}^*$. We will justify this later on in the lecture.

6. Consider now $\varepsilon = 10^{-5}$, $\boldsymbol{w}_0 = (-3, -3)$ and let the step length $\alpha$ vary among the following five possible choices: 1, 1/9, 1/9.4, 1/18 and 1/36. Does the gradient method always converge to $\boldsymbol{w}^*$?

*Solution*:
> Let us now consider first $\alpha = 1$.

In [None]:
all_w_alpha_1, all_f_alpha_1, _ = gradient_descent_ex_1_1(1, 1e-5, np.array([-3.0, -3.0]))

> The descent condition has been violated at the first iteration: with such a large value of $\alpha$ the iterative method is diverging.

In [None]:
all_f_alpha_1

> Let us now move to $\alpha = 1/9$.

In [None]:
all_w_alpha_9, all_f_alpha_9, _ = gradient_descent_ex_1_1(1 / 9, 1e-5, np.array([-3.0, -3.0]))

> Also in this case the iterations have been stopped. However, in constrast to $\alpha=1$, it seems that the cost function is getting stuck at a value around $450$. Such value is not associated to a minimum.

In [None]:
all_f_alpha_9

> The optimization variables seem to oscillate between the point (6, 8) and the point (-4, -2) without making any sensible progress.

In [None]:
all_w_alpha_9

> We move now to the case $\alpha = 1/9.4$.

In [None]:
all_w_alpha_94, all_f_alpha_94, _ = gradient_descent_ex_1_1(1 / 9.4, 1e-5, np.array([-3.0, -3.0]))

In [None]:
all_f_alpha_94

> We can query how many iterations were required to reach convergence: around 100 iterations in this case.

In [None]:
all_f_alpha_94.shape[0]

> Let us now try $\alpha = 1/18$.

In [None]:
all_w_alpha_18, all_f_alpha_18, _ = gradient_descent_ex_1_1(1 / 18, 1e-5, np.array([-3.0, -3.0]))

In [None]:
all_f_alpha_18

> We can query how many iterations were required to reach convergence: around 50 iterations in this case.

In [None]:
all_f_alpha_18.shape[0]

> Finally, we try with $\alpha = 1/36$.

In [None]:
all_w_alpha_36, all_f_alpha_36, _ = gradient_descent_ex_1_1(1 / 36, 1e-5, np.array([-3.0, -3.0]))

In [None]:
all_f_alpha_36

> We also have convergence for $\alpha = 1/36$, again with around 100 iterations.

In [None]:
all_f_alpha_36.shape[0]

> We prepare a four plots of the optimization variable iteration over the contour plot of $f$, each corresponding to a value of $\alpha$, discarding the case $\alpha = 1$ which did not converge.

In [None]:
fig = plotly.subplots.make_subplots(rows=2, cols=2)
denominators = [9, 9.4, 18, 36]
rows = [1, 1, 2, 2]
cols = [1, 2, 1, 2]
all_w_alpha = [all_w_alpha_9, all_w_alpha_94, all_w_alpha_18, all_w_alpha_36]
for alpha_index in range(4):
    fig.add_contour(
        x=w_component_0, y=w_component_1, z=f_w, opacity=0.5, showscale=False,
        row=rows[alpha_index], col=cols[alpha_index]
    )
    fig.add_scatter(
        x=all_w_alpha[alpha_index][:, 0], y=all_w_alpha[alpha_index][:, 1],
        marker=dict(color=plotly.colors.qualitative.Set1[alpha_index], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[alpha_index], width=2),
        mode="lines+markers", name="Alpha = 1/" + str(denominators[alpha_index]),
        row=rows[alpha_index], col=cols[alpha_index]
    )
fig.update_layout(
    title="Booth function - optimization variable iterations over contour plot - different step lengths",
    width=768, height=768, autosize=False
)
fig.show()

> These plots allow to have a graphical summary of the behaviors that we encountered when varying $\alpha$:
> * we have not shown the case $\alpha = 1$, which is the diverging case; 
> * *top left* corresponds to $\alpha = 1/9$, which is a case in which oscillation do not reduce, and the iterations do not converge;
> * *top right* corresponds to $\alpha = 1/9.4$, which is a case of slow and oscillatory convergence;
> * *bottom left* corresponds to $\alpha = 1/18$, which is a case is characterized by the optimal step length;
> * *bottom right* corresponds to $\alpha = 1/36$, which is a case of slow and non-oscillatory convergence.

> We next plot the history of the error on the function value for different step lengths.

In [None]:
fig = plotly.subplots.make_subplots(rows=2, cols=2)
all_f_alpha = [all_f_alpha_9, all_f_alpha_94, all_f_alpha_18, all_f_alpha_36]
for alpha_index in range(4):
    fig.add_scatter(
        x=np.arange(all_f_alpha[alpha_index].shape[0]), y=all_f_alpha[alpha_index],
        marker=dict(color=plotly.colors.qualitative.Set1[alpha_index], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[alpha_index], width=2),
        mode="lines+markers", name="Alpha = 1/" + str(denominators[alpha_index]),
        row=rows[alpha_index], col=cols[alpha_index]
    )
fig.update_layout(
    title="Booth function - error on the function value - different step lengths",
    width=768, height=768, autosize=False
)
fig.update_xaxes(range=[0, 110])
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> Note how the slow converging cases ($\alpha = 1/9.4$ *top right* and $\alpha = 1/36$ *bottom right*), even though characterized by a similar number of iterations, have a very different convergence history: the cost function in the *top right* decreases by the same rate throughout the optimization iterations, while the cost function in the *bottom* right* has a very fast decrease during the first few iterations followed by a slower decrease.
>
> The case $\alpha = 1/18$ is characterized by the fastest decrease in the very first iteration, and followed by a slow decrease in the subsequent iterations. For a generic function, is there a general way to determine good choices of $\alpha$ (i.e., exclude values that may cause divergence) and have an idea of how many iterations will be required to get convergence? We will discuss these points later on the lecture.

## Exercise 1.2
Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the following function
$$f(\boldsymbol{w}) = e^{w^{(0)} + 3 w^{(1)} - 0.1} + e^{w^{(0)} - 3 w^{(1)} - 0.1} + e^{- w^{(0)} - 0.1}.$$

1. Draw a surface plot and a contour plot of the function $f$ on the square domain $[-2, 2]^2$.

*Solution*:
> As in the previous exercise we define a uniform subdivision in each coordinate direction, made up of 100 points.

In [None]:
domain_component_0 = [-2, 2]
domain_component_1 = [-2, 2]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

> We then evaluate the function $f$ at every $\boldsymbol{w}$.

In [None]:
def f_ex_1_2(w: np.ndarray) -> float:
    """Evaluate f(w)."""
    return np.exp(w[0] + 3 * w[1] - 0.1) + np.exp(w[0] - 3 * w[1] - 0.1) + np.exp(- w[0] - 0.1)

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_1_2([w_component_0[i], w_component_1[j]])

> When preparing a contour plot with `plotly` we notice that the resulting plot is not very informative because most of the figure is associated to a single contour level.

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_layout(title="Function exercise 1.2 - contour plot", width=512, height=512, autosize=False)
fig.show()

> This is again due to the fact that the function $f$ spans several order of magnitude. As in the previous exercise, we can use a log scale to improve the visualization.

In [None]:
fig = go.Figure(data=[go.Contour(
    x=w_component_0, y=w_component_1, z=np.log10(f_w),
    hovertemplate="x: %{x:.2f}<br>y: %{y:.2f}<br>z: 10^%{z:.2f} = %{customdata}", customdata=f_w,
    colorbar=dict(tickprefix="10^")
)])
fig.update_layout(title="Function exercise 1.2 - contour plot (log scale)", width=512, height=512, autosize=False)
fig.show()

2. Compute the gradient $\nabla f$ and determine the global minimum of the function $f$.

*Solution*:
> We compute the partial derivatives of $f(\boldsymbol{w}) = e^{w^{(0)} + 3 w^{(1)} - 0.1} + e^{w^{(0)} - 3 w^{(1)} - 0.1} + e^{- w^{(0)} - 0.1}$ to obtain the gradient
\begin{equation*}
\nabla f(\boldsymbol{w}) = \begin{bmatrix}
e^{w^{(0)} + 3 w^{(1)} - 0.1} + e^{w^{(0)} - 3 w^{(1)} - 0.1} - e^{- w^{(0)} - 0.1}\\
3 e^{w^{(0)} + 3 w^{(1)} - 0.1} - 3 e^{w^{(0)} - 3 w^{(1)} - 0.1}
\end{bmatrix}
\end{equation*}
Solving the equation $\nabla f(\boldsymbol{w}) = 0$ is more complicated than the previous exercise, but we are still in a favorable situation in which, with a bit of algebra, we can solve this problem analytically. Let us start from the second equation
$$3 e^{w^{(0)} + 3 w^{(1)} - 0.1} = 3 e^{w^{(0)} - 3 w^{(1)} - 0.1}.$$
Dividing by 3 and taking the $\ln$ on both sides we get
$$w^{(0)} + 3 w^{(1)} - 0.1 = w^{(0)} - 3 w^{(1)} - 0.1$$
from which we obtain ${w^*}^{(1)}$ = 0. Substituting this value into the first equation gives
$$e^{w^{(0)} - 0.1} + e^{w^{(0)} - 0.1} - e^{- w^{(0)} - 0.1} = 0$$
which can be equivalently rewritten as
$$2 e^{w^{(0)}} e^{- 0.1} - e^{- w^{(0)}}e^{- 0.1} = 0$$
which divided by $e^{- 0.1}$ gives
$$2 e^{w^{(0)}} - e^{- w^{(0)}} = 0.$$
Finally, multiplying both sides by $e^{w^{(0)}}$ we end up with
$$2 e^{2 w^{(0)}} - 1 = 0, \qquad\text{that is}\qquad e^{2 w^{(0)}} = \frac{1}{2},$$
from which we obtain ${w^*}^{(0)} = - \frac{\ln 2}{2}$. In conclusion, the only stationary point is
$$ \boldsymbol{w}^* = \left(- \frac{\ln 2}{2}, 0\right).$$
>
> In order to determine if $\boldsymbol{w}^*$ is a minimum (and therefore a global minimum, being the only stationary point), we compute the hessian matrix
\begin{equation*}
\nabla^2 f(\boldsymbol{w}) = \begin{bmatrix}
e^{w^{(0)} + 3 w^{(1)} - 0.1} + e^{w^{(0)} - 3 w^{(1)} - 0.1} + e^{- w^{(0)} - 0.1}
&3 e^{w^{(0)} + 3 w^{(1)} - 0.1} - 3 e^{w^{(0)} - 3 w^{(1)} - 0.1}\\
3 e^{w^{(0)} + 3 w^{(1)} - 0.1} - 3 e^{w^{(0)} - 3 w^{(1)} - 0.1}
&9 e^{w^{(0)} + 3 w^{(1)} - 0.1} + 9 e^{w^{(0)} - 3 w^{(1)} - 0.1}
\end{bmatrix}
\end{equation*}
> We notice that $\nabla^2 f(\boldsymbol{w}^*)$ is of the following form
\begin{equation*}
\nabla^2 f(\boldsymbol{w}^*) = \begin{bmatrix}
e^{{w^*}^{(0)} + 3 {w^*}^{(1)} - 0.1} + e^{{w^*}^{(0)} - 3 {w^*}^{(1)} - 0.1} + e^{- {w^*}^{(0)} - 0.1}
&0\\
0
&9 e^{{w^*}^{(0)} + 3 {w^*}^{(1)} - 0.1} + 9 e^{{w^*}^{(0)} - 3 {w^*}^{(1)} - 0.1}
\end{bmatrix}
\end{equation*}
because the off-diagonal terms correspond to the second equation of the first order optimality condition. This is a diagonal matrix, so its eigenvalue are the diagonal elements. Furthermore, since such elements are positive (being the sum of exponential functions), we conclude that $\nabla^2 f(\boldsymbol{w}^*)$ is positive definite, and thus $\boldsymbol{w}^*$ is a global minimum.

> We conclude this point by:
> * defining $\boldsymbol{w}^*$, and determining the optimal value $f(\boldsymbol{w}^*)$,
> * implementing a Python function for $\nabla f$ as well, checking that it is implemented correctly by evaluating $\nabla f(\boldsymbol{w}^*)$.

In [None]:
w_star = np.array([- np.log(2) / 2, 0])
w_star

In [None]:
f_star = f_ex_1_2(w_star)
f_star

In [None]:
def grad_f_ex_1_2(w: np.ndarray) -> np.ndarray:
    r"""Evaluate \nabla f(w)."""
    return np.array([
        np.exp(w[0] + 3 * w[1] - 0.1) + np.exp(w[0] - 3 * w[1] - 0.1) - np.exp(- w[0] - 0.1),
        3 * np.exp(w[0] + 3 * w[1] - 0.1) - 3 * np.exp(w[0] - 3 * w[1] - 0.1)
    ])

In [None]:
grad_f_ex_1_2(w_star)

3. Implement the gradient descent method with exact line search in a Python function. Use the stopping criterion based on the error of the cost. Such function should:
   * take as input the tolerance $\varepsilon$ for the stopping criterion and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$, and the values of the step lengths $\{\alpha_k\}_k$.
 
*Solution*:
> We will provide a very simple (and inefficient!) implementation of the exact line search. Denote by 
$$E_k(a) = f(\boldsymbol{w}_k + a \boldsymbol{g}_k),$$
where $\boldsymbol{g}_k = - \nabla f(\boldsymbol{w}_k)$ is the current descent direction.
A very simple (but expensive) way to find the minimum of $E(a)$ is to evaluate it at several points, and then pick the point which has the minimum value. This is not feasible in practical problems (especially in more than one dimension), but it will suffice for the goals of our exercise.
> In particular, we will pick 1001 equispaced points in the interval $[0, 1]$.

In [None]:
alpha_choices = np.linspace(0, 1, 1001, endpoint=True)
alpha_choices

In [None]:
def gradient_descent_exact_line_search_ex_1_2(epsilon: float, w_0: np.ndarray) -> typing.Tuple[
        np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the gradient descent method with exact line search.

    Parameters
    ----------
    epsilon : float
        tolerance for the stopping criterion on the error on the cost.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    1d numpy array
        history of the step lengths selected by the exact line search.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_2(w_0)]
    all_grad_f = [grad_f_ex_1_2(w_0)]
    all_alpha = []

    # Prepare iteration counter
    k = 0

    # Use the error on the cost to determine when the while loop should stop.
    # Since we have stored the value of f(w^*) in the variable f_star, we can use it to
    # compute f(w_k) - f^*
    while all_f[k] - f_star > epsilon:
        w_k = all_w[k]
        grad_f_k = all_grad_f[k]
        g_k = - grad_f_k

        # Carry out an exact line search
        E_choices = [f_ex_1_2(w_k + a * g_k) for a in alpha_choices]
        min_choice = np.argmin(E_choices)
        alpha_k = alpha_choices[min_choice]

        # Bail out if the descent condition is not satisfied
        if alpha_k == 0:
            print("WARNING: descent conditions is not satisfied")
            break

        # Compute w_{k+1}
        w_k_plus_1 = w_k + alpha_k * g_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_2(w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_2(w_k_plus_1))
        all_alpha.append(alpha_k)

        # Increment iteration counter
        k += 1

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f), np.array(all_alpha)

4. Choose $\varepsilon = 10^{-5}$ and $\boldsymbol{w}_0 = (-1, -1.9)$, and run the gradient descent method with exact line search. Visualize:
   * the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$;
   * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$;
   * a video of the functions $E_k(a)$ as $k$ progresses.

*Solution*:
> We run the function we have implemented.

In [None]:
all_w_exact, all_f_exact, all_grad_f_exact, all_alpha_exact = gradient_descent_exact_line_search_ex_1_2(
    1e-10, np.array([-1.0, 1.9]))

> We next create the required contour plot, with overlayed optimization variable iterations. Note how with three iterations we get very close to the minimum from a graphical point of view.

In [None]:
fig = go.Figure(data=[go.Contour(
    x=w_component_0, y=w_component_1, z=np.log10(f_w),
    hovertemplate="x: %{x:.2f}<br>y: %{y:.2f}<br>z: 10^%{z:.2f} = %{customdata}", customdata=f_w,
    colorbar=dict(tickprefix="10^")
)])
fig.add_scatter(
    x=all_w_exact[:, 0], y=all_w_exact[:, 1],
    marker=dict(color="black", size=10),
    line=dict(color="black", width=2),
    mode="lines+markers"
)
fig.update_layout(
    title="Function exercise 1.2 - optimization variable iterations over contour plot (log scale)",
    width=512, height=512, autosize=False)
fig.show()

> The semilog plot of the error on the function value is very similar to the one in the previous exercise. Also in this case we may notice that the the decrease of the function value has a linear trend when the horizontal axis is shown in logarithimic scale.

In [None]:
fig = go.Figure(data=go.Scatter(x=np.arange(all_f_exact.shape[0]), y=all_f_exact - f_star))
fig.update_layout(title="Function exercise 1.2 - error on the function value - semilog plot")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> Finally, we create an animation of the function $E_k(a)$, where each slide of the animation corresponds to a different value of $k$. We can visually confirm that the selected $\alpha_k$ is the minimum of $E_k(a)$. Notice how while getting closer to the minimum point the range of variation for $E_k(a)$ decreases considerably when varying $a$ in [0, 1].

In [None]:
K = all_alpha_exact.shape[0]

fig = go.Figure()
slides = []
for k in range(K):
    w_k = all_w_exact[k]
    grad_f_k = all_grad_f_exact[k]
    g_k = - grad_f_k
    alpha_k = all_alpha_exact[k]

    # Evaluate E_k
    E_0 = f_ex_1_2(w_k)
    E_alpha_k = f_ex_1_2(w_k + alpha_k * g_k)
    E_choices = [f_ex_1_2(w_k + a * g_k) for a in alpha_choices]

    # Add line to plot
    fig.add_trace(
        go.Scatter(x=alpha_choices, y=E_choices, visible=False,
                   line=dict(color="blue"), name="E_k"))
    fig.add_trace(
        go.Scatter(x=[alpha_k], y=[E_alpha_k], visible=False,
                   marker=dict(color="red"), name="alpha_k"))

    # Add slider tick
    slide = {
        "method": "update",
        "args": [
            {"visible": [False] * (2 * K)},
            {"title": "k = " + str(k),
             "yaxis.range": [E_alpha_k - 0.1 * (E_0 - E_alpha_k), E_alpha_k + 2 * (E_0 - E_alpha_k)]}
        ]
    }
    slide["args"][0]["visible"][2 * k] = True
    slide["args"][0]["visible"][2 * k + 1] = True
    slides.append(slide)

fig.update_layout(
    title="k = 0", yaxis_range=[2, 10],
    sliders=[dict(steps=slides)])
fig.data[0].visible = True
fig.data[1].visible = True

fig.show()

5. Implement the gradient descent method with backtracking line search in a Python function. Use the stopping criterion based on the error of the cost. Such function should:
   * take as input the constants $\alpha$, $c_1$ and $c_2$ of the backtracking algorithm, the tolerance $\varepsilon$ for the stopping criterion and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$, and the values of the step lengths $\{\alpha_k\}_k$.
 
*Solution*:
> We closely follow the previous implementation, and change only the line search procedure.

In [None]:
def gradient_descent_backtracking_line_search_ex_1_2(
    alpha: float, c_1: float, c_2: float, epsilon: float, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the gradient descent method with backtracking line search.

    Parameters
    ----------
    alpha : float
        initial step length.
    c_1, c_2 : float
        constants of the backtracking algorithm.
    epsilon : float
        tolerance for the stopping criterion on the error on the cost.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    1d numpy array
        history of the step lengths selected by the backtracking line search.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_2(w_0)]
    all_grad_f = [grad_f_ex_1_2(w_0)]
    all_alpha = []

    # Prepare iteration counter
    k = 0

    # Use the error on the cost to determine when the while loop should stop.
    while all_f[k] - f_star > epsilon:
        w_k = all_w[k]
        f_k = all_f[k]
        grad_f_k = all_grad_f[k]
        norm_grad_f_k = np.linalg.norm(grad_f_k)

        # Carry out a backtracking line search
        alpha_k = alpha
        while f_ex_1_2(w_k - alpha_k * grad_f_k) > f_k - c_1 * alpha_k * norm_grad_f_k**2:
            alpha_k = c_2 * alpha_k

        # Compute w_{k+1}
        w_k_plus_1 = w_k - alpha_k * grad_f_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_2(w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_2(w_k_plus_1))
        all_alpha.append(alpha_k)

        # Increment iteration counter
        k += 1

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f), np.array(all_alpha)

6. Choose $\alpha = 1$, $c_1 = 0.1$, $c_2 = 0.7$, $\varepsilon = 10^{-5}$ and $\boldsymbol{w}_0 = (-1, -1.9)$, and run the gradient descent method with backtracking line search. Visualize:
   * the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$;
   * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$;
   * a video of the functions $E_k(a)$ as $k$ progresses.

*Solution*:
> We use our implementation as follows

In [None]:
all_w_backtracking, all_f_backtracking, all_grad_f_backtracking, all_alpha_backtracking = (
    gradient_descent_backtracking_line_search_ex_1_2(
        1, 0.1, 0.7, 1e-10, np.array([-1.0, 1.9])))

> We first visually compare the value $\alpha_k$ selected by backtracking with the minimum of $E_k(a)$. This gives us an idea of how much the step length $\alpha_k$ selected by the inexact line search is far from the optimal one which would have been selected by the exact line search.

In [None]:
K = all_alpha_backtracking.shape[0]

fig = go.Figure()
slides = []
for k in range(K):
    w_k = all_w_backtracking[k]
    grad_f_k = all_grad_f_backtracking[k]
    g_k = - grad_f_k
    alpha_k = all_alpha_backtracking[k]

    # Determine what would have been the optimal step length
    E_choices = [f_ex_1_2(w_k + a * g_k) for a in alpha_choices]
    min_choice = np.argmin(E_choices)
    optimal_alpha_k = alpha_choices[min_choice]

    # Evaluate E_k
    E_0 = f_ex_1_2(w_k)
    E_alpha_k = f_ex_1_2(w_k + alpha_k * g_k)
    E_optimal_alpha_k = f_ex_1_2(w_k + optimal_alpha_k * g_k)

    # Add line to plot
    fig.add_trace(
        go.Scatter(x=alpha_choices, y=E_choices, visible=False,
                   line=dict(color="blue"), name="E_k"))
    fig.add_trace(
        go.Scatter(x=[optimal_alpha_k], y=[E_optimal_alpha_k], visible=False,
                   marker=dict(color="red"), name="optimal alpha_k"))
    fig.add_trace(
        go.Scatter(x=[alpha_k], y=[E_alpha_k], visible=False,
                   marker=dict(color="orange"), name="backtracking alpha_k"))

    # Add slider tick
    slide = {
        "method": "update",
        "args": [
            {"visible": [False] * (3 * K)},
            {"title": "k = " + str(k),
             "yaxis.range": [
                 E_optimal_alpha_k - 0.1 * (E_0 - E_optimal_alpha_k),
                 E_optimal_alpha_k + 2 * (E_0 - E_optimal_alpha_k)
            ]}
        ]
    }
    slide["args"][0]["visible"][3 * k] = True
    slide["args"][0]["visible"][3 * k + 1] = True
    slide["args"][0]["visible"][3 * k + 2] = True
    slides.append(slide)

fig.update_layout(
    title="k = 0", yaxis_range=[2, 10],
    sliders=[dict(steps=slides)])
fig.data[0].visible = True
fig.data[1].visible = True
fig.data[2].visible = True

fig.show()

> We may notice that the value $\alpha_k$ selected by the backtracking consistently overestimates the optimal step length. Therefore, the steps taken by the gradient descent with backtracking are too long, as can see on a contour plot, with overlayed optimization variable iterations comparing the backtracking and the exact iterations.

In [None]:
fig = go.Figure(data=[go.Contour(
    x=w_component_0, y=w_component_1, z=np.log10(f_w),
    hovertemplate="x: %{x:.2f}<br>y: %{y:.2f}<br>z: 10^%{z:.2f} = %{customdata}", customdata=f_w,
    showscale=False
)])
fig.add_scatter(
    x=all_w_exact[:, 0], y=all_w_exact[:, 1],
    marker=dict(color="red", size=10),
    line=dict(color="red", width=2),
    mode="lines+markers", name="Exact line search"
)
fig.add_scatter(
    x=all_w_backtracking[:, 0], y=all_w_backtracking[:, 1],
    marker=dict(color="orange", size=10),
    line=dict(color="orange", width=2),
    mode="lines+markers", name="Backtracking line search"
)
fig.update_layout(
    title="Function exercise 1.2 - optimization variable iterations over contour plot (log scale)",
    width=612, height=512, autosize=False)
fig.show()

> This overshooting behavior causes the gradient method with backtracking to take a few more iterations than the one with exact line search to convrge to the same tolerance on the error in the cost function. However, from the next plot we notice that even with backtracking we still have a linear convergence when plotting the error in the cost function in a semilog plot. This indicates that, while the backtracking procedure is certainly inferior to the exact line search, it does not deteriorate the convergence speed of the gradient method. This is a welcome news, because in practice we never use the exact line search (because it is too costly! Count how many function evaluations we were required to spend at each iteration $k$!).

In [None]:
fig = go.Figure()
fig.add_scatter(
    x=np.arange(all_f_exact.shape[0]), y=all_f_exact - f_star,
    line=dict(color="red"), name="Exact line search"
)
fig.add_scatter(
    x=np.arange(all_f_backtracking.shape[0]), y=all_f_backtracking - f_star,
    line=dict(color="orange"), name="Backtracking line search"
)
fig.update_layout(title="Function exercise 1.2 - error on the function value - semilog plot")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

## Exercise 1.3
In this exercise we consider a binary classification via logistic regression.

1. Generate a dataset of $m = 100$ points as follows: 50 points should be uniformly distributed in the interval $[-1, -0.5]$, and 50 points should be uniformly distributed in the interval $[0.5, 1]$. The corresponding label should be 0 for negative points, and 1 positive points.

*Solution*:
> We use [`numpy.random.uniform`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html) to draw samples from a uniform distribution, and [`numpy.hstack`](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html) to join the two generated datasets.
> We already introduce here a few *best practices* that will be helpful when implementing more complex machine learning models:
> * in order to ensure reproducibility (between different runs on the same laptop, or across different devices) we set the seed at the beginning of every cell that generates random numbers using [`numpy.random.seed`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html)
> * after stacking, the generated dataset would have the first 50 entries with negative numbers, and the last 50 entries with positive numbers. We then reshuffle the order of the entries with [`numpy.random.shuffle`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html). We will see that this can be important when creating a validation set in more complex machine learning models.

In [None]:
np.random.seed(13)
x = np.hstack((np.random.uniform(-1, -0.5, 50), np.random.uniform(0.5, 1, 50)))
np.random.shuffle(x)

In [None]:
x

> We then assign label 0 if the corresponding point is negative, label 1 otherwise. This can be easily done by checking the condition `x > 0`, which returns a boolean value. Since the dataset should contain floating point numbers, we convert the boolean value `True` to 1 and `False` to 0.

In [None]:
y = (x > 0).astype(np.float64)

In [None]:
y

2. Implement the prediction function $\hat{y}(x; \boldsymbol{w}) = \sigma(w^{0} x + w^{1})$ associated to a logistic regression, where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

*Solution*:

In [None]:
def sigmoid(z: float) -> float:
    """Evaluate the sigmoid function."""
    return 1 / (1 + np.exp(-z))

In [None]:
def y_hat(x_j: float, w: np.ndarray) -> float:
    """Evaluate the prediction function associated to a logistic regression."""
    return sigmoid(w[0] * x_j + w[1])

3. Implement the cross entropy loss $\ell(x, y; \boldsymbol{w}) = - y \log \hat{y}(x; \boldsymbol{w}) - (1 - y) \log (1 - \hat{y}(x; \boldsymbol{w}))$, and the corresponding empirical risk function $f$ associated to the dataset.

*Solution*:

In [None]:
def logistic_loss(x_j: float, y_j: float, w: np.ndarray) -> float:
    """Evaluate the logistic loss."""
    return - y_j * np.log(y_hat(x_j, w)) - (1 - y_j) * np.log(1 - y_hat(x_j, w))

In [None]:
def f_ex_1_3(x: np.ndarray, y: np.ndarray, w: np.ndarray) -> float:
    """Evaluate the empirical risk."""
    m = x.shape[0]
    return 1 / m * sum(logistic_loss(x_j, y_j, w) for (x_j, y_j) in zip(x, y))

4. Initialize $\boldsymbol{w}$ to $\boldsymbol{w}_0 = (1, 0)$. Prepare a plot of the corresponding predictions for 500 equispaced points on $[-1, 1]$.

*Solution*:

In [None]:
w_0 = np.array([1, 0])

In [None]:
x_plot = np.linspace(-1, 1, 500)

In [None]:
fig = go.Figure()
fig.add_scatter(x=x, y=y, marker=dict(color="red", size=10), mode="markers", name="Data")
y_hat_plot = [y_hat(x_p, w_0) for x_p in x_plot]
fig.add_scatter(x=x_plot, y=y_hat_plot, marker=dict(color="blue", size=5), mode="markers", name="Prediction")
fig.update_layout(title="Logistic regression - dataset (x, y), w = w_0")
fig.show()

5. Determine the gradient of the empirical risk $f$.

*Solution*:
> Recall that $$\ell(x, y; \boldsymbol{w}) = - y \log \sigma(w^{0} x + w^{1}) - (1 - y) \log (1 - \sigma(w^{0} x + w^{1})).$$
> When taking the partial derivative w.r.t. $w^{(0)}$ we have
$$\frac{\partial}{\partial w^{(0)}} \ell(x, y; \boldsymbol{w}) = - y \frac{1}{\sigma(w^{0} x + w^{1})} \sigma'(w^{0} x + w^{1}) x - (1 - y) \frac{1}{1 - \sigma(w^{0} x + w^{1})} [- \sigma'(w^{0} x + w^{1})].$$
> Since $\sigma'(z) = \sigma(z) (1 - \sigma(z))$ we end up with
$$\frac{\partial}{\partial w^{(0)}} \ell(x, y; \boldsymbol{w}) = [- y + \sigma(w^{0} x + w^{1})] x.$$
Similarly
$$\frac{\partial}{\partial w^{(1)}} \ell(x, y; \boldsymbol{w}) = [- y + \sigma(w^{0} x + w^{1})].$$
Therefore
$$\nabla_\boldsymbol{w} \ell(x, y; \boldsymbol{w}) = (- y + \sigma(w^{0} x + w^{1})) \begin{bmatrix}x\\1\end{bmatrix}.$$

In [None]:
def grad_logistic_loss(x_j: float, y_j: float, w: np.ndarray) -> np.ndarray:
    """Evaluate the gradient of the logistic loss."""
    return (y_hat(x_j, w) - y_j) * np.array([x_j, 1])

In [None]:
def grad_f_ex_1_3(x: np.ndarray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
    """Evaluate the gradient of the empirical risk."""
    m = x.shape[0]
    return 1 / m * sum(grad_logistic_loss(x_j, y_j, w) for (x_j, y_j) in zip(x, y))

> As the plot above has graphically shown, the choice $\boldsymbol{w}_0$ is definitely not optimal, since its gradient is not zero.

In [None]:
grad_f_ex_1_3(x, y, w_0)

6. Implement the gradient descent method with constant step size equal to $1/L$, where $L$ is the smoothness constant of the empirical risk.

*Solution*:
> We have shown in lecture 0 a formula for the smoothness constant of the empirical risk.

In [None]:
def smoothness_constant(x: np.ndarray) -> float:
    """Evaluate the smoothness constant L of the empirical risk."""
    m = x.shape[0]
    return 0.5 + sum(x**2) / (2 * m)

In [None]:
smoothness_constant(x)

> We then implement the gradient method, closely following Exercise 1.1. Note that, compared to that exercise, we change the stopping criterion because we do not know $\boldsymbol{w}^*$. Here instead we use a stopping criterion based on the norm of the gradient.

In [None]:
def gradient_descent_ex_1_3(x: np.ndarray, y: np.ndarray, epsilon: float, w_0: np.ndarray) -> typing.Tuple[
        np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the gradient descent method with constant step length.

    Parameters
    ----------
    x, y : np.ndarray
        training dataset.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Compute the constant step length from the smoothness constant
    alpha = 1 / smoothness_constant(x)

    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_3(x, y, w_0)]
    all_grad_f = [grad_f_ex_1_3(x, y, w_0)]

    # Prepare iteration counter
    k = 0

    # We cannot use here a stopping criterion based on the difference between the function value
    # f(w_k) and f(w^*), because we do not know w^*! We use instead a stopping criterion based
    # on the norm of the gradient.
    # The rest of the implementation is as in Exercise 1.1 (except for changing the function f)
    while np.linalg.norm(all_grad_f[k]) > epsilon:
        w_k = all_w[k]
        grad_f_k = all_grad_f[k]
        w_k_plus_1 = w_k - alpha * grad_f_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_3(x, y, w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_3(x, y, w_k_plus_1))

        # Bail out if the descent condition is not satisfied
        if all_f[k + 1] >= all_f[k]:
            print("WARNING: descent conditions is not satisfied")
            break

        # Increment iteration counter
        k += 1
        print(k, np.linalg.norm(all_w[-1]), all_f[-1], np.linalg.norm(all_grad_f[-1]))

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

In [None]:
all_w, all_f, all_grad_f = gradient_descent_ex_1_3(x, y, 1e-3, w_0)

> The gradient method convergences to the required tolerance after approximately 200 iterations.

In [None]:
all_w[-1]

> Predictions are now more accurate, as the plot below shows.

In [None]:
fig = go.Figure()
fig.add_scatter(x=x, y=y, marker=dict(color="red", size=10), mode="markers", name="Data")
y_hat_plot = [y_hat(x_p, all_w[-1]) for x_p in x_plot]
fig.add_scatter(x=x_plot, y=y_hat_plot, marker=dict(color="blue", size=5), mode="markers", name="Prediction")
fig.update_layout(title="Logistic regression - dataset (x, y), optimal w")
fig.show()

> We next plot the progress of the cost function value w.r.t. the iterations.

In [None]:
fig = go.Figure(data=go.Scatter(x=np.arange(all_f.shape[0]), y=all_f))
fig.update_layout(title="Logistic regression - progress of the function value - semilog plot")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> We finally plot the progress of the norm of the gradient w.r.t. $k$. We notice that the convergence curve is almost horizontal, i.e. the slope of the line is *very low*, which means that very little progress is made at each iteration. Equivalently said, if there was linear convergence the constant $C$ in its definition is very close to 1. (This is actually a case in which the gradient method converges sublinearly, or equivalently said, the constant $C$ is actually equal to $1$, which is not allowed in the definition of linear convergence! This is due to the fact that the logistic regression empirical risk on this dataset is not strongly convex).

In [None]:
fig = go.Figure(data=go.Scatter(x=np.arange(all_f.shape[0]), y=np.linalg.norm(all_grad_f, axis=1)))
fig.update_layout(title="Logistic regression - violation of first order optimality conditions - semilog plot")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

7. Graphically justify why the convergence of the gradient method is so slow.

*Solution*:
> To get an intuition, we create an animation of the prediction when $\boldsymbol{w}_k$ changes according to the $k$-th iteration of the gradient descent. We realize that the prediction curve is getting steeper and steeper as $k$ increases. Indeed, a vertical line (characterized by $\boldsymbol{w}^{0} \to +\infty$) would be the best prediction, as it would best reflect that the $y = 0$ when $x < 0$ and $y = 1$ when $x > 0$.

In [None]:
K = all_f.shape[0]

fig = go.Figure()
fig.add_trace(
    go.Scatter(x=x, y=y, marker=dict(color="red", size=10), mode="markers", name="Data"))
slides = []
for k in range(0, K):
    y_hat_plot = [y_hat(x_p, all_w[k]) for x_p in x_plot]
    fig.add_trace(
        go.Scatter(x=x_plot, y=y_hat_plot, marker=dict(color="blue", size=5),
                   mode="markers", name="Prediction", visible=False))

    # Add slider tick
    slide = {
        "method": "update",
        "args": [
            {"visible": [False] * (K + 1)},
            {"title": "Logistic regression - dataset (x, y), w at iteration " + str(k)}
        ]
    }
    slide["args"][0]["visible"][0] = True
    slide["args"][0]["visible"][k + 1] = True
    slides.append(slide)

fig.update_layout(
    title="Logistic regression - dataset (x, y), w at iteration 0",
    sliders=[dict(steps=slides)])
fig.data[0].visible = True
fig.data[1].visible = True

fig.show()

> This is confirmed by creating a plot of the empirical risk as a function $\boldsymbol{w}$. Notice how the function is asymptotically zero on the line $(w^{(0)}, 0)$.

In [None]:
domain_component_0 = [-10, 10]
domain_component_1 = [-10, 10]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_1_3(x, y, np.array([w_component_0[i], w_component_1[j]]))

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_layout(title="Logistic regression - contour plot", width=512, height=512, autosize=False)
fig.add_scatter(
    x=all_w[:, 0], y=all_w[:, 1],
    marker=dict(color="black", size=10),
    line=dict(color="black", width=2),
    mode="lines+markers"
)
fig.show()

> The gradient method enters a region where the function is almost flat, therefore the gradient is almost zero, forcing the iterative method to take very small step

## Exercise 1.4

Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the following non-convex function, known as *Ackley function* and defined as
$$f(\boldsymbol{w}) = - 20 \exp\left(- 0.2 \sqrt{\frac{[w^{(0)}]^2 + [w^{(1)}]^2}{2}}\right) - \exp\left(\frac{\cos(2 \pi w^{(0)}) + \cos(2 \pi w^{(1)})}{2}\right) + 20 + \exp(1).$$

1. Draw a surface plot and a contour plot of the function $f$ in the interval $[-10, 10]$.

*Solution*:

In [None]:
domain_component_0 = [-10, 10]
domain_component_1 = [-10, 10]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 200)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 200)

In [None]:
def f_ex_1_4(w: np.ndarray) -> float:
    """Evaluate f(w)."""
    return (
        - 20 * np.exp(- 0.2 * np.sqrt((w[0]**2 + w[1]**2) / 2))
        - np.exp((np.cos(2 * np.pi * w[0]) + np.cos(2 * np.pi * w[1])) / 2)
        + 20 + np.exp(1)
    )

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_1_4(np.array([w_component_0[i], w_component_1[j]]))

In [None]:
fig = go.Figure(data=[go.Surface(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_layout(title="Ackley function - surface plot")
fig.show()

> We notice that the function has several small "bumps" that result in the formation of many local minima (and local maxima). Looking at the expression it can be shown that such points are associated to $(w^{(0)}, w^{(1)})$ given by a pair of integer coordinates. Such points are also visibile looking at a contour plot. There is a global minimum $\boldsymbol{w}^* = (0, 0)$.

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w)])
fig.update_layout(title="Ackley function - contour plot", width=512, height=512, autosize=False)
fig.show()

2. Implement a gradient method with backtracking line search and a stopping criterion based on the norm of the increment of the optimization variables.

*Solution*:
> We first need to write the gradient of the function $f$. *Bonus point*: the expression is very complicated to differentiate, but the partial derivatives can be obtained symbolically with `sympy`.

In [None]:
def grad_f_ex_1_4(w: np.ndarray) -> np.ndarray:
    r"""Evaluate \nabla f(w)."""
    return np.array([
        2.0 * w[0] * np.exp(- 0.2 * np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)) / np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)
        + np.pi * np.exp(np.cos(2 * np.pi * w[0]) / 2 + np.cos(2 * np.pi * w[1]) / 2) * np.sin(2 * np.pi * w[0]),
        2.0 * w[1] * np.exp(- 0.2 * np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)) / np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)
        + np.pi * np.exp(np.cos(2 * np.pi * w[0]) / 2 + np.cos(2 * np.pi * w[1]) / 2) * np.sin(2 * np.pi * w[1])
    ])

In [None]:
def gradient_descent_backtracking_line_search_ex_1_4(
    alpha: float, c_1: float, c_2: float, epsilon: float, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the gradient descent method with backtracking line search.

    Parameters
    ----------
    alpha : float
        initial step length.
    c_1, c_2 : float
        constants of the backtracking algorithm.
    epsilon : float
        tolerance for the stopping criterion on the increment on the optimization variables.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_4(w_0)]
    all_grad_f = [grad_f_ex_1_4(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the norm of the variable increment as stopping criterion.
    variable_increment = 2 * epsilon
    while variable_increment > epsilon:
        w_k = all_w[k]
        f_k = all_f[k]
        grad_f_k = all_grad_f[k]
        norm_grad_f_k = np.linalg.norm(grad_f_k)

        # Carry out a backtracking line search
        alpha_k = alpha
        while f_ex_1_4(w_k - alpha_k * grad_f_k) > f_k - c_1 * alpha_k * norm_grad_f_k**2:
            alpha_k = c_2 * alpha_k

        # Compute w_{k+1}
        w_k_plus_1 = w_k - alpha_k * grad_f_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_4(w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_4(w_k_plus_1))

        # Increment iteration counter
        k += 1
        variable_increment = np.linalg.norm(all_w[k] - all_w[k - 1])

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

3. Apply the gradient descent method with backtracking starting from several initial conditions, obtained by an equispaced subdivision of the domain $[-10, 10]$ in squares of length two. How many times does the gradient descent method convergence to the global minimum $\boldsymbol{w}^* = (0, 0)$?

*Solution*:

In [None]:
solutions = dict()
for i in range(-10, 11, 2):
    for j in range(-10, 11, 2):
        all_w, _, _ = gradient_descent_backtracking_line_search_ex_1_4(
            1, 0.1, 0.7, 1e-5, np.array([i + 0.5, j + 0.5]))
        optimal_w = all_w[-1]
        component_0_int = np.round(optimal_w[0], 0)
        component_1_int = np.round(optimal_w[1], 0)
        assert np.isclose(optimal_w[0] - component_0_int, 0., atol=1e-1)
        assert np.isclose(optimal_w[1] - component_1_int, 0., atol=1e-1)
        if (component_0_int, component_1_int) not in solutions:
            solutions[(component_0_int, component_1_int)] = 0
        solutions[(component_0_int, component_1_int)] += 1

> We convert the dictionary into a matrix with three columns.

In [None]:
solutions_keys_np = np.array(list(solutions.keys()))
solutions_values_np = np.array(list(solutions.values())).reshape(-1, 1)
solutions_np = np.hstack((solutions_keys_np, solutions_values_np))

In [None]:
solutions_np

> We prepare a contour plot overlayed with a scatter plot of the minima found by the optimization method. Markers in the scatter plot as larger if a minima has been found several times.

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w, opacity=0.5)])
fig.add_scatter(
    x=solutions_np[:, 0], y=solutions_np[:, 1],
    marker=dict(size=5 * np.sqrt(solutions_np[:, 2]), color="black"), mode="markers",
    hovertemplate="(%{x}, %{y}) found %{customdata} times", customdata=solutions_np[:, 2],
)
fig.update_layout(title="Ackley function - convergence to minima", width=512, height=512, autosize=False)
fig.show()

> Percentage of runs at which the global minimum has been found:

In [None]:
solutions[(0, 0)] / sum(solutions_np[:, 2])

> Due to the small bumps, the gradient method has got stuck almost 80% of the cases in local minima. Are there extensions of the gradient method that are less susceptible to local minima?

## Exercise 1.5 (continuation of Exercise 1.1)
Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the *Booth function*
$$f(\boldsymbol{w}) = (w^{(0)} + 2 w^{(1)} - 7)^2 + (2 w^{(0)} + w^{(1)} - 5)^2.$$

7. Implement Nesterov accelerated gradient method with constant step length and constant momentum coefficient in a Python function. Use the stopping criterion based on the error of the cost. Such function should:
   * take as input the value $\alpha$ of the step length, the value $\beta$ of the momentum coefficient, the tolerance $\varepsilon$ for the stopping criterion, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.
 
*Solution*:

In [None]:
def f_ex_1_5(w: np.ndarray) -> float:
    """Evaluate f(w)."""
    return (w[0] + 2 * w[1] - 7)**2 + (2 * w[0] + w[1] - 5)**2

In [None]:
def grad_f_ex_1_5(w: np.ndarray) -> float:
    r"""Evaluate \nabla f(w)."""
    return np.array([10 * w[0] + 8 * w[1] - 34, 8 * w[0] + 10 * w[1] - 38])

In [None]:
def nesterov_accelerated_gradient_method_ex_1_5(
    alpha: float, beta: float, epsilon: float, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run Nesterov accelerated gradient descent method with constant step length and constant momentum coefficient.

    Parameters
    ----------
    alpha : float
        constant step length.
    beta : float
        constant momentum coefficient.
    epsilon : float
        tolerance for the stopping criterion on the error on the cost.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_5(w_0)]
    all_grad_f = [grad_f_ex_1_5(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the error on the cost to determine when the while loop should stop.
    while all_f[k] > epsilon:
        w_k_minus_1 = all_w[k - 1]
        w_k = all_w[k]
        z_k = w_k + beta * (w_k - w_k_minus_1)
        grad_f_z_k = grad_f_ex_1_5(z_k)
        w_k_plus_1 = z_k - alpha * grad_f_z_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_5(w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_5(w_k_plus_1))

        # Increment iteration counter
        k += 1

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

8. Choose $\alpha = 1/L = 1/18$, $\varepsilon = 10^{-5}$ and $\boldsymbol{w}_0 = (-8, -8)$, and four possible choices of $\beta$, corresponding to $\{0, 0.5, 1, 1.5\}$ multiplied by the momentum coefficient suggested by the convergence result. Run Nesterov method, and visualize a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$ (for each value of $\beta$).

*Solution*:
> We first run the case $\beta = 0$. This corresponds to the gradient method.

In [None]:
all_w_0, all_f_0, all_grad_f_0 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 18, 0, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_0.shape[0]

> We then compute the suggested value of the momentum coefficient, given by $\frac{\sqrt{L} - \sqrt{\mu}}{\sqrt{L} + \sqrt{\mu}}$, where we have seen in Exercise 1.1 that $\mu = 2$ and $L = 18$.

In [None]:
suggested_beta = (np.sqrt(18) - np.sqrt(2)) / (np.sqrt(18) + np.sqrt(2))
suggested_beta

> We then run the remaining cases.

In [None]:
all_w_05, all_f_05, all_grad_f_05 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 18, 0.5 * suggested_beta, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_05.shape[0]

In [None]:
all_w_1, all_f_1, all_grad_f_1 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 18, suggested_beta, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_1.shape[0]

In [None]:
all_w_15, all_f_15, all_grad_f_15 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 18, 1.5 * suggested_beta, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_15.shape[0]

> We prepare a plot of the error on the function value.

In [None]:
fig = go.Figure()
all_f_beta = [all_f_0, all_f_05, all_f_1, all_f_15]
beta_factors = [0, 0.5, 1, 1.5]
for beta_index in range(4):
    fig.add_scatter(
        x=np.arange(all_f_beta[beta_index].shape[0]), y=all_f_beta[beta_index],
        marker=dict(color=plotly.colors.qualitative.Set1[beta_index], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[beta_index], width=2),
        mode="lines+markers", name="Factor in front of beta = " + str(beta_factors[beta_index])
    )
fig.update_layout(
    title="Booth function - error on the function value - different momentum coefficients",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> From this plot we conclude that:
> * $\beta = 0$ corresponds to the gradient descent method, which is indeed slower than the three other accelerated methods;
> * taking $\beta$ less than the suggested value causes the acceleration to be less effective;
> * taking $\beta$ as suggested by the convergence theory guarantees the best convergence;
> * taking $\beta$ above the suggested values causes a non montone decrease of the cost function, resulting in a method which does not satisfy the descent condition. While non montone decrease is not an issue per se, in this case (and also in several practical cases) it causes a larger number of iterations to be necessary to reach the prescribed tolerance.

9. Since in many practical application the value of $L$ is not known, the actual step length $\alpha$ may be smaller than the suggested one by the convergence results. Consider for instance $\alpha = 1/36$, which is half the suggested one. One might try to "compensate" for the halved step by approximately doubling the coefficient $\beta$.
So, choose $\alpha = 1/36$, $\varepsilon = 10^{-5}$ and $\boldsymbol{w}_0 = (-8, -8)$, and four possible choices of $\beta$, corresponding to $\{0, 1, 1.5, 2\}$ multiplied by the momentum coefficient suggested by the convergence result. Run Nesterov method, and visualize:
   * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$ (for each value of $\beta$);
   * the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$.
 
*Solution*:
> We run our implementation for all required values of $\beta$.

In [None]:
all_w_0, all_f_0, all_grad_f_0 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 36, 0, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_0.shape[0]

In [None]:
all_w_1, all_f_1, all_grad_f_1 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 36, suggested_beta, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_1.shape[0]

In [None]:
all_w_15, all_f_15, all_grad_f_15 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 36, 1.5 * suggested_beta, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_15.shape[0]

In [None]:
all_w_2, all_f_2, all_grad_f_2 = nesterov_accelerated_gradient_method_ex_1_5(
    1 / 36, 2 * suggested_beta, 1e-5, np.array([-8.0, -8.0]))

In [None]:
all_w_2.shape[0]

> We prepare a plot of the error on the function value. Following the discussion at the previous point, we suggest to choose a factor of 1.5

In [None]:
fig = go.Figure()
all_f_beta = [all_f_0, all_f_1, all_f_15, all_f_2]
for beta_index in range(4):
    fig.add_scatter(
        x=np.arange(all_f_beta[beta_index].shape[0]), y=all_f_beta[beta_index],
        marker=dict(color=plotly.colors.qualitative.Set1[beta_index], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[beta_index], width=2),
        mode="lines+markers", name="Factor in front of beta = " + str(beta_factors[beta_index])
    )
fig.update_layout(
    title="Booth function - error on the function value - different momentum coefficients",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> We conclude with the visualization of the iterations over a contour plot.

In [None]:
domain_component_0 = [-10, 10]
domain_component_1 = [-10, 10]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_1_5([w_component_0[i], w_component_1[j]])

In [None]:
fig = plotly.subplots.make_subplots(rows=2, cols=2)
rows = [1, 1, 2, 2]
cols = [1, 2, 1, 2]
all_w_beta = [all_w_0, all_w_1, all_w_15, all_w_2]
beta_factors = [0, 1, 1.5, 2]
for beta_index in range(4):
    fig.add_contour(
        x=w_component_0, y=w_component_1, z=f_w, opacity=0.5, showscale=False,
        row=rows[beta_index], col=cols[beta_index]
    )
    fig.add_scatter(
        x=all_w_beta[beta_index][:, 0], y=all_w_beta[beta_index][:, 1],
        marker=dict(color=plotly.colors.qualitative.Set1[beta_index], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[beta_index], width=2),
        mode="lines+markers", name="Factor in front of beta = " + str(beta_factors[beta_index]),
        row=rows[beta_index], col=cols[beta_index]
    )
fig.update_layout(
    title="Booth function - optimization variable iterations over contour plot - different momentum coefficients",
    width=768, height=768, autosize=False
)
fig.update_xaxes(range=[0, 5])
fig.update_yaxes(range=[0, 5])
fig.show()

> For every value of $\beta$ the first iterations of the accelerated method take considerably larger steps than the gradient method. However, at least for moderate values of $\beta$ (factors 1 and 1.5) this is beneficial overall for later iterations. In contrast, a too large valu eof $\beta$ (factor 2) may cause the method to overshoot the minimum point.

## Exercise 1.6 (continuation of Exercise 1.4)

Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the *Ackley function*
$$f(\boldsymbol{w}) = - 20 \exp\left(- 0.2 \sqrt{\frac{[w^{(0)}]^2 + [w^{(1)}]^2}{2}}\right) - \exp\left(\frac{\cos(2 \pi w^{(0)}) + \cos(2 \pi w^{(1)})}{2}\right) + 20 + \exp(1).$$

4. Implement the heavy ball method with backtracking line search and constant momentum coefficient. Use the stopping criterion based on the norm of the increment of the optimization variables. Such function should:
   * take as input the constants $\alpha$, $c_1$ and $c_2$ of the backtracking algorithm, the momentum coefficient $\beta$, the tolerance $\varepsilon$ for the stopping criterion, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.
 
*Solution*:

In [None]:
def f_ex_1_6(w: np.ndarray) -> float:
    """Evaluate f(w)."""
    return (
        - 20 * np.exp(- 0.2 * np.sqrt((w[0]**2 + w[1]**2) / 2))
        - np.exp((np.cos(2 * np.pi * w[0]) + np.cos(2 * np.pi * w[1])) / 2)
        + 20 + np.exp(1)
    )

In [None]:
def grad_f_ex_1_6(w: np.ndarray) -> np.ndarray:
    r"""Evaluate \nabla f(w)."""
    return np.array([
        2.0 * w[0] * np.exp(- 0.2 * np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)) / np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)
        + np.pi * np.exp(np.cos(2 * np.pi * w[0]) / 2 + np.cos(2 * np.pi * w[1]) / 2) * np.sin(2 * np.pi * w[0]),
        2.0 * w[1] * np.exp(- 0.2 * np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)) / np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)
        + np.pi * np.exp(np.cos(2 * np.pi * w[0]) / 2 + np.cos(2 * np.pi * w[1]) / 2) * np.sin(2 * np.pi * w[1])
    ])

In [None]:
def heavy_ball_backtracking_line_search_ex_1_6(
    alpha: float, c_1: float, c_2: float, beta: float, epsilon: float, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the heavy ball method with backtracking line search and constant momentum.

    Parameters
    ----------
    alpha : float
        initial step length.
    c_1, c_2 : float
        constants of the backtracking algorithm.
    beta : float
        constant momentum coefficient.
    epsilon : float
        tolerance for the stopping criterion on the increment on the optimization variables.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f_ex_1_6(w_0)]
    all_grad_f = [grad_f_ex_1_6(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the norm of the variable increment as stopping criterion.
    variable_increment = 2 * epsilon
    while variable_increment > epsilon:
        w_k_minus_1 = all_w[k - 1]
        w_k = all_w[k]
        f_k = all_f[k]
        grad_f_k = all_grad_f[k]
        norm_grad_f_k = np.linalg.norm(grad_f_k)

        # Carry out a backtracking line search
        alpha_k = alpha
        while f_ex_1_6(w_k - alpha_k * grad_f_k) > f_k - c_1 * alpha_k * norm_grad_f_k**2:
            alpha_k = c_2 * alpha_k

        # Compute z_k and w_{k+1}
        z_k = w_k - alpha_k * grad_f_k
        w_k_plus_1 = z_k + beta * (w_k - w_k_minus_1)

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f_ex_1_6(w_k_plus_1))
        all_grad_f.append(grad_f_ex_1_6(w_k_plus_1))

        # Increment iteration counter
        k += 1
        variable_increment = np.linalg.norm(all_w[k] - all_w[k - 1])

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

5. Apply the heavy ball method with backtracking starting from several initial conditions, obtained by an equispaced subdivision of the domain $[-10, 10]$ in squares of length two. How many times does the heavy ball method convergence to the global minimum $\boldsymbol{w}^* = (0, 0)$? Is it better than gradient descent?

*Solution*:

In [None]:
solutions = dict()
for i in range(-10, 11, 2):
    for j in range(-10, 11, 2):
        all_w, _, _ = heavy_ball_backtracking_line_search_ex_1_6(
            1, 0.1, 0.7, 0.5, 1e-5, np.array([i + 0.5, j + 0.5]))
        optimal_w = all_w[-1]
        component_0_int = np.round(optimal_w[0], 0)
        component_1_int = np.round(optimal_w[1], 0)
        assert np.isclose(optimal_w[0] - component_0_int, 0., atol=1e-1)
        assert np.isclose(optimal_w[1] - component_1_int, 0., atol=1e-1)
        if (component_0_int, component_1_int) not in solutions:
            solutions[(component_0_int, component_1_int)] = 0
        solutions[(component_0_int, component_1_int)] += 1

> We convert the dictionary into a matrix with three columns.

In [None]:
solutions_keys_np = np.array(list(solutions.keys()))
solutions_values_np = np.array(list(solutions.values())).reshape(-1, 1)
solutions_np = np.hstack((solutions_keys_np, solutions_values_np))

> We prepare a contour plot overlayed with a scatter plot of the minima found by the optimization method. Markers in the scatter plot as larger if a minima has been found several times.

In [None]:
domain_component_0 = [-10, 10]
domain_component_1 = [-10, 10]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_1_6([w_component_0[i], w_component_1[j]])

In [None]:
fig = go.Figure(data=[go.Contour(x=w_component_0, y=w_component_1, z=f_w, opacity=0.5)])
fig.add_scatter(
    x=solutions_np[:, 0], y=solutions_np[:, 1],
    marker=dict(size=5 * np.sqrt(solutions_np[:, 2]), color="black"), mode="markers",
    hovertemplate="(%{x}, %{y}) found %{customdata} times", customdata=solutions_np[:, 2],
)
fig.update_layout(title="Ackley function - convergence to minima", width=512, height=512, autosize=False)
fig.show()

> Percentage of runs at which the global minimum has been found:

In [None]:
solutions[(0, 0)] / sum(solutions_np[:, 2])

> The heavy ball method converged to the global minimum in more than $60\%$ cases, compared to roughly $20\%$ of the gradient method.