<a href="https://colab.research.google.com/github/pserebrennikov/3rd-year-project/blob/master/3_stochastic_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 3 - Stochastic methods
### Course on Optimization for Machine Learning - Dr. F. Ballarin
### Master Degree in Data Analytics for Business, Catholic University of the Sacred Heart, Milano

In this notebook we implement the stochastic version of first order methods we have seen in the first lecture.

In [None]:
import typing

In [None]:
import numpy as np
import plotly.colors
import plotly.graph_objects as go
import plotly.subplots

## Exercise 3.1 (continuation of Exercises 1.1, 1.5 and 2.1)
Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the *Booth function*
$$f(\boldsymbol{w}) = (w^{(0)} + 2 w^{(1)} - 7)^2 + (2 w^{(0)} + w^{(1)} - 5)^2$$
as the sum of two functions
$$f_0(\boldsymbol{w}) = (w^{(0)} + 2 w^{(1)} - 7)^2, \qquad f_1(\boldsymbol{w}) = (2 w^{(0)} + w^{(1)} - 5)^2.$$
(Note that the numbering of the addends starts from 0 for consistency with Python numbering)

13. Draw a contour plot of the functions $f$, $f_0$ and $f_1$ on the square domain $[-10, 10]^2$.

*Solution*:
> As in previous exercises we start defining an equispaced sampling of the domain.

In [None]:
domain_component_0 = [-10, 10]
domain_component_1 = [-10, 10]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

> We then define $f$, $f_0$ and $f_1$ using a single Python function. In addition to the usual `w` argument, this function takes a second optional argument:
> * when the second argument is not provided, i.e. `f_ex_3_1(w)` is called, we return the evaluation of $f$, as we are used to from previous exercises
> * when the second argument is either 0 or 1, we return the evaluation of either $f_0$ or $f_1$, respectively, at the point `w`.

In [None]:
def f_ex_3_1(w: np.ndarray, addend: int = None) -> float:
    r"""Evaluate f(w) if addend is None, or f_{addend}(w) if addend is an integer."""
    if addend == 0:
        return (w[0] + 2 * w[1] - 7)**2
    elif addend == 1:
        return (2 * w[0] + w[1] - 5)**2
    elif addend is None:
        return (w[0] + 2 * w[1] - 7)**2 + (2 * w[0] + w[1] - 5)**2

> In preparation of the contour plot we store the evaluation of $f$, $f_0$ and $f_1$ in three different matrices. Note the three different calls to `f_ex_3_1`:
> * we do not provide the second argument when evaluating $f$,
> * when interested in evaluating $f_0$ or $f_1$, we may either provide directly the corresponding number 0 or 1 as the second argument, or prepend it with the `addend=` syntax if we want to highlight that the value is associated to the `addend` input variable.

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
f0_w = np.zeros((len(w_component_0), len(w_component_1)))
f1_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_3_1([w_component_0[i], w_component_1[j]])
        f0_w[j, i] = f_ex_3_1([w_component_0[i], w_component_1[j]], addend=0)
        f1_w[j, i] = f_ex_3_1([w_component_0[i], w_component_1[j]], 1)

> We then prepare a contour plot. We prepare three subplots aligned vertically, that will contain (from top to bottom) $f$, $f_0$ and $f_1$. Notice that, with `coloraxis="coloraxis"`, we are instructing `plotly` to share the colorbar among the three subplots.

In [None]:
fig = plotly.subplots.make_subplots(rows=3, cols=1, vertical_spacing=0.05)
fig.add_contour(x=w_component_0, y=w_component_1, z=f_w, row=1, col=1, coloraxis="coloraxis")
fig.add_contour(x=w_component_0, y=w_component_1, z=f0_w, row=2, col=1, coloraxis="coloraxis")
fig.add_contour(x=w_component_0, y=w_component_1, z=f1_w, row=3, col=1, coloraxis="coloraxis")
fig.update_layout(title="Booth function - contour plot", width=512, height=2.5 * 512, autosize=False)
fig.show()

14. Implement the stochastic gradient method with constant step length in a Python function. Such function should:
    * take as input the number $m$ of addends, the function $f$, its gradient $\nabla f$, the value $\alpha$ of the step length, the tolerance $\varepsilon$ for the stopping criterion, maximum number $K_{\max}$ of allowed iterations, and the initial condition $\boldsymbol{w}_{0}$;
    * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.

    Use the stopping criterion based on the norm of the gradient. (Note that the such stopping criterion is not realistic in big data applications, because it requires the gradient evaluation of the whole sum!)

*Solution*:
> Before starting the implementation of the stochastic gradient method we still have to implement the gradient of the functions $f$, $f_0$ and $f_1$. We follow a design similar to the one used for function evaluation, by adding an optional `addend` argument.

In [None]:
def grad_f_ex_3_1(w: np.ndarray, addend: int = None) -> np.ndarray:
    r"""Evaluate \nabla f(w) if addend is None, or \nabla f_{addend}(w) if addend is an integer."""
    if addend == 0:
        return (w[0] + 2 * w[1] - 7) * np.array([2, 4])
    elif addend == 1:
        return (2 * w[0] + w[1] - 5) * np.array([4, 2])
    elif addend is None:
        return np.array([10 * w[0] + 8 * w[1] - 34, 8 * w[0] + 10 * w[1] - 38])

> We test the implemented functions on the optimal point $\boldsymbol{w}^*$

In [None]:
w_star = np.array([1, 3])

In [None]:
f_ex_3_1(w_star)

In [None]:
grad_f_ex_3_1(w_star)

> Notice that $\boldsymbol{w}^*$ is a not only a minimum of $f$, but also of $f_0$ and $f_1$. This is a particularly favorable situation for the Booth function, but this will not be true in general! (see next Exercise 3.2)

In [None]:
f_ex_3_1(w_star, 0)

In [None]:
grad_f_ex_3_1(w_star, 0)

In [None]:
f_ex_3_1(w_star, 1)

In [None]:
grad_f_ex_3_1(w_star, 1)

> In order to implement the stochastic gradient we need to select at every iteration a value $j_k$ as either 0 or 1.
> To this end, we can use [`numpy.random.randint`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html#numpy.random.randint). Notice that we have to pass the number `2` as argument to get samples made of 0 or 1.

In [None]:
def stochastic_gradient(
    m: int, f: typing.Callable, grad_f: typing.Callable, alpha: float, epsilon: float, maxit: int, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the stochastic gradient method with constant step length.

    Parameters
    ----------
    m : int
        number of addends in the expression of the cost function.
    f, grad_f : Python function
        callable evaluating the cost function and its gradient, respectively.
    alpha : float
        constant step length.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    maxit : int
        maximum number of allowed iterations.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f(w_0)]
    all_grad_f = [grad_f(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the norm of the gradient as stopping criterion.
    while np.linalg.norm(all_grad_f[k]) > epsilon:
        w_k = all_w[k]

        # Draw a random index
        j_k = np.random.randint(m)

        # Compute the update direction
        g_k = - grad_f(w_k, addend=j_k)

        # Compute w_{k + 1}
        w_k_plus_1 = w_k + alpha * g_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f(w_k_plus_1))
        all_grad_f.append(grad_f(w_k_plus_1))

        # Increment iteration counter
        k += 1

        # Bail out if exceeded allowed number of iterations
        if k >= maxit:
            print("WARNING: stochastic gradient method exceeded number of allowed iterations")
            break

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

15. Choose $\alpha = 1 / 18$, $\varepsilon = 10^{-2}$, $K_{\max} = 1000$ and $\boldsymbol{w}_0 = (-8, -8)$. Visualize:
    * an animation of the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$;
    * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$;
    * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.
 
    Since the choice of addend is random, set a seed (for reproducibility) and run the stochastic gradient method four times (and expect slightly different results every time).
 
*Solution*:
> We set the seed and run the stochastic gradient method implementation four times, and collect the results.

In [None]:
all_w = [None] * 4
all_f = [None] * 4
all_grad_f = [None] * 4

np.random.seed(31)
for run in range(4):
    all_w[run], all_f[run], all_grad_f[run] = stochastic_gradient(
        2, f_ex_3_1, grad_f_ex_3_1, 1 / 18, 1e-2, 1000, np.array([-8.0, -8.0]))

> We create a table that compares the cost function for different iterations $k$ (rows) and different runs (columns). Since different runs may have reach the prescribed tolerance $\varepsilon$ in a different number of iterations, we will compare different runs using the minimum of such numbers.

In [None]:
all_f[0].shape[0]

In [None]:
all_f[1].shape[0]

In [None]:
all_f[2].shape[0]

In [None]:
all_f[3].shape[0]

In [None]:
min_K = min([all_f[run].shape[0] for run in range(4)])
min_K

In [None]:
max_K = max([all_f[run].shape[0] for run in range(4)])
max_K

In [None]:
np.vstack([all_f[run][:min_K] for run in range(4)]).T

> We notice that the overall behavior is quite similar between the different runs, even though the cost function in the last row of the table may be different of about an order of magnitude. Furthermore, we notice that, especially towards the final iterations, the cost function may increase from one iteration to the next!
>
> The non monotone behavior is particularly evident looking at the plots of the error on the function value, as well as the plot of the norms of the gradient.

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0]), y=all_f[run],
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Booth function - error on the function value",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0]), y=np.linalg.norm(all_grad_f[run], axis=1),
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Booth function - norm of the gradient",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> Judging from these plots we are led to believe that, at least on average (to cancel out the oscillations), we still have a linear convergance rate.
>
> Oscillations are also quite visible when plotting the optimization variables over a contour plot. First iterations have considerably larger oscillations, but the optimization variable still oscillate even at later iterations (to see this uncomment the lines that apply a restriction of the horizontal and vertical axes)

In [None]:
# Add opaque contour plot
fig = go.Figure(data=[go.Contour(
    x=w_component_0, y=w_component_1, z=f_w,
    showscale=False, visible=True, opacity=0.5
)])

# Add a red cross marker to locate w*
fig.add_trace(
    go.Scatter(x=[w_star[0]], y=[w_star[1]],
               marker=dict(color="red", size=10, symbol="x"),
               mode="markers", name="w*", visible=True))

# Prepare a slider for each iteration k
slides = []
for k in range(max_K):
    for run in range(4):
        # Set non uniform marker size to highlight the current iteration k
        marker_size = np.zeros((k + 1, ))
        marker_size[-1] = 10
        # Add lines
        fig.add_trace(
            go.Scatter(x=all_w[run][:k + 1, 0],
                       y=all_w[run][:k + 1, 1],
                       visible=False,
                       marker=dict(color=plotly.colors.qualitative.Set1[run], size=marker_size),
                       line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
                       mode="lines+markers",
                       name="run " + str(run) + " at k = " + str(k).zfill(3)))

    # Add slider tick
    slide = {
        "method": "update",
        "args": [
            {"visible": [False] * (4 * max_K + 2)},
            {}
        ]
    }
    slide["args"][0]["visible"][0] = True
    slide["args"][0]["visible"][1] = True
    for run in range(4):
        slide["args"][0]["visible"][4 * k + run + 2] = True
    slides.append(slide)

for run in range(4):
    fig.data[run + 2].visible = True

fig.update_layout(
    title="Booth function - optimization variable iterations over contour plot",
    width=612, height=512, autosize=False,
    sliders=[dict(steps=slides)]
)
# fig.update_xaxes(range=[1 - 0.1, 1 + 0.1])
# fig.update_yaxes(range=[3 - 0.1, 3 + 0.1])
fig.show()

## Exercise 3.2
Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the following function, known as *Trid function*
$$f(\boldsymbol{w}) = (w^{(0)} - 1)^2 + (w^{(1)} - 1)^2 - w^{(0)}w^{(1)}$$
as the sum of three functions
$$f_0(\boldsymbol{w}) = (w^{(0)} - 1)^2, \qquad f_1(\boldsymbol{w}) = (w^{(1)} - 1)^2, \qquad f_2(\boldsymbol{w}) = - w^{(0)}w^{(1)}.$$

1. Draw a contour plot of the functions $f$, $f_0$, $f_1$ and $f_2$ on the square domain $[-4, 4]^2$.

*Solution*:
> The code is very similar to the previous exercise.

In [None]:
domain_component_0 = [-4, 4]
domain_component_1 = [-4, 4]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

In [None]:
def f_ex_3_2(w: np.ndarray, addend: int = None) -> float:
    r"""Evaluate f(w) if addend is None, or f_{addend}(w) if addend is an integer."""
    if addend == 0:
        return (w[0] - 1)**2
    elif addend == 1:
        return (w[1] - 1)**2
    elif addend == 2:
        return - w[0] * w[1]
    elif addend is None:
        return (w[0] - 1)**2 + (w[1] - 1)**2 - w[0] * w[1]

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
f0_w = np.zeros((len(w_component_0), len(w_component_1)))
f1_w = np.zeros((len(w_component_0), len(w_component_1)))
f2_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_3_2([w_component_0[i], w_component_1[j]])
        f0_w[j, i] = f_ex_3_2([w_component_0[i], w_component_1[j]], 0)
        f1_w[j, i] = f_ex_3_2([w_component_0[i], w_component_1[j]], 1)
        f2_w[j, i] = f_ex_3_2([w_component_0[i], w_component_1[j]], 2)

In [None]:
fig = plotly.subplots.make_subplots(rows=4, cols=1, vertical_spacing=0.05)
fig.add_contour(x=w_component_0, y=w_component_1, z=f_w, row=1, col=1, coloraxis="coloraxis")
fig.add_contour(x=w_component_0, y=w_component_1, z=f0_w, row=2, col=1, coloraxis="coloraxis")
fig.add_contour(x=w_component_0, y=w_component_1, z=f1_w, row=3, col=1, coloraxis="coloraxis")
fig.add_contour(x=w_component_0, y=w_component_1, z=f2_w, row=4, col=1, coloraxis="coloraxis")
fig.update_layout(title="Trid function - contour plot", width=512, height=3.25 * 512, autosize=False)
fig.show()

> The plot of (especially) the third function $f_2$ is very different from $f$. There is a poor correlation between $f$ and $f_2$, meaning that the gradient of $f_2$ may not give an accurate information about $f$. How will the stochastic gradient method will be affected by this?

2. Compute the gradient $\nabla f$ and determine the global minimum of the function $f$. Also compute the gradient $\nabla f_0$, $\nabla f_1$ and $\nabla f_2$.

*Solution*:
> Taking the partial derivatives of
$$f(\boldsymbol{w}) = (w^{(0)} - 1)^2 + (w^{(1)} - 1)^2 - w^{(0)}w^{(1)}$$
and
$$f_0(\boldsymbol{w}) = (w^{(0)} - 1)^2, \qquad f_1(\boldsymbol{w}) = (w^{(1)} - 1)^2, \qquad f_2(\boldsymbol{w}) = - w^{(0)}w^{(1)}$$
we can easily see that
$$\nabla f(\boldsymbol{w}) = \begin{bmatrix}
2 w^{(0)} - w^{(1)} - 2 \\
- w^{(0)} + 2 w^{(1)} - 2
\end{bmatrix},
\quad
\nabla f_0(\boldsymbol{w}) = \begin{bmatrix}
2 w^{(0)} - 2\\
0
\end{bmatrix},
\quad
\nabla f_1(\boldsymbol{w}) = \begin{bmatrix}
0\\
2 w^{(1)} - 2
\end{bmatrix},
\quad
\nabla f_2(\boldsymbol{w}) = \begin{bmatrix}
- w^{(1)}\\
- w^{(0)}
\end{bmatrix}.
$$

The only stationary point of $f$ is $\boldsymbol{w}^* = (2, 2)$, as can be seen by solving $\nabla f(\boldsymbol{w}) = 0$. Furthermore, it is a global minimum because the hessian of $f$ is always positive definite, as the following calculation show

In [None]:
hessian_f = np.array([[2, -1], [-1, 2]])
hessian_f

In [None]:
eigs, _ = np.linalg.eig(hessian_f)
eigs

In [None]:
assert (eigs > 0).all()

> From the calculated eigenvalues we see that $f$ is $L$-smooth with $L = 3$ and $\mu$-strongly convex with $\mu = 1$.
>
> We next implement the gradient evaluations

In [None]:
def grad_f_ex_3_2(w: np.ndarray, addend: int = None) -> np.ndarray:
    r"""Evaluate \nabla f(w) if addend is None, or \nabla f_{addend}(w) if addend is an integer."""
    if addend == 0:
        return np.array([2 * w[0] - 2, 0])
    elif addend == 1:
        return np.array([0, 2 * w[1] - 2])
    elif addend == 2:
        return np.array([- w[0], - w[1]])
    elif addend is None:
        return np.array([2 * w[0] - w[1] - 2, - w[0] + 2 * w[1] - 2])

> We test the implemented functions at $\boldsymbol{w}^*$. First of all, we get the optimal function value $f(\boldsymbol{w}^*)$ and check that $\nabla f(\boldsymbol{w}^*) = \boldsymbol{0}$.

In [None]:
w_star = np.array([2, 2])
w_star

In [None]:
f_ex_3_2(w_star)

In [None]:
grad_f_ex_3_2(w_star)

> We then evaluate $f_j(\boldsymbol{w}^*)$ and $\nabla f_j(\boldsymbol{w}^*)$, for $j = 0, 1, 2$.

In [None]:
f_ex_3_2(w_star, 0)

In [None]:
f_ex_3_2(w_star, 1)

In [None]:
f_ex_3_2(w_star, 2)

In [None]:
grad_f_ex_3_2(w_star, 0)

In [None]:
grad_f_ex_3_2(w_star, 1)

In [None]:
grad_f_ex_3_2(w_star, 2)

> Notice how *none* of $\nabla f_j(\boldsymbol{w}^*) = \boldsymbol{0}$, for $j = 0, 1, 2$. How will the stochastic gradient method will be affected by this? This is particularly worrisome, because $\nabla f_j(\boldsymbol{w}^*) \neq 0$ means that the stochastic gradient method believes that it should take a non-zero step even from the optimum $\boldsymbol{w}^*$!

3. Choose $\alpha = 0.1 / 3$, $\varepsilon = 10^{-2}$, $K_{\max} = 1000$ and $\boldsymbol{w}_0 = (0, -2)$. Visualize:
   * an animation of the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$;
   * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$;
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.
 
   Since the choice of addend is random, set a seed (for reproducibility) and run the stochastic gradient method four times (and expect slightly different results every time).
 
*Solution*:
> We may reuse the implementation of the stochastic gradient method from the previous exercise.

In [None]:
all_w = [None] * 4
all_f = [None] * 4
all_grad_f = [None] * 4

np.random.seed(32)
for run in range(4):
    all_w[run], all_f[run], all_grad_f[run] = stochastic_gradient(
        3, f_ex_3_2, grad_f_ex_3_2, 0.1 / 3, 1e-2, 1000, np.array([0.0, -2.0]))

> We notice that none of the runs converged. The computation of `min_K` below indeed returns 1001.

In [None]:
min_K = min([all_f[run].shape[0] for run in range(4)])
min_K

In [None]:
max_K = max([all_f[run].shape[0] for run in range(4)])
max_K

> A table of the evolution of the cost function clearly shows that the method is oscillating close to the optimum.

In [None]:
np.vstack([all_f[run][:min_K] for run in range(4)]).T

> A plot of the error on the function value clearly shows that after an almost monotone improvement in the first iterations, the stochastic gradient method starts to oscillate.

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0]), y=(all_f[run] - f_ex_3_2(w_star)),
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Trid function - error on the function value",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> A similar phenomen happens for the norm of the gradient of the cost function $f$.

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0]), y=np.linalg.norm(all_grad_f[run], axis=1),
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Trid function - norm of the gradient",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> The animation of the optimization variables over the countour plot provides us with an intution of what is happening. After a first phase in which the stochastic method is getting closer to the optimum, iterations start to have large oscillatations around the minimum point. This is a peculiar phenomenon which happens quite frequently with the stochastic gradient method with constant step size. It is due to the fact that, as we saw at item 2, minimum points of $f$ are not necessarily minum points of $f_j$, and therefore a non zero (and possibly big) update step is performed even close to the global minimum. We will discuss more about why it happens in the lecture.
>
> Uncomment the axis updates line to zoom in close to the optimum and better appreciate the oscillations.

In [None]:
# Add opaque contour plot
fig = go.Figure(data=[go.Contour(
    x=w_component_0, y=w_component_1, z=f_w,
    showscale=False, visible=True, opacity=0.5
)])

# Add a red cross marker to locate w*
fig.add_trace(
    go.Scatter(x=[w_star[0]], y=[w_star[1]],
               marker=dict(color="red", size=10, symbol="x"),
               mode="markers", name="w*", visible=True))

# Prepare a slider every 10 iterations
slides = []
for k in range(int(max_K / 10)):
    for run in range(4):
        # Set non uniform marker size to highlight the current iteration k
        marker_size = np.zeros((10 * k + 1, ))
        marker_size[-1] = 10
        # Add lines
        fig.add_trace(
            go.Scatter(x=all_w[run][:10 * k + 1, 0],
                       y=all_w[run][:10 * k + 1, 1],
                       visible=False,
                       marker=dict(color=plotly.colors.qualitative.Set1[run], size=marker_size),
                       line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
                       mode="lines+markers",
                       name="run " + str(run) + " at k = " + str(10 * k).zfill(3)))

    # Add slider tick
    slide = {
        "method": "update",
        "args": [
            {"visible": [False] * (4 * int(max_K / 10) + 2)},
            {}
        ]
    }
    slide["args"][0]["visible"][0] = True
    slide["args"][0]["visible"][1] = True
    for run in range(4):
        slide["args"][0]["visible"][4 * k + run + 2] = True
    slides.append(slide)

for run in range(4):
    fig.data[run + 2].visible = True

fig.update_layout(
    title="Trid function - optimization variable iterations over contour plot",
    width=612, height=512, autosize=False,
    sliders=[dict(steps=slides)]
)
# fig.update_xaxes(range=[0, 4])
# fig.update_yaxes(range=[0, 4])
fig.show()

> Notice that the step length suggested by the gradient method theory was $1 / L = 1 / 3$. Here we used $0.1 / 3$, which is 10 times smaller. To convince us that the oscillation are an issue for the stochastic gradient method, but not for the standard gradient descent, we will also run a standard gradient descent method with same step length. The implementation of the gradient method is copied from Exercise 1.1.

In [None]:
def gradient_descent(
    f: typing.Callable, grad_f: typing.Callable, alpha: float, epsilon: float, maxit: int, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the gradient descent method with constant step length.

    Parameters
    ----------
    f, grad_f : Python function
        callable evaluating the cost function and its gradient, respectively.
    alpha : float
        constant step length.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    maxit : int
        maximum number of allowed iterations.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f(w_0)]
    all_grad_f = [grad_f(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the norm of the gradient as stopping criterion.
    while np.linalg.norm(all_grad_f[k]) > epsilon:
        w_k = all_w[k]
        grad_f_k = all_grad_f[k]
        w_k_plus_1 = w_k - alpha * grad_f_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f(w_k_plus_1))
        all_grad_f.append(grad_f(w_k_plus_1))

        # Bail out if the descent condition is not satisfied
        if all_f[k + 1] >= all_f[k]:
            print(all_f[k + 1], all_f[k])
            print("WARNING: descent conditions is not satisfied")
            break

        # Increment iteration counter
        k += 1

        # Bail out if exceeded allowed number of iterations
        if k >= maxit:
            print("WARNING: gradient method exceeded number of allowed iterations")
            break

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

> The gradient descent method does converge, in approximately 200 iterations, and the function values do not oscillate.

In [None]:
all_w_gradient, all_f_gradient, all_grad_f_gradient = gradient_descent(
    f_ex_3_2, grad_f_ex_3_2, 0.1 / 3, 1e-2, 1000, np.array([0.0, -2.0]))

In [None]:
all_w_gradient.shape[0]

In [None]:
all_f_gradient - f_ex_3_2(w_star)

## Exercise 3.3 (continuation of Exercise 3.2)
Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the *Trid function*
$$f(\boldsymbol{w}) = (w^{(0)} - 1)^2 + (w^{(1)} - 1)^2 - w^{(0)}w^{(1)}$$
as the sum of three functions
$$f_0(\boldsymbol{w}) = (w^{(0)} - 1)^2, \qquad f_1(\boldsymbol{w}) = (w^{(1)} - 1)^2, \qquad f_2(\boldsymbol{w}) = - w^{(0)}w^{(1)}.$$

4. Implement the stochastic gradient method with decreasing step length in a Python function. Such function should:
   * take as input the number $m$ of addends, the function $f$, its gradient $\nabla f$, the value $\beta$ and $\gamma$ used in the computation of the step length, the tolerance $\varepsilon$ for the stopping criterion, maximum number $K_{\max}$ of allowed iterations, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.

   Use the stopping criterion based on the norm of the gradient.
 
*Solution*:
> We start from our previous implementation of the stochastic gradient and change the computation of the step length $\alpha_k$.

In [None]:
def stochastic_gradient_decreasing_step_length(
    m: int, f: typing.Callable, grad_f: typing.Callable, beta: float, gamma: float, epsilon: float,
    maxit: int, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the stochastic gradient method with decreasing step length.

    Parameters
    ----------
    m : int
        number of addends in the expression of the cost function.
    f, grad_f : Python function
        callable evaluating the cost function and its gradient, respectively.
    beta, gamma : float
        constants used in the computation of the step length.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    maxit : int
        maximum number of allowed iterations.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f(w_0)]
    all_grad_f = [grad_f(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the norm of the gradient as stopping criterion.
    while np.linalg.norm(all_grad_f[k]) > epsilon:
        w_k = all_w[k]

        # Draw a random index
        j_k = np.random.randint(m)

        # Compute the update direction
        g_k = - grad_f(w_k, addend=j_k)

        # Compute alpha_k
        alpha_k = beta / (k + gamma)

        # Compute w_{k + 1}
        w_k_plus_1 = w_k + alpha_k * g_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f(w_k_plus_1))
        all_grad_f.append(grad_f(w_k_plus_1))

        # Increment iteration counter
        k += 1

        # Bail out if exceeded allowed number of iterations
        if k >= maxit:
            print("WARNING: stochastic gradient method exceeded number of allowed iterations")
            break

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

5. Choose $\beta = 1.01$, $\gamma = 3$, $\varepsilon = 10^{-2}$, $K_{\max} = 10000$ and $\boldsymbol{w}_0 = (0, -2)$. Visualize:
   * an animation of the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ on a contour plot of $f$;
   * a semilogarithimic plot of error in the function value $\{f(\boldsymbol{w}_k) - f(\boldsymbol{w}^*)\}_k$ versus the iteration counter $k$;
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.
 
   Since the choice of addend is random, set a seed (for reproducibility) and run the stochastic gradient method with decreasing step size four times (and expect slightly different results every time).

*Solution*:
> Notice that $\mu = 1$ and $L = 3$, so indeed $\beta > \frac{1}{\mu}$ and $\gamma = \frac{L}{\mu}$ as required by the convergence result. We run the our implementation of the stochastic gradient method with diminishing step size.

In [None]:
def f_ex_3_3(w: np.ndarray, addend: int = None) -> float:
    r"""Evaluate f(w) if addend is None, or f_{addend}(w) if addend is an integer."""
    if addend == 0:
        return (w[0] - 1)**2
    elif addend == 1:
        return (w[1] - 1)**2
    elif addend == 2:
        return - w[0] * w[1]
    elif addend is None:
        return (w[0] - 1)**2 + (w[1] - 1)**2 - w[0] * w[1]

In [None]:
def grad_f_ex_3_3(w: np.ndarray, addend: int = None) -> np.ndarray:
    r"""Evaluate \nabla f(w) if addend is None, or \nabla f_{addend}(w) if addend is an integer."""
    if addend == 0:
        return np.array([2 * w[0] - 2, 0])
    elif addend == 1:
        return np.array([0, 2 * w[1] - 2])
    elif addend == 2:
        return np.array([- w[0], - w[1]])
    elif addend is None:
        return np.array([2 * w[0] - w[1] - 2, - w[0] + 2 * w[1] - 2])

In [None]:
all_w = [None] * 4
all_f = [None] * 4
all_grad_f = [None] * 4

np.random.seed(33 + 500)
for run in range(4):
    all_w[run], all_f[run], all_grad_f[run] = stochastic_gradient_decreasing_step_length(
        3, f_ex_3_3, grad_f_ex_3_3, 1.01, 3, 1e-2, 10000, np.array([0.0, -2.0]))

> We notice that none of the runs converged in the prescribed number of iterations.

In [None]:
min_K = min([all_f[run].shape[0] for run in range(4)])
min_K

In [None]:
max_K = max([all_f[run].shape[0] for run in range(4)])
max_K

> A plot of the error on the function values, as well as a plot of the norm of the gradient, reveals that:
> * the convergence is not affected anymore by oscillations, but
> * it is extremely slow, because the step lengths are decreasing very rapidly.

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0])[::10], y=(all_f[run][::10] - f_ex_3_3(w_star)),
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Trid function - error on the function value",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0])[::10], y=np.linalg.norm(all_grad_f[run][::10], axis=1),
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Trid function - norm of the gradient",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> The slow convergence is quite clear also from the animation of the optimization variables. Zoom in close to the optimum by uncommenting two lines towards the end of the last cell, and compare iteration 3000 to iteration 10000.

In [None]:
domain_component_0 = [-4, 4]
domain_component_1 = [-4, 4]

In [None]:
w_component_0 = np.linspace(domain_component_0[0], domain_component_0[1], 100)
w_component_1 = np.linspace(domain_component_1[0], domain_component_1[1], 100)

In [None]:
f_w = np.zeros((len(w_component_0), len(w_component_1)))
for i in range(f_w.shape[0]):
    for j in range(f_w.shape[1]):
        f_w[j, i] = f_ex_3_3([w_component_0[i], w_component_1[j]])

In [None]:
w_star = np.array([2, 2])

In [None]:
# Add opaque contour plot
fig = go.Figure(data=[go.Contour(
    x=w_component_0, y=w_component_1, z=f_w,
    showscale=False, visible=True, opacity=0.5
)])

# Add a red cross marker to locate w*
fig.add_trace(
    go.Scatter(x=[w_star[0]], y=[w_star[1]],
               marker=dict(color="red", size=10, symbol="x"),
               mode="markers", name="w*", visible=True))

# Prepare a slider every 100 iterations
slides = []
for k in range(int(max_K / 100)):
    for run in range(4):
        # Set non uniform marker size to highlight the current iteration k
        marker_size = np.zeros((100 * k + 1, ))
        marker_size[-1] = 10
        # Add lines
        fig.add_trace(
            go.Scatter(x=all_w[run][:100 * k + 1:10, 0],
                       y=all_w[run][:100 * k + 1:10, 1],
                       visible=False,
                       marker=dict(color=plotly.colors.qualitative.Set1[run], size=marker_size),
                       line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
                       mode="lines+markers",
                       name="run " + str(run) + " at k = " + str(100 * k).zfill(4)))

    # Add slider tick
    slide = {
        "method": "update",
        "args": [
            {"visible": [False] * (4 * int(max_K / 100) + 2)},
            {}
        ]
    }
    slide["args"][0]["visible"][0] = True
    slide["args"][0]["visible"][1] = True
    for run in range(4):
        slide["args"][0]["visible"][4 * k + run + 2] = True
    slides.append(slide)

for run in range(4):
    fig.data[run + 2].visible = True

fig.update_layout(
    title="Trid function - optimization variable iterations over contour plot",
    width=612, height=512, autosize=False,
    sliders=[dict(steps=slides)]
)
# fig.update_xaxes(range=[0, 4])
# fig.update_yaxes(range=[0, 4])
fig.show()

> The learning schedule $\alpha_k = \frac{\beta}{k + \gamma}$ is decreasing the step length too aggressively, to the point that the step $\alpha_k$ vanishes for $k$ large. Note that we have already encountered a similar issue when discussing Adagrad (and RMSProp, which solves the issue).
> Overall, the linearly decaying schedule is not satisfying. Feel free to try your implementation of RMSProp or Adam (Homework $\alpha.1$) to see if they perform better than the linearly decaying schedule. We will propose an alternative method later on this exercise that gives more satisfactory results.

6. Implement the mini-batch stochastic gradient method with constant step length in a Python function. Such function should:
   * take as input the number $m$ of addends, the size $m_b$ of a mini-batch, the function $f$, its gradient $\nabla f$, the value $\alpha$ of the step length, the tolerance $\varepsilon$ for the stopping criterion, maximum number $K_{\max}$ of allowed iterations, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$.

   Use the stopping criterion based on the norm of the gradient.
 
*Solution*:
> We start from our previous implementation of the stochastic gradient and change the computation of the update direction. To extract multiple indices we use the [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html). Note that, in contrast to the convergence results in the slides, we use index selection *without* replacement, because we do not want some of the extracted indices to be repeated.

In [None]:
def mini_batch_stochastic_gradient(
    m: int, m_b: int, f: typing.Callable, grad_f: typing.Callable, alpha: float, epsilon: float,
    maxit: int, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the mini-batch stochastic gradient method with constant step length.

    Parameters
    ----------
    m : int
        number of addends in the expression of the cost function.
    m_b : int
        size of the mini-batch.
    f, grad_f : Python function
        callable evaluating the cost function and its gradient, respectively.
    alpha : float
        constant step length.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    maxit : int
        maximum number of allowed iterations.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f(w_0)]
    all_grad_f = [grad_f(w_0)]

    # Prepare iteration counter
    k = 0

    # Use the norm of the gradient as stopping criterion.
    while np.linalg.norm(all_grad_f[k]) > epsilon:
        w_k = all_w[k]

        # Draw random indices
        J_k = np.random.choice(m, size=m_b, replace=False)

        # Compute the update direction
        g_k = - 1 / m_b * sum([grad_f(w_k, addend=j) for j in J_k])

        # Compute w_{k + 1}
        w_k_plus_1 = w_k + alpha * g_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f(w_k_plus_1))
        all_grad_f.append(grad_f(w_k_plus_1))

        # Increment iteration counter
        k += 1

        # Bail out if exceeded allowed number of iterations
        if k >= maxit:
            print("WARNING: stochastic gradient method exceeded number of allowed iterations")
            break

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f)

7. Choose $\alpha = 0.1 / 3$, $\varepsilon = 10^{-2}$, $K_{\max} = 1000$ and $\boldsymbol{w}_0 = (0, -2)$. For $m_b = 1, 2$ or $3$, visualize:
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.
 
   Since the choice of addend is random, set a seed (for reproducibility) and run four times the mini-batch stochastic gradient (and expect slightly different results every time).
 
*Solution*:

In [None]:
all_w = {m_b: [None] * 4 for m_b in range(1, 4)}
all_f = {m_b: [None] * 4 for m_b in range(1, 4)}
all_grad_f = {m_b: [None] * 4 for m_b in range(1, 4)}

np.random.seed(33 + 700)
for m_b in range(1, 4):
    for run in range(4):
        all_w[m_b][run], all_f[m_b][run], all_grad_f[m_b][run] = mini_batch_stochastic_gradient(
            3, m_b, f_ex_3_3, grad_f_ex_3_3, 0.1 / 3, 1e-2, 1000, np.array([0.0, -2.0]))

> Many runs have exceeded the number of allowed iterations, so we expect to see oscillations in the plot of the norm of the gradient. We plot the three cases $m_b = 1, 2$ or $3$ from the top to the bottom.

In [None]:
fig = plotly.subplots.make_subplots(rows=3, cols=1, vertical_spacing=0.05)
for m_b in range(1, 4):
    for run in range(4):
        fig.add_scatter(
            x=np.arange(all_f[m_b][run].shape[0]), y=np.linalg.norm(all_grad_f[m_b][run], axis=1),
            marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
            line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
            mode="lines+markers", name="run " + str(run), showlegend=(m_b == 1),
            row=m_b, col=1
        )
fig.update_layout(
    title="Trid function - norm of the gradient",
    width=768, height=2.5 * 768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> * The most striking difference between the three plots is the fact that the third one ($m_b = 3$) does not have any oscillation. This is indeed because, since we are extracting indices without replacment, the case $m_b = m = 3$ corresponds to the gradient method!
> * The first and second plot ($m_b = 1$ and $m_b = 2$, respectively), look similar, as they both present oscillations. However, one should notice that the second plot is better than the first one (i.e., oscillations are less pronounced). Compare for instance in $k \in [800, 1000]$: the error oscillates between 2 and 20 for $m_b = 1$, while it is between $0.5$ and $5$ for $m_b = 2$.
>
> Unfortunately, even if the two options proposed in the lecture gave some improvement compared to the basic stochastic gradient method, none of them gave fully satisfactory results. Therefore, in the final question of this exercise we will propose a (very popular) combination of them: using a mini-batch method with step sizes that are decreasing (but not as fast as $1/k$).

8. Consider a mini-batch stochastic method with the following learning schedule:
   1. the initial step length is set as $\alpha_0 = \frac{\beta}{\gamma}$.
   2. proceed with the iterations of a mini-batch stochastic method with constant step $\alpha_0$ until oscillations are detected.
   3. as soon as oscillations are detected, decrease the step length to $\alpha_1 = \frac{\beta}{(1 + \gamma)}$.
   4. proceed with the iterations of a mini-batch stochastic method with constant step $\alpha_1$ until oscillations are detected.
   5. as soon as oscillations are detected, decrease the step length to $\alpha_2 = \frac{\beta}{(2 + \gamma)}$.
   6. ... and so on ...
 
   The proposed learning schedule is decreasing the step size less agressively than the linear decay shown in the lecture. Indeed, the proposed sequence of step size is $[\alpha_0, \alpha_0, \dots, \alpha_0, \alpha_1, \alpha_1, \dots, \alpha_1, \alpha_2, \dots]$, while the linearly decaying strategy would be $[\alpha_0, \alpha_1, \alpha_2, \alpha_3, \dots]$.
 
   Implement the proposed method in a Python function. Such function should:
   * take as input the number $m$ of addends, the function $f$, its gradient $\nabla f$, the value $\beta$ and $\gamma$ used in the computation of the step length, the tolerance $\varepsilon$ for the stopping criterion, maximum number $K_{\max}$ of allowed iterations, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$, the corresponding function values $\{f(\boldsymbol{w}_k)\}_k$ and gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$, and the resulting sequence of stepsizes.

   Use the stopping criterion based on the norm of the gradient. Use the following criterion to detect oscillations:
   * if the function value at iteration $k$ is larger than the function value of iteration $k - 10$, and
   * at least 10 iterations have passed since the last oscillations were detected.
 
*Solution*:

In [None]:
def mini_batch_stochastic_gradient_variable_step_size(
    m: int, m_b: int, f: typing.Callable, grad_f: typing.Callable, beta: float, gamma: float, epsilon: float,
    maxit: int, w_0: np.ndarray
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the mini-batch stochastic gradient method with variable step length.

    Parameters
    ----------
    m : int
        number of addends in the expression of the cost function.
    m_b : int
        size of the mini-batch.
    f, grad_f : Python function
        callable evaluating the cost function and its gradient, respectively.
    beta, gamma : float
        constants used in the computation of the step length.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    maxit : int
        maximum number of allowed iterations.
    w_0 : 1d numpy array
        numpy array containing the initial condition.

    Returns
    -------
    2d numpy array
        history of the optimization variables iterations.
    1d numpy array
        history of the cost function values.
    2d numpy array
        history of the gradient of the cost function.
    """
    # Prepare lists collecting the required outputs over the iterations
    all_w = [w_0]
    all_f = [f(w_0)]
    all_grad_f = [grad_f(w_0)]

    # Prepare iteration counter
    k = 0

    # Last iteration at which the step length was changed
    k_osc = 0
    alpha = beta / gamma
    all_alphas = [(k_osc, alpha)]

    # Use the norm of the gradient as stopping criterion.
    while np.linalg.norm(all_grad_f[k]) > epsilon:
        w_k = all_w[k]

        # Draw random indices
        J_k = np.random.choice(m, size=m_b, replace=False)

        # Compute the update direction
        g_k = - 1 / m_b * sum([grad_f(w_k, addend=j) for j in J_k])

        # Compute w_{k + 1}
        w_k_plus_1 = w_k + alpha * g_k

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f(w_k_plus_1))
        all_grad_f.append(grad_f(w_k_plus_1))

        # Detect oscillations
        if k >= k_osc + 10 and all_f[-1] > all_f[-11]:
            alpha = beta / (len(all_alphas) + gamma)
            k_osc = k
            all_alphas.append((k_osc, alpha))

        # Increment iteration counter
        k += 1

        # Bail out if exceeded allowed number of iterations
        if k >= maxit:
            print("WARNING: stochastic gradient method exceeded number of allowed iterations")
            break

    # For convenience we transform the outputs into numpy array before returning
    return np.array(all_w), np.array(all_f), np.array(all_grad_f), np.array(all_alphas)

9. Choose $m_b = 2$, $\beta = 1$, $\gamma = 3$, $\varepsilon = 10^{-2}$, $K_{\max} = 10000$ and $\boldsymbol{w}_0 = (0, -2)$. Visualize:
   * a semilogarithimic plot of the norm of the gradients $\{\nabla f(\boldsymbol{w}_k)\}_k$ versus the iteration counter $k$.
 
   Since the choice of addend is random, set a seed (for reproducibility) and run four times our proposed stochastic gradient method (and expect slightly different results every time).
 
*Solution*:

In [None]:
all_w = [None] * 4
all_f = [None] * 4
all_grad_f = [None] * 4
all_alphas = [None] * 4

np.random.seed(33 + 900)
for run in range(4):
    all_w[run], all_f[run], all_grad_f[run], all_alphas[run] = mini_batch_stochastic_gradient_variable_step_size(
        3, 2, f_ex_3_3, grad_f_ex_3_3, 1, 3, 1e-2, 10000, np.array([0.0, -2.0]))

> Note that now all runs converged in less than the required number of iterations.

In [None]:
min_K = min([all_f[run].shape[0] for run in range(4)])
min_K

In [None]:
max_K = max([all_f[run].shape[0] for run in range(4)])
max_K

> Our criterion for oscillations ensures that the same step length will be kept for at least 10 iterations. With the help of the code below, note that it often happens that the same step length is kept for more than 10 iterations. This guarantees that the step length does not decay to fast, as it happened in the case of linearly decaying sequence.

In [None]:
for run in range(4):
    print("Run =", run)
    print(np.diff(all_alphas[run][:, 0]))
    print("")

In [None]:
for run in range(4):
    print("Run =", run)
    print("\t final step with our sequence is ", all_alphas[run][-1, 1])
    print("\t linearly decaying step at the same iteration would have been", 1 / (all_f[run].shape[0] + 3))
    print("")

> Thanks to the improved gradient estimation (mini-batch) and the better sequence of the step lengths, the oscillations in the norm of the gradient are less pronounced and such norm decreases below the provided tolerance in less than 10000 iterations.

In [None]:
fig = go.Figure()
for run in range(4):
    fig.add_scatter(
        x=np.arange(all_f[run].shape[0])[::10], y=np.linalg.norm(all_grad_f[run], axis=1)[::10],
        marker=dict(color=plotly.colors.qualitative.Set1[run], size=10),
        line=dict(color=plotly.colors.qualitative.Set1[run], width=2),
        mode="lines+markers", name="run " + str(run)
    )
fig.update_layout(
    title="Trid function - norm of the gradient",
    width=768, height=768, autosize=False
)
fig.update_yaxes(type="log", exponentformat="power")
fig.show()