<a href="https://colab.research.google.com/github/pserebrennikov/3rd-year-project/blob/master/5_automatic_differentiation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 5 - Automatic differentiation
### Course on Optimization for Machine Learning - Dr. F. Ballarin
### Master Degree in Data Analytics for Business, Catholic University of the Sacred Heart, Milano

In this notebook we discuss automatic differentiation and also introduce the [`JAX library`](https://github.com/google/jax), a Google research project that includes automatic differentiation capabilities.

In [None]:
import typing

In [None]:
import jax
import jax.numpy as jnp
import numpy as np
import plotly.graph_objects as go

## Exercise 5.1

1. Implement forward and backward passes for the expression $(a + b) \cdot c$, with $a = -2$, $b = 5$, $c = -4$.

*Solution*
> We follow the computational graph we have derived for the the evaluation of the expression...

In [None]:
# Nodes in the first layer
a = - 2
b = 5
c = - 4

# Nodes in the second layer
d = a + b

# Nodes in the third layer
e = d * c

In [None]:
e

In [None]:
assert e == -12

> ... and its gradient

In [None]:
# Nodes in the third layer
de_de = 1

# Edges starting from any node in the third layer
de_dd = c
de_dc = d

# Nodes in the second layer
de_dd = de_de * de_dd

# Edges starting from any node in the second layer
dd_da = 1
dd_db = 1

# Nodes in the first layer
de_da = de_dd * dd_da
de_db = de_dd * dd_db
de_dc = de_de * de_dc

In [None]:
(de_da, de_db, de_dc)

In [None]:
assert (de_da, de_db, de_dc) == (-4, -4, 3)

2. Implement forward and backward passes for the expression $(a + b) \cdot (b + 1)$, with $a = 2$, $b = 1$.

*Solution*
> We follow the computational graph we have derived for the the evaluation of the expression...

In [None]:
# Nodes in the first layer
a = 2
b = 1

# Nodes in the second layer
c = a + b
d = b + 1

# Nodes in the third layer
e = c * d

In [None]:
e

In [None]:
assert e == 6

> ... and its gradient

In [None]:
# Nodes in the third layer
de_de = 1

# Edges starting from any node in the third layer
de_dc = d
de_dd = c

# Nodes in the second layer
de_dc = de_de * de_dc
de_dd = de_de * de_dd

# Edges starting from any node in the second layer
dc_da = 1
dc_db = 1
dd_db = 1

# Nodes in the first layer
de_da = de_dc * dc_da
de_db = de_dc * dc_db + de_dd * dd_db

In [None]:
(de_da, de_db)

In [None]:
assert (de_da, de_db) == (2, 5)

3. Let $\boldsymbol{w} \in \mathbb{R}^2$. Consider the *Ackley function*
$$f(\boldsymbol{w}) = - 20 \exp\left(- 0.2 \sqrt{\frac{[w^{(0)}]^2 + [w^{(1)}]^2}{2}}\right) - \exp\left(\frac{\cos(2 \pi w^{(0)}) + \cos(2 \pi w^{(1)})}{2}\right) + 20 + \exp(1).$$

   Compute the relative error between the evaluation of the exact derivative, which have used in Exercises 1.4 and 1.6, and the result of reverse automatic differentiation

*Solution*:
> The evaluation of the Ackley function requires the following operations:
> * 5 sums
> * 8 multiplications or divisions
> * 2 squares
> * 1 square root
> * 2 cosine evaluations
> * 3 applications of $\exp$
>
> for a total of 21 operations. This means that the computational graph will have 23 nodes (2 inputs + 21 operations). Manually drawing and manually coding the computational graph would be a very tedious and error prone task. Fortunately, there are several automatic differentation libraries that carry out this task. Today we will use the [`JAX library`](https://github.com/google/jax), a Google research project that includes automatic differentiation capabilities.
>
> `jax` defines an interface very similar to `numpy` in the submodule `jax.numpy`, which we imported as `jnp`.
> Using `jnp` rather than `np` is a technical detail which `jax` needs so that it can create the computational graph.
>
> We then implement the Ackley function using `jnp`.

In [None]:
def f_ex_5_1(w: np.ndarray) -> float:
    """Evaluate f(w)."""
    return (
        - 20 * jnp.exp(- 0.2 * jnp.sqrt((w[0]**2 + w[1]**2) / 2))
        - jnp.exp((jnp.cos(2 * jnp.pi * w[0]) + jnp.cos(2 * jnp.pi * w[1])) / 2)
        + 20 + jnp.exp(1)
    )

> Note that if we evaluate the Ackley function at the point $(1, 1)$, the result is stored in a new datatype `DeviceArray`, which is again part of `jax`. Again, this is a technical detail related to their implementation, and for our goals we will consider it simply a representation of a scalar (for arrays with a single entry) or vectors (for arrays with multiple entries).

In [None]:
f_ex_5_1(np.array([1.0, 1.0]))

> `jax` offers a `grad` command that will create the computational graph and perform reverse automatic differentiation of the function provided as the first input argument.

In [None]:
grad_f_ex_5_1_ad = jax.grad(f_ex_5_1)

In [None]:
grad_f_ex_5_1_ad(np.array([1.0, 1.0]))

> To compare the result of the automatic differentation with the exact differentiation, we copy the gradient $\nabla f$ from Exercises 1.4 and 1.6.
>
> Should we be using `np` or should we be using `jnp` here? The rule of thumb is: use `jnp` for any expression that we would like to differentiate (like the implementation of $f$, above). Use `np` for any expression that you are not interested in differentiating (like the implementation of $\nabla f$).

In [None]:
def grad_f_ex_5_1_exact(w: np.ndarray) -> np.ndarray:
    r"""Evaluate \nabla f(w) on paper, for comparison with the result from jax."""
    return np.array([
        2.0 * w[0] * np.exp(- 0.2 * np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)) / np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)
        + np.pi * np.exp(np.cos(2 * np.pi * w[0]) / 2 + np.cos(2 * np.pi * w[1]) / 2) * np.sin(2 * np.pi * w[0]),
        2.0 * w[1] * np.exp(- 0.2 * np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)) / np.sqrt(w[0]**2 / 2 + w[1]**2 / 2)
        + np.pi * np.exp(np.cos(2 * np.pi * w[0]) / 2 + np.cos(2 * np.pi * w[1]) / 2) * np.sin(2 * np.pi * w[1])
    ])

In [None]:
grad_f_ex_5_1_exact(np.array([1.0, 1.0]))

> We compare the automatic differentiation and the exact differentiation on $14^2$ points on an equispaced rectangular grid in $[-1, 1]^2$. The comparison is carried out in terms of relative error. Given a number $a$ and its approximation $b$, the relative error is defined as
$$ \frac{\left|b - a\right|}{\left|a\right|}.$$
Similarly, given a vector $\boldsymbol{v}$ and its approximation $\boldsymbol{z}$, the relative error is defined as
$$ \frac{\left\|\boldsymbol{z} - \boldsymbol{v}\right\|}{\left\|\boldsymbol{v}\right\|}.$$

In [None]:
grad_relative_errors = []
for w_i in np.linspace(-1, 1, 14):
    for w_j in np.linspace(-1, 1, 14):
        w_ij = np.array([w_i, w_j])
        grad_f_ij_exact = grad_f_ex_5_1_exact(w_ij)
        grad_f_ij_ad = grad_f_ex_5_1_ad(w_ij)
        grad_f_ij_error = grad_f_ij_exact - grad_f_ij_ad
        grad_relative_errors.append(np.linalg.norm(grad_f_ij_error) / np.linalg.norm(grad_f_ij_exact))

In [None]:
np.max(grad_relative_errors)

> The relative error in the evaluation of the gradient is $O(10^{-6})$, which is a reasonable accuracy for most applications in machine learning.

## Exercise 5.2

The California Housing Data Set contains data collected during the 1990 U.S. Census. The dataset has divided California in neighborhoods (called blocks), and contains the following fields:
* <code><font color="blue">longitude</font></code>: longitudinal position of the block (neighborhood, composed of several houses), i.e. a measure of how far west the block is,
* <code><font color="blue">latitude</font></code>: latitudinal position of the block, i.e. a measure of how far north the block is,
* <code><font color="blue">housing_median_age</font></code>: median age of houses (measured in years) within the block,
* <code><font color="blue">total_rooms</font></code>: total number of rooms within the block,
* <code><font color="blue">total_bedrooms</font></code>: total number of bedrooms within the block,
* <code><font color="red">population</font></code>: total number of people residing within the block,
* <code><font color="blue">households</font></code>: total number of households, i.e. a number of home units or apartments, for the block,
* <code><font color="red">median_income</font></code>: median income for households within the block of houses (measured in tens of thousands of US dollars),
* <code><font color="blue">median_house_value</font></code>: median house value for households within the block (measured in US dollars).

A non-profit company, based in California, would like to fund urban requalification projects for <code><font color="red">highly populated</font></code> blocks with <code><font color="red">low income</font></code> residents. Unfortunately, they cannot just use an up-to-date (updated to last year, not 1990) version of the <code><font color="red">population</font></code> and <code><font color="red">median_income</font></code> fields, because:
* interviewing every single household to ask for a precise and an up-to-date quantification of the <code><font color="red">population</font></code> would take too much time,
* their ethics committee has advised against asking every single resident their income (from which <code><font color="red">median_income</font></code> could be computed).

They are currently collecting up-to-date information on the remaining fields, written in <code><font color="blue">blue</font></code>:
* the fields <code><font color="blue">longitude</font></code> and <code><font color="blue">latitude</font></code> will not have changed since 1990,
* the fields <code><font color="blue">housing_median_age</font></code>, <code><font color="blue">total_rooms</font></code>, <code><font color="blue">total_bedrooms</font></code>, <code><font color="blue">households</font></code> can be inferred from public records,
* they are willing to have technicians provide an up-to-date estimate of the current <code><font color="blue">median_house_value</font></code>. (The ethics committee has not advised against collecting this data, and the company feels that this field can surely be used to describe the impact of their urban requalification project to the residents. Indeed, after requalification the house value is expected to increase).

We have been contacted and asked to provide them with a model that uses the fields written in <code><font color="blue">blue</font></code> to predict <code><font color="red">highly populated</font></code> blocks with <code><font color="red">low income</font></code> residents. Our model will be trained and tested on the 1990 dataset. After validation, the company will then deploy it on the updated data they are currently collecting, and it will be used to help them identify which blocks might be candidates for one of their urban requalification project.

1. Load and clean the training dataset.

*Solution*:
> The dataset is already available on Colab, in the `sample_data` folder.

In [None]:
import os

In [None]:
import pandas as pd

In [None]:
if os.path.isfile("data/california_housing_train.csv"):
    csv_path = "data/california_housing_train.csv"
elif os.path.isfile("sample_data/california_housing_train.csv"):
    csv_path = "sample_data/california_housing_train.csv"
else:
    csv_path = (
        "https://dmf.unicatt.it/~fball/public/optimization_for_machine_learning"
        + "/california_housing_train.csv"
    )
train = pd.read_csv(csv_path)

> We display the first few lines of the dataset, and some basic information from which we can see basic statistics and check that there are no missing data.

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

> Looking at the maximum values we see that there are two fields that seem odd:
> * the maximum of `housing_median_age` is only 52. It seems a bit odd that none of the blocks have been built before 1938, so we should better investigate this field.
> * the maximum of `median_house_value` (the field we are interested in predicting) is 500 001, which is strangely close to the round number 500 000.
>
> We graphically investigate these two issues using the `seaborn` library, to help us in the visualization of histograms of each field and an estimation of the density function of the field distribution.

In [None]:
import seaborn as sns

> We start from the median house value field.

In [None]:
sns.histplot(train["median_house_value"], kde=True, stat="density", linewidth=0)

> From the plot we are lead to believe that the median house value was clipped at $500 000$, and values above that threshold were reported as $500 001$. Since we do not have the original data to replace the clipped values, for simplicity we will simply discard blocks with a median house value larger than $500 000$.

In [None]:
train = train[train["median_house_value"] < 500001]

In [None]:
sns.histplot(train["median_house_value"], kde=True, stat="density", linewidth=0)

> We then perform a similar analysis for the housing median age field.

In [None]:
sns.histplot(train["housing_median_age"], kde=True, stat="density", linewidth=0)

> Also in this case, it seems that a median age equal to $52$ has been introduced to denote values $> 51$. We discard such blocks as well.

In [None]:
train = train[train["housing_median_age"] < 52]

In [None]:
sns.histplot(train["housing_median_age"], kde=True, stat="density", linewidth=0)

> After this cleaning, the dataset size has decreased from $17 000$ to $15 000$. Still, this size is large enough for our goals.

In [None]:
train.info()

In [None]:
train.describe()

> We finally standardize the dataset. In previous tutorials we have used the following normalization formula
> `(train - train.min()) / (train.max() - train.min())`, here we use instead a standardization `(train - train.mean()) / train.std()`. On one hand this choice shows you different options for the same standardization task; on the other, the reasons for this specific choice here will be clear in the following. Note that we are defining two variables `train_mean` and `train_std` because we will need them during the testing phase as well.

In [None]:
train_mean = train.mean()
train_std = train.std()
train = (train - train_mean) / train_std

In [None]:
train.describe()

> We finally separate features and observations into $(\boldsymbol{X}_{\text{train}}, \boldsymbol{Y}_{\text{train}})$. Note that, for the first time in this course, we have more than one observation for each data point.

In [None]:
Y_train = train[["population", "median_income"]].to_numpy()
Y_train

In [None]:
X_train = train.drop(["population", "median_income"], axis=1).to_numpy()
X_train

2. Define a function to evaluate a feedforward neural network with two hidden layers, with 16 neurons in the first hidden layer and 12 neurons in the second hidden layer. The $\tanh$ activation function should be used for the hidden layers, while no activation function (i.e., the identity) should be used in the output layer.

*Solution*:
> Note that the text does not say how many neurons there are in the input and output layer. Indeed, it is understood that we should use as many neurons in the input layer as features (i.e., 7 neurons) and as many neurons in the output layer as observations (i.e., 2 neurons).
>
> The following (optimization variables, for us) define the neural network:
> * a matrix $\boldsymbol{W}_1 \in \mathbb{R}^{16 \times 7}$, which collects the weights connecting nodes in the input layer to the first hidden layer;
> * a vector $\boldsymbol{b}_1 \in \mathbb{R}^{16}$, which collects the biases associated to each node in the first hidden layer;
> * a matrix $\boldsymbol{W}_2 \in \mathbb{R}^{12 \times 16}$, which collects the weights connecting nodes in the first hidden layer to the second hidden layer;
> * a vector $\boldsymbol{b}_2 \in \mathbb{R}^{12}$, which collects the biases associated to each node in the second hidden layer;
> * a matrix $\boldsymbol{W}_3 \in \mathbb{R}^{2 \times 12}$, which collects the weights connecting nodes in the second hidden layer to the output layer;
> * a vector $\boldsymbol{b}_3 \in \mathbb{R}^{2}$, which collects the biases associated to each node in the output layer.
>
> To closely resemble previous implementations (e.g., the prediction function $\hat{y}$ in a linear or logistic regression exercises), we will collect all such arguments in a list `w`.
> Note that, since the prediction function will be used in the the definition of the empirical risk (which is our cost function, of which we need to compute the gradient), here we use `jnp` to enable automatic differentiation in `jax`.

In [None]:
def feedforward_neural_network_regression(x: np.ndarray, w: typing.List[np.ndarray]) -> np.ndarray:
    """
    Evaluate the feedforward neural.

    Parameters
    ----------
    x : 1d or 2d numpy array
        a single feature vector (1d array) or multiple feature vectors (2d array) for which we desire a prediction
        by evaluation of the neural network.
    w : list of 2d numpy arrays
        weights and biases of the neural network.

    Returns
    -------
    1d or 2d numpy array
        prediction associated to the feature vector (1d array) or multiple feature vectors (2d array) which
        were provided as inputs.
    """
    (W_1, b_1, W_2, b_2, W_3, b_3) = w

    # Handle x of different shapes (come back to this after point 3)
    if len(x.shape) == 2:
        x = x.T
    else:
        x = x.reshape(-1, 1)

    # Layer 0 is composed by the input features x
    layer_0 = x

    # Use layer 0, the weights W_1 and the biases b_1 to activate layer 1
    layer_1 = jnp.tanh(jnp.dot(W_1, layer_0) + b_1)

    # Use layer 1, the weights W_2 and the biases b_2 to activate layer 2
    layer_2 = jnp.tanh(jnp.dot(W_2, layer_1) + b_2)

    # Use layer 2, the weights W_3 and the biases b_3 to compute (without activation) the output layer
    layer_3 = jnp.dot(W_3, layer_2) + b_3

    # Apply the transformation back before returning
    if len(x.shape) == 2:
        return layer_3.T
    else:
        return layer_3.reshape(-1, 1)

3. Initialize the weights and biases using the following procedure, known as Glorot initialization or Xavier initialization (from the name of the researcher who proposed it, Xavier Glorot):
   * inizialize biases to zero,
   * inizialize weights at layer $l$ sampling from a Gaussian distribution with zero mean and with standard deviation 
$$
\sqrt{\frac{2}{d_{l - 1} + d_l}}
$$
where the vector $\boldsymbol{d}$ contains the number of neurons per layer (e.g., in our example $\boldsymbol{d} = [7, 16, 12, 2])$.

*Solution*:
> We use [`numpy.random.normal`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html) to perform the initialization. For technical reasons we define the biases as matrices with one columns (rather than vectors).

In [None]:
np.random.seed(52 + 300)

d = [7, 16, 12, 2]

W_1 = np.random.normal(0, np.sqrt(2 / (d[0] + d[1])), size=(d[1], d[0]))
print("W_1 shape:", W_1.shape)

b_1 = np.zeros((d[1], 1))
print("b_1 shape:", b_1.shape)

W_2 = np.random.normal(0, np.sqrt(2 / (d[1] + d[2])), size=(d[2], d[1]))
print("W_2 shape:", W_2.shape)

b_2 = np.zeros((d[2], 1))
print("b_2 shape:", b_2.shape)

W_3 = np.random.normal(0, np.sqrt(2 / (d[2] + d[3])), size=(d[3], d[2]))
print("W_3 shape:", W_3.shape)

b_3 = np.zeros((d[3], 1))
print("b_3 shape:", b_3.shape)

> We collect them in a variable `w_0`, which we will use as initial point for the optimization.

In [None]:
w_0 = [W_1, b_1, W_2, b_2, W_3, b_3]

> With this initialization, we may test the evaluation of the neural network for a specific value of $\boldsymbol{x}$.

In [None]:
feedforward_neural_network_regression(np.ones(7), w_0)

> `numpy` and `jax` are able to optimize repeated function evaluations when passed multiple points (this is called vectorization). We need multiple function evaluations for instance to compute the empirical risk. This is the reason why there is a `if` case at the beginning of the implementation of `feedforward_neural_network`. We may try to evaluate the network for a matrix composed by a single row (e.g., one row of the training set) or for a matrix composed by two rows (e.g. three rows of the training dataset).

In [None]:
feedforward_neural_network_regression(np.ones((1, 7)), w_0)

In [None]:
feedforward_neural_network_regression(
    np.vstack((np.ones((1, 7)), 2 * np.ones((1, 7)), 3 * np.ones((1, 7)))), w_0)

4. Implement the evaluation of the empirical risk associated to a least squares regression, and its gradient.

*Solution*:
> The least squares loss is defined here as
> $$ \ell(\boldsymbol{x}, \boldsymbol{y}; \boldsymbol{w}) = \left\|\hat{\boldsymbol{y}}(\boldsymbol{x}; \boldsymbol{w}) - \boldsymbol{y}\right\|^2,$$
> where $\hat{\boldsymbol{y}}$ is a symbol that encodes the evaluation of the neural network. Note that we are using the squared norm (rather than the square of a number) because the observation is now a vector with two entries!
>
> We are used to define the empirical risk by summing the least squares loss over the training dataset. However, this is typically very slow when the dataset is large. For performance reasons, we do this by using the `jnp.mean` function and the vectorized evaluation of the prediction.
> A call to the slow (but more readable! Note the use of `np.linalg.norm`, because the observation is now a vector with two entries) version of the empirical risk is left commented below.

In [None]:
def empirical_risk_regression_slow(w: typing.List[np.ndarray]) -> float:
    """
    Evaluate the empirical risk on the training dataset.

    This is implementation follows what we did in previous tutorials, but may be slow on large datasets.
    """
    m = X_train.shape[0]
    return 1 / m * sum(
        np.linalg.norm(feedforward_neural_network_regression(X_train[j], w) - Y_train[j])**2
        for j in range(m)
    )

In [None]:
# empirical_risk_regression_slow(w_0)

In [None]:
def empirical_risk_regression(w: typing.List[np.ndarray]) -> float:
    """
    Evaluate the empirical risk on the training dataset.

    This is implementation relies on jax for faster computations.
    """
    return jnp.mean(
        jnp.linalg.norm(feedforward_neural_network_regression(X_train, w) - Y_train, axis=1)**2
    )

In [None]:
empirical_risk_regression(w_0)

> When using the gradient method, we are interested in computing the derivative w.r.t. the optimization variables of the empirical risk associated to the whole dataset. Therefore, we apply `jax.grad` to `empirical_risk_regression`.

In [None]:
grad_empirical_risk_regression = jax.grad(empirical_risk_regression)

> Note that, since `w` contains $\boldsymbol{W}_1$, $\boldsymbol{b}_1$, etc, `jax` has returned the gradient w.r.t. $\boldsymbol{W}_1$, $\boldsymbol{b}_1$, ...

In [None]:
grad_empirical_risk_regression(w_0)

> When using the mini-batch stochastic gradient method, we are interested in computing the derivative w.r.t. the optimization variables of the empirical risk associated to a subset of the dataset. Therefore, we need to replicate the definition of the `empirical_risk_regression` function, but with an additional argument `X` and `Y` that encodes the rows selected by the mini-batch index selection.

In [None]:
def mini_batch_empirical_risk_regression(X: np.ndarray, Y: np.ndarray, w: typing.List[np.ndarray]) -> float:
    """Evaluate the empirical risk on a mini-batch of the training dataset."""
    return jnp.mean(
        jnp.linalg.norm(feedforward_neural_network_regression(X, w) - Y, axis=1)**2
    )

> For instance, using a mini-batch made of a four rows we obtain the following result...

In [None]:
mini_batch_empirical_risk_regression(X_train[:4], Y_train[:4], w_0)

> ... or a mini batch made up of all rows, we obtain the same value as calling `empirical_risk_regression`.

In [None]:
mini_batch_empirical_risk_regression(X_train, Y_train, w_0)

> However, when calling `jax.grad` to compute the gradient of the `mini_batch_empirical_risk_regression`, we should inform `jax` that we are only interested in the gradient w.r.t. the argument `w` (and not `X` or `Y`). This can be done as follows.

In [None]:
grad_mini_batch_empirical_risk_regression = jax.grad(mini_batch_empirical_risk_regression, argnums=2)

In [None]:
grad_mini_batch_empirical_risk_regression(X_train, Y_train, w_0)

5. Implement the mini-batch stochastic gradient method with constant momentum and constant step length in a Python function. Such function should:
   * take as input the features and observations in the training dataset, the percentage of training features to use in a mini-batch, the function $f$ to be used for the evaluation of the empirical risk on a mini-batch, its gradient $\nabla f$, the value $\alpha$ of the step length, the value $\beta$ of the momentum coefficient, the maximum number $E_{\max}$ of allowed epochs, and the initial condition $\boldsymbol{w}_{0}$;
   * return as outputs the optimization variable iterations $\{\boldsymbol{w}_k\}_k$ and the history of the function values $\{f(\boldsymbol{w}_k)\}_k$.

   Do not use any stopping criteria: termination should occur soon as the maximum number of epochs is reached.
 
*Solution*:
> We start from the `mini_batch_stochastic_gradient` implementation we have used in the previous tutorial, and change it as follows:
> * since the variable `w` is now a list containing the optimization variables $\boldsymbol{W}_1$, $\boldsymbol{b}_1$, ... as components, we need to carry out the update step from $k$ to $k + 1$ for each component separately (`for c in range(len(w_k))`),
> * the signature of `f` and `grad_f` is different from the previous tutorials: in previous tutorials we used to pass a single index $j \in J_k$ to the function evaluation (and then sum all function evaluations), instead here the restriction to $J_k$ of the training dataset should be provided to the function evaluation.

In [None]:
def mini_batch_stochastic_gradient_momentum(
    X: np.ndarray, Y: np.ndarray, perc: float, f: typing.Callable, grad_f: typing.Callable,
    alpha: float, beta: float, maxep: float, w_0: typing.List[np.ndarray]
) -> typing.Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Run the mini-batch stochastic gradient method with constant step length and constant momentum coefficient.

    Parameters
    ----------
    X, Y : np.ndarray
        features and observations of the training dataset.
    perc : float
        percentage of training features to use a mini-batch.
    f, grad_f : Python function
        callable evaluating the cost function and its gradient, respectively.
    alpha : float
        constant step length.
    beta : float
        constant momentum coefficient.
    epsilon : float
        tolerance for the stopping criterion on the error on the norm of the gradient of the cost.
    maxep : int
        maximum number of allowed epochs.
    w_0 : list of 2d numpy arrays
        initial condition for weights and biases of the neural network.

    Returns
    -------
    list of 2d numpy arrays
        history of the weights and biases of the neural network over the optimization iterations.
    1d numpy array
        history of the empirical risk function values.
    """
    # Determine m and m_b from input arguments
    assert X.shape[0] == Y.shape[0]
    m = Y.shape[0]
    m_b = int(perc * m)

    # Use JAX just-in-time compilation to improve performance
    f = jax.jit(f)
    grad_f = jax.jit(grad_f)

    # Prepare lists collecting the required outputs over the iterations
    assert isinstance(w_0, list)
    all_w = [w_0]
    all_f = [f(X, Y, w_0)]

    # Prepare iteration counter
    k = 0

    # Use the epoch number as stopping criterion
    while k < maxep * m / m_b:
        w_k = all_w[k]
        w_k_minus_1 = all_w[k - 1]

        # Draw random indices
        J_k = np.random.choice(m, size=m_b, replace=False)

        # Compute the update direction
        g_k = grad_f(X[J_k], Y[J_k], w_k)

        # Compute w_{k + 1}
        w_k_plus_1 = [
            w_k[c] - alpha * g_k[c] + beta * (w_k[c] - w_k_minus_1[c])
            for c in range(len(w_k))
        ]

        # Update required outputs
        all_w.append(w_k_plus_1)
        all_f.append(f(X, Y, w_k))

        # Increment iteration counter
        k += 1

    # Return the history of the optimization variables and costs
    return all_w, np.array(all_f)

6. Train the neural network using mini-batches made of $10\%$ of the training dataset, step length $\alpha = 0.05$ and momentum coefficient $\beta = 0.9$. Limit your training to 150 epochs.

*Solution*:

In [None]:
np.random.seed(52 + 600)
all_w_regression, all_f_regression = mini_batch_stochastic_gradient_momentum(
    X_train, Y_train, 0.1, mini_batch_empirical_risk_regression, grad_mini_batch_empirical_risk_regression,
    0.05, 0.9, 150, w_0)

> We compare the square root of the training loss at the first epoch and at the last epoch. Note that, since we are using a least squares loss function, we can easily print the RMSE on the training dataset by taking the square root of the empirical risk.

In [None]:
np.sqrt(all_f_regression[0]), np.sqrt(all_f_regression[-1])

> We plot the history of the RMSE on the training set. We notice that there are some oscillations in the plot. Therefore, rather than using the optimization variable `all_w[-1]` at the last iteration, we will use `all_w[k_best]`, where `k_best` is the iteration index at which the RMSE is lowest.

In [None]:
fig = go.Figure()
fig.add_scatter(x=np.arange(all_f_regression.shape[0]), y=np.sqrt(all_f_regression))
fig.update_layout(title="History of RMSE on training set")
fig.update_xaxes(type="log", exponentformat="power")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

In [None]:
K_regression = all_f_regression.shape[0] - 1
k_best_regression = np.argmin(np.sqrt(all_f_regression))
print(k_best_regression, "vs", K_regression)

In [None]:
np.sqrt(all_f_regression[0]), np.sqrt(all_f_regression[-1]), np.sqrt(all_f_regression[k_best_regression])

7. After loading and cleaning the test dataset, assess the accuracy of the prediction on the test dataset.

*Solution*:
> We load the test dataset with `pandas`, and carry out the same filtering based on the `median_house_value` and `housing_median_age` features.

In [None]:
if os.path.isfile("data/california_housing_test.csv"):
    csv_path = "data/california_housing_test.csv"
elif os.path.isfile("sample_data/california_housing_test.csv"):
    csv_path = "sample_data/california_housing_test.csv"
else:
    csv_path = (
        "https://dmf.unicatt.it/~fball/public/optimization_for_machine_learning"
        + "/california_housing_test.csv"
    )
test = pd.read_csv(csv_path)

In [None]:
test = test[test["median_house_value"] < 500001]
test = test[test["housing_median_age"] < 52]

> We then proceed to the standardization. Watch out that, in order to be consistent with the training phase, the mean and standard deviation of the *training* dataset should be used to carry out this standardization, even if we are operating on the test dataset!

In [None]:
test = (test - train_mean) / train_std

> We finally separate features and observation into $(\boldsymbol{X}_{\text{test}}, \boldsymbol{Y}_{\text{test}})$.

In [None]:
Y_test = test[["population", "median_income"]].to_numpy()
X_test = test.drop(["population", "median_income"], axis=1).to_numpy()

> We then evaluate the RMSE on the test dataset.

In [None]:
RMSE_test = np.sqrt(np.mean(np.linalg.norm(
    feedforward_neural_network_regression(X_test, all_w_regression[k_best_regression]) - Y_test, axis=1
)**2))
RMSE_test

> To present the results to the company, we consider the two outputs separately, and convert the results back into their original units multiplying by their standard deviation.

In [None]:
RMSE_test_normalized_population = np.sqrt(np.mean(
    (feedforward_neural_network_regression(X_test, all_w_regression[k_best_regression])[0] - Y_test[0])**2
))
RMSE_test_normalized_income = np.sqrt(np.mean(
    (feedforward_neural_network_regression(X_test, all_w_regression[k_best_regression])[1] - Y_test[1])**2
))
RMSE_test_normalized_population, RMSE_test_normalized_income

In [None]:
std_population = train_std[["population"]].to_numpy()[0]
std_income = train_std[["median_income"]].to_numpy()[0]
std_population, std_income

In [None]:
RMSE_test_population = RMSE_test_normalized_population * std_population
RMSE_test_income = RMSE_test_normalized_income * std_income
RMSE_test_population, RMSE_test_income

> What would we need to change in this notebook if, instead of running a regression by neural networks, we were interested in running a classification?

8. The RMSE values on the test set are reported back to the company, which sends us back the two following remarks:
   * they are not satisfied by the accuracy of the estimation on the <code><font color="red">population</font></code> field. Indeed, they define a block as <code><font color="red">highly populated</font></code> if the number of its residents lies in the right distribution tail starting from half a standard deviation of the <code><font color="red">population</font></code> field. Thus, having a RMSE of 500 people on a field that has a standard deviation of 1000 people is unsatisfactory for their goals;
   * their ethics committee feels that taking decisions based on a estimated value for <code><font color="red">median_income</font></code> is as unethical as directly asking people what their income is. What they are looking for instead is to detect <code><font color="red">low income</font></code> blocks, defined as blocks which lie in the left distribution tail ending at half a standard deviation of the <code><font color="red">median_income</font></code> field.

   What should we do to address these remarks?
 
*Solution*:
> Thanks to the choice of the normalization (w.r.t. mean and standard deviation, rather than min and max) the requirement
> * "number of its residents lies in the right distribution tail starting from half a standard deviation of the <code><font color="red">population</font></code> field" corresponds to `Y_train[:, 0] > 0.5`, and
> * "lie in the left distribution tail ending at half a standard deviation of the <code><font color="red">median_income</font></code>" corresponds to `Y_train[:, 1] < - 0.5`.
>
> We then multiply the two arrays together to detect blocks in which both conditions hold. For technical reasons we also reshape the resulting vector as a matrix with one column, so that we can reuse some of the previous implementations (in which the two outputs were stored each in a separate column).

In [None]:
y_train = ((Y_train[:, 0] > 0.5) * (Y_train[:, 1] < - 0.5)).astype(np.float64).reshape(-1, 1)

In [None]:
np.sum(y_train) / y_train.shape[0]

> We notice that only a small percentage of the dataset satisfies both conditions, and thus is eligible for the urban requalification project.

9. Define a function to evaluate a feedforward neural network with two hidden layers, with 16 neurons in the first hidden layer and 12 neurons in the second hidden layer. The sigmoid activation function should be used for the hidden layers, but, since we are interested in a classification task, also in the output layer.

*Solution*:
> There are two differences compared to the previous implementation:
> * the activation function is a sigmoid (while previous we used a $\tanh$ activation function). We implement the sigmoid function as we did in previous tutorials (using `jnp` rather than `np`!);
> * the activation function should be applied to the output layer too, otherwise the output may note be between 0 and 1.

In [None]:
def sigmoid(z: float) -> float:
    """Evaluate the sigmoid function."""
    return 1 / (1 + jnp.exp(-z))

In [None]:
def feedforward_neural_network_classification(x: np.ndarray, w: typing.List[np.ndarray]) -> np.ndarray:
    """
    Evaluate the feedforward neural.

    Parameters
    ----------
    x : 1d or 2d numpy array
        a single feature vector (1d array) or multiple feature vectors (2d array) for which we desire a prediction
        by evaluation of the neural network.
    w : list of 2d numpy arrays
        weights and biases of the neural network.

    Returns
    -------
    1d or 2d numpy array
        prediction associated to the feature vector (1d array) or multiple feature vectors (2d array) which
        were provided as inputs.
    """
    (W_1, b_1, W_2, b_2, W_3, b_3) = w

    # Handle x of different shapes (come back to this after point 3)
    if len(x.shape) == 2:
        x = x.T
    else:
        x = x.reshape(-1, 1)

    # Layer 0 is composed by the input features x
    layer_0 = x

    # Use layer 0, the weights W_1 and the biases b_1 to activate layer 1
    layer_1 = sigmoid(jnp.dot(W_1, layer_0) + b_1)

    # Use layer 1, the weights W_2 and the biases b_2 to activate layer 2
    layer_2 = sigmoid(jnp.dot(W_2, layer_1) + b_2)

    # Use layer 2, the weights W_3 and the biases b_3 to compute (without activation) the output layer
    layer_3 = sigmoid(jnp.dot(W_3, layer_2) + b_3)

    # Apply the transformation back before returning
    if len(x.shape) == 2:
        return layer_3.T
    else:
        return layer_3.reshape(-1, 1)

10. Initialize the weights and biases for the classification neural network using the Glorot initialization.

*Soution*:
> The code is very similar to the one used for the regression task. The main difference is that here the output layer has only one neuron (instead of two).

In [None]:
np.random.seed(52 + 1000)

d = [7, 16, 12, 1]

W_1 = np.random.normal(0, np.sqrt(2 / (d[0] + d[1])), size=(d[1], d[0]))
print("W_1 shape:", W_1.shape)

b_1 = np.zeros((d[1], 1))
print("b_1 shape:", b_1.shape)

W_2 = np.random.normal(0, np.sqrt(2 / (d[1] + d[2])), size=(d[2], d[1]))
print("W_2 shape:", W_2.shape)

b_2 = np.zeros((d[2], 1))
print("b_2 shape:", b_2.shape)

W_3 = np.random.normal(0, np.sqrt(2 / (d[2] + d[3])), size=(d[3], d[2]))
print("W_3 shape:", W_3.shape)

b_3 = np.zeros((d[3], 1))
print("b_3 shape:", b_3.shape)

w_0 = [W_1, b_1, W_2, b_2, W_3, b_3]

11. Implement the evaluation of the empirical risk associated to the cross-entropy loss (i.e., the loss function of a logistic regression problem), and its gradient.

*Solution*:
> We implement already the mini-batch version of the empirical risk, since it is the one that we use in the training of the network.

In [None]:
def mini_batch_empirical_risk_classification(X: np.ndarray, y: np.ndarray, w: typing.List[np.ndarray]) -> float:
    """Evaluate the empirical risk on a mini-batch of the training dataset."""
    y_hat = feedforward_neural_network_classification(X, w)
    return jnp.mean(- y * jnp.log(y_hat) - (1 - y) * jnp.log(1 - y_hat))

> We test the implemented function on a mini-batch and on a batch application.

In [None]:
mini_batch_empirical_risk_classification(X_train[:4], y_train[:4], w_0)

In [None]:
mini_batch_empirical_risk_classification(X_train, y_train, w_0)

> We define the gradient of the empirical risk, to be employed by the stochastic gradient method, using `jax`.

In [None]:
grad_mini_batch_empirical_risk_classification = jax.grad(mini_batch_empirical_risk_classification, argnums=2)

In [None]:
grad_mini_batch_empirical_risk_classification(X_train, y_train, w_0)

12. Train the (classification) neural network using mini-batches made of $10\%$ of the training dataset, step length $\alpha = 2$ and momentum coefficient $\beta = 0.9$. Limit your training to 150 epochs.

*Solution*:
> We may reuse our previous implementation of `mini_batch_stochastic_gradient_momentum` by passing `y_train` as the second argument.

In [None]:
np.random.seed(52 + 1200)
all_w_classification, all_f_classification = mini_batch_stochastic_gradient_momentum(
    X_train, y_train, 0.1, mini_batch_empirical_risk_classification, grad_mini_batch_empirical_risk_classification,
    2.0, 0.9, 150, w_0)

> We compare the cross-entropy loss at the beginning and at the end, as well as plot its history.

In [None]:
all_f_classification[0], all_f_classification[-1]

In [None]:
fig = go.Figure()
fig.add_scatter(x=np.arange(all_f_classification.shape[0]), y=all_f_classification)
fig.update_layout(title="History of cross-entropy loss on training set")
fig.update_xaxes(type="log", exponentformat="power")
fig.update_yaxes(type="log", exponentformat="power")
fig.show()

> We notice some oscillations at the end of the training. Therefore, also here we determine the iteration index `k_best` that minimizes the cross-entropy loss.

In [None]:
K_classification = all_f_classification.shape[0] - 1
k_best_classification = np.argmin(all_f_classification)
print(k_best_classification, "vs", K_classification)

13. Present the prediction results on the test dataset by means of the following *table of confusion*
<table>
 <tr>
  <th></th>
  <th>$\hat{y} \leq \text{tr}$ (blocks not suggested for urban requalification)</th>
  <th>$\hat{y} > \text{tr}$ (blocks suggested for urban requalification)</th>
 </tr>
 <tr>
  <td>$y = 0$ (block is not eligible for urban requalification)</td>
  <td><i># of true negative</i></td>
  <td># of false positive</td>
 </tr>
 <tr>
  <td>$y = 1$ (block is eligible for urban requalification)</td>
  <td># of false negative</td>
  <td><i># of true positive</i></td>
 </tr>
</table>
and choose the threshold value $\text{tr}$ in the logistic regression accounting for the following policies of the company:

   * the company would like to have as many true positive cases as possible, i.e. blocks which are eligible for urban requalification and are suggested as such by the model
   * when comparing the results of the classification to the former ones obtained with the regression, the compay is willing to consider the new model based on classification provided that it has a lower number of false negative cases, as long as the number of false positive cases predicted by the classification model are not larger than twice the number of false positive cases predicted by the regression model. This means that the company is worried about wrongly excluding blocks that are indeed eligible (false negative), and to avoid this they are willing to mistakenly evaluate a moderate number of blocks that in the end will turn out to be not eligible (false positive).
 
*Solution*:
> We first determine the entries that fit the requalification criteria.

In [None]:
y_test = ((Y_test[:, 0] > 0.5) * (Y_test[:, 1] < - 0.5)).astype(np.float64).reshape(-1, 1)

In [None]:
np.sum(y_test) / y_test.shape[0]

> As requested, we use the regression model as base model. We compute the corresponding confusion matrix. This requires evaluating the regression neural network, and checking whether its first output is above 0.5 and its second output is below -0.5.

In [None]:
confusion_regression = np.zeros((2, 2))
for (x_j, y_j) in zip(X_test, y_test):
    Y_hat_j = feedforward_neural_network_regression(x_j, all_w_regression[k_best_regression])
    y_hat_j = ((Y_hat_j[0, 0] > 0.5) * (Y_hat_j[0, 1] < - 0.5)).astype(np.float32)
    confusion_regression[int(y_j > 0.5), int(y_hat_j > 0.5)] += 1
confusion_regression

> We then compare classification models obtained using different thresholds $0.1$, $0.3$ or the default $0.5$.

In [None]:
confusion_classification_1 = np.zeros((2, 2))
confusion_classification_3 = np.zeros((2, 2))
confusion_classification_5 = np.zeros((2, 2))
for (x_j, y_j) in zip(X_test, y_test):
    y_hat_j = feedforward_neural_network_classification(x_j, all_w_classification[k_best_classification])
    confusion_classification_1[int(y_j > 0.1), int(y_hat_j > 0.1)] += 1
    confusion_classification_3[int(y_j > 0.3), int(y_hat_j > 0.3)] += 1
    confusion_classification_5[int(y_j > 0.5), int(y_hat_j > 0.5)] += 1

In [None]:
confusion_classification_1

In [None]:
confusion_classification_3

In [None]:
confusion_classification_5

> * We see that the classification model with threshold 0.5 is characterized by a number of true positive cases that is lower than the regression model. This goes against the first policy of the company, and thus we discard the threshold 0.5
> * The two remaining classification models (thresholds 0.1 or 0.3) are both characterized by a lower number of false negative than the regression model. However the threshold 0.1 leads to a very large number of false positive cases. Therfore, we suggest the company to use the classification model with a threshold equal to 0.3.