In [9]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn.pipeline import make_pipeline
from sklearn.inspection import DecisionBoundaryDisplay

from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.datasets import make_circles, make_classification, make_moons
from sklearn.model_selection import train_test_split

from utils import color_cycle

## Lecture 4: Learning in neural netwoks

* **4.2 Feedforward networks**
    * Multi-layered perceptrons, error backpropagation
    * Regularization
    * MLPs are universal approximators

_Recommended readings_:
* A classic introduction to early neural networks is contained in the book series [Parallel Distributed Processing](https://direct.mit.edu/books/monograph/4424/Parallel-Distributed-Processing-Volume)
* The wiki article on the [Universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) is a good starting point for reading about recent developments after the [pioneering paper](https://link.springer.com/article/10.1007/BF02551274) by Cybenko.

## Multi-layered neural networks

A Multilayer Perceptron (MLP) employs Fully Connected (FC) layers followed by nonlinearities $h$. An example of a two-layer (one-hidden layer) network with $M$ hidden neurons is the following:
$$y_k(x,w)=h_2\left(w_{k0}^{(2)}+\sum_{j=1}^M w^{(2)}_{kj} h_1\left(w_{j0}^{(1)}+\sum_{i=1}^N w^{(1)}_{ji}x_i\right)\right)$$
$h_{1,2}$ are scalar functions such as $x, \sigma(x)$ or $\text{ReLu}(x)$ [recall that $\text{ReLu}(x) =x$ when $x>0$ and $\text{ReLu}(x) =0$ otherwise].

Usually one takes the activation function of the last hidden layer as the identity function and includes a softmax operation on the output.

## Cybenko Theorem (1989)

Let $\sigma$ be a sigmoidal function, i.e. $\lim_{x\to+\infty}\sigma\left(x\right)=1$ and $\lim_{x\to-\infty}\sigma\left(x\right)=0$. Then finite sums of the form
$$g\left(x\right)=\sum_{k=1}^{K}w_{k}^{\left(2\right)}\sigma\left(w_{k}^{\left(1\right)}\cdot x+b_{k}^{\left(1\right)}\right)$$
are dense with respect to the supremum norm in the space $C\left(I_{N}\right)$ of continuous functions in the $N$-dimensional unit cube $\left[0,1\right]^{N}$ , i.e. given any $f\in C\left(I_{N}\right)$ and any $\epsilon>0$ there exists a $g$ such that
$$|g\left(x\right)-f\left(x\right)|<\epsilon$$
for all $f\in I_{N}$. Recall that $\|f\|_\infty=\sup_{I_N} \{ \left|f(x)\right|, x \in I_N\}$.

#### BUT

A one-hidden-layer network will typically need an **exponentially large** $K$ to approximate a function with an $N$-dimensional input.

## Universal approximation theorems

More general existence theorems exist for
* non-polynomial activation functions;
* MLPs with more than one hidden layer.

### Example with a one-hidden layer network

Consider $2^N$ binary patterns $x_i^\mu= \pm 1$ in $N$ dimensions and two classes $x^\mu \rightarrow t^\mu=\pm 1$.

Use $2^N$ hidden units, labeled $k=0,\ldots,2^N-1$ and make sure that each hidden unit only responds to a one of the possible $2^N$ patterns.

To this aim, set:
* $w^{(1)}_{ki}=b$ if $i$th digit in binary representation of $k$ is 1
* $w^{(1)}_{ki}=-b$ otherwise

and use a threshold of $(N-1)b$ at each hidden units, with a Heaviside activation function for the hidden layer and a sign activation function at the output:
$$z_k=\Theta\left[\sum_i w^{(1)}_{ki}x_i-(N-1)b\right]$$
$$y=\text{sign}\left[\sum_{k=0}^{K-1} w^{(2)}_k z_k\right]$$

Here's an example with $N=2$-dimensional inputs and $2^N=4$ hidden units:

| $k$ | binary repr. | $w^{(1)}_{k1}$ | $w^{(1)}_{k2}$ |
|---|---|---|---|
| 0 | 00 | -b | -b |
| 1 | 01 | -b | b |
| 2 | 10 | b | -b |
| 3 | 11 | b | b |

Consider the pre-activation in the presence of all possible $P=2^N$ inputs $x$:
| $x_1$ | $x_2$ | $\sum_i w^{(1)}_{0i}x_i$ | $w^{(1)}_{1i}x_i$ | $w^{(1)}_{2i}x_i$ | $w^{(1)}_{3i}x_i$ |
|---|---|---|---|---|---|
| -1 | -1 | **2b** | 0 | 0 | -2b |
| -1 | 1 | 0 | **2b** | -2b | 0 |
| 1 | -1 | 0 | -2b | **2b** | 0 |
| 1 | 1 | -2b | 0 | 0 | **2b** |

In the $2^N$ dimensional $z$ space the problem is linearly separable in a trivial fashion:

| $x_1$ | $x_2$ | $z_0$ | $z_1$ | $z_2$ | $z_3$ | $y$ |
|---|---|---|---|---|---|---|
| -1 | -1 | 1 | 0 | 0 | 0 | $\text{sign}[w^{(2)}_0]$ |
| -1 | 1 | 0 | 1 | 0 | 0 | $\text{sign}[w^{(2)}_1]$ |
| 1 | -1 | 0 | 0 | 1 | 0 | $\text{sign}[w^{(2)}_2]$ |
| 1 | 1 | 0 | 0 | 0 | 1 | $\text{sign}[w^{(2)}_3]$ |

## MLP for regression

Let us use a two-layer NN with 3 $\tanh$ hidden units and linear output to approximate functions. We take equally spaced points in $x\in [-1,1]$.

In [12]:
# my own network
def my_own_net(mlp, x, act_name = "tanh"):
    if act_name == "tanh":
        act = np.tanh
    elif act_name == "relu":
        act = lambda x: (x>0) * x
    else:
        raise ValueError("Unknwon activation function")
    W, v = mlp.coefs_
    b_W, b_v = mlp.intercepts_
    a_pred = act(x @ W + b_W)
    y_pred = a_pred @ v + b_v
    return a_pred, y_pred

In [None]:
# set input
x = np.linspace(-1, +1, 100)

# uncomment to select output function
# y = x**2
# y = np.sin(x)
# y = np.abs(x)
y = np.sign(x)

# regularization controls the final scale of learned weights and biases: look at the different effect of 0.1 and 0.0
regularization = 0.0001

# train networks
mlp = MLPRegressor(hidden_layer_sizes=3, alpha=regularization, activation="tanh", solver="lbfgs", momentum=0.,
                   max_iter=100000, learning_rate_init=0.01, tol=1e-5, random_state=1, verbose=False)
# compute predicted outputs
mlp.fit(x[:,None], y)
y_pred = mlp.predict(x[:,None])

a_pred, y_pred_check = my_own_net(mlp, x[:,None], act_name = "tanh")

# plot network output
plt.figure(figsize=(10,4))
plt.subplot(121)
plt.plot(x, y, label="true output");
plt.plot(x, y_pred, '.', label="predicted output");
plt.plot(x, y_pred_check[:,0], '.', alpha=0.2, label="just a check"); # just a check
lines = plt.plot(x, a_pred, '--', alpha=0.3, label="hidden activations"); # visualize internal activities
plt.setp(lines[1:], label="_")
plt.xlabel("x")
plt.legend();

plt.subplot(122)
plt.plot(mlp.coefs_[0][0], '.-', label="W");
plt.plot(mlp.coefs_[1][:,0], '.-', label="v");
plt.legend();

plt.tight_layout();

## MLP for classification

Adapted from sklearn tutorial [Classifier comparison](https://scikit-learn.org/1.5/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py)

In [None]:
# generate dataset
np.random.seed(1)
num_hidden = 30
n_samples = 500
noise = 0.2

# get dataset
X, y = make_circles(n_samples, noise=noise, factor=0.5, random_state=1)
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# train classifier and get boundaries
mlp = MLPClassifier(activation="relu", hidden_layer_sizes=num_hidden, alpha=1, max_iter=1000, random_state=42)
mlp.fit(X_train, y_train)
score = mlp.score(X_test, y_test)

figure = plt.figure(figsize=(12, 4))
# just plot the dataset first
ax = plt.subplot(1, 2, 1)
cm = plt.cm.RdBu
cm_bright = ListedColormap(["#FF0000", "#0000FF"])
# Plot the training and test points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.5, edgecolors="k");
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)

# Plot the training and test points
ax = plt.subplot(1, 2, 2)
DecisionBoundaryDisplay.from_estimator(mlp, X, ax=ax, cmap=cm, alpha=0.8, eps=0.5)
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, edgecolors="k",alpha=0.6);
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max);

#### Distribution of pre-outputs

In [None]:
# get labels from trained mlp
label_pred = mlp.predict(X_train)

# extract parameters and compute internal activities
_, y_pred_check = my_own_net(mlp, X_train, act_name="relu")

# uncomment to check outputs with your own computation
proba_pred = mlp.predict_proba(X_train)
# plt.plot(proba_pred[:,1], proba_pred[:,1], ':')
# plt.plot(proba_pred[:,1], np.exp(y_pred_check)/(1.+np.exp(y_pred_check)), '.');
# plt.xlabel('proba pred')
# plt.ylabel('proba pred check');

# visualize last layer activation per label
plt.hist(y_pred_check[label_pred==0], bins="auto", label="label=0");
plt.hist(y_pred_check[label_pred==1], bins="auto", label="label=1");
plt.xlabel("pre-output")
plt.ylabel("count")
plt.legend();
plt.show()

#### Have a look at internal representations of each neuron

In [None]:
# extract internal representation in a 2d grid
num_a = num_hidden
X, Y = np.meshgrid(np.linspace(x_min, x_max, 20), np.linspace(y_min, y_max, 20))
x_stacked = np.dstack([X,Y])
y_stacked = np.zeros((20, 20, num_a))
for irow, row in enumerate(x_stacked):
    rowy, _ = my_own_net(mlp, row, act_name="relu")
    y_stacked[irow] = rowy[:,:num_a]

nrows, ncols = num_a//5, 5
fig, axes = plt.subplots(nrows, 5)
fig.set_size_inches(8,nrows)
extent = [x_min, x_max, y_min, y_max]
for count in range(num_a):
    i, j = count // 5, count % 5
    axes[i,j].imshow(y_stacked[:,:,count], cmap=cm, extent=extent)
plt.tight_layout();

## Local minima

Error is affected by local minima.

Part of the cause of local minima is the **saturation of the sigmoid functions** $$\sigma\left(\sum_j w_{ij} x_j\right)$$ When $w_{ij}$ becomes large, any change in its value hardly affects the output, implying $\nabla_{ij}E=0$.

You can partly prevent this from happening by:
* Choosing **tanh** instead of sigmoid transfer functions and scaling inputs and outputs to have mean zero and standard deviation one.
* Choosing **ReLU** activation functions.
* **Proper initialization** of $w_{ij}$ with mean zero and a standard deviation of order $1/\sqrt{n_1}$, where $n_1$ is the number of inputs to neuron $i$.
* Adding a **regularizer** such as $\sum_{i} w_i^2$ to the cost function to keep weights small.
* Using techniques like dropouts, layer normalization, adding noise, etc.

In [None]:
# train a networks starting from different initial conditions
num_hidden = 3
num_seeds = 100
print_every = 20
noise = 0.1
regularization = 0.01

# generate input and output
np.random.seed(1)
X_train = np.linspace(-np.pi, np.pi, 10)
y_train = np.sin(X_train) + np.random.randn(len(X_train)) * noise
X_test = np.linspace(-np.pi, np.pi, 100)
y_test = np.sin(X_test)

plt.plot(X_test, y_test, ':')
plt.plot(X_train, y_train, 'o');
plt.xlabel('x')
plt.ylabel('y');

In [None]:
y_test_preds = np.zeros((num_seeds, len(y_test)))
errs = np.zeros(num_seeds)

for seed in range(1, num_seeds+1):
    if seed % print_every == 0:
        print("training with seed", seed)
    mlp = MLPRegressor(hidden_layer_sizes=num_hidden, alpha=regularization, activation="tanh", solver="lbfgs", momentum=0.,
                       max_iter=10000, learning_rate_init=0.01, tol=1e-5, random_state=seed, verbose=False)
    mlp.fit(X_train[:,None], y_train)
    y_test_pred = mlp.predict(X_test[:,None])
    y_test_preds[seed-1] = y_test_pred
    errs[seed-1] = ((y_test - y_test_pred)**2).mean()
print("\n")

# plot
plt.figure(figsize=(12,4))
plt.subplot(121)
plt.plot(X_test, y_test, '.-', c="black", ms=3, label="y test")
lines = plt.plot(X_test, y_test_preds.T, '.-', alpha=0.2, color='gray', ms=3, label="y test pred");
plt.setp(lines[1:], label="_")
plt.legend();
plt.xlabel('x')
plt.ylabel('y')
plt.subplot(122)
plt.hist(errs);
plt.xlabel('err')
plt.ylabel('# err');

## Training an MLP with gradient descent: backpropagation of error #1

We will use the Mean Squared Error (MSE) loss. The error is the sum of error per pattern using :
$$E(w) = \sum_\mu E_\mu(w)\qquad E_\mu(w)=\frac{1}{2}\sum_k\left(y_k(x^\mu,w)-t_k^\mu\right)^2$$

$$y_k=h\left(\underbrace{\sum_j w_{kj}^{(2)} z_j^{(1)}}_{a_k^{(2)}}\right)\qquad z_j^{(1)}=h\left(\sum_i w_{ji}^{(1)} x_i\right)$$

$$\frac{\partial}{\partial w^{(2)}_{kj}}\frac{1}{2}\sum_{k'}(y_{k'}-t_{k'})^2=(y_k-t_k) \frac{\partial y_k}{\partial w^{(2)}_{kj}}={\color{red} \underbrace{ (y_k-t_k) h'(a_k^{(2)}) }_{\delta_k^{(2)}}}z_j^{(1)}={\color{red} \delta^{(2)}_k} z_j^{(1)}$$

## Training an MLP with gradient descent: backpropagation of error #2

Similarly:
$$y_k =h\left(\sum_j w_{kj}^{(2)} h\left(\sum_i w_{ji}^{(1)} z_i^{(0)}\right)\right)$$$$\frac{\partial}{\partial w^{(1)}_{ji}}\frac{1}{2}\sum_{k}(y_{k}-t_k)^2=\sum_{k=1}^K (y_k-t_k) \frac{\partial y_k}{\partial w^{(1)}_{ji}}$$$$=\underbrace{\sum_{k=1}^K \underbrace{(y_k-t_k) h'(a^{(2)}_k)}_{\delta_k^{(2)}}w_{kj}^{(2)}h'(a_j^{(1)})}_{\delta_j^{(1)}}z_i^{(0)}= \textcolor{red}{\delta_j^{(1)}} z_i^{(0)}$$with$$\delta_j^{(1)}=h'(a_j^{(1)})\sum_{k=1}^K \delta_k^{(2)} w^{(2)}_{kj}$$

## Training an MLP with gradient descent: backpropagation of error #3

Backpropagation extends to $T$ layers. For each pattern $\mu$:
1.  Compute forward activities $a_j^{\mu,(t)}=\sum_i w_{ji}^{(t)}z_i^{\mu,(t-1)}$ and $z_j^{\mu,(t)}=h\left(a_j^{\mu,(t)}\right), t=1,\ldots, T$.

    With $z_i^{\mu,(0)}=x_i^\mu$.
    
2.  Compute the errors $\delta_j^{\mu,(t-1)}=h'(a_j^{\mu,(t-1)})\sum_k w_{kj}^{(t)} \delta_k^{\mu,(t)}, t=T,\ldots,2$.

    With $\delta_k^{\mu,(T)}=(y^\mu_k-t^\mu_k)h'(a_k^{\mu,(T)})$.

3.  $\frac{\partial E_\mu}{\partial w^{(t)}_{ji}} = \delta_j^{\mu,(t)} z^{\mu,(t-1)}_i$

The gradient is $\frac{\partial E}{ \partial w^{(t)}_{ji}}=\sum_\mu \frac{\partial E_\mu}{\partial w^{(t)}_{ji} }$.

# <center>Assignments</center>

#### Ex 4.4

Write your own version of a one-hidden layer neural network trained with MSE as in the section **Local minima** and train it on the same dataset using gradient descent. Experiment with both $\tanh$ and ReLu activation functions.

Report the training and test error over training and visualize the final regression performance for three values of `num_hidden`: 3, 10, 50.