<img src=docs/tudelft_logo.jpg width=50%>

## LIGHTEN Project: Training School #2

## Practical Session 1: Introduction to Gaussian Processes and Artificial Neural Networks

### Miguel A. Bessa | <a href = "mailto: M.A.Bessa@tudelft.nl">M.A.Bessa@tudelft.nl</a>  | Associate Professor

**What:** A simple tutorial on Supervised Regression for the Training School of the EU project LIGHTEN

**Where:** This notebook comes from this [repository](https://github.com/mabessa/Intro2ML/LIGHTEN)

**References/resources to create this notebook:**
* This simple tutorial is still based on a script I created for this article: https://imechanica.org/node/23957
* It follows from some examples provided by the scikit-learn user guide, which seem to have originated from Mathieu Blondel, Jake Vanderplas, Vincent Dubourg, and Jan Hendrik Metzen.
* [Black box figure](https://openclipart.org/download/218057/black_boxes.svg)
* For artificial neural networks, using materials from Andreas Mueller (Lecture 22): https://github.com/amueller/COMS4995-s19

Apologies in advance if I missed some reference used in this notebook. Please contact me if that is the case, and I will gladly include it here.

## **OPTION 1**. Run this notebook **locally in your computer**:
1. Install miniconda3 [here](https://docs.conda.io/en/latest/miniconda.html)
2. Open a command window and create a virtual environment called "3dasm":
```
conda create -n 3dasm python=3 numpy scipy jupyter nb_conda matplotlib pandas scikit-learn rise tensorflow -c conda-forge
```
3. Install [git](https://github.com/git-guides/install-git), open command window & clone the repository to your computer:
```
git clone https://github.com/mabessa/Intro2ML
```
4. Load jupyter notebook by typing in (anaconda) command window (it will open in your internet browser):
```
conda activate 3dasm
jupyter notebook
```
5. Open notebook (Intro2ML/LIGHTEN/TS2_Session1.ipynb)

## **OPTION 2**. Use **Google's Colab** (no installation required, but times out if idle):

1. go to https://colab.research.google.com
2. login
3. File > Open notebook
4. click on Github (no need to login or authorize anything)
5. paste the git link: https://github.com/mabessa/Intro2ML
6. click search and then click on the notebook for this Lecture (Intro2ML/LIGHTEN/TS2_Session1.ipynb).

In [None]:
# Basic plotting tools needed in Python.

import matplotlib.pyplot as plt # import plotting tools to create figures
import numpy as np # import numpy to handle a lot of things!
from IPython.display import display, Math # to print with Latex math

%config InlineBackend.figure_format = "retina" # render higher resolution images in the notebook
plt.style.use("seaborn") # style for plotting that comes from seaborn
plt.rcParams["figure.figsize"] = (8,4) # rescale figure size appropriately for slides

## Outline for today

* Introduction tutorial on Gaussian Processes (no theory today!)
    - Using Scikit-learn for Gaussian Process Regression (noiseless and noisy datasets)
* Introduction tutorial on Artificial Neural Networks (no theory today!)
    - Using Keras for regression with Artificial Neural Networks (noiseless and noisy datasets)

The goal is for you to be able to use these models as *black boxes*.

* This session focuses on how to train:
    - **Gaussian Processes** using [scikit-learn](https://scikit-learn.org)
    - **Artificial Neural Networks** (ANNs) using [keras](https://keras.io/) and [tensorflow](https://www.tensorflow.org/)

Is it a good idea to use machine learning models without understanding them?

* No.

But I am not talented enough to teach Machine Learning in 1.5 hours!

So, this session is like a *teaser*. Hopefully it motivates you to do a Machine Learning course!

* The course I teach about machine learning is in the following repo: https://github.com/bessagroup/3dasm_course

So, let's consider supervised Machine Learning (ML) models for regression as black boxes.

<img src="docs/black_box.png" title="Machine learning as a black box" width="50%">

* Each ML model has its own *expressivity* that depends on a set of parameters called **hyperparameters**.

* Hyperparameters are parameters that you define before you start training the model using data (observations for your problem of interest)

* If you understand the ML model you are using, then its hyperparameters can be meaningful to you. In that case, you *might* be able to choose reasonable hyperparameter values.

## 1. Short tutorial on 1D regression with Gaussian Processes

Gaussian Processes are a simple to use but very powerful ML model.

They are simple because they only require the definition of two hyperparaters:

1. The **kernel function** $k(x_i,x_j)$, which can be many different kinds of functions (with some special properties).


2. And the noise level $\sigma_i^2$ which represents uncertainty (variance) of your data at each output point $y_i$. If the function you want to approximate is noiseless, i.e. if the measurements are exact, then you can use a small value for this parameter (e.g. $1\times 10^{-10}$). 

The kernel function is important because it assigns specific properties to your approximation of the data.

* An example of a kernel function is the RBF: $k(x_i,x_j) = {\color{red}\eta}^2\exp{\left(-\frac{||x_i-x_j||^2}{2{\color{red}\lambda}^2}\right)}$


* Every kernel function has a set of unknown parameters (in red) that will be learned by **training on the data**.
    * For example, the RBF kernel has 2 parameters.
    
Here's two resources that are very useful to understand the role of the kernel function:

* Kernel cookbook: https://www.cs.toronto.edu/~duvenaud/cookbook/
* Visualizing different kernel functions: https://distill.pub/2019/visual-exploration-gaussian-processes/#MultipleKernels

### Recap: Viewing a Gaussian Process as a black box

<img src="docs/black_box.png" title="Machine learning as a black box" width="50%">

* Where the hyperparameters for the Gaussian Process method are:

    * Kernel function, for example the RBF: $k(x_i,x_j) = {\color{red}\eta}^2\exp{\left(-\frac{||x_i-x_j||^2}{2{\color{red}\lambda}^2}\right)}$
    
    * Noise at each data point: $\sigma_i^2$

* The <span style="color:red">**key concept**</span> is that the prediction of the mean and variance of a new point $y^*$ depends on the values of the parameters of the kernel function (which are **UNKNOWN**).


* However, despite the fact that we don't know the parameters of the kernel function, they can be obtained by **Bayesian inference**.

Bayesian inference is possible by using Bayes rule to find the **posterior** information. This involves doing Marginalization and Conditioning.

Here we don't cover this. Instead, we just let the code do it for us!

Example for Gaussian Process regression of one-dimensional datasets.

Let's start with a **noiseless** case.

In [None]:
from sklearn.model_selection import train_test_split

# Function to "learn"
def f(x):
    return x * np.sin(x)

n_data = 50 # number of points in our dataset
testset_ratio = 0.90 # ratio of test set points from the dataset
x_data = np.linspace(0, 10, n_data) # uniformly spaced points
y_data = f(x_data) # function values at x_data

X_data = np.reshape(x_data,(-1,1)) # a 2D array that scikit-learn likes

seed = 1987 # set a random seed so that everyone gets the same result
np.random.seed(seed)

# Let's split into 10% training points and the rest for testing:
X_train, X_test, y_train, y_test = train_test_split(X_data,
                                    y_data, test_size=testset_ratio,
                                    random_state=seed)

x_train = X_train.ravel() # just for plotting later
x_test = X_test.ravel() # just for plotting later

print("Here's a print of X_train:\n", X_train)

### Gaussian Process Regression (GPR) for noiseless datasets

Let's use the RBF kernel for our predictions:

$$
k(x_i,x_j) = {\color{red}\eta}^2\exp{\left(-\frac{||x_i-x_j||^2}{2{\color{red}\lambda}^2}\right)}
$$

with an initial guess for the parameters as: $\eta = 1$ and $\lambda = 10$.

In [None]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, ExpSineSquared, ConstantKernel

# Define the kernel function
kernel = ConstantKernel(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2)) # This is the standard RBF kernel
#kernel = 1.0 * RBF(10, (1e-2, 1e2)) # Same kernel as above (scikit-learn assumes constant variance if you just
                                     # write RBF without the constant kernel or without multiplying by 1.0)

# Other examples of kernels:
#kernel = ExpSineSquared(length_scale=3.0, periodicity=3.14,
#                       length_scale_bounds=(0.1, 10.0),
#                       periodicity_bounds=(0.1, 10)) * RBF(3.0, (1e-2, 1e2))
#kernel = Matern(length_scale=1.0, length_scale_bounds=(1e-2, 1e2),nu=1.5)
                
gp_model = GaussianProcessRegressor(kernel=kernel, alpha=1e-10, n_restarts_optimizer=20) # using a small alpha

# Fit to data to determine parameters
gp_model.fit(X_train, y_train)

# Make the prediction on the entire dataset (for plotting)
y_data_pred, sigma_data = gp_model.predict(X_data, return_std=True) # also output the uncertainty (std)

# Predict for test set (for error metric)
y_pred, sigma = gp_model.predict(X_test, return_std=True) # also output the uncertainty (std)

In [None]:
# Plot the function, the prediction and the 95% confidence interval
fig1, ax1 = plt.subplots()

ax1.plot(x_data, y_data, 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # function to learn

ax1.plot(x_data, y_data_pred, 'b-', label="GPR prediction")
ax1.fill(np.concatenate([x_data, x_data[::-1]]),
         np.concatenate([y_data_pred - 1.9600 * sigma_data,
                        (y_data_pred + 1.9600 * sigma_data)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')

ax1.plot(x_train, y_train, 'ro', markersize=6, label="training points") # noiseless data
ax1.plot(x_test, y_test, 'kX', markersize=6, label="testing points") # Plot test points

ax1.set_xlabel('$x$', fontsize=20)
ax1.set_ylabel('$f(x)$', fontsize=20)
ax1.set_title("Posterior kernel: %s"
              % gp_model.kernel_, fontsize=20) # Show in the title the value of the hyperparameters
ax1.set_ylim(-10, 15) # just to provide more space for the legend
ax1.legend(loc='upper left', fontsize=15)
fig1.set_size_inches(8,8)
plt.close(fig1) # close the plot to see it in next cell

In [None]:
fig1 # plot figure.

### Example 1: Compare with the approximation of a polynomial of degree 4

Let's fit a polynomial of degree 4 and compute the error metrics for that model as well as the above mentioned Gaussian process.

In [None]:
# We start by importing the polynomial predictor from scikit-learn
from sklearn.preprocessing import PolynomialFeatures # For Polynomial fit
from sklearn.linear_model import LinearRegression # For Least Squares
from sklearn.pipeline import make_pipeline # to link different objects

degree = 4 # degree of polynomial we want to fit
poly_model = make_pipeline(PolynomialFeatures(degree),LinearRegression())
poly_model.fit(X_train,y_train) # fit the polynomial to our 5 points in X_train which is a 2D array  
y_poly_pred = poly_model.predict(X_test) # prediction of our polynomial

In [None]:
# Plot the function and the polynomial prediction
fig2, ax2 = plt.subplots()

ax2.plot(x_data, y_data, 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # function to learn

y_poly_data_pred = poly_model.predict(X_data) # prediction of our polynomial
ax2.plot(x_data, y_poly_data_pred, 'b-', label="Polynomial prediction")

ax2.plot(x_train, y_train, 'ro', markersize=6, label="training points") # noiseless data
ax2.plot(x_test, y_test, 'kX', markersize=6, label="testing points") # Plot test points

ax2.set_xlabel('$x$', fontsize=20)
ax2.set_ylabel('$f(x)$', fontsize=20)
ax2.set_title("Polynomial approximation", fontsize=20)
ax2.set_ylim(-10, 15) # just to provide more space for the legend
ax2.legend(loc='upper left', fontsize=15)
fig2.set_size_inches(8,8)
plt.close(fig2) # close the plot to see it in next cell

In [None]:
fig2

Gaussian Process Regression clearly approximates the function much better! And predicts the uncertainty!

In practice, assessing the quality of approximations should be done with **error metrics**.

* Scikit-learn has very useful error metrics for regression problems!
    * The most common ones are the **mean squared error** ($\text{MSE}$) and the **R-squared** error ($R^2$)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score # Import error metrics
# Compute MSE and R2 for the GP model
gp_mse_value = mean_squared_error(y_test, y_pred)
gp_r2_value = r2_score(y_test, y_pred)
print('MSE for GPR = ', gp_mse_value)
print('R2 score for GPR = ', gp_r2_value)

# Compute MSE and R2 for the polynomial model
poly_mse_value = mean_squared_error(y_test, y_poly_pred)
poly_r2_value = r2_score(y_test, y_poly_pred)
print('MSE for polynomial = ', poly_mse_value)
print('R2 score for polynomial = ', poly_r2_value)

Clearly, the GPR approximation is much better!

## Example 2: Redo the Gaussian Process approximation but using the Matern kernel

Of course, the **choice of kernel** used in Gaussian Process Regression (GPR) affects the quality of the prediction...

* Let's go back to the code where we trained the Gaussian process, but now let's use a different kernel: the Matern kernel. Re-run the cells and compare the error metrics.

Probably you found that the GPR prediction is still much better than the polynomial approximation, but not as good as the approximation obtained with the RBF kernel.

* Can you hypothesize why that happened?

### Gaussian Process regression for noisy datasets

Let's recreate the noisy dataset from $f(x)=x\sin{x}$, as we did in Lecture 9:

In [None]:
# Now let's also create the noisy dataset:
random_std = 0.5 + 1.0 * np.random.random(y_data.shape) # np.random.random returns random number between [0.0, 1.0)
noise = np.random.normal(0, random_std) # sample vector from Gaussians with random standard deviation
y_noisy_data = y_data + noise # Perturb every y_data point with Gaussian noise

# Pair up points with their associated noise level (because of train_test_split):
Y_noisy_data = np.column_stack((y_noisy_data,noise))

# Split into 10% training points and the rest for testing:
X_train, X_test, Y_noisy_train, Y_noisy_test = train_test_split(X_data,
                                    Y_noisy_data, test_size=testset_ratio,
                                    random_state=seed) # "noisy_train" is a great name for a variable, hein?
# NOTE: since we are using the same seed and we do train_test_split on the same X_data and y_noisy_data is
#       just y_data + noise, we are splitting the dataset exactly in the same way! This is nice because we
#       want to keep the comparison as fair as possible.

# Finally, for plotting purposes, let's convert the 2D arrays into 1D arrays (vectors):
x_train = X_train.ravel()
x_test = X_test.ravel()
y_noisy_train = Y_noisy_train[:,0]
noise_train = Y_noisy_train[:,1]
y_noisy_test = Y_noisy_test[:,0]
noise_test = Y_noisy_test[:,1]

print("Note that X_train and X_test are the same data that we used for the noiseless case.")
print("Here's a print of X_train:\n", X_train)

In [None]:
# Instanciate a Gaussian Process model
kernel = ConstantKernel(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

# Fitting for noisy data, if we have access to the uncertainty at the training points (usually we don't!), then
# we can include the noise level at the alpha parameter
gp_model = GaussianProcessRegressor(kernel=kernel, alpha=noise_train**2, n_restarts_optimizer=5)

# Fit to data to determine the parameters of the model
gp_model.fit(X_train, y_noisy_train)

# Make the predictions
y_noisy_pred, sigma_noisy = gp_model.predict(X_test, return_std=True) # predictions including uncertainty (std)
y_noisy_data_pred, sigma_noisy_data = gp_model.predict(X_data, return_std=True) # for plotting

# Plot the function, the prediction and the 95% confidence interval
fig3, ax3 = plt.subplots() # This opens a new figure

ax3.plot(x_data, f(x_data), 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # function to learn
ax3.errorbar(x_train, y_noisy_train, noise_train, fmt='ro', markersize=6, label=u'training points inc. uncertainty')
ax3.errorbar(x_test, y_noisy_test, noise_test, fmt='kX', markersize=6, label=u'testing points inc. uncertainty')

ax3.plot(x_data, y_noisy_data_pred, 'b-', label="GPR prediction")
ax3.fill(np.concatenate([x_data, x_data[::-1]]),
         np.concatenate([y_noisy_data_pred - 1.9600 * sigma_noisy_data,
                        (y_noisy_data_pred + 1.9600 * sigma_noisy_data)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
ax3.set_xlabel('$x$', fontsize=20)
ax3.set_ylabel('$f(x)$', fontsize=20)
ax3.set_ylim(-10, 15) # just to provide more space for the legend
ax3.legend(loc='upper left', fontsize=15)
fig3.set_size_inches(8,8)
plt.close(fig3)

In [None]:
fig3 # plot figure.

### Exercise

Fit a polynomial of degree 4 (like we did last class) and compute the error metrics for that model as well as the above mentioned Gaussian process.

In [None]:
# Exercise.

# until here.

Well done...

Consider playing a bit with the notebook, using higher order degrees for the polynomial approximation and different number of training points.

## 2. Short tutorial on 1D regression with Artificial Neural Networks

Artificial Neural Networks (ANNs) have a lot more hyperparameters than Gaussian Processes...

<img src="docs/nn_basic_arch.png" title="Simple ANN" width="50%" align="right">

Each circle represents a node (called neurons).

Input layer: number of neurons equals number of features of our data

Each connection of a hidden layer: $ h(x) = f(W_1x+b_1) $

Output layer: $ o(x) = g(W_2h(x) + b_2) $

<img src="docs/nn_manylayers.png" title="ANN with two hidden layers" width="50%" align="right">

This is called a multilayer perceptron or fully-connected feed-forward neural network.

* Neurons in hidden layers usually have the same non-linear function (ReLu is popular), weights are different for every neuron.

* Many layers $\rightarrow$ “deep learning”.

* In theory: more hidden layers $\rightarrow$ more complex functions and feature representation. But there's more to this story...

* For regression each output neuron corresponds to a single output variable and the last layer uses a linear activation function.

We will look into this model carefully in a few lectures.

For now, I just want to draw a schematic so that you understand the number of parameters that starts appearing!

* Draw on the board a feedforward ANN with 2 hidden layers for 1D case.
    * First hidden layer with 3 neurons and second hidden layer with 2 neurons.

Two examples of these nonlinear functions.


<div>
<img style="float: left"; src=docs/nonlin_fn.png width=500px></div>

* ReLu is the most common choice nowadays (easier optimization)

* tanh has small gradients in most places (harder to optimize)

## Quick overview about neural networks

* Non-linear regression models

* Powerful for very large datasets

* Non-convex optimization

* Notoriously slow to train (state of the art models take days or even weeks to train, often on multiple GPUs)

* Important to scale and transform the data properly (preprocessing)

* MANY variants (Convolutional nets, Recurrent neural networks, variational autoencoders, generative adversarial networks, deep reinforcement learning, ...)

## Training Objective

$ h(x) = f(W_1x+b_1) $
$ o(x) = g(W_2h(x)+b_2) = g(W_2f(W_1x + b_1) + b_2)$

$ \min_{W_1,W_2,b_1,b_2} \sum\limits_{i=1}^N l(y_i,o(x_i)) $

$ =\min_{W_1,W_2,b_1,b_2} \sum\limits_{i=1}^N l(y_i,g(W_2f(W_1x+b_1)+b_2)$

- $l$ is the MSE (or squared loss function) for regression.

## Backpropagation

* Need $ \frac{\partial l(y, o)}{\partial W_i} $ and $\frac{\partial l(y, o)}{\partial b_i}$


* Example for network with one hidden layer where $ \text{net}(x) := W_1x + b_1 $

<img src="docs/backprop_eqn.png" title="ANN with two hidden layers" width="50%">

## Optimizing W, b

* Batch

    $ W_i \leftarrow W_i - \eta\sum\limits_{j=1}^N \frac{\partial l(x_j,y_j)}{\partial W_i} $


* Online/Stochastic

    $ W_i \leftarrow W_i - \eta\frac{\partial l(x_j,y_j)}{\partial W_i}$


* Minibatch

    $ W_i \leftarrow W_i - \eta\sum\limits_{j=k}^{k+m} \frac{\partial l(x_j,y_j)}{\partial W_i}$

Below you will find some notes about different optimizers.

**A nice resource about optimizers** to understand what is learning rate (step size) and momentum: https://distill.pub/2017/momentum/


To note:

1. Standard gradient descent with constant learning rate (or step size) $\eta$ is slow because we update a weight matrix $W_i$ using the old $W_i$ and taking a gradient step after summing over the whole training set.

    * Using the entire training set for each forward pass through the network means that we need to make predictions for every point (without updating the weights) and then do a backward pass with backpropagation... One forward pass and one backward pass is what we call an epoch. So, if we do this, a single epoch has a lot of matrix multiplications to do a single gradient step.
    
    
2. To speed this up we can do a stochastic approximation, i.e. stochastic gradient descent (or online gradient descent). Here, you pick a sample at random, compute the gradient just considering that sample, and then update the parameter. So you update the weights much more often, but you have a much less stable estimate of the gradient. In practice, we often just iterate through the data instead of picking a sample at random. And as with linear models, this is much faster than doing full batches for large datasets.

    * Stochastic grandient descent is less stable (of course!).
    
    
3. A compromise is to consider batch sizes of $k$ samples of the training set (also called mini-batches). For example, we could use 64 points per epoch (one forward and backward pass). In other words: we look at $k=64$ samples, compute the gradients, average them, and update the weights. That allows us to update much more often than looking at the whole dataset, while still having a more stable gradient. This strategy is easy to parallelize in modern CPUs and GPUs and it is very commonly used in practice. The reason why this is faster is basically that doing a matrix-matrix multiplication is faster than doing a bunch of matrix-vector operations.


Finally, a short note: we could also be using smarter optimization methods, like second order methods or LBFGS, but these are often not very effective on these large non-convex problems. One, called levenberg-marquardt is actually a possibility, but it's not really used these days.

# Learning Heuristics

* Constant learning rate $\eta$ not good


* Better: adaptive $\eta$ for each entry of $W_i$ (large $\eta$ in the beginning and small at the end)


* Common approach: adam optimizer


* There are many variants of optimizers... Remember: you will often get different solutions depending on how you pick the learning rate because you are solving a (very) non-convex problem. It's nearly impossible to actually find a global optimum. So nearly all of these strategies are heuristics, that have just proven well to work in practice.

# Complexity Control

These are the main ways to control complexity.

* Decrease number of parameters


* Regularization:

    * L2 & L1 regularization (just like what you did [will do] in the Lab Assignment for Regularized Least Squares)
    * Dropout: randomly prune neurons from the network


* Early Stopping:

    * Early stopping means that you compute the loss from a validation set and then you stop when you start to overfit.

Let's now focus on how to create an ANN model for 1D regression using the following hyperparameters:
1. A feedforward architecture with 2 dense hidden layers
2. The ReLu activation function
3. Adam optimizer

In [None]:
# Let's create a function defining our Artificial Neural Network.
from tensorflow import keras # fast library for ANNs
from tensorflow.keras.optimizers import Adam # import the optimizer you want to use to calculate the parameters
from keras.models import Sequential # to create a feedforward neural network
from keras.layers.core import Dense # to create a feedforward neural network with dense layers
#
# Function to create the ANN model (in this case we are creating )
def create_ANN(input_dimensions=1, # number of input variables
               neurons1=3, # number of neurons in first hidden layer
               neurons2=2, # number of neurons in second hidden layer
               activation='relu', # activation function
               optimizer='adam'): # optimization algorithm to compute the weights and biases
    # create model
    model = Sequential() # Feedforward architecture
    model.add(Dense(neurons1, input_dim=input_dimensions, activation=activation)) # first hidden layer
    model.add(Dense(neurons2, activation=activation)) # second hidden layer
    model.add(Dense(1)) # output layer with just one neuron because we have only one output (1D problem!)
    model.compile(loss='mse', # error metric to measure our NLL (loss)
                  optimizer=optimizer)
    return model

In addition, let's introduce something important: dataset preprocessing.

Standardizing our dataset is good practice and can be important for many ML algorithms (ANNs included).

In [None]:
# Standardizing your dataset is good practice and can be important for ANNs!
from sklearn.preprocessing import StandardScaler # standardize the dataset with scikit-learn
#
scaler = StandardScaler().fit(X_train) # Check scikit-learn to see what this does!
#
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)
X_data_scaled=scaler.transform(X_data)
#

In [None]:
from keras.wrappers.scikit_learn import KerasRegressor # a new version will use scikeras
# Now create your first ANN model!
neurons1=200 # number of neurons for the first hidden layer
neurons2=10 # number of neurons for the second hidden layer
batch_size = len(X_train) # considering the entire dataset for updating the weights and biases in each epoch
optimizer = Adam(learning_rate=0.001) # specifying the learning rate value for the optimizer (PLAY WITH THIS!)
ANN_model = KerasRegressor(build_fn=create_ANN, neurons1=neurons1, neurons2=neurons2,
                           batch_size=batch_size, epochs=150, optimizer=optimizer,
                           validation_data=(scaler.transform(X_test), y_test))

In [None]:
# Now that we created our first ANN model, let's fit it to our (scaled) dataset!
history = ANN_model.fit(X_train_scaled, y_train)

In [None]:
fig_ANN, (ax1_ANN, ax2_ANN) = plt.subplots(1,2)
# Create a plot for the loss history
ax1_ANN.plot(history.history['loss']) # plot training loss
ax1_ANN.plot(history.history['val_loss']) # plot testing loss
ax1_ANN.set_title('Training and testing loss', fontsize=20)
ax1_ANN.set_ylabel('loss', fontsize=20)
ax1_ANN.set_xlabel('epoch', fontsize=20)
ax1_ANN.legend(['training', 'testing'], loc='upper right', fontsize=15)

# Create a plot for the ANN prediction
ax2_ANN.plot(x_data, f(x_data), 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # show ground truth function
ax2_ANN.plot(x_train, y_train, 'ro', markersize=6, label="training points") # show training data
ax2_ANN.plot(x_test, y_test, 'kX', markersize=6, label="testing points") # show testing data

y_pred = history.model.predict(X_data_scaled) # predict all data points with ANN

ax2_ANN.plot(x_data, y_pred, 'b-', label="Neural Network prediction") # plot prediction
ax2_ANN.set_title(r'NN with '+str(neurons1)+' neurons in the 1st hidden layer, and '+str(neurons2)+' in the 2nd',
                 fontsize=20)
ax2_ANN.set_xlabel('$x$', fontsize=20)
ax2_ANN.set_ylabel('$f(x)$', fontsize=20)
ax2_ANN.legend(loc='upper left', fontsize=15)

# Create figure with specified size
fig_ANN.set_size_inches(16, 8)
plt.close(fig_ANN) # do not plot the figure now. We will show it in the next cell

In [None]:
fig_ANN # show figure now.

Not the most amazing model you have ever seen, right?

* Try again but now using 200 neurons for the first hidden layer and 10 for the second hidden layer.
    - spoiler alert: a bit better, but far from amazing...

It's possible to try to find better hyperparameters (and ANNs have many!).

The notes below (not shown in presentation) contain a simple code to do this by "brute force" in a procedure called grid search.

But there are much better ways to find better hyperparameters...

In [None]:
from keras.callbacks import EarlyStopping # a strategy for complexity control
from sklearn.model_selection import GridSearchCV # simple (brute force) approach to find better hyperparameters.
#
# Function to create the ANN model
#(I am writing the function again, in case we want to change some hyperparameters, e.g. use more layers)
def create_ANN(input_dimensions=1,neurons1=10,neurons2=10,neurons3=10,neurons4=10,
                 activation='relu',optimizer='adam'):
    # create model
    model = Sequential()
    model.add(Dense(neurons1, input_dim=input_dimensions, activation=activation)) # first hidden layer
    model.add(Dense(neurons2, activation=activation)) # second hidden layer
    model.add(Dense(neurons3, activation=activation)) # UNCOMMEND If you want a third hidden layer
    model.add(Dense(neurons4, activation=activation)) # UNCOMMENT if you want a fourth hidden layer, etc.
    model.add(Dense(1)) # output layer with just one neuron (we only have one output)
    model.compile(loss='mse', optimizer=optimizer) # choose error metric and optimizer.
    return model
# -----------------------------------------------------------------------------
#
# create model
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.0, patience=30, mode='min')

# define the grid search parameters:
neurons1 = [5,20,200] # number of neurons in hidden layer 1 (it's a vector because we will run the model)
#neurons1 = [5] # number of neurons in hidden layer 1
#neurons2 = [5,10] # number of neurons in hidden layer 2 (if present; uncomment in create_ANN function)
neurons2 = [5] # number of neurons in hidden layer 2 (if present; uncomment in create_ANN function)
neurons3 = [10] # number of neurons in hidden layer 3 (if present; uncomment in create_ANN function)
neurons4 = [10] # number of neurons in hidden layer 4 (if present; uncomment in create_ANN function)
#
batch_size = [len(X_train)] # here considering batch size as large as the training data.
#
epochs = [1000]
#
optimizer = ['adam'] # if we specify the optimizer as a string, then you use the default parameters
#optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
#init_mode = ['uniform', 'lecun_uniform', 'normal', 'orthogonal', 'zero', 'one', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']    
#
param_grid = dict(batch_size=batch_size, epochs=epochs,neurons1=neurons1,neurons2=neurons2,
                  neurons3=neurons3,neurons4=neurons4, # comment this line if you don't want to use layer 3 and 4
                  #init_mode=init_mode, # comment this line if you are not specifying the initialization mode
                  optimizer=optimizer)
NN_model = KerasRegressor(build_fn=create_ANN)
grid = GridSearchCV(estimator=NN_model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train_scaled, y_train, callbacks=[early_stopping],
                       validation_data=(scaler.transform(X_test), y_test))
history = grid_result.best_estimator_.fit(X_train_scaled, y_train,callbacks=[early_stopping],
                                          validation_data=(scaler.transform(X_test), y_test))
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
fig_ANN, (ax1_ANN, ax2_ANN) = plt.subplots(1,2)
# Create a plot for the loss history
ax1_ANN.plot(history.history['loss']) # plot training loss
ax1_ANN.plot(history.history['val_loss']) # plot testing loss
ax1_ANN.set_title('Training and testing loss', fontsize=20)
ax1_ANN.set_ylabel('loss', fontsize=20)
ax1_ANN.set_xlabel('epoch', fontsize=20)
ax1_ANN.legend(['training', 'testing'], loc='upper right', fontsize=15)

# Create a plot for the ANN prediction
ax2_ANN.plot(x_data, f(x_data), 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # show ground truth function
ax2_ANN.plot(x_train, y_train, 'ro', markersize=6, label="training points") # show training data
ax2_ANN.plot(x_test, y_test, 'kX', markersize=6, label="testing points") # show testing data

y_pred = history.model.predict(X_data_scaled) # predict all data points with ANN

ax2_ANN.plot(x_data, y_pred, 'b-', label="Neural Network prediction") # plot prediction
ax2_ANN.set_title(r'NN with '+str(neurons1)+' neurons in the 1st hidden layer, and '+str(neurons2)+' in the 2nd',
                 fontsize=20)
ax2_ANN.set_xlabel('$x$', fontsize=20)
ax2_ANN.set_ylabel('$f(x)$', fontsize=20)
ax2_ANN.legend(loc='upper left', fontsize=15)

# Create figure with specified size
fig_ANN.set_size_inches(16, 8)
plt.close(fig_ANN) # do not plot the figure now. We will show it in the next cell

In [None]:
fig_ANN # show figure now.

This neural network approximation is also not brilliant... Were you expecting this?

# <font color='red'>HOMEWORK</font>

Redo the neural network regression but now for the noisy dataset (use the same network we used for the noiseless dataset).

In [None]:
# HOMEWORK.

# until here.

### Solution to Exercise 1

``` python
degree = 4 # degree of polynomial we want to fit
poly_model = make_pipeline(PolynomialFeatures(degree),LinearRegression())
poly_model.fit(X_train,y_noisy_train) # fit the polynomial to our 5 points in X_train which is a 2D array
y_poly_noisy_pred = poly_model.predict(X_test) # prediction of our polynomial

# Plot the function and the polynomial prediction
fig_ex1, ax_ex1 = plt.subplots()

ax_ex1.plot(x_data, f(x_data), 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # function to learn
ax_ex1.errorbar(x_train, y_noisy_train, noise_train, fmt='ro', markersize=6, label=u'training points inc. uncertainty')
ax_ex1.errorbar(x_test, y_noisy_test, noise_test, fmt='kX', markersize=6, label=u'testing points inc. uncertainty')

y_poly_noisy_data_pred = poly_model.predict(X_data) # prediction of our polynomial for all data points
ax_ex1.plot(x_data, y_poly_noisy_data_pred, 'b-', label="Polynomial prediction")

ax_ex1.set_xlabel('$x$', fontsize=20)
ax_ex1.set_ylabel('$f(x)$', fontsize=20)
ax_ex1.set_title("Polynomial approximation", fontsize=20)
ax_ex1.set_ylim(-10, 15) # just to provide more space for the legend
ax_ex1.legend(loc='upper left', fontsize=15)
fig_ex1.set_size_inches(8,8)

# Compute MSE and R2 for the GP model
# NOTE: here we will compare with the noiseless function (in practice we don't have this information!).
gp_mse_value = mean_squared_error(y_test, y_noisy_pred)
gp_r2_value = r2_score(y_test, y_noisy_pred)
print('MSE for GPR = ', gp_mse_value)
print('R2 score for GPR = ', gp_r2_value)

# Compute MSE and R2 for the polynomial model
poly_mse_value = mean_squared_error(y_test, y_poly_noisy_pred)
poly_r2_value = r2_score(y_test, y_poly_noisy_pred)
print('MSE for polynomial = ', poly_mse_value)
print('R2 score for polynomial = ', poly_r2_value)
```

### Solution to Homework of this Lecture
``` python
# Function to create the ANN model (a feedforward architecture with 2 hidden layers)
def create_ANN(input_dimensions=1, # number of input variables
               neurons1=3, # number of neurons in first hidden layer
               neurons2=2, # number of neurons in second hidden layer
               activation='relu', # activation function
               optimizer='adam'): # optimization algorithm to compute the weights and biases
    # create model
    model = Sequential() # Feedforward architecture
    model.add(Dense(neurons1, input_dim=input_dimensions, activation=activation)) # first hidden layer
    model.add(Dense(neurons2, activation=activation)) # second hidden layer
    model.add(Dense(1)) # output layer with just one neuron because we have only one output (1D problem!)
    model.compile(loss='mse', # error metric to measure our NLL (loss)
                  optimizer=optimizer)
    return model
# -----------------------------------------------------------------------------
#
# Standardizing your dataset is usually a good practice and can be important for ANNs:
scaler = StandardScaler().fit(X_train) # Check scikit-learn to see what this does!
#
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)
X_data_scaled=scaler.transform(X_data)
#

# Now create your first ANN model!
neurons1=200 # number of neurons for the first hidden layer
neurons2=10 # number of neurons for the second hidden layer
batch_size = len(X_train) # considering the entire dataset for updating the weights and biases in each epoch
ANN_model = KerasRegressor(build_fn=create_ANN, neurons1=neurons1, neurons2=neurons2,
                           batch_size=batch_size, epochs=150, optimizer='adam',
                           validation_data=(scaler.transform(X_test), y_noisy_test))

# Now let's train the model on our (scaled) dataset!
history = ANN_model.fit(X_train_scaled, y_noisy_train)

# Finally, plot the loss history and the predicted function.
fig_ANN, (ax1_ANN, ax2_ANN) = plt.subplots(1,2)
# Create a plot for the loss history
ax1_ANN.plot(history.history['loss']) # plot training loss
ax1_ANN.plot(history.history['val_loss']) # plot testing loss
ax1_ANN.set_title('Training and testing loss', fontsize=20)
ax1_ANN.set_ylabel('loss', fontsize=20)
ax1_ANN.set_xlabel('epoch', fontsize=20)
ax1_ANN.legend(['training', 'testing'], loc='upper right', fontsize=15)

# Create a plot for the ANN prediction
ax2_ANN.plot(x_data, f(x_data), 'r:', label=u'ground truth: $f(x) = x\,\sin(x)$') # show ground truth function
ax2_ANN.errorbar(x_train, y_noisy_train, noise_train, fmt='ro', markersize=6, label=u'training points inc. uncertainty')
ax2_ANN.errorbar(x_test, y_noisy_test, noise_test, fmt='kX', markersize=6, label=u'testing points inc. uncertainty')

y_pred = history.model.predict(X_data_scaled) # predict all data points with ANN

ax2_ANN.plot(x_data, y_pred, 'b-', label="Neural Network prediction") # plot prediction
ax2_ANN.set_title(r'NN with '+str(neurons1)+' neurons in the 1st hidden layer, and '+str(neurons2)+' in the 2nd',
                 fontsize=20)
ax2_ANN.set_xlabel('$x$', fontsize=20)
ax2_ANN.set_ylabel('$f(x)$', fontsize=20)
ax2_ANN.legend(loc='upper left', fontsize=15)

# Create figure with specified size
fig_ANN.set_size_inches(16, 8)
```

### End of Practical Session 1

Have fun!