# Exercise: Scaling and learning rates

In this notebook, we'll train a neural network on a EEG dataset. The objective is to detect whether the eyes of the experimental subject are open or closed.  

First we do a couple of necessary imports and load the dataset. We also print the description. 

In [None]:
import math
import tensorflow as tf
import numpy as np
import sklearn 
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
plt.style.use("seaborn-v0_8")

# fetch data from openml.org
from sklearn.datasets import fetch_openml
data = fetch_openml('eeg-eye-state', cache=True)
print(data.DESCR)

Let's start examining the dataset. First question: how large is the dataset.

In [None]:
print("number of samples: {}".format(len(data["data"])))

Next, we split off a training set and a validation set. Because there is not much data and we just want to compare different algorithms, we don't define a test set. 

In [None]:
X,y=sklearn.utils.shuffle(np.array(data["data"]), data["target"].astype('int')-1) # let's make sure the data is in random order
train_size=10000
X_train,X_val=X[:train_size],X[train_size:]
y_train,y_val=y[:train_size],y[train_size:]
X_train.shape,X_val.shape

## Scaling

We continue inspecting the data. As I already know that the data takes continuous values with the values mainly in the same range, we can do a boxplot of the features. 

In [None]:
_,axs=plt.subplots(1,2,figsize=(10,5))
axs[0].boxplot(X_train[:,:5])
axs[0].set_title("feature range of first five features, with outliers")
axs[1].boxplot(X_train,showfliers=False)
axs[1].set_title("feature range, without outliers")
plt.show()

We observe: there are substantial outliers and the data takes quite large values.

### Task: Train NN
* Use tensorflow, to define a ReLU-neural network for binary classification with two hidden layers, the first with 50 neurons, the second with 10 neurons. You can either have a single output neuron with logistic activation ('sigmoid') or two output neurons with softmax activation.
* Take <code>tf.keras.losses.SparseCategoricalCrossentropy</code> as loss.
* Train the neural network for 30 epochs and pass along the validation set <code>validation_data=(X_val,y_val)</code> to <code>fit</code>. Otherwise use the default parameters.

You should see terrible performance. (If this exercise seems challenging to you, have a look in [tfintro](https://colab.research.google.com/github/henningbruhn/math_of_ml_course/blob/main/neural_networks/tfintro.ipynb).)

In [None]:
### insert your code here ###



What happened? Obviously, the model performs not well, not at all. The reason: When the weights are initialised at the beginning of training it is expected that the data has values more or less in the range $[-1,1]$. Our data, however, takes much larger values. Because of that the neural network starts training in a sort of off-balance state and then takes a very long while moving away from there (if that happens at all). We'll rectify that by scaling the data. Here, because it's simple we use <code>sklearn.preprocessing.StandardScaler</code>, a method of *scikit-learn*. There's also a tensorflow way of doing this, which you'll see in the solution. 

What does <code>StandardScaler</code> do? First, it ensures that the data has mean 0 by substracting the mean from every sample. Then the data is multiplied by a factor so that the variance becomes 1. Thus, larger values than 1 are possible, but mostly the values range from -1 to 1. Let's have a look at the boxplots again.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # we need to fit, to learn the mean and the variance
X_val_scaled=scaler.transform(X_val) # the validation/test set is only transformed, not fit 

### plotting
_,axs=plt.subplots(1,2,figsize=(10,5))
axs[0].boxplot(X_train_scaled[:,:5])
axs[0].set_title("feature range of first five features, with outliers")
axs[1].boxplot(X_train_scaled,showfliers=False)
axs[1].set_title("feature range, without outliers")
plt.show()

Note that the line in the boxes shows the *median* not the *mean* (which should be at 0). There are still substantial outliers but as most of the values hail from $[-1,1]$, training should hopefully work better now. Let's try out!

### Task: Train with scaling
* Repeat the steps of the previous task, use <code>X_train_scaled</code> and <code>X_val_scaled</code>, however. 

In [None]:
### insert your code here ###



Much better!

## Learning rates

Next, we'll see how learning rates can be set. So far, we had always used the default values for the optimiser. This needs to change now. The optimiser is specified when calling <code>compile</code> on a model. If we just want to choose between different optimisers without setting any optimiser specific parameters, we can prescribe the optimiser by passing a string: 

<code>mode.compile(loss=loss_fn, metrics=['accuracy'],optimizer="SGD")  # stochastic gradient descent</code> 

or 

<code>mode.compile(loss=loss_fn, metrics=['accuracy'],optimizer="Adam")  # Adam optimiser</code> 

As we want set a learning rate, however, we actually need to instantiate an optimiser object:

<code>my_SGD=tf.keras.optimizers.SGD(learning_rate=0.42)
mode.compile(loss=loss_fn, metrics=['accuracy'],optimizer=my_SGD)</code> 

### Task: different learning rates
* For each of the learning rates $[0.01,0.1,1]$ train a neural network of the same type as above (ie, two hidden layers with 50 and 10 neurons etc), again for 30 epochs each.
* Use the <code>history</code> object returned by <code>fit</code> to plot the three losses (for the different learning rates) in the same plot. Which learning rate works best? For plotting see [plt_intro](https://colab.research.google.com/github/henningbruhn/math_of_ml_course/blob/main/python_intro/plt_intro.ipynb).
* Plot also the validation loss. (You find it in <code>history.history["val_loss"]</code>.)

By the way, if all the output of <code>fit</code> annoys you: You can switch it off by setting <code>verbose=0</code>.

In [None]:
### insert your code here ###