# Tuning of regularisation strength

The strength of the regularisation is a parameter, often termed a *hyperparameter*, of the ML algorithm that needs to be fine-tuned. Often there are a number of these. Others are the learning rate, the number of hidden neurons or even the whole architecture of the neural networks. Hyperparameter tuning aims to find the set of hyperparameters that yield the best, the most efficient algorithm. Special care needs to be taken that hyperparameter tuning does not exploit the test set. In practice this means: as usual, the test set should only be fed into the algorithm (or algorithms) once, at the end.

Because I am lazy we use the fashion data set again, and also the neural network class from scikit-learn. (Tensorflow would be better.)

## Fashion MNIST

In [1]:
import numpy as np
import sklearn 
from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import zero_one_loss

In [2]:
# fetch data from openml.org
# see https://www.openml.org/d/40996
fashion = fetch_openml('Fashion-MNIST', cache=True)
fashion.target = fashion.target.astype(np.int8) # fetch_openml() returns targets as strings
X, y = fashion["data"]/255, fashion["target"]
X.shape

(70000, 784)

In [3]:
print(fashion.DESCR)

**Author**: Han Xiao, Kashif Rasul, Roland Vollgraf  
**Source**: [Zalando Research](https://github.com/zalandoresearch/fashion-mnist)  
**Please cite**: Han Xiao and Kashif Rasul and Roland Vollgraf, Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, arXiv, cs.LG/1708.07747  

Fashion-MNIST is a dataset of Zalando's article images, consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. 

Raw data available at: https://github.com/zalandoresearch/fashion-mnist

### Target classes
Each training and test example is assigned to one of the following labels:
Label  Description  
0  T-shirt/top  
1  Trouser  
2  Pullover  
3  Dress  
4  

# training, test set and validation set

We split the data set into **three** parts: training set, validation set and test set. The validation set is used for hyperparameter tuning.

In [4]:
# we pick a large random subset as training set
# the rest makes up the test set
train_size=5000
val_size=5000
X, y = sklearn.utils.shuffle(X,y)

# introduce validation set
# here, it is a bit large in comparison with the training set
X_train, X_val, X_test = X[:train_size], X[train_size:train_size+val_size], X[train_size+val_size:]
y_train, y_val, y_test = y[:train_size], y[train_size:train_size+val_size], y[train_size+val_size:]

## hyperparameter tuning

We search semi-systematically for the best L2-penalty alpha. The neural network classifier <code>MLPClassifier</code> in scikit-learn has a parameter <code>alpha</code> with which large weights can be penalised. More precisely, the loss function comprises a regularisation term

$$
R(w)=\alpha||w||_2^2,
$$

where $w$ is the weight vector of the neural network.

Note that the network here is over-parameterised: the training set comprises just 5000 samples, while the network has about 90000 weights. And indeed, we'll be able to achieve zero or close to zero training error.

In [5]:
hidden_layer_sizes=(100,100)
alphas=[0,0.001,0.01,0.1,1]  
for alpha in alphas:
    net=MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,max_iter=3000,\
                      alpha=alpha,solver='sgd',learning_rate='constant',learning_rate_init=0.01)
    net.fit(X_train,y_train)
    train_err=zero_one_loss(y_train,net.predict(X_train))
    val_err=zero_one_loss(y_val,net.predict(X_val))
    test_err=zero_one_loss(y_test,net.predict(X_test))
    print("alpha / iterations: {} / {}".format(alpha,net.n_iter_))
    print("train / val error: {:.1f}% / {:.1f}%".format(train_err*100,val_err*100))
    print(" ")

alpha / iterations: 0 / 167
train / val error: 0.2% / 15.9%
 
alpha / iterations: 0.001 / 255
train / val error: 0.0% / 15.8%
 
alpha / iterations: 0.01 / 251
train / val error: 0.0% / 15.8%
 
alpha / iterations: 0.1 / 553
train / val error: 0.0% / 15.6%
 
alpha / iterations: 1 / 180
train / val error: 3.9% / 14.9%
 


In [6]:
hidden_layer_sizes=(100,100)
alphas=[0.5,1,2,10]  
for alpha in alphas:
    net=MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,max_iter=3000,\
                      alpha=alpha,solver='sgd',learning_rate='constant',learning_rate_init=0.01)
    net.fit(X_train,y_train)
    train_err=zero_one_loss(y_train,net.predict(X_train))
    val_err=zero_one_loss(y_val,net.predict(X_val))
    print("alpha / iterations: {} / {}".format(alpha,net.n_iter_))
    print("train / val error: {:.1f}% / {:.1f}%".format(train_err*100,val_err*100))
    print(" ")

alpha / iterations: 0.5 / 291
train / val error: 0.9% / 15.2%
 
alpha / iterations: 1 / 216
train / val error: 3.1% / 14.5%
 
alpha / iterations: 2 / 165
train / val error: 9.7% / 16.7%
 
alpha / iterations: 10 / 131
train / val error: 21.6% / 23.3%
 


In [7]:
hidden_layer_sizes=(100,100)
alphas=[0.8,0.9,1.1,1.2]  
for alpha in alphas:
    net=MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,max_iter=3000,\
                      alpha=alpha,solver='sgd',learning_rate='constant',learning_rate_init=0.01)
    net.fit(X_train,y_train)
    train_err=zero_one_loss(y_train,net.predict(X_train))
    val_err=zero_one_loss(y_val,net.predict(X_val))
    print("alpha / iterations: {} / {}".format(alpha,net.n_iter_))
    print("train / val error: {:.1f}% / {:.1f}%".format(train_err*100,val_err*100))
    print(" ")

alpha / iterations: 0.8 / 310
train / val error: 1.7% / 14.7%
 
alpha / iterations: 0.9 / 190
train / val error: 4.9% / 16.0%
 
alpha / iterations: 1.1 / 211
train / val error: 6.7% / 15.8%
 
alpha / iterations: 1.2 / 220
train / val error: 5.0% / 15.5%
 


Okay, alpha=1 seems to give the best results. Note that we look here for the error on the validation set -- it works as a sort of "test set" for hyperparameter tuning.

## Final training and test run
Train on all of available data, ie training set plus validation set. Why? Because more data is always better and now that the validation set has been used for fine-tuning we can simply lump it with the training set.
Then compute test error.

In [8]:
alpha=1
net=MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,max_iter=3000,\
                  alpha=alpha,solver='sgd',learning_rate='constant',learning_rate_init=0.01)
net.fit(X[:train_size+val_size],y[:train_size+val_size])
train_err=zero_one_loss(y[:train_size+val_size],net.predict(X[:train_size+val_size]))
test_err=zero_one_loss(y[train_size+val_size:],net.predict(X[train_size+val_size:]))
print("alpha / iterations: {} / {}".format(alpha,net.n_iter_))
print("train / test error: {:.1f}% / {:.1f}%".format(train_err*100,test_err*100))
print(" ")

alpha / iterations: 1 / 180
train / test error: 6.9% / 14.0%
 


Just for the sake of comparison, let's train an unregularised network.

In [9]:
alpha=0
net=MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,max_iter=3000,\
                  alpha=alpha,solver='sgd',learning_rate='constant',learning_rate_init=0.01)
net.fit(X[:train_size+val_size],y[:train_size+val_size])
train_err=zero_one_loss(y[:train_size+val_size],net.predict(X[:train_size+val_size]))
test_err=zero_one_loss(y[train_size+val_size:],net.predict(X[train_size+val_size:]))
print("alpha / iterations: {} / {}".format(alpha,net.n_iter_))
print("train / test error: {:.1f}% / {:.1f}%".format(train_err*100,test_err*100))
print(" ")

alpha / iterations: 0 / 170
train / test error: 0.2% / 15.0%
 


Obviously, we'd need to repeat this experiment to make sure that the results are not a statistical fluke. 

Moreover, what we did here was a at most semi-systematic search for one best hyperparameter. There are more sophisticated methods. A good starting point is the [tutorial](https://scikit-learn.org/stable/modules/grid_search.html) by scikit-learn.