# Algorithm configuration: hyperparameters, sampling, grid search

## Importing data

For this tutorial, we will be using the Melbourne Housing Snapshot dataset available at Kaggle. To download it, follow the [first stage of this tutorial](https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0), which shows how to download access credentials for Kaggle (kaggle.json). Once you have downloaded the credentials, use the side menu to upload the file to Colab, and run the cells below:

In [1]:
import pandas as pd

In [3]:
!mkdir /root/.kaggle
!cp /content/kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [4]:
!kaggle datasets download -d dansbecker/melbourne-housing-snapshot/

Downloading melbourne-housing-snapshot.zip to /content
  0% 0.00/451k [00:00<?, ?B/s]
100% 451k/451k [00:00<00:00, 57.0MB/s]


In [5]:
!unzip melbourne-housing-snapshot.zip

Archive:  melbourne-housing-snapshot.zip
  inflating: melb_data.csv           


In [6]:
data = pd.read_csv('melb_data.csv')
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## Sampling

An important step in the process of building a model is to test the accuracy of predictions based on a set of data, in order to evaluate the performance of the algorithm. One may think to do that using the training dataset, however, by doing exactly that, you're guaranteed a perfect score due to overfitting. Thus, it would fail to make useful predictions on unseen data. Given that, there are techniques to partition the dataset for the different development stages.

A method that is frequently used is the *train_test_split*, which splits the data in two parts, a subset for training and the other for testing. It is also possible to determine the amount of data sectioned for each one, i.e., if you wanted to hold out 60% of data for training and 40% for testing, there is a parameter in the *train_test_split* function that allows that, *test_size*. Let's take a look at an example below.

In [7]:
# Removing missing values for the purpose of this analysis
data = data.dropna()
# Input data
X = data[['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']].values
# Target data
y = data['Price'].values

In [8]:
from sklearn.model_selection import train_test_split
from sklearn import svm
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
clf = svm.SVR(kernel='linear').fit(X_train, y_train)
clf.score(X_test, y_test)

0.2851704875718537

Note that the score is low which indicates that the model prediction accuracy is not good. But with the purpose of not prolonging this tutorial, we will not dive further into feature engineering and maintain only the extraction of missing values.

Furthermore, even though we just used the *train_test_split* to overcome the overfitting problem, there is still a chance for that function to fail in that regard when you are evaluating different hyperparameters. The option to make changes in the parameters in order to optimize the model's performance can reduce the generalization. An approach that can solve that is cross-validation. 

Cross-validation allows the developer to split the data in k-parts (e.g. k=3, k=5 or k=10), each called a "fold". The model uses k-1 folds for training, while the held back fold is later applied as test data for validation. This process is repeated until each fold is used for testing. In the end, you will have a set of k performance scores to evaluate the model's performance, which you can then use to get the mean and standard deviation of the values. Let's check out how we can make this happen.

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn import svm
clf = svm.SVR(kernel='linear')
scores = cross_val_score(clf, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.32 (+/- 0.39)


# Cross validation iterators
This sections list utilities to generate indices that can be used to generate dataset splits according to different cross validation strategies.

## KFold
KFold divides all  the samples in *k* groups of sample, called folds, of equal sizes (if possible).



In [10]:
import numpy as np
from sklearn.model_selection import KFold


In [11]:
X = data[['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']].values
kf = KFold(n_splits=2)

In [12]:
 for train, test in kf.split(X):
        print("%s %s" % (train, test))

[3098 3099 3100 ... 6193 6194 6195] [   0    1    2 ... 3095 3096 3097]
[   0    1    2 ... 3095 3096 3097] [3098 3099 3100 ... 6193 6194 6195]


Each fold is constituted by two arrays: the training set and the test set

## Repeated K-Fold
RepeatedKFold repeats K-Fold *n* times. It can be used when one requires to run KFold *n* times, producing different splits in each repetition.

In [13]:
import numpy as np
from sklearn.model_selection import RepeatedKFold

In [14]:
X = data[['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']].values
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)

In [15]:
for train, test in rkf.split(X):
       print("%s %s" % (train, test))

[   4    5    6 ... 6190 6192 6193] [   0    1    2 ... 6191 6194 6195]
[   0    1    2 ... 6191 6194 6195] [   4    5    6 ... 6190 6192 6193]
[   0    5    9 ... 6183 6188 6189] [   1    2    3 ... 6193 6194 6195]
[   1    2    3 ... 6193 6194 6195] [   0    5    9 ... 6183 6188 6189]


## Leave One Out (LOO)
LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out.
Thus, for ***n*** samples, we have ***n*** different training sets and ***n*** different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:

In [16]:
from sklearn.model_selection import LeaveOneOut
X = data[['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']].values
loo = LeaveOneOut()


In [17]:
for train, test in loo.split(X):
       print("%s %s" % (train, test))

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
[   0    1    2 ... 6193 6194 6195] [1197]
[   0    1    2 ... 6193 6194 6195] [1198]
[   0    1    2 ... 6193 6194 6195] [1199]
[   0    1    2 ... 6193 6194 6195] [1200]
[   0    1    2 ... 6193 6194 6195] [1201]
[   0    1    2 ... 6193 6194 6195] [1202]
[   0    1    2 ... 6193 6194 6195] [1203]
[   0    1    2 ... 6193 6194 6195] [1204]
[   0    1    2 ... 6193 6194 6195] [1205]
[   0    1    2 ... 6193 6194 6195] [1206]
[   0    1    2 ... 6193 6194 6195] [1207]
[   0    1    2 ... 6193 6194 6195] [1208]
[   0    1    2 ... 6193 6194 6195] [1209]
[   0    1    2 ... 6193 6194 6195] [1210]
[   0    1    2 ... 6193 6194 6195] [1211]
[   0    1    2 ... 6193 6194 6195] [1212]
[   0    1    2 ... 6193 6194 6195] [1213]
[   0    1    2 ... 6193 6194 6195] [1214]
[   0    1    2 ... 6193 6194 6195] [1215]
[   0    1    2 ... 6193 6194 6195] [1216]
[   0    1    2 ... 6193 6194 6195] [1217]
[   0    1    2 ... 6193

In terms of accuracy, LOO often results in high variance as an estimator for the test error.
5 to 10 cross-fold validation should be preferred over LOO.

#Hyper-parameter


## Grid search

The grid search is the process of adjusting hyperparameters to determine the ideal values for a given model. This is very important to automate this process of assigning hyperparameters. And thus improve the performance of the model.

First, we need to import GridSearchCV from the sklearn library. The GridSearchCV estimator parameter requires the model we are using for the hyperparameter adjustment process.

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

In [19]:
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

In [20]:
# Set the parameters grid
param_grid = {
    'C': [0.1, 1, 100, 1000],
    'gamma': [0.1, 1, 3, 5],
    'epsilon': [0.1 , 0.5, 1, 5, 10]
}

The param_grid parameter requires a list of parameters and the range of values for each parameter of the specified estimator. The most significant parameters needed when working with the SVR model rbf kernel are c, gamma and epsilon.

**NOTE**: You can change these values and experiment more to see which ranges of values perform best. A cross-validation process is performed to determine the set of hyperparameter values that provide the best levels of accuracy.

In [21]:
gsc = GridSearchCV(SVR(kernel='rbf'), param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = gsc.fit(X, y)

KeyboardInterrupt: ignored

Then, just use the best set of hyperparameter values chosen in the grid search, in the real model to have a better result and performance.

In [None]:
print("Best parameters set found on development set:")
print(grid_result.best_params_)

Best parameters set found on development set:
{'C': 1000, 'epsilon': 10, 'gamma': 0.1}


##Random Search for Classification

We will explore hyperparameter optimization of the logistic regression model.

First, we will define the model that will be optimized and use default values for the hyperparameters that will not be optimized.

In [34]:
# random search logistic regression model on the sonar dataset
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

In [35]:
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)

In [36]:
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]

In [38]:
# define model
model = LogisticRegression()

We will evaluate model configurations using repeated stratified k-fold cross-validation with three repeats and 10 folds.

In [26]:
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

Next, we can define the search space.

This is a dictionary where names are arguments to the model and values are distributions from which to draw samples. We will optimize the solver, the penalty, and the C hyperparameters of the model with discrete distributions for the solver and penalty type and a log-uniform distribution from 1e-5 to 100 for the C value.

Log-uniform is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step.

In [39]:
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = loguniform(1e-5, 100)

Next, we can define the search procedure with all of these elements.

Importantly, we must set the number of iterations or samples to draw from the search space via the “n_iter” argument. In this case, we will set it to 500.

In [40]:
# define search
search = RandomizedSearchCV(model, space, n_iter=500, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

Finally, we can perform the optimization and report the results.

In [41]:
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: 0.7897619047619049
Best Hyperparameters: {'C': 4.878363034905756, 'penalty': 'l2', 'solver': 'newton-cg'}


Running the example may take a minute. It is fast because we are using a small search space and a fast model to fit and evaluate. You may see some warnings during the optimization for invalid configuration combinations. These can be safely ignored.

At the end of the run, the best score and hyperparameter configuration that achieved the best performance are reported.

Your specific results will vary given the stochastic nature of the optimization procedure. Try running the example a few times.

In this case, we can see that the best configuration achieved an accuracy of about 78.9 percent, which is fair, and the specific values for the solver, penalty, and C hyperparameters used to achieve that score.