<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/notebooks/hyperparameter_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Tuning

From [Hyperparameter Tuning](https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624) by [Tara Boyle](https://taraboyle.me/data-science/) in [towardsdatascience.com](https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)

Updated by [John Fogarty](https://github.com/jfogarty) for Python 3.6 and [Base2 MLI](https://github.com/base2solutions/mli) and [colab](https://colab.research.google.com) standalone evaluation.

[Kaggle’s](https://www.kaggle.com/c/dont-overfit-ii) Don’t Overfit II competition presents an interesting problem. We have 20,000 rows of continuous variables, with only 250 of them belonging to the training set.
The challenge is not to overfit.

With such a small dataset — and even smaller training set, this can be a difficult task!
In this article, we’ll explore hyperparameter optimization as a means of preventing overfitting.

## Hyperparameter Tuning

[Wikipedia states](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that “hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm”. So what is a [hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))?

> *A hyperparameter is a parameter whose value is set before the learning process begins.
Some examples of hyperparameters include penalty in logistic regression and loss in stochastic gradient descent.*

In [sklearn](https://scikit-learn.org/stable/modules/grid_search.html#grid-search), hyperparameters are passed in as arguments to the constructor of the model classes.

## Tuning Strategies

We will explore two different methods for optimizing hyperparameters:

- **Grid Search**

- **Random Search**

We’ll begin by preparing the data and trying several different models with their default hyperparameters. From these we’ll select the top two performing methods for hyperparameter tuning.

In [0]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn import linear_model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')

np.random.seed(27)

import os
REPODATA='https://github.com/plotly/datasets/blob/master/titanic.csv'
RAWDATA='https://raw.githubusercontent.com/plotly/datasets/master/titanic.csv'
filename='titanic.csv'
TMPDATA='./tmpData'
if not os.path.exists(TMPDATA) : os.makedirs(TMPDATA)
datafile=os.path.join(TMPDATA, filename)
!curl $RAWDATA -o $datafile

In [0]:
# setting up default plotting parameters
%matplotlib inline

plt.rcParams['figure.figsize'] = [20.0, 7.0]
plt.rcParams.update({'font.size': 22,})

sns.set_palette('viridis')
sns.set_style('white')
sns.set_context('talk', font_scale=0.8)

In [4]:

train = pd.read_csv(os.path.join(TMPDATA, 'train.csv'))
test = pd.read_csv(os.path.join(TMPDATA, 'test.csv'))

print('Train Shape: ', train.shape)
print('Test Shape: ', test.shape)

train.head()

NameError: ignored

Prepare the data sets for training.

In [0]:
# prepare for modeling
X_train = train.drop(['id', 'target'], axis=1)
y_train = train['target']

X_test = test.drop(['id'], axis=1)

# scaling data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Baseline Models

In [0]:
# define models
ridge = linear_model.Ridge()
lasso = linear_model.Lasso()
elastic = linear_model.ElasticNet()
lasso_lars = linear_model.LassoLars()
bayesian_ridge = linear_model.BayesianRidge()
logistic = linear_model.LogisticRegression(solver='liblinear')
sgd = linear_model.SGDClassifier()

Here we select seven common traditional machine learning models.

In [0]:
models = [ridge, lasso, elastic, lasso_lars, bayesian_ridge, logistic, sgd]

### Get Basic Metrics

We then find the mean cross validation score and standard deviation:

In [0]:
# function to get cross validation scores
def get_cv_scores(model):
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    print('CV Mean: ', np.mean(scores))
    print('STD: ', np.std(scores))
    print('\n')

In [0]:
# loop through list of models
for model in models:
    print(model)
    get_cv_scores(model)

From this we can see our best performing models out of the box are logistic regression and stochastic gradient descent. Let's see if we can optimize these models with hyperparameter tuning.


## Logistic Regression and Grid Search

Grid search is a traditional way to perform hyperparameter optimization. It works by searching exhaustively through a specified subset of hyperparameters.
Using sklearn’s **[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)**, we first define our grid of parameters to search over and then run the grid search.

In [0]:
penalty = ['l1', 'l2']
C = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
class_weight = [{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}]
solver = ['liblinear', 'saga']

param_grid = dict(penalty=penalty,
                  C=C,
                  class_weight=class_weight,
                  solver=solver)

grid = GridSearchCV(estimator=logistic, param_grid=param_grid, scoring='roc_auc', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)

print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)

We improved our cross validation score from 0.744 to 0.789!

The benefit of grid search is that it is guaranteed to find the optimal combination of parameters supplied. The drawback is that it can be very time consuming and computationally expensive.

We can combat this with random search.

In [0]:
logistic = linear_model.LogisticRegression(C=1, class_weight={1:0.6, 0:0.4}, penalty='l1', solver='liblinear')
get_cv_scores(logistic)

In [0]:
predictions = logistic.fit(X_train, y_train).predict_proba(X_test)
#### score 0.828 on public leaderboard

In [0]:
submission = pd.read_csv('../input/sample_submission.csv')
submission['target'] = predictions
#submission.to_csv('submission.csv', index=False)
submission.head()

## Stochastic Gradient Descent and Random Search
Random search is a random (obviously) search over specified parameter values.

### Random Search

Random search differs from grid search mainly in that it searches the specified subset of hyperparameters randomly instead of exhaustively. The major benefit being decreased processing time.

There is a tradeoff to decreased processing time, however. We aren’t guaranteed to find the optimal combination of hyperparameters.

Let’s give random search a try with sklearn’s **[RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)**. Very similar to grid search above, we define the hyperparameters to search over before running the search.
An important additional parameter to specify here is n_iter. This specifies the number of combinations to randomly try.

- Selecting too low of a number will decrease our chance of finding the best combination. 

- Selecting too large of a number will increase our processing time.

In [0]:
loss = ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron']
penalty = ['l1', 'l2', 'elasticnet']
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
learning_rate = ['constant', 'optimal', 'invscaling', 'adaptive']
class_weight = [{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}]
eta0 = [1, 10, 100]

param_distributions = dict(loss=loss,
                           penalty=penalty,
                           alpha=alpha,
                           learning_rate=learning_rate,
                           class_weight=class_weight,
                           eta0=eta0)

random = RandomizedSearchCV(estimator=sgd, param_distributions=param_distributions, scoring='roc_auc', verbose=1, n_jobs=-1, n_iter=1000)
random_result = random.fit(X_train, y_train)

print('Best Score: ', random_result.best_score_)
print('Best Params: ', random_result.best_params_)

In [0]:
sgd = linear_model.SGDClassifier(alpha=0.1,
                                 class_weight={1:0.7, 0:0.3},
                                 eta0=100,
                                 learning_rate='optimal',
                                 loss='log',
                                 penalty='elasticnet')
get_cv_scores(sgd)

In [0]:
predictions = sgd.fit(X_train, y_train).predict_proba(X_test)
#### score 0.790 on public leaderboard

Here we improved the cross validation score from 0.733 to 0.780!
Conclusion
Here we explored two methods for hyperparameter turning and saw improvement in model performance.
While this is an important step in modeling, it is by no means the only way to improve performance.
In future articles we will explore other means to prevent overfitting including feature selection and ensembling.

## Conclusion

Here we explored two methods for hyperparameter turning and saw improvement in model performance.

While this is an important step in modeling, it is by no means the only way to improve performance.

In future articles we will explore other means to prevent overfitting including feature selection and ensembling.

In [0]:
submission = pd.read_csv('../input/sample_submission.csv')
submission['target'] = predictions
submission.to_csv('submission.csv', index=False)
submission.head()

So since this is used in a Kaggle Competition, the results are saved into a submission CSV dataset.

### End of notebook.