#### Parameters tuning tutorial

As what we focus on how to find the best model parameters given the datasets, so how to find the best parameters is the key point. In fact, if we just construct the model using the default parameters no matter using sklearn module or other machine leaning module we could train a model, but what we need to do is to find the best parameters that could give us a higher accuracy or lower loss. So how to find the best parameters?

There are two common ways that we could use to find the parameters that are best for our dataset, there are: Grid search and Random Search. One thing to notice is that: no matter which way we use, we couldn't find the best model!!! What we want to do is to find the sub-best model! useful tips that we could use for interview, really important!

So here we start to use the sklearn to explain this two method.

In [1]:
# first to import the module
import numpy as np
from scipy.stats import distributions
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import warnings

np.random.seed(1)

# I don't want to see the warning again!
warnings.simplefilter('ignore')

In [2]:
# load the dataset, this is a regression data
x, y = load_diabetes(return_X_y=True)

# train_test_split is the stractied data sample
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.2, random_state=1234)

In [3]:
# construct the algorithm object instance for later model parameters tuning
# for the linear model for regression, there are 3 types model: linear regression, lasso, ridge. 
# The common part is Y = W*X means the linear function, but for the lasso and ridge regression is
# added with regularization term, for lasso is added with l1-regulazation, 
# this would make the weights to be 0 for some feature dosen't provide much info;
# for ridge regression is added with l2-regulazation term, this would make the weights to be more equally with each other.
# For more, I will explain to you.
lr = Ridge()

# get the default parameters, the alpha is the regulazation term that we want to tune.
lr.get_params()

{'alpha': 1.0,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'normalize': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.001}

In [4]:
# train the model using the default parameters
lr.fit(xtrain, ytrain)

pred = lr.predict(xtest)

base_loss = metrics.mean_squared_error(ytest, pred)

# here is to use the mean squared error for evaluation: loss=sum((y - pred) ** 2)
print("Default parameters accuracy: ", base_loss)

Default parameters accuracy:  3227.4560610246067


#### Grid search

The grid search algorithm is really simple, as what you need to do is just to predefine what parameters that you want to tune, then for each step, model's parameters will be updated with you defined. 
One thing to notice is that, here we would use cross-validation to find the best model. About cross-validation, I will also expalin to you.

Here we could start the grid search.

In [5]:
# there are two ways to define the parameters, but both are with being created with dictionary data type 
# with key for parameter name and value is list or array for what we want to tune.
# the first
param_grid = {'alpha': [.0001, .001, .01, .1, 1, 10]}

print("First way parameter:", param_grid)

# the second way
alpha = [.0001, .001, .01, .1, 1, 10, 100]
param_grid = dict(alpha=alpha)

print("Second way parameter:", param_grid)

# you could see they are same! both ways works!

First way parameter: {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10]}
Second way parameter: {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}


In [6]:
# cross-validation search, cv means: cross-validation 
grid_search = GridSearchCV(estimator=lr, param_grid=param_grid, cv=3, verbose=1)

# start to fit the model on train data!
# Noted: no matter which way we use or choose, we should just train the model with train data, 
# the test data should just use once when you evaluate your model
grid_search.fit(xtrain, ytrain)

Fitting 3 folds for each of 7 candidates, totalling 21 fits


[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.0s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [7]:
# after the fitting step finishs, then we could get the best parameters that found, also with the best estimator
# and the training step info. As bellow:

best_score_grid = grid_search.best_score_
print('best score: ', best_score_grid)

print('*'*30)
print("Best find parameter:", grid_search.best_params_)

best_model = grid_search.best_estimator_
print("Best fitted model: ", best_model)

print('*'*30)
print('training step evaluation metrics:', grid_search.cv_results_.keys())

print('*'*30)
print('model training step info:')
print(grid_search.cv_results_)

best score:  0.48875139903342185
******************************
Best find parameter: {'alpha': 0.001}
Best fitted model:  Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
******************************
training step evaluation metrics: dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_alpha', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'mean_train_score', 'std_train_score'])
******************************
model training step info:
{'mean_fit_time': array([0.00133006, 0.00034237, 0.00165606, 0.00031964, 0.0006539 ,
       0.00067862, 0.00065136]), 'std_fit_time': array([0.00046924, 0.00048418, 0.00095272, 0.00045204, 0.00046255,
       0.00048014, 0.00046089]), 'mean_score_time': array([0.        , 0.00066471, 0.       

In [8]:
# here we want to compare with the base model result
pred_grid = best_model.predict(xtest)

grid_loss = metrics.mean_squared_error(ytest, pred_grid)

print("compare with base model, grid improve %.2f %s" % ((base_loss - grid_loss) / grid_loss * 100, '%'))

print('base:', base_loss)
print('grid:', grid_loss)

compare with base model, grid improve 8.92 %
base: 3227.4560610246067
grid: 2963.2130044925148


#### Tips
One thing should notice that for now, I just to make the tutorial for model selection, in fact, you could choose one more parameter to tune with one pass, and the range of the parameter is what should be learned from machine learning project!

#### Random Search

As the grid search is used so many times during my previous work, but one thing should notice that the random search is more effecient to find the best parameters in fact. only the difference with the grid search and random search is random search parameters' value is choosen from some distribution like Gaussian distribution, but we have to define how many values we want to choose with n_iter parameter, I will show you.

So here we start!

In [9]:
expon = distributions.expon()

param_distribution = {'alpha': expon}

# n_iter is how many values that we want to choose
rand_search = RandomizedSearchCV(estimator=lr, param_distributions=param_distribution, n_iter=10, cv=3)

# start to train
rand_search.fit(xtrain, ytrain)

RandomizedSearchCV(cv=3, error_score='raise',
          estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000013D00357828>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [10]:
# evaluate the result like the grid search

# after the fitting step finishs, then we could get the best parameters that found, also with the best estimator
# and the training step info. As bellow:

best_score_grid = rand_search.best_score_
print('best score: ', best_score_grid)

print('*'*30)
print("Best find parameter:", rand_search.best_params_)

best_model = rand_search.best_estimator_
print("Best fitted model: ", best_model)

print('*'*30)
print('training step evaluation metrics:', rand_search.cv_results_.keys())

print('*'*30)
print('model training step info:')
print(rand_search.cv_results_)

best score:  0.48824244509444026
******************************
Best find parameter: {'alpha': 0.09688387165373345}
Best fitted model:  Ridge(alpha=0.09688387165373345, copy_X=True, fit_intercept=True,
   max_iter=None, normalize=False, random_state=None, solver='auto',
   tol=0.001)
******************************
training step evaluation metrics: dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_alpha', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'mean_train_score', 'std_train_score'])
******************************
model training step info:
{'mean_fit_time': array([0.00098745, 0.0006748 , 0.00066503, 0.00099627, 0.00098999,
       0.00067639, 0.00098594, 0.0006671 , 0.        , 0.        ]), 'std_fit_time': array([1.13932043e-05, 4.77919635e-04, 4.70246800e-04, 3.54879913e-06,
       1.67802572e-05

In [11]:
# here we want to compare with the base model result
pred_grid = best_model.predict(xtest)

rand_loss = metrics.mean_squared_error(ytest, pred_grid)

print("compare with base model, grid improve %.2f %s" % ((base_loss - rand_loss) / rand_loss * 100, '%'))

print('base:', base_loss)
print('grid:', rand_loss)

compare with base model, grid improve 10.70 %
base: 3227.4560610246067
grid: 2915.3913293918163


In [12]:
# not too much improvement in fact, but don't worry for this sample tutorial. But you have to learn to how to use these two methods