# Features engineering and models selection - Models selection 

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


### `GridSearchCV` selection

In the last notebooks, we have seen several models as supervised as unsupervised, and them need params for their construction. For that, `scikit-learn` gives us the `GridSearchCV` constructor with which makes models for the params selection. We are going to use the `winequality-white.csv` file and the Lasso regression to test this constructor:

In [2]:
import pandas as pd

# Read file
data = pd.read_csv('data/winequality-white.csv', sep = ';')

# Remove the quality feature:
target = 'quality'
features = list(data.columns)
features.remove(target)

x = data[features]
y = data[target]

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

In [4]:
# List of alphas for being evaluated
alphas = np.array([1, 0.1, 0.01, 0.001, 0.0001])

# Model's creation
model = Lasso()

# Model's selection
grid = GridSearchCV(estimator = model,                 # Model to estimate
                    param_grid = dict(alpha = alphas), # Alpha's params
                    cv = 10)                           # Number of set to crossvalidation

# Fit the model
grid.fit(x, y)

# Print the results
print('The best param is:', grid.best_params_)
print('The best score is:', grid.best_score_)

The best param is: {'alpha': 0.001}
The best score is: 0.243657025021


Also, `GridSearchCV` allows us to test several params of the model together. Let's see that:

In [5]:
# List of alphas for being evaluated
alphas = np.array([1, 0.1, 0.01, 0.001, 0.0001])
fit_intercept = np.array([True, False])

# Model's creation
model = Lasso()

# Model's selection
grid = GridSearchCV(estimator = model,                                 # Model to estimate
                    param_grid = dict(alpha = alphas,
                                      fit_intercept = fit_intercept),  # Param's values
                    cv = 10)                                           # Number of set to crossvalidation

# Fit the model
grid.fit(x, y)

# Print the results
print('The best param is:', grid.best_params_)
print('The best score is:', grid.best_score_)



The best param is: {'alpha': 0.001, 'fit_intercept': True}
The best score is: 0.243657025021




### `RandomizedSearchCV` selection

Until now, we have to indicate the param's values to test them, but that values are not known in many times. An alternative is use the `RandomizedSearchCV` constructor together a random genereator like `sp_rand()`. Let's see that constructor:

In [6]:
from scipy.stats             import uniform as sp_rand
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model    import Lasso

In [7]:
# Get the param's grid as random values
param_grid = dict(alpha = sp_rand())

# Lasso model
model = Lasso()

# Model's selection
rsearch = RandomizedSearchCV(estimator = model,                # Model to estimate
                             param_distributions = param_grid, # Params
                             n_iter = 100,                     # Maximum number of iterations
                             cv = 10,                          # Number of sets to crossvalidation
                             random_state = 1)

# Fit the model
rsearch.fit(x, y)

# Print the results
print('The best param is:', rsearch.best_params_)
print('The best score is:', rsearch.best_score_)

The best param is: {'alpha': 0.00011437481734488664}
The best score is: 0.243620917135
