## CHAPTER 12
---
# MODEL SELECTION

---
- Model selection is, in this book, selecting the best learning algorithm and its best hyperparameters
- In this chapter, we will cover techniques to efficiently select the best model from a set of candidates
- Hyperparameters are like the settings for the learning algorithm that we must choose before starting training

## 12.1 Selecting Best Models Using Exhaustive Search

**Problem:** You want to select the best model by searching over a range of hyperparameters

**Solution:** Use scikit-learn’s GridSearchCV
- GridSearchCV is a brute-force approach to model selection using cross-validation.
- Specifically, a user defines sets of possible values for one or multiple hyperparameters, and then GridSearchCV trains a model using every value and/or combination of values. 
- The model with the best performance score is selected as the best model

In [1]:
# Load libraries
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression(max_iter=10000)

# Create range of candidate regularization values
C = np.logspace(0, 4, 10)

# Create range of candidate solver values
l1_solver = ['liblinear', 'saga']
l2_solver = ['newton-cg', 'lbfgs', 'sag', 'saga']

# Create dictionary hyperparameter candidates
hyperparameters = [dict(C=C, penalty=['l1'], solver=l1_solver), 
                   dict(C=C, penalty=['l2'], solver=l2_solver)]

# Create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0)

# Fit grid search
best_model = gridsearch.fit(features, target)

# View best hyperparameters
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])
print('Best Solver:', best_model.best_estimator_.get_params()['solver'])

Best Penalty: l1
Best C: 2.7825594022071245
Best Solver: saga


**My additions to book's code:**
- The code, as it is written in the book, was throwing a bunch of errors ("FitFailedWarning") due to the fact that some hyperparameters cannot be combined. I wish skit-learn ignored them but it has to show errors
- I added the solver values and modified the hyperparameters variable to separate l1 parameters from l2 parameters. 
- To my surprise it worked perfectly and I actually learned something new.
- I also added max_iter=10000 in the logistic variable because I was getting ConvergenceWarning

#### Discussion:
- Let's calculate the number of models from which the best was selected:
    - 10C * 1l1 * 2l1_solver = 20 models
    - 10C * 1l2 * 4l2_solver = 40 models
    - *Total:* 60 models
- The best model's parameters are: solver=saga, penalty=l1, and C=2.78
- By default, after identifying the best hyperparameters, GridSearchCV will retrain a model using the best hyperparameters on the entire dataset (rather than leaving a fold out for cross-validation). We can use this model to predict values like any other scikit-learn model:

In [2]:
# Predict target vector
best_model.predict(features)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [3]:
best_model.best_estimator_

LogisticRegression(C=2.7825594022071245, max_iter=10000, penalty='l1',
                   solver='saga')

One GridSearchCV parameter is worth noting: verbose. While mostly unnecessary, it can be reassuring during long searching processes to receive an indication that the search is progressing. The verbose parameter determines the amount of messages
outputted during the search, with 0 showing no output, and 1 to 3 outputting messages with increasing detail.

## 12.2 Selecting Best Models Using Randomized Search

**Problem:** You want a computationally cheaper method than exhaustive search to select the best model.

**Solution:** Use scikit-learn’s RandomizedSearchCV

In [4]:
# Load libraries
from scipy.stats import uniform
from sklearn import linear_model, datasets
from sklearn.model_selection import RandomizedSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression(max_iter=1000)

# Create range of candidate regularization penalty hyperparameter values
penalty = ['l1', 'l2']

# Create distribution of candidate regularization hyperparameter values
C = uniform(loc=0, scale=4)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

# Create randomized search
randomizedsearch = RandomizedSearchCV(
    logistic, hyperparameters, random_state=1, n_iter=100, cv=5, verbose=0,
    n_jobs=-1)

# Fit randomized search
best_model = randomizedsearch.fit(features, target)

# view best hyperparameters
print("Best penalty: {}".format(best_model.best_estimator_.get_params()['penalty']))
print("Best C: {}".format(best_model.best_estimator_.get_params()['C']))
print('Best Solver: {}' .format(best_model.best_estimator_.get_params()['solver']))
print(best_model.predict(features))

Best penalty: l2
Best C: 3.730229437354635
Best Solver: lbfgs
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [5]:
# Define a uniform distribution between 0 and 4, sample 10 values
uniform(loc=0, scale=4).rvs(10)

array([0.34703081, 1.65515723, 2.10020241, 3.60456704, 3.75272077,
       0.02328594, 0.73854804, 1.88678024, 1.50348456, 2.63012833])

#### Discussion:
- With RandomizedSearchCV, if we specify a distribution, scikit-learn will randomly sample without replacement hyperparameter values from that distribution. As an example of the general concept, above we randomly sample 10 values from a uniform distribution ranging from 0 to 4
- Alternatively, if we specify a list of values such as two regularization penalty hyperparameter values, ['l1', 'l2'], RandomizedSearchCV will randomly sample with replacement from the list.
- The number of sampled combinations of hyperparameters (i.e., the number of candidate models trained) is specified with the n_iter (number of iterations) setting.

## 12.3 Selecting Best Models from Multiple Learning Algorithms

**Problem:** You want to select the best model by searching over a range of learning algorithms and their respective hyperparameters.

**Solution:** Create a dictionary of candidate learning algorithms and their hyperparameters

I am getting a lot of FitFailedWarning when I used the code in the book. I will have to find a way to get around it by, maybe, using the solver parameter instead of the penalty. If this was a real project, I would do that and then tune up the model further by refining the hyperparameters.

In [6]:
# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Set random seed
np.random.seed(0)

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

# Create dictionary with candidate learning algorithms and their hyperparameters
search_space = [{"classifier": [LogisticRegression(max_iter=10000)],
                 "classifier__solver": ['liblinear', 'newton-cg', 
                                        'lbfgs', 'sag', 'saga'],
                 "classifier__C": np.logspace(0, 4, 10)},
                {"classifier": [RandomForestClassifier()],
                 "classifier__n_estimators": [10, 100, 1000],
                 "classifier__max_features": [1, 2, 3]}]

# Create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)

# Fit grid search
best_model = gridsearch.fit(features, target)

# view best model
print(best_model.best_estimator_.get_params()['classifier'])

best_model.predict(features)

LogisticRegression(C=2.7825594022071245, max_iter=10000, solver='sag')


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

## 12.4 Selecting Best Models When Preprocessing

**Problem:** You want to include a preprocessing step during model selection.

**Solution:** Create a pipeline that includes the preprocessing step and any of its parameters.

In [7]:
# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Set random seed
np.random.seed(0)

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create a preprocessing object that includes StandardScaler features and PCA
preprocess = FeatureUnion([("std", StandardScaler()), ("pca", PCA())])

# Create a pipeline
pipe = Pipeline([("preprocess", preprocess),
                 ("classifier", LogisticRegression())])

# Create space of candidate values
search_space = [{"preprocess__pca__n_components": [1, 2, 3],
                 "classifier__penalty": ["l1", "l2"],
                 "classifier__C": np.logspace(0, 4, 10)}]

# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)

# Fit grid search
best_model = clf.fit(features, target)

# View best model
best_model.best_estimator_.get_params()['preprocess__pca__n_components']

2

#### Discussion:
- FeatureUnion allows us to combine multiple preprocessing actions properly. 
- In our solution we use FeatureUnion to combine two preprocessing steps: 
    - standardize the feature values (StandardScaler) and 
    - Principal Component Analysis (PCA). 
- This object is called preprocess and contains both of our preprocessing steps. 
- We then include preprocess into a pipeline with our learning algorithm. 
    - The end result is that this allows us to outsource the proper (and confusing) handling of fitting, transforming, and training the models with combinations of hyperparameters to scikit-learn.
- We also defined features__pca__n_components': [1, 2, 3] in the search space to indicate that we wanted to discover if one, two, or three principal components produced the best model.

## 12.5 Speeding Up Model Selection with Parallelization

**Problem:** You need to speed up model selection.

**Solution:** Use all the cores in your machine by setting n_jobs=-1

In [12]:
# Load libraries
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression(max_iter=1000)

# Create range of candidate regularization penalty hyperparameter values
penalty = ["l1", "l2"]

# Create range of candidate values for C
C = np.logspace(0, 4, 1000)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

LogisticRegression(C=5.354620899273607, max_iter=1000)

#### Discussion:
- Setting n_jobs to -1 tells scikit-learn to use all cores. However, by default n_jobs is set to 1, meaning it only uses one core. 
- To demonstrate this, if we run the same GridSearch as in the solution, but with n_jobs=1, we will see that it takes significantly longer to find the best model: 48.6s vs. 3.7min
- The n_jobs=1 also gives a lot of "FitFailedWarning", so it is not shown.
- Set verbose to 1 or above to get the times

In [13]:
# Create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1, verbose=1)

# Fit grid search
best_model = gridsearch.fit(features, target)
best_model.best_estimator_

Fitting 5 folds for each of 2000 candidates, totalling 10000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 2160 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 6160 tasks      | elapsed:   22.1s
[Parallel(n_jobs=-1)]: Done 9200 tasks      | elapsed:   43.2s
[Parallel(n_jobs=-1)]: Done 10000 out of 10000 | elapsed:   48.6s finished


LogisticRegression(C=5.354620899273607, max_iter=1000)

## 12.6 Speeding Up Model Selection Using Algorithm-Specific Methods

**Problem:** You need to speed up model selection.

**Solution:** If you are using a select number of learning algorithms, use scikit-learn’s model-specific cross-validation hyperparameter tuning. 
- For example, LogisticRegressionCV:

In [16]:
# Load libraries
from sklearn import linear_model, datasets

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create cross-validated logistic regression
logit = linear_model.LogisticRegressionCV(Cs=100, max_iter=1000)

# Train model
logit.fit(features, target)

LogisticRegressionCV(Cs=100, max_iter=1000)

#### Discussion:
