# Finding the Optimal Hyperparameters
---
## GridSearchCV
- Used to find optimal hyperparameters (optimal regularization) in a model
- Gridsearching can also be done by model specific cross validation functions such as `LassoCV`, `RidgeCV`, `ElasticNetCV`, `LogisticRegressionCV` etc.
- This method is often preferred but `GridSearchCV` is a general way to find the optimal hyperparameters
- Check out the scikit-learn documentation for [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


## RandomizedSearchCV
- `GridSearchCV` can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters.
- A solution to this is to use `RandomizedSearchCV`, in which not all hyperparameter values are tried out.
- Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.
- Check out the scikit-learn documentation for [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

by Will Campbell

In [58]:
import numpy as np
import pandas as pd

In [59]:
# Load dataset
from sklearn import datasets
iris = datasets.load_iris()

In [60]:
# Create feature matrix and target varaible
X = iris.data
y = iris.target

In [61]:
# Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [62]:
# Standardize variables
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)

### Using GridsearchCV on a Logistic Regression model

In [63]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X,y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 268.26957952797272}
Best score is 0.98


### Using RandomizedSearchCV on a Decision Tree Regressor

In [64]:
# Load dataset
from sklearn import datasets
boston = datasets.load_boston()

In [65]:
# Create feature matrix and target varaible
X = boston.data
y = boston.target

In [66]:
# Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [67]:
# Standardizing variables
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)

In [68]:
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              'min_samples_split': range(2,16,2)}

# Instantiate a Decision Tree regressor: tree
tree = DecisionTreeRegressor()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'max_depth': None, 'max_features': 8, 'min_samples_leaf': 2, 'min_samples_split': 10}
Best score is 0.7489924408002584


Useful attributes for `GridSearchCV` and `RandomizedSearchCV`:
    - best_estimator_ : Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data
    - best_score_ : Mean cross-validated score of the best_estimator
    - best_params_ : Parameter setting that gave the best results on the hold out data

- Note that `RandomizedSearchCV` will never outperform `GridSearchCV`.
- Instead, it is valuable because it saves on computation time.
- Check out Adam's **'randomized_search'** notebook for more information on the differences between `GridSearchCV` and `RandomizedSearchCV` as well as putting these steps into a Pipeline

## EXTRAS

### Build different types of SVM models on "car dataset" 
#### (no GridSearch yet)

In [69]:
from sklearn.svm import SVC

In [70]:
car = pd.read_csv('./datasets/car_evaluation/car.csv')

In [71]:
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [72]:
y = car.acceptability.map(lambda x: 1 if x in ['vgood','good'] else 0)

import patsy

X = patsy.dmatrix('~ buying + maint + doors + persons + lug_boot + safety -1',
                  data=car, return_type='dataframe')

In [73]:
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [74]:
y.value_counts() / len(y)
# baseline is 92.2%

0    0.922454
1    0.077546
Name: acceptability, dtype: float64

In [75]:
from sklearn.model_selection import cross_val_score
lin_model = SVC(kernel='linear')

scores = cross_val_score(lin_model, Xs, y, cv=5)
print(scores)
sm = scores.mean()
ss = scores.std()
print("Average score of linear SVM model: {:0.3} +/- {:0.3}".format(sm, ss))

[ 0.98554913  0.96242775  0.90462428  0.66473988  0.875     ]
Average score of linear SVM model: 0.878 +/- 0.114


In [76]:
rbf_model = SVC(kernel='rbf')

scores = cross_val_score(rbf_model, Xn, y, cv=5)
print(scores)
sm = scores.mean()
ss = scores.std()
print("Average score of rbf SVM model: {:0.3} +/- {:0.3}".format(sm, ss))

[ 0.94219653  0.95086705  0.93930636  0.68786127  0.53488372]
Average score of rbf SVM model: 0.811 +/- 0.17


In [77]:
from sklearn.metrics import classification_report

def print_cm_cr(y_true, y_pred):
    """prints the confusion matrix and the classification report"""
    confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
    print(confusion)
    print(classification_report(y_true, y_pred))

In [78]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, stratify=y, test_size=0.33)
lin_model.fit(X_train, y_train)
y_pred = lin_model.predict(X_test)
print_cm_cr(y_test, y_pred)

Predicted    0   1  All
Actual                 
0          521   6  527
1           10  34   44
All        531  40  571
             precision    recall  f1-score   support

          0       0.98      0.99      0.98       527
          1       0.85      0.77      0.81        44

avg / total       0.97      0.97      0.97       571



## Now compare SVM Classifier, kNN Classifer and logistic regression using the "car" dataset.

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.



In [79]:
# gridsearch kNN
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

knn_params = {
    'n_neighbors':[1,3,5,7,9,11,13,15],
    'weights':['distance','uniform']
}

knn_gs = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, verbose=1)
knn_gs.fit(Xs, y)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    1.5s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15], 'weights': ['distance', 'uniform']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [80]:
knn_best = knn_gs.best_estimator_
print(knn_gs.best_params_)
print(knn_gs.best_score_)

{'n_neighbors': 5, 'weights': 'distance'}
0.7951388888888888


In [81]:
# gridsearch SVM
from sklearn.svm import SVC

svc_params = {
    'C':np.logspace(-3, 2, 10),
    'gamma':np.logspace(-5, 2, 10),
    'kernel':['linear','rbf']
}

svc_gs = GridSearchCV(SVC(), svc_params, cv=3, verbose=1)
svc_gs.fit(Xs, y)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:   21.7s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   3.59381e-03,   1.29155e-02,   4.64159e-02,
         1.66810e-01,   5.99484e-01,   2.15443e+00,   7.74264e+00,
         2.78256e+01,   1.00000e+02]), 'gamma': array([  1.00000e-05,   5.99484e-05,   3.59381e-04,   2.15443e-03,
         1.29155e-02,   7.74264e-02,   4.64159e-01,   2.78256e+00,
         1.66810e+01,   1.00000e+02]), 'kernel': ['linear', 'rbf']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [82]:
best_svc = svc_gs.best_estimator_
print(svc_gs.best_params_)
print(svc_gs.best_score_)

{'C': 0.001, 'gamma': 1.0000000000000001e-05, 'kernel': 'linear'}
0.9224537037037037


In [83]:
from sklearn.linear_model import LogisticRegression

lr_params = {
    'penalty':['l1','l2'],
    'C':np.logspace(-4, 2, 40),
    'solver':['liblinear']
}

lr_gs = GridSearchCV(LogisticRegression(), lr_params, cv=5, verbose=1)
lr_gs.fit(Xn, y)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=1)]: Done 400 out of 400 | elapsed:   11.2s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-04,   1.42510e-04,   2.03092e-04,   2.89427e-04,
         4.12463e-04,   5.87802e-04,   8.37678e-04,   1.19378e-03,
         1.70125e-03,   2.42446e-03,   3.45511e-03,   4.92388e-03,
         7.01704e-03,   1.00000e-02,   1.42510e-02,   2.0...6e+01,
         3.45511e+01,   4.92388e+01,   7.01704e+01,   1.00000e+02]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [84]:
best_lr = lr_gs.best_estimator_
print(lr_gs.best_params_)
print(lr_gs.best_score_)

{'C': 0.0001, 'penalty': 'l1', 'solver': 'liblinear'}
0.9224537037037037
