
# Introduction to Gridsearching Hyperparameters


### Learning Objective
- Understand what the terms gridsearch and hyperparameter refer to.
- Understand how to manually build a gridsearching procedure.
- Apply sklearn's `GridSearchCV` object with basketball data to optimize a KNN model.
- Practice using and evaluating attributes of the gridsearch object.
- Understand the pitfalls of searching large hyperparameter spaces.
- Practice the gridsearch procedure independently optimizing regularized logistic regression.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>

## What is "Gridsearching"? What are "hyperparameters"?

---

Models often have specifications that can be set. For example, when we choose a linear regression, we may decide to add a penalty to the loss function such as the Ridge or the Lasso. Those penalties require the regularization strength, alpha, to be set. 

**Model parameters are called hyperparameters.**

>What hyperparameters do you know for k-nearest neighbours?


Hyperparameters are different than the parameters of the model resulting from a fit, such as the coefficients. The hyperparameters are set prior to the fit and determine the behavior of the model.

There are often more than one kind of hyperparameter to set for a model.

**The search for the optimal set of hyperparameters is called gridsearching.**



Gridsearching gets its name from the fact that we are searching over a "grid" of parameters. For example, imagine the `n_neighbors` hyperparameters on the x-axis and `weights` on the y-axis, and we need to test all points on the grid.

**Gridsearching uses cross-validation internally to evaluate the performance of each set of hyperparameters.** More on this later.

<a id='basketball-data'></a>

## Basketball data

---

To explore the process of gridsearching over sets of hyperparameters, we will use some basketball data. The data below has statistics for 4 different seasons of NBA basketball: 2013-2016.
- This data includes aggregate statistical data for each game. 
- The data of each game is aggregated by match for all players.
- Scraped from http://www.basketball-reference.com

Many of the columns in the dataset represent the mean of a statistic across the last 10 games, for example. Non-target statistics are for *prior* games, they do not include information about player performance in the current game.

**We are interested in predicting whether the home team will win the game or not.** This is a classification problem.



### Load the data and create the target and predictor matrix
- The target will be a binary column of whether the home team wins.
- The predictors should be numeric statistics columns.

We're going to exclude these columns from the predictor matrix:

    ['GameId','GameDate','GameTime','HostName',
     'GuestName','total_score','total_line','game_line',
     'winner','loser','host_wins','Season']

In [2]:
data = pd.read_csv('./datasets/basketball_data.csv')

In [3]:
data.head()

Unnamed: 0,Season,GameId,GameDate,GameTime,HostName,GuestName,total_score,total_line,game_line,Host_HostRank,...,gPTS_avg10,gTS%_avg10,g3PAR_avg10,gFTr_avg10,gDRB%_avg10,gTRB%_avg10,gAST%_avg10,gSTL%_avg10,gBLK%_avg10,gDRtg_avg10
0,2013,201212090LAL,2012-12-09,6:30 pm,Los Angeles Lakers,Utah Jazz,227.0,207.5,7.5,13,...,99.0,0.5206,0.223,0.2981,69.22,50.05,61.57,8.63,10.31,110.87
1,2013,201212100PHI,2012-12-10,7:00 pm,Philadelphia 76ers,Detroit Pistons,201.0,186.5,5.5,13,...,90.3,0.5077,0.2144,0.3095,71.46,49.48,59.83,6.48,9.46,107.91
2,2013,201212100HOU,2012-12-10,7:00 pm,Houston Rockets,San Antonio Spurs,240.0,212.0,-7.0,12,...,108.0,0.5915,0.2743,0.2518,74.26,50.99,61.82,8.3,6.85,101.41
3,2013,201212110BRK,2012-12-11,7:00 pm,Brooklyn Nets,New York Knicks,197.0,195.5,-3.5,12,...,100.3,0.5473,0.3595,0.2544,74.23,47.88,52.07,9.31,7.64,109.24
4,2013,201212110DET,2012-12-11,7:30 pm,Detroit Pistons,Denver Nuggets,195.0,203.5,-4.5,11,...,101.1,0.5605,0.2173,0.3177,68.45,50.4,56.33,7.67,7.83,114.86


In [4]:
data.columns

Index([u'Season', u'GameId', u'GameDate', u'GameTime', u'HostName',
       u'GuestName', u'total_score', u'total_line', u'game_line',
       u'Host_HostRank', u'Host_GameRank', u'Guest_GuestRank',
       u'Guest_GameRank', u'host_win_count', u'host_lose_count',
       u'guest_win_count', u'guest_lose_count', u'game_behind', u'winner',
       u'loser', u'host_place_streak', u'guest_place_streak', u'hq1_avg10',
       u'hq2_avg10', u'hq3_avg10', u'hq4_avg10', u'hPace_avg10',
       u'heFG%_avg10', u'hTOV%_avg10', u'hORB%_avg10', u'hFT/FGA_avg10',
       u'hORtg_avg10', u'hFG_avg10', u'hFGA_avg10', u'hFG%_avg10',
       u'h3P_avg10', u'h3PA_avg10', u'h3P%_avg10', u'hFT_avg10', u'hFTA_avg10',
       u'hFT%_avg10', u'hORB_avg10', u'hDRB_avg10', u'hTRB_avg10',
       u'hAST_avg10', u'hSTL_avg10', u'hBLK_avg10', u'hTOV_avg10',
       u'hPF_avg10', u'hPTS_avg10', u'hTS%_avg10', u'h3PAR_avg10',
       u'hFTr_avg10', u'hDRB%_avg10', u'hTRB%_avg10', u'hAST%_avg10',
       u'hSTL%_avg10', u'hBLK%_

In [5]:
data.shape

(3768, 96)

In [6]:
data.Season.unique()

array([2013, 2014, 2015, 2016])

In [7]:
data.winner.head()

0             Utah Jazz
1    Philadelphia 76ers
2     San Antonio Spurs
3       New York Knicks
4        Denver Nuggets
Name: winner, dtype: object

Let's create our target variable: does the home team win?

In [8]:
data['host_wins'] = (data.HostName == data.winner).astype(int)

In [9]:
predictors = [c for c in data.columns if c not in ['GameId','GameDate','GameTime','HostName',
                                                   'GuestName','total_score','total_line','game_line',
                                                   'winner','loser','host_wins','Season']]
X = data[predictors]
y = data.host_wins.values

### Create the training and testing data
- Test data should be the 2016 season data, train data will be the previous seasons.
- Make sure to standardize your predictor matrix (easiest to do prior to splitting the data into training and testing... but techinically not totally correct)!

In [10]:
from sklearn.preprocessing import StandardScaler

In [11]:
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [12]:
X_train = Xs[data.Season.isin([2013, 2014, 2015])]
X_test = Xs[data.Season == 2016]

y_train = y[data.Season.isin([2013, 2014, 2015])]
y_test = y[data.Season == 2016]

<a id='fit-knn'></a>

## Fitting the default KNN

---

Below we can fit a default `KNeighborsClassifier` to predict win vs. not on the training data, then score it on the testing data. 

Make sure to compare your score to the baseline accuracy!

In [13]:
from sklearn.neighbors import KNeighborsClassifier

In [14]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [15]:
knn.score(X_test, y_test)

0.57258883248730963

In [16]:
print np.mean(y_test)

0.603045685279


> Is this good?

## Searching for the best hyperparameters


Our default KNN performs quite poorly on the test data. But what if we changed the number of neighbors? The weighting? The distance metric?

These are all hyperparameters of the KNN. How would we do this manually? 







### Gridsearch pseudocode for our KNN

```python
accuracies = {}
for k in neighbors_to_test:
    for w in weightings_to_test:
        for d in distance_metrics_to_test:
            hyperparam_set = (k, w, d)
            knn = KNeighborsClassifier(n_neighbors=n, weights=w, metric=d)
            cv_accuracies = cross_val_score(knn, X_train, y_train, cv=5)
            accuracies[hyperparam_set] = np.mean(cv_accuracies)
```

In the pseudocode above, we would find the key in the dictionary (a hyperparameter set) that has the larget value (mean cross-validated accuracy).

<a id='gscv'></a>
### Using `GridSearchCV`

This would be an annoying process to have to do manually. Luckily sklearn comes with a convenience class for performing gridsearch:

```python
from sklearn.model_selection import GridSearchCV
```

The `GridSearchCV` has a handful of important arguments:

| Argument | Description |
| --- | ---|
| **`estimator`** | The sklearn instance of the model to fit on |
| **`param_grid`** | A dictionary where keys are hyperparameters for the model and values are lists of values to test |
| **`cv`** | The number of internal cross-validation folds to run for each set of hyperparameters |
| **`n_jobs`** | How many cores to use on your computer to run the folds (-1 means use all cores) |
| **`verbose`** | How much output to display (0 is none, 1 is limited, 2 is printouts for every internal fit) |


Below is an example for how one might set up the gridsearch for our KNN:

```python
knn_parameters = {
    'n_neighbors':[1,3,5,7,9],
    'weights':['uniform','distance']
}

knn_gridsearcher = GridSearchCV(KNeighborsClassifier(), knn_parameters, verbose=1)
knn_gridsearcher.fit(X_train, y_train)
```

**Try out the sklearn gridsearch below on the training data.**

In [17]:
from sklearn.model_selection import GridSearchCV

In [18]:
knn_params = {
    'n_neighbors':[1,3,5,9,15,21],
    'weights':['uniform','distance'],
    'metric':['euclidean','manhattan']
}

knn_gridsearch = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, verbose=1)

knn_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:  2.0min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 3, 5, 9, 15, 21], 'metric': ['euclidean', 'manhattan'], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

<a id='gs-results'></a>
### Examining the results of the gridsearch

Once the gridsearch has fit (this can take awhile!) we can pull out a variety of information and useful objects from the gridsearch object, stored as attributes:

| Property | Use |
| --- | ---|
| **`results.param_grid`** | Displays parameters searched over. |
| **`results.best_score_`** | Best mean cross-validated score achieved. |
| **`results.best_estimator_`** | Reference to model with best score.  Is usable / callable. |
| **`results.best_params_`** | The parameters that have been found to perform with the best score. |
| **`results.grid_scores_`** | Display score attributes with corresponding parameters. | 

**Print out the best score found in the search.**

In [19]:
knn_gridsearch.best_score_

0.61264822134387353

**Print out the set of hyperparameters that achieved the best score.**

In [20]:
knn_gridsearch.best_params_

{'metric': 'manhattan', 'n_neighbors': 21, 'weights': 'uniform'}

**Assign the best fit model (`best_estimator_`) to a variable and score it on the test data.**

Compare this model to the baseline accuracy and your default KNN.

In [21]:
best_knn = knn_gridsearch.best_estimator_
best_knn.score(X_test, y_test)

0.62538071065989853

In [22]:
print 'baseline:', np.mean(y_test)
print 'default KNN:', knn.score(X_test, y_test)

baseline: 0.603045685279
default KNN: 0.572588832487


<a id='caution'></a>

## A caution on gridsearching

---

Sklearn models often have many options/hyperparameters with many different possible values. It may be tempting to search over a wide variety of them. In general, this is not wise.

Remember that **gridsearch searches over all possible combinations of hyperparameters in the parameter dictionary!**

The KNN model class takes a wider range of options during instantiation than we have explored. Imagine that we had this as our parameter dictionary:

```python
parameter_grid = {
    'n_neighbors':range(1,151),
    'weights':['uniform','distance',custom_function],
    'algorithm':['ball_tree','kd_tree','brute','auto'],
    'leaf_size':range(1,152),
    'metric':['minkowski','euclidean'],
    'p':[1,2]
}
```

**How many different combinations will need to be tested?

| Parameter | Potential Values | Unique Values |
| --- | ---| ---: |
| **n_neighbors** | int range 1-150 | 150 |
| **weights** | strs:  "uniform", "distance" or user defined function | 3 |
| **algorithm** | strs: "ball_tree", "kd_tree", "brute", "auto" | 4 |
| **leaf_size** | int range 1-151 | 151 |
| **metric** | str: "minkowski" or 'euclidean' type | 2 |
| **p** | int: 1=manhattan_distance, 2= euclidean_distance | 2 |
|| <br>_150 \* 3 \* 4 \* 151 \* 2 \* 2 = n combinations_ <br><br>| _1,087,200_ |

And that is over a million tests *before we even consider the number of cross-validation folds!*

If you're not careful, gridsearching can quickly blow up. A lot of the hyperparameters we put in the dumb example above are either redundant or not useful.

> **It is extremely important to understand what the hyperparameters do and think critically about what ranges are useful and relevant to your model!**


## Grid searching over some hyper-params but over-riding defaults for others

If you look at the documentation for KNearestNeighbors, by default, the weights metric is 'uniform' and the number of neighbours is 3.  

What if I want to override the weights metric to 'distance' and grid search only through number of neighbours?

In [23]:
knn_params = {
    'n_neighbors':[1,3,5,9,15,21],
}

knn_gridsearch = GridSearchCV(KNeighborsClassifier(weights='distance'), knn_params, cv=5, verbose=1)

knn_gridsearch.fit(X_train, y_train)
knn_gridsearch.best_estimator_

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   28.8s finished


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=21, p=2,
           weights='distance')

<a id='practice'></a>

## Independent practice: gridsearch regularization penalties with logistic regression

---

Logistic regression models can also apply the Lasso and Ridge penalties. The `LogisticRegression` class takes these regularization-relevant hyperparameters:

| Argument | Description |
| --- | ---|
| **`penalty`** | `'l1'` for Lasso, `'l2'` for Ridge |
| **`solver`** | Must be set to `'liblinear'` for the Lasso penalty to work. |
| **`C`** | The regularization strength. Equivalent to `1./alpha` |

**You should:**
1. Fit and validate the accuracy of a default logistic regression on the basketball data.
- Perform a gridsearch over different regularization strengths and Lasso and Ridge penalties.
- Compare the accuracy on the test set of your optimized logistic regression to the baseline accuracy and the default model.
- Look at the best parameters found. What was chosen? What does this suggest about our data?
- Look at the (non-zero, if Lasso was selected as best) coefficients and associated predictors for your optimized model. What appears to be the most important predictors of winning the game?


In [24]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

lr = LogisticRegression()
lr.fit(X_train, y_train)

print lr.score(X_test, y_test)

0.695431472081


In [25]:
print np.mean(y_test)

0.603045685279


In [26]:
# default is already much better than KNN!

In [27]:
# Set up the parameters. Looking at C regularization strengths on a log scale.
# This takes awhile...
gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=5, verbose=1)


In [28]:
lr_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:   35.6s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [29]:
# best score on the training data:
lr_gridsearch.best_score_

0.66079770032339202

In [30]:
# best parameters on the training data:
lr_gridsearch.best_params_

{'C': 0.0029836472402833404, 'penalty': 'l1', 'solver': 'liblinear'}

In [31]:
# Lasso was chosen: this indicates that maybe unimportant (noise) variables
# is more of an issue in our data than multicollinearity.

In [32]:
# assign the best estimator to a variable:
best_lr = lr_gridsearch.best_estimator_

In [33]:
# Score it on the testing data:
best_lr.score(X_test, y_test)

0.66598984771573599

In [34]:
# slightly better than the default.

In [35]:
coef_df = pd.DataFrame({
        'coef':best_lr.coef_[0],
        'feature':X.columns
    })

In [36]:
coef_df['abs_coef'] = np.abs(coef_df.coef)

In [37]:
# sort by absolute value of coefficient (magnitude)
coef_df.sort_values('abs_coef', ascending=False, inplace=True)

In [38]:
# Show non-zero coefs and predictors
coef_df[coef_df.coef != 0]

Unnamed: 0,coef,feature,abs_coef
8,0.218062,game_behind,0.218062


In [40]:
gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

## Independent practice part 2:

Recreate the above grid search from scratch to verify that you get the same answer for optimal hyperparams!

*hint*: use a for loop, and `cross_val_score`

In [49]:
from sklearn.model_selection import cross_val_score

results = []
for pen in gs_params['penalty']:
    for c in gs_params['C']:
        lr = LogisticRegression(C=c, penalty=pen, solver='liblinear')
        results.append((pen, c, cross_val_score(lr, X_train, y_train, cv=5).mean()))

In [50]:
optimized = pd.DataFrame(results, columns=['penalty', 'c', 'score'])
optimized.sort_values('score', ascending=False)


Unnamed: 0,penalty,c,score
49,l1,0.002984,0.660794
50,l1,0.003352,0.660794
51,l1,0.003765,0.660794
52,l1,0.004229,0.660794
53,l1,0.004751,0.660794
59,l1,0.009545,0.659721
67,l1,0.024201,0.659718
71,l1,0.038535,0.659001
69,l1,0.030539,0.658282
72,l1,0.043288,0.658280


In [None]:
# alternate solution - saves space

best_c = None
best_pen = None
best_score = 0
for pen in gs_params['penalty']:
    for c in gs_params['C']:
        lr = LogisticRegression(C=c, penalty=pen, solver='liblinear')
        score = cross_val_score(lr, X_train, y_train, cv=5).mean()
        if score > best_score:
            best_score = score
            best_pen = pen
            best_c = c

In [None]:
print best_pen, best_c, best_score