# Grid Search and Cross-Validation

_Author: Michael Frantz (LA)_

---


In [None]:
!pip --quiet install mglearn

In [None]:
import mglearn
from sklearn.linear_model import Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Review: Bias and Variance

---
![](http://tomrobertshaw.net/img/2015/12/overfitting.jpg)

### Discussion:
Answer these three questions about the above models given the data. 

1. Out of these three, which one would you pick as a low bias model? Why?
1. Out of these three, which would you pick as a low variance model? Why?
1. Out of these three, which one would you chose? Why?
1. What are some current methods we have learned for **reducing variance**?

## Tuning Hyperparameters
---

We've now covered a number of regressors and classifiers, including:
* `KNeighborsClassifier`/`KNeighborRegressor`
* `LogisticRegression`
* `Ridge`
* `Lasso`
* `SVC`/`SVR`
* `DecisionTreeClassifier`/`DecisionTreeRegressor`

Each of these models have what are called **hyperparameters**, which affect how the model learns the data. Think of them like dials, and by changing them, we can optimize a model's performance on new data (a.k.a. **reduce variance**)!

* What is the hyperparameter for `KNeighborsClassifier`?
* What are the hyperparameters for `Ridge`?

Up until now, we've used:
> **"Training set":** the subset of the data that we fit our model on.

> **"Testing set":** the subset of the data that we evaluate the quality of our predictions on.

We've tuned our hyperparameters by hand so our model performs well on a test set. But now, instead of overfitting to our training set, we may be overfitting to our test set! Ultimately, we want to find optimal values for our hyperparameters **without touching our test data**. **BUT HOW?!?!?!**

The trick is to add *another* set of splits. We're going to use cross-validation **within the training set** to fit our models and tune our hyperparamets, then **evaluate** on the test set. Let's spend a minute on the figure below. Notice:

* When we originally fit our models, we use cross-validation to identify bias and variance, and adjust our hyperamaters to improve our model.
* In the top panel, we don't do anything with our test set.
* In the bottom panel, once we've tuned our hyperparameters, we can **re-fit** on all our training data to predict our **test** data and perform a model evaluation.


In [None]:
mglearn.plot_improper_preprocessing.plot_improper_processing()

## Introducing (drum roll please...) `GridSearchCV`

What if there was a way we could reduce variance by tuning hyperparameters without even touching our test dataset? To do this, we use the `sklearn` class `GridSearchCV`. Let's dive in:

### Setting up `GridSearchCV`

##### 1. Estimator

First, pick your estimator. This can be any `sklearn` model. Today, we'll use `KNeighborsClassifier`. This is passed to the `estimator` argument.

##### 2. Parameter Grid:

The second thing you should do when setting up a grid search is create a parameter grid. This allows you to input values for all the hyperparameters you'd like to tune! Here's an example of a parameter grid for `KNeighborsClassifier`. The only hyperparameter we can tune here is `n_neighbors`, which makes this task pretty simple. Let's try every value between 1 and 10.

```
knc_params = {
        'n_neighbors':range(1,11)
    }
```
Your param grid is then passed to your grid search object through the argument `param_grid`.

##### 3.  Cross-Valisation:

Cross-validation is defined by the `cv=` argument. `cv` can be an integer, `ShuffleSplit`, or `StratifiedShuffleSplit` object.
* `cv=5` will perform 5-fold cross-validation
* `ShuffleSplit` will shuffle your samples for random validation sets. Default `cv` argument for this object is 10. If you want 5-fold cross-validation using all the data with `ShuffleSplit`, use `cv=ShuffleSplit(cv=10, test_size = .2)`.
* `StratifiedShuffleSplit` is similar to `ShuffleSplit`, but can only be used for classificaion. This ensures that the class proportion in your target are preserved in each validation split.


##### Note:

`GridSearchCV` also takes an argument for `n_jobs`, with a default of 1. Many `sklearn` objects take this argument. If you set `n_jobs=-1`, `sklearn` will parallelize this process accross all the cores in your machine. This can really help speed up model training!


## Ok, now what does `GridSearchCV` do with that info?

When we use `.fit()` on your instantiated `GridSearchCV` object, it does a few things:
1. For each combination of hyperparameters, it performs cross-validation and gets a mean train score and mean test score. **Question:** If we do 5-fold cross validaion on a `param_grid` for `Ridge` where `param_grid = {'c':range(1,6)}`, how many individual models are fit?
2. For the combination of parameters with the **best mean test score**, `GridSearchCV` re-fits **all the training data** using those best parameters. 

**Question:** If your `GridSearchCV` is taking *REEEEEEEAAALLY* long to run `.fit()`, what are some options you have to reduce the time?

Let's spend a minute and review the graphic below to review what `GridSearchCV` is doing.

In [None]:
mglearn.plot_grid_search.plot_grid_search_overview()

## Ok, enough talk! Let's code it up. 

**NOTE:** Because we want to spend time with GridSearch, I've left the data cleaning in for you. However, if you have time, it would be great to review the steps taken to clean this data and try replicating them on your own. I've explained what we're doing step-by-step:

In [None]:
from sklearn.model_selection import GridSearchCV, ShuffleSplit, StratifiedShuffleSplit, train_test_split

Load the data:

In [None]:
data = pd.read_csv('datasets/basketball_data.csv')
# Set the index to the `GameID`
data.set_index('GameId', inplace=True)
data.head()

Extract the month and day of week from the data using `pandas datetime` functionality:

In [None]:
# Change the GameDate colum to datetime type
data.GameDate = pd.to_datetime(data['GameDate'])
# Extract the day of week and month of the game into their own columns
data['GameMonth'] = data['GameDate'].apply(lambda x: x.month)
data['GameDayofWeek'] = data['GameDate'].apply(lambda x: x.dayofweek)

View the non-numeric columns in this dataset using `df.select_dtypes`

In [None]:
data.select_dtypes(include=['object']).head()

Make our target whether the home team wins in a new columna called `host_wins`

In [None]:
data['host_wins'] = data['HostName'] == data['winner']

Drop the winner and loser columns from the dataset.

In [None]:
data.drop(['winner','loser','GameDate'],axis=1, inplace=True)

We still have a few `object` columns in this dataset. Let's use `pd.get_dummies` to change these into numeric columns. 

In [None]:
data_dummies = pd.get_dummies(data)
data_dummies.head()

Finally, let's split the data into our `features` and `target`. 

In [None]:
target = data_dummies['host_wins']
features = data_dummies.drop('host_wins',axis=1)

Now's where you take over! 

### Exercise: `GridSearchCV` Workflow

Perform a `train_test_split` on the data: 
* Your split data `X_train, X_test, y_train, y_test`. 
* Use a `test_size` of .25
* Use a `random_state` of 42 so we all have the same results

Import `KNeighborsClassifier` and `LogisticRegression`

Instantiate a `StandardScaler` object. `.fit` the scaler object on your training data (`X_train`), and then `.transform` both your training and test features so they are all scaled.

Coming out of this, you should have two new `np.ndarray`s: `X_train_scaled` and `X_test_scaled`. 

**Question:** Why do we fit the scaler on only the training data? Why do we transform both the train and test data?

Ok, let's set up a `param_grid` for KNeighborsClassifier. Let's try neighbors from 5 to 50 with intervals of 5. Ex: `[5,10,15,...,45,50]`

Almost there! Lets' instantiate our `GridSearchCV` object where...
* our `estimator` is `KNeighborsClassifier`
* our `param_grid` is our `knc_params`
* our `cv=5` (5-fold cross-validation)

In [None]:
knc_gs = None

Use `.fit` once to fit ALL OUR MODELS!

**Question:** Which of the below dataset pairs do we use to fit on? Why?
1. `.fit(X_train, y_test)`
1. `.fit(X_test, y_train)`
1. `.fit(X_train, y_train)`
1. `.fit(X_test, y_test)`
1. `.fit(X_train, y_train)`
1. `.fit(X_train_scaled, y_train)`
1. `.fit(X_test_scaled, y_test)`
1. `.fit(X_train_scaled, y_train_scaled)`

That took way longer than we're used to... 
**Question:** Why?

## Let's explore the **methods** and **attributes** available to our fit `GridSearchCV` object.

### `.cv_results_`

This method summarizes all the results of all our cross-validated models. Let's put it and a DataFrame and see what we get!

Notice the column names. Since we did 5-fold cross-validation, every time we see a mean, we're evaluating the mean of 5 individual fits on different folds of our cross-validation.

Below, let's sort our `cv_results` dataframe by `mean_test_score` using `.sort_values`. 

**Question:** How many neighbors give us our best score?

### `.best_score_`

Let's say we don't want to look at the `cv_results_`. This attribute gives the **best mean test score** from our `cv_results_`. 

### `.best_params_`

This attribute returns the parameters for the model that got the best mean test score.

### `.best_estimator_`

The `best_estimator_` is an **actual model**. When we call `.predict()` on our `GridSearchCV` object, this is the model that it's using in the background. Notice that the `best_estimator_` has the hyperparameters from `best_params_`. Let's check it out!

### `.predict()`

This works just like the `predict` method on any other `sklearn` estimator. Let's try it out creating a `y_test_pred` array!

**Note:** If your `estimator` is a classifier, you also have `.predict_proba()` available to you.

**Question:** Which dataset goes into the `.predict()` method if we want our test predictions?

1. `X_train`
1. `X_train_scaled`
1. `X_test`
1. `X_test_scaled`
1. `y_train`
1. `y_train_scaled`
1. `y_test`
1. `y_test_scaled`

In [None]:
y_test_pred = None

### `.score()` 

This **inherits** the `.score()` functoin from our `estimator`. So, if we're doing classification, the default will be accuracy, and for regressors, the default will be $r^2$.

What is the **train score**? What is the **test score**? What's your interpretation about our model's capacity to generalize to unseen data? What does that mean in terms of bias and variance?

In our `train` and `test` y's, what percent of the time does the home team win? i.e. what is the accuracy if we guessed the home team won every time? (`y_pred` is all ones)

Does our model out-perform chance?

**Let's go back up to where we define our grid search object and use `StratifiedShuffleSplit` as an example.**

### (if time permits) Exercise:

Use `GridSearchCV` with `LogisticRegression` to see if this model out-performs your nearest neighbors model.

* What hyperparameters can we tune for `LogisticRegression`? Let's tune `C` and `penalty`. Let's use a `np.logspace(-3,3,7)` for our `C` hyperparameter, and try both a `l1` and `l2` penalty. 
* Try using `ShuffleSplit` or `StratifiedShuffleSplit` with 5-fold cross-validation.

What are the `best_params_`? The `best_score_`?

In [None]:
lr_params = {
    
}

In [None]:
lrgs = None

Let's plot out our mean test score over different values of `C` from our `.cv_results_`.

In [None]:
best_penalty = lrgs.best_params_['penalty']
lrgs_results_df = pd.DataFrame(lrgs.cv_results_)

(lrgs_results_df[lrgs_results_df['param_penalty'] == best_penalty]
 [['mean_train_score','mean_test_score','param_C']]
 .plot(x='param_C'))

plt.axvline(lrgs_results_df[lrgs_results_df['param_penalty'] == 'l1']['mean_test_score'].max(), c='r', ls='--', label = 'optimal C')

plt.legend()
plt.xscale('log')

## In Conclusion...

`GridSearchCV` performs two primary functions:

* Cross-validation, to **identify bias and variance** during training
* Grid Search, to **reduce bias and variance** by testing many combinations of model hyperparameters at once **without looking at our test data!**