<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Gridsearching Hyperparameters


---

<img src="gridsearch_meme.png" style="width: 500px;">

## Learning Objectives


### Core

- Understand what the terms gridsearch and hyperparameter refer to
- Understand how to manually build a gridsearching procedure
- Apply sklearn's `GridSearchCV` object to optimize a scikit-learn model
- Interpret the output from gridsearch

### Target

- Understand the pitfalls of searching large hyperparameter spaces
- Understand how to set up parameter dictionaries for gridsearch


### Stretch

- Look at the documentation for GridSearchCV on scikit-learn's website, and investigate the different options that are available (for example, the attribute `.cv_results_`)

<h1>Lesson Guide<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Gridsearching-Hyperparameters" data-toc-modified-id="Introduction-to-Gridsearching-Hyperparameters-1">Introduction to Gridsearching Hyperparameters</a></span><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1.1">Learning Objectives</a></span><ul class="toc-item"><li><span><a href="#Core" data-toc-modified-id="Core-1.1.1">Core</a></span></li><li><span><a href="#Target" data-toc-modified-id="Target-1.1.2">Target</a></span></li><li><span><a href="#Stretch" data-toc-modified-id="Stretch-1.1.3">Stretch</a></span></li></ul></li><li><span><a href="#Big-Picture:-What-part-of-the-modeling-process-are-we-focusing-on?" data-toc-modified-id="Big-Picture:-What-part-of-the-modeling-process-are-we-focusing-on?-1.2">Big Picture: What part of the modeling process are we focusing on?</a></span></li><li><span><a href="#Gridsearching-and-hyperparameters" data-toc-modified-id="Gridsearching-and-hyperparameters-1.3">Gridsearching and hyperparameters</a></span></li><li><span><a href="#Basketball-data" data-toc-modified-id="Basketball-data-1.4">Basketball data</a></span><ul class="toc-item"><li><span><a href="#Load-the-data-and-create-the-target-and-predictor-matrix" data-toc-modified-id="Load-the-data-and-create-the-target-and-predictor-matrix-1.4.1">Load the data and create the target and predictor matrix</a></span></li><li><span><a href="#1.-Cleaning,-descriptive-stats,-correlation-heatmap,-plots-&amp;-visualizations.-Find-baseline-(accuracy)." data-toc-modified-id="1.-Cleaning,-descriptive-stats,-correlation-heatmap,-plots-&amp;-visualizations.-Find-baseline-(accuracy).-1.4.2">1. Cleaning, descriptive stats, correlation heatmap, plots &amp; visualizations. Find baseline (accuracy).</a></span></li><li><span><a href="#2.-Set-up-predictor-matrix-(X)-and-target-array-(y).--Dummify-if-necessary." data-toc-modified-id="2.-Set-up-predictor-matrix-(X)-and-target-array-(y).--Dummify-if-necessary.-1.4.3">2. Set up predictor matrix (X) and target array (y).  Dummify if necessary.</a></span></li><li><span><a href="#3.-Train/test-split-and-StandardScaler(-).--In-this-case,-instead-of-using-train_test_split,-we're-going-to-use-the-most-recent-season-as-our-testing-data-(2016-data),-and-previous-seasons-as-our-training-data." data-toc-modified-id="3.-Train/test-split-and-StandardScaler(-).--In-this-case,-instead-of-using-train_test_split,-we're-going-to-use-the-most-recent-season-as-our-testing-data-(2016-data),-and-previous-seasons-as-our-training-data.-1.4.4">3. Train/test split and <code>StandardScaler( )</code>.  In this case, instead of using train_test_split, we're going to use the most recent season as our testing data (2016 data), and previous seasons as our training data.</a></span></li><li><span><a href="#4.-Use-cross-validation-to-optimize-the-hyperparameters-for-your-model." data-toc-modified-id="4.-Use-cross-validation-to-optimize-the-hyperparameters-for-your-model.-1.4.5">4. Use cross-validation to optimize the hyperparameters for your model.</a></span><ul class="toc-item"><li><span><a href="#Gridsearch-pseudocode-for-our-KNN" data-toc-modified-id="Gridsearch-pseudocode-for-our-KNN-1.4.5.1">Gridsearch pseudocode for our KNN</a></span></li><li><span><a href="#Using-GridSearchCV" data-toc-modified-id="Using-GridSearchCV-1.4.5.2">Using <code>GridSearchCV</code></a></span></li><li><span><a href="#Examing-the-results-of-GridSearch(-)" data-toc-modified-id="Examing-the-results-of-GridSearch(-)-1.4.5.3">Examing the results of GridSearch( )</a></span></li></ul></li></ul></li><li><span><a href="#5.-Once-you're-happy-with-your-hyperparameters,-evaluate-your-model-on-the-training-and-test-data" data-toc-modified-id="5.-Once-you're-happy-with-your-hyperparameters,-evaluate-your-model-on-the-training-and-test-data-1.5">5. Once you're happy with your hyperparameters, evaluate your model on the training and test data</a></span><ul class="toc-item"><li><span><a href="#6.-Evaluation" data-toc-modified-id="6.-Evaluation-1.5.1">6. Evaluation</a></span></li></ul></li><li><span><a href="#Gridsearch-regularization-penalties-with-logistic-regression" data-toc-modified-id="Gridsearch-regularization-penalties-with-logistic-regression-1.6">Gridsearch regularization penalties with logistic regression</a></span><ul class="toc-item"><li><span><a href="#Analyze-the-feature-importances" data-toc-modified-id="Analyze-the-feature-importances-1.6.1">Analyze the feature importances</a></span></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-1.7">Conclusions</a></span></li></ul></li></ul></div>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Big Picture: What part of the modeling process are we focusing on?

---

A general approach you can use for modeling questions would look like this:

1. Cleaning, descriptive stats, correlation heatmap, plots & visualizations. Find baseline score.

1. Set up predictor matrix (X) and target array (y).  Dummify if necessary.

1. Train/test split and StandardScaler( ).

1. Use cross-validation to optimise the hyperparameters for your model. You might try different types of models at this stage as well, and you might use GridSearchCV (or any of the other CVs like RidgeCV).

1. Once you're happy with your hyperparameters, fit your model on your whole training data and test it on your whole testing data.

1. Then you might want to 
    - evaluate the performance of the model (R2 score, accuracy, confusion matrix, etc)
    - find the actual predictions that your model is providing and store them in a dataframe
    - plot your predictions against your actual target variable to visualize how well your model performed
    - investigate feature importance with .coef_ if you have a parametric model

Now, we're learning more about 4.

## Gridsearching and hyperparameters

---

Models often have specifications that can be set. For example, when we choose a linear regression, we may decide to add a penalty to the loss function such as the Ridge or the Lasso. Those penalties require the regularization strength to be set. 

**Model parameters are called hyperparameters.**

Hyperparameters are different than the parameters of the model resulting from a fit, such as the coefficients. The hyperparameters are set prior to the fit and determine the behavior of the model.

There are often more than one kind of hyperparamter to set for a model. For example, in the KNN algorithm 
- we have a hyperparameter to set the number of neighbors 
- we also have a hyperparameter for the weights: uniform or distance

We want to know the *optimal* hyperparameter settings, the set that results in the best model evaluation. 

**The search for the optimal set of hyperparameters is called gridsearching.**

Gridsearching gets its name from the fact that we are searching over a **"grid" of parameters**. For example, imagine the `n_neighbors` hyperparameters on the x-axis and `weights` on the y-axis, and we need to test all points on the grid.

**Gridsearching uses cross-validation internally to evaluate the performance of each set of hyperparameters.** More on this later.

## Basketball data

---

To explore the process of gridsearching over sets of hyperparameters, we will use some basketball data. The data below has statistics for 4 different seasons of NBA basketball: 2013-2016.
- This data includes aggregate statistical data for each game 
- The data of each game is aggregated by match for all players
- Scraped from https://www.basketball-reference.com
- For some score explications, see https://en.wikipedia.org/wiki/Basketball_statistics

Many of the columns in the dataset represent the mean of a statistic across the last 10 games, for example. Non-target statistics are for *prior* games, they do not include information about player performance in the current game.

**We are interested in predicting whether the home team will win the game or not.** This is a classification problem.

### Load the data and create the target and predictor matrix
- The target will be a binary column of whether the home team wins.
- The predictors should be numeric columns.

Exclude these columns from the predictor matrix:

    ['GameId','GameDate','GameTime','HostName',
     'GuestName','total_score','total_line','game_line',
     'winner','loser','host_wins','Season']

### 1. Cleaning, descriptive stats, correlation heatmap, plots & visualizations. Find baseline (accuracy).

In [2]:
data = pd.read_csv('./datasets/basketball_data.csv')



In [3]:
data.head()



Unnamed: 0,Season,GameId,GameDate,GameTime,HostName,GuestName,total_score,total_line,game_line,Host_HostRank,...,gPTS_avg10,gTS%_avg10,g3PAR_avg10,gFTr_avg10,gDRB%_avg10,gTRB%_avg10,gAST%_avg10,gSTL%_avg10,gBLK%_avg10,gDRtg_avg10
0,2013,201212090LAL,2012-12-09,6:30 pm,Los Angeles Lakers,Utah Jazz,227.0,207.5,7.5,13,...,99.0,0.5206,0.223,0.2981,69.22,50.05,61.57,8.63,10.31,110.87
1,2013,201212100PHI,2012-12-10,7:00 pm,Philadelphia 76ers,Detroit Pistons,201.0,186.5,5.5,13,...,90.3,0.5077,0.2144,0.3095,71.46,49.48,59.83,6.48,9.46,107.91
2,2013,201212100HOU,2012-12-10,7:00 pm,Houston Rockets,San Antonio Spurs,240.0,212.0,-7.0,12,...,108.0,0.5915,0.2743,0.2518,74.26,50.99,61.82,8.3,6.85,101.41
3,2013,201212110BRK,2012-12-11,7:00 pm,Brooklyn Nets,New York Knicks,197.0,195.5,-3.5,12,...,100.3,0.5473,0.3595,0.2544,74.23,47.88,52.07,9.31,7.64,109.24
4,2013,201212110DET,2012-12-11,7:30 pm,Detroit Pistons,Denver Nuggets,195.0,203.5,-4.5,11,...,101.1,0.5605,0.2173,0.3177,68.45,50.4,56.33,7.67,7.83,114.86


In [4]:
data.columns



Index(['Season', 'GameId', 'GameDate', 'GameTime', 'HostName', 'GuestName',
       'total_score', 'total_line', 'game_line', 'Host_HostRank',
       'Host_GameRank', 'Guest_GuestRank', 'Guest_GameRank', 'host_win_count',
       'host_lose_count', 'guest_win_count', 'guest_lose_count', 'game_behind',
       'winner', 'loser', 'host_place_streak', 'guest_place_streak',
       'hq1_avg10', 'hq2_avg10', 'hq3_avg10', 'hq4_avg10', 'hPace_avg10',
       'heFG%_avg10', 'hTOV%_avg10', 'hORB%_avg10', 'hFT/FGA_avg10',
       'hORtg_avg10', 'hFG_avg10', 'hFGA_avg10', 'hFG%_avg10', 'h3P_avg10',
       'h3PA_avg10', 'h3P%_avg10', 'hFT_avg10', 'hFTA_avg10', 'hFT%_avg10',
       'hORB_avg10', 'hDRB_avg10', 'hTRB_avg10', 'hAST_avg10', 'hSTL_avg10',
       'hBLK_avg10', 'hTOV_avg10', 'hPF_avg10', 'hPTS_avg10', 'hTS%_avg10',
       'h3PAR_avg10', 'hFTr_avg10', 'hDRB%_avg10', 'hTRB%_avg10',
       'hAST%_avg10', 'hSTL%_avg10', 'hBLK%_avg10', 'hDRtg_avg10', 'gq1_avg10',
       'gq2_avg10', 'gq3_avg10', 'gq

In [5]:
data.shape



(3768, 96)

In [6]:
data.Season.value_counts()



2014    998
2016    985
2015    984
2013    801
Name: Season, dtype: int64

In [7]:
data.winner.head()



0             Utah Jazz
1    Philadelphia 76ers
2     San Antonio Spurs
3       New York Knicks
4        Denver Nuggets
Name: winner, dtype: object

In [8]:
# let's create a column called 'host_wins' which will indicate
# whether the host team won the game or not
data['host_wins'] = (data['HostName'] == data['winner']).astype(int)



In [9]:
data['host_wins'].value_counts()



1    2239
0    1529
Name: host_wins, dtype: int64

In [10]:
# let's look at the baseline accuracy for this data
baseline = data['host_wins'].value_counts(normalize=True).max()
baseline



0.5942144373673036

At this point, we could definitely do a bit more EDA!  We could
- use `.describe()` to find some descriptive statistics
- create some plots and visualizations to better understand the shapes and relationships in our data
- use a correlation heatmap to help us find variables that could predict `host_wins`

... but we're going to skip that for this lesson!  Be aware that if you're undertaking your own data investigation, skipping EDA is a bad idea. (Obviously.)  

### 2. Set up predictor matrix (X) and target array (y).  Dummify if necessary.

In [11]:
predictors = [c for c in data.columns if c not in
              ['GameId', 'GameDate', 'GameTime', 'HostName',
               'GuestName', 'total_score', 'total_line', 'game_line',
               'winner', 'loser', 'host_wins', 'Season']]
X = data[predictors]
y = data['host_wins']

In [12]:
X.head()

Unnamed: 0,Host_HostRank,Host_GameRank,Guest_GuestRank,Guest_GameRank,host_win_count,host_lose_count,guest_win_count,guest_lose_count,game_behind,host_place_streak,...,gPTS_avg10,gTS%_avg10,g3PAR_avg10,gFTr_avg10,gDRB%_avg10,gTRB%_avg10,gAST%_avg10,gSTL%_avg10,gBLK%_avg10,gDRtg_avg10
0,13,21,13,22,9,11,11,10,-1.5,1,...,99.0,0.5206,0.223,0.2981,69.22,50.05,61.57,8.63,10.31,110.87
1,13,21,13,23,11,9,7,15,5.0,1,...,90.3,0.5077,0.2144,0.3095,71.46,49.48,59.83,6.48,9.46,107.91
2,12,20,13,22,9,10,17,4,-7.0,2,...,108.0,0.5915,0.2743,0.2518,74.26,50.99,61.82,8.3,6.85,101.41
3,12,20,13,21,11,8,15,5,-3.5,4,...,100.3,0.5473,0.3595,0.2544,74.23,47.88,52.07,9.31,7.64,109.24
4,11,24,16,22,7,16,10,11,-4.0,1,...,101.1,0.5605,0.2173,0.3177,68.45,50.4,56.33,7.67,7.83,114.86


### 3. Train/test split and `StandardScaler( )`.  In this case, instead of using train_test_split, we're going to use the most recent season as our testing data (2016 data), and previous seasons as our training data.

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
data['Season'].value_counts()

2014    998
2016    985
2015    984
2013    801
Name: Season, dtype: int64

In [15]:
# create your training and testing sets
# or, you could use X[data['Season]<2016]
X_train = X[data['Season'].isin([2013, 2014, 2015])]
X_test = X[data['Season'] == 2016]

y_train = y[data['Season'].isin([2013, 2014, 2015])]
y_test = y[data['Season'] == 2016]

In [16]:
# training and test set baseline
print(y_train.value_counts(normalize=True).max())
print(y_test.value_counts(normalize=True).max())

0.5910887531440892
0.6030456852791878


In [17]:
# standardise your predictor matrices
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  after removing the cwd from sys.path.


### 4. Use cross-validation to optimize the hyperparameters for your model. 


> You might try different types of models at this stage as well, and you might use `GridSearchCV` (or any of the other CVs like `RidgeCV`).

Below we can fit a default `KNeighborsClassifier` to predict `host_wins`.

We can use cross-validation with our training data to see how well it performs.

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [19]:
# set up a default KNN model and cross-validate it on the training data
# use 5-fold cross-validation
knn = KNeighborsClassifier(n_neighbors=15)

In [20]:
# find the mean for your cross_val_scores
knn_cv_accuracy = cross_val_score(knn, X_train_std, y_train, cv=5).mean()
print('Mean cross-validated accuracy for default knn:', knn_cv_accuracy)
print('Baseline accuracy:', y_train.value_counts(normalize=True).max())

Mean cross-validated accuracy for default knn: 0.606901050075559
Baseline accuracy: 0.5910887531440892


In [21]:
knn.fit(X_train_std, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=15, p=2,
           weights='uniform')

In [22]:
knn.score(X_test_std, y_test)

0.6294416243654822

We only used the default kNN. But what if we changed the number of neighbors? The weighting? The distance metric?

These are all hyperparameters of the KNN. How would we do this manually? We would need to evaluate the set of hyperparameters that perform best during cross validation. The quality of the selected model we could judge by evaluating its performance on the test set.

#### Gridsearch pseudocode for our KNN

```python
accuracies = {}
for k in neighbors_to_test:
    for w in weightings_to_test:
        for d in distance_metrics_to_test:
            hyperparam_set = (k, w, d)
            knn = KNeighborsClassifier(n_neighbors=n, weights=w, metric=d)
            cv_accuracies = cross_val_score(knn, X_train_std, y_train, cv=5)
            accuracies[hyperparam_set] = np.mean(cv_accuracies)
```

In the pseudocode above, we would find the key in the dictionary (a hyperparameter set) that has the largest value (mean cross-validated accuracy).

#### Using `GridSearchCV`

This would be an annoying process to have to do manually. Luckily sklearn comes with a convenience class for performing gridsearch:

```python
from sklearn.model_selection import GridSearchCV
```

`GridSearchCV` has a handful of important arguments:

| Argument | Description |
| --- | ---|
| **`estimator`** | The sklearn instance of the model to fit on |
| **`param_grid`** | A dictionary where keys are hyperparameters for the model and values are lists of values to test |
| **`cv`** | The number of internal cross-validation folds to run for each set of hyperparameters |
| **`n_jobs`** | How many cores to use on your computer to run the folds (-1 means use all cores) |
| **`verbose`** | How much output to display (0 is none, 1 is limited, 2 is printouts for every internal fit) |
| **`return_train_score`** | Whether to save intermediate results |


Below is an example for how one might set up the gridsearch for our KNN:

```python
knn_parameters = {
    'n_neighbors':[1, 3, 5, 7, 9],
    'weights':['uniform', 'distance']
}
knn = KNeighborsClassifier()
knn_gridsearcher = GridSearchCV(knn, knn_parameters, cv=4, verbose=1)
knn_gridsearcher.fit(X_train_std, y_train)
```

**Try out the sklearn gridsearch below on the training data.**

In [23]:
from sklearn.model_selection import GridSearchCV

In [24]:
knn_params = {
    'n_neighbors': [5, 9, 15, 25, 40, 50, 60],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']}
knn = KNeighborsClassifier()
knn_gridsearch = GridSearchCV(knn,
                              knn_params,
                              n_jobs=2, 
                              cv=5, 
                              verbose=1, 
                              return_train_score=True)

knn_gridsearch.fit(X_train_std, y_train)

Fitting 5 folds for each of 28 candidates, totalling 140 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   21.1s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed:   58.1s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid={'n_neighbors': [5, 9, 15, 25, 40, 50, 60], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

Great!  We've fit our gridsearch.

#### Examing the results of GridSearch( )

Once the gridsearch has fit (this can take a while!) we can pull out a variety of information and useful objects from the gridsearch object, stored as attributes:

| Property | Use |
| --- | ---|
| **`results.param_grid`** | Displays parameters searched over. |
| **`results.best_score_`** | Best mean cross-validated score achieved. |
| **`results.best_estimator_`** | Reference to model with best score.  Is usable / callable. |
| **`results.best_params_`** | The parameters that have been found to perform with the best score. |
| **`results.cv_results_`** | Display score attributes with corresponding parameters. | 

**Print out the parameter grid**

In [25]:
knn_gridsearch.param_grid

{'n_neighbors': [5, 9, 15, 25, 40, 50, 60],
 'weights': ['uniform', 'distance'],
 'metric': ['euclidean', 'manhattan']}

**Print out the best score found in the search.**

In [26]:
# print out the best mean cross-validated accuracy from the gridsearch
# hopefully this should be much better than our default mean cross-validated accuracy
knn_gridsearch.best_score_

0.6266618756737333

**Best estimator**

In [27]:
knn_gridsearch.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
           metric_params=None, n_jobs=None, n_neighbors=50, p=2,
           weights='uniform')

**Print out the set of hyperparameters that achieved the best score.**

In [28]:
# print out your best hyperparameters
knn_gridsearch.best_params_

{'metric': 'manhattan', 'n_neighbors': 50, 'weights': 'uniform'}

**Print out all intermediate results (returned as dictionary, cast into dataframe).**

In [29]:
pd.DataFrame(knn_gridsearch.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_metric,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.012028,0.002321,0.215343,0.031844,euclidean,5,uniform,"{'metric': 'euclidean', 'n_neighbors': 5, 'wei...",0.590664,0.579892,...,0.589651,0.014536,27,0.746631,0.7646,0.742588,0.751684,0.754827,0.752066,0.00754
1,0.009416,0.000997,0.206012,0.032022,euclidean,5,distance,"{'metric': 'euclidean', 'n_neighbors': 5, 'wei...",0.590664,0.579892,...,0.589651,0.014536,27,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,0.009865,0.001314,0.190792,0.022895,euclidean,9,uniform,"{'metric': 'euclidean', 'n_neighbors': 9, 'wei...",0.62298,0.605027,...,0.607977,0.016086,17,0.714286,0.728212,0.70575,0.715312,0.720251,0.716762,0.007386
3,0.008405,0.000726,0.1514,0.011414,euclidean,9,distance,"{'metric': 'euclidean', 'n_neighbors': 9, 'wei...",0.62298,0.605027,...,0.607977,0.016086,17,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,0.008865,0.000852,0.158837,0.004173,euclidean,15,uniform,"{'metric': 'euclidean', 'n_neighbors': 15, 'we...",0.603232,0.59605,...,0.606899,0.010421,20,0.694519,0.701707,0.687332,0.691962,0.694207,0.693946,0.004654
5,0.00901,0.000716,0.162534,0.013111,euclidean,15,distance,"{'metric': 'euclidean', 'n_neighbors': 15, 'we...",0.603232,0.59605,...,0.606899,0.010421,20,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,0.008992,0.001677,0.162062,0.007382,euclidean,25,uniform,"{'metric': 'euclidean', 'n_neighbors': 25, 'we...",0.612208,0.617594,...,0.609055,0.012375,13,0.668913,0.686433,0.671608,0.681634,0.672654,0.676248,0.006648
7,0.008758,0.001081,0.150942,0.004897,euclidean,25,distance,"{'metric': 'euclidean', 'n_neighbors': 25, 'we...",0.612208,0.617594,...,0.608696,0.012199,15,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,0.008766,0.000683,0.180395,0.008294,euclidean,40,uniform,"{'metric': 'euclidean', 'n_neighbors': 40, 'we...",0.617594,0.615799,...,0.609774,0.013528,10,0.662174,0.665319,0.656334,0.669511,0.667265,0.664121,0.004578
9,0.00816,0.000367,0.152203,0.008141,euclidean,40,distance,"{'metric': 'euclidean', 'n_neighbors': 40, 'we...",0.61939,0.612208,...,0.612289,0.012491,9,1.0,1.0,1.0,1.0,1.0,1.0,0.0


## 5. Once you're happy with your hyperparameters, evaluate your model on the training and test data

> When you use a gridsearch's `.best_estimator_`, it will already have fit a model with the best hyperparameters on your training data, so all you have to do is score it on your training and testing data.

**Assign the best fit model (`best_estimator_`) to a variable and score it on the test data.**

Compare this model to the baseline accuracy and your default KNN.

In [30]:
# assign your best_estimator_ to the variable, then use .score( ) on your testing data
best_knn = knn_gridsearch.best_estimator_
print(best_knn.score(X_train_std, y_train))
print(best_knn.score(X_test_std, y_test))

0.6665468918433345
0.6467005076142132


It is also possible to directly work with the gridsearch model. It will make use of the best model.

In [31]:
print(knn_gridsearch.score(X_train_std, y_train))
print(knn_gridsearch.score(X_test_std, y_test))

0.6665468918433345
0.6467005076142132


### 6. Evaluation

Then  
- evaluate the performance of the model (accuracy, confusion matrix, etc)
- find the actual predictions that your model is providing and store them in a dataframe
- plot your predictions against your actual target variable to visualize how well your model performed
- investigate feature importance with `.coef_` if available

In [32]:
from sklearn.metrics import confusion_matrix, classification_report

predictions = best_knn.predict(X_test_std)
confusion = confusion_matrix(y_test, predictions, labels=[1, 0])
pd.DataFrame(confusion,
             columns=['predicted_home_win', 'predicted_home_loss'],
             index=['True_home_win', 'True_home_loss'])

Unnamed: 0,predicted_home_win,predicted_home_loss
True_home_win,501,93
True_home_loss,255,136


In [33]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.59      0.35      0.44       391
           1       0.66      0.84      0.74       594

   micro avg       0.65      0.65      0.65       985
   macro avg       0.63      0.60      0.59       985
weighted avg       0.64      0.65      0.62       985



## Gridsearch regularization penalties with logistic regression

---

Logistic regression models can also apply the Lasso and Ridge penalties. The `LogisticRegression` class takes these regularization-relevant hyperparameters:

| Argument | Description |
| --- | ---|
| **`penalty`** | `'l1'` for Lasso, `'l2'` for Ridge |
| **`solver`** | Must be set to `'liblinear'` for the Lasso penalty to work |
| **`C`** | The regularization strength (lower means more regularization)|


1. Fit and validate the accuracy of a default logistic regression on the basketball data.
- Perform a gridsearch over different regularization strengths and Lasso and Ridge penalties.
- Compare the accuracy on the test set of your optimized logistic regression to the baseline accuracy and the default model.
- Look at the best parameters found. What was chosen? What does this suggest about our data?
- Look at the coefficients and associated predictors for your optimized model. What appears to be the most important predictor?


In [34]:
from sklearn.linear_model import LogisticRegression

In [35]:
# set up a logistic regression model,
# and find its cross-validated scores for your training data
# you should get accuracies higher than 0.6, which is better than KNN!
lr = LogisticRegression(solver='lbfgs', max_iter=1000)

# find the mean of those cross-validated scores:
cross_val_score(lr, X_train_std, y_train, cv=5).mean()

0.649658370251734

In [36]:
# Set up the parameters.
# Use a list with 'l1' and 'l2' for the penalties,
# Use a list with 'liblinear' for the solver,
# Use a logspace from -3 to 0, with 50 different values

# fill the dictionary of parameters
gs_params = {'penalty': ['l1', 'l2'],
             'solver': ['liblinear'],
             'C': np.logspace(-3, 0, 50)}

# create your gridsearch object
lr_gridsearch = GridSearchCV(lr,
                             gs_params,
                             n_jobs=2, 
                             cv=5, 
                             verbose=1)

In [37]:
# fit your gridsearch object on your training data
lr_gridsearch.fit(X_train_std, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 500 out of 500 | elapsed:   13.1s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid={'penalty': ['l1', 'l2'], 'solver': ['liblinear'], 'C': array([0.001  , 0.00115, 0.00133, 0.00153, 0.00176, 0.00202, 0.00233,
       0.00268, 0.00309, 0.00356, 0.00409, 0.00471, 0.00543, 0.00625,
       0.0072 , 0.00829, 0.00954, 0.01099, 0.01265, 0.01456, 0.01677,
       0.01931, 0.02223...18, 0.32375,
       0.37276, 0.42919, 0.49417, 0.56899, 0.65513, 0.75431, 0.86851,
       1.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [38]:
# find the best mean cross-validated score that your gridsearch found:
# (this should be better than the mean cross-validated score for your default logistic regression above)
lr_gridsearch.best_score_

0.660797700323392

In [39]:
# find the best hyperparameters that your gridsearch found:
lr_gridsearch.best_params_

{'C': 0.0030888435964774815, 'penalty': 'l1', 'solver': 'liblinear'}

In [40]:
# assign the best estimator to a variable:
best_lr = lr_gridsearch.best_estimator_

In [41]:
# score your best estimator on the testing data:
lr_gridsearch.score(X_test_std, y_test)

0.665989847715736

### Analyze the feature importances

In [42]:
# create a dataframe to look at the coefficients
coef_df = pd.DataFrame({'feature': X.columns,
                        'coef': best_lr.coef_[0],
                        'abs_coef': np.abs(best_lr.coef_[0])})


In [43]:
# sort by absolute value of coefficient (magnitude)
coef_df.sort_values('abs_coef', ascending=False, inplace=True)
coef_df.head()

Unnamed: 0,feature,coef,abs_coef
8,game_behind,0.240558,0.240558
0,Host_HostRank,0.0,0.0
54,gTOV%_avg10,0.0,0.0
62,g3PA_avg10,0.0,0.0
61,g3P_avg10,0.0,0.0


## Conclusions

- You always want to use your training data to search for your best hyperparameters! You can do this with GridSearchCV, or with other sklearn objects like RidgeCV, LassoCV, ElasticNetCV, or LogisticRegressionCV.  


- You instantiate GridSearchCV with:
    - a model
    - a dictionary for that model's parameters
    - the number of cross-validation folds you want it to perform (`cv=5`)
    - how many cores to use on your computer for this job (`n_jobs=1`)
    - whether you want your model to give you some print-outs while it works (`verbose=1`)
    

- Once you've instantiated the GridSearch object, you can fit it on your training data


- Once it's finished searching, you can access some useful attributes:
    - `.best_score_`, to find the mean cross-validated score of the best estimator
    - `.best_params_`, to find the best hyperparameters 
    - refer to `.best_estimator_` to extract `.coef_`, etc
    - `.score()` and `.predict()` used on the gridsearched model will make use of the best estimator
    - `.cv_results_` to display results about all models (only available if you set `return_train_score=True`)