## Introduction

There are many different models to choose from in ``sklean`` to model your data with.  There are many parameters and hyper-parameters related to these models.  How can you find the **best** (or, at least, pretty good) ones for your data?

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn.datasets import load_digits, load_iris
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

---
## Grid Search

If you were posed this question and you didn't know a whole lot about the ``sklearn`` universe, you might say something like this:

"For every (hyper)parameter, let's take a list of values to try and ``for``-loop over them all."

That's pretty much hitting the nail on the head. Instead of doing ugly ``for``-loops some number of times (potentially indenting past what your monitor can show!), ``sklean`` has ``GridSearchCV``.  

Let's give an example, then chat about it.  If you've not read about [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), check out [this post](/2021-01-08-sklearn-pipelines-how-to.html).

In [2]:
# Sample data, the Sklearn Digits Dataset.
df_features, df_targets = load_digits(return_X_y=True, as_frame=True)
x_train, x_test, y_train, y_test = train_test_split(
    df_features, df_targets, train_size=0.33, random_state=1234
)

# Create our pipelines: Preprocess, Model.
pipeline_preprocess = Pipeline([("pca", PCA(n_components=3))])

pipeline_model = Pipeline([("random_forest", RandomForestClassifier(n_estimators=100))])

# Hook up our piplines together and train.
pipeline_full = Pipeline(
    [("preprocessing", pipeline_preprocess), ("modeling", pipeline_model)]
)

pipeline_full.fit(x_train, y_train)

# Score our model.
pipeline_full.score(x_test, y_test)

0.7491694352159468

As we can see, we've made pipelines for preprocessing, modeling, and then tying those together.  It might seem verbose, but it makes things much easier when attempting to extend one part of the model, or swap things out.

While not a perfect model, it gets a respectable accuracy when running with the default parameters in ``PCA`` and ``RandomForestClassifier``.  Maybe tweaking these values would give a better result.  Suppose we try out something like, ``[1, 5, 10, 15, 20, 25, 30, 35]`` for the components in ``PCA``and ``[1, 10, 25, 50, 75, 100, 125]`` for ``n_estimators`` in the Random Forest &mdash; if you tried to do this yourself, you'd have to manually type in these values and run the model **56 times**.  That's much too much.  Instead, let's let grid-search do it for us.

(Note that, in addition to grid-searching, ``GridSearchCV`` will work on cross-validation scoring, so we no longer need to split our data into a train-test set.  However, we will rename and use the test set as the validation set at the end to score our model.)



In [3]:
# Sample data, the Sklearn Digits Dataset.
df_features, df_targets = load_digits(return_X_y=True, as_frame=True)
x_train, x_validation, y_train, y_validation = train_test_split(
    df_features, df_targets, train_size=0.33, random_state=1234
)

# Create our pipelines: Preprocess, Model.
pipeline_preprocess = Pipeline([("pca", PCA(n_components=3))])

pipeline_model = Pipeline([("random_forest", RandomForestClassifier(n_estimators=100))])

# Hook up our piplines together and train.
pipeline_full = Pipeline(
    [("preprocessing", pipeline_preprocess), ("modeling", pipeline_model)]
)

# Parameters we're making a grid of.
#
# In our case, since `pipeline_full` is a pipeline of pipelines, we must
# use (pipeline_name)__(estimator_name)__(param_name).
#
# For example, `n_estimators` is given by `modeling__random_forest__n_estimators`.
#
# If you're not sure what to use, you can always print ``pipeline_full.get_params()``.
#
# See: https://scikit-learn.org/stable/modules/compose.html#nested-parameters

param_grid = {
    "modeling__random_forest__n_estimators": [1, 10, 25, 50, 75, 100, 125],
    "preprocessing__pca__n_components": [1, 5, 10, 15, 20, 25, 30, 35],
}

# NOTE: This takes about a minute.
grid_search = GridSearchCV(pipeline_full, param_grid)
grid_search.fit(x_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('preprocessing',
                                        Pipeline(steps=[('pca',
                                                         PCA(n_components=3))])),
                                       ('modeling',
                                        Pipeline(steps=[('random_forest',
                                                         RandomForestClassifier())]))]),
             param_grid={'modeling__random_forest__n_estimators': [1, 10, 25,
                                                                   50, 75, 100,
                                                                   125],
                         'preprocessing__pca__n_components': [1, 5, 10, 15, 20,
                                                              25, 30, 35]})

In [4]:
print(grid_search.score(x_validation, y_validation))
print(grid_search.best_score_)
print(grid_search.best_estimator_)

0.9493355481727574
0.9628827802307363
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('pca', PCA(n_components=30))])),
                ('modeling',
                 Pipeline(steps=[('random_forest',
                                  RandomForestClassifier(n_estimators=125))]))])


On the digits dataset, using more of the data via components and estimators gave us a better accuracy; this isn't always the case, and it's a good reason we grid-search in the first place.

One thing you might have noticed: this took a while to run.  Modeling the digit dataset is typically do-able in a second or less, but this took around a minute!  This seems trivial until we think about training datasets much larger than the digit dataset.  That's a problem.  This can be resolved in a few ways:

- Reduce your parameter space (using commonly accepted "good" parameters may work well!),
- Use a smarter grid-search (there are several out there which are a bit more complicated and situational),
- Try a bunch of different, spread-out parameters to try to hone in on areas which may be worth looking at,
- Trying something like ``RandomizedSearchCV``

There are many, many other potential solutions for the problem of "too big of a grid", but we will note one other thing here.  For parameters like regularization (which are commonly gridded), the workload can be reduced by computing the [regularization path](https://scikit-learn.org/stable/modules/grid_search.html#grid-search-tips).

It may also be worth checking out parallelization methods if you're going to be using larger grids on significant amounts of data.

---
## What do we get from GridSearchCV?

When we ran ``GridSearchCV`` above, we took the ``.best_estimator_`` and were done with it. Can we look a bit closer into the results?  Sure.

In [5]:
df_grid_search_results = pd.DataFrame(grid_search.cv_results_)
df_grid_search_results = df_grid_search_results[
    [
        "param_modeling__random_forest__n_estimators",
        "param_preprocessing__pca__n_components",
        "mean_test_score",
    ]
]
df_grid_search_results.head(5)

Unnamed: 0,param_modeling__random_forest__n_estimators,param_preprocessing__pca__n_components,mean_test_score
0,1,1,0.227603
1,1,5,0.741988
2,1,10,0.704885
3,1,15,0.684475
4,1,20,0.709927


In [6]:
# Plot these values.
chart = (
    alt.Chart(df_grid_search_results)
    .encode(
        x="param_modeling__random_forest__n_estimators:Q",
        y="param_preprocessing__pca__n_components",
        color=alt.Color("mean_test_score", scale=alt.Scale(scheme="redblue")),
    )
    .configure_axis(grid=False)
    .mark_circle()
)
chart

Seems like a lot of the grid was pretty close, in terms of scoring.  This is reasonable, given how small and simple the data is.

---
## One More Example

Let's do one last easy example to solidify this.  We'll use the iris dataset, but we'll have a ridiculously small training size.  Let's see how well we can do.

In [7]:
df_features, df_targets = load_iris(return_X_y=True, as_frame=True)
x_train, x_validation, y_train, y_validation = train_test_split(
    df_features, df_targets, train_size=0.15, random_state=1234
)

pipeline_preprocess = Pipeline(
    [("scaler", StandardScaler()), ("pca", PCA(n_components=3))]
)

# NOTE: Here, we could have used LogisticRegressionCV to grid values for C.
# Since we're focusing on GridSearchCV, I decided to use the standard LogReg.
pipeline_model = Pipeline(
    [("logistic_regression", LogisticRegression(C=1.0, max_iter=1_000))]
)

# Hook up our piplines together and train.
pipeline_full = Pipeline(
    [("preprocessing", pipeline_preprocess), ("modeling", pipeline_model)]
)

param_grid = {
    "modeling__logistic_regression__C": np.logspace(-4, 2, 10),
    "preprocessing__pca__n_components": [1, 2, 3, 4],
}

# NOTE: Takes a few seconds.
grid_search = GridSearchCV(pipeline_full, param_grid)
grid_search.fit(x_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('preprocessing',
                                        Pipeline(steps=[('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=3))])),
                                       ('modeling',
                                        Pipeline(steps=[('logistic_regression',
                                                         LogisticRegression(max_iter=1000))]))]),
             param_grid={'modeling__logistic_regression__C': array([1.00000000e-04, 4.64158883e-04, 2.15443469e-03, 1.00000000e-02,
       4.64158883e-02, 2.15443469e-01, 1.00000000e+00, 4.64158883e+00,
       2.15443469e+01, 1.00000000e+02]),
                         'preprocessing__pca__n_components': [1, 2, 3, 4]})

In [8]:
df_grid_search_results = pd.DataFrame(grid_search.cv_results_)
df_grid_search_results = df_grid_search_results[
    [
        "param_modeling__logistic_regression__C",
        "param_preprocessing__pca__n_components",
        "mean_test_score",
    ]
]

chart = (
    alt.Chart(df_grid_search_results)
    .encode(
        x="param_modeling__logistic_regression__C:Q",
        y="param_preprocessing__pca__n_components",
        color=alt.Color("mean_test_score", scale=alt.Scale(scheme="redblue")),
    )
    .configure_axis(grid=False)
    .mark_circle()
)
chart

In [9]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'modeling__logistic_regression__C': 21.54434690031882, 'preprocessing__pca__n_components': 3}
1.0


Interesting!  Of course, this isn't meant to show the best models for these smaller datasets, but rather how to use the tools for your larger, more complex data.

_Happy gridding!_  ``:']``