# Exercises - Week 6 - Grid Search - Blackjack

## References

- https://scikit-learn.org/stable/modules/grid_search.html
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
- https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- https://scikit-learn.org/stable/modules/model_evaluation.html
- https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
- https://scikit-learn.org/stable/tutorial/statistical_inference/index.html

## Contents
1. Setup 
2. Grid search
3. Datasets

## 1. Setup

Below libraries are loaded, a helper function is defined, and the features dataframe and target series are created for use by the `GridSearchCV` demonstrations that follow.

Load libraries and display version numbers.

In [8]:
import pandas  as pd
import numpy   as np
import sklearn as sk
print('sklearn',sk.__version__)
print('pandas ',pd.__version__)
print('numpy  ',np.__version__)

These version numbers may not be the most recent or correspond to the documentation you locate via Google.

The `display_pdf` function displays a pandas dataframe using the databricks display function.

In [11]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

The demonstrations below all use the Boston housing dataset. The following code cell displays its dimensions.

In [13]:
from sklearn.datasets import load_boston
load_boston().data.shape

The following cell displayes the first three rows of features.

In [15]:
load_boston().data[ :3]

Notice that all values are numeric (which makes sense as they are stored in a numpy array).

The following cell displayes the first three values of the target.

In [18]:
load_boston().target[:3]

Count the number of missing values in the dataset. (There aren't any.)

In [20]:
import numpy  as np
from sklearn.datasets import load_boston
np.sum(np.isnan(load_boston().data))

It would be helpful to have some missing values in the dataset. The following code cell creates 20 missing values in `bos_fea_array`.

In [22]:
import pandas as pd
from sklearn.datasets import load_boston
from random           import randint
bos_fea_array = load_boston().data 
for n in range(20): 
  bos_fea_array[randint(0,502), 
                randint(0,12)
               ] = np.nan

The following cell check that we do in fact have 20 missing values.

In [24]:
np.sum(np.isnan(bos_fea_array))

Define two functions to provide the target series and feature dataframe for the Boston housing dataset, including the additional missing values.

In [26]:
def bos_tgt_ser(): 
  import pandas as pd
  from sklearn import datasets
  return pd.Series(data=datasets.load_boston().target)

def bos_fea_pdf(): 
  import pandas as pd
  from sklearn.datasets import load_boston
  bos_fea_array   = load_boston().data 
  bos_fea_columns = load_boston().feature_names
  for n in range(20): 
    bos_fea_array[randint(0,502), 
                  randint(0,12)
                 ] = np.nan
  return pd.DataFrame(data   =          bos_fea_array,
                      columns=pd.Series(bos_fea_columns).str.lower()
                     )

The commands below display summary details about the target and features, in the next two cells.

In [28]:
bos_fea_pdf().info()

The features dataframe only contains `float` values/columns.

In [30]:
bos_tgt_ser()[:3]

The target series is also of type `float`.

## 2. Grid search

Grid search is a technique for creating models using a range of hyper-parameters and then scoring these predictions made by these models.

First, create a pipeline below which:
1. Imputes missing values
2. Scales values to the `0` to `1` range
3. Performs a linear regression on the data

In [34]:
from sklearn.pipeline        import Pipeline
from sklearn.linear_model    import LogisticRegression, LinearRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer

est_pipe = Pipeline([
  ('imp', SimpleImputer()),
  ('sca', MinMaxScaler()),
  ('lrg', LinearRegression())
])

This pipeline will be used in the grid search example below.

Consider the `parameters` dictionary below:
- The first three characters of the keys correspond to names of pipeline steps
- Two underscores follow the name of the pipeline step
- The remainder of the key is the name of an init parameter of the named step
- The value (of the key) is a list of possible values for that named init parameter

In [37]:
parameters = {
  'imp__strategy': ['mean', 'median'],
  'lrg__fit_intercept': [True, False]
}

The following cell performs a grid search. This means that the above pipeline is run:
- for all combinations of parameters listed in the `parameters` dictionary
- in this case, each model created with a combination of hyper-parameters, is fit on the train dataset and predicts on the test dataset of each of three cross validation pairs (see the `cv` parameter)

The keys of the `parameters` dictionary are hyper-parameters of the transformers or final estimator of the pipeline.

The code below:
1. Creates a `GridSearchCV` object, which is an estimator
2. Fits this object to the features and target of the Boston housing dataset

Notice three key parameters of this object:
1. `estimator`: the pipeline, which is a series of transformer objects and a final estimator
2. `param_grid`: the parameters (hyper-parameters of the step objects)
3. `cv`: the number of cross validation pairs

In [39]:
from sklearn.model_selection import GridSearchCV

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                            param_grid=parameters,
                            cv=3, 
                            iid=False,
                            return_train_score=True
                           )
est_grid_obj.fit(bos_fea_pdf(), 
                 bos_tgt_ser()
                )

The output indicates the steps of the pipeline, including the default parameters for each.

Grid search (`GridSearchCV` in Python) is often used to optimize the choice of hyper-parameters in order that the pipeline (as determined by its hyper-parameters) produces the best prediction score. 

We use `GridSearchCV`  to investigate the effects of the chosen hyper-parameter values on the scoring of the predictions made by the estimator pipeline. This will allow us to better understand a range of estimators and transformers, through the effect of their hyper-parameters on the scoring of their predictions.

In particular, the `GridSearchCV` object contains the `cv_results_` attribute, which contains information on the hyper-parameters and the scoring.

In [43]:
est_grid_obj.best_params_

In [44]:
est_grid_obj.cv_results_

There is a lot to notice here:
- The keys that start with `param_` contain values from hyper-parameter options in `parameters`
- The keys the end with `_test_score` contain results from the scoring of cross validation train-test pairs

Below this dictionary is converted into a dataframe. The key values are columns in the dataframe.

In [46]:
import pandas as pd
pd.DataFrame(data=est_grid_obj.cv_results_)

__Aside:__ what does it mean to summarized one variable in terms of another?

The scoring results can be summarized in terms of the parameter options.
There are three reasons for this:
1. The results are a __dataframe__
2. The parameter options are __categorical variables__ (to be used by the `groupby` method)
3. The score results are __numeric variables__ (to be used by the `agg` method)

The first and last items above are always true, but the second will require some work. We need to ensure that the values of the parameter options are either strings or integers (so they can be used by the `groupby` method).

The example below is a demonstration where this is not the case.

I expect to use this often and so have placed the code above in a function (below).

In [50]:
def est_grid_results_pdf(my_est_grid_obj): 
  import pandas as pd
  return pd.DataFrame(data=my_est_grid_obj.cv_results_)

__Exercise:__ Create a copy of `est_grid_results_pdf` and modify it to only return the parameter options and score results. Run it to produce a pandas dataframe with the reduced set of variable. 

See `Cmd 24` (starting "There is a lot to notice here") for the variables.

Three functions defined above are used below:
- `bos_tgt_ser`: returns the target series from the Boston dataset using the `sklearn.datasets.load_boston` function 
- `bos_fea_pdf`: returns the feature dataframe from the Boston dataset using the `sklearn.datasets.load_boston` function 
- `est_grid_results_pdf`: returns the dataframe of results from a fitted `GridSearchCV` object

The exercises below will focus on the three code cells immediately below.

In [54]:
from sklearn.pipeline        import Pipeline
from sklearn.linear_model    import LogisticRegression, LinearRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer

est_pipe = Pipeline([
  ('imp', SimpleImputer()),
  ('sca', MinMaxScaler()),
  ('lrg', LinearRegression())
])

In [55]:
parameters = {
  'imp__strategy': ['mean', 'median'],
  'lrg__fit_intercept': [True, False]
}

from sklearn.model_selection import GridSearchCV

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                        param_grid=parameters,
                        cv=3, 
                        iid=False,
                        return_train_score=False
                       )
est_grid_obj.fit(bos_fea_pdf(), 
                 bos_tgt_ser())

In [56]:
est_grid_results_pdf(est_grid_obj)

__Exercise:__ Summarize the output from `est_grid_results_pdf(est_grid_obj)` with the `groupby` and `agg` methods.

I understand that there are only four cases and that you can do this visually, but use these methods to learn the technique.

The following two lines of code show how to use the groupby and agg functions.  There is not much of interest going on with the code or much worth noting here.

In [59]:
x = est_grid_results_pdf(est_grid_obj).groupby(est_grid_results_pdf(est_grid_obj)['mean_fit_time'])

In [60]:
x1 = est_grid_results_pdf(est_grid_obj).groupby(est_grid_results_pdf(est_grid_obj)['std_score_time'])

__Exercise:__ copy the three code cells above and modify as below:
- Add `SelectKBest()` to the pipeline in the third position, just before the `LinearRegression` 
- Run `GridSearchCV` on this pipeline (with the same parameters as before)
- Check that reasonable looking results are produced by the `est_grid_results_pdf` function

This initial block of text serves to explain oour understanidng of the grid search method and works through the three code block segemnts. The first code block sets up the. creation of an estimator pipeline which includes the imported things from sklearn. The pipeline will unfctiono as an estimator, thus the need for us to put the SelectKBest above ogisitc regression, because had we not than there would have been issues in running. any predictions.  The next code block defines the parameters to be included in the grid search and under which the grid search will be conducted. For our purposes, the inital parameters are the imputing strtaegy being mean or median, and the logistiic regression intercept being true or false (indicating weather the log regression funciton needs to enter the origin).  After this, we load in the. Grid Search function, and create a grid search object, which takes the arguments estimator, parameter grid, and number of cross validations to be performed.  We also choose to specify iid to be false, and in our case return of the train score is set to false.  We finally call the object created to be fit onto the boston features dataframe and the boston target series - mimiking what will be expected out of us later on in the project.  We then call the pdf funtion defined above to show the results of the object we created (est_grid_object). Future text will refer nack to this cell as "original explanation".

In [63]:
from sklearn.pipeline        import Pipeline
from sklearn.linear_model    import LogisticRegression, LinearRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer
from sklearn.feature_selection import SelectKBest

est_pipe = Pipeline([
  ('imp', SimpleImputer()),
  ('sca', MinMaxScaler()),
  ('kbest', SelectKBest()),
  ('lrg', LinearRegression())
])

In [64]:
parameters = {
  'imp__strategy': ['mean', 'median'],
  'lrg__fit_intercept': [True, False]
}

from sklearn.model_selection import GridSearchCV

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                        param_grid=parameters,
                        cv=3, 
                        iid=False,
                        return_train_score=False
                       )
est_grid_obj.fit(bos_fea_pdf(), 
                 bos_tgt_ser())

In [65]:
est_grid_results_pdf(est_grid_obj)

__Exercise:__ copy the three code cells from the previous exercise and modify as below:
- Add a key-value pair to the `parameters` dictionary for the `k` parameter of the `SelectKBest` class/object with value `[13,10,7,4]`
- Run `GridSearchCV` on the pipeline with the new `parameters` dictionary
- Use the `est_grid_results_pdf` function modified in a previous exercise
- Check that reasonable looking results are produced and notice the column for the new parameter

This text block will discuss the additional code from the "original explanation".  The second line of code now includes a new parameter to be tested, the SelectKBest with differing values as specified in the exercise.  This means that the total number of results in or Grid Serach will be 16 (4x2x2) and mean sthat we will hopefully find a better answer when compared to previously run segments.

In [68]:
from sklearn.pipeline        import Pipeline
from sklearn.linear_model    import LogisticRegression, LinearRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer
from sklearn.feature_selection import SelectKBest

est_pipe = Pipeline([
  ('imp', SimpleImputer()),
  ('sca', MinMaxScaler()),
  ('kbest', SelectKBest()),
  ('lrg', LinearRegression())
])

Note the presence of the kbest param.

In [70]:
parameters = {
  'imp__strategy': ['mean', 'median'],
  'lrg__fit_intercept': [True, False],
  'kbest__k': [13,10,7,4]
}

from sklearn.model_selection import GridSearchCV

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                        param_grid=parameters,
                        cv=3, 
                        iid=False,
                        return_train_score=False
                       )
est_grid_obj.fit(bos_fea_pdf(), 
                 bos_tgt_ser())

In [71]:
est_grid_obj.best_params_

Note there are 16 rows representing the 16 different grid boxes that were ran for the parameters given.

In [73]:
est_grid_results_pdf(est_grid_obj)

__Exercise:__ copy the three code cells from the previous exercise and modify as below:
- Import `f_regression` and `mutual_info_regression` from `sklearn.feature_selection` (first line of first code cell)
- Add the `score_func` parameter of `SelectKBest` to the `parameters` dictionary with value `[f_regression,mutual_info_regression]`
- Run `GridSearchCV` with this modified `parameters` dictionary
- Check the column for the `score_func` parameter in the resulting dataframe. Notice that it is not a string or number.

This text block will discuss the additional code from the "original explanation".  We add a new parameter to be considered from the kbest section.  This parameter specifies two different methodologies for the code to choose between, either f_regression or mutual_info_regression.  Note that the grid search was able to run before this because kbest had a default parameter for scoring originally.  Because of the increased parameters there will now be 32 grids or rows to consider when looking at our grid search results.

In [76]:
from sklearn.pipeline        import Pipeline
from sklearn.linear_model    import LogisticRegression, LinearRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

est_pipe = Pipeline([
  ('imp', SimpleImputer()),
  ('sca', MinMaxScaler()),
  ('kbest', SelectKBest()),
  ('lrg', LinearRegression())
])

Note presence of the score function defined to test f_regression and mutual_info_regression

In [78]:
parameters = {
  'imp__strategy': ['mean', 'median'],
  'lrg__fit_intercept': [True, False],
  'kbest__k': [13,10,7,4],
  'kbest__score_func': [f_regression, mutual_info_regression]
}

from sklearn.model_selection import GridSearchCV

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                        param_grid=parameters,
                        cv=3, 
                        iid=False,
                        return_train_score=False
                       )
est_grid_obj.fit(bos_fea_pdf(), 
                 bos_tgt_ser())

In [79]:
est_grid_obj.best_params_

We note 32 potential rows to choose from when looking at the results.

In [81]:
est_grid_results_pdf(est_grid_obj)

__Exercise:__ create a wrapper class called `SelectKBestWrap` for `SelectKBest`:
- Create a `SelectKBest` object in the `init` method
- Create an init parameter called `score_func_str` 
- When this parameter is `'f_regression'` then pass `f_regression` to `SelectKBest`
- When this parameter is `'mutual_info_regression'` then pass `mutual_info_regression` to `SelectKBest`
- You will need to create `fit` and `transform` methods (see examples from a previous week) which return results 
- Test this function on the Boston features dataframe and the target series

The below code creates teh wrapper class for the SelectKBEst functionality.  We coded into the int function two if tests to dtermine what the input was for the given results.  When writting this code it is important to specify this, because it will allow us to perform the analyses needed without typing them in or knowing them - this could proove useful in machine learning, because we will now be able to take the output from our grid search, and refer to i as the input needed for the wrapper class and conduct further analyses.

In [85]:
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.base import BaseEstimator, TransformerMixin

class SelectKBestWrap(BaseEstimator, TransformerMixin):
  def __init__ (self, score_func_str):
    self.score_func_str = score_func_str
    if self.score_func_str == 'f_regression':
      abc = f_regression
    if self.score_func_str == 'mutual_info_regression':
      abc = mutual_info_regression
    self.SK=SelectKBest(score_func = abc,k=6)
  def fit(self,X,y):
    self.SK.fit(X,y)
    return self
  def transform(self, X):
    X=self.SK.transform(X)
    return X

We see below the code functioning as expected, by replacing the 'na' portions of the features df with the mean (simple form of imputing) we are then able to use our transformer pipeline defined above. We note that the below output is working because there are a similar amount of rows to the original dataset, and the number of columns fitrs with what was specified in the SelectKBest wrapper class.

In [87]:
raw1=bos_fea_pdf()
raw2=raw1.fillna(raw1.iloc[0,:].mean())
a=SelectKBestWrap('f_regression').fit_transform(raw2, bos_tgt_ser())
a.shape

## 3. Datasets

This section contains the location and a few notes about the datasets from a previous week.

Display the paths of the three files in our dataset.

In [91]:
%sh ls -hot /dbfs/mnt/group-ma707/data/*

__Note:__ we will create, from the above files, at least three dataframes (with features and target). Each is described below.

From only the 5TC dataset: 
- target will be `BCI`
- features will include lagged versions of the other columns 
- features will include date and time components (hour, day of week, etc.)
- features may include external time series

From only the _mining_ dataset(s):
- the target may be one or more tags (from the `tags` variable)
- features would be words present in the `content` or `title` variables

From the 5TC and _mining_ dataset(s): 
- target will be `BCI` (from 5TC dataframe)
- include all features from either of the above dataframes
- the dataframes would need to be joined by either:
    1. aggregating the features from the _mining_ dataframe (by date)
    1. spreading the 5TC dataframe onto the _mining_ dataframe (duplicating 5TC rows)

__The End__