# Week 8 - Code Samples - [Group name]

## References

## Contents
1. Setup 
1. Read dataset
3. Preprocessing pipeline
4. Estimator pipeline
5. Inspect pipeline attributes
6. Compare `SelectKBest` and `SelectKBestDF`

## 1. Setup

Below libraries are loaded, a helper function is defined, and the features dataframe and target series are created for use by the `GridSearchCV` demonstrations that follow.

Load libraries and display version numbers.

In [7]:
import pandas  as pd
import numpy   as np
import sklearn as sk
print('sklearn',sk.__version__)
print('pandas ',pd.__version__)
print('numpy  ',np.__version__)

These version numbers may not be the most recent or correspond to the documentation you locate via Google.

The `display_pdf` function displays a pandas dataframe using the databricks display function.

In [10]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

In [11]:
def est_grid_results_pdf(my_est_grid_obj): 
  import pandas as pd
  import numpy  as np
  return pd.DataFrame(data=my_est_grid_obj.cv_results_) \
           .loc[:,lambda df: np.logical_or(df.columns.str.startswith('param_'),
                                           df.columns.str.endswith('test_score'))
               ]

## 2. Read dataset

In [13]:
%sh ls -hot /dbfs/mnt/group-ma707/data/*

In [14]:
%sh head -n 3 /dbfs/mnt/group-ma707/data/5tc_plus_ind_vars.csv

In [15]:
import pandas as pd
bci_pdf = pd.read_csv('/dbfs/mnt/group-ma707/data/5tc_plus_ind_vars.csv') \
            .rename(columns={'P3A~IV':'P3A_IV'}) \
            .assign(datetime=lambda pdf: pd.to_datetime(pdf.Date)) \
            .drop('Date', axis=1)
bci_pdf.columns = bci_pdf.columns.str.lower()
bci_pdf.info()

The BCI dataset has no missing values. See below:

In [17]:
bci_pdf.isnull().sum().sum()

## 3. Preprocessing pipeline

In [19]:
%run "./W08.2 Feature creation (inc)"

See [W08.2 Feature creation (inc)](https://bentley.cloud.databricks.com/#notebook/1176459)

Create a preprocessing pipeline to create the feature-target dataset from the `bci_pdf` dataframe.

In [22]:
from sklearn.pipeline      import Pipeline, FeatureUnion

fea_tgt_pipe = Pipeline([
  ('fea_tgt', FeatureUnionDF([('tgt', CreateTargetVarDF(var='bci_5tc')), 
                              ('nam', CreateNamedVarsDF(var_list=['datetime'])),
                              ('lag', CreateLagVarsDF  (var_list=list(set(bci_pdf.columns) - 
                                                                      set(['bci_5tc','datetime'])),
                                                        lag_list=range(3,11))),
                              ('dat', CreateDatetimeVarsDF(var='datetime',
                                                           var_list=['year','month','day','weekofyear','weekday','dayofyear']))
                             ])),
  ('drop_na', DropNaRowsDF())
])

In [23]:
bci_fea_tgt_pdf = fea_tgt_pipe.fit_transform(bci_pdf)
bci_fea_tgt_pdf.info()

Create a function to create train and test datasets from 
- a start date for the test set, and 
- the features and target of the preprocessed dataframe produced immediately above.

In [25]:
def get_ts_train_test_split(x,y,start_test=None):
  return (x.loc[:start_test ], y.loc[:start_test ],
          x.loc[ start_test:], y.loc[ start_test:]
         )

Now get the train and test datasets (for features and target).

In [27]:
import pandas as pd
(train_bci_fea_pdf, train_bci_tgt_ser,
  test_bci_fea_pdf,  test_bci_tgt_ser
) = get_ts_train_test_split(x=bci_fea_tgt_pdf.set_index('datetime').drop('target',axis=1), 
                            y=bci_fea_tgt_pdf.set_index('datetime').target,
                            start_test=pd.to_datetime('2018-01-01')
                           )
(train_bci_fea_pdf.shape, train_bci_tgt_ser.shape,
  test_bci_fea_pdf.shape,  test_bci_tgt_ser.shape
)

## 4. Estimator pipeline

Define a transformer class `SelectKBestDF` which wraps the `SelectKBest` class from Scikit-learn.

Notice:
- the `score_func_str` init parameter expects a string
- the use of the `super` function to call methods from the wrapped class
- the use of `self` to access the attributes of the wrapped class

In [30]:
from sklearn.feature_selection import SelectKBest
from sklearn.base              import BaseEstimator,TransformerMixin
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

class SelectKBestDF(SelectKBest):
  import numpy as np
  
  def __init__(self, k=10, score_func_str=None):
    if score_func_str is None                    : score_func = f_regression
    if score_func_str == 'f_regression'          : score_func = f_regression
    if score_func_str == 'mutual_info_regression': score_func = mutual_info_regression
    super().__init__(k=k, score_func=score_func)

  def fit(self, X, y=None):
    super().fit(X=X,y=y)
    scores_ = self.scores_
    pvalues_ = self.pvalues_

    var_ndx_sorted    = self.scores_.argsort()[::-1][:self.k]
    self.columns_keep = [X.columns[i] for i in var_ndx_sorted]
    self.columns_drop = list(set(X.columns)-set(self.columns_keep))
    return self
  
  def transform(self, X):
    return pd.DataFrame(data=super().transform(X),
                        columns=self.columns_keep)

Create a pipeline with the wrapped transformer class `SelectKBestDF` and the estimator `LinearRegression`.

In [32]:
from sklearn.pipeline          import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model      import LinearRegression

est_pipe = Pipeline([
  ('skb', SelectKBestDF()),
  ('lrg', LinearRegression())
])

There are two ways to use this estimator pipeline.
1. Fit the feature and target train datasets to this estimator and make predictions on the test feature dataset. 
2. Run `GridSearchCV` with the pipeline and a parameter grid. Then compare results for different sets of hyperparameters.

Both ways are demonstrated below.

The first way is demonstrated in the following code cell.

The pipeline `est_pipe` was created with default parameters. The `set_params` method can be used to set parameters. This is demonstated in the following code cell, but will be useful further below.

In [36]:
est_pipe.set_params(**{'skb__k':1,
                       'skb__score_func_str':'f_regression'
                      })
est_pipe.fit(X=train_bci_fea_pdf, 
             y=train_bci_tgt_ser)
est_pipe.predict(X=test_bci_fea_pdf)

The second way is demonstated in the following code cell where 
- a parameter grid is created, and 
- a grid search object is created from the pipeline and parameter grid

In [38]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics         import make_scorer, r2_score

parameters = {'skb__k'             : [10, 7, 4, 1],
              'skb__score_func_str': ['f_regression', 'mutual_info_regression']
              }

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                            param_grid=parameters,
                            scoring=make_scorer(r2_score),
                            cv=3, 
                            iid=False,
                            return_train_score=False
                           )

Continuing wih the second way, fit the train and test datasets to the grid search object.

In [40]:
est_grid_obj.fit(X=train_bci_fea_pdf,
                 y=train_bci_tgt_ser
                )

Display the results. The second code cell below produces more readable and workable results by converting the dictionary to a dataframe.

In [42]:
est_grid_obj.cv_results_

A pandas dataframe will be easier to analyze. In addition, the columns can be easily sorted in the table below.

In [44]:
display_pdf(est_grid_results_pdf(est_grid_obj))

mean_test_score,param_skb__k,param_skb__score_func_str,rank_test_score,split0_test_score,split1_test_score,split2_test_score,std_test_score
0.8667903232663242,10,f_regression,1,0.8806511977723495,0.8285565527757559,0.8911632192508674,0.0273738507014181
0.8667903232663242,10,mutual_info_regression,1,0.8806511977723495,0.8285565527757559,0.8911632192508674,0.0273738507014181
0.7888989393527116,7,f_regression,5,0.8766313548607985,0.8331036359174122,0.656961827279924,0.09497093146742
0.7888989393527116,7,mutual_info_regression,5,0.8766313548607985,0.8331036359174122,0.656961827279924,0.09497093146742
0.6500215104731023,4,f_regression,7,0.6038700140048274,0.8168163507428787,0.5293781666716006,0.1217994194984644
0.6500215104731023,4,mutual_info_regression,7,0.6038700140048274,0.8168163507428787,0.5293781666716006,0.1217994194984644
0.7890911125079342,1,f_regression,3,0.748078442220349,0.833457159617723,0.7857377356857306,0.034936277212366
0.7890911125079342,1,mutual_info_regression,3,0.748078442220349,0.833457159617723,0.7857377356857306,0.034936277212366


Notice that a single variable is nearly as good in making predictions as 10 variables with respect to the R squared metric.

## 5. Inspect pipeline attributes

After inspecting a dataframe of results, you may want to inspect the attributes of a transformer or the estimator contained in a pipeline used by `GridSearchCV`. This section demonstrates this process. 

For instance, we inspect the attributes determined by `skb__k=1` and `skb__score_func_str='f_regression'`.

In [48]:
est_pipe.set_params(**{'skb__k':1,
                       'skb__score_func_str':'f_regression'
                      })

In [49]:
est_pipe.fit(X=train_bci_fea_pdf,
             y=train_bci_tgt_ser
            )

In [50]:
est_pipe.named_steps['skb'].columns_keep

In [51]:
est_pipe.named_steps['lrg'].coef_

## 6. Compare `SelectKBest` and `SelectKBestDF`

In [53]:
skb_df = SelectKBestDF(k=4,score_func_str='f_regression')
skb_df \
  .fit(X=train_bci_fea_pdf,
       y=train_bci_tgt_ser) 
skb_df \
  .transform(X=train_bci_fea_pdf) \
  .head()

In [54]:
skb_df.scores_

In [55]:
np.set_printoptions(suppress=True)
skb = SelectKBest(k=4,score_func=f_regression)
skb \
  .fit(X=train_bci_fea_pdf,
       y=train_bci_tgt_ser) 
skb \
  .transform(X=train_bci_fea_pdf)[:5,:]

In [56]:
skb.scores_

__The End__