# Exercises - Week 7 - Mini Pipeline - Blackjack

## References
- http://flennerhag.com/2017-01-08-Recursive-Override/
- https://www.pythonforbeginners.com/super/working-python-super-function
- https://scikit-learn.org/stable/modules/compose.html
- https://scikit-learn.org/stable/modules/feature_extraction.html 
- https://scikit-learn.org/stable/modules/feature_selection.html 
- https://scikit-learn.org/stable/modules/ensemble.html 
- https://scikit-learn.org/stable/modules/grid_search.html
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

## Contents
1. Setup 
2. Grid search
3. Datasets

## 1. Setup

Below libraries are loaded, a helper function is defined, and the features dataframe and target series are created for use by the `GridSearchCV` demonstrations that follow.

Load libraries and display version numbers.

In [7]:
import pandas  as pd
import numpy   as np
import sklearn as sk
print('sklearn',sk.__version__)
print('pandas ',pd.__version__)
print('numpy  ',np.__version__)

These version numbers may not be the most recent or correspond to the documentation you locate via Google.

The `display_pdf` function displays a pandas dataframe using the databricks display function.

In [10]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

In [11]:
def est_grid_results_pdf(my_est_grid_obj): 
  import pandas as pd
  import numpy  as np
  return pd.DataFrame(data=my_est_grid_obj.cv_results_) \
           .loc[:,lambda df: np.logical_or(df.columns.str.startswith('param_'),
                                           df.columns.str.endswith('test_score'))
               ]

In [12]:
def bos_tgt_ser(): 
  import pandas as pd
  from sklearn import datasets
  return pd.Series(data=datasets.load_boston().target)

def bos_fea_pdf(): 
  import pandas as pd
  from sklearn.datasets import load_boston
  bos_fea_array   = load_boston().data 
  bos_fea_columns = load_boston().feature_names
  return pd.DataFrame(data   =          bos_fea_array,
                      columns=pd.Series(bos_fea_columns).str.lower()
                     )

def bos_tgt_npa(): 
  import pandas as pd
  from sklearn import datasets
  return datasets.load_boston().target

def bos_fea_npa(): 
  import pandas as pd
  from sklearn.datasets import load_boston
  return load_boston().data 

In [13]:
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.base              import BaseEstimator,TransformerMixin

class SelectKBest2(SelectKBest):
  import numpy as np
  
  def __init__(self, k=10, score_func_str=None):
    if score_func_str is None                    : score_func = f_regression
    if score_func_str == 'f_regression'          : score_func = f_regression
    if score_func_str == 'mutual_info_regression': score_func = mutual_info_regression
    super().__init__(k=k, score_func=score_func)

  def fit(self, X, y=None):
    super().fit(X,y)
    return self
  
  def transform(self, X):
    return super().transform(X)

In [14]:
SelectKBest2() \
  .fit(X=bos_fea_npa(),
       y=bos_tgt_npa()) \
  .transform(X=bos_fea_npa()) \
  .shape

In [15]:
from sklearn.pipeline      import FeatureUnion
from sklearn.decomposition import PCA, TruncatedSVD

union = FeatureUnion([('pca', PCA(n_components=7)),
                      ('svd', TruncatedSVD(n_components=7))
                     ])

In [16]:
from sklearn.pipeline        import Pipeline
from sklearn.linear_model    import LogisticRegression, LinearRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer

est_pipe = Pipeline([
  ('fea', union),
  ('skb', SelectKBest(score_func=f_regression)), #(score_func_str='f_regression')),
  ('lrg', LinearRegression())
])

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics         import make_scorer, r2_score

parameters = {'skb__k': [10, 7, 4, 1],
              'fea__pca__n_components' : [3, 5]
              }

est_grid_obj = GridSearchCV(estimator=est_pipe, 
                            param_grid=parameters,
                            scoring=make_scorer(r2_score),
                            cv=3, 
                            iid=False,
                            return_train_score=False
                           )
est_grid_obj.fit(X=bos_fea_npa(),
                 y=bos_tgt_npa()
                )

In [18]:
display_pdf(est_grid_results_pdf(est_grid_obj))

mean_test_score,param_fea__pca__n_components,param_skb__k,rank_test_score,split0_test_score,split1_test_score,split2_test_score,std_test_score
-13.764852483897416,3,10,7,0.6971431434274243,0.3587915060140209,-42.350492101133696,20.21357159109553
-0.0945536972340353,3,7,4,0.4836931903028533,0.377072262824947,-1.1444265448299062,0.7436472069619031
0.1009151030013393,3,4,2,0.4323486332677186,0.1814887503427305,-0.3110920746064312,0.3088096566015225
-0.7587825397738547,3,1,6,-0.4114264851776676,-0.5941432638196442,-1.2707778703242525,0.3696401576113212
-14.076774295979208,5,10,8,0.6966655169846774,0.3585613728703106,-43.28554977779261,20.65418444227553
-0.070809537496518,5,7,3,0.4858881085838432,0.3732931382619974,-1.0716098593353949,0.7091640007690012
0.2053725589445404,5,4,1,0.4323486332677051,0.18148875034277,0.0022802932231461,0.1763850362363828
-0.758782539773851,5,1,5,-0.4114264851776676,-0.5941432638196462,-1.2707778703242392,0.3696401576113147
