This notebook contains code to create estimator pipelines. 
This code will include the feature selection classes and the wrapped estimator classes. 
The code will be reused in the `Investigation` notebook.

#MA707 Report - Estimator Pipelines (spring 2019, Blacjack)

## Introduction

## Contents
1. Setup
2. Create raw dataset
3. Create feature-target datasets
4. Create train-data datasets

## 1. Setup

In [6]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final2/2. Preprocessing pipeline"

We run our previously done work above so that we need not waste space in the notebook writting out things again

##Create train-test datasets

The below code is set up to allow us to create train and test data frames and time series which will allow us to hold part of the data separate for testing purposes after we have created our final model(S).  As input the function takes a feature data frame and a target series, and separates the data into the different sets, with 80% being held for train purposes and 20% for test purposes.

In [10]:
def create_train_test_ts(fea_pdf, tgt_ser, trn_prop=0.8):
  trn_len = int(trn_prop * len(fea_pdf))
  return (fea_pdf.iloc[:trn_len],
          fea_pdf.iloc[ trn_len:],
          tgt_ser.iloc[:trn_len],
          tgt_ser.iloc[ trn_len:]
         )

Below we see the examples of our train and test datasets being created.  We note that, as the code is written above to produce four separate outputs we specify four different places in the upper portion of the code blocks below.

In [12]:
#Create train and test dataset with different lag1 variables Coal
(trn_fea_pdf_lag1, tst_fea_pdf_lag1, 
 trn_tgt_ser_lag1, tst_tgt_ser_lag1
) = \
create_train_test_ts(fea_pdf = fea_tgt_coal_pdf_lag1.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_coal_pdf_lag1.loc[:,'target'],
                    )

In [13]:
#Create train and test dataset with different lag3 variables Coal
(trn_fea_pdf_lag3, tst_fea_pdf_lag3, 
 trn_tgt_ser_lag3, tst_tgt_ser_lag3
) = \
create_train_test_ts(fea_pdf = fea_tgt_coal_pdf_lag3.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_coal_pdf_lag3.loc[:,'target'],
                    )

In [14]:
#Create train and test for tfidf coal
(trn_fea_pdf_tfidf_lag1, tst_fea_pdf_tfidf_lag1, 
 trn_tgt_ser_tfidf_lag1, tst_tgt_ser_tfidf_lag1
) = \
create_train_test_ts(fea_pdf = fea_tgt_coal_pdf_tfidf_lag1 .drop( 'target',axis=1),
                     tgt_ser = fea_tgt_coal_pdf_tfidf_lag1 .loc[:,'target'],
                    )

In [15]:
#Create train and test dataset with different lag1 variables Ore
(trn_fea_pdf_lag1_ore, tst_fea_pdf_lag1_ore, 
 trn_tgt_ser_lag1_ore, tst_tgt_ser_lag1_ore
) = \
create_train_test_ts(fea_pdf = fea_tgt_ore_pdf_lag1.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_ore_pdf_lag1.loc[:,'target'],
                    )

In [16]:
#Create train and test dataset with different lag3 variables Ore
(trn_fea_pdf_lag3_ore, tst_fea_pdf_lag3_ore, 
 trn_tgt_ser_lag3_ore, tst_tgt_ser_lag3_ore
) = \
create_train_test_ts(fea_pdf = fea_tgt_ore_pdf_lag3.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_ore_pdf_lag3.loc[:,'target'],
                    )

In [17]:
#Create train and test for tfidf ore
(trn_fea_pdf_tfidf_lag1_ore, tst_fea_pdf_tfidf_lag1_ore, 
 trn_tgt_ser_tfidf_lag1_ore, tst_tgt_ser_tfidf_lag1_ore
) = \
create_train_test_ts(fea_pdf = fea_tgt_ore_pdf_tfidf_lag1 .drop( 'target',axis=1),
                     tgt_ser = fea_tgt_ore_pdf_tfidf_lag1 .loc[:,'target'],
                    )

## Estiamtor Pipeline

This section of code looks to take create different estimator pipelines based on our goals and objectives specified in the objectives section. For these pipeline we created 2 each for the RIDGE, LASSO, and Elastic Net models, one pipeline with a scaler and one without.

In [20]:
from sklearn.pipeline        import FeatureUnion, Pipeline
from sklearn.linear_model    import Ridge, Lasso, ElasticNet
from sklearn.decomposition   import PCA
from spark_sklearn           import GridSearchCV
from sklearn.preprocessing   import MinMaxScaler
#from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score

RIDGE pipeline no scaler

In [22]:
%python 
PCA_Ridge_Pipeline=Pipeline(steps=[
    ('pca',PCA()),
    ('rdg', Ridge())
  ])

RIDGE pipeline with scaler

In [24]:
%python 
Sca_PCA_Ridge_Pipeline=Pipeline(steps=[
    ('sca', MinMaxScaler()),
    ('pca',PCA()),
    ('rdg', Ridge())
  ])

LASSO pipeline no scaler

In [26]:
%python 
PCA_Lasso_Pipeline=Pipeline(steps=[
    ('pca',PCA()),
    ('lso', Lasso())
  ])

LASSO pipeline with scaler

In [28]:
%python 
PCA_Lasso_SCA_Pipeline=Pipeline(steps=[
    ('sca', MinMaxScaler()),
    ('pca',PCA()),
    ('lso', Lasso())
  ])

Elastic Net pipeline no scaler

In [30]:
%python 
PCA_Ela_Pipeline=Pipeline(steps=[
    ('pca',PCA()),
    ('ela', ElasticNet())
  ])

Elastic Net pipeline with scaler

In [32]:
%python 
PCA_Ela_SCA_Pipeline=Pipeline(steps=[
    ('sca', MinMaxScaler()),
    ('pca',PCA()),
    ('ela', ElasticNet())
  ])

## Summary

In the above notebook we created six estimator pipelines to be used in future notebooks to perform our analysis on regressor type and important components.  We began the notebook by first considering how to best create train and test datasets, and then went on to create pipelines for each of our three regressor types that included a scaler and withheld a scaler.  We explain in depth how these different regressors perform relative to. our data in the investigations notebook and further explore what might cause success or failure.