This notebook contains code to create feature-target datasets. 
This code will include the `FeatureUnionDF` class, the `DropNaRowsDF` class and the feature creation classes. 
The code will be reused in the `Investigation` notebook.

#MA707 Report - Preprocessing (spring 2019, Blackjack)

## Introduction

## Contents
1. Setup

## 1. Setup

In [6]:
%run "./1. Class demonstrations"

Rather than transport code already written, we utilize the run method introduced to us before - which is incredibly useful.

Below, we create a preprocessing pipeline that takes FeatureUnioin and Pipeline as arguments and creates two distinct feature unions.  The first feature union combines the target variables, date and time variables, and lagged variables of the numeric data and the text data into the first feature union. We drop the rows created with NA, so we do not create a problem later on when running our data through any future pipelines and attempting to predict.  Second, the feature union is called again, this time creating a union between the countvecorized variables for the lagged tags and title variables created above.  Note we choose to ignore content because of the sheer computing power it would take to run the model, and our opinion that content will not be useful in predicting.

In [9]:
%python 
def get_count_plus_ts_lag1_pipe():
  from sklearn.pipeline import FeatureUnion, Pipeline
  return Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'    ,CreateTargetVarDF(var='bci_5tc')),
      ('dt_vars'    ,CreateDatetimeVarsDF(var='date')),
      ('lag_ts_vars',CreateLagVarsDF(
        var_list=['cme_ln2','rici','p1a_03','p4_03','c7',
                  'cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                  'shfe_cu3','ice_tib3','cme_fc3','opec_orb',
                  'ice_sb3','p3a_iv','ice_kc3','c5',
                  'p2a_03','cme_lc2','cme_sm3','ice_tib4','bci','cme_ln1','cme_s2'],
        lag_list=[1])),
      ('lag_txt_vars',CreateLagVarsDF(var_list=['tags','title'],
                                      lag_list=[1])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any')),
    ('fea_two', FeatureUnionDF(transformer_list=[
      ('named_vars' ,CreateNamedVarsDF(except_list=['tags_lag1','title_lag1'])),
      ('cnt_tags'   , CountVectColDF(col_name=   'tags_lag1',prefix='cnt_tags_'   ,add_stop_words=[])),
     # ('cnt_content', CountVectColDF(col_name='content_lag1',prefix='cnt_content_',add_stop_words=[])),  
      ('cnt_title'  , CountVectColDF(col_name=  'title_lag1',prefix='cnt_title_'  ,add_stop_words=[])),  
    ])),
    ('drop_na_rows_again'  ,DropNaRowsDF(how='any')),
  ])

Rather than restate what has been done - note that the below simply changes the lag time period from 1 to 3

In [11]:
%python 
def get_count_plus_ts_lag3_pipe():
  from sklearn.pipeline import FeatureUnion, Pipeline
  return Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'    ,CreateTargetVarDF(var='bci_5tc')),
      ('dt_vars'    ,CreateDatetimeVarsDF(var='date')),
      ('lag_ts_vars',CreateLagVarsDF(
        var_list=['cme_ln2','rici','p1a_03','p4_03','c7',
                  'cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                  'shfe_cu3','ice_tib3','cme_fc3','opec_orb',
                  'ice_sb3','p3a_iv','ice_kc3','c5',
                  'p2a_03','cme_lc2','cme_sm3','ice_tib4','bci','cme_ln1','cme_s2'],
        lag_list=[3])),
      ('lag_txt_vars',CreateLagVarsDF(var_list=['tags','title'],
                                      lag_list=[3])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any')),
    ('fea_two', FeatureUnionDF(transformer_list=[
      ('named_vars' ,CreateNamedVarsDF(except_list=['tags_lag3','title_lag3'])),
      ('cnt_tags'   , CountVectColDF(col_name=   'tags_lag3',prefix='cnt_tags_'   ,add_stop_words=[])),
      ('cnt_title'  , CountVectColDF(col_name=  'title_lag3',prefix='cnt_title_'  ,add_stop_words=[])),  
     # ('cnt_content', CountVectColDF(col_name='content_lag3',prefix='cnt_content_',add_stop_words=[])),  
    ])),
    ('drop_na_rows_again'  ,DropNaRowsDF(how='any')),
  ])

Here we create our feature target data frames for the coal and ore raw datasets by calling the pipelines created above.  These data frames will be a combination of the lagged numeric data, and the countvectorizer results of the tags and title (as well as the actual words, which will be ignored in any sort of pipeline that takes place)

In [13]:
fea_tgt_coal_pdf_lag1 = get_count_plus_ts_lag1_pipe().fit(bci_coal_pdf).transform(bci_coal_pdf)
fea_tgt_coal_pdf_lag3 =  get_count_plus_ts_lag3_pipe().fit(bci_coal_pdf).transform(bci_coal_pdf)

In [14]:
fea_tgt_ore_pdf_lag1 = get_count_plus_ts_lag1_pipe().fit(bci_ore_pdf).transform(bci_ore_pdf)
fea_tgt_ore_pdf_lag3 =  get_count_plus_ts_lag3_pipe().fit(bci_ore_pdf).transform(bci_ore_pdf)

Below, we take the pipeline that had been applied above, and replace countvectorizer with tfidfvectorizer.  This will hypothetically provide an estimate that shows a tag or titles relevance in a given line

In [16]:
%python 
def get_tfidf_plus_ts_lag1_pipe():
  from sklearn.pipeline import FeatureUnion, Pipeline
  return Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'    ,CreateTargetVarDF(var='bci_5tc')),
      ('dt_vars'    ,CreateDatetimeVarsDF(var='date')),
      ('lag_ts_vars',CreateLagVarsDF(
        var_list=['cme_ln2','rici','p1a_03','p4_03','c7',
                  'cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                  'shfe_cu3','ice_tib3','cme_fc3','opec_orb',
                  'ice_sb3','p3a_iv','ice_kc3','c5',
                  'p2a_03','cme_lc2','cme_sm3','ice_tib4','bci','cme_ln1','cme_s2'],
        lag_list=[1])),
      ('lag_txt_vars',CreateLagVarsDF(var_list=['tags','title'],
                                      lag_list=[1])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any')),
    ('fea_two', FeatureUnionDF(transformer_list=[
      ('named_vars' ,CreateNamedVarsDF(except_list=['tags_lag1','title_lag1'])),
      ('tfidf_tags'   , TfidfVectColDF(col_name=   'tags_lag1',prefix='tfidf_tags_'   ,add_stop_words=[])),
      ('tfidf_title'  , TfidfVectColDF(col_name=  'title_lag1',prefix='tfidf_title_'  ,add_stop_words=[])),  
     # ('cnt_content', CountVectColDF(col_name='content_lag3',prefix='cnt_content_',add_stop_words=[])),  
    ])),
    ('drop_na_rows_again'  ,DropNaRowsDF(how='any')),
  ])

Below, our feature-target dataframes are created for tfidf.

In [18]:
fea_tgt_coal_pdf_tfidf_lag1 = get_tfidf_plus_ts_lag1_pipe().fit(bci_coal_pdf).transform(bci_coal_pdf)

In [19]:
fea_tgt_ore_pdf_tfidf_lag1 = get_tfidf_plus_ts_lag1_pipe().fit(bci_ore_pdf).transform(bci_ore_pdf)

## Summary

In this notebook, we sought to create target-feature data frames that could then be used in our estimator pipelines.  We utilized classes previously discussed such as the dropNArows, and the wrapper class of FeatureUnion to accomplish this, as well as considered the different ways of processing words by using both countvectorizer and tfidf vectorizer separately in different pipelines.  Moving forward, we will create train/test separated datasets and will explore different estimator pipelines per our stated objectives and goals