#MA707 Report - Class Demonstrations (spring 2019, Blackjack)

## Introduction

## Contents
1. Setup

## 1. Setup

In [5]:
%run "./0.1 Raw dataset (inc)"

In [6]:
%run "./0.2 Feature creation (inc)"

In [7]:
%run "./0.3 Feature selection (inc)"

In [8]:
%run "./0.4 Estimators (inc)"

## 2. Class demonstrations

The following subsections demonstrate the classes used by the `FeatureUnion` and `Pipeline` classes to create a feature-target dataframe.

### 2.1 `CreateTargetDF`

The below code creates a target data frame from the bci_coal data frame, where the target variable is the bci_5 index.  The below method will be useful in creating our preprocessing pipeline, but we should be careful when moving forward to only include predictor variables we want to consider.

In [13]:
%python
CreateTargetVarDF(var='bci_5tc') \
  .fit_transform(bci_coal_pdf) \
  .head()

### 2.2 `CreateDatetimeVarsDF`

The below code takes as argument the variable date, and then splits the datetime variable already contained in our dataset into the independent values below. Something like this may prove useful if we choose to explore how day of week or day in year could impact the target variable, and potentially observe price fluctuations that align with seasonality or other macroeconomic factors we may not be able to see elsewhere.

In [16]:
%python
CreateDatetimeVarsDF(var='date',
                     var_list=['year','month','day',
                               'dayofyear','weekofyear','weekday']) \
  .fit_transform(bci_coal_pdf) \
  .head()

### 2.3 `CreateLagVarsDF`

Lagged variables will be incredibly important for our purposes moving forward, because they represent the easiest way to formulate a prediction for our datasets.  When considering the lagged variables, one thing we will want to pay special attention to is the time period we consider for lagging everything - for example, a lagged variable of one day may provide a much different result than a lagged variable of 3 days, due to the relative speed with which information is transferred these days.

In [19]:
%python
CreateLagVarsDF(var_list=['cme_ln2','rici','p1a_03','p4_03','c7',
                          'cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                          'shfe_cu3','ice_tib3','cme_fc3','opec_orb',
                          'ice_sb3','p3a_iv','ice_kc3','c5',
                          'p2a_03','cme_lc2','content','cme_sm3',
                          'ice_tib4','bci','tags','cme_ln1','cme_s2'],
                lag_list=range(0,2)) \
  .fit_transform(bci_coal_pdf) \
  .loc[:5,['bci_lag0','bci_lag1']] \
  .head()

The below indicates that the lagged data frame now exists, with bci_lag1 shifted back one full day behind its actual indexing

In [21]:
bci_coal_pdf \
  .loc[:,['date','bci']] \
  .head()

### 2.3 `DropNaRowsDF`

As a result of creating lagged variables, we will lose some data at both the start and end of the dataset, because there will not be observations for those time periods. This will throw an exception due to NA's being present in the dataset. To combat this problem, we will use DropNARows from the dataset.  While this solves our NA problem, one must be careful on dataset size and amount of time lagged.  We are currently working with a large dataset so there is not much concern, but should we have a smaller dataset the information lost from dropping the rows may create problems.

In [24]:
%python 
from sklearn.pipeline import Pipeline
xfm_pipe = Pipeline(
  steps=[('lag',CreateLagVarsDF(var_list=['cme_ln2','rici','p1a_03','p4_03','c7','cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                                          'shfe_cu3','ice_tib3','cme_fc3','opec_orb','ice_sb3','p3a_iv','ice_kc3','c5',
                                          'p2a_03','cme_lc2','content','cme_sm3','ice_tib4','bci','tags','cme_ln1','cme_s2'],
                                lag_list=range(0,3)))
        ])
xfm_pipe \
  .fit_transform(bci_pdf) \
  .loc[:,['bci_lag0',
          'bci_lag1',
          'bci_lag2']] \
  .head(3)

Above, we see what happens when we allow the NaN rows to exist, and below - what happens when we allow the rows to be dropped, it should be noted that we have lost the first two days’ worth of information from utilizing this coded technique.

In [26]:
%python 
from sklearn.pipeline import Pipeline
xfm_pipe = Pipeline(
  steps=[('lag',CreateLagVarsDF(var_list=['cme_ln2','rici','p1a_03','p4_03','c7','cme_ln3','p3a_03','shfe_rb3','shfe_al3',
                                          'shfe_cu3','ice_tib3','cme_fc3','opec_orb','ice_sb3','p3a_iv','ice_kc3','c5',
                                          'p2a_03','cme_lc2','content','cme_sm3','ice_tib4','bci','tags','cme_ln1','cme_s2'],
                                lag_list=range(0,3))),
         ('row',DropNaRowsDF(how='any'))
        ])
xfm_pipe \
  .fit_transform(bci_pdf) \
  .loc[:,['bci_lag0',
          'bci_lag1',
          'bci_lag2']] \
  .head(2)

### 2.4 `CountVectColDF`

The below code begins to highlight how the countVect wrapped class can be defined and brought to work.  We will use this, along with other fitting techniques (Tfidf, PCA, etc.) to perform evaluations on our working model and in our preprocessing pipelines.

In [29]:
%python 
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
class CountVectColDF(CountVectorizer):
  def __init__(self,col_name,prefix='cnt_',
               stop_words=list(ENGLISH_STOP_WORDS),
               add_stop_words=[]
              ):
    stop_words_list = stop_words+add_stop_words
    self.col_name = col_name
    self.prefix   = prefix
    super().__init__(stop_words=stop_words_list)
    return
  
  def fit(self,X,y=None):
    super().fit(X[self.col_name])
    return self
  
  def transform(self,X,y=None):
    return pd.DataFrame(data=super().transform(X[self.col_name]).toarray(),
                        columns=[self.prefix+feature_name for feature_name in super().get_feature_names()]
                       )

Below, we implement the CountVect wrapped class to demonstrate how the code will function in our pre processing pipeline.

In [31]:
%python
CountVectColDF(col_name='tags',
               prefix='cnt_',
               add_stop_words=['2012']) \
  .fit(bci_coal_pdf) \
  .transform(bci_coal_pdf) \
  .head() \
  .columns

### 2.5 `TfidfVectColDF`

The below code is very similar to the above code, however the main difference is the presence of the Tfidf vectorizer as compared to countvectorizer.  This difference may prove to be relevant pending how we attempt to process with natural language and understand the relationship between words and word groups as opposed to single words.

In [34]:
%python 
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
class TfidfVectColDF(TfidfVectorizer):
  def __init__(self,col_name,prefix='cnt_',
               stop_words=list(ENGLISH_STOP_WORDS),
               add_stop_words=[]
              ):
    stop_words_list = stop_words+add_stop_words
    self.col_name = col_name
    self.prefix   = prefix
    super().__init__(stop_words=stop_words_list)
    return
  
  def fit(self,X,y=None):
    super().fit(X[self.col_name])
    return self
  
  def transform(self,X,y=None):
    return pd.DataFrame(data=super().transform(X[self.col_name]).toarray(),
                        columns=[self.prefix+feature_name for feature_name in super().get_feature_names()]
                       )

## Summary

Above - we demonstrate some of the classes that will need to be used in our future notebooks.  Showing these classes here not only allows us to determine that they work but shows us what the output might happen to be - which will be important because knowing the output allows us to better determine which feature selection methods will work best for our investigation.