# Exercises - Week 4 - Feature Union - Blackjack

## References
- https://scikit-learn.org/stable/data_transforms.html
- https://scikit-learn.org/stable/modules/preprocessing.html
- https://scikit-learn.org/stable/modules/compose.html#combining-estimators
- https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces

## Contents
1. Setup 
2. Data Lab notebooks
3. Dataset
4. Wrapper transformer classes
5. Types of pipelines
6. `FeatureUnion`

## 1. Setup

Load libraries and display version numbers.

In [6]:
import pandas  as pd
import numpy   as np
import sklearn as sk
print('sklearn',sk.__version__)
print('pandas ',pd.__version__)
print('numpy  ',np.__version__)

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin

These version numbers may not be the most recent or correspond to the documentation you locate via Google.

The `display_pdf` function displays a pandas dataframe using the databricks display function.

In [10]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

## 2. Data Lab notebooks
- [sklearn/Introduction](https://bentley.cloud.databricks.com/#notebook/210807) 
- [sklearn/Preprocessing](https://bentley.cloud.databricks.com/#notebook/404771)
- [Variance Thresholding For Feature Selection](https://bentley.cloud.databricks.com/#notebook/434422)
- [Univariate Feature Selection](https://bentley.cloud.databricks.com/#notebook/478847)

## 3. Dataset

Display the paths of the three files in our dataset.

In [14]:
%sh ls -hot /dbfs/mnt/group-ma707/data/*

__Exercise:__ write functions `get_coal_pdf`, `get_iron_ore_pdf` and `get_bci_pdf` which 
- read dataframes from the respective CSV files
- replace the tilde with an underscore (in the variable name with the tilde)
- rename the variables to snake case (lowercase and underscores between words)
- create a variable `timestamp` in the mining dataframes of type `datetime64`
- create `date` variables of type `datetime64` with only year, month, and day (in all 3 functions)

The below function imports the pandas library to be used in creating dataframes.  The first line of the function serves to create the dataframe with the appropriate encoding for the dataframe.  The next line creates the snake case column names, and finally the final column creates a timestamp date in the appropriate format, thus eliminating the extraneous data from the iron ore and coal data in the form of timestamps.

In [17]:
import pandas as pd
import numpy as np
def get_coal_pdf():
  x = pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_coal.csv', encoding = "ISO-8859-1")
  x.columns = x.columns.str.lower().str.replace(' ','_').str.replace('-', '_')
  mining_com_coal_pdf['timestamp'] = pd.to_datetime(mining_com_coal_pdf['date'],format='%m/%d/%Y')
 


Like the function above, the one below follows the same basic principle.  This one has the added feature of creating a new date from the dataframe which serves to create the needed date data from the iron ore file.

In [19]:
import pandas as pd
def get_iron_ore_pdf():
  mining_com_iron_ore_pdf = pd.read_csv ('/dbfs/mnt/group-ma707/data/mining_com_iron_ore.csv', encoding = "ISO-8859-1")
  mining_com_iron_ore_pdf['timestamp'] = pd.to_datetime(mining_com_iron_ore_pdf['date'],format='%Y-%m-%d %H:%M:%S')
  mining_com_iron_ore_pdf['date2']=mining_com_iron_ore_pdf['date'].str.slice(0,10)
  mining_com_iron_ore_pdf['date_new'] = pd.to_datetime(mining_com_iron_ore_pdf['date2'],format='%Y-%m-%d')
  
  return mining_com_iron_ore_pdf
mining_com_iron_ore_pdf= get_iron_ore_pdf()
#mining_com_iron_ore_pdf.head()

mining_com_coal_pdf=pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_coal.csv', encoding = "ISO-8859-1")
mining_com_coal_pdf.head()

Note the differences here between the data frame creation (no ISD encoding needed) and the lack of the new date because the target bci5 variable not having a time portion to the date function.

In [21]:
def bci_5tc_plus_pdf():
  bci_5tc_plus_pdf = pd.read_csv('/dbfs/mnt/group-ma707/data/5tc_plus_ind_vars.csv')
  bci_5tc_plus_pdf['Date_new']=pd.to_datetime(bci_5tc_plus_pdf['Date'],format='%Y-%m-%d')
  return bci_5tc_plus_pdf
bci_5tc_plus_pdf=bci_5tc_plus_pdf()
#bci_5tc_plus_pdf.head()

__Exercise:__ demonstrate that there are multiple entries for some dates (year, month, day) in the mining dataframes, using: 
- the `to_datetime` function from pandas
- the `dt` and `date` attributes
- the `value_counts` and `sort_values` methods 

Do so with a single method chain (hence no variables).

By using the value counts and describe functionality incorporated we can see (output) that there are multiple points at which certain dates occur.  this will create problems later on in our analyses as we have to handle a target variable with only one date, but features that have mutliple entries on the same date.

In [24]:
a= mining_com_iron_ore_pdf['date_new'].value_counts()
mining_com_iron_ore_pdf['date_new'].describe()
a

__Exercise:__ demonstrate that there are no multiple entries for any dates in the 5TC dataframe, using: 
- the `to_datetime` function from pandas
- the `dt` and `date` attributes
- the `value_counts` and `describe` methods 

Do so with a single method chain (hence no variables).

Using the same functionality as displayed above, we can see that the count and unique functions are exactly the same, indicating that there are exactly 1602 date variables and all of them are unique in the 5TC dataframe.

In [27]:
bci_5tc_plus_pdf['Date_new'].value_counts().describe()


__Note:__ we will create, from the above files, at least three dataframes (with features and target). Each is described below.

From only the 5TC dataset: 
- target will be `BCI`
- features will include lagged versions of the other columns 
- features will include date and time components (hour, day of week, etc.)
- features may include external time series

From only the _mining_ dataset(s):
- the target may be one or more tags (from the `tags` variable)
- features would be words present in the `content` or `title` variables

From the 5TC and _mining_ dataset(s): 
- target will be `BCI` (from 5TC dataframe)
- include all features from either of the above dataframes
- the dataframes would need to be joined by either:
    1. aggregating the features from the _mining_ dataframe (by date)
    1. spreading the 5TC dataframe onto the _mining_ dataframe (duplicating 5TC rows)

__Exercise:__ discuss with your group which of the two options for the third (combined) dataset look more promising or most reasonable.

After discussing at length, we believe that aggregating the features from the mining data frame will provide the better answer.  Our reasoning is as follows - as the 5TC variable is our target variable, we do not want multiple instances of the same target.  By aggregating the data from the mining data frames, we are better providing ourselves with an accurate representation of all things impacting the target variable on the given day in which it was measured.  Furthermore, we believe it will be important to see and understand if and when the target variable was collected, and perhaps segregate data in a way that is more akin to its collection than the standard 24 hour clock (i.e. for our analysis, the day either starts or ends with the collection of the target value, and all features are collected before or after).

## 4. Wrapper transformer classes

__Note:__ Below is a template for a transformer class.

In [33]:
class TransformerExample(BaseEstimator, TransformerMixin):
  def __init__(self, init_param1, init_param2):
    self.init_param1 = init_param1 
    self.init_param2 = init_param2    
    self.vec = CountVectorizer()
  def fit(self, X, y=None):
    # set attributes of self and then return self
    return self
  def transform(self, X):
    # return a dataframe/array from X
    # do not modify X
    my_df = pd.DataFrame(vec.fit_transform(X))
    my_df.columns = self.init_param1
    return 

__Exercise:__ create wrapper transformer classes called `CountVectorizerDF` and `TfidfVectorizerDF` which call their counterparts above (`CountVectorizer` and `TfidfVectorizer`) and return a pandas dataframe with:
- column names from the `get_feature_names` method (of the counterpart object)
- values from the `fit_transform` method (of the counterpart object)

You may need to use the `toarray` method on the `fit_transform` output. Demonstrate these two classes using `corpus` (below).

The code below shows a wrapper transformer class for CountVectorizer.  We begin by defining the init method to contain only the CountVectorizer method.  We then use the fit_transform method on the data that is passed and obtain the column names by using "get_feature_names" functionality of CountVectorizer.  Finally, in the transform method we create a data frame from what was determined above and return a pandas data frame.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

The following code takes the CountVectorizer class and wraps it in our own function.  Note the specific use of the "super" method to appoint portions of the wrapped method to our method.  The super method calls the parent class on the wrapper class and applies the selected portions of the super class onto it - for example, the fit method of CountVectorizer is applied verbatim to the fit method in our created class.  The main difference from our created class and the CountVectorizer class is the return of a Data Frame.

In [38]:
class CountVectorizerDF2(CountVectorizer):
  def __init__(self):
    super().__init__()
  def fit(self,X, y = None):
    super().fit(X,y)
    self.columns = super().fit(X,y).get_feature_names()
    return self
  def transform(self,X):
    abc = pd.DataFrame(super().transform(X).toarray())
    abc.columns = self.columns
    return abc

The "corpus" code is simply used as tester for our created methods.

In [40]:
corpus = [
  'dogs and cats.',
  'dogs, more dogs and horses.',
  'cats or birds.']

Results of the Corpus expirement on the CountVectorizer wrapper class.  Note that the code works as expected, however we are unable to use the fit_transform method as we have not defined it in our class.  Because we called a super class and only implemented parts, we will have to independenlty call the fit_transfrom method and define it in our class.

In [42]:
CountVectorizerDF2().fit(corpus).transform(corpus)

Our results shown above match the results generated earlier.

Like above, we begin to specify the class, and define our init method to include only self.  Rather than fully re-explaining the methodology, we note that the steps taken for TfidfVectorizer do not differ from those for countVectorizer, it simply utilizes Tfidf as opposed to CountVectorizer.

In [45]:
class TfidfVectorizerDF2(TfidfVectorizer):
  def __init__(self):
    super().__init__()
  def fit(self, X, y = None):
    super().fit(X, y)
    self.columns = super().fit(X,y).get_feature_names()
    return self
  def transform(self, X):
    abc = pd.DataFrame(super().transform(X).toarray(), columns  = self.columns)
    return abc

We see here the output from the Tfidf method.  We note that it is different because of the differnt computing methods between Tfidf and CountVec.

In [47]:
result=TfidfVectorizerDF2().fit(corpus).transform(corpus)
result

__Exercise:__ create a wrapper transformer class called `CountVectColDF`.
1. The init method should have a parameter called `col_name`
1. The init method creates and then stores a `CountVectorizer` object in an attribute called `vec`
1. The `transform` method expects a pandas dataframe in the `X` argument
1. The `transform` method passes the `col_name` column of the `X` dataframe to the `transform` method of the `CountVectorizerDF` object stored in `vec`
1. The `transform` method then returns the result of the previous item (4).

Test your class with the command `CountVectColDF('tags').fit_transform(mining_com_coal_pdf)`

The following code takes the CountVectorizerDF2 class and wraps it in a new function caleld CountVectColDF2. We again use the super methodology to apply the column name of a specified dataframe and then employ a count vectorizer wrapped class to the argument.  Note that this method only works based on our method above working, which is why we cannot use fit_transform and instead use fit().transform() chained together.  We take the argument entered in 'tags' and pull the tags column from the specified data frame.  This will be useful in performing our analysis later on as we will be able to pull out the needed coolumns and run a count vectorizer or tfdif type analysis on them.

In [50]:
class CountVectColDF2(CountVectorizerDF2):
  def __init__ (self, col_name):
    super().__init__()
    self.col_name = col_name
  def fit (self, X, y = None):
    self.doc=np.array(X[self.col_name]).astype('U').tolist()
    super().fit(self.doc,y)
    return self
  def transform (self, X):
    abc = pd.DataFrame(super().transform(self.doc))
    return abc

In [51]:
result=CountVectColDF2('tags').fit(mining_com_coal_pdf).transform(mining_com_coal_pdf)
result.info()
print(result)

## 5. Types of pipelines

There are two key observations about pipelines in `sklearn` that determine how I use them:
1. The `transform` method does not transform the target variable (`y`)
1. Transformer steps in an estimator pipeline cannot add or drop rows 

This implies two types of pipelines, transformer pipelines and estimator pipelines, and suggests how to use them.

Transformer pipelines: 
- create features for the estimator pipeline 
- includes, but doesn't modify, target variable values
- fills in missing values of feature variables
- may drop rows to remove rows with missing values 
- result in feature variables and a target variable with no missing values

Keep transformations steps to a minimum for two reasons:
1. there is no distinction between train and test data 
2. these pipelines cannot be part of grid search (as they aren't estimators)

Estimator pipelines: 
- should not contain missing values
- include transformations that modify columns
- can be run using grid search

Initially, we focus on creating classes for a transformer pipeline. Two of these classes are `DateVarsDF` and `LagVarsDF`, to be built in exercises below.

__Exercise:__ create a transformer called `DatetimeVarsDF` that: 
- has an init method with one parameter `dt_col_name` that is the name of a `datetime` column (that is stored as an attribute of self)
- has a `transform` method that returns a dataframe of all individual date and time variables derived from the datetime variable named in `dt_col_name`

The following code considers the DateTimeVarsDF that works as a transformer pipeline.  By utilizing the dt_col_name to be set to the col name specified by user input, we can take the resulting input and run an analysis that returns a pandas data frame showing the specific date and time from each observation in the given data.  This can prove useful for observing different dates that have multiple times associated and can be considered when performing the analysis discussed previously in merging target data frames with feature data frames.

In [59]:
class  DatetimeVarsDF():
  def __init__(self,dt_col_name):
    self.dt_col_name = dt_col_name
  def transform(self,df):
    self.df=df
    results=pd.to_datetime(self.df[self.dt_col_name],format='%Y-%m-%d %H:%M:%S')
    return results
  

In [60]:
a=DatetimeVarsDF('date').transform(mining_com_coal_pdf)
a

## 6. `FeatureUnion`

`FeatureUnion` combines several transformer objects into a new transformer that combines their output. --- Scikit-learn 

The `fit` and `transform` methods of a `FeatureUnion` object initiate the same methods on each component transformer object.
The result of the `transform` method (of the `FeatureUnion` object) is the column-wise concatenation of the results of the `transform` methods applied to the component transformer objects. 

For example:

In [63]:
from sklearn.pipeline                import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
  'dogs and cats.',
  'dogs, more dogs and horses.',
  'cats or birds.'
]
fea = FeatureUnion([('cnt_vec', CountVectorizer()),
                ('idf_vec', TfidfVectorizer())
                   ])
fea_pdf = \
fea.fit_transform(corpus) \
   .toarray() \
   .round(3)
display_pdf(pd.DataFrame(data=fea_pdf))


0,1,2,3,4,5,6,7,8,9,10,11,12,13
1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.577,0.0,0.577,0.577,0.0,0.0,0.0
1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.344,0.0,0.0,0.688,0.452,0.452,0.0
0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.623,0.474,0.0,0.0,0.0,0.623


__Exercise:__ copy the `FeatureUnion` example above and modify it to use your two new transformer classes `CountVectorizerDF` and `TfidfVectorizerDF`.

By modifying the FetureUnion above we are able to provide code that takes both methods, and then runs the same data through the code and concatenates the results.  This is useful for comparing side by sode results, and could be useful in the project anlaysis, but does not provide much in the way of "new information" and is more useful as a viewing tool. Here we note that CountVec rows are 0-6, and the tfdif are 7-13, indicating FetaureUnion merged the data side by side, we should consider this when applying a FeatureUnion command in our analysis.

In [66]:
fea_Vec = FeatureUnion([('cnt_vec',CountVectorizerDF2()),
                       ('idf_vec',TfidfVectorizerDF2())
                      ])
fea_pdf2 = fea_Vec.fit_transform(corpus).toarray().round(3) #Function(countvectorize) will return a dataframe, why fea_pdf2 is a ndarray
print(type(fea_pdf2))
display_pdf(pd.DataFrame(data=fea_pdf2))


0,1,2,3,4,5,6,7,8,9,10,11,12,13
1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.577,0.0,0.577,0.577,0.0,0.0,0.0
1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.344,0.0,0.0,0.688,0.452,0.452,0.0
0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.623,0.474,0.0,0.0,0.0,0.623


__The End__