# Part 2. Featurization jobs

## Overview

In the previous demo we have learned how to use `DataSource` and `Selector` classes to build a tidy dataset from the external data source. For small datasets that can be build in several minutes, these two components are fine and you don't need anything more. However, for the bigger datasets, additional questions arise, like:

* Sometimes the data set is too large to hold in memory. Since selector produces rows one by one, it's not a big problem: we can just separate them into several smaller dataframes
* Sometimes we want to exclude some records from the dataset, or produce several rows per one object
* Sometimes we actually do not want the resulting dataframe, but some aggregated statistics instead
* And finally, sometimes we do not want to execute this procedure on our local machine. Instead, we want to deliver it to the cloud. 

To address these questions, TG offers the `FeaturizationJob` class, which is described in this demo.

First, we will create data source and selector in a way similar to previous demo, but without artificial distortion.

In [1]:
from tg.common.datasets.access import MockDfDataSource
import pandas as pd

source = MockDfDataSource(pd.read_csv('titanic.csv'))
selector = lambda z: z
selector(source.get_data().first())

{'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': nan, 'Embarked': 'S'}

`MockDfDataSource` is a class that converts the dataframe into DOF of its rows. You may effectively use this class for, e.g., unit tests. And since the output of the `MockDfDataSource` is already in an appropriate format, we don't need any complex selectors, so we will just use an identity function.

Now we can create a data frame in the most primitive way:

In [2]:
df = pd.DataFrame(selector(z) for z in source.get_data())
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The featurization job that produces the same result can be created as follows:

In [3]:
from tg.common.datasets.featurization import FeaturizationJob, DataframeFeaturizer, InMemoryJobDestination


destination = InMemoryJobDestination()

job = FeaturizationJob(
    name = 'job',
    version = 'v1',
    source = source,
    featurizers = {
        'passengers': DataframeFeaturizer(row_selector = selector)
    },
    destination = destination
)

job.run()
destination.buffer['passengers'][0].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The new classes in the code above are:

* `DataFrameFeaturizer`: When used in this way, it just applies `row_selector` to each data object from `source` and collects the results into pandas dataframes
* `InMemoryJobDestination`: In general, `destination` defines how and where the resulting dataframes are stored. When set to `InMemoryJobDestination`, they are stored in memory, in `destination.buffer`:
  * keys of this dictionary are the names of featurizers (this is where `passenger` came from).
  * values are lists of dataframes (because, as we will see soon, more than dataframe can be produced)
  
Other implementations of `JobDestination` base class allow you to store dataframes in the local filesystem, or send it to S3.

## Partitioning

What if the data is too big? Per se, it's not a problem: data sources do not normally keep all the data in memory at once, and selectors process the data one-by-one. But when we do the last step of assembling the data into a dataframe, we might run into problem. Let's use `DataframeFeaturizer` arguments to prevent it:

In [4]:
destination = InMemoryJobDestination()

job = FeaturizationJob(
    name = 'job',
    version = 'v1',
    source = source,
    featurizers = {
        'passengers': DataframeFeaturizer(buffer_size=3, row_selector = selector)
    },
    destination = destination
)

job.run()
print(len(destination.buffer['passengers']))
destination.buffer['passengers'][0]

297


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


Here we have limited amount of rows that can be put into one data frame to 3. As the result, each data frame in `destination.buffer['passengers']` has no more than 3 rows, and of course we have a lot of them. In case of destinations that keep dataframes in files (locally or on S3), each dataframe will be placed into its individual file.

## Filtering / expanding

What if our data are more complicated, and there is no 1-to-1 correspondance between data objects and rows? Examples are:
* We want to filter out some rows. In this case, 1 incoming data object corresponds to 0 rows.
* We are processing data that are organized not as a flow of passengers, but as a flow of cabins, where each cabin is a list of passengers. In this case, 1 incoming data object corresponds to arbitrary amount of rows.

Let's implement the first option by modifying `DataFrameFeaturizer`, and also explore some additional features of this class

In [5]:
import numpy as np

class MyDataFrameFeaturizer(DataframeFeaturizer):
    def __init__(self):
        super(MyDataFrameFeaturizer, self).__init__()
        
    def _featurize(self, obj):
        if obj['Age'] < 18:
            return []
        else:
            return [obj]
        
    def _postprocess(self, df):
        df.Cabin = np.where(df.Cabin.isnull(), 'NONE', df.Cabin)
        return df
     
destination = InMemoryJobDestination()

job = FeaturizationJob(
    name = 'job',
    version = 'v1',
    source = source,
    featurizers = {
        'passengers': MyDataFrameFeaturizer()
    },
    destination = destination
)

job.run()
destination.buffer['passengers'][0].sort_values('Age').head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
597,678,1,3,"Turja, Miss. Anna Sofia",female,18.0,0,0,4138,9.8417,NONE,S
728,835,0,3,"Allum, Mr. Owen George",male,18.0,0,0,2223,8.3,NONE,S
707,808,0,3,"Pettersson, Miss. Ellen Natalia",female,18.0,0,0,347087,7.775,NONE,S
509,586,1,1,"Taussig, Miss. Ruth",female,18.0,0,2,110413,79.65,E68,S
332,386,0,2,"Davies, Mr. Charles Henry",male,18.0,0,0,S.O.C. 14879,73.5,NONE,S


Here we have created a special class just for this particular dataset, therefore, we don't really need to pass the `selector` to it.

In `_featurize` method we process a given data object in an arbitrary way, and return a list of rows.

In `_postprocess` we may perform some additional operations on the dataframe. We have imputed the values for `Cabin` field, but **do not do it** in the real examples: the imputation belongs to the machine learning part of the pipeline, not to the data cleaning.

## Aggregating 

Sometimes we are not really interested in the dataframe as it is, but want to compute some aggregated statistics and use it as features. In our case, we may wish to compute average fare and age for each cabin. (_Note that it would be an awful idea in the reality, as it would be a leakage of data from test to train_).

In this case, we need to step back in our inheritance hierarchy, and use `StreamFeaturizer` class.

In [6]:
from tg.common.datasets.featurization import StreamFeaturizer

class CabinStatisticsFeaturizer(StreamFeaturizer):
    def start(self):
        self.cabins = {}
    
    def observe_data_point(self, item):
        cabin =  item['Cabin'] 
        if not isinstance(cabin, str) or item['Fare'] is None or item['Age'] is None:
            return
        if cabin not in self.cabins:
            self.cabins[cabin] = dict(count=0, age=0, fare=0, id=cabin)
        self.cabins[cabin]['count']+=1
        self.cabins[cabin]['age']+=item['Age']
        self.cabins[cabin]['fare']+=item['Fare']
        
    def finish(self):
        df = pd.DataFrame(list(self.cabins.values()))
        df.age = df.age/df['count']
        df.fare = df.fare/df['count']
        return df.set_index('id')
        
destination = InMemoryJobDestination()

job = FeaturizationJob(
    name = 'job',
    version = 'v1',
    source = source,
    featurizers = {
        'passengers': MyDataFrameFeaturizer(),
        'cabins': CabinStatisticsFeaturizer()
    },
    destination = destination
)

job.run()
destination.buffer['cabins'][0].head()

Unnamed: 0_level_0,count,age,fare
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C85,1,38.0,71.2833
C123,2,36.0,53.1
E46,1,54.0,51.8625
G6,4,14.75,13.58125
C103,1,58.0,26.55


Note, that passengers dataset is produced as well:

In [7]:
destination.buffer['passengers'][0].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,NONE,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,NONE,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,NONE,S


So it is possible, and often very useful, to build several datasets with a single run over the source. So far, retrieving data from source was the most time-consuming part of the featurization, so it _really_ saves time

## Summary

In this demo, we covered a higher level of abstraction in the featurization process: `FeaturizationJob` class that encapsulates data source and selectors, and controls the overall featurization process. The only remained question is how to perform this job on the remote server, and this will be answered in the next demo.