# Part 1. Featurization, Sources and Selectors

## Overview

In this Demo, we will cover the featurization of the data: essentially, how one item of data from the database can be transformed into one row in the tidy dataset. Training Grounds offers the following workflow for that:

*  To get data from *data source* in a format of *data objects flow* (DOF).
  *Data object* are non-relational JSONs, typically huge
  and containing nested lists and dictionaries.
  *Flow* means that objects are not placed in memory all at once,
  but are accessible as a python iterator.
  `yo_fluq_ds` is employed for more functionality (https://pypi.org/project/yo-fluq-ds/)
- The data can be cached locally in zipped pickle format in order
  to speed up access while prototyping. This cache file can be read in the same DOF format
  as the original data source, so they are mutually interchangeable.
- *Selectors* are applied to the DOF. Selector is an object that behaves as a *pure function*
  (in the sense of the functional programming) that converts Data Object into a tidy row.
- The rows are assembled in a `pandas` dataframe
  and stored in `parquet` format in one or multiple files.

## Data sources

In this section, we will learn how to access the data in the DOF format. The fundamental class here is `DataSource` with `get_data` method, that returns data in the form of DOF. `DataSource` is a necessary abstraction that hides the details of how the data are actually stored: be it relational database, AWS S3 storage or simply a file, as long as the data can be represented as a flow of DOFs, you will be able to use it in your project. If the data storage changes, you may adapt to this change by replacing the `DataSource` implementation and keeping the rest of the featurization process intact. Typically, you need to implement your own `DataSource` inheritants for the storages you have in your environment.

In this demo, we will work with the well-known Titanic dataset, which is stored in the local folder as a `csv` file. Of course, it already contains all the data in the tidy format, but for the sake of the demonstration we will distort this format, and then restore the tidiness again with the TG-pipeline. 

The first step is to write your own `DataSource` class, that will make Titanic dataset available as DOF.

In [1]:
from yo_fluq_ds import Query, Queryable
from tg.common.datasets.access import DataSource
import pandas as pd

class CsvDataSource(DataSource):
    def __init__(self, filename):
        self.filename = filename

    def _get_data_iter(self):
        df = pd.read_csv(self.filename)
        for row in df.iterrows():
            d = row[1].to_dict()
            yield  {
                'id': d['PassengerId'],
                'ticket': {
                    'ticket.id': d['Ticket'],
                    'fare': d['Fare'],
                    'Pclass': d['Pclass']
                },
                'passenger': {
                    'Name': d['Name'],
                    'Sex': d['Sex'],
                    'Age': d['Age']
                },
                'trip': {
                    'Survived': d['Survived'],
                    'SibSp': d['SibSp'],
                    'Patch': d['Parch'],
                    'Cabin': d['Cabin'],
                    'Embarked' : d['Embarked']
                    
                }
            }
            
    def get_data(self) -> Queryable:
        return Query.en(self._get_data_iter())
    
source = CsvDataSource('titanic.csv')

for row in source.get_data():
    print(row)
    break

{'id': 1, 'ticket': {'ticket.id': 'A/5 21171', 'fare': 7.25, 'Pclass': 3}, 'passenger': {'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0}, 'trip': {'Survived': 0, 'SibSp': 1, 'Patch': 0, 'Cabin': nan, 'Embarked': 'S'}}


Here `_get_data_iter` creates an iterator, that yields objects one after another. In `get_data`, we simply wrap this iterator in `Queryable` type from `yo_fluq`. It's still the iterator, so we can iterate over it, as `for` loop demonstrates.

`Queryable` class contains a variety of methods for easy-to-write data processing, which are the Python-port of LINQ technology in C#. The methods are explained in full details in https://pypi.org/project/yo-fluq-ds/ . The access to the DOF in `Queryable` format allows you to quickly perform exploratory data analysis. As an example, consider the following code:

In [2]:
(source
 .get_data()
 .where(lambda z: z['passenger']['Sex']=='male')
 .order_by(lambda z: z['passenger']['Age'])
 .select(lambda z: z['ticket'])
 .take(3)
 .to_dataframe()
)

Unnamed: 0,ticket.id,fare,Pclass
0,2625,8.5167,3
1,250649,14.5,2
2,248738,29.0,2


The meaning is self-evident: filter by `Sex`, order by `Age` and select the `Ticket` information out of the records, then take 3 of them in the format of pandas `DataFrame`.

Quite often we want to make the data available offline, so the data is available faster and do not create a load on the external server. The typical use cases are:

* Exploratory data analysis
* Functional tests in your service: these tests often use the real data, and it's impractical to wait each time for this data to be delivered.
* Debugging of you services: most of the data services are downloading some data at the beginning, and in order to speed-up the startup when debugging on the local machine, it's helpful to create a cache.

To make data source cacheable, create a wrapper:

In [3]:
from tg.common.datasets.access import ZippedFileDataSource, CacheableDataSource

cacheable_source = CacheableDataSource(
    inner_data_source = source,
    file_data_source = ZippedFileDataSource(path='./temp/titanic')
)

`CacheableDataSource` is still a `DataSource` and can be called directly. In this case, the original source will be called.

In [4]:
cacheable_source.get_data().first()

{'id': 1,
 'ticket': {'ticket.id': 'A/5 21171', 'fare': 7.25, 'Pclass': 3},
 'passenger': {'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0},
 'trip': {'Survived': 0,
  'SibSp': 1,
  'Patch': 0,
  'Cabin': nan,
  'Embarked': 'S'}}

However, we can also access data this way:

In [5]:
cacheable_source.safe_cache('default').get_data().first()

{'id': 1,
 'ticket': {'ticket.id': 'A/5 21171', 'fare': 7.25, 'Pclass': 3},
 'passenger': {'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0},
 'trip': {'Survived': 0,
  'SibSp': 1,
  'Patch': 0,
  'Cabin': nan,
  'Embarked': 'S'}}

`safe_cache` accepts the following modes: 
* `default` mode, in this case `safe_cache` will create the cache in the `path` folder, provided to `ZippedFileDataSource`, if it does not exists, and read from it. 
* `use` mode. the error will be thrown if cache does not exist locally. 
* `no` mode, the underlying source will be called directly, the cache will neither created nor used.

So, when developing, we can use caches to save time, but when deploying, disable caching them with simple change of the argument. 

The format for the created cache file is a zipped folder with files that contains pickled data separated into bins. Normally, you don't need to intervene to their size. Increasing the bins size increases both performance and memory consumption. Theoretically, you may use another format by implementing your own class instead of `ZippedFileDataSource`. However, it's only recommended: the current format is a result of a comparative research, and other, more obvious ways of caching (caching everything in one file, or caching each object in an invidual file) perform much slower.

## Selectors

### Basics

Selectors are objects that:
* define a pure function that transforms a data object into a row in the dataset
* track errors and warnings that happen during this conversion
* fully maintain the inner structure of selectors, making it possible to e.g. visualize the selector

**Note**: selectors are slow! They are not really aligned for the processing of hundreds of gygabytes of data. If this use case arises:
* they potentially can be parallelized in a PySpark 
* they potentially can be partially translated into e.g. PrestoSQL queries, since they maintain the inner structure

For the demonstration purpose, we will take the first object of our data source and process it with various selectors.

In [6]:
obj = source.get_data().skip(11).first()
obj

{'id': 12,
 'ticket': {'ticket.id': '113783', 'fare': 26.55, 'Pclass': 1},
 'passenger': {'Name': 'Bonnell, Miss. Elizabeth',
  'Sex': 'female',
  'Age': 58.0},
 'trip': {'Survived': 1,
  'SibSp': 0,
  'Patch': 0,
  'Cabin': 'C103',
  'Embarked': 'S'}}

Let's start with simply selecting one field:

In [7]:
from tg.common.datasets.selectors import Selector

selector = (Selector()
            .select('id')
            )

selector(obj)

{'id': 12}

`Selector` class is a high-level abstraction, that allows you defining the featurization function with a `Fluent API`-interface. `Selector` is building a complex object of interconnected smaller processors, and we will look at these processors a little later. We may consider `Selector` on a pure syntax level: how exactly this or that use case can be covered with it. 

We can rename the field as follow:

In [8]:
selector = (Selector()
            .select(passenger_id = 'id')
            )

selector(obj)

{'passenger_id': 12}

We can select nested fields several syntax options:

In [9]:
from tg.common.datasets.selectors import Selector, FieldGetter, Pipeline

selector = (Selector()
            .select(
                'passenger.Name',
                ['passenger','Age'],
                ticket_id = ['ticket',FieldGetter('ticket.id')],
                sex = Pipeline(FieldGetter('passenger'), FieldGetter('Sex'))
            ))
selector(obj)

{'ticket_id': '113783',
 'sex': 'female',
 'Name': 'Bonnell, Miss. Elizabeth',
 'Age': 58.0}

* the first one (for `Name`) represents the highest level of abstraction, it is very easy to define lots of fields for selection in this way.
* the second one (for `Age`) shows that arrays can be used instead of dotted names. The elements of array will be applied sequencially to the input. In this particular case the array consists of two strings, and strings are used as the keys to extract values from dictionaries. Therefore, first the `passenger` will be extracted from the top-level dictionary, and then -- `Age` from `passenger`.
* the third (for `ticket.id`) is the only way how we can access the fields with the symbol `.` in name. `FieldGetter` is one of aforementioned small processors: it processes the given object by extracting the element out of the dictionary, or a field from the Python object. 
* the fourth way (for `Sex`) fully represents how selection works under the hood: it is a sequencial application (`Pipeline`) of two `FieldGetters`. So the arrays for `Age` and `ticket.id` will be converted to `Pipeline` under the hood.

The best practice is to use the first method wherever possible, and the third one in other cases.

If you select several fields from the same nested object, please use `with_prefix` method for optimization:

In [10]:
from tg.common.datasets.selectors import Selector, FieldGetter, Pipeline

selector = (Selector()
            .with_prefix('passenger')
            .select('Name','Age','Sex')
            .select(ticket_id = ['ticket',FieldGetter('ticket.id')])
            )
selector(obj)

{'Name': 'Bonnell, Miss. Elizabeth',
 'Age': 58.0,
 'Sex': 'female',
 'ticket_id': '113783'}

`with_prefix` method only affects the `select` that immediately follows it. 

Often, we need to post-process the values. For instance, name by itself is not likely to be feature (and would be GDPR-protected for the actual customers, thus making the entire output dataset GDPR-affected, which is better to avoid). However, we can extract the title from name as it can indeed be a predictor.

In [11]:
import re

def get_title(name):
    title = re.search(' ([A-Za-z]+)\.', name).group().strip()[:-1]
    if title in ['Lady', 'Countess','Capt', 'Col',
                 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']:
        return 'Rare'
    elif title == 'Mlle':
        return 'Miss'
    elif title == 'Ms':
        return 'Miss'
    elif title == 'Mme':
        return 'Mrs'
    else:
        return title

string_size = (Selector()
               .select(['passenger','Name', get_title])
              )

string_size(obj)

{'Name': 'Miss'}

What if we want to apply multiple featurizers to the same field? We can use `Ensemble` for that:

In [12]:
from tg.common.datasets.selectors import Ensemble

def get_length(s):
    return len(s)

string_size = (Selector()
               .select(['passenger','Name', Ensemble(title=get_title, length=get_length)])
              )


string_size(obj)

{'Name': {'title': 'Miss', 'length': 24}}

`Ensemble` can also be used to combine several selectors together. When your Data Objects are huge and complicated, it makes more sense to write several smaller selectors, instead of writing one that selects all the fields you have. It's easier to read and reuse this way.

In [13]:
ticket_selector = (Selector()
                   .with_prefix('ticket')
                   .select(
                       'fare',
                       'PClass',
                       id = [FieldGetter('ticket.id')]
                   )
)
passenger_selector = (Selector()
                      .with_prefix('passenger')
                      .select(
                          'Sex',
                          'Age',
                          name=['Name', Ensemble(
                              length=get_length,
                              title=get_title
                          )]
                      ))

combined_selector = Ensemble(
    ticket = ticket_selector,
    passenger = passenger_selector,
)
combined_selector(obj)

{'ticket': {'id': '113783', 'fare': 26.55, 'PClass': None},
 'passenger': {'name': {'length': 24, 'title': 'Miss'},
  'Sex': 'female',
  'Age': 58.0}}

Pipelines, too, can be used for combination purposes. The typical use case is postprocessing: at the first step, we select fields from the initial object, and then, we want to compute some functions from these fields (e.g., we may want to compute BMI for the person from their weight and height). 

In Titanic example, let's compute a sum of `SibSp` and `Patch` as a new feature, `Relatives`. We will place it into the new `trip_selector` (which is selector, describing the trip in general, rather than the person or the ticket).

For that, we will use `Pipeline`. The arguments of the `Pipeline` are functions, that will be sequencially applied to the input.

In [14]:
def add_relatives_count(d):
    d['Relatives'] = d['SibSp'] + d['Patch']
    return d

trip_selector = Pipeline(
    Selector()
     .select('id')
     .with_prefix('trip')
     .select('Survived','Cabin','Embarked','SibSp','Patch'),
    add_relatives_count
)

trip_selector(obj)

{'id': 12,
 'Survived': 1,
 'Cabin': 'C103',
 'Embarked': 'S',
 'SibSp': 0,
 'Patch': 0,
 'Relatives': 0}

Now we need to do some finishing stitches: 
* for a problemless conversion to dataframe, we need a flat `dict`, not nested. TG has the method for that, `flatten_dict`
* We will also insert the current time as a processing time.

In [15]:
from tg.common.datasets.selectors import flatten_dict
import datetime

def add_meta(obj):
    obj['processed'] = datetime.datetime.now()
    return obj

titanic_selector = Pipeline(
    Ensemble(
        passenger = passenger_selector,
        ticket = ticket_selector,
        trip = trip_selector
    ),
    add_meta,
    flatten_dict
)
titanic_selector(obj)

{'passenger_name_length': 24,
 'passenger_name_title': 'Miss',
 'passenger_Sex': 'female',
 'passenger_Age': 58.0,
 'ticket_id': '113783',
 'ticket_fare': 26.55,
 'ticket_PClass': None,
 'trip_id': 12,
 'trip_Survived': 1,
 'trip_Cabin': 'C103',
 'trip_Embarked': 'S',
 'trip_SibSp': 0,
 'trip_Patch': 0,
 'trip_Relatives': 0,
 'processed': datetime.datetime(2021, 4, 8, 14, 3, 34, 267566)}

### Representation

The selectors always keep the internal structure and thus can be analyzed and represented in the different format. The following code demonstrates how this structure can be retrieved. 

In [16]:
from tg.common.datasets.selectors import CombinedSelector
import json

def process_selector(selector):
    if isinstance(selector, CombinedSelector):
        children = selector.get_structure()
        if children is None:
            return selector.__repr__()
        result = {} # {'@type': str(type(selector))}
        for key, value in children.items():
            result[key] = process_selector(value)
        return result
    return selector.__repr__()


representation = process_selector(titanic_selector)

print(json.dumps(representation, indent=1)[:300]+"...")
            

{
 "0": {
  "passenger": {
   "0": {
    "0": {
     "0": "[?passenger]"
    },
    "1": {
     "name": {
      "0": "[?Name]",
      "1": {
       "length": "<function get_length at 0x7f020f803200>",
       "title": "<function get_title at 0x7f023c7833b0>"
      }
     },
     "Sex": {
      "0": "...


To date, we didn't really find out the format that is both readable and well-representative, so we encourage you to explore and extend the code for representation creation to add the field you need for the effective debugging.

### Error handling

Sometimes selectors throw an error while processing the request. They provide a powerful tracing mechanism to find the cause of error in their complicated structure, as well as in the original piece of data.

Let us create an erroneous object for processing. The `Name` field which is normally a string, will be replaced with integer value.

In [17]:
err_obj = source.get_data().first()
err_obj['passenger']['Name'] = 0
err_obj

{'id': 1,
 'ticket': {'ticket.id': 'A/5 21171', 'fare': 7.25, 'Pclass': 3},
 'passenger': {'Name': 0, 'Sex': 'male', 'Age': 22.0},
 'trip': {'Survived': 0,
  'SibSp': 1,
  'Patch': 0,
  'Cabin': nan,
  'Embarked': 'S'}}

In [18]:
from tg.common.datasets.selectors import SelectorException
exception = None
try:
    titanic_selector(err_obj)
except SelectorException as ex:
    exception = ex
    
print(exception.context.original_object)
print(exception.context.get_data_path())
print(exception.context.get_code_path())


{'id': 1, 'ticket': {'ticket.id': 'A/5 21171', 'fare': 7.25, 'Pclass': 3}, 'passenger': {'Name': 0, 'Sex': 'male', 'Age': 22.0}, 'trip': {'Survived': 0, 'SibSp': 1, 'Patch': 0, 'Cabin': nan, 'Embarked': 'S'}}
[?passenger].[?Name]
/0/passenger/0/1/name/1/length:get_length


Selectors are usually applied to the long sequences of data which may not be reproducible. It is therefore wise to cover your featurization with try-except block and store the exception on the hard drive, so you could later build a test case with `original_object`.

`get_data_path()` returns the string representation of path inside data where the error has occured: somewhere around `obj['passenger']['Name']`. Symbol `?` means that these fields are optional, and _all fields_ are optional by default. If you want the selector that raises exception when the field does not exist, pass the `True` argument to the constructor of the `Selector`.

`get_code_path()` describes where the error occured within the hierarchy of selectors. By looking at this string, we can easily figure out that error occured somewhere around processing `name` with `get_length` method. If the deeper analysis is required, we may use the `representation` object we have previously built:

In [19]:
representation[0]['passenger'][0]

{0: {0: '[?passenger]'},
 1: {'name': {0: '[?Name]',
   1: {'length': '<function get_length at 0x7f020f803200>',
    'title': '<function get_title at 0x7f023c7833b0>'}},
  'Sex': {0: '[?Sex]'},
  'Age': {0: '[?Age]'}}}

Here we input the beginning of `get_code_path()` output and see the closer surroundings of the error.

## List featurization

In this section we will consider building features for a list of objects. This use case is rather rare, the examples is, for instance, building the features for a customer, basing on the articles he have purchased in the past.

In the Titanic setup, imagine that:
1. Our task is actually produce feature for cabins, not for passengers.
2. Our DOF is also flow of cabins, not passengers.

So one object looks like this:

In [20]:
cabin_obj = source.get_data().where(lambda z: z['trip']['Cabin']=='F2').to_list()
cabin_obj

[{'id': 149,
  'ticket': {'ticket.id': '230080', 'fare': 26.0, 'Pclass': 2},
  'passenger': {'Name': 'Navratil, Mr. Michel ("Louis M Hoffman")',
   'Sex': 'male',
   'Age': 36.5},
  'trip': {'Survived': 0,
   'SibSp': 0,
   'Patch': 2,
   'Cabin': 'F2',
   'Embarked': 'S'}},
 {'id': 194,
  'ticket': {'ticket.id': '230080', 'fare': 26.0, 'Pclass': 2},
  'passenger': {'Name': 'Navratil, Master. Michel M',
   'Sex': 'male',
   'Age': 3.0},
  'trip': {'Survived': 1,
   'SibSp': 1,
   'Patch': 1,
   'Cabin': 'F2',
   'Embarked': 'S'}},
 {'id': 341,
  'ticket': {'ticket.id': '230080', 'fare': 26.0, 'Pclass': 2},
  'passenger': {'Name': 'Navratil, Master. Edmond Roger',
   'Sex': 'male',
   'Age': 2.0},
  'trip': {'Survived': 1,
   'SibSp': 1,
   'Patch': 1,
   'Cabin': 'F2',
   'Embarked': 'S'}}]

We want to build the following features for this `cabin_obj`: the average fare and age of the passengers. 

So, to build such aggregated selectors, following practice is recommended:
1. build a `Selector` that selects the fields, and apply it to the list, building list of dictionaries
2. convert list of dictionaries into dictionary of lists
3. apply averager to each list.

Let's first do it step-by-step. `Listwise` applies arbitrary function (e.g., your selector) to the elements of the list:

In [21]:
from tg.common.datasets.selectors import Listwise, Dictwise, transpose_list_of_dicts_to_dict_of_lists

cabin_features_selector = (Selector()
                           .select('passenger.Age','ticket.fare')
                          )
list_of_dicts = Listwise(cabin_features_selector)(cabin_obj)
list_of_dicts

[{'Age': 36.5, 'fare': 26.0},
 {'Age': 3.0, 'fare': 26.0},
 {'Age': 2.0, 'fare': 26.0}]

`transpose_list_of_dicts_to_dict_of_lists` makes the "transposition" of list of dicts into dict of lists.

In [22]:
dict_of_lists = transpose_list_of_dicts_to_dict_of_lists(list_of_dicts)
dict_of_lists

{'Age': [36.5, 3.0, 2.0], 'fare': [26.0, 26.0, 26.0]}

Finally, `Dictwise` applies function to the elements of dictionary

In [23]:
import numpy as np

Dictwise(np.mean)(dict_of_lists)

{'Age': 13.833333333333334, 'fare': 26.0}

If you need a more complicated logic, such as applying different functions to different fields, you will need to extend `Dictwise` class.

All we have to do now is to assemble it to the pipeline. Since in our use cases we have used this pipeline several times, it's standardized in the following class:

In [24]:
from tg.common.datasets.selectors import ListFeaturizer

selector = ListFeaturizer(cabin_features_selector, np.mean)
selector(cabin_obj)

{'Age': 13.833333333333334, 'fare': 26.0}

### Quick dataset creations

The combination of `DataSource` and `Featurizer` allows you to quickly build the tidy dataset:

In [25]:
cacheable_source.safe_cache('default').get_data().take(10).select(titanic_selector).to_dataframe()

Unnamed: 0,passenger_name_length,passenger_name_title,passenger_Sex,passenger_Age,ticket_id,ticket_fare,ticket_PClass,trip_id,trip_Survived,trip_Cabin,trip_Embarked,trip_SibSp,trip_Patch,trip_Relatives,processed
0,23,Mr,male,22.0,A/5 21171,7.25,,1,0,,S,1,0,1,2021-04-08 14:03:34.427351
1,51,Mrs,female,38.0,PC 17599,71.2833,,2,1,C85,C,1,0,1,2021-04-08 14:03:34.427597
2,22,Miss,female,26.0,STON/O2. 3101282,7.925,,3,1,,S,0,0,0,2021-04-08 14:03:34.427831
3,44,Mrs,female,35.0,113803,53.1,,4,1,C123,S,1,0,1,2021-04-08 14:03:34.428067
4,24,Mr,male,35.0,373450,8.05,,5,0,,S,0,0,0,2021-04-08 14:03:34.428298
5,16,Mr,male,,330877,8.4583,,6,0,,Q,0,0,0,2021-04-08 14:03:34.428528
6,23,Mr,male,54.0,17463,51.8625,,7,0,E46,S,0,0,0,2021-04-08 14:03:34.428757
7,30,Master,male,2.0,349909,21.075,,8,0,,S,3,1,4,2021-04-08 14:03:34.428986
8,49,Mrs,female,27.0,347742,11.1333,,9,1,,S,0,2,2,2021-04-08 14:03:34.429218
9,35,Mrs,female,14.0,237736,30.0708,,10,1,,C,1,0,1,2021-04-08 14:03:34.429464


Featurizing of bigger datasets, however, adds some complications (memory control and storing the output), and will be covered in the next Demo.

## Summary

In this demo, we have covered two topics:
* How to access the storage and retrieve the data elements one by one
* How to process each of them to transform into a row of the tidy dataset.

By simply combining these steps, we can already featurize the Titanic dataset and build a `Dataframe`. 