<cite>Darryl Oatridge, August 2022<cite>

In [1]:
import os

In [2]:
os.environ['HADRON_PM_PATH'] = '0_hello_meta/demo/contracts'
os.environ['HADRON_DEFAULT_PATH'] = '0_hello_meta/demo/data'

## Feature Engineering
This new component works in exactly the same way as the selection component, whereby we create the instance pertinent to our intentions, give it a location to retrieve data from, the source, and where to persist the results. Then we add the component intent, which in this case is to engineer the features we have selected and make them appropriate for a machine learning model or for further investigation. 

For feature engineering the component we will use, that contains the feature engineering intent, is called `wrangle`.

In [3]:
from ds_discovery import Wrangle, Transition

In [4]:
# get the instance
wr = Wrangle.from_env('hello_wr', has_contract=False)

With the source we want to be able to retrieve the outcome of the previous select component as this contains the selected features of interest. In order to retrieve this information we need to access the select components Domain Contract, remember this holds all the knowledge for any component. As this is a common thing to do there is a First class method call `get_persist_contract` that can be called directly. 

To retrieve the name of the source we are interested in we reload the previous component `Transition` giving it the unique name we used when creating the select component, in this case `hello_wr`, this loads the select components Domain Contract and then `get_persist_contract` which returns the string value of the outcome of that select component.

In [5]:
source = Transition.from_env('hello_tr').get_persist_contract()
wr.set_source_contract(source)
wr.set_persist()

As a check we can run the canonical report and see that we have loaded the output of the previous component (Transition component) into the current source.  



In [6]:
df = wr.load_source_canonical()

In [7]:
wr.canonical_report(df, stylise=False)

Unnamed: 0,Attributes (10),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,0.201,0.201,1309,99,max=80.0 | min=0.1667 | mean=29.88 | dominant=24.0
1,cabin,string,0.775,0.775,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
2,embarked,category,0.0,0.698,1309,4,Sample: S | C | Q | nan
3,fare,float64,0.001,0.046,1309,282,max=512.3292 | min=0.0 | mean=33.3 | dominant=8.05
4,parch,category,0.0,0.765,1309,8,Sample: 0 | 1 | 2 | 3 | 4
5,pclass,category,0.0,0.542,1309,3,Sample: 3 | 1 | 2
6,sex,category,0.0,0.644,1309,2,Sample: male | female
7,sibsp,category,0.0,0.681,1309,7,Sample: 0 | 1 | 2 | 4 | 3
8,survived,category,0.0,0.618,1309,2,Sample: 0 | 1
9,ticket,string,0.0,0.008,1309,929,Sample: CA. 2343 | 1601 | CA 2144 | PC 17608 | 347077


### Engineering the Features
As mentioned in the previous component demo, the components intent methods are not first class methods but part of the intent_model_class. Therefore to access the intent specify the controller instance name, in this case tr, and then reference the intent_model_class to access the components intent. To make this easier to remember with an abbreviated form we have overloaded the intent_model name with the name tools. You can see with all reference to the intent actions they start with tr.tools.

Now we have the source we can deal with the feature Engineering. As this is for the purpose of demonstration we are only sampling a small selection of Intent methods. It is well worth looking through the other Intent methods to get to know the full extent of the feature engineering package.

To get started, the column name `sibsip`, the number of siblings or the spouse of a person onboard, and `parch`, the number of parents or children each passenger was touring with, added together provide a new value that provides the size of each family.

In [8]:
df['family'] = wr.tools.correlate_aggregate(df, headers=['parch', 'sibsp'], agg='sum', column_name='family')

The column name `cabin` provides us with a record of the cabin each passenger was allocated.  Taking the first letter from each cabin gives us the deck the passenger was on. This provides us with a useful catagorical. 

In [9]:
df['deck'] = wr.tools.correlate_custom(df, code_str="@['cabin'].str[0]", column_name='deck')

We also note that a passenger travelling alone seems to have an improved survival rate. By selecting `family`, who's value is one and giving all other values a zero we can create a new column `is_alone` that indicates passengers travelling on their own.

In [10]:
selection = [wr.tools.select2dict(column='family', condition='@==0')]
df['is_alone'] = wr.tools.correlate_selection(df, selection=selection, action=1, default_action=0, column_name='is_alone')

Finally we ensure each of our new features are appropriately `typed` as a category. We also want to ensure the change to catagory runs after the newly created columns so we add the parameter `intent_order` with a value of one.

In [11]:
df = wr.tools.model_to_category(df, headers=['family','deck','is_alone'], intent_order=1, column_name='to_category')

By running the Intent report we can observe the change of order of the intent level.

In [12]:
wr.report_intent()

Unnamed: 0,level,order,intent,parameters,creator
0,deck,0,correlate_custom,"[""code_str='@['cabin'].str[0]'"", ""column_name='deck'"", 'kwargs={}']",doatridge
1,family,0,correlate_aggregate,"[""headers=['parch', 'sibsp']"", ""agg='sum'"", ""column_name='family'""]",doatridge
2,is_alone,0,correlate_selection,"[""selection=[{'column': 'family', 'condition': '@==0'}]"", 'action=1', 'default_action=0', ""column_name='is_alone'""]",doatridge
3,to_category,1,model_to_category,"[""headers=['family', 'deck', 'is_alone']"", ""column_name='to_category'""]",doatridge


## Run Component Pipeline
To run a component we use the common method `run_component_pipeline` which loads the source data, executes the component task , in this case components intent, then persists the results. This is the only method you can use to run the tasks of a component and produce its results and should be a familiarized method.

At this point we can run the pipeline and see the results of the new features.

In [13]:
wr.run_component_pipeline()

In [14]:
wr.canonical_report(df, stylise=False)

Unnamed: 0,Attributes (13),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,0.201,0.201,1309,99,max=80.0 | min=0.1667 | mean=29.88 | dominant=24.0
1,cabin,string,0.775,0.775,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
2,deck,category,0.0,0.775,1309,9,Sample: <NA> | C | B | D | E
3,embarked,category,0.0,0.698,1309,4,Sample: S | C | Q | nan
4,family,category,0.0,0.604,1309,9,Sample: 0 | 1 | 2 | 3 | 5
5,fare,float64,0.001,0.046,1309,282,max=512.3292 | min=0.0 | mean=33.3 | dominant=8.05
6,is_alone,category,0.0,0.604,1309,2,Sample: 1 | 0
7,parch,category,0.0,0.765,1309,8,Sample: 0 | 1 | 2 | 3 | 4
8,pclass,category,0.0,0.542,1309,3,Sample: 3 | 1 | 2
9,sex,category,0.0,0.644,1309,2,Sample: male | female


## Imputation
Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a complete dataset that can be used to train machine learning models.  There are three types of missing data:
- Missing Completely at Random (MCAR); where the missing data has nothing to do with another feature(s)
- Missing at Random (MAR); where missing data can be interpreted from another feature(s)
- Missing not at Random (MNAR); where missing data is not random and can be interpreted from another feature(s)

With `deck` and `fair` we can assume MCAR but with `age` it appears to have association with other features.  But for the purposes of the demo we are going to assume it to also be MCAR.

With `deck` the conversion to catagorical has already imputed the nulls with the new catagorical value <NA> therefore we do not need to do anything. 

In [15]:
df['deck'].value_counts()

<NA>    1014
C         94
B         65
D         46
E         41
A         22
F         21
G          5
T          1
Name: deck, dtype: int64

With `fare`  we chose a random number whereby this number is more likely to fall within a populated area and preserves the distribution of the data. This works particulary well with the small amount of missing data. 

In [16]:
df['fare'] = wr.tools.correlate_missing(df, header='fare', method='random', column_name='fare')

Age is slightly more tricky as its null values are quite large.  In this instance we will use probability frequency, which like random values preserves the distribution of the data.  Quite often, in these cases, we can add an additional boulean column that tells us which values were generated to replace nulls.

In [17]:
df['age'] = wr.tools.correlate_missing_weighted(df, header='age', granularity=5.0, column_name='age')

Using the Intent report we can check on the additional intent added.

In [18]:
wr.report_intent()

Unnamed: 0,level,order,intent,parameters,creator
0,age,0,correlate_missing_weighted,"[""header='age'"", 'granularity=5.0', ""column_name='age'""]",doatridge
1,deck,0,correlate_custom,"[""code_str='@['cabin'].str[0]'"", ""column_name='deck'"", 'kwargs={}']",doatridge
2,family,0,correlate_aggregate,"[""headers=['parch', 'sibsp']"", ""agg='sum'"", ""column_name='family'""]",doatridge
3,fare,0,correlate_missing,"[""header='fare'"", ""method='random'"", ""column_name='fare'""]",doatridge
4,is_alone,0,correlate_selection,"[""selection=[{'column': 'family', 'condition': '@==0'}]"", 'action=1', 'default_action=0', ""column_name='is_alone'""]",doatridge
5,to_category,1,model_to_category,"[""headers=['family', 'deck', 'is_alone']"", ""column_name='to_category'""]",doatridge


### Run Book

We have touched on Run Book before where by the Run Book allows us to define a run order that is preserved longer term.  With the need for `to_category` to run as the final intent the Run Book fulfills this perfectly.

Adding a Run Book is a simple task of listing the intent in the order in which you wish it to run.  As discussed before we are using the default Run Book which will automatically be picked up by the run component as its run order.

In [19]:
wr.add_run_book(run_levels=['age','deck','family','fare','is_alone','to_category'])

In [20]:
wr.run_component_pipeline()

Finially we can finish off by checking the Run Book with the Run Book report and produce the Canonical Report to see the changes the feature engineering has made.

In [21]:
wr.report_run_book()

Unnamed: 0,name,run_book
0,primary_run_book,"['age', 'deck', 'family', 'fare', 'is_alone', 'to_category']"


In [22]:
wr.canonical_report(wr.load_persist_canonical(), stylise=False)

Unnamed: 0,Attributes (13),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,0.0,0.036,1309,361,max=80.0 | min=0.1667 | mean=30.22 | dominant=24.0
1,cabin,string,0.775,0.775,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
2,deck,category,0.0,0.775,1309,9,Sample: <NA> | C | B | D | E
3,embarked,category,0.0,0.698,1309,4,Sample: S | C | Q | nan
4,family,category,0.0,0.604,1309,9,Sample: 0 | 1 | 2 | 3 | 5
5,fare,float64,0.0,0.046,1309,281,max=512.3292 | min=0.0 | mean=33.28 | dominant=8.05
6,is_alone,category,0.0,0.604,1309,2,Sample: 1 | 0
7,parch,category,0.0,0.765,1309,8,Sample: 0 | 1 | 2 | 3 | 4
8,pclass,category,0.0,0.542,1309,3,Sample: 3 | 1 | 2
9,sex,category,0.0,0.644,1309,2,Sample: male | female
