In this notebook I will automate feature engineering process using library [featuretools](https://www.featuretools.com/)   

###  What is feature engineering?

It can simply be defined as the process of creating new features from the existing features in a dataset. 
This is one of critical element in solving data science problem, it's requires domain knowledge and substantial amount of time.     

In [1]:
import pandas as pd
import numpy as np
import time
import featuretools as ft

Read in the original dataset:

In [2]:
app_train = pd.read_csv('../data/application_train.csv')
app_test = pd.read_csv('../data/application_test.csv')
bureau = pd.read_csv('../data/bureau.csv')
bureau_balance = pd.read_csv('../data/bureau_balance.csv')
cash = pd.read_csv('../data/POS_CASH_balance.csv')
credit = pd.read_csv('../data/credit_card_balance.csv')
previous = pd.read_csv('../data/previous_application.csv')
installments = pd.read_csv('../data/installments_payments.csv')

The scheme of tables:

![](../images/home_credit_data.png)

In [3]:
# Add name to dataframe
app_train.name = 'app_train'
app_test.name = 'app_test'
bureau.name = 'bureau'
bureau_balance.name = 'bureau_balance'
cash.name = 'cash'
credit.name = 'credit'
previous.name = 'previous'
installments.name = 'installments'

In [4]:
# Numbers of rows for each table in dataset:
datasets_list = [app_train, app_test, bureau, bureau_balance, cash, credit, previous, installments]


for ds in datasets_list:
    print('{}\t - \t{} rows'.format(ds.name , ds.iloc[:, 0].count()))

app_train	 - 	307511 rows
app_test	 - 	48744 rows
bureau	 - 	1716428 rows
bureau_balance	 - 	27299925 rows
cash	 - 	10001358 rows
credit	 - 	3840312 rows
previous	 - 	1670214 rows
installments	 - 	13605401 rows


In [5]:
# replace the anomalous values
for ds in datasets_list:
    ds.replace({365243: np.nan}, inplace=True)

I join train and test set to make sure, that the same feature are created for each set. Later it will be separated.

In [6]:
app_test['TARGET'] = np.nan
app = app_train.append(app_test, ignore_index=True)

### Use Featuretools library

First we need to define `EntitySet`. In [docs](https://docs.featuretools.com/loading_data/using_entitysets.html)  EntitySet is defined as a collection of entities and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering.

EntitySet keep track of all the data.

In [7]:
es = ft.EntitySet(id = 'clients')

Explicitly define some feature types:

In [8]:
import featuretools.variable_types as vtypes

In [9]:
app_types = {}

In [10]:
# Boolean variables
for col in app.columns:
    if (app[col].nunique() == 2) and (app[col].dtype == float):
        app_types[col] = vtypes.Boolean
        
del app_types['TARGET']


# Ordinal variables
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

previous_types = {}

for col in previous.columns:
    if ( previous[col].nunique() == 2) and (previous[col].dtype == float):
        previous_types[col] = vtypes.Boolean


Drop `SK_ID_CURR` in `installments`, `credit`, `cash` because I will link to these dataset through `previous` and `SK_ID_PREV`. See tables scheme.

In [11]:
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

Add Entities to EntitySet. An Entity can be considered as a representation of a Pandas DataFrame. 

In [13]:
es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR', variable_types=app_types)
es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')
es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV', variable_types= previous_types )

# Entities without unique index. We need to add.
es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                             make_index = True, index = 'bureaubalance_index')

es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                             make_index = True, index = 'cash_index')

es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                             make_index = True, index = 'installments_index')

es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                             make_index = True, index = 'credit_index')

We need to define relationship between tables using indicies. This it to tell featuretool, how tables are related and how they can be joined. This is requires from us specific knowledge about this particular dataset.

In [None]:
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

In [None]:
# Add relationships to EntitySet
es = es.add_relationships([r_app_bureau, r_bureau_balance,  r_app_previous, r_previous_cash, r_previous_installments, r_previous_credit])

Next we need to define `feature primitives`. Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. See [docs.](https://docs.featuretools.com/automated_feature_engineering/primitives.html)

In [None]:
primitives = ft.list_primitives()

In [None]:
primitives.head(5)

In [None]:
# Define default premitives
agg_primitives = ["sum", "max", "min", "mean", "count", "percent_true", "num_unique", "mode"]
trans_primitives = ['percentile', 'and']

Featuretools uses **Deep Feature Synthesis** to generate new features. More information about DFS [here](https://www.featurelabs.com/blog/deep-feature-synthesis/)

In [None]:
# Deep feature synthesis
feature_names = ft.dfs(entityset = es, target_entity = 'app', 
                        trans_primitives = trans_primitives,
                        agg_primitives = agg_primitives, 
                        max_depth = 2, n_jobs = 1, verbose = 1,
                        features_only = True)

When `features_only` true, only feature names are created and actual values of the features are not computed.

In [None]:
ft.save_features(feature_names, '../input/features2.txt')

**Note:** Due to constrains of my local machine, above code to was ran on AWS EC2. DFS generated  1800 features names. 