In this notebook I will automate feature engineering process using library [featuretools](https://www.featuretools.com/)   

###  What is feature engineering?

It can simply be defined as the process of creating new features from the existing features in a dataset. 
This is one of critical element in solving data science problem, it's requires domain knowledge and substantial amount of time.     

### Featuretools basics

**Featuretools** is an open-source Python library for automatically creating features out of a set of related tables using a technique called **Deep Feature Synthesis**. Automated feature engineering, like many topics in machine learning, is a complex subject built upon a foundation of simpler ideas. By going through these ideas one at a time, we can build up our understanding of Featuretools which will later allow for us to get the most out of it.

___

Coming up with features is difficult, time-consuming, and requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.

— Andrew Ng

### Dataset and problem

Kaggle competition **The Home Credit Default Risk** is a supervised classification task where the objective is to predict whether or not an applicant for a loan (known as a client) will default on the loan. The data comprises socio-economic indicators for the clients, loan specific financial information, and comprehensive data on previous loans at Home Credit (the institution sponsoring the competition) and other credit agencies. The metric for this competition is Receiver Operating Characteristic Area Under the Curve (ROC AUC) with predictions made in terms of the probability of default. We can evaluate our submissions both through cross-validation on the training data (for which we have the labels) or by submitting our test predictions to Kaggle to see where we place on the public leaderboard (which is calculated with only 10% of the testing data).

In [1]:
import pandas as pd
import numpy as np
import time
import featuretools as ft

Read in the original dataset:

In [2]:
app_train = pd.read_csv('../data/application_train.csv')
app_test = pd.read_csv('../data/application_test.csv')
bureau = pd.read_csv('../data/bureau.csv')
bureau_balance = pd.read_csv('../data/bureau_balance.csv')
cash = pd.read_csv('../data/POS_CASH_balance.csv')
credit = pd.read_csv('../data/credit_card_balance.csv')
previous = pd.read_csv('../data/previous_application.csv')
installments = pd.read_csv('../data/installments_payments.csv')

The scheme of tables:

![](../images/home_credit_data.png)

The variables that tie the tables together will be important to understand when it comes to adding relationships between entities. **The only domain knowledge we need for a full Featuretools approach to the problem is the indexes of the tables and the relationships between the tables.**

In [3]:
# Add name to dataframe
app_train.name = 'app_train'
app_test.name = 'app_test'
bureau.name = 'bureau'
bureau_balance.name = 'bureau_balance'
cash.name = 'cash'
credit.name = 'credit'
previous.name = 'previous'
installments.name = 'installments'

In [4]:
# Numbers of rows for each table in dataset:
datasets_list = [app_train, app_test, bureau, bureau_balance, cash, credit, previous, installments]


for ds in datasets_list:
    print('{}\t - \t{} rows'.format(ds.name , ds.iloc[:, 0].count()))

app_train	 - 	307511 rows
app_test	 - 	48744 rows
bureau	 - 	1716428 rows
bureau_balance	 - 	27299925 rows
cash	 - 	10001358 rows
credit	 - 	3840312 rows
previous	 - 	1670214 rows
installments	 - 	13605401 rows


In [5]:
# replace the anomalous values
for ds in datasets_list:
    ds.replace({365243: np.nan}, inplace=True)

I join train and test set to make sure, that the same feature are created for each set. Later it will be separated.

In [7]:
app_test['TARGET'] = np.nan
app = app_train.append(app_test, ignore_index=True)

### Use Featuretools

There are a few concepts that we will cover along the way:

- **Entities and EntitySets**: our tables and a data structure for keeping track of them all
- **Relationships between tables**: how the tables can be related to one another
- **Feature primitives**: aggregations and transformations that are stacked to build features
- **Deep feature synthesis**: the method that uses feature primitives to generate thousands of new features

#### EntitySet

First we need to define `EntitySet`. In [docs](https://docs.featuretools.com/loading_data/using_entitysets.html)  EntitySet is defined as a collection of entities and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering.

EntitySet keep track of all the data.

In [7]:
es = ft.EntitySet(id = 'clients')

Explicitly define some feature types:

In [8]:
import featuretools.variable_types as vtypes

In [9]:
app_types = {}

In [10]:
# Boolean variables
for col in app.columns:
    if (app[col].nunique() == 2) and (app[col].dtype == float):
        app_types[col] = vtypes.Boolean
        
del app_types['TARGET']


# Ordinal variables
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

previous_types = {}

for col in previous.columns:
    if ( previous[col].nunique() == 2) and (previous[col].dtype == float):
        previous_types[col] = vtypes.Boolean


Drop `SK_ID_CURR` in `installments`, `credit`, `cash` because I will link to these dataset through `previous` and `SK_ID_PREV`. See tables scheme.

In [11]:
bureau['SK_ID_BUREAU'] = bureau['SK_ID_BUREAU'].astype(np.int64)

In [12]:
previous['SK_ID_PREV'] = previous['SK_ID_PREV'].astype(np.int64)
cash['SK_ID_PREV'] = cash['SK_ID_PREV'].astype(np.int64)
installments['SK_ID_PREV'] = installments['SK_ID_PREV'].astype(np.int64)

In [13]:
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

Add Entities to EntitySet. An Entity can be considered as a representation of a Pandas DataFrame. 

In [14]:
es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR', variable_types=app_types)
es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')
es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV', variable_types= previous_types )

# Entities without unique index. We need to add.
es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                             make_index = True, index = 'bureaubalance_index')

es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                             make_index = True, index = 'cash_index')

es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                             make_index = True, index = 'installments_index')

es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                             make_index = True, index = 'credit_index')

#### Relationships

We need to define relationship between tables using indicies. This it to tell featuretool, how tables are related and how they can be joined. This is requires from us specific knowledge about this particular dataset.

In [16]:
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

In [17]:
# Add relationships to EntitySet
es = es.add_relationships([r_app_bureau, r_bureau_balance,  r_app_previous, r_previous_cash, r_previous_installments, r_previous_credit])

In [18]:
es

Entityset: clients
  Entities:
    app [Rows: 356255, Columns: 122]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
    bureau_balance [Rows: 27299925, Columns: 4]
    cash [Rows: 10001358, Columns: 8]
    installments [Rows: 13605401, Columns: 8]
    credit [Rows: 3840312, Columns: 23]
  Relationships:
    bureau.SK_ID_CURR -> app.SK_ID_CURR
    bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
    previous.SK_ID_CURR -> app.SK_ID_CURR
    cash.SK_ID_PREV -> previous.SK_ID_PREV
    installments.SK_ID_PREV -> previous.SK_ID_PREV
    credit.SK_ID_PREV -> previous.SK_ID_PREV

#### Feature Primitives

A **feature primitive** is an operation applied to a table or a set of tables to create a feature. These represent simple calculations, many of which we already use in manual feature engineering, that can be stacked on top of each other to create complex deep features. Feature primitives fall into two categories:

`Aggregation`: function that groups together children for each parent and calculates a statistic such as mean, min, max, or standard deviation across the children. An example is the maximum previous loan amount for each client. An aggregation covers multiple tables using relationships between tables.
`Transformation`: an operation applied to one or more columns in a single table. An example would be taking the absolute value of a column, or finding the difference between two columns in one table.

Next we need to define `feature primitives`. Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. See [docs.](https://docs.featuretools.com/automated_feature_engineering/primitives.html)

In [19]:
primitives = ft.list_primitives()

In [45]:
pd.set_option('max_colwidth',300)

In [47]:
primitives

Unnamed: 0,name,type,description
0,max,aggregation,Finds the maximum non-null value of a numeric feature.
1,avg_time_between,aggregation,Computes the average time between consecutive events.
2,mean,aggregation,Computes the average value of a numeric feature.
3,num_unique,aggregation,Returns the number of unique categorical variables.
4,any,aggregation,Test if any value is 'True'.
5,n_most_common,aggregation,Finds the N most common elements in a categorical feature.
6,std,aggregation,Finds the standard deviation of a numeric feature ignoring null values.
7,skew,aggregation,Computes the skewness of a data set.
8,count,aggregation,Counts the number of non null values.
9,all,aggregation,Test if all values are 'True'.


In [46]:
primitives[primitives['name'] =='percentile']

Unnamed: 0,name,type,description
39,percentile,transform,"For each value of the base feature, determines the percentile in relation"


In [48]:
# Define default premitives
agg_primitives = ["sum", "max", "min", "mean", "count", "percent_true", "num_unique", "mode"]
trans_primitives = ['percentile', 'and', 'diff']

#### Deep feature synthesis

**Deep Feature Synthesis** (DFS) is the method Featuretools uses to make new features. DFS stacks feature primitives to form features with a "depth" equal to the number of primitives. For example, if we take the maximum value of a client's previous loans (say MAX(previous.loan_amount)), that is a "deep feature" with a depth of 1. To create a feature with a depth of two, we could stack primitives by taking the maximum value of a client's average monthly payments per previous loan (such as MAX(previous(MEAN(installments.payment)))). In manual feature engineering, this would require two separate groupings and aggregations and took more than 15 minutes to write the code per feature.

Featuretools uses **Deep Feature Synthesis** to generate new features. More information about DFS [here](https://www.featurelabs.com/blog/deep-feature-synthesis/)

In [49]:
# Deep feature synthesis
feature_names = ft.dfs(entityset = es, target_entity = 'app', 
                        trans_primitives = trans_primitives,
                        agg_primitives = agg_primitives, 
                        max_depth = 3, n_jobs = 1, verbose = 1,
                        features_only = True)

Built 5152 features


When `features_only` true, only feature names are created and actual values of the features are not computed.

In [53]:
feature_names[4000:4010]

[<Feature: MEAN(previous.PERCENTILE(SUM(credit.AMT_PAYMENT_CURRENT)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.AMT_PAYMENT_TOTAL_CURRENT)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.AMT_RECEIVABLE_PRINCIPAL)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.AMT_RECIVABLE)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.AMT_TOTAL_RECEIVABLE)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.CNT_DRAWINGS_ATM_CURRENT)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.CNT_DRAWINGS_CURRENT)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.CNT_DRAWINGS_OTHER_CURRENT)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.CNT_DRAWINGS_POS_CURRENT)))>,
 <Feature: MEAN(previous.PERCENTILE(SUM(credit.CNT_INSTALMENT_MATURE_CUM)))>]

In [51]:
ft.save_features(feature_names, '../input/features_names_depth3_5152.txt')

**Note:** Due to constrains of my local machine, above code to was ran on AWS EC2. DFS generated  1800 features names. 