## Dataset and approach:
Data is from Kaggle competiotion [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk). 

I implement an automated feature engineering approach with an open-source library [Featuretools](https://www.featuretools.com/). 


In [1]:
import pandas as pd
import numpy as np

In [2]:
import featuretools as ft

In [3]:
# Read in the datasets  
app_train = pd.read_csv('../../Home_Credit_data/data/application_train.csv', sep=',')
app_test = pd.read_csv('../../Home_Credit_data/data/application_test.csv')
bureau = pd.read_csv('../../Home_Credit_data/data/bureau.csv')
bureau_balance = pd.read_csv('../../Home_Credit_data/data/bureau_balance.csv')
cash = pd.read_csv('../../Home_Credit_data/data/POS_CASH_balance.csv')
credit = pd.read_csv('../../Home_Credit_data/data/credit_card_balance.csv')
previous = pd.read_csv('../../Home_Credit_data/data/previous_application.csv')
installments = pd.read_csv('../../Home_Credit_data/data/installments_payments.csv')

![](../images/home_credit_data.png)

In [9]:
datasets_list = [app_train, app_test, bureau, bureau_balance, cash, credit, previous, installments]

In [10]:
# replace the anomalous values
for ds in datasets_list:
    ds.replace({365243: np.nan}, inplace=True)

In [None]:
# Join train and test set to make sure, that the same feature are created for each set. 
# Later it will be separated.

In [None]:
app_test['TARGET'] = np.nan
app = app_train.append(app_test, ignore_index=True)

### Featuretools

In [54]:
# Entity set to keep track of all the data
es = ft.EntitySet(id = 'clients')

#### Variable Types

In [55]:
import featuretools.variable_types as vtypes

In [56]:
app_types = {}

In [57]:
# Boolean variables

for col in app.columns:
    if (app[col].nunique() == 2) and (app[col].dtype == float):
        app_types[col] = vtypes.Boolean
        
del app_types['TARGET']

print('Number of Boolean variables: {}'.format(len(app_types)))

Number of Boolean variables: 32


In [58]:
# Ordinal variables
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

In [59]:
previous_types = {}

for col in previous.columns:
    if ( previous[col].nunique() == 2) and (previous[col].dtype == float):
        previous_types[col] = vtypes.Boolean
        
print('Number of Boolean variables: {}'.format(len(previous_types)))

Number of Boolean variables: 2


Drop `SK_ID_CURR` in `installments`, `credit`, `cash` because I will link to these dataset through `previous` and `SK_ID_PREV`.

To avoid `featuretools` to create useless statistical aggregations of ids.

In [61]:
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])

In [63]:
# Add Entities to EntitySet

es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR', variable_types=app_types)
es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')
es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV', variable_types= previous_types )


In [75]:
# Entities without unique index. We need to add.
es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                              make_index = True, index = 'bureaubalance_index')

es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                              make_index = True, index = 'cash_index')

es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                              make_index = True, index = 'installments_index')

es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                              make_index = True, index = 'credit_index')

In [73]:
es

Entityset: clients
  Entities:
    app [Rows: 356255, Columns: 122]
    bureau [Rows: 1716428, Columns: 17]
    previous [Rows: 1670214, Columns: 37]
  Relationships:
    No relationships

In [None]:
# Define relationship
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

In [76]:
# Add relationships to EntitySet

In [None]:
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous, r_previous_cash, r_previous_installments, r_previous_credit])

#### Feature primitives

In [77]:
primitives = ft.list_primitives()

In [79]:
primitives

Unnamed: 0,name,type,description
0,n_most_common,aggregation,Finds the N most common elements in a categori...
1,max,aggregation,Finds the maximum non-null value of a numeric ...
2,skew,aggregation,Computes the skewness of a data set.
3,trend,aggregation,Calculates the slope of the linear trend of va...
4,mode,aggregation,Finds the most common element in a categorical...
5,any,aggregation,Test if any value is 'True'.
6,min,aggregation,Finds the minimum non-null value of a numeric ...
7,sum,aggregation,Counts the number of elements of a numeric or ...
8,time_since_last,aggregation,Time since last related instance.
9,all,aggregation,Test if all values are 'True'.
