## Dataset and approach:
Data is from Kaggle competiotion [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk). 

I implement an automated feature engineering approach with an open-source library [Featuretools](https://www.featuretools.com/). 


In [1]:
import pandas as pd
import numpy as np

In [2]:
import featuretools as ft

In [3]:
# Read in the datasets  
app_train = pd.read_csv('../../Home_Credit_data/data/application_train.csv', sep=',')
app_test = pd.read_csv('../../Home_Credit_data/data/application_test.csv')
bureau = pd.read_csv('../../Home_Credit_data/data/bureau.csv')
bureau_balance = pd.read_csv('../../Home_Credit_data/data/bureau_balance.csv')
cash = pd.read_csv('../../Home_Credit_data/data/POS_CASH_balance.csv')
credit = pd.read_csv('../../Home_Credit_data/data/credit_card_balance.csv')
previous = pd.read_csv('../../Home_Credit_data/data/previous_application.csv')
installments = pd.read_csv('../../Home_Credit_data/data/installments_payments.csv')

![](../images/home_credit_data.png)

In [9]:
datasets_list = [app_train, app_test, bureau, bureau_balance, cash, credit, previous, installments]

In [10]:
# replace the anomalous values
for ds in datasets_list:
    ds.replace({365243: np.nan}, inplace=True)

In [None]:
# Join train and test set to make sure, that the same feature are created for each set. 
# Later it will be separated.

In [None]:
app_test['TARGET'] = np.nan
app = app_train.append(app_test, ignore_index=True)

### Featuretools

In [None]:
# Entity set to keep track of all the data
es = ft.EntitySet(id = 'clients')

#### Variable Types

In [37]:
import featuretools.variable_types as vtypes

In [38]:
app_types = {}

In [39]:
# Boolean variables

for col in app.columns:
    if (app[col].nunique() == 2) and (app[col].dtype == float):
        app_types[col] = vtypes.Boolean
        
del app_types['TARGET']

print('Number of Boolean variables: {}'.format(len(app_types)))

Number of Boolean variables 32


In [41]:
# Ordinal variables
app_types['REGION_RATING_CLIENT'] = vtypes.Ordinal
app_types['REGION_RATING_CLIENT_W_CITY'] = vtypes.Ordinal
app_types['HOUR_APPR_PROCESS_START'] = vtypes.Ordinal

In [49]:
previous_types = {}

for col in previous.columns:
    if ( previous[col].nunique() == 2) and (previous[col].dtype == float):
        previous_types[col] = vtypes.Boolean
        
print('Number of Boolean variables: {}'.format(len(previous_types)))

Number of Boolean variables: 2


Drop `SK_ID_CURR` in `installments`, `credit`, `cash` because I will link to these dataset through `previous` and `SK_ID_PREV`.

To avoid `featuretools` to create useless statistical aggregations of ids.

In [None]:
installments = installments.drop(columns = ['SK_ID_CURR'])
credit = credit.drop(columns = ['SK_ID_CURR'])
cash = cash.drop(columns = ['SK_ID_CURR'])