#Data Engineering: Automated Feature Engineering
Machine learning models commonly require data engineers to combine related data from multiple sources into one denormalized table. 

###Featuretools
Featuretools is an open-source Python library that automatically creates many features from a set of related tables. It is based on a method known as Deep Feature Synthesis, a name that comes from stacking multiple features. In a typical workflow, data engineers may use Pandas to prepare data, FeatureTools for feature engineering, and Scikit-Learn to run models.

[Documentation is at alteryx.com](https://featuretools.alteryx.com/en/stable/).

##Problem
Suppose a client in the lending industry wants to predict which customers are most likely to repay loans. Data from three tables is available:
1. **clients:** information about clients at a credit union
2. **loans:** previous loans taken out by the clients
3. **payments:** payments made/missed on the previous loans.

##Approach
Use FeatureTools to: create candidate features, combine them in a single table, and pass them to a machine learning model. 

Note that it is possible to add features that not improve the model. Be careful to avoid the *curse of dimensionality*.

In [None]:
# to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')
#Read files from: /content/drive/MyDrive/datasets/filename.ext

Mounted at /content/drive


##Understanding Data Relationships

Before using Featuretools, it's important to understand:
1. Entities and EntitySets
2. Relationships between tables
3. Feature primitives: aggregations and transformations
4. Deep feature synthesis

###Entities and EntitySets
An **entity** is a table, such as Pandas dataframe. The observations are the rows, and the features are the columns. An **EntitySet** is a collection of tables and the relationships between them. Grouping multiple tables to manipulate them is much quicker than manipulating one at a time.

###Table Relationships
Think of relations in featuretools with what you know in relational database. It is a one-to-many relationship. The clients dataframe is a parent of the loans dataframe. Loans is the parent of payments. In each pair, find the variable (usually an ID) that links two tables together, then formalize the relationships between pairs in featuretools. The client_id variable links the clients and the loans table, whereas loan_id links the loans and payments.

###Feature Primitives
A feature primitive is an operation applied to source data to create a new feature. They are simple calculations that can be chained to create complex features. There are two main categories:

1. **Aggregation:** a grouping to which a statistic is applied, such as mean, min, max, or standard deviation. An aggregation works across multiple tables using relationships between tables. An example is calculating the maximum loan amount for each client.
2. **Transformation:** an operation applied to one or more columns in a single table. An example would be extracting the day from dates, or finding the difference between two columns in one table.

###Deep Feature Synthesis
Deep Feature Synthesis (DFS) is stacking primitives to form features with a "depth" equal to the number of primitives. For example, the LAST(loans.(MEAN(payments.payment_amount))) has depth = 2 because it is made by stacking two feature primitives, first an aggregation and then a transformation. This feature represents the average payment amount for the last (most recent) loan for each client.

##Install Featuretools
Install featuretools first, as the runtime may have to be restarted afterward and all local variables recreated.

In [None]:
!pip install featuretools

Collecting featuretools
  Downloading featuretools-1.6.0-py3-none-any.whl (356 kB)
[?25l[K     |█                               | 10 kB 19.1 MB/s eta 0:00:01[K     |█▉                              | 20 kB 23.9 MB/s eta 0:00:01[K     |██▊                             | 30 kB 29.0 MB/s eta 0:00:01[K     |███▊                            | 40 kB 27.0 MB/s eta 0:00:01[K     |████▋                           | 51 kB 18.7 MB/s eta 0:00:01[K     |█████▌                          | 61 kB 20.1 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 21.5 MB/s eta 0:00:01[K     |███████▍                        | 81 kB 23.2 MB/s eta 0:00:01[K     |████████▎                       | 92 kB 24.9 MB/s eta 0:00:01[K     |█████████▏                      | 102 kB 22.9 MB/s eta 0:00:01[K     |██████████                      | 112 kB 22.9 MB/s eta 0:00:01[K     |███████████                     | 122 kB 22.9 MB/s eta 0:00:01[K     |████████████                    | 133 kB 22.9 

##Read Datasets

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np

#### Clients Dataset

In [None]:
#Load the client dataset.
clients = pd.read_csv('/content/drive/MyDrive/datasets/clients.txt', parse_dates = ['joined'])
clients.head()

Unnamed: 0,client_id,joined,income,credit_score
0,46109,2002-04-16,172677,527
1,49545,2007-11-14,104564,770
2,41480,2013-03-11,122607,585
3,46180,2001-11-06,43851,562
4,25707,2006-10-06,211422,621


**Clients** data consist of basic information about clients at a credit union. Each client has only one row in this dataframe.

### **Loans** dataset

In [None]:
loans = pd.read_csv('/content/drive/MyDrive/datasets/loans.txt', parse_dates = ['loan_start', 'loan_end'])
loans.head()

Unnamed: 0,client_id,loan_type,loan_amount,repaid,loan_id,loan_start,loan_end,rate
0,46109,home,13672,0,10243,2002-04-16,2003-12-20,2.15
1,46109,credit,9794,0,10984,2003-10-21,2005-07-17,1.25
2,46109,home,12734,1,10990,2006-02-01,2007-07-05,0.68
3,46109,cash,12518,1,10596,2010-12-08,2013-05-05,1.24
4,46109,credit,14049,1,11415,2010-07-07,2012-05-21,3.13


**loans** data consist of loans made to the clients. Each loan has only own row in this dataframe but clients may have multiple loans.


###**Payments** dataset

In [None]:
payments = pd.read_csv('/content/drive/MyDrive/datasets/payments.txt', parse_dates = ['payment_date'])
payments.head()

Unnamed: 0,loan_id,payment_amount,payment_date,missed
0,10243,2369,2002-05-31,1
1,10243,2439,2002-06-18,1
2,10243,2662,2002-06-29,0
3,10243,2268,2002-07-20,0
4,10243,2027,2002-07-31,1


The **Payments** data contains the payments made on the loans. Each payment has only one row but each loan will have multiple payments.

## Combining Dataframes
With the three datasets above, use featuretools to combine. Here's how. 
1. Instantiate an EntitySet and store in variable "es". 
2. Add a dataframe: es.add_dataframe. Provide 4 things: 
    - pass in the original dataset: dataframe = dataset, 
    - name the new dataframe: entity_id = "some name" 
    - cite the index of the dataset: index = 'some_id'
    - cite the time index: time_index = 'some date field'

In [None]:
import featuretools as ft
# Create new entityset
es = ft.EntitySet(id = 'clients')



Each entity must have an index which is a column with unique elements.

In [None]:
# We create an entity from the client dataframe
# Also note that the dataframe already has an index and a time (the joined column) index
es = es.add_dataframe(dataframe = clients,
                      dataframe_name = 'clients', 
                      index = 'client_id', 
                      time_index = 'joined')

Featuretools will attempt to infer types for any columns that do not have types defined by the user.
* Automatic type inferencing is provided by the Woodwork library. 
* Optionally, bypass inferencing by passing logical types in a dictionary when calling add_dataframe.

In [None]:
#We create an entity from the loan dataframe
# This dataframe already has an index and a time index
es = es.add_dataframe(dataframe_name = 'loans', 
                      dataframe = loans,
                      index = 'loan_id',
                      time_index = 'loan_start')

In [None]:
#We create an entity from the payments dataframe
es = es.add_dataframe(dataframe_name = 'payments', 
                      dataframe = payments,
                      make_index = True,
                      index = 'payment_id',
                      time_index = 'payment_date')

View Results

In [None]:
es['clients']

Unnamed: 0,client_id,joined,income,credit_score
42320,42320,2000-04-27,229481,563
39384,39384,2000-06-18,191204,617
26945,26945,2000-11-26,214516,806
41472,41472,2001-11-06,152214,638
46180,46180,2001-11-06,43851,562
46109,46109,2002-04-16,172677,527
32885,32885,2002-05-13,58955,642
29841,29841,2002-08-17,38354,523
38537,38537,2002-10-21,127183,643
35214,35214,2003-08-08,95849,696


In [None]:
clients.shape
#Note that the 25 rows will be the shape of the output, after combining the loans and payments tables.

(25, 4)

In [None]:
#Use the Woodworking library to view the metadata created by FeatureTools.
es['clients'].ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
client_id,int64,Integer,['index']
joined,datetime64[ns],Datetime,['time_index']
income,int64,Integer,['numeric']
credit_score,int64,Integer,['numeric']


In [None]:
es['loans']

Unnamed: 0,client_id,loan_type,loan_amount,repaid,loan_id,loan_start,loan_end,rate
11140,39505,home,2274,1,11140,2000-01-26,2002-01-29,1.00
11251,26326,home,2847,1,11251,2000-03-06,2001-09-26,1.32
10816,49545,home,8354,1,10816,2000-03-08,2001-08-02,0.45
11965,29841,credit,6012,0,11965,2000-03-25,2002-07-10,4.63
10166,41472,home,13657,1,10166,2000-04-11,2001-09-08,5.68
...,...,...,...,...,...,...,...,...
11595,35089,other,773,1,11595,2014-09-26,2016-04-23,7.63
10985,26695,other,13918,1,10985,2014-10-03,2016-10-25,0.90
10684,48177,credit,659,1,10684,2014-10-05,2017-01-16,1.52
10131,49068,other,10082,1,10131,2014-10-10,2016-05-25,0.63


In [None]:
es['loans'].ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
client_id,int64,Integer,['numeric']
loan_type,category,Categorical,['category']
loan_amount,int64,Integer,['numeric']
repaid,int64,Integer,['numeric']
loan_id,int64,Integer,['index']
loan_start,datetime64[ns],Datetime,['time_index']
loan_end,datetime64[ns],Datetime,[]
rate,float64,Double,['numeric']


In [None]:
es['payments']

Unnamed: 0,payment_id,loan_id,payment_amount,payment_date,missed
2113,2113,11988,2053,2000-03-05,0
726,726,11140,402,2000-03-19,0
2114,2114,11988,2627,2000-03-30,0
3223,3223,11430,1284,2000-04-05,0
2115,2115,11988,1911,2000-04-11,1
...,...,...,...,...,...
1415,1415,11072,957,2015-07-01,0
1308,1308,10684,115,2015-07-06,0
1416,1416,11072,988,2015-07-14,1
1417,1417,11072,940,2015-07-29,0


In [None]:
es['payments'].ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
payment_id,int64,Integer,['index']
loan_id,int64,Integer,['numeric']
payment_amount,int64,Integer,['numeric']
payment_date,datetime64[ns],Datetime,['time_index']
missed,int64,Integer,['numeric']


<br>

The next step is to specify how the tables in the entityset are related.

##Defining Relationships
With the three entities created, define relationships. Here's how.
1. Pass the dataframes and column names to EntitySet.add_relationship.
2. Identify the parent_dataframe and parent_column_name (usually primary key)
3. Identify the child_dataframe and child_column_name (usually foreign key from parent)

For each parameter, provide the string representation of the name. 

In [None]:
es.add_relationship(parent_dataframe_name = 'clients',
                    parent_column_name = 'client_id',
                    child_dataframe_name = 'loans',
                    child_column_name = 'client_id')

Entityset: clients
  DataFrames:
    clients [Rows: 25, Columns: 4]
    loans [Rows: 443, Columns: 8]
    payments [Rows: 3456, Columns: 5]
  Relationships:
    loans.client_id -> clients.client_id

In [None]:
es.add_relationship(parent_dataframe_name = 'loans',
                    parent_column_name = 'loan_id',
                    child_dataframe_name = 'payments',
                    child_column_name = 'loan_id')

Entityset: clients
  DataFrames:
    clients [Rows: 25, Columns: 4]
    loans [Rows: 443, Columns: 8]
    payments [Rows: 3456, Columns: 5]
  Relationships:
    loans.client_id -> clients.client_id
    payments.loan_id -> loans.loan_id

In [None]:
#ALTERNATIVELY: Create relationship first, then add to es using the relationship 
#keyword.

# Define relationship between clients and loans
#client_loan = ft.Relationship(es['clients']['client_id'],
#                              es['loans']['client_id'])

# Add the relationship to the entity set
#es = es.add_relationship(client_loan)

# Relationship between loans and payments
#loans_payment = ft.Relationship(es['loans']['loan_id'],
#                               es['payments']['loan_id'])

# Add the relationship to the entity set
#es = es.add_relationship(loans_payment)

The next step is to create new features using stacks of **feature primitives**.

##Feature Primitives

FeatureTools provides a list of **primitives** that it can create. Ask for the list by type, aggregation or transformation.

In [None]:
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 80
primitives[primitives['type'] == 'aggregation'].head(20)

Unnamed: 0,name,type,dask_compatible,koalas_compatible,description,valid_inputs,return_type
0,std,aggregation,True,True,"Computes the dispersion relative to the mean value, ignoring `NaN`.",<ColumnSchema (Semantic Tags = ['numeric'])>,
1,median,aggregation,False,False,Determines the middlemost number in a list of values.,<ColumnSchema (Semantic Tags = ['numeric'])>,
2,avg_time_between,aggregation,False,False,Computes the average number of seconds between consecutive events.,<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>,
3,percent_true,aggregation,True,False,Determines the percent of `True` values.,"<ColumnSchema (Logical Type = Boolean)>, <ColumnSchema (Logical Type = Boole...",
4,num_unique,aggregation,True,True,"Determines the number of distinct values, ignoring `NaN` values.",<ColumnSchema (Semantic Tags = ['category'])>,
5,min,aggregation,True,True,"Calculates the smallest value, ignoring `NaN` values.",<ColumnSchema (Semantic Tags = ['numeric'])>,
6,trend,aggregation,False,False,Calculates the trend of a column over time.,"<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>, <...",
7,mean,aggregation,True,True,Computes the average for a list of values.,<ColumnSchema (Semantic Tags = ['numeric'])>,
8,time_since_last,aggregation,False,False,Calculates the time elapsed since the last datetime (default in seconds).,<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>,
9,count,aggregation,True,True,"Determines the total number of values, excluding `NaN`.",<ColumnSchema (Semantic Tags = ['index'])>,


In [None]:
primitives[primitives['type'] == 'transform'].head(30)

Unnamed: 0,name,type,dask_compatible,koalas_compatible,description
22,add_numeric,transform,True,True,Element-wise addition of two lists.
23,cum_count,transform,False,False,Calculates the cumulative count.
24,weekday,transform,True,True,Determines the day of the week from a datetime.
25,not_equal,transform,True,False,Determines if values in one list are not equal to another list.
26,or,transform,True,True,Element-wise logical OR of two lists.
27,year,transform,True,True,Determines the year value of a datetime.
28,equal,transform,True,True,Determines if values in one list are equal to another list.
29,modulo_by_feature,transform,True,True,Return the modulo of a scalar by each element in the list.
30,week,transform,True,True,Determines the week of the year from a datetime.
31,haversine,transform,False,False,Calculates the approximate haversine distance between two LatLong


##Creating New Features
You can specify primitives to be created or rely on FeatureTools to create all automatically.

In [None]:
# Create new features using specified primitives. The method returns
# features and feature names, unless you specify features_only = True.
features, feature_names = ft.dfs(entityset = es, 
                                 target_dataframe_name = 'clients', 
                                 agg_primitives = ['mean','sum', 'max', 'percent_true', 'count','last'],
                                 trans_primitives = ["day", "year", "month", "weekday"])

  agg_primitives: ['percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.


In [None]:
len(feature_names)

96

In [None]:
features.head()

Unnamed: 0_level_0,income,credit_score,COUNT(loans),LAST(loans.loan_amount),LAST(loans.loan_id),LAST(loans.loan_type),LAST(loans.rate),LAST(loans.repaid),MAX(loans.loan_amount),MAX(loans.rate),MAX(loans.repaid),MEAN(loans.loan_amount),MEAN(loans.rate),MEAN(loans.repaid),SUM(loans.loan_amount),SUM(loans.rate),SUM(loans.repaid),COUNT(payments),LAST(payments.missed),LAST(payments.payment_amount),LAST(payments.payment_id),MAX(payments.missed),MAX(payments.payment_amount),MEAN(payments.missed),MEAN(payments.payment_amount),SUM(payments.missed),SUM(payments.payment_amount),DAY(joined),MONTH(joined),WEEKDAY(joined),YEAR(joined),LAST(loans.COUNT(payments)),LAST(loans.DAY(loan_end)),LAST(loans.DAY(loan_start)),LAST(loans.MAX(payments.missed)),LAST(loans.MAX(payments.payment_amount)),LAST(loans.MEAN(payments.missed)),LAST(loans.MEAN(payments.payment_amount)),LAST(loans.MONTH(loan_end)),LAST(loans.MONTH(loan_start)),...,MEAN(loans.LAST(payments.payment_amount)),MEAN(loans.LAST(payments.payment_id)),MEAN(loans.MAX(payments.missed)),MEAN(loans.MAX(payments.payment_amount)),MEAN(loans.MEAN(payments.missed)),MEAN(loans.MEAN(payments.payment_amount)),MEAN(loans.SUM(payments.missed)),MEAN(loans.SUM(payments.payment_amount)),SUM(loans.LAST(payments.missed)),SUM(loans.LAST(payments.payment_amount)),SUM(loans.LAST(payments.payment_id)),SUM(loans.MAX(payments.missed)),SUM(loans.MAX(payments.payment_amount)),SUM(loans.MEAN(payments.missed)),SUM(loans.MEAN(payments.payment_amount)),LAST(payments.loans.loan_amount),LAST(payments.loans.loan_type),LAST(payments.loans.rate),LAST(payments.loans.repaid),MAX(payments.loans.loan_amount),MAX(payments.loans.rate),MAX(payments.loans.repaid),MEAN(payments.loans.loan_amount),MEAN(payments.loans.rate),MEAN(payments.loans.repaid),SUM(payments.loans.loan_amount),SUM(payments.loans.rate),SUM(payments.loans.repaid),DAY(LAST(loans.loan_end)),DAY(LAST(loans.loan_start)),DAY(LAST(payments.payment_date)),MONTH(LAST(loans.loan_end)),MONTH(LAST(loans.loan_start)),MONTH(LAST(payments.payment_date)),WEEKDAY(LAST(loans.loan_end)),WEEKDAY(LAST(loans.loan_start)),WEEKDAY(LAST(payments.payment_date)),YEAR(LAST(loans.loan_end)),YEAR(LAST(loans.loan_start)),YEAR(LAST(payments.payment_date))
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
42320,229481,563,15,8090,10156,home,3.18,0,13887.0,6.74,1.0,7062.066667,2.457333,0.6,105931.0,36.86,9.0,120,1,1082,1542,1.0,2769.0,0.516667,1021.483333,62.0,122578.0,27,4,3,2000,6,17,22,1.0,1484.0,0.5,1192.333333,4,9,...,1001.733333,1714.933333,1.0,1325.066667,0.50373,1042.907672,4.133333,8171.866667,4.0,15026.0,25724.0,15.0,19876.0,7.555952,15643.615079,8090,home,3.18,0,13887.0,6.74,1.0,7028.058333,2.523667,0.583333,843367.0,302.84,70.0,17,22,12,4,9,4,4,5,4,2015,2012,2013
39384,191204,617,19,14654,11735,other,2.26,0,14654.0,9.23,1.0,7865.473684,3.538421,0.631579,149444.0,67.23,12.0,146,1,2045,2562,1.0,2822.0,0.513699,1193.630137,75.0,174270.0,18,6,6,2000,7,13,21,1.0,2822.0,0.428571,2311.285714,3,7,...,1265.0,2565.526316,1.0,1503.473684,0.508649,1194.628227,3.947368,9172.105263,11.0,24035.0,48745.0,19.0,28566.0,9.664332,22697.936321,14654,other,2.26,0,14654.0,9.23,1.0,7957.130137,3.41863,0.575342,1161741.0,499.12,84.0,13,21,13,3,7,3,6,0,4,2016,2014,2015
26945,214516,806,15,9249,11482,cash,2.86,1,14593.0,5.65,1.0,7125.933333,2.855333,0.4,106889.0,42.83,6.0,112,1,1597,3340,1.0,2768.0,0.508929,1109.473214,57.0,124261.0,26,11,6,2000,6,11,24,1.0,1834.0,0.833333,1598.666667,5,12,...,1237.4,3390.4,1.0,1411.6,0.511431,1115.150112,3.8,8284.066667,11.0,18561.0,50856.0,15.0,21174.0,7.671459,16727.251679,9249,cash,2.86,1,14593.0,5.65,1.0,6884.401786,2.947589,0.339286,771053.0,330.13,38.0,11,24,30,5,12,7,2,1,2,2016,2013,2014
41472,152214,638,16,10122,11936,cash,1.03,0,13657.0,9.82,1.0,7510.8125,3.98125,0.5,120173.0,63.7,8.0,105,0,1453,3129,1.0,2436.0,0.485714,1129.07619,51.0,118553.0,6,11,1,2001,5,22,6,0.0,1803.0,0.0,1427.0,6,8,...,1076.9375,3112.0,0.9375,1382.6875,0.483904,1133.414162,3.1875,7409.5625,5.0,17231.0,49792.0,15.0,22123.0,7.74246,18134.626587,10122,cash,1.03,0,13657.0,9.82,1.0,7473.628571,4.146286,0.533333,784731.0,435.36,56.0,22,6,21,6,8,3,2,2,5,2016,2014,2015
46180,43851,562,20,3834,10887,other,1.38,0,14081.0,9.26,1.0,7700.85,3.5025,0.5,154017.0,70.05,10.0,149,1,697,437,1.0,2660.0,0.416107,1186.550336,62.0,176796.0,6,11,1,2001,8,24,3,1.0,706.0,0.375,557.125,2,5,...,1176.4,527.85,0.95,1491.45,0.408829,1163.803383,3.1,8839.8,9.0,23528.0,10557.0,19.0,29829.0,8.176587,23276.067659,3834,other,1.38,0,14081.0,9.26,1.0,7668.899329,3.882081,0.496644,1142666.0,578.43,74.0,24,3,2,2,5,2,2,5,0,2016,2014,2015


In [None]:
#Get row
features.loc[42320]

income                                  229481
credit_score                               563
COUNT(loans)                                15
LAST(loans.loan_amount)                   8090
LAST(loans.loan_id)                      10156
                                         ...  
WEEKDAY(LAST(loans.loan_start))              5
WEEKDAY(LAST(payments.payment_date))         4
YEAR(LAST(loans.loan_end))                2015
YEAR(LAST(loans.loan_start))              2012
YEAR(LAST(payments.payment_date))         2013
Name: 42320, Length: 96, dtype: object

In [None]:
#Get column
features['income'][:5]

client_id
42320    229481
39384    191204
26945    214516
41472    152214
46180     43851
Name: income, dtype: int64

In [None]:
#Get column and display as dataframe
pd.DataFrame(features['MEAN(payments.payment_amount)'].head())

Unnamed: 0_level_0,MEAN(payments.payment_amount)
client_id,Unnamed: 1_level_1
42320,1021.483333
39384,1193.630137
26945,1109.473214
41472,1129.07619
46180,1186.550336


Featuretool has created new features by combining and stacking the primitives.

<br>

##**Deep Feature Synthesis**

In [None]:
pd.DataFrame(features['LAST(loans.MEAN(payments.payment_amount))'].head(10))

Unnamed: 0_level_0,LAST(loans.MEAN(payments.payment_amount))
client_id,Unnamed: 1_level_1
42320,1192.333333
39384,2311.285714
26945,1598.666667
41472,1427.0
46180,557.125
46109,1708.875
32885,1729.0
29841,1125.5
38537,1348.833333
35214,1410.25


##**Automated Deep Feature Synthesis**
FeatureTools can automatically create features, if nothing is specified in advance. Instead, specify the *max_depth* parameter and allow FeatureTools to  combinations of feature primitives.

In [None]:
# Do not specifying primitives; instead specify maximum depth.
auto_features, auto_feature_names = ft.dfs(entityset = es, 
                                           target_dataframe_name='clients', 
                                           max_depth = 4)

In [None]:
auto_features.head()

Unnamed: 0_level_0,income,credit_score,COUNT(loans),MAX(loans.loan_amount),MAX(loans.rate),MAX(loans.repaid),MEAN(loans.loan_amount),MEAN(loans.rate),MEAN(loans.repaid),MIN(loans.loan_amount),MIN(loans.rate),MIN(loans.repaid),MODE(loans.loan_type),NUM_UNIQUE(loans.loan_type),SKEW(loans.loan_amount),SKEW(loans.rate),SKEW(loans.repaid),STD(loans.loan_amount),STD(loans.rate),STD(loans.repaid),SUM(loans.loan_amount),SUM(loans.rate),SUM(loans.repaid),COUNT(payments),MAX(payments.missed),MAX(payments.payment_amount),MEAN(payments.missed),MEAN(payments.payment_amount),MIN(payments.missed),MIN(payments.payment_amount),SKEW(payments.missed),SKEW(payments.payment_amount),STD(payments.missed),STD(payments.payment_amount),SUM(payments.missed),SUM(payments.payment_amount),DAY(joined),MONTH(joined),WEEKDAY(joined),YEAR(joined),...,SKEW(payments.loans.rate),SKEW(payments.loans.repaid),STD(payments.loans.loan_amount),STD(payments.loans.rate),STD(payments.loans.repaid),SUM(payments.loans.loan_amount),SUM(payments.loans.rate),SUM(payments.loans.repaid),MAX(loans.NUM_UNIQUE(payments.DAY(payment_date))),MAX(loans.NUM_UNIQUE(payments.MONTH(payment_date))),MAX(loans.NUM_UNIQUE(payments.WEEKDAY(payment_date))),MAX(loans.NUM_UNIQUE(payments.YEAR(payment_date))),MEAN(loans.NUM_UNIQUE(payments.DAY(payment_date))),MEAN(loans.NUM_UNIQUE(payments.MONTH(payment_date))),MEAN(loans.NUM_UNIQUE(payments.WEEKDAY(payment_date))),MEAN(loans.NUM_UNIQUE(payments.YEAR(payment_date))),MIN(loans.NUM_UNIQUE(payments.DAY(payment_date))),MIN(loans.NUM_UNIQUE(payments.MONTH(payment_date))),MIN(loans.NUM_UNIQUE(payments.WEEKDAY(payment_date))),MIN(loans.NUM_UNIQUE(payments.YEAR(payment_date))),MODE(loans.MODE(payments.DAY(payment_date))),MODE(loans.MODE(payments.MONTH(payment_date))),MODE(loans.MODE(payments.WEEKDAY(payment_date))),MODE(loans.MODE(payments.YEAR(payment_date))),NUM_UNIQUE(loans.MODE(payments.DAY(payment_date))),NUM_UNIQUE(loans.MODE(payments.MONTH(payment_date))),NUM_UNIQUE(loans.MODE(payments.WEEKDAY(payment_date))),NUM_UNIQUE(loans.MODE(payments.YEAR(payment_date))),SKEW(loans.NUM_UNIQUE(payments.DAY(payment_date))),SKEW(loans.NUM_UNIQUE(payments.MONTH(payment_date))),SKEW(loans.NUM_UNIQUE(payments.WEEKDAY(payment_date))),SKEW(loans.NUM_UNIQUE(payments.YEAR(payment_date))),STD(loans.NUM_UNIQUE(payments.DAY(payment_date))),STD(loans.NUM_UNIQUE(payments.MONTH(payment_date))),STD(loans.NUM_UNIQUE(payments.WEEKDAY(payment_date))),STD(loans.NUM_UNIQUE(payments.YEAR(payment_date))),SUM(loans.NUM_UNIQUE(payments.DAY(payment_date))),SUM(loans.NUM_UNIQUE(payments.MONTH(payment_date))),SUM(loans.NUM_UNIQUE(payments.WEEKDAY(payment_date))),SUM(loans.NUM_UNIQUE(payments.YEAR(payment_date)))
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
42320,229481,563,15,13887.0,6.74,1.0,7062.066667,2.457333,0.6,1070.0,0.38,0.0,home,4,0.185406,0.993713,-0.455083,4165.826885,1.984938,0.507093,105931.0,36.86,9.0,120,1.0,2769.0,0.516667,1021.483333,0.0,130.0,-0.067551,0.613867,0.501817,640.794189,62.0,122578.0,27,4,3,2000,...,0.872916,-0.342356,3929.351652,1.891948,0.495074,843367.0,302.84,70.0,12.0,11.0,6.0,3.0,6.666667,6.533333,4.533333,1.6,3.0,4.0,2.0,1.0,11,3,0,2003,11,9,7,9,0.664732,0.803159,-0.378644,0.547317,2.225395,1.84649,1.355764,0.632456,100.0,98.0,68.0,24.0
39384,191204,617,19,14654.0,9.23,1.0,7865.473684,3.538421,0.631579,1770.0,0.43,0.0,credit,4,-0.242626,0.992152,-0.593464,3964.28684,2.629599,0.495595,149444.0,67.23,12.0,146,1.0,2822.0,0.513699,1193.630137,0.0,195.0,-0.055386,0.234089,0.501533,630.722531,75.0,174270.0,18,6,6,2000,...,0.971757,-0.308024,3765.368569,2.685701,0.495992,1161741.0,499.12,84.0,12.0,9.0,7.0,4.0,6.789474,6.105263,4.789474,1.684211,4.0,3.0,3.0,1.0,3,1,1,2014,14,9,5,9,0.972597,-0.193919,0.002498,1.603463,2.043389,1.559727,1.182227,0.945905,129.0,116.0,91.0,32.0
26945,214516,806,15,14593.0,5.65,1.0,7125.933333,2.855333,0.4,653.0,0.13,0.0,credit,4,0.174492,-0.002227,0.455083,4543.621769,1.619717,0.507093,106889.0,42.83,6.0,112,1.0,2768.0,0.508929,1109.473214,0.0,69.0,-0.036207,0.421394,0.502167,769.466481,57.0,124261.0,26,11,6,2000,...,0.020986,0.688133,4372.558827,1.599556,0.475595,771053.0,330.13,38.0,10.0,10.0,7.0,3.0,6.6,6.0,4.466667,1.6,5.0,3.0,3.0,1.0,1,3,4,2002,11,7,6,9,0.915046,0.366335,0.726471,0.547317,1.549193,1.889822,1.245946,0.632456,99.0,90.0,67.0,24.0
41472,152214,638,16,13657.0,9.82,1.0,7510.8125,3.98125,0.5,986.0,0.01,0.0,cash,4,-0.075884,0.416789,0.0,4257.668536,3.198366,0.516398,120173.0,63.7,8.0,105,1.0,2436.0,0.485714,1129.07619,0.0,108.0,0.057998,0.091418,0.502193,669.804703,51.0,118553.0,6,11,1,2001,...,0.311596,-0.135575,4250.127784,3.129716,0.50128,784731.0,435.36,56.0,8.0,8.0,7.0,2.0,5.875,5.625,4.4375,1.6875,4.0,3.0,3.0,1.0,1,1,1,2002,13,8,7,11,0.073805,0.136238,0.705987,-0.895257,1.360147,1.258306,1.093542,0.478714,94.0,90.0,71.0,27.0
46180,43851,562,20,14081.0,9.26,1.0,7700.85,3.5025,0.5,1607.0,0.57,0.0,other,4,0.081292,0.945069,0.0,3835.726436,2.550263,0.512989,154017.0,70.05,10.0,149,1.0,2660.0,0.416107,1186.550336,0.0,163.0,0.343868,0.301708,0.494574,566.732413,62.0,176796.0,6,11,1,2001,...,0.649145,0.01356,3815.702946,2.714351,0.501675,1142666.0,578.43,74.0,15.0,8.0,6.0,3.0,6.65,5.8,4.6,1.5,5.0,3.0,2.0,1.0,1,1,0,2014,12,7,7,11,3.16417,-0.086708,-0.683319,0.784528,2.183069,1.281447,1.095445,0.606977,133.0,116.0,92.0,30.0


In [None]:
auto_features.shape

(25, 183)

Loans and payment data were collapsed into 25 rows by summing unique loans and payments per customer. 