# Deep feature synthesis

We first generate two data sets, a monthly set of transactions and a set of static data for the loans.

The final dataset will consist of features automatically engineered from these datasets.
We run three different transformations: addition, multiplication and division for the loans data for different aggregations of the monthly transactions data. 
Each of these aggregation operations results in a large amount of generated features. We will then run a process of selecting variables.

In [1]:
import featuretools as ft
import pandas as pd
import pickle

In [2]:
!python dfs_prep.py



In [19]:
with open('dfs_data_loans', 'rb') as file:
    loans = pickle.load(file, encoding="latin1")

with open('dfs_data_trans', 'rb') as file:
    trans = pickle.load(file, encoding="latin1")

In [20]:
y = loans.target
loans_x = loans.drop('target', axis=1)
trans.reset_index(inplace=True)
trans

Unnamed: 0,index,loan_id,month,amount_trans_sum,amount_trans_mean,amount_trans_max,amount_trans_min,balance_sum,balance_mean,balance_max,...,interest_max,interest_min,c_deposit_sum,c_deposit_mean,c_deposit_max,c_deposit_min,c_withdr_sum,c_withdr_mean,c_withdr_max,c_withdr_min
0,0,4959,23918,2200.0,1100.000000,1100.0,1100.0,2200.0,1100.000000,1100.0,...,0,0,2,1.000000,1,1,0,0.000000,0,0
1,1,4959,23919,47899.0,7983.166667,20236.0,13.5,142843.0,23807.166667,25049.5,...,1,0,2,0.333333,1,0,0,0.000000,0,0
2,2,4959,23920,62691.0,10448.500000,20236.0,109.5,227931.8,37988.633333,45285.5,...,1,0,0,0.000000,0,0,2,0.333333,1,0
3,3,4959,23921,75961.4,12660.233333,20236.0,144.7,257674.8,42945.800000,54630.9,...,1,0,0,0.000000,0,0,2,0.333333,1,0
4,4,4959,23922,105827.8,17637.966667,30354.0,159.9,315897.4,52649.566667,67529.6,...,1,0,0,0.000000,0,0,2,0.333333,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9426,9426,7308,23960,34251.2,3805.688889,16141.0,56.0,206105.1,22900.566667,30208.2,...,1,0,1,0.111111,1,0,3,0.333333,1,0
9427,9427,7308,23961,31309.4,3913.675000,16141.0,56.0,195245.5,24405.687500,31014.9,...,1,0,1,0.125000,1,0,2,0.250000,1,0
9428,9428,7308,23962,29102.8,3233.644444,16141.0,56.0,242661.2,26962.355556,32119.6,...,1,0,1,0.111111,1,0,3,0.333333,1,0
9429,9429,7308,23963,29730.3,3716.287500,16141.0,56.0,200993.7,25124.212500,30857.8,...,1,0,1,0.125000,1,0,2,0.250000,1,0


### Create entityset, generate features

In [21]:
es = ft.EntitySet(id = 'loan')
es.entity_from_dataframe(entity_id = 'loan', dataframe = loans_x, index='loan_id')
es.entity_from_dataframe(entity_id='trans',
                                 dataframe=trans,
                                 index='index',
                                 time_index='month')
rl = ft.Relationship(es['loan']['loan_id'],
                                    es['trans']['loan_id'])

# Add the relationship to the entity set
es = es.add_relationship(rl)

In [23]:
mult_feature_matrix, _ = ft.dfs(entityset=es,
                       target_entity = 'loan',
                       agg_primitives = ['min','mean','max'],
                       trans_primitives = ['multiply_numeric'],
                       max_depth = 2, 
                       verbose = 1, 
                       n_jobs = -1)

Built 12900 features

Elapsed: 00:00 | Progress:   0%|                                                                                       
                                                                                                                       
EntitySet scattered to 2 workers in 5 seconds                                                                          
Elapsed: 00:27 | Progress:   0%|                                                                                       
Elapsed: 00:00 | Progress:   0%|                                                                                       
Elapsed: 00:52 | Progress:  48%|█████████████████████████████████████████▎                                             
Elapsed: 00:53 | Progress:  95%|██████████████████████████████████████████████████████████████████████████████████▋    
Elapsed: 00:57 | Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████
Elapsed: 00:57 | P

In [24]:
mult_feature_matrix = mult_feature_matrix.loc[:,(mult_feature_matrix.isna().sum() == 0)]
mult_feature_matrix = mult_feature_matrix.loc[:,(mult_feature_matrix.var() > 0)]
mult_feature_matrix

Unnamed: 0_level_0,amount,duration,payments,A4,A5,A6,A7,A8,A9,A10,...,MAX(trans.c_withdr_max) * MEAN(trans.b_withdr_mean),MEAN(trans.b_deposit_sum) * MEAN(trans.hhold_mean),MAX(trans.b_withdr_sum) * MEAN(trans.balance_sum),MEAN(trans.balance_min) * payments,MEAN(trans.interest_max) * MIN(trans.amount_trans_min),accnt_age * MIN(trans.interest_max),MAX(trans.hhold_max) * MEAN(trans.insur_max),MEAN(trans.hhold_sum) * MIN(trans.c_withdr_mean),MAX(trans.amount_trans_mean) * MIN(trans.c_deposit_max),A7 * MEAN(trans.c_deposit_sum)
loan_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4959,80952.0,24.0,3373.0,1204953.0,0.0,0.0,0.0,1.0,1.0,100.0,...,0.000000,0.141873,0.000000,8.348865e+07,12.272727,0.000000,0.000000,0.000000,0.000000,0.000000
4961,30276.0,12.0,2523.0,103347.0,87.0,16.0,7.0,1.0,7.0,67.0,...,0.000000,0.098901,0.000000,4.094421e+07,43.476923,0.000000,0.000000,0.000000,0.000000,4.846154
4962,30276.0,12.0,2523.0,228848.0,15.0,40.0,18.0,2.0,6.0,57.2,...,0.111655,0.000000,531748.022222,8.815989e+07,112.333333,0.000000,0.777778,0.000000,0.000000,36.000000
4967,318480.0,60.0,5308.0,70646.0,94.0,14.0,3.0,1.0,4.0,58.4,...,0.105849,0.000000,282637.973333,1.244936e+08,0.373333,0.000000,0.800000,0.000000,0.000000,3.600000
4968,110736.0,48.0,2307.0,51428.0,50.0,11.0,3.0,1.0,4.0,52.7,...,0.076058,0.057540,177773.300000,5.506504e+07,42.844444,0.000000,0.444444,0.000000,0.000000,2.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7294,39168.0,24.0,1632.0,94725.0,38.0,28.0,1.0,3.0,6.0,63.4,...,0.000000,0.000000,0.000000,6.837736e+07,41.166667,0.000000,0.000000,0.000000,14086.466667,1.083333
7295,280440.0,60.0,4674.0,387570.0,0.0,0.0,0.0,1.0,1.0,100.0,...,0.000000,0.000000,0.000000,1.202509e+08,58.176190,0.000000,0.000000,0.000000,0.000000,0.000000
7304,419880.0,60.0,6998.0,1204953.0,0.0,0.0,0.0,1.0,1.0,100.0,...,0.000000,0.055556,0.000000,3.014256e+08,26.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7305,54024.0,12.0,4502.0,117897.0,139.0,28.0,5.0,1.0,6.0,53.8,...,0.118851,0.118851,214978.309091,1.272831e+08,67.300000,1.848049,0.000000,0.193182,0.000000,0.227273


In [25]:
div_feature_matrix, _ = ft.dfs(entityset=es,
                       target_entity = 'loan',
                       agg_primitives = ['mean'],
                       trans_primitives = ['divide_numeric'],
                       max_depth = 2, 
                       verbose = 1, 
                       n_jobs = -1)

Built 5484 features

Elapsed: 00:00 | Progress:   0%|                                                                                       




                                                                                                                       
EntitySet scattered to 2 workers in 5 seconds                                                                          
Elapsed: 04:12 | Progress:   0%|                                                                                       
Elapsed: 00:00 | Progress:   0%|                                                                                       
Elapsed: 00:20 | Progress:  48%|█████████████████████████████████████████▎                                             
Elapsed: 00:21 | Progress:  95%|██████████████████████████████████████████████████████████████████████████████████▋    
Elapsed: 00:23 | Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████
Elapsed: 00:23 | Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████


In [26]:
div_feature_matrix = div_feature_matrix.loc[:,(div_feature_matrix.isna().sum() == 0)]
div_feature_matrix = div_feature_matrix.loc[:,(div_feature_matrix.var() > 0)]
dive_feature_matrix

NameError: name 'dive_feature_matrix' is not defined