# Deep feature synthesis

https://featuretools.alteryx.com/en/stable/getting_started/afe.html

The meat of Featuretools: automated feature engineering based on table relationships and timestamps

## Setup

In [1]:
import featuretools as ft
import pandas

# Display options
pandas.set_option('display.max_rows', 10)

es = ft.demo.load_mock_customer(return_entityset=True)
es

Entityset: transactions
  DataFrames:
    transactions [Rows: 500, Columns: 6]
    products [Rows: 5, Columns: 3]
    sessions [Rows: 35, Columns: 5]
    customers [Rows: 5, Columns: 5]
  Relationships:
    transactions.product_id -> products.product_id
    transactions.session_id -> sessions.session_id
    sessions.customer_id -> customers.customer_id

## Primitives

Primitives are the basic transformations used by Featuretools: functions operating on the data.

In [2]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count"],
    trans_primitives=["month"],
    max_depth=1,
)

feature_matrix

Unnamed: 0_level_0,zip_code,COUNT(sessions),MONTH(birthday),MONTH(join_date)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,60091,6,7,7
4,60091,8,8,4
1,60091,8,7,4
3,13244,6,11,8
2,13244,7,8,4


In the example above, `count` is an aggregation primitive because it computes a single value based on many sessions related to one customer. `month` is called a transform primitive because it takes one value for a customer transforms it to another.

## Deep features

Primitives can be "stacked" onto one another creating features up to the specified `max_depth`.

In [3]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "mode"],
    trans_primitives=["month", "hour"],
    max_depth=2,
)

feature_matrix

Unnamed: 0_level_0,zip_code,MODE(sessions.device),MEAN(transactions.amount),MODE(transactions.product_id),SUM(transactions.amount),HOUR(birthday),HOUR(join_date),MONTH(birthday),MONTH(join_date),MEAN(sessions.MEAN(transactions.amount)),MEAN(sessions.SUM(transactions.amount)),MODE(sessions.HOUR(session_start)),MODE(sessions.MODE(transactions.product_id)),MODE(sessions.MONTH(session_start)),SUM(sessions.MEAN(transactions.amount)),MODE(transactions.sessions.device)
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
5,60091,mobile,80.375443,5,6349.66,0,5,7,7,78.705187,1058.276667,0,3,1,472.231119,mobile
4,60091,mobile,80.070459,2,8727.68,0,20,8,4,81.207189,1090.96,1,1,1,649.657515,mobile
1,60091,mobile,71.631905,4,9025.62,0,10,7,4,72.77414,1128.2025,6,4,1,582.193117,mobile
3,13244,desktop,67.06043,1,6236.62,0,15,11,8,67.539577,1039.436667,5,1,1,405.237462,desktop
2,13244,desktop,77.422366,4,7200.28,0,23,8,4,78.415122,1028.611429,3,3,1,548.905851,desktop


Note that Featuretools ran the aggregation primitives again in a second pass.

With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this paper. In the returned feature matrix, let us understand one of the depth 2 features:

In [4]:
feature_matrix[["MEAN(sessions.SUM(transactions.amount))"]]

Unnamed: 0_level_0,MEAN(sessions.SUM(transactions.amount))
customer_id,Unnamed: 1_level_1
5,1058.276667
4,1090.96
1,1128.2025
3,1039.436667
2,1028.611429


For each customer this feature:

1. Calculates the sum of all transaction amounts per session to get total amount per session,

2. then applies the mean to the total amounts across multiple sessions to identify the average amount spent per session.

We call this feature a “deep feature” with a depth of 2.

Let’s look at another depth 2 feature that calculates for every customer the most common hour of the day when they start a session:

In [5]:
feature_matrix[["MODE(sessions.HOUR(session_start))"]]

Unnamed: 0_level_0,MODE(sessions.HOUR(session_start))
customer_id,Unnamed: 1_level_1
5,0
4,1
1,6
3,5
2,3


For each customer this feature calculates:

1. The hour of the day each of his or her sessions started, then

2. uses the statistical function mode to identify the most common hour he or she started a session.

Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning.

In [6]:
ft.describe_feature(feature_defs[11])

'The most frequently occurring value of the hour value of the "session_start" of all instances of "sessions" for each "customer_id" in "customers".'

DFS is powerful because we can create a feature matrix for any dataframe in our dataset. If we switch our target dataframe to “sessions”, we can synthesize features for each session instead of each customer. Now, we can use these features to predict the outcome of a session.

In [7]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="sessions",
    agg_primitives=["mean", "sum", "mode"],
    trans_primitives=["month", "hour"],
    max_depth=2,
)

feature_matrix

Unnamed: 0_level_0,customer_id,device,MEAN(transactions.amount),MODE(transactions.product_id),SUM(transactions.amount),HOUR(session_start),MONTH(session_start),customers.zip_code,MODE(transactions.HOUR(transaction_time)),MODE(transactions.MONTH(transaction_time)),customers.MODE(sessions.device),customers.MEAN(transactions.amount),customers.MODE(transactions.product_id),customers.SUM(transactions.amount),customers.HOUR(birthday),customers.HOUR(join_date),customers.MONTH(birthday),customers.MONTH(join_date)
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,2,desktop,76.813125,3,1229.01,0,1,13244,0,1,desktop,77.422366,4,7200.28,0,23,8,4
2,5,mobile,74.696000,5,746.96,0,1,60091,0,1,mobile,80.375443,5,6349.66,0,5,7,7
3,4,mobile,88.600000,1,1329.00,0,1,60091,0,1,mobile,80.070459,2,8727.68,0,20,8,4
4,1,mobile,64.557200,5,1613.93,0,1,60091,0,1,mobile,71.631905,4,9025.62,0,10,7,4
5,4,mobile,70.638182,5,777.02,1,1,60091,1,1,mobile,80.070459,2,8727.68,0,20,8,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31,2,mobile,68.899444,3,1240.19,7,1,13244,7,1,desktop,77.422366,4,7200.28,0,23,8,4
32,5,mobile,67.897500,3,543.18,8,1,60091,8,1,mobile,80.375443,5,6349.66,0,5,7,7
33,2,mobile,61.910000,3,804.83,8,1,13244,8,1,desktop,77.422366,4,7200.28,0,23,8,4
34,3,desktop,82.109444,4,1477.97,8,1,13244,8,1,desktop,67.060430,1,6236.62,0,15,11,8
