# Automated Feature Engineering

https://github.com/WillKoehrsen/automated-feature-engineering

In this notebook, we will walk through an implementation of using [Featuretools](https://www.featuretools.com/), an open-source Python library for automatically creating features with relational data (where the data is in structured tables). Although there are now many efforts working to enable automated model selection and hyperparameter tuning, there has been a lack of automating work on the feature engineering aspect of the pipeline. This library seeks to close that gap and the general methodology has been proven effective in both [machine learning competitions with the data science machine](https://github.com/HDI-Project/Data-Science-Machine) and [business use cases](https://www.featurelabs.com/blog/predicting-credit-card-fraud/). 

## Dataset

To show the basic idea of featuretools we will use an example dataset consisting of three tables:

* `clients`: 客户基本信息information about clients at a credit union
* `loans`: 客户之前的贷款信息previous loans taken out by the clients
* `payments`: 客户之前的还款信息payments made/missed on the previous loans

The general problem of feature engineering is taking disparate data, often distributed across multiple tables, and combining it into a single table that can be used for training a machine learning model. Featuretools has the ability to do this for us, creating many new candidate features with minimal effort. These features are combined into a single table that can then be passed on to our model. 

First, let's load in the data and look at the problem we are working with.

In [1]:
# !pip install -U featuretools

Collecting featuretools
[?25l  Downloading https://files.pythonhosted.org/packages/52/5f/57c526a0ea506b29a0029f234f09b1fcbf4dbb1e32a1996fe5228d2b2833/featuretools-0.9.1-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 16kB/s eta 0:00:011
Installing collected packages: featuretools
  Found existing installation: featuretools 0.7.0
    Uninstalling featuretools-0.7.0:
      Successfully uninstalled featuretools-0.7.0
Successfully installed featuretools-0.9.1


In [2]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

# ignore warnings from pandas
# import warnings
# warnings.filterwarnings('ignore')

In [3]:
# Read in the data
clients = pd.read_csv('data/clients.csv', parse_dates = ['joined'])
loans = pd.read_csv('data/loans.csv', parse_dates = ['loan_start', 'loan_end'])
payments = pd.read_csv('data/payments.csv', parse_dates = ['payment_date'])

In [4]:
clients.head()

Unnamed: 0,client_id,joined,income,credit_score
0,46109,2002-04-16,172677,527
1,49545,2007-11-14,104564,770
2,41480,2013-03-11,122607,585
3,46180,2001-11-06,43851,562
4,25707,2006-10-06,211422,621


In [5]:
loans.sample(5)

Unnamed: 0,client_id,loan_type,loan_amount,repaid,loan_id,loan_start,loan_end,rate
284,32885,cash,12291,1,10198,2012-03-28,2014-08-14,1.16
112,39505,credit,10436,0,10509,2007-06-15,2009-03-02,1.34
205,26326,home,12760,0,11708,2003-12-11,2006-04-13,5.43
440,26945,other,9329,0,10154,2001-12-17,2004-07-22,5.65
96,25707,other,11467,1,11499,2011-10-27,2014-01-09,4.56


In [6]:
payments.sample(5)

Unnamed: 0,loan_id,payment_amount,payment_date,missed
550,11649,1995,2006-05-02,1
2750,11728,444,2002-09-20,1
2104,10116,937,2001-10-06,1
2720,10612,407,2005-01-05,1
2057,10732,834,2003-10-24,0


## Manual Feature Engineering Examples

Let's show a few examples of features we might make by hand. We will keep this relatively simple to avoid doing too much work! 

First we will focus on a single dataframe before combining them together. In the `clients` dataframe, we can take the month of the `joined` column and the natural log of the `income` column. 

Later, we see these are known in featuretools as transformation feature primitives because they act on column in a single table. 

In [7]:
# Create a month column
clients['join_month'] = clients['joined'].dt.month

# Create a log of income column
clients['log_income'] = np.log(clients['income'])

clients.head()

Unnamed: 0,client_id,joined,income,credit_score,join_month,log_income
0,46109,2002-04-16,172677,527,4,12.059178
1,49545,2007-11-14,104564,770,11,11.557555
2,41480,2013-03-11,122607,585,3,11.716739
3,46180,2001-11-06,43851,562,11,10.688553
4,25707,2006-10-06,211422,621,10,12.261611


To incorporate information about the other tables, we use the `df.groupby` method, followed by a suitable aggregation function, followed by `df.merge`.  

For example, let's calculate the average, minimum, and maximum amount of previous loans for each client. 

In the terms of featuretools, this would be considered an aggregation feature primitive because we using multiple tables in a one-to-many relationship to calculate aggregation figures (don't worry, this will be explained shortly!).

In [8]:
# Groupby client id and calculate mean, max, min previous loan size
stats = loans.groupby('client_id')['loan_amount'].agg(['mean', 'max', 'min'])
stats.columns = ['mean_loan_amount', 'max_loan_amount', 'min_loan_amount']
stats.head()

Unnamed: 0_level_0,mean_loan_amount,max_loan_amount,min_loan_amount
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
25707,7963.95,13913,1212
26326,7270.0625,13464,1164
26695,7824.722222,14865,2389
26945,7125.933333,14593,653
29841,9813.0,14837,2778


In [9]:
# Merge with the clients dataframe
clients.merge(stats, left_on = 'client_id', right_index=True, how = 'left').head(5)

Unnamed: 0,client_id,joined,income,credit_score,join_month,log_income,mean_loan_amount,max_loan_amount,min_loan_amount
0,46109,2002-04-16,172677,527,4,12.059178,8951.6,14049,559
1,49545,2007-11-14,104564,770,11,11.557555,10289.3,14971,3851
2,41480,2013-03-11,122607,585,3,11.716739,7894.85,14399,811
3,46180,2001-11-06,43851,562,11,10.688553,7700.85,14081,1607
4,25707,2006-10-06,211422,621,10,12.261611,7963.95,13913,1212


## Featuretools
https://www.featuretools.com/

Now that we know what we are trying to avoid (tedious manual feature engineering), let's figure out how to automate this process. Featuretools operates on an idea known as [Deep Feature Synthesis](https://docs.featuretools.com/api_reference.html#deep-feature-synthesis). You can read the [original paper here](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf), and although it's quite readable, it's not necessary to understand the details to do automated feature engineering. 


 ***The concept of Deep Feature Synthesis is to use basic building blocks known as feature primitives (like the transformations and aggregations done above) that can be stacked on top of each other to form new features. The depth of a "deep feature" is equal to the number of stacked primitives. ***

I threw out some terms there, but don't worry because we'll cover them as we go. Featuretools builds on simple ideas to create a powerful method, and we will build up our understanding in much the same way. 

The first part of Featuretools to understand [is an `entity`](https://docs.featuretools.com/loading_data/using_entitysets.html#adding-entities). This is simply a table, or in `pandas`, a `DataFrame`. 



### EntitySet
We corral multiple entities into a [single object called an `EntitySet`](https://docs.featuretools.com/loading_data/using_entitysets.html). This is just a large data structure composed of many individual entities and the relationships between them.  

Creating a new `EntitySet` is pretty simple: 

In [12]:
es = ft.EntitySet(id = 'clients')

#### Entities 

An entity is simply a table, which is represented in Pandas as a `dataframe`. 

***Each entity must have a uniquely identifying column, known as an index. ***

For the clients dataframe, this is the `client_id` because each id only appears once in the `clients` data. In the `loans` dataframe, `client_id` is not an index because each id might appear more than once. The index for this dataframe is instead `loan_id`. 

When we create an `entity` in featuretools, we have to identify which column of the dataframe is the index. If the data does not have a unique index we can tell featuretools to make an index for the entity by passing in `make_index = True` and specifying a name for the index. 

If the data also has a uniquely identifying time index, we can pass that in as the `time_index` parameter. 

Featuretools will automatically infer the variable types (numeric, categorical, datetime) of the columns in our data, but we can also pass in specific datatypes to override this behavior. As an example, even though the `repaid` column in the `loans` dataframe is represented as an integer, we can tell featuretools that this is a categorical feature since it can only take on two discrete values. This is done using an integer with the variables as keys and the feature types as values.

In the code below we create the three entities and add them to the `EntitySet`.  The syntax is relatively straightforward with a few notes: for the `payments` dataframe we need to make an index, for the `loans` dataframe, we specify that `repaid` is a categorical variable, and for the `payments` dataframe, we specify that `missed` is a categorical feature. 

In [13]:
# Create an entity from the client dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'clients', dataframe = clients, 
                              index = 'client_id', time_index = 'joined')

# Create an entity from the loans dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'loans', dataframe = loans, 
                              variable_types = {'repaid': ft.variable_types.Categorical},
                              index = 'loan_id', 
                              time_index = 'loan_start')

# Create an entity from the payments dataframe
# This does not yet have a unique index
es = es.entity_from_dataframe(entity_id = 'payments', 
                              dataframe = payments,
                              variable_types = {'missed': ft.variable_types.Categorical},
                              make_index = True,
                              index = 'payment_id',
                              time_index = 'payment_date')

In [14]:
es

Entityset: clients
  Entities:
    clients [Rows: 25, Columns: 6]
    loans [Rows: 443, Columns: 8]
    payments [Rows: 3456, Columns: 5]
  Relationships:
    No relationships

In [15]:
es['loans']

Entity: loans
  Variables:
    loan_id (dtype: index)
    client_id (dtype: numeric)
    loan_type (dtype: categorical)
    loan_amount (dtype: numeric)
    loan_start (dtype: datetime_time_index)
    loan_end (dtype: datetime)
    rate (dtype: numeric)
    repaid (dtype: categorical)
  Shape:
    (Rows: 443, Columns: 8)

In [16]:
es['payments']

Entity: payments
  Variables:
    payment_id (dtype: index)
    loan_id (dtype: numeric)
    payment_amount (dtype: numeric)
    payment_date (dtype: datetime_time_index)
    missed (dtype: categorical)
  Shape:
    (Rows: 3456, Columns: 5)

#### Relationships

After defining the entities (tables) in an `EntitySet`, we now need to tell featuretools [how they are related with a relationship](https://docs.featuretools.com/loading_data/using_entitysets.html#adding-a-relationship). 

The most intuitive way to think of relationships is with the parent to child analogy: a parent-to-child relationship is one-to-many because for each parent, there can be multiple children. 

The `client` dataframe is therefore the parent of the `loans` dataframe because while there is only one row for each client in the `client` dataframe, each client may have several previous loans covering multiple rows in the `loans` dataframe. Likewise, the `loans` dataframe is the parent of the `payments` dataframe because each loan will have multiple payments. 

These relationships are what allow us to group together datapoints using aggregation primitives and then create new features. As an example, we can group all of the previous loans associated with one client and find the average loan amount. We will discuss the features themselves more in a little bit, but for now let's define the relationships. 

To define relationships, we need to specify the parent variable and the child variable. This is the variable that links two entities together. In our example, the `client` and `loans` dataframes are linked together by the `client_id` column. Again, this is a parent to child relationship because for each `client_id` in the parent `client` dataframe, there may be multiple entries of the same `client_id` in the child `loans` dataframe. 

We codify relationships in the language of featuretools by specifying the parent variable and then the child variable. After creating a relationship, we add it to the `EntitySet`. 

In [17]:
# Relationship between clients and previous loans
r_client_previous = ft.Relationship(es['clients']['client_id'],
                                    es['loans']['client_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_client_previous)

In [18]:
# Relationship between previous loans and previous payments
r_payments = ft.Relationship(es['loans']['loan_id'],
                                      es['payments']['loan_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_payments)

es

Entityset: clients
  Entities:
    clients [Rows: 25, Columns: 6]
    loans [Rows: 443, Columns: 8]
    payments [Rows: 3456, Columns: 5]
  Relationships:
    loans.client_id -> clients.client_id
    payments.loan_id -> loans.loan_id

We now have our entities in an entityset along with the relationships between them. We can now start to making new features from all of the tables using stacks of feature primitives to form deep features. First, let's cover feature primitives.

### Feature Primitives

A [feature primitive](https://docs.featuretools.com/automated_feature_engineering/primitives.html) a at a very high-level is an operation applied to data to create a feature. These represent very simple calculations that can be stacked on top of each other to create complex features. Feature primitives fall into two categories:

* __Aggregation__: 聚合。function that groups together child datapoints for each parent and then calculates a statistic such as mean, min, max, or standard deviation. An example is calculating the maximum loan amount for each client. An aggregation works across multiple tables using relationships between tables.
* __Transformation__: an operation applied to one or more columns in a single table. An example would be extracting the day from dates, or finding the difference between two columns in one table.

Let's take a look at feature primitives in featuretools. We can view the list of primitives:

In [19]:
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100
primitives[primitives['type'] == 'aggregation'].head(10)

Unnamed: 0,name,type,description
0,percent_true,aggregation,Determines the percent of `True` values.
1,num_unique,aggregation,"Determines the number of distinct values, ignoring `NaN` values."
2,time_since_first,aggregation,Calculates the time elapsed since the first datetime (in seconds).
3,min,aggregation,"Calculates the smallest value, ignoring `NaN` values."
4,all,aggregation,Calculates if all values are 'True' in a list.
5,mean,aggregation,Computes the average for a list of values.
6,last,aggregation,Determines the last value in a list.
7,count,aggregation,"Determines the total number of values, excluding `NaN`."
8,skew,aggregation,Computes the extent to which a distribution differs from a normal distribution.
9,n_most_common,aggregation,Determines the `n` most common elements.


In [24]:
primitives[primitives['type'] == 'transform']

Unnamed: 0,name,type,description
20,cum_min,transform,Calculates the cumulative minimum.
21,longitude,transform,Returns the second tuple value in a list of LatLong tuples.
22,multiply_numeric,transform,Element-wise multiplication of two lists.
23,is_weekend,transform,Determines if a date falls on a weekend.
24,less_than_equal_to,transform,Determines if values in one list are less than or equal to another list.
25,modulo_numeric_scalar,transform,Return the modulo of each element in the list by a scalar.
26,cum_count,transform,Calculates the cumulative count.
27,percentile,transform,Determines the percentile rank for each value in a list.
28,subtract_numeric_scalar,transform,Subtract a scalar from each element in the list.
29,divide_by_feature,transform,Divide a scalar by each value in the list.


In [20]:
primitives[primitives['type'] == 'transform'].head(10)

Unnamed: 0,name,type,description
20,cum_min,transform,Calculates the cumulative minimum.
21,longitude,transform,Returns the second tuple value in a list of LatLong tuples.
22,multiply_numeric,transform,Element-wise multiplication of two lists.
23,is_weekend,transform,Determines if a date falls on a weekend.
24,less_than_equal_to,transform,Determines if values in one list are less than or equal to another list.
25,modulo_numeric_scalar,transform,Return the modulo of each element in the list by a scalar.
26,cum_count,transform,Calculates the cumulative count.
27,percentile,transform,Determines the percentile rank for each value in a list.
28,subtract_numeric_scalar,transform,Subtract a scalar from each element in the list.
29,divide_by_feature,transform,Divide a scalar by each value in the list.


If featuretools does not have enough primitives for us, we can [also make our own.](https://docs.featuretools.com/automated_feature_engineering/primitives.html#defining-custom-primitives) 

To get an idea of what a feature primitive actually does, let's try out a few on our data. Using primitives is surprisingly easy using the `ft.dfs` function (which stands for deep feature synthesis). In this function, we specify the entityset to use; the `target_entity`, which is the dataframe we want to make the features for (where the features end up); the `agg_primitives` which are the aggregation feature primitives; and the `trans_primitives` which are the transformation primitives to apply. 

In the following example, we are using the `EntitySet` we already created, the target entity is the `clients` dataframe because we want to make new features about each client, and then we specify a few aggregation and transformation primitives. 

In [28]:
# Create new features using specified primitives
features, feature_names = ft.dfs(entityset = es, target_entity = 'clients', 
                                 agg_primitives = ['mean', 'max', 'percent_true', 'last'],
                                 trans_primitives = ['year', 'month', 'subtract_numeric', 'divide_numeric'])

In [31]:
features.columns.to_list()

['income',
 'credit_score',
 'join_month',
 'log_income',
 'MEAN(loans.loan_amount)',
 'MEAN(loans.rate)',
 'MAX(loans.loan_amount)',
 'MAX(loans.rate)',
 'LAST(loans.loan_id)',
 'LAST(loans.loan_type)',
 'LAST(loans.loan_amount)',
 'LAST(loans.rate)',
 'LAST(loans.repaid)',
 'MEAN(payments.payment_amount)',
 'MAX(payments.payment_amount)',
 'LAST(payments.payment_id)',
 'LAST(payments.payment_amount)',
 'LAST(payments.missed)',
 'YEAR(joined)',
 'MONTH(joined)',
 'income - log_income',
 'join_month - log_income',
 'income - join_month',
 'credit_score - join_month',
 'credit_score - log_income',
 'credit_score - income',
 'join_month / log_income',
 'join_month / income',
 'join_month / credit_score',
 'income / join_month',
 'credit_score / join_month',
 'credit_score / log_income',
 'credit_score / income',
 'log_income / income',
 'log_income / credit_score',
 'income / log_income',
 'income / credit_score',
 'log_income / join_month',
 'MEAN(loans.MEAN(payments.payment_amount))',


In [32]:
features.head()

Unnamed: 0_level_0,income,credit_score,join_month,log_income,MEAN(loans.loan_amount),MEAN(loans.rate),MAX(loans.loan_amount),MAX(loans.rate),LAST(loans.loan_id),LAST(loans.loan_type),...,LAST(payments.payment_amount) / MAX(payments.payment_amount),log_income / join_month - log_income,credit_score / income - join_month,MAX(payments.payment_amount) / MEAN(loans.loan_amount),credit_score - log_income / income,income - join_month / MAX(loans.rate),MAX(loans.loan_amount) / MEAN(loans.rate),income / MAX(loans.loan_amount),LAST(payments.payment_amount) / MEAN(loans.rate),income - log_income / income - join_month
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25707,211422,621,10,12.261611,7963.95,3.477,13913,9.44,10363,home,...,0.089127,-5.421626,0.002937,0.33953,0.002879,22395.338983,4001.438021,15.196004,69.312626,0.999989
26326,227920,633,5,12.33675,7270.0625,2.5175,13464,6.73,11072,credit,...,0.35064,-1.681501,0.002777,0.365609,0.002723,33865.527489,5348.16286,16.928105,370.20854,0.999968
26695,174532,680,8,12.069863,7824.722222,2.466111,14865,6.51,10985,other,...,0.710778,-2.965668,0.003896,0.37471,0.003827,26808.602151,6027.708943,11.741137,845.055193,0.999977
26945,214516,806,11,12.27614,7125.933333,2.855333,14593,5.65,11482,cash,...,0.576951,-9.619747,0.003757,0.38844,0.0037,37965.486726,5110.786832,14.699925,559.304226,0.999994
29841,38354,523,8,10.554614,9813.0,3.445,14837,6.76,11188,home,...,0.276052,-4.131588,0.013639,0.295323,0.013361,5672.485207,4306.82148,2.585024,232.22061,0.999933


### Deep Feature Synthesis

While feature primitives are useful by themselves, the main benefit of using featuretools arises when we stack primitives to get deep features. The depth of a feature is simply the number of primitives required to make a feature. So, a feature that relies on a single aggregation would be a deep feature with a depth of 1, a feature that stacks two primitives would have a depth of 2 and so on. The idea itself is lot simpler than the name "deep feature synthesis" implies. (I think the authors were trying to ride the way of deep neural network hype when they named the method!) To read more about deep feature synthesis, check out [the documentation](https://docs.featuretools.com/automated_feature_engineering/afe.html) or the [original paper by Max Kanter et al](http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf). 

Already in the dataframe we made by specifying the primitives manually we can see the idea of feature depth. For instance, the MEAN(loans.loan_amount) feature has a depth of 1 because it is made by applying a single aggregation primitive. This feature represents the average size of a client's previous loans.

In [33]:
# Show a feature with a depth of 1
pd.DataFrame(features['MEAN(loans.loan_amount)'].head(10))

Unnamed: 0_level_0,MEAN(loans.loan_amount)
client_id,Unnamed: 1_level_1
25707,7963.95
26326,7270.0625
26695,7824.722222
26945,7125.933333
29841,9813.0
32726,6633.263158
32885,9920.4
32961,7882.235294
35089,6939.2
35214,7173.555556


In [34]:
# Show a feature with a depth of 2
pd.DataFrame(features['LAST(loans.MEAN(payments.payment_amount))'].head(10))

Unnamed: 0_level_0,LAST(loans.MEAN(payments.payment_amount))
client_id,Unnamed: 1_level_1
25707,293.5
26326,977.375
26695,1769.166667
26945,1598.666667
29841,1125.5
32726,799.5
32885,1729.0
32961,282.6
35089,110.4
35214,1410.25


We can create features of arbitrary depth by stacking more primitives. 
__However, when I have used featuretools I've never gone beyond a depth of 2!__ 

After this point, the features become very convoluted to understand. I'd encourage anyone interested to experiment with increasing the depth (maybe for a real problem) and see if there is value to "going deeper".

### Automated Deep Feature Synthesis

In addition to manually specifying aggregation and transformation feature primitives, we can let featuretools automatically generate many new features. We do this by making the same `ft.dfs` function call, but without passing in any primitives. We just set the `max_depth` parameter and featuretools will automatically try many all combinations of feature primitives to the ordered depth. 

When running on large datasets, this process can take quite a while, but for our example data, it will be relatively quick. For this call, we only need to specify the `entityset`, the `target_entity` (which will again be `clients`), and the `max_depth`. 

In [35]:
# Perform deep feature synthesis without specifying primitives
features, feature_names = ft.dfs(entityset=es, target_entity='clients', 
                                 max_depth = 2)

In [36]:
features.head()

Unnamed: 0_level_0,income,credit_score,join_month,log_income,SUM(loans.loan_amount),SUM(loans.rate),STD(loans.loan_amount),STD(loans.rate),MAX(loans.loan_amount),MAX(loans.rate),...,NUM_UNIQUE(loans.WEEKDAY(loan_end)),MODE(loans.MODE(payments.missed)),MODE(loans.DAY(loan_start)),MODE(loans.DAY(loan_end)),MODE(loans.YEAR(loan_start)),MODE(loans.YEAR(loan_end)),MODE(loans.MONTH(loan_start)),MODE(loans.MONTH(loan_end)),MODE(loans.WEEKDAY(loan_start)),MODE(loans.WEEKDAY(loan_end))
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25707,211422,621,10,12.261611,159279,69.54,4149.486062,2.484186,13913,9.44,...,6,0,27,1,2010,2007,1,8,3,0
26326,227920,633,5,12.33675,116321,40.28,4393.666631,2.057142,13464,6.73,...,5,0,6,6,2003,2005,4,7,5,2
26695,174532,680,8,12.069863,140845,44.39,4196.462499,1.561659,14865,6.51,...,6,0,3,14,2003,2005,9,4,1,1
26945,214516,806,11,12.27614,106889,42.83,4543.621769,1.619717,14593,5.65,...,6,0,16,1,2002,2004,12,5,0,1
29841,38354,523,8,10.554614,176634,62.01,4209.224171,2.122904,14837,6.76,...,7,1,1,15,2005,2007,3,2,5,1


### Next Steps

While automatic feature engineering solves one problem, it provides us with another problem: too many features! Although it's difficult to say which features will be important to a given machine learning task ahead of time, it's likely that not all of the features made by featuretools add value. In fact, having too many features is a significant issue in machine learning because it makes training a model much harder. The [irrelevant features can drown out the important features](https://pdfs.semanticscholar.org/a83b/ddb34618cc68f1014ca12eef7f537825d104.pdf), leaving a model unable to learn how to map the features to the target.

This problem is known as the ["curse of dimensionality"](https://en.wikipedia.org/wiki/Curse_of_dimensionality#Machine_learning) and is addressed through the process of [feature reduction and selection](http://scikit-learn.org/stable/modules/feature_selection.html), which means [removing low-value features](https://machinelearningmastery.com/feature-selection-machine-learning-python/) from the data. Defining which features are useful is an important problem where a data scientist can still add considerable value to the feature engineering task. Feature reduction will have to be another topic for another day!

### Conclusions

In this notebook, we saw how to apply automated feature engineering to an example dataset. This is a powerful method which allows us to overcome the human limits of time and imagination to create many new features from multiple tables of data. Featuretools is built on the idea of deep feature synthesis, which means stacking multiple simple feature primitives - __aggregations and transformations__ - to create new features. Feature engineering allows us to combine information across many tables into a single dataframe that we can then use for machine learning model training. Finally, the next step after creating all of these features is figuring out which ones are important. 

Featuretools is currently the only Python option for this process, but with the recent emphasis on automating aspects of the machine learning pipeline, other competitiors will probably enter the sphere. While the exact tools will change, the idea of automatically creating new features out of existing data will grow in importance. Staying up-to-date on methods such as automated feature engineering is crucial in the rapidly changing field of data science. Now go out there and find a problem on which to apply featuretools! 

For more information, check out the [documentation for featuretools](https://docs.featuretools.com/index.html). Also, read about how featuretools is [used in the real world by Feature Labs](https://www.featurelabs.com/), the company behind the open-source library.