# Automated Feature Engineering

https://github.com/WillKoehrsen/automated-feature-engineering

In this notebook, we will walk through an implementation of using [Featuretools](https://www.featuretools.com/), an open-source Python library for automatically creating features with relational data (where the data is in structured tables). Although there are now many efforts working to enable automated model selection and hyperparameter tuning, there has been a lack of automating work on the feature engineering aspect of the pipeline. This library seeks to close that gap and the general methodology has been proven effective in both [machine learning competitions with the data science machine](https://github.com/HDI-Project/Data-Science-Machine) and [business use cases](https://www.featurelabs.com/blog/predicting-credit-card-fraud/). 

## Dataset

To show the basic idea of featuretools we will use an example dataset consisting of three tables:

* `clients`: information about clients at a credit union
* `loans`: previous loans taken out by the clients
* `payments`: payments made/missed on the previous loans

The general problem of feature engineering is taking disparate data, often distributed across multiple tables, and combining it into a single table that can be used for training a machine learning model. Featuretools has the ability to do this for us, creating many new candidate features with minimal effort. These features are combined into a single table that can then be passed on to our model. 

First, let's load in the data and look at the problem we are working with.

In [1]:
# !pip install -U featuretools

Collecting featuretools
[?25l  Downloading https://files.pythonhosted.org/packages/52/5f/57c526a0ea506b29a0029f234f09b1fcbf4dbb1e32a1996fe5228d2b2833/featuretools-0.9.1-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 16kB/s eta 0:00:011
Installing collected packages: featuretools
  Found existing installation: featuretools 0.7.0
    Uninstalling featuretools-0.7.0:
      Successfully uninstalled featuretools-0.7.0
Successfully installed featuretools-0.9.1


In [2]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

# ignore warnings from pandas
# import warnings
# warnings.filterwarnings('ignore')

In [3]:
# Read in the data
clients = pd.read_csv('data/clients.csv', parse_dates = ['joined'])
loans = pd.read_csv('data/loans.csv', parse_dates = ['loan_start', 'loan_end'])
payments = pd.read_csv('data/payments.csv', parse_dates = ['payment_date'])

In [4]:
clients.head()

Unnamed: 0,client_id,joined,income,credit_score
0,46109,2002-04-16,172677,527
1,49545,2007-11-14,104564,770
2,41480,2013-03-11,122607,585
3,46180,2001-11-06,43851,562
4,25707,2006-10-06,211422,621


In [5]:
loans.sample(5)

Unnamed: 0,client_id,loan_type,loan_amount,repaid,loan_id,loan_start,loan_end,rate
284,32885,cash,12291,1,10198,2012-03-28,2014-08-14,1.16
112,39505,credit,10436,0,10509,2007-06-15,2009-03-02,1.34
205,26326,home,12760,0,11708,2003-12-11,2006-04-13,5.43
440,26945,other,9329,0,10154,2001-12-17,2004-07-22,5.65
96,25707,other,11467,1,11499,2011-10-27,2014-01-09,4.56


In [6]:
payments.sample(5)

Unnamed: 0,loan_id,payment_amount,payment_date,missed
550,11649,1995,2006-05-02,1
2750,11728,444,2002-09-20,1
2104,10116,937,2001-10-06,1
2720,10612,407,2005-01-05,1
2057,10732,834,2003-10-24,0
