# Overview
This notebook is a tutorial on the basic functionality of the automated feature engineering library 'Featuretools'. 

# Why is automated feature engineering something to be excited about?
Good question! 

As data scientists, much of our job revolves around feature engineering. When we engineer a feature, we're trying to answer an implicit question: "is there a relationship between this feature that I'm creating and our target?" 

For example, our goal will be to predict whether or not water pumps are broken. Our data includes features like which basin the water is drawn, and size of the population that is using that water pump. For me this raises the question "Do pumps that draw water from basins that support larger populations experience greater failure?" 

Typically we'd perform a pandas groupby for the 'basin' column of our dataframe to engineer a feature that allows us to test this. 

What if instead of "Do pumps that draw water from basins that support larger populations experience greater failure?" we asked "Is there any relationship between the basins and any of our other features?"

That's a much bigger question, and generating all of those interaction terms could entail a lot of work. Featuretools is able to handle it for us.

Or maybe you want to know if the relationship between any of your numeric features and your target is better modeling by the natural log or a polynomial of those features? We could try to eyeball it with a pair plot, or we could simply let Featuretools handle that for us too. 

# So are data scientists out of the job?
Nope! Or at least not yet...

As we'll see in this notebook Feature Tools is powerful, but not "smart". Using its default settings and the data from a kaggle competition, the features generated by Feature Tools perform worse than all of the following:

1. A random forest classifier with no feature engineering
2. A logistic regression with no feature engineering
3. A majority class baseline model

# Then why should I learn it? 
Because it is a useful tool in the right context. Like many of the tools at our disposal, Feature Tools has an appropriate time and place. Once you understand its basic functionality, it quickly becomes clear where there could be valuable usecases, and how it might fit into your workflow as a data scientist.

### Let's begin!

---
---



In [1]:
# You can skip to "Understanding feature tools"

### Imports

In [2]:
# The basics
import pandas as pd
import numpy as np 

### Utility Functions and Lists (skip to "Understanding feature tools")
I'm defining a few utility functions to make cleaning our data faster. The purprose of this tutorial is understanding featuretools, so we won't spend too much time worrying about picture perfect cleaning.

In [3]:
#Fix dates
def fix_dates(x):

    #Import
    import datetime
    now = datetime.datetime.today()

    #Turn string to datetime
    x.date_recorded = pd.to_datetime(x['date_recorded'],format = '%Y-%m-%d')

    #Turn date into how long ago it happened
    x['age'] = x['date_recorded'] - now

    #sklearn doesn't like time. Turn it into an int
    x['age'] = x['age'].dt.days
    
    return x

#remove the columns that we don't want
def drop_stuff(x):
    x = x.drop(to_drop, axis=1)

    return x

#label NaNs
def label_nans(x):
    x.funder.fillna('unknown', inplace=True)
    x.permit.fillna('unknown', inplace=True)
    x.installer.fillna('unknown', inplace=True)
    x.subvillage.fillna('unknown', inplace=True)
    x.scheme_name.fillna('unknown', inplace=True)
    x.public_meeting.fillna('unknown', inplace=True)
    x.scheme_management.fillna('unknown', inplace=True)

    return x

#Clean Data
def clean_data(x):
    x = label_nans(x)
    x = fix_dates(x)
    x = drop_stuff(x)
    
    return x

In [4]:
#drop categories that are excessive, or drop redundant
to_drop = ['funder', 'installer', 'wpt_name', 'subvillage','region_code',
          'ward', 'scheme_name','payment', 'quantity_group', 'recorded_by',]

#reserved in case I need to drop gps coords
angry_model = ['gps_height','longitude','latitude']

# Understanding Featuretools
---
---

## Organizational Structure 
There are 3 key pieces of terminology to understand the structure that Featuretools uses.

### Entities
An entity is just a table. Each pandas dataframe is an entity.

### Entity Sets
This is just a group of entities. If you have three pandas dataframes that you want to use with Featuretools then they will all be contained in the same Entity Set.

### Relationships
The most import thing to understand for you to get started is how relationships function in Featuretools. 

Relationships are organized as "parent" and "child" relationships. The "parent" and "child" are both dataframes. The relationship between parent and child is a shared feature (column). The parent can only have unique values in the shared feature, while the child can have repeats of values in the shared feature. 

This can be hard to wrap your head around, so I drew a picture.

'Dog Breeds Dataframe' (Entity 1) and 'Dogs at Park Today Dataframe' (Entity 2) are both entities that are contained in the entity set called 'Entity Set'. 

Entity 1. has a list of dog breeds and their (fake) attributes. Entity 2. is a list of observations about dogs at the dog park today. 

Each breed is only listed once under the column "Breed" in Entity 1, but breeds may appear multiple times in the "breed" column of Entity 2. That makes Entity 1 the parent and Entity 2 the child for this particular relationship.

<img src="relationship_diagram_updated.png" />

# Prepare our data
---
---
The data that we're using is from a previous Kaggle competition. The goal of the competition was to take the data provided and predict which water pumps would be in need of repair. The data is fairly dirty, has lots of different data types, and is great to learn on.

Our first step is to take our dataframe and cut it into a more digestible chunk so that it's easier for us to understand exactly what Featuretools is doing

In [6]:
# Allow us to view up to 500 columns so that we don't deal with the '...'
pd.set_option('display.max_columns', 500)

# Load in our full set of train data
X = pd.read_csv('train_features.csv')
y = pd.read_csv('train_labels.csv')
# Use cleaning function
X = clean_data(X)

# features that we want included for instructional purposes
subset = ['id', 'date_recorded', 'num_private', 'basin', 'population', 'public_meeting']

# a dataframe with a random sample of 1000 rows, including only our selected features
practice = X[subset].sample(1000, random_state=13)


In [7]:
practice_target = y.sample(1000, random_state=13)
practice_target = practice_target.status_group

In [8]:
practice.head()

Unnamed: 0,id,date_recorded,num_private,basin,population,public_meeting
49126,12237,2013-01-23,0,Ruvuma / Southern Coast,123,False
9836,24566,2013-02-07,0,Lake Victoria,0,True
6587,20536,2011-07-15,0,Lake Victoria,0,True
24047,30633,2011-03-25,0,Lake Nyasa,0,True
314,2993,2011-02-17,0,Wami / Ruvu,500,True


id                0
date_recorded     0
num_private       0
basin             0
population        0
public_meeting    0
dtype: int64

In [22]:
# What does the data in our sub-sample look like?
for i in range(len(practice.columns)):
    print(practice.columns[i],'\nData Type:',practice.dtypes[i],'\nUnique Values:', practice.nunique()[i],'\nNull:', practice.isna().sum()[i],'\n')

id 
Data Type: int64 
Unique Values: 1000 
Null: 0 

date_recorded 
Data Type: datetime64[ns] 
Unique Values: 246 
Null: 0 

num_private 
Data Type: int64 
Unique Values: 11 
Null: 0 

basin 
Data Type: object 
Unique Values: 9 
Null: 0 

population 
Data Type: int64 
Unique Values: 157 
Null: 0 

public_meeting 
Data Type: object 
Unique Values: 3 
Null: 0 



# Automated feature engineering time!
---
---

In [22]:
# Step 1, import featuretools
import featuretools as ft


In [23]:
# Step 2, create a new entity set
es = ft.EntitySet('Entity Set')

In [24]:
# Step 3, add our entities
es.entity_from_dataframe(dataframe=practice, # the dataframe used to construct the entity
                        entity_id='entity_1', # the ft reference name for this entity 
                        index='id' # the feature with unique values to use as an index
                        )


Entityset: Entity Set
  Entities:
    entity_1 [Rows: 1000, Columns: 6]
  Relationships:
    No relationships

# Houston, we have a problem.
---
---
We have one 1 entity under Entities, and no relationships. In our example with the dog breeds, we had two dataframes so that we could create a relationship for featuretools to use. We only have a single dataframe to work with, which means only 1 entity. 

Luckily, featuretools allows us to create new entities from existing entites using "normalize_entity". 

# WAIT!!! Why should I care about normalize_entity?

When we create a new entity, we're able to generate aggregations and interaction terms for that at a large scale. 

Think back to the question: "Is there any relationship between the basins and any of our other features?" When use normalize_entity on a feature, we are able to use Featuretools to examine a large number of relationships between that feature and the rest of the features in your dataframe. 

In [25]:
# Create a new entity from an existing entity's features

es.normalize_entity(base_entity_id='entity_1', # the entity that has the column of interest
                   new_entity_id='basins', # the ft reference name for this entity
                    index='basin' # the name of the column of interest
                   )

Entityset: Entity Set
  Entities:
    entity_1 [Rows: 1000, Columns: 6]
    basins [Rows: 9, Columns: 1]
  Relationships:
    entity_1.basin -> basins.basin

# Actually time to do some automated feature engineering!
---
---
It will be easier to run the code and look at what featuretools generated before I explain exactly what featuretools is doing. After that, we'll discuss the featuretools approach to feature generation. The only thing that you need to know for this next step is that the target_entity is the entity that we are adding the new features to. It will likely be the dataframe that you will be using to generate predictions with for your model.

In [26]:
# this is standard syntax for featuretools.
# fm is your dataframe with both the old and newly engineered features
# features is your list of features in fm 
fm, features = ft.dfs(entityset=es,
                     target_entity='entity_1') 

In [27]:
fm.head()

Unnamed: 0_level_0,num_private,basin,population,public_meeting,DAY(date_recorded),YEAR(date_recorded),MONTH(date_recorded),WEEKDAY(date_recorded),basins.SUM(entity_1.num_private),basins.SUM(entity_1.population),basins.STD(entity_1.num_private),basins.STD(entity_1.population),basins.MAX(entity_1.num_private),basins.MAX(entity_1.population),basins.SKEW(entity_1.num_private),basins.SKEW(entity_1.population),basins.MIN(entity_1.num_private),basins.MIN(entity_1.population),basins.MEAN(entity_1.num_private),basins.MEAN(entity_1.population),basins.COUNT(entity_1),basins.NUM_UNIQUE(entity_1.public_meeting),basins.MODE(entity_1.public_meeting)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
6,0,Internal,0,True,20,2012,10,5,0,24734,0.0,494.74989,0,3000,0.0,4.270376,0,0,0.0,209.610169,118,3,True
146,0,Rufiji,0,True,21,2011,3,0,0,18870,0.0,413.1729,0,4310,0.0,8.479129,0,0,0.0,150.96,125,2,True
253,0,Lake Tanganyika,520,True,3,2013,2,6,0,27733,0.0,355.356273,0,2350,0.0,2.794793,0,0,0.0,231.108333,120,3,True
298,0,Wami / Ruvu,50,True,22,2011,3,1,0,26694,0.0,1006.729314,0,6922,0.0,6.215221,0,0,0.0,278.0625,96,3,True
348,0,Pangani,1,False,3,2013,7,2,224,32174,7.900171,449.86718,65,3620,6.228074,5.02688,0,1,1.493333,214.493333,150,2,True


In [28]:
features

[<Feature: num_private>,
 <Feature: basin>,
 <Feature: population>,
 <Feature: public_meeting>,
 <Feature: DAY(date_recorded)>,
 <Feature: YEAR(date_recorded)>,
 <Feature: MONTH(date_recorded)>,
 <Feature: WEEKDAY(date_recorded)>,
 <Feature: basins.SUM(entity_1.num_private)>,
 <Feature: basins.SUM(entity_1.population)>,
 <Feature: basins.STD(entity_1.num_private)>,
 <Feature: basins.STD(entity_1.population)>,
 <Feature: basins.MAX(entity_1.num_private)>,
 <Feature: basins.MAX(entity_1.population)>,
 <Feature: basins.SKEW(entity_1.num_private)>,
 <Feature: basins.SKEW(entity_1.population)>,
 <Feature: basins.MIN(entity_1.num_private)>,
 <Feature: basins.MIN(entity_1.population)>,
 <Feature: basins.MEAN(entity_1.num_private)>,
 <Feature: basins.MEAN(entity_1.population)>,
 <Feature: basins.COUNT(entity_1)>,
 <Feature: basins.NUM_UNIQUE(entity_1.public_meeting)>,
 <Feature: basins.MODE(entity_1.public_meeting)>]

# So what just happened? 
---
---
The first thing that we can see is that ft separated our date_recorded feature into DAY, YEAR, MONTH, and WEEKDAY, and then dropped date_recorded. We can see where the new feature was derived from based on what is inside of the parenthesis of the new feature name. For example, DAY(date_recorded). 

### Transformations
This act of taking one feature and transforming it into a new feature is called, simply enough, "transformation". It is one of the two categories of operations that featuretools refers to as "primatives". 

### Aggregations
The other primative operation is called an aggregation. An example of an aggregation is basins.SUM(entity_1.num_private). We'll explore this in more depth in the next cells. 

If you look below, you'll see 2 pandas groupby objects - a numeric groupby and a categorical groupby. These are aggregations.  

In [29]:
# create groupby object, grouped by basin, for our numeric columns, and show...
# ...the sum, stddev, max, skew, min, and mean values 
numeric_aggregation = practice.groupby('basin')['num_private','population'].agg(['sum', 'std', 'max', 'skew', 'min', 'mean'])

# create groupby object,grouped by basin, for our categorical columns, and show...
# ...the number of unique values
# mode for categorical data takes some extra work in pandas, so we're skipping it here 
categorical_aggregation = practice.groupby('basin')['public_meeting'].agg(['nunique']) 

In [30]:
numeric_aggregation

Unnamed: 0_level_0,num_private,num_private,num_private,num_private,num_private,num_private,population,population,population,population,population,population
Unnamed: 0_level_1,sum,std,max,skew,min,mean,sum,std,max,skew,min,mean
basin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Internal,0,0.0,0,0.0,0,0.0,24734,494.74989,3000,4.270376,0,209.610169
Lake Nyasa,56,5.642722,55,9.741917,0,0.589474,5313,131.061886,800,3.725055,0,55.926316
Lake Rukwa,0,0.0,0,0.0,0,0.0,12112,1456.556731,8848,5.872797,0,327.351351
Lake Tanganyika,0,0.0,0,0.0,0,0.0,27733,355.356273,2350,2.794793,0,231.108333
Lake Victoria,0,0.0,0,0.0,0,0.0,25622,480.661905,4000,6.09809,0,149.836257
Pangani,224,7.900171,65,6.228074,0,1.493333,32174,449.86718,3620,5.02688,1,214.493333
Rufiji,0,0.0,0,0.0,0,0.0,18870,413.1729,4310,8.479129,0,150.96
Ruvuma / Southern Coast,0,0.0,0,0.0,0,0.0,26706,388.356454,2210,2.528716,0,303.477273
Wami / Ruvu,0,0.0,0,0.0,0,0.0,26694,1006.729314,6922,6.215221,0,278.0625


In [31]:
categorical_aggregation

Unnamed: 0_level_0,nunique
basin,Unnamed: 1_level_1
Internal,3
Lake Nyasa,3
Lake Rukwa,2
Lake Tanganyika,3
Lake Victoria,3
Pangani,2
Rufiji,2
Ruvuma / Southern Coast,3
Wami / Ruvu,3


---
---
Aggregations are bread-and-butter feature engineering operations. Groupby, aggregation, join to the original dataframe. It can get tedious. Featuretools was able to handle it for us.

However, the functions that we've demonstrated so far are only touched the surface of what featuretools can do. 