# Featuretools 
Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

In [1]:
import featuretools as ft

## Loading Raw Data
Loading two separate tables below represented as Pandas dataframes
1. Merge of transactions, sessions, customers
2. List of products

In [2]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
transactions_df.sample(10)

Unnamed: 0,transaction_id,session_id,transaction_time,product_id,amount,customer_id,device,session_start,zip_code,join_date,date_of_birth
264,380,21,2014-01-01 05:14:10,5,57.09,4,desktop,2014-01-01 05:02:15,60091,2011-04-08 20:08:14,2006-08-15
19,244,10,2014-01-01 02:34:55,2,116.95,2,tablet,2014-01-01 02:31:40,13244,2012-04-15 23:31:04,1986-08-18
314,299,6,2014-01-01 01:32:05,4,64.99,1,tablet,2014-01-01 01:23:25,60091,2011-04-17 10:48:33,1994-07-18
290,78,4,2014-01-01 00:54:10,1,37.5,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18
379,457,27,2014-01-01 06:37:35,1,19.16,1,mobile,2014-01-01 06:34:20,60091,2011-04-17 10:48:33,1994-07-18
335,477,9,2014-01-01 02:30:35,3,41.7,1,desktop,2014-01-01 02:15:25,60091,2011-04-17 10:48:33,1994-07-18
293,103,4,2014-01-01 00:57:25,5,20.79,1,mobile,2014-01-01 00:44:25,60091,2011-04-17 10:48:33,1994-07-18
271,390,22,2014-01-01 05:21:45,2,54.83,4,desktop,2014-01-01 05:21:45,60091,2011-04-08 20:08:14,2006-08-15
404,476,29,2014-01-01 07:24:10,4,121.59,1,mobile,2014-01-01 07:10:05,60091,2011-04-17 10:48:33,1994-07-18
179,90,3,2014-01-01 00:35:45,1,75.73,4,mobile,2014-01-01 00:28:10,60091,2011-04-08 20:08:14,2006-08-15


In [6]:
# only 5 unique products in this example
products_df = data["products"]
products_df

Unnamed: 0,product_id,brand
0,1,B
1,2,B
2,3,B
3,4,B
4,5,A


## Creating an EntitySet
<font color='red'>**Naming Conventions**</font> 
- **Entity:** Equivalent to a table in relational database. Represented by the Entity class.
- **Instance:** Equivalent to a row in a relational database. Each entity has many instances, and each instance has a value for each variable and feature defined on the entity.
- **Variable:** Equivalent to a column in a relational database. Represented by the Variable class.
- **Feature:** A transformation of data used for machine learning. featuretools has a custom language for defining features as described here. All features are represented by subclasses of FeatureBase.
- **EntitySet:** A collection of entities and the relationships between them. Represented by the EntitySet class.
- **Parent Entity:** An entity that is referenced by another entity via relationship. The "one" in a one-to-many relationship.
- **Target Entity:** The entity on which we will be making a features for.

<font color='red'>**Things to Note**</font><br>
<font color='red'>I always need the ability to easily find and differentiate between raw dataframes and entity objects as well as raw columns and derived features</font>
1. There's a clear separation between the raw dataframes I loaded and the explicit entities I define below as part of my EntitySet
2. There's a clear separation between the columns that are loaded as part of the raw dataframes (referred to as "variables" above) and the features I will create later on

### Creating and Adding Individual Entities
<font color='red'>**Things to Note**</font><br>
<font color='red'>I always need the ability to do the following w.r.t. creating entities:
1. **Add a friendly name to a given entity**
2. **Add a friendly description to a given entity:** For example, "customer_data" represents a combination of three separate data sources: transactions, sessions, and customers. If I can leave myself a entity description to this effect then I don't have to endure the cognitive burden of keeping track of this info throughout my workflow.
3. **Create an index:** to identify the column that uniquely identifies rows in the entity
4. **Create a time index:** to determine the first and last dated records in any entity
5. **Type-cast columns:** For example, I type cast product_id to categorical below for primitive applicability
6. **View variable vs. feature count:** I want to know how many columns correspond to "variables" vs. "features" given the definitions above</font>

In [9]:
# First, we initialize an empty EntitySet
es = ft.EntitySet(id="customer_data")
es

Entityset: customer_data
  Entities:
  Relationships:
    No relationships

In [13]:
# Next, we add our first individual entity to the entityset
# Great naming conventions here: 'create_entity_from_X'
es = es.entity_from_dataframe(entity_id="transactions", 
                              dataframe=transactions_df,
                              index="transaction_id",
                              time_index="transaction_time",
                              variable_types={"product_id": ft.variable_types.Categorical,
                                              "zip_code": ft.variable_types.ZIPCode})

es

Entityset: customer_data
  Entities:
    transactions [Rows: 500, Columns: 11]
  Relationships:
    No relationships

In [14]:
# Each column (from raw dataframes) is loaded in as a variable
es["transactions"].variables

[<Variable: transaction_id (dtype = index)>,
 <Variable: session_id (dtype = numeric)>,
 <Variable: transaction_time (dtype: datetime_time_index, format: None)>,
 <Variable: amount (dtype = numeric)>,
 <Variable: customer_id (dtype = numeric)>,
 <Variable: device (dtype = categorical)>,
 <Variable: session_start (dtype: datetime, format: None)>,
 <Variable: join_date (dtype: datetime, format: None)>,
 <Variable: date_of_birth (dtype: datetime, format: None)>,
 <Variable: product_id (dtype = categorical)>,
 <Variable: zip_code (dtype = zip_code)>]

In [16]:
# Next, we add another individual entity
es = es.entity_from_dataframe(entity_id="products",
                              dataframe=products_df,
                              index="product_id")

es

Entityset: customer_data
  Entities:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    No relationships

<font color='red'>**Things to Note**</font><br>

<font color='red'>I have the notion of parent-child entities and separately, parent variables as well as child variables. Explicitly noting certain variables as child variables helps me keep track of the direction of aggregation. Normally, I'd have to explicitly think about these relationships when doing feature engineering work in SQL or Python. Now that I don't have to anymore, help me build trust in the system and interpret it more easily by explicitly calling out relationships that define the direction of aggregation.</font>

In [18]:
# Next we define a parent-child relationship between the transactions and products entities. 
# "Products" is the parent entity and "Transactions" is the child entity (one product to many transactions)
# Note that each ft.Relatioship must denote a one-to-many relationship rather than one-to-one or many-to-many
new_relationship = ft.Relationship(es["products"]["product_id"],
                                   es["transactions"]["product_id"])

es = es.add_relationship(new_relationship)

es

Entityset: customer_data
  Entities:
    transactions [Rows: 500, Columns: 11]
    products [Rows: 5, Columns: 2]
  Relationships:
    transactions.product_id -> products.product_id

### Normalize Entities
Allows me to create entities from existing tables to reduce the amount of redundant information. We can normalize sessions and customers from the original transactions_df, which has the entity_id "transactions"

### Visualize EntitySets

In [19]:
es.plot()

ImportError: Please install graphviz to plot. (See https://docs.featuretools.com/en/stable/getting_started/install.html#installing-graphviz for details)