Run FeatureTools to automate Feature Engineering distributionally on Spark.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial commit Oct 10, 2018

FeatureTools for Spark (featuretools4s)

1. What's FeatureTools?

FeatureTools is a Python library open-sourced by MIT's FeatureLab aiming to automate the process of feature engineering in Machine Learning applications.

Please visit the official website for more details about FeatureTools.

FeatureTools4S is a Python library written by me aiming to scale FeatureTools with Spark, making it capable of generating features for billions of rows of data, which is usually considered impossible to process on single machine using original FeatureTools library with Pandas.

FeatureTools4S provides almost the same API as original FeatureTools, which make its users completely free of transferring between FeatureTools and FeatureTools4S. Hence we suggest the readers first to learn FeatureTools and then you can easily work on FeatureTools4S.

2. How to use FeatureTools4S?

First install featuretools4s through pip:

pip3 install featuretools4s 

Then a simple example of using featuretools4s is as follows:

import featuretools4s as fts
from pyspark.sql import SparkSession

import os
import pandas as pd

os.environ["SPARK_HOME"] = "C:\Python36\Lib\site-packages\pyspark"
os.environ["PATH"] = "C:\Python36;" + os.environ["PATH"]
pd.set_option('display.expand_frame_repr', False)
spark = SparkSession.builder.master("local[*]").getOrCreate()

order_df ="C:/Users/MengPan/PycharmProjects/BelleTire/resources/order.csv", header=True, inferSchema=True).sort("sales_tax")
customer_df ="C:/Users/MengPan/PycharmProjects/BelleTire/resources/customer.csv", header=True, inferSchema=True)

es = fts.EntitySet(id="test")
es.entity_from_dataframe("order", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("order2", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("customer", customer_df, index="cust_num")
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order"]["cust_num"]))
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order2"]["cust_num"]))

features = fts.dfs(spark, entityset=es, target_entity="customer", primary_entity="customer",
                   primary_col="cust_num", num_partition=5)