Skip to content

Run FeatureTools to automate Feature Engineering distributionally on Spark.

Notifications You must be signed in to change notification settings

pan5431333/featuretools4s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FeatureTools for Spark (featuretools4s)

1. What's FeatureTools?

FeatureTools is a Python library open-sourced by MIT's FeatureLab aiming to automate the process of feature engineering in Machine Learning applications.

Please visit the official website for more details about FeatureTools.

FeatureTools4S is a Python library written by me aiming to scale FeatureTools with Spark, making it capable of generating features for billions of rows of data, which is usually considered impossible to process on single machine using original FeatureTools library with Pandas.

FeatureTools4S provides almost the same API as original FeatureTools, which make its users completely free of transferring between FeatureTools and FeatureTools4S. Hence we suggest the readers first to learn FeatureTools and then you can easily work on FeatureTools4S.

2. How to use FeatureTools4S?

First install featuretools4s through pip:

pip3 install featuretools4s 

Then a simple example of using featuretools4s is as follows:

import featuretools4s as fts
from pyspark.sql import SparkSession

import os
import pandas as pd

os.environ["SPARK_HOME"] = "C:\Python36\Lib\site-packages\pyspark"
os.environ["PATH"] = "C:\Python36;" + os.environ["PATH"]
pd.set_option('display.expand_frame_repr', False)
spark = SparkSession.builder.master("local[*]").getOrCreate()

order_df = spark.read.csv("C:/Users/MengPan/PycharmProjects/BelleTire/resources/order.csv", header=True, inferSchema=True).sort("sales_tax")
customer_df = spark.read.csv("C:/Users/MengPan/PycharmProjects/BelleTire/resources/customer.csv", header=True, inferSchema=True)

es = fts.EntitySet(id="test")
es.entity_from_dataframe("order", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("order2", order_df, index="order_num", time_index="wo_timestamp")
es.entity_from_dataframe("customer", customer_df, index="cust_num")
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order"]["cust_num"]))
es.add_relationship(fts.Relationship(es["customer"]["cust_num"], es["order2"]["cust_num"]))

features = fts.dfs(spark, entityset=es, target_entity="customer", primary_entity="customer",
                   primary_col="cust_num", num_partition=5)
features.show()

About

Run FeatureTools to automate Feature Engineering distributionally on Spark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages