# auto feature engineering on NYC Taxi Fare

* ### [example 1](#Example-1): use Featuretools to process, took 2908 secs, enrich from 6 features to 12 features
* ### [example 2](#Example-3): use RecDP w/spark to process, took 115 secs, enrich from 6 features to 12 features
* ### [RMSE evaluation](#Estimator): use rmse to evaluate if auto feaure engineering improved the score

In [1]:
# data set schema
import pandas as pd
from pathlib import Path
from utils import Timer
import os, sys
pathlib = str(Path(os.path.abspath('')).parent.parent.parent.resolve())
train_data = pd.read_csv(f"{pathlib}/dataset/nyc_taxi_fare/nyc_taxi_fare_cleaned.csv")
train_data

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.841610,40.712278,1
1,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.761270,-73.991242,40.750562,2
3,7.7,2012-04-21 04:30:42 UTC,-73.987130,40.733143,-73.991567,40.758092,1
4,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
...,...,...,...,...,...,...,...
54315950,14.0,2014-03-15 03:28:00 UTC,-74.005272,40.740027,-73.963280,40.762555,1
54315951,4.2,2009-03-24 20:46:20 UTC,-73.957784,40.765530,-73.951640,40.773959,1
54315952,14.1,2011-04-02 22:04:24 UTC,-73.970505,40.752325,-73.960537,40.797342,1
54315953,28.9,2011-10-26 05:57:51 UTC,-73.980901,40.764629,-73.870605,40.773963,1


# Example 1

### Using Featuretools for 55M record, took 2908secs

In [8]:
import featuretools as ft
from featuretools.primitives import TransformPrimitive
from woodwork.logical_types import LatLong, Ordinal

import pandas as pd
from utils import Timer

def manual_coordination_convert(df):
    df["pickup_latlong"] = df[['pickup_latitude', 'pickup_longitude']].apply(tuple, axis=1)
    df["dropoff_latlong"] = df[['dropoff_latitude', 'dropoff_longitude']].apply(tuple, axis=1)
    df = df.drop(["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"], axis = 1)
    return df

with Timer("read train data from csv"):
    print(f"train_data shape is {train_data.shape}")

with Timer("manually convert geo points to coordination"):
    #prepare feature tool entityset
    train_data = manual_coordination_convert(train_data)

with Timer("Load data to entityset"):
    es = ft.EntitySet("nyc_taxi_fare")
    trip_logical_types = {
        'passenger_count': Ordinal(order=list(range(0, 10))), 
        'pickup_latlong': 'LatLong',
        'dropoff_latlong': 'LatLong',
    }
    es.add_dataframe(dataframe_name="trips",
                     dataframe=train_data,
                     index="id",
                     time_index='pickup_datetime',
                     logical_types=trip_logical_types)
    
with Timer("DFS feature generation"):
    cutoff_time = es['trips'][['id', 'pickup_datetime']]
    trans_primitives = ["day", "year", "month", "weekday", "hour", "is_weekend", "is_working_hours", "part_of_day"]
    trans_primitives += ["cityblock_distance", "haversine"]
    # calculate feature_matrix using deep feature synthesis
    ret_df, features = ft.dfs(entityset=es,
                      target_dataframe_name="trips",
                      trans_primitives=trans_primitives,
                      verbose=True,
                      cutoff_time=cutoff_time,
                      approximate='36d',
                      max_depth=3,
                      max_features=40)
ret_df

train_data shape is (54315955, 7)
read train data from csv took 46.042598474770784 sec
manually convert geo points to coordination took 426.60777373984456 sec




Load data to entityset took 1607.2522095814347 sec




Built 12 features
Elapsed: 13:58 | Progress: 100%|██████████
DFS feature generation took 875.3592023644596 sec


Unnamed: 0_level_0,fare_amount,passenger_count,"CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)",DAY(pickup_datetime),"HAVERSINE(dropoff_latlong, pickup_latlong)",HOUR(pickup_datetime),IS_WEEKEND(pickup_datetime),IS_WORKING_HOURS(pickup_datetime),MONTH(pickup_datetime),PART_OF_DAY(pickup_datetime),WEEKDAY(pickup_datetime),YEAR(pickup_datetime)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
43310508,30.2,1,11.682842,1,9.756261,0,False,False,1,midnight,3,2009
862908,15.0,1,4.439169,1,3.177903,0,False,False,1,midnight,3,2009
13073257,4.2,1,0.275526,1,0.195552,0,False,False,1,midnight,3,2009
647957,5.8,2,0.938679,1,0.793177,0,False,False,1,midnight,3,2009
12655086,14.6,1,4.305175,1,3.180219,0,False,False,1,midnight,3,2009
...,...,...,...,...,...,...,...,...,...,...,...,...
40210315,24.5,2,5.876328,30,4.770929,23,False,False,6,midnight,1,2015
13957545,6.0,2,1.241293,30,0.883764,23,False,False,6,midnight,1,2015
48940597,33.5,1,10.384043,30,7.340707,23,False,False,6,midnight,1,2015
22295217,9.5,1,2.593244,30,2.107357,23,False,False,6,midnight,1,2015


In [9]:
ret_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54315955 entries, 43310508 to 9085761
Data columns (total 12 columns):
 #   Column                                               Dtype   
---  ------                                               -----   
 0   fare_amount                                          float64 
 1   passenger_count                                      category
 2   CITYBLOCK_DISTANCE(dropoff_latlong, pickup_latlong)  float64 
 3   DAY(pickup_datetime)                                 category
 4   HAVERSINE(dropoff_latlong, pickup_latlong)           float64 
 5   HOUR(pickup_datetime)                                category
 6   IS_WEEKEND(pickup_datetime)                          boolean 
 7   IS_WORKING_HOURS(pickup_datetime)                    boolean 
 8   MONTH(pickup_datetime)                               category
 9   PART_OF_DAY(pickup_datetime)                         category
 10  WEEKDAY(pickup_datetime)                             category
 11  YEA

# Example 2

### Using spark for 55M records, took about 115secs

In [4]:
from pyrecdp.autofe import FeatureWrangler
with Timer("initiate autofe pipeline"):
    pipeline = FeatureWrangler(dataset=train_data, label="fare_amount")

with Timer("transform"):
    ret = pipeline.fit_transform(engine_type = 'spark')
    
print(f"transformed shape is {ret.shape}")
ret

  def may_sample(df):


initiate autofe pipeline took 13.097141648642719 sec
Will assign 48 cores and 308513 M memory for spark
per core memory size is 6.277 GB and shuffle_disk maximum capacity is 8589934592.000 GB
append DataFrame
append type_infer
append DataFrameToRDDConverter
DataframeConvert partition pandas dataframe to spark RDD took 26.007 secs
append tuple
append tuple
append fillna
append datetime_feature
append haversine
append drop
append RDDToDataFrameConverter
execute with spark started ...
23/03/06 09:56:51 WARN TaskSetManager: Stage 0 contains a task of very large size (86521 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

DataframeTransform took 85.178 secs, processed 54315955 rows with num_partitions as 200
DataframeTransform combine to one pandas dataframe took 2.410 secs
execute with spark took 87.71510096127167 sec
transform took 115.15144114103168 sec
transformed shape is (54315955, 12)


Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime__day,pickup_datetime__month,pickup_datetime__weekday,pickup_datetime__year,pickup_datetime__hour,haversine_pickup_coordinates_dropoff_coordinates
0,4.5,-73.844311,40.721319,-73.841610,40.712278,1,15,6,0,2009,17,0.640488
1,16.9,-74.016048,40.711303,-73.979268,40.782004,1,5,1,1,2010,16,5.250677
2,5.7,-73.982738,40.761270,-73.991242,40.750562,2,18,8,3,2011,0,0.863412
3,7.7,-73.987130,40.733143,-73.991567,40.758092,1,21,4,5,2012,4,1.739388
4,5.3,-73.968095,40.768008,-73.956655,40.783762,1,9,3,1,2010,7,1.242220
...,...,...,...,...,...,...,...,...,...,...,...,...
54315950,14.0,-74.005272,40.740027,-73.963280,40.762555,1,15,3,5,2014,3,2.693273
54315951,4.2,-73.957784,40.765530,-73.951640,40.773959,1,24,3,1,2009,20,0.665235
54315952,14.1,-73.970505,40.752325,-73.960537,40.797342,1,2,4,5,2011,22,3.153803
54315953,28.9,-73.980901,40.764629,-73.870605,40.773963,1,26,10,2,2011,5,5.807441


In [3]:
ret.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54315955 entries, 0 to 54315954
Data columns (total 12 columns):
 #   Column                                            Dtype  
---  ------                                            -----  
 0   fare_amount                                       float64
 1   pickup_longitude                                  float64
 2   pickup_latitude                                   float64
 3   dropoff_longitude                                 float64
 4   dropoff_latitude                                  float64
 5   passenger_count                                   int64  
 6   pickup_datetime__day                              int64  
 7   pickup_datetime__month                            int64  
 8   pickup_datetime__weekday                          int64  
 9   pickup_datetime__year                             int64  
 10  pickup_datetime__hour                             int64  
 11  haversine_pickup_coordinates_dropoff_coordinates  float64
dty

# Estimator

In [3]:
from utils import Timer
import pandas as pd
from sklearn.metrics import mean_squared_error
import lightgbm as lgbm
import numpy as np
           
params = {
        'boosting_type':'gbdt',
        'objective': 'regression',
        'nthread': 4,
        'num_leaves': 31,
        'learning_rate': 0.05,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 5000 ,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'rmse',
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 10,
        'scale_pos_weight':1,
        'zero_as_missing': True,
        'seed':0,
        'num_rounds':2000,
        'num_boost_round': 2000,
        'early_stopping_rounds': 50
    }

with Timer("split data"):
    test_sample = ret.sample(frac = 0.1)
    train_sample = ret.drop(test_sample.index)

with Timer("prepare train and validate for lgbm"):
    x_train = train_sample.drop(columns=['fare_amount'])
    y_train = train_sample['fare_amount'].values

    x_val = test_sample.drop(columns=['fare_amount'])
    y_val = test_sample['fare_amount'].values

    lgbm_train = lgbm.Dataset(x_train, y_train, silent=False)
    lgbm_val = lgbm.Dataset(x_val, y_val, silent=False)

with Timer("train"):
    model = lgbm.train(params=params, train_set=lgbm_train, valid_sets=lgbm_val, verbose_eval=100)
    
with Timer("predict"):
    pred = model.predict(x_val, num_iteration=model.best_iteration)
    
with Timer("calculate rmse"):
    rmse = np.sqrt(mean_squared_error(y_val, pred))

print('LightGBM RMSE', rmse)

split data took 18.846782117150724 sec
prepare train and validate for lgbm took 1.839367987588048 sec




You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 25091
[LightGBM] [Info] Number of data points in the train set: 48884359, number of used features: 11




[LightGBM] [Info] Start training from score 11.324507
Training until validation scores don't improve for 50 rounds
[100]	valid_0's rmse: 3.92054
[200]	valid_0's rmse: 3.80928
[300]	valid_0's rmse: 3.74226
[400]	valid_0's rmse: 3.70658
[500]	valid_0's rmse: 3.68103
[600]	valid_0's rmse: 3.65886
[700]	valid_0's rmse: 3.64397
[800]	valid_0's rmse: 3.62826
[900]	valid_0's rmse: 3.61487
[1000]	valid_0's rmse: 3.60162
[1100]	valid_0's rmse: 3.59362
[1200]	valid_0's rmse: 3.58307
[1300]	valid_0's rmse: 3.57557
[1400]	valid_0's rmse: 3.56744
[1500]	valid_0's rmse: 3.56027
[1600]	valid_0's rmse: 3.55456
[1700]	valid_0's rmse: 3.54906
[1800]	valid_0's rmse: 3.54471
[1900]	valid_0's rmse: 3.5408
[2000]	valid_0's rmse: 3.53851
Did not meet early stopping. Best iteration is:
[1997]	valid_0's rmse: 3.53851
train took 1506.3177826348692 sec
predict took 338.8403431503102 sec
calculate rmse took 0.04924597032368183 sec
LightGBM RMSE 3.5385067280108027
