# Feature Engineering

### Previous demand as input

As we have given time series data, it is a common approach to use the demand of previous hours (or days etc.) as an input for the prediction. The assumption we hereby make is that the factors that influence the demand have not changed dramatically within the used time frames. We have decided to construct the following features from previous demand:

* 2 hour: The asssumption is that the demand should not change dramatically between three hours.
* 24 hours: The asssumption is that the current demand should be comparable to the demand exactly one day ago, as factors such as season, time of the day are the same.
* Average demand of the past week at the same day time: This feature is the average of all 7 demand observations of the past week at same time of the day. 

In [1]:
import vaex
import h3
import pandas as pd

df_taxi_trips = vaex.open('./data/trips_prepared.hdf5')
df_taxi_trips.head()

df_taxi_trips["trip_start_day"] = df_taxi_trips.trip_start_timestamp.dt.day
df_taxi_trips["trip_start_month"] = df_taxi_trips.trip_start_timestamp.dt.month
df_taxi_trips["trip_start_hour"] = df_taxi_trips.trip_start_timestamp.dt.hour
df_taxi_trips["trip_start_minute"] = df_taxi_trips.trip_start_timestamp.dt.minute

In [2]:
df_taxi_trips = df_taxi_trips.sample(1000000, random_state=42)

In [3]:
RESOLUTION = 10
def geo_to_h3(row1, row2):
    return h3.geo_to_h3(row1,row2, RESOLUTION)

# Step 1: For each pickup and drop-off calculate the correct hexagon in the resolution
df_taxi_trips['pickup_hex'] = df_taxi_trips.apply(geo_to_h3, [df_taxi_trips['pickup_centroid_latitude'], df_taxi_trips['pickup_centroid_longitude']])
df_taxi_trips['dropoff_hex'] = df_taxi_trips.apply(geo_to_h3, [df_taxi_trips['dropoff_centroid_latitude'], df_taxi_trips['dropoff_centroid_longitude']])

In [4]:
### LONG LOADING TIME
df_demand = df_taxi_trips.groupby(['trip_start_hour', 'trip_start_month', 'trip_start_day', 'pickup_hex']).agg({'demand': 'count'})

In [5]:
# craft timestamp column
df_demand['timestamp']=pd.to_datetime({'year': 2017, 'month': df_demand['trip_start_month'].to_numpy(), 'day': df_demand['trip_start_day'].to_numpy(), 'hour': df_demand['trip_start_hour'].to_numpy()}).to_numpy()

In [6]:
# convert to pandas df
df_demand = df_demand.to_pandas_df()

In [7]:
# insert 0 values for hours without demand
df_demand=df_demand.set_index('timestamp')
df_demand_resampled = df_demand.groupby('pickup_hex').resample('H').sum()

  df_demand_resampled = df_demand.groupby('pickup_hex').resample('H').sum()


In [11]:
# insert features 1, 2 and 24 hours previous demand
df_demand_resampled['demand_h-1'] = df_demand_resampled.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(1)
df_demand_resampled['demand_h-2'] = df_demand_resampled.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(2)
df_demand_resampled['demand_h-24'] = df_demand_resampled.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(24)
df_demand_resampled.reset_index(inplace=True)

In [13]:
# control 
date_to_compare = pd.to_datetime('2017-07-12')
date_to_compare2 = pd.to_datetime('2017-07-11')

df_demand_resampled[(df_demand_resampled["pickup_hex"] == "8a2664c1e2effff") & ((df_demand_resampled["timestamp"].dt.date == date_to_compare.date()) | (df_demand_resampled["timestamp"].dt.date == date_to_compare2.date()))]

Unnamed: 0,pickup_hex,timestamp,trip_start_hour,trip_start_month,trip_start_day,demand,demand_h-1,demand_h-2,demand_h-24
634771,8a2664c1e2effff,2017-07-11 00:00:00,0,7,11,1,3.0,3.0,1.0
634772,8a2664c1e2effff,2017-07-11 01:00:00,0,0,0,0,1.0,3.0,0.0
634773,8a2664c1e2effff,2017-07-11 02:00:00,0,0,0,0,0.0,1.0,1.0
634774,8a2664c1e2effff,2017-07-11 03:00:00,0,0,0,0,0.0,0.0,0.0
634775,8a2664c1e2effff,2017-07-11 04:00:00,0,0,0,0,0.0,0.0,1.0
634776,8a2664c1e2effff,2017-07-11 05:00:00,0,0,0,0,0.0,0.0,3.0
634777,8a2664c1e2effff,2017-07-11 06:00:00,6,7,11,4,0.0,0.0,6.0
634778,8a2664c1e2effff,2017-07-11 07:00:00,7,7,11,10,4.0,0.0,21.0
634779,8a2664c1e2effff,2017-07-11 08:00:00,8,7,11,19,10.0,4.0,28.0
634780,8a2664c1e2effff,2017-07-11 09:00:00,9,7,11,23,19.0,10.0,17.0


In [14]:
def get_mean_demand(df, hour_shift, hour, month, day):
    winter = [12, 1, 2]
    spring = [3,4,5]
    summer = [6,7,8]
    autumn = [9,10,11]

    months = []
    if month in winter:
        months = winter
    elif month in spring:
        months = spring
    elif month in summer:
        months = summer
    else:
        months = autumn
    
    return df_demand.filter(df_demand['trip_start_hour']((df_demand['trip_start_month'] == months[0]) | (df_demand['trip_start_month'] == months[1]) | (df_demand['trip_start_month'] == months[2])))['demand'].mean()