# Feature Engineering

### Previous demand as input

As we have given time series data, it is a common approach to use the demand of previous hours (or days etc.) as an input for the prediction. The assumption we hereby make is that the factors that influence the demand have not changed dramatically within the used time frames. We have decided to construct the following features from previous demand:

* 2 hour: The asssumption is that the demand should not change dramatically between three hours.
* 24 hours: The asssumption is that the current demand should be comparable to the demand exactly one day ago, as factors such as season, time of the day are the same.
* Average demand of the past week at the same day time: This feature is the average of all 7 demand observations of the past week at same time of the day. 

In [81]:
import vaex
import h3
import pandas as pd

df_taxi_trips = vaex.open('./data/trips_prepared.hdf5')
df_taxi_trips.head()

df_taxi_trips["trip_start_day"] = df_taxi_trips.trip_start_timestamp.dt.day
df_taxi_trips["trip_start_month"] = df_taxi_trips.trip_start_timestamp.dt.month
df_taxi_trips["trip_start_hour"] = df_taxi_trips.trip_start_timestamp.dt.hour
df_taxi_trips["trip_start_minute"] = df_taxi_trips.trip_start_timestamp.dt.minute

In [82]:
RESOLUTION = 10
def geo_to_h3(row1, row2):
    return h3.geo_to_h3(row1,row2, RESOLUTION)

# Step 1: For each pickup and drop-off calculate the correct hexagon in the resolution
df_taxi_trips['pickup_hex'] = df_taxi_trips.apply(geo_to_h3, [df_taxi_trips['pickup_centroid_latitude'], df_taxi_trips['pickup_centroid_longitude']])
df_taxi_trips['dropoff_hex'] = df_taxi_trips.apply(geo_to_h3, [df_taxi_trips['dropoff_centroid_latitude'], df_taxi_trips['dropoff_centroid_longitude']])

In [83]:
### LONG LOADING TIME
df_demand = df_taxi_trips.groupby(['trip_start_hour', 'trip_start_month', 'trip_start_day', 'pickup_hex']).agg({'demand': 'count'})

In [84]:
# craft timestamp column
df_demand['timestamp']=pd.to_datetime({'year': 2017, 'month': df_demand['trip_start_month'].to_numpy(), 'day': df_demand['trip_start_day'].to_numpy(), 'hour': df_demand['trip_start_hour'].to_numpy()}).to_numpy()

In [85]:
# convert to pandas df
df_demand = df_demand.to_pandas_df()

In [86]:
df_demand = df_demand.set_index(['pickup_hex', 'timestamp'])
df_resampled = df_demand.groupby('pickup_hex').resample('H').sum()
all_hexagons = df_demand.index.get_level_values('pickup_hex').unique()
all_hours = pd.date_range(start=df_demand.index.get_level_values('timestamp').min().floor('D'),
                          end=df_demand.index.get_level_values('timestamp').max().ceil('D'),
                          freq='H')
index = pd.MultiIndex.from_product([all_hexagons, all_hours], names=['pickup_hex', 'timestamp'])
df_all_combinations = pd.DataFrame(index=index).reset_index()
df_merged = pd.merge(df_all_combinations, df_resampled, on=['pickup_hex', 'timestamp'], how='left')
df_merged = df_merged.fillna(0)


ValueError: MultiIndex has no single backing array. Use 'MultiIndex.to_numpy()' to get a NumPy array of tuples.

In [72]:
len(df_demand)

203506

In [63]:
df_demand.groupby('pickup_hex').head()

Unnamed: 0_level_0,trip_start_hour,trip_start_month,trip_start_day,pickup_hex,demand
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-12-20 09:00:00,9,12,20,8a2664c1e4effff,5
2017-01-11 16:00:00,16,1,11,8a2664c1e8cffff,5
2017-03-20 12:00:00,12,3,20,8a2664c1e32ffff,8
2017-04-29 21:00:00,21,4,29,8a2664c1e0effff,21
2017-04-18 17:00:00,17,4,18,8a2664c1acd7fff,16
...,...,...,...,...,...
2017-12-03 00:00:00,0,12,3,8a2664ca1a0ffff,1
2017-09-06 15:00:00,15,9,6,8a2664d9d76ffff,1
2017-07-06 21:00:00,21,7,6,8a2664d88457fff,1
2017-09-15 23:00:00,23,9,15,8a2664cab057fff,1


In [52]:
df_demand=df_demand.set_index('timestamp')
df_demand_resampled = df_demand.groupby('pickup_hex').resample('D').sum()
df_demand_resampled = df_demand_resampled.reset_index()

  df_demand_resampled = df_demand.groupby('pickup_hex').resample('D').sum()


In [61]:
hexagons = df_demand_resampled["pickup_hex"].unique()

for hexagon in hexagons:
    if len(df_demand_resampled[df_demand_resampled["pickup_hex"] == str(hexagon)]['timestamp'].dt.date.unique()) == 365:
        print("true")

true
true
true
true
true
true
true


true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true
true


In [54]:
hexagons = df_demand_resampled["pickup_hex"].unique()
hexagons_with_empty_days = []
for hexagon in hexagons:
    df_one_hexagon = df_demand_resampled[df_demand_resampled["pickup_hex"] == str(hexagon)]
    df_one_hexagon.reset_index(inplace=True)
    if len(df_one_hexagon['timestamp'].dt.date.unique()) != 365:
        hexagons_with_empty_days.append(str(hexagon))
    
print(len(hexagons_with_empty_days))

210


In [7]:
# insert 0 values for hours without demand
df_demand=df_demand.set_index('timestamp')

df_demand_resampled = df_demand.groupby('pickup_hex').resample('H').sum()

  df_demand_resampled = df_demand.groupby('pickup_hex').resample('H').sum()


In [8]:
# insert features 1, 2 and 24 hours previous demand
df_demand_resampled['demand_h-1'] = df_demand_resampled.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(1)
df_demand_resampled['demand_h-2'] = df_demand_resampled.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(2)
df_demand_resampled['demand_h-24'] = df_demand_resampled.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(24)
df_demand_resampled.reset_index(inplace=True)

In [10]:
def get_mean_demand(df, hour_shift, hour, month, day):
    winter = [12, 1, 2]
    spring = [3,4,5]
    summer = [6,7,8]
    autumn = [9,10,11]

    months = []
    if month in winter:
        months = winter
    elif month in spring:
        months = spring
    elif month in summer:
        months = summer
    else:
        months = autumn
    
    return df_demand.filter(df_demand['trip_start_hour']((df_demand['trip_start_month'] == months[0]) | (df_demand['trip_start_month'] == months[1]) | (df_demand['trip_start_month'] == months[2])))['demand'].mean()

# Weather features
In the descriptive analysis, particularly the analysis of temporal demand patterns, we found that the temperature and demand curves follow similar directions. Therefore, we construct features based on temperature to enable models that capture this relationship.

### Include weather data
First, we have to include the weather data into the dataframe. For this we just need to merge the two datasets, as both are already in hourly frequency. The weather data propose data for minute 53 of an hour. Therefore, we round up to the nearest hour for each row. We suppose that the weather changes in seven minutes can be disregarded.

In [11]:
import numpy as np

df_weather = pd.read_csv('data/weather_data_final.csv')
df_weather['date_time'] = pd.to_datetime(df_weather['date_time'])
df_weather['date_time'] = df_weather['date_time'].dt.ceil('H')
df_weather.rename(columns={'date_time': 'timestamp'}, inplace=True)

In [12]:
df_weather.head(1)

Unnamed: 0,date,time,temp,dew_point,humidity,wind_speed,wind_gust,pressure,precip,condition,timestamp
0,2017-01-01,00:53,33 °F,24 °F,70 °%,8 °mph,0 °mph,29.45 °in,0.0 °in,Partly Cloudy,2017-01-01 01:00:00


In [13]:
df_demand_merged = df_demand_resampled.merge(df_weather, on='timestamp', how='left')

### Temperature features
In addition to the current temperature, we are add the temperature from 1, 2, and 3 hours prior to the time of taxi demand. We suggest that past temperature conditions could potentially impact the decision to hire a taxi.

In [14]:
df_demand_merged['temp_h-1'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['temp'].shift(1)
df_demand_merged['temp_h-2'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['temp'].shift(2)
df_demand_merged['temp_h-3'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['temp'].shift(3)

### Precipitation
We hypothesize that precipitation has a significant impact on demand. Therefore, we construct features that describe whether it has rained in the last 1-3 hours.

In [15]:
df_demand_merged['precip_h-1'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['precip'].shift(1)
df_demand_merged['precip_h-2'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['precip'].shift(2)
df_demand_merged['precip_h-3'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['precip'].shift(3)