# Feature Engineering

### Previous demand as input

As we have given time series data, it is a common approach to use the demand of previous hours (or days etc.) as an input for the prediction. The assumption we hereby make is that the factors that influence the demand have not changed dramatically within the used time frames. We have decided to construct the following features from previous demand:

* 2 hour: The asssumption is that the demand should not change dramatically between three hours.
* 24 hours: The asssumption is that the current demand should be comparable to the demand exactly one day ago, as factors such as season, time of the day are the same.
* Average demand of the past week at the same day time: This feature is the average of all 7 demand observations of the past week at same time of the day. 

In [1]:
import vaex
import h3
import pandas as pd
import numpy as np

df_taxi_trips = vaex.open('./data/trips_prepared.hdf5')
df_taxi_trips.head()

df_taxi_trips["trip_start_day"] = df_taxi_trips.trip_start_timestamp.dt.day
df_taxi_trips["trip_start_month"] = df_taxi_trips.trip_start_timestamp.dt.month
df_taxi_trips["trip_start_hour"] = df_taxi_trips.trip_start_timestamp.dt.hour
df_taxi_trips["trip_start_minute"] = df_taxi_trips.trip_start_timestamp.dt.minute

In [2]:
RESOLUTION = 10
def geo_to_h3(row1, row2):
    return h3.geo_to_h3(row1,row2, RESOLUTION)

# Step 1: For each pickup and drop-off calculate the correct hexagon in the resolution
df_taxi_trips['pickup_hex'] = df_taxi_trips.apply(geo_to_h3, [df_taxi_trips['pickup_centroid_latitude'], df_taxi_trips['pickup_centroid_longitude']])
df_taxi_trips['dropoff_hex'] = df_taxi_trips.apply(geo_to_h3, [df_taxi_trips['dropoff_centroid_latitude'], df_taxi_trips['dropoff_centroid_longitude']])

In [3]:
### Group by hour
df_demand_vaex = df_taxi_trips.groupby(['trip_start_hour', 'trip_start_month', 'trip_start_day', 'pickup_hex']).agg({'demand': 'count'})

# Add timestamp as preparation for resampling
df_demand_vaex['timestamp'] = pd.to_datetime({'year': 2017, 'month': df_demand_vaex['trip_start_month'].to_numpy(), 'day': df_demand_vaex['trip_start_day'].to_numpy(), 'hour': df_demand_vaex['trip_start_hour'].to_numpy()}).to_numpy()

# convert to pandas df
df_demand = df_demand_vaex.to_pandas_df()

In [18]:
### Creation of dummy df which contains hourly data dummy data over an entire year per hexagon

# Create a DateTimeIndex with hourly intervals for the year 2017
start_date = '2017-01-01 00:00:00'
end_date = '2017-12-31 23:00:00'
hourly_range = pd.date_range(start=start_date, end=end_date, freq='H')
num_entries_per_year = len(hourly_range)

hourly_range = np.tile(hourly_range,len(np.unique(df_demand.pickup_hex)))

# -1 values will indacte that these rows were artificially generated later on
data = {
    'trip_start_hour': -1,
    'trip_start_month': -1,
    'trip_start_day': -1,
    'pickup_hex': np.repeat(np.unique(df_demand.pickup_hex), num_entries_per_year),
    'demand': 0,
}

df_demand_hourly = pd.DataFrame(data, index=hourly_range)
df_demand_hourly= df_demand_hourly.set_index([df_demand_hourly.index, 'pickup_hex'])

# introduce multiindex for filling up the df with hourly index later on
df_demand=df_demand.set_index(['timestamp', 'pickup_hex'])

# insert df_demand 
df_demand_hourly.update(df_demand)

# clear up multi-index
df_demand_hourly=df_demand_hourly.reset_index()
df_demand_hourly.columns = ['timestamp','pickup_hex','trip_start_hour','trip_start_month','trip_start_day','demand']
df_demand_hourly

In [24]:
# insert features 1, 2 and 24 hours previous demand
df_demand_hourly['demand_h-1'] = df_demand_hourly.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(1)
df_demand_hourly['demand_h-2'] = df_demand_hourly.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(2)
df_demand_hourly['demand_h-24'] = df_demand_hourly.sort_values('timestamp').groupby('pickup_hex')['demand'].shift(24)
# df_demand_hourly.reset_index(inplace=True)

In [25]:
df_demand_hourly

Unnamed: 0,timestamp,pickup_hex,trip_start_hour,trip_start_month,trip_start_day,demand,demand_h-1,demand_h-2,demand_h-24
0,2017-01-01 00:00:00,8a266452180ffff,0,1,1,1,,,
1,2017-01-01 01:00:00,8a266452180ffff,1,1,1,1,1.0,,
2,2017-01-01 02:00:00,8a266452180ffff,-1,-1,-1,0,1.0,1.0,
3,2017-01-01 03:00:00,8a266452180ffff,-1,-1,-1,0,0.0,1.0,
4,2017-01-01 04:00:00,8a266452180ffff,-1,-1,-1,0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...
3127315,2017-12-31 19:00:00,8a275936bc4ffff,-1,-1,-1,0,0.0,0.0,0.0
3127316,2017-12-31 20:00:00,8a275936bc4ffff,-1,-1,-1,0,0.0,0.0,0.0
3127317,2017-12-31 21:00:00,8a275936bc4ffff,-1,-1,-1,0,0.0,0.0,0.0
3127318,2017-12-31 22:00:00,8a275936bc4ffff,-1,-1,-1,0,0.0,0.0,0.0


In [None]:
def get_mean_demand(df, hour_shift, hour, month, day):
    winter = [12, 1, 2]
    spring = [3,4,5]
    summer = [6,7,8]
    autumn = [9,10,11]

    months = []
    if month in winter:
        months = winter
    elif month in spring:
        months = spring
    elif month in summer:
        months = summer
    else:
        months = autumn
    
    return df_demand.filter(df_demand['trip_start_hour']((df_demand['trip_start_month'] == months[0]) | (df_demand['trip_start_month'] == months[1]) | (df_demand['trip_start_month'] == months[2])))['demand'].mean()

# Weather features
In the descriptive analysis, particularly the analysis of temporal demand patterns, we found that the temperature and demand curves follow similar directions. Therefore, we construct features based on temperature to enable models that capture this relationship.

### Include weather data
First, we have to include the weather data into the dataframe. For this we just need to merge the two datasets, as both are already in hourly frequency. The weather data propose data for minute 53 of an hour. Therefore, we round up to the nearest hour for each row. We suppose that the weather changes in seven minutes can be disregarded.

In [None]:
import numpy as np

df_weather = pd.read_csv('data/weather_data_final.csv')
df_weather['date_time'] = pd.to_datetime(df_weather['date_time'])
df_weather['date_time'] = df_weather['date_time'].dt.ceil('H')
df_weather.rename(columns={'date_time': 'timestamp'}, inplace=True)

In [None]:
df_weather.head(1)

Unnamed: 0,date,time,temp,dew_point,humidity,wind_speed,wind_gust,pressure,precip,condition,timestamp
0,2017-01-01,00:53,33 °F,24 °F,70 °%,8 °mph,0 °mph,29.45 °in,0.0 °in,Partly Cloudy,2017-01-01 01:00:00


In [None]:
df_demand_merged = df_demand_resampled.merge(df_weather, on='timestamp', how='left')

### Temperature features
In addition to the current temperature, we are add the temperature from 1, 2, and 3 hours prior to the time of taxi demand. We suggest that past temperature conditions could potentially impact the decision to hire a taxi.

In [None]:
df_demand_merged['temp_h-1'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['temp'].shift(1)
df_demand_merged['temp_h-2'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['temp'].shift(2)
df_demand_merged['temp_h-3'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['temp'].shift(3)

### Precipitation
We hypothesize that precipitation has a significant impact on demand. Therefore, we construct features that describe whether it has rained in the last 1-3 hours.

In [None]:
df_demand_merged['precip_h-1'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['precip'].shift(1)
df_demand_merged['precip_h-2'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['precip'].shift(2)
df_demand_merged['precip_h-3'] = df_demand_merged.sort_values('timestamp').groupby('pickup_hex')['precip'].shift(3)