<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Load Libraries and Data
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import wkt
import h3

In [2]:
# Define the file path
file_path = "./data/"

# Import the dataset
df = pd.read_csv(f'{file_path}clean_taxi_data.csv')

# load the census tract areas as a geodataframe
census_gdf = gpd.read_file(f'{file_path}Boundaries.geojson', crs='epsg:4326')

# Load the weather dataset
weather = pd.read_csv(f'{file_path}weather_chic.csv')

<div class="alert alert-danger">
<b>IMPORTANT:</b> Please run this notebook three times to get all the features. Please set hexagon_resolution once to 6 and once to 7. Then run the notebook once with census_resolution set to True and hexagon_resolution set to 0.
</div>

In [3]:
# Please adjust these values
hexagon_resolution = 0
census_resolution = False

# Depending on the case, import the corresponding spatial features, else return an error
if hexagon_resolution == 6:
    spatial_features = gpd.read_file('spatial_features_hex6.geojson', crs='epsg:4326')
    
elif hexagon_resolution == 7:
    spatial_features = gpd.read_file('spatial_features_hex7.geojson', crs='epsg:4326')
    
elif census_resolution == True:
    spatial_features = gpd.read_file('census_spatial_features.geojson', crs='epsg:4326')
    
else:
    raise ValueError('Please adjust the hexagon resolution to either 6 or 7 OR set the census_resolution to true')

<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Define important functions
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

In [4]:
def convert_point_to_hexagon(point_wkt: str, hex_resolution: int) -> str:
    """
    Convert a Well-Known Text (WKT) point string to an H3 hexagon ID.

    Parameters:
    - point_wkt (str): The Well-Known Text (WKT) representation of a point.
    - hex_resolution (int): The resolution of the H3 hexagon (0-15).

    Returns:
    - hex_id (str): The H3 hexagon ID corresponding to the point.
    """
    
    # Convert the WKT point string to a Point object
    point_obj = wkt.loads(point_wkt)
    
    # Convert the latitude and longitude of the Point object to an H3 hexagon ID
    hex_id = h3.geo_to_h3(point_obj.y, point_obj.x, hex_resolution)
    
    return hex_id

<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Feature Engineering (Weather Data)
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

We create binary variables for rain and snow because their presence is more important than the absolute amount of precipitation. We also drop unnecessary columns that are unlikely to correlate with the number of trips to save space. 

In [5]:
# Replace the values in the preciptpe column with numerical values
weather['snow_binary'] = weather['preciptype'].apply(lambda x: 1 if x == 'snow' or x == 'rain,snow'  else 0)
weather['rain_binary'] = weather['preciptype'].apply(lambda x: 1 if x == 'rain' or x == 'rain,snow' else 0)

In [6]:
# Convert the Times to the datetime format
weather['datetime'] = pd.to_datetime(weather.datetime)

# Drop unnecessary columns
weather.drop(['name','dew','humidity','windgust','winddir','sealevelpressure','cloudcover',
         'visibility','solarradiation','solarenergy','uvindex','conditions','icon','stations','Unnamed: 0'],axis=1,inplace=True)

<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Feature Engineering (Trip Data)
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

To merge the data frames, we need to round the time variable down to the nearest full hour. We also drop the columns related to the drop-off, since we are only predicting demand. Since the underlying data is in census resolution, we need to convert each point to a hex ID.

In [7]:
# Convert the trip_start column to the Datetime format
df['trip_start'] = pd.to_datetime(df.trip_start)

# Round the time in the df to the nearest hour (This is needed to aggregate the data later)
df['rounded_time'] = df['trip_start'].dt.floor('H')

# Since we only predict demand, we drop all unnecessary columns
df.drop(['Unnamed: 0','taxi_id','dropoff_location','dropoff_census'],axis=1,inplace=True)

if hexagon_resolution != 0:
    # Apply the function to your DataFrame
    df['h3_hex_id'] = df.pickup_location.apply(lambda x: convert_point_to_hexagon(x, hexagon_resolution))

<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Merging the Data
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

We merge the trips dataset with the weather dataframe. We then aggregate the entire dataframe so that we have one data point for each hour and spatial bucket. We create a new column that counts the number of rows that have been aggregated, called "rides". This column will serve as the dependent variable later in our models. Finally, the spatial features are merged with the data frame to provide detailed spatial information about each data point. 

In [8]:
# Merge the dataframes
df = pd.merge(df, weather, left_on='rounded_time', right_on='datetime', how='left')

# Drop the (now) unnecesarry columns
df.drop(['datetime','trip_end','end_day','start_time','end_time','start_day','preciptype'],axis=1,inplace=True)

In [9]:
# Define the column on which to aggregate the DataFrame based on which resolution we are using
if census_resolution == True:
    agg_column = 'pickup_census'
else:
    agg_column = 'h3_hex_id'

# Aggregate the dataframe so that it just contains the number of rides for each hexagon and time resolution
df_grouped = df.groupby(['rounded_time', agg_column]).agg(
    rides=(agg_column, 'size'),  
    trip_seconds =('trip_seconds', 'mean'),
    trip_miles =('trip_miles', 'mean'), 
    fare=('fare', 'first'), 
    temp=('temp','first'), # All weather values are the same for the hour, so we take the first
    precip =('precip','first'), 
    preciprob=('precipprob','first'),
    snow=('snow','first'),
    snowdepth=('snowdepth','first'),
    windspeed=('windspeed','first'),
    severerisk=('severerisk','first'),
    snow_binary =('snow_binary','first'),
    rain_binary = ('rain_binary','first'),
).reset_index()

In [10]:
# Merge the DataFrame with the spatial features
if census_resolution == True:
    
    # Rename the pickup_census column to census
    df_grouped.rename(columns={'pickup_census': 'census'}, inplace=True)
    
    # Change the datatype of the census column to int to merge it
    spatial_features['census'] = spatial_features.census.astype('int')
    
    # Merge the spatial features with the DataFrame
    df_final = pd.merge(df_grouped, spatial_features, on='census', how='left')
    
else:
    
    # Merge the spatial features with the DataFrame
    df_final = pd.merge(df_grouped, spatial_features, on='h3_hex_id', how='left')

<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Feature Engineering
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

To get additional features, especially regarding the time variables, we create several binary variables. First, the time is separated into hour, day, and month. In addition, we saw in the descriptive analysis that there are fewer trips on weekends, so we create another binary variable. Variables are created to capture morning and evening commutes during the week. We chose the times based on the demand from the descriptive analysis. Finally, we drop all columns that we consider to be weak predictors.

In [11]:
# Create an hour, day of the week and month variable
df_final['hour'] = df_final['rounded_time'].dt.hour
df_final['day_of_week'] = df_final['rounded_time'].dt.dayofweek  # Monday=0, Sunday=6
df_final['month'] = df_final['rounded_time'].dt.month

# Binary variable that indicates whether the corresponding datapoint is on a weekend
df_final['weekend_binary'] = df_final['rounded_time'].dt.dayofweek >= 5 

# Binary variable that indicates whether a datapoint falls into the bar hours (on a weekend)
df_final['bar_hours'] = df_final['hour'].apply(lambda x: 0 if x <= 18 and x >= 4 else 1)
df_final['bar_hours_weekend'] = df_final['bar_hours'] * df_final['weekend_binary']

# Binary variables that indicate either morning or evening commuting
df_final['morning_commuting'] = df_final['hour'].apply(lambda x: 1 if x >= 5 and x<= 10 else 0)
df_final['evening_commuting'] = df_final['hour'].apply(lambda x: 1 if x>= 13 and x<= 18 else 0)
df_final['bar_hours_weekend'] = df_final['bar_hours'] * df_final['weekend_binary']
df_final['morning_commuting_week'] = df_final['morning_commuting'] * (1-df_final['weekend_binary'])
df_final['evening_commuting_week'] = df_final['evening_commuting'] * (1-df_final['weekend_binary'])

if census_resolution == True:
    df_final.drop(['trip_seconds', 'trip_miles', 'fare','preciprob', 'snow', 'snowdepth', 'windspeed',
       'severerisk', 'snow_binary', 'rain_binary','num_stadiums','day_of_week','month','geometry'],axis=1,inplace=True)

<div>
    <span style ="font-size: 30px; font-weight: bold; color: #8EB944">
        Exporting the final dataset
    </span>
    
<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">
</div>

In [12]:
# Export the final features DataFrame depending on the case
if hexagon_resolution == 6:
    df_final.to_csv('features_hex_6.csv', index=False)
    
elif hexagon_resolution == 7:
    df_final.to_csv('features_hex_7.csv', index=False)
    
elif census_resolution == True:
    df_final.to_csv('features_census.csv', index=False)
    
else:
    raise ValueError('Please adjust the hexagon resolution to either 6 or 7 OR set the census_resolution to true')

<hr style="color: #8EB944; height: 3px;background-color: #8EB944;border: none">