## 1.0a Taxi Data - Preprocessing

In this first notebook, the main dataframe is preprocessed. This includes:
- Read and display datafile
- Checking data logic & removing invalid data
- Changing its datatypes
- Dropping trivial or null columns/rows
- Splitting the data into subsets and saving them for further work

As the original data is rather large in size, it is not included in the repository. Instead, when wanting to run this notebook, it is necessary to:
1. Download the data from https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew#column-menu where you specify via Actions->Query Data->Trip Start Timestamp // In Between // 2022 Jan 01 12:00:00 AM AND 2022 Dec 31 11:59:59 PM
2. Rename the file to "taxidata"
3. Run the code cell below and add the data to the following directory

In [43]:
import os
# this directory for the original data file
os.makedirs('./data', exist_ok=True)
# this directory to later save the prepared data
os.makedirs('./data/prepped', exist_ok=True)

In [44]:
# Standard libraries
import pandas as pd
import numpy as np

# Geospatial libraries
from h3 import h3 
import geopandas as gp
from shapely.geometry.polygon import Polygon

### 1.1 Read and display datafile

Data file not included in the project, needs to be downloaded individually. This step can take a few minutes due to size of the original file

In [45]:
df = pd.read_csv("data/taxidata.csv")

In [46]:
df.head(3)

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,extras,trip_total,payment_type,company,pickup_centroid_latitude,pickup_centroid_longitude,pickup_centroid_location,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_centroid_location
0,4404c6835b9e74e9f74d70f235200a8ce09db14a,7e179f8ef66ae99ec2d1ec89224e0b7ee5469fe5627f6d...,2022-12-31T23:45:00.000,2023-01-01T00:15:00.000,2081.0,4.42,,,2.0,3.0,...,0.0,20.5,Prcard,Flash Cab,42.001571,-87.695013,POINT (-87.6950125892 42.001571027),41.965812,-87.655879,POINT (-87.6558787862 41.96581197)
1,466473fd2a196ebe92fb2983cb7e8af32e39aa1f,d1d88b89ceb6d753007b6e795e3c24f4bea905a51e9d47...,2022-12-31T23:45:00.000,2023-01-01T00:00:00.000,812.0,0.0,,,8.0,24.0,...,0.0,16.57,Mobile,Flash Cab,41.899602,-87.633308,POINT (-87.6333080367 41.899602111),41.901207,-87.676356,POINT (-87.6763559892 41.9012069941)
2,3f5cd3f78e5cab455606a31372a95d3204b2fb3f,847cf962bd6f62040673e6c24c24940aeb2d7fdaa54677...,2022-12-31T23:45:00.000,2023-01-01T00:00:00.000,600.0,0.9,,,8.0,8.0,...,3.0,12.0,Credit Card,Taxi Affiliation Services,41.899602,-87.633308,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111)


In [47]:
# Data types
df.dtypes

trip_id                        object
taxi_id                        object
trip_start_timestamp           object
trip_end_timestamp             object
trip_seconds                  float64
trip_miles                    float64
pickup_census_tract           float64
dropoff_census_tract          float64
pickup_community_area         float64
dropoff_community_area        float64
fare                          float64
tips                          float64
tolls                         float64
extras                        float64
trip_total                    float64
payment_type                   object
company                        object
pickup_centroid_latitude      float64
pickup_centroid_longitude     float64
pickup_centroid_location       object
dropoff_centroid_latitude     float64
dropoff_centroid_longitude    float64
dropoff_centroid_location      object
dtype: object

In [48]:
# Convert time types to check if entries are from correct range
df["trip_start_timestamp"] = pd.to_datetime(df["trip_start_timestamp"])
df["trip_end_timestamp"] = pd.to_datetime(df["trip_end_timestamp"])

# In range of 2022:
print(f"Min date: {df['trip_start_timestamp'].min()}")
print(f"Max date: {df['trip_start_timestamp'].max()}")

# Convert other 
df["trip_seconds"] =  pd.to_numeric(df['trip_seconds'])

Min date: 2022-01-01 00:00:00
Max date: 2022-12-31 23:45:00


Make sure these are the right dates, here it should say **2022-01-01 00:00:00** and **2022-12-31 23:45:00**. 

In [49]:
# Convert trip duration to hours
df['trip_hours'] = df['trip_seconds'] / 3600

In [50]:
# Look into null values
print(f"General shape of dataframe: {df.shape}")
print(df.isna().sum())

General shape of dataframe: (6382425, 24)
trip_id                             0
taxi_id                           354
trip_start_timestamp                0
trip_end_timestamp                212
trip_seconds                     1465
trip_miles                         56
pickup_census_tract           3758594
dropoff_census_tract          3707094
pickup_community_area          513853
dropoff_community_area         633684
fare                             3536
tips                             3536
tolls                            3536
extras                           3536
trip_total                       3536
payment_type                        0
company                             0
pickup_centroid_latitude       511551
pickup_centroid_longitude      511551
pickup_centroid_location       511551
dropoff_centroid_latitude      597931
dropoff_centroid_longitude     597931
dropoff_centroid_location      597931
trip_hours                       1465
dtype: int64


In [55]:
invalid_time = df[df['trip_end_timestamp'] < df['trip_start_timestamp']].index

df = df.drop(invalid_time)

In [56]:
mask = df['trip_seconds'].isna()
df.loc[mask, 'trip_seconds'] = (df.loc[mask, 'trip_end_timestamp'] - df.loc[mask, 'trip_start_timestamp']).dt.total_seconds()

In [57]:
print(df.isna().sum())

trip_id                             0
taxi_id                           354
trip_start_timestamp                0
trip_end_timestamp                212
trip_seconds                      212
trip_miles                         56
pickup_census_tract           3758493
dropoff_census_tract          3706993
pickup_community_area          513818
dropoff_community_area         633644
fare                             3533
tips                             3533
tolls                            3533
extras                           3533
trip_total                       3533
payment_type                        0
company                             0
pickup_centroid_latitude       511516
pickup_centroid_longitude      511516
pickup_centroid_location       511516
dropoff_centroid_latitude      597891
dropoff_centroid_longitude     597891
dropoff_centroid_location      597891
trip_hours                       1343
dtype: int64


### 1.2 Checking data logic & removing invalid data

#### 1.2.1 Duplicate entries

In [9]:
# Check duplicates 
print("Number of duplicate entries: ", df.duplicated(subset = [
    'taxi_id',
    'trip_start_timestamp',
    'trip_end_timestamp',
    'trip_seconds',
    'trip_miles',
    'pickup_census_tract',
    'dropoff_census_tract',
    'pickup_community_area',
    'dropoff_community_area',
    'pickup_centroid_latitude',
    'pickup_centroid_longitude',
    'pickup_centroid_location',
    'dropoff_centroid_latitude',
    'dropoff_centroid_longitude',
    'dropoff_centroid_location',
]).sum())

Number of duplicate entries:  21772


In [10]:
df = df.drop_duplicates(subset = [
    'taxi_id',
    'trip_start_timestamp',
    'trip_end_timestamp',
    'trip_seconds',
    'trip_miles',
    'pickup_census_tract',
    'dropoff_census_tract',
    'pickup_community_area',
    'dropoff_community_area',
    'pickup_centroid_latitude',
    'pickup_centroid_longitude',
    'pickup_centroid_location',
    'dropoff_centroid_latitude',
    'dropoff_centroid_longitude',
    'dropoff_centroid_location',
], keep='first')
df.shape

(6360653, 24)

#### 1.2.2 Drop Outliers

In [20]:
df_outliers = df

In [21]:
def drop_outliers(df, column, mean, std):
    return df[(df[column] > mean - 3 * std) & (df[column] < mean + 3 * std)]

In [22]:
std_trip_seconds = df_outliers['trip_seconds'].describe(include='all').loc['std']
mean_trip_seconds = df_outliers['trip_seconds'].describe(include='all').loc['mean']

std_trip_miles = df_outliers['trip_miles'].describe(include='all').loc['std']
mean_trip_miles = df_outliers['trip_miles'].describe(include='all').loc['mean']

std_trip_total = df_outliers['trip_total'].describe(include='all').loc['std']
mean_trip_total = df_outliers['trip_total'].describe(include='all').loc['mean']

In [23]:
df_outliers = drop_outliers(df_outliers, "trip_seconds", mean_trip_seconds, std_trip_seconds)
df_outliers = drop_outliers(df_outliers, "trip_miles", mean_trip_miles, std_trip_miles)
df_outliers = drop_outliers(df_outliers, "trip_total", mean_trip_total, std_trip_total)

In [24]:
df_outliers.shape

(6306811, 24)

In [25]:
# Number of equal zero entries after dropping
print("Number of zero entries for trip_seconds: ", len(df_outliers[df_outliers["trip_seconds"] == 0]))
print("Number of zero entries for trip_miles: ", len(df_outliers[df_outliers["trip_miles"] == 0]))
print("Number of zero entries for trip_total: ", len(df_outliers[df_outliers["trip_total"] == 0]))

Number of zero entries for trip_seconds:  147287
Number of zero entries for trip_miles:  781031
Number of zero entries for trip_total:  5325


#### 1.2.2 Mph Logic

To have some orientation for the speed limit in Chicago/Cook County, we took this as a source: https://www.arcgis.com/home/item.html?id=5e279cbe89794bcba87809d9ae95594d which resulted in a limit of 65 mph for our taxi data, as anything above is unrealistic.

In [18]:
df_outliers['mph'] = np.where(df_outliers['trip_hours'] != 0, df_outliers['trip_miles'] / df_outliers['trip_hours'], np.nan)

In [19]:
df_outliers = df_outliers[(df_outliers["mph"] <= 65)]

In [20]:
df_outliers["mph"].max()

65.0

In [21]:
## change order
## duration 
## kmh

#### 1.2.2 Tips Logic

Through some research we found that the base fare of any taxi ride is $3.25 (https://www.chicago.gov/content/dam/city/depts/bacp/publicvehicleinfo/Chicabs/chicagotaxiplacard20200629.pdf). Hence, entries where the **"trip_total" is smaller than 3.25** are dropped.

In [22]:
print("Number of entries with below $3.25 trip total: ", len(df_outliers[df_outliers["trip_total"] <= 3.25]))

Number of entries with below $3.25 trip total:  99404


In [23]:
df_outliers = df_outliers[df_outliers["trip_total"] > 3.25]

In [24]:
df_outliers["trip_total"].min()

3.26

#### 1.2.3 Cancelled Trips Logic

We consider a trip cancelled if
- trip_miles = 0
- trip_seconds = 0
- pickup = dropoff (regarding lat, lng, census tract and centroid location)

In [25]:
print(len(df_outliers[df_outliers['trip_miles']==0]))

570298


In [26]:
# Filter out trips that match any "cancelled"-condition
condition_census_tract = (df_outliers['pickup_census_tract'].notna() & df_outliers['dropoff_census_tract'].notna())

condition_geolocation = (
    (df_outliers['pickup_centroid_latitude'].notna() & df_outliers['dropoff_centroid_latitude'].notna()) &
    (df_outliers['pickup_centroid_longitude'].notna() & df_outliers['dropoff_centroid_longitude'].notna())
)

condition_community_area = df_outliers['pickup_community_area'].notna() & df_outliers['dropoff_community_area'].notna()

condition = (condition_census_tract | condition_geolocation|  condition_community_area) 

df_cancelled = df_outliers[condition]

In [27]:
df_cancelled.shape

(5257291, 25)

In [28]:
df_cancelled[df_cancelled["trip_miles"]== 0].shape

(524407, 25)

#### 1.2.4 Merging Census Data

In [29]:
# Import prepped dfs
community_df = gp.read_file("data/prepped/community_df.geojson")
census_df = gp.read_file("data/prepped/census_tracts_df.geojson")

In [30]:
community_df.head()

Unnamed: 0,community,area_number,geometry
0,DOUGLAS,35,"MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ..."
1,OAKLAND,36,"MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ..."
2,FULLER PARK,37,"MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ..."
3,GRAND BOULEVARD,38,"MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ..."
4,KENWOOD,39,"MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ..."


In [31]:
def merge_geodata(df, lat_col, lon_col, community_col, area_number_col, prefix):
    gdf = gp.GeoDataFrame(df, geometry=gp.points_from_xy(df[lon_col], df[lat_col]))
    gdf = gdf.rename_geometry(f"{prefix}_geometry")
    
    gdf.set_crs(epsg=4326, inplace=True)
    community_df.to_crs(epsg=4326, inplace=True)

    # Perform the spatial join
    merged_gdf = gp.sjoin(gdf, community_df, how="left", predicate="within")

    # Drop unnecessary columns
    merged_gdf = merged_gdf.drop([f"{prefix}_geometry", "index_right"], axis=1)

    # Rename columns
    merged_gdf = merged_gdf.rename(columns={
        community_col: f"{prefix}_community",
        area_number_col: f"{prefix}_area_number"
    })

    return pd.DataFrame(merged_gdf)

# Merge for pickup
merged_pickup_df = merge_geodata(df_cancelled, 'pickup_centroid_latitude', 'pickup_centroid_longitude', 'community', 'area_number', 'pickup')
# Merge for dropoff
merged_dropoff_df = merge_geodata(merged_pickup_df, 'dropoff_centroid_latitude', 'dropoff_centroid_longitude', 'community', 'area_number', 'dropoff')

In [32]:
merged_dropoff_df

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,pickup_centroid_location,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_centroid_location,trip_hours,mph,pickup_community,pickup_area_number,dropoff_community,dropoff_area_number
0,4404c6835b9e74e9f74d70f235200a8ce09db14a,7e179f8ef66ae99ec2d1ec89224e0b7ee5469fe5627f6d...,2022-12-31 23:45:00,2023-01-01 00:15:00,2081.0,4.42,,,2.0,3.0,...,POINT (-87.6950125892 42.001571027),41.965812,-87.655879,POINT (-87.6558787862 41.96581197),0.578056,7.646324,WEST RIDGE,2,UPTOWN,3
1,466473fd2a196ebe92fb2983cb7e8af32e39aa1f,d1d88b89ceb6d753007b6e795e3c24f4bea905a51e9d47...,2022-12-31 23:45:00,2023-01-01 00:00:00,812.0,0.00,,,8.0,24.0,...,POINT (-87.6333080367 41.899602111),41.901207,-87.676356,POINT (-87.6763559892 41.9012069941),0.225556,0.000000,NEAR NORTH SIDE,8,WEST TOWN,24
2,3f5cd3f78e5cab455606a31372a95d3204b2fb3f,847cf962bd6f62040673e6c24c24940aeb2d7fdaa54677...,2022-12-31 23:45:00,2023-01-01 00:00:00,600.0,0.90,,,8.0,8.0,...,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111),0.166667,5.400000,NEAR NORTH SIDE,8,NEAR NORTH SIDE,8
3,38292159642750da7b20419330566f9eb0961cde,81092e4881f56106fae845c3ae4492f8b3c3213c33c920...,2022-12-31 23:45:00,2023-01-01 00:00:00,546.0,0.85,,,8.0,8.0,...,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111),0.151667,5.604396,NEAR NORTH SIDE,8,NEAR NORTH SIDE,8
4,3e01498f8ff771ad7eb37e4844cef20201b6c339,4ae32e2eb244ce143800e0c40055e537cc50e3358a07ce...,2022-12-31 23:45:00,2023-01-01 00:00:00,574.0,0.33,,,8.0,8.0,...,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111),0.159444,2.069686,NEAR NORTH SIDE,8,NEAR NORTH SIDE,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6382420,54d812a0b88f8f9707825261014b3563a0a60ace,f98ae5e71fdda8806710af321dce58002146886c013f41...,2022-01-01 00:00:00,2022-01-01 00:00:00,536.0,4.83,,,28.0,22.0,...,POINT (-87.6635175498 41.874005383),41.922761,-87.699155,POINT (-87.6991553432 41.9227606205),0.148889,32.440299,NEAR WEST SIDE,28,LOGAN SQUARE,22
6382421,7125b9e03a0f16c2dfb5eaf73ed057dc51eb68ef,8eca35a570101ad24c638f1f43eecce9d0cb7843e13a75...,2022-01-01 00:00:00,2022-01-01 00:15:00,897.0,2.07,,,8.0,32.0,...,POINT (-87.6333080367 41.899602111),41.878866,-87.625192,POINT (-87.6251921424 41.8788655841),0.249167,8.307692,NEAR NORTH SIDE,8,LOOP,32
6382422,52d1bd00d97eaed338bd98faf80c5709e22fef3d,b5e2695a2f44b9bce7a0a86148ac418802f0067be1f6d4...,2022-01-01 00:00:00,2022-01-01 00:00:00,598.0,6.64,,,8.0,77.0,...,POINT (-87.6333080367 41.899602111),41.986712,-87.663416,POINT (-87.6634164054 41.9867117999),0.166111,39.973244,NEAR NORTH SIDE,8,EDGEWATER,77
6382423,0f0c856e620e6b4dfd2bb1e921d966dd179eeca1,b21050ab3ad3d0972fd6378f6bf4d0251a8a7af42e6e0e...,2022-01-01 00:00:00,2022-01-01 00:00:00,33.0,0.17,,,3.0,3.0,...,POINT (-87.6558787862 41.96581197),41.965812,-87.655879,POINT (-87.6558787862 41.96581197),0.009167,18.545455,UPTOWN,3,UPTOWN,3


In [33]:
merged_dropoff_df.shape

(5257291, 29)

In [35]:
merged_dropoff_df.isna().sum()

trip_id                             0
taxi_id                             1
trip_start_timestamp                0
trip_end_timestamp                  0
trip_seconds                        0
trip_miles                          0
pickup_census_tract           2841436
dropoff_census_tract          2841436
pickup_community_area            1365
dropoff_community_area          33496
fare                                0
tips                                0
tolls                               0
extras                              0
trip_total                          0
payment_type                        0
company                             0
pickup_centroid_latitude          283
pickup_centroid_longitude         283
pickup_centroid_location          283
dropoff_centroid_latitude        5157
dropoff_centroid_longitude       5157
dropoff_centroid_location        5157
trip_hours                          0
mph                                 0
pickup_community                  288
pickup_area_

In [37]:
## MERGE HERE --------------------

In [51]:
condition = ~(merged_dropoff_df['pickup_area_number'].isna() | merged_dropoff_df['dropoff_area_number'].isna())

df_merged = merged_dropoff_df[condition]

In [52]:
df_merged.isna().sum()

trip_id                             0
taxi_id                             1
trip_start_timestamp                0
trip_end_timestamp                  0
trip_seconds                        0
trip_miles                          0
pickup_census_tract           2841436
dropoff_census_tract          2841436
pickup_community_area            1074
dropoff_community_area          28335
fare                                0
tips                                0
tolls                               0
extras                              0
trip_total                          0
payment_type                        0
company                             0
pickup_centroid_latitude            0
pickup_centroid_longitude           0
pickup_centroid_location            0
dropoff_centroid_latitude           0
dropoff_centroid_longitude          0
dropoff_centroid_location           0
trip_hours                          0
mph                                 0
pickup_community                    0
pickup_area_

In [53]:
df_merged

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,...,pickup_centroid_location,dropoff_centroid_latitude,dropoff_centroid_longitude,dropoff_centroid_location,trip_hours,mph,pickup_community,pickup_area_number,dropoff_community,dropoff_area_number
0,4404c6835b9e74e9f74d70f235200a8ce09db14a,7e179f8ef66ae99ec2d1ec89224e0b7ee5469fe5627f6d...,2022-12-31 23:45:00,2023-01-01 00:15:00,2081.0,4.42,,,2.0,3.0,...,POINT (-87.6950125892 42.001571027),41.965812,-87.655879,POINT (-87.6558787862 41.96581197),0.578056,7.646324,WEST RIDGE,2,UPTOWN,3
1,466473fd2a196ebe92fb2983cb7e8af32e39aa1f,d1d88b89ceb6d753007b6e795e3c24f4bea905a51e9d47...,2022-12-31 23:45:00,2023-01-01 00:00:00,812.0,0.00,,,8.0,24.0,...,POINT (-87.6333080367 41.899602111),41.901207,-87.676356,POINT (-87.6763559892 41.9012069941),0.225556,0.000000,NEAR NORTH SIDE,8,WEST TOWN,24
2,3f5cd3f78e5cab455606a31372a95d3204b2fb3f,847cf962bd6f62040673e6c24c24940aeb2d7fdaa54677...,2022-12-31 23:45:00,2023-01-01 00:00:00,600.0,0.90,,,8.0,8.0,...,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111),0.166667,5.400000,NEAR NORTH SIDE,8,NEAR NORTH SIDE,8
3,38292159642750da7b20419330566f9eb0961cde,81092e4881f56106fae845c3ae4492f8b3c3213c33c920...,2022-12-31 23:45:00,2023-01-01 00:00:00,546.0,0.85,,,8.0,8.0,...,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111),0.151667,5.604396,NEAR NORTH SIDE,8,NEAR NORTH SIDE,8
4,3e01498f8ff771ad7eb37e4844cef20201b6c339,4ae32e2eb244ce143800e0c40055e537cc50e3358a07ce...,2022-12-31 23:45:00,2023-01-01 00:00:00,574.0,0.33,,,8.0,8.0,...,POINT (-87.6333080367 41.899602111),41.899602,-87.633308,POINT (-87.6333080367 41.899602111),0.159444,2.069686,NEAR NORTH SIDE,8,NEAR NORTH SIDE,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6382420,54d812a0b88f8f9707825261014b3563a0a60ace,f98ae5e71fdda8806710af321dce58002146886c013f41...,2022-01-01 00:00:00,2022-01-01 00:00:00,536.0,4.83,,,28.0,22.0,...,POINT (-87.6635175498 41.874005383),41.922761,-87.699155,POINT (-87.6991553432 41.9227606205),0.148889,32.440299,NEAR WEST SIDE,28,LOGAN SQUARE,22
6382421,7125b9e03a0f16c2dfb5eaf73ed057dc51eb68ef,8eca35a570101ad24c638f1f43eecce9d0cb7843e13a75...,2022-01-01 00:00:00,2022-01-01 00:15:00,897.0,2.07,,,8.0,32.0,...,POINT (-87.6333080367 41.899602111),41.878866,-87.625192,POINT (-87.6251921424 41.8788655841),0.249167,8.307692,NEAR NORTH SIDE,8,LOOP,32
6382422,52d1bd00d97eaed338bd98faf80c5709e22fef3d,b5e2695a2f44b9bce7a0a86148ac418802f0067be1f6d4...,2022-01-01 00:00:00,2022-01-01 00:00:00,598.0,6.64,,,8.0,77.0,...,POINT (-87.6333080367 41.899602111),41.986712,-87.663416,POINT (-87.6634164054 41.9867117999),0.166111,39.973244,NEAR NORTH SIDE,8,EDGEWATER,77
6382423,0f0c856e620e6b4dfd2bb1e921d966dd179eeca1,b21050ab3ad3d0972fd6378f6bf4d0251a8a7af42e6e0e...,2022-01-01 00:00:00,2022-01-01 00:00:00,33.0,0.17,,,3.0,3.0,...,POINT (-87.6558787862 41.96581197),41.965812,-87.655879,POINT (-87.6558787862 41.96581197),0.009167,18.545455,UPTOWN,3,UPTOWN,3


In [38]:
## MERGE END -----------

In [39]:
## if merged add rest of function

In [58]:
df_merged = df_merged.drop(['pickup_community_area', 'dropoff_community_area'], axis=1)

In [59]:
df_merged.isna().sum()

trip_id                             0
taxi_id                             1
trip_start_timestamp                0
trip_end_timestamp                  0
trip_seconds                        0
trip_miles                          0
pickup_census_tract           2841436
dropoff_census_tract          2841436
fare                                0
tips                                0
tolls                               0
extras                              0
trip_total                          0
payment_type                        0
company                             0
pickup_centroid_latitude            0
pickup_centroid_longitude           0
pickup_centroid_location            0
dropoff_centroid_latitude           0
dropoff_centroid_longitude          0
dropoff_centroid_location           0
trip_hours                          0
mph                                 0
pickup_community                    0
pickup_area_number                  0
dropoff_community                   0
dropoff_area

### 1.3 Checking null values

In [63]:
df_merged = df_merged.dropna(subset=['taxi_id'])

In [64]:
df_merged.isna().sum()

trip_id                             0
taxi_id                             0
trip_start_timestamp                0
trip_end_timestamp                  0
trip_seconds                        0
trip_miles                          0
pickup_census_tract           2841435
dropoff_census_tract          2841435
fare                                0
tips                                0
tolls                               0
extras                              0
trip_total                          0
payment_type                        0
company                             0
pickup_centroid_latitude            0
pickup_centroid_longitude           0
pickup_centroid_location            0
dropoff_centroid_latitude           0
dropoff_centroid_longitude          0
dropoff_centroid_location           0
trip_hours                          0
mph                                 0
pickup_community                    0
pickup_area_number                  0
dropoff_community                   0
dropoff_area

#### 1.3.1 Time null values

In [None]:
# If start and end time are the same then trip seconds would still 

In [None]:
# Drop entries where trip seconds are null and time is equal
print("Entries where no trip seconds but start and end time are also equal ", len(df[df["trip_seconds"].isna() & (df["trip_start_timestamp"] ==  df["trip_end_timestamp"])]))
print("Entries where no trip seconds but start and end time different ", len(df[df["trip_seconds"].isna() & (df["trip_start_timestamp"] !=  df["trip_end_timestamp"])]))

In [None]:
# Drop where cannot be calculated:
#df = df[~(df["trip_seconds"].isna() & (df["trip_start_timestamp"] ==  df["trip_end_timestamp"]))]

# Else calculate the trip seconds:
# temp_trip = df[df["trip_seconds"].isna() & (df["trip_start_timestamp"] !=  df["trip_end_timestamp"])].copy()
# temp_trip["calculated_trip_seconds"] = (temp_trip["trip_end_timestamp"] - temp_trip["trip_start_timestamp"]).dt.seconds
# df["trip_seconds"].fillna(temp_trip["calculated_trip_seconds"], inplace=True)

# df["trip_seconds"].isna().sum()

#### Save Data

In [67]:
df_merged.to_csv('data/prepped/prep_taxidata.csv')

### 1.4 Hexagons - res8 and res7 takes too long (deswegen erstmal ausgeklammert)

In [None]:
# Get hex ids
def add_h3_ids(df, res):
    df[f"h3_res{res}_pickup"] = np.vectorize(h3.geo_to_h3)(
        df['pickup_centroid_latitude'], df['pickup_centroid_longitude'], res)
    df[f"h3_res{res}_dropoff"] = np.vectorize(h3.geo_to_h3)(
        df['dropoff_centroid_latitude'], df['dropoff_centroid_longitude'], res)
    return df

# Get poly from hex ids - vectorized form to save time
def poly_from_hex(df, colname, res):
    hex_ids = df[f"h3_res{res}_{colname}"].values
    polygons = np.vectorize(lambda hex_id: Polygon(h3.h3_to_geo_boundary(hex_id, geo_json=True)))(hex_ids)
    df[f"poly_res{res}_{colname}"] = polygons
    return df

# Get count for each trip happening in the same hexagon
def get_poly_count(df, colname):
    name = colname.split("_")[1] + "_" + colname.split("_")[2]
    df[f"count{name}"] = df.groupby(colname)['trip_id'].transform('count')
    return df

In [None]:
# For hexagon resolution, adapted: https://towardsdatascience.com/exploring-location-data-using-a-hexagon-grid-3509b68b04a2 table
# hex_df = add_h3_ids(df, 7)
#hex_df = add_h3_ids(df, 8)

In [None]:
# Entries where hex id is 0, we cannot use for visualization, hence drop
## TODO: maybe if census tract is given/community area, we can use this to merge with later census data to get hexagons for these entries? 
print("Number of hex ids equal to 0: ",len(hex_df[(hex_df["h3_res7_pickup"] == "0") | (hex_df["h3_res7_dropoff"] == "0")]))

hex_df_clear = hex_df[(hex_df["h3_res7_pickup"] != "0") & (hex_df["h3_res7_dropoff"] != "0")]

In [None]:
hex_df_clear.head(2)

In [None]:
# Get polygon from hex ids
#hex_df_poly = poly_from_hex(hex_df_clear, "pickup", 7)
#hex_df_poly = poly_from_hex(hex_df_clear, "dropoff", 7)
# hex_df_poly = poly_from_hex(hex_df_clear, "pickup", 8)
# hex_df_poly = poly_from_hex(hex_df_clear, "dropoff", 8)

In [None]:
hex_df_poly.head(2)

In [None]:
# Get count for each polygon
# hex_df_poly = get_poly_count(hex_df_poly, "poly_res7_pickup")
#hex_df_poly = get_poly_count(hex_df_poly, "poly_res7_dropoff")
# hex_df_poly = get_poly_count(hex_df_poly, "poly_res8_pickup")
# hex_df_poly = get_poly_count(hex_df_poly, "poly_res8_dropoff")

In [None]:
#hex_df_clear.head(1)

In [None]:
# Make a geodf out of it for simple plotting
gdf_res7_pickup = gp.GeoDataFrame(hex_df_clear, geometry=hex_df_poly['poly_res7_pickup'], crs='EPSG:4326')
gdf_res7_dropoff = gp.GeoDataFrame(hex_df_clear, geometry=hex_df_poly['poly_res7_dropoff'], crs='EPSG:4326')
# gdf_res8_pickup = gp.GeoDataFrame(hex_df_clear, geometry=hex_df_poly['poly_res8_pickup'], crs='EPSG:4326')
# gdf_res8_dropoff = gp.GeoDataFrame(hex_df_clear, geometry=hex_df_poly['poly_res8_dropoff'], crs='EPSG:4326')

#Visualize
# fig, axs = plt.subplots(nrows = 1, ncols = 2, figsize=(10, 10))

# titles = ["Hex resolution 7", 
#           # "Hex resolution 8"
#          ]
# dfs = [gdf_res7_pickup, gdf_res7_dropoff,  
#        # gdf_res8_pickup, gdf_res8_dropoff
#       ]

# axs = axs.flatten()
# for ind in range(0, 3):
#     dfs[ind].plot(column="count", ax=axs[ind], legend=True)
#     axs[ind].set_title(titles[ind])

# plt.tight_layout()
# plt.show()

In [None]:
# df_pickup_res8['geometry'] = df_pickup_res7.apply(lambda x: Polygon(h3.h3_to_geo_boundary(x["h3_pickup_res8"], geo_json=True)), axis=1)
# trips_starts_geo = gp.GeoDataFrame(df_pickup_res7, geometry=df_pickup_res7['geometry'], crs='EPSG:4326')
# trips_starts_geo.head()

In [None]:
# trips_starts_geo.plot(column='count')

In [1]:
# print(df_outliers)