In [1]:
import pandas as pd
import os

## Data Overview

The trip data was downloaded from the New York City Taxi & Limousine Commission (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The following is the data description from TLC's data page:

For-Hire Vehicle (“FHV”) trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID (shape file below). These records are generated from the FHV Trip Record submissions made by bases. Note: The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness. Therefore, this may not represent the total amount of trips dispatched by all TLC-licensed bases. The TLC performs routine reviews of the records and takes enforcement actions when necessary to ensure, to the extent possible, complete and accurate information.

The weather data was downloaded from Kaggle (https://www.kaggle.com/datasets/meinertsen/new-york-city-taxi-trip-hourly-weather-data). The uploader scraped the data from Wunderground. It contains hourly weather data in New York City.

## Data Assumptions

1. Each row represents a completed trip
2. The FHV data, even though it's not strictly Uber/Lyft/other ride-hailing apps only, is representative of these ride-hailing apps
3. 

## Data Preprocessing

Since the granularity of the weather data is at the hour-level, we will aggregate the trip data on the same level of granularity. We have 12 files of weather data, one for each month of 2016, and we will aggregate each data file at the hour level first before combining them together. We will count the number of records for every hour, and that count will be our outcome variable--demand. 

In [3]:
# go over the 12 trip data table (1 per month)
# aggregate each table by hour-level, count number of rows (each row is a trip)
# combine the aggregated tables

def agg_hourly(df):
    df = df.resample('60min', on="pickup_datetime").agg({'dropOff_datetime':'size'}).reset_index()
    df = df.rename(columns={'pickup_datetime':'datetime','dropOff_datetime':'trip_count'})
    return df

directory = os.fsencode('../data/trip-data-tlc/raw/')

combined_tripdata = pd.DataFrame([])

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    trip_data = pd.read_parquet('../data/trip-data-tlc/raw/'+filename)
    agg_tripdata = agg_hourly(trip_data)
    combined_tripdata = pd.concat([combined_tripdata,agg_tripdata])

# combined_tripdata.to_csv('../data/trip-data-tlc/aggregated/combined_tripdata.csv')

In [None]:
combined_tripdata

In [None]:
filename

In [None]:
agg_hourly(data)

In [None]:
index = pd.date_range('1/1/2000', periods=9, freq='min')

In [None]:
series = pd.Series(range(9), index=index)

In [None]:
series

In [None]:
series.resample('2min').count()

In [None]:
data