## Data Aggregation
In this notebook, we will aggregate data on an hourly basis to be used for predictive analysis.

In [1]:
import pandas as pd

In [2]:
trips_df = pd.read_pickle('../00_data/trips.pkl')

In [3]:
# floor start and end times of trips to hour value, i.e. ignore minutes and seconds
trips_df['start_time_floored'] = trips_df['start_time'].dt.floor('1H')
trips_df['end_time_floored'] = trips_df['end_time'].dt.floor('1H')

In [4]:
# calculate the number of starting and ending trips for each hour
starting_trips_grouped = (
    trips_df.groupby(["start_time_floored"])
    .size()
    .to_frame("starting_trips")
)
ending_trips_grouped = (
    trips_df.groupby(["end_time_floored"])
    .size()
    .to_frame("ending_trips")
)

In [5]:
# combine these values into a single dataframe and determine whether data cleaning is necessary
trips_hourly = pd.concat([starting_trips_grouped, ending_trips_grouped], axis=1)
trips_hourly.isna().sum()

starting_trips    142
ending_trips       85
dtype: int64

There are some missing values, because there may be hours at which trips start but no trips end and vice versa. We will fill these null values with 0, which represents 0 trips started/ended in this hour, and save this aggregated data into a new file.

In [6]:
trips_hourly = trips_hourly.fillna(0) 

In [7]:
trips_hourly.to_pickle('../00_data/trips_hourly.pkl')