# Police Call Data Resampling

This notebook takes the cleaned police call data and resamples it. This provides summary data per hour. See the column descriptions below.

This notebook could probably be part of the data cleaning notebook, but it's fine to keep it separate.

## Setup

This script depends on the cleaned data generated by the Data Cleaning notebook. Run that first, which should generate 4 csv files in the `data/cleaned/` directory. 2 of those are the inputs for this notebook.

In [1]:
# Imports
import pandas as pd
import os

In [12]:
# Read frames
# These csv files should exist if you ran the Data Cleaning notebook.

def output_dir(filename = ''):
    return os.path.join('data', 'cleaned', filename)

fremont_calls_file = output_dir('fremont_calls.csv')
greenway_calls_file = output_dir('greenway_calls.csv')

# Output files that we'll write to at the very end
resampled_fremont_calls_file = output_dir('resampled_fremont_calls.csv')
resampled_greenway_calls_file = output_dir('resampled_greenway_calls.csv')

# Make sure the files exist
if not os.path.exists(fremont_calls_file):
    print(f"{fremont_calls_file} doesn't exist. Run the Data Cleaning notebook first to generate.")
if not os.path.exists(greenway_calls_file):
    print(f"{greenway_calls_file} doesn't exist. Run the Data Cleaning notebook first to generate.")

In [3]:
# load the call data
fc = pd.read_csv('data/cleaned/fremont_calls.csv')
gc = pd.read_csv('data/cleaned/greenway_calls.csv')

## Summarizing the Data
We want to get hourly totals for the call data. We'll have to summarize because we have a lot of different data per hour. We'll do this by resampling.

For numeric values, we'll take the average. For things like call_type, we'll keep each unique value.

In [4]:
# Parse dates and times
def parse_dates_and_times(frame):
    # Parse to datetime objects
    frame['time_queued'] = pd.to_datetime(frame['time_queued'], format="%m/%d/%Y %I:%M:%S %p")
    frame['arrived_time'] = pd.to_datetime(frame['arrived_time'], format="%H:%M:%S")
    
    # arrived_time only contains a time, so we'll pull the date from time_queued and add it to arrived_time.
    # This means that both arrived_time and time_queued both contain full dates and can be more easily compared.
    frame['arrived_time'] = frame.apply(
        lambda row: row['time_queued'].replace(
            hour=row['arrived_time'].hour,
            minute=row['arrived_time'].minute,
            second=row['arrived_time'].second
        ),
        axis=1
    )
    
    # Some time differences cross the date boundary (ie. queued up just before midnight and they arrive after midnight)
    # This will check for that and fix the date so it's accurate.
    # Basically, if the arrived time of day is less than the queued time, we know it crossed the date boundary.
    for index, row in frame.iterrows():
        if row['arrived_time'].time() < row['time_queued'].time():
            frame.at[index, 'arrived_time'] += pd.DateOffset(days=1)
    
    frame['arrived_time'] = pd.to_datetime(frame['arrived_time'], format="%m/%d/%Y %I:%M:%S %p")

# Run the above function for both frames
parse_dates_and_times(fc)
parse_dates_and_times(gc)

In [5]:
# Calculate arrived_delay
fc['arrived_delay'] = (fc['arrived_time'] - fc['time_queued']).dt.total_seconds() / 60 # units in minutes
gc['arrived_delay'] = (gc['arrived_time'] - gc['time_queued']).dt.total_seconds() / 60

In [6]:
# Here we'll do the actual resampling
def resample(frame):
    resampled_frame = frame.resample('H', on = 'arrived_time').agg({
        # Keep each unique call type
        'call_type': lambda call_type: call_type.unique(),
        # Keep each unique initial call type
        'initial_call_type': lambda initial_call_type: initial_call_type.unique(),
        # Keep each unique final call type
        'final_call_type': lambda final_call_type: final_call_type.unique(),
        # Keep each unique clearance description
        'clearance_desc': lambda clearance_desc: clearance_desc.unique(),
        # Keep the average priority
        'priority': 'mean',
        # Keep the average arrived delay that we calculated
        'arrived_delay': 'mean',
        # this is just the number of rows aggregated, ie. the number of calls in that hour
        'cad_num': 'size'
    })
    
    # Rename the columns to be more descriptive
    resampled_frame.columns = [
        'call_types',
        'initial_call_type',
        'final_call_types',
        'clearance_desc',
        'average_priority',
        'average_arrived_delay',
        'call_count'
    ]
    resampled_frame.index.name = 'hour'
    return resampled_frame

resampled_fc = resample(fc)
resampled_gc = resample(gc)

In [13]:
# Write the resampled data to file
resampled_fc.to_csv(resampled_fremont_calls_file)
resampled_gc.to_csv(resampled_greenway_calls_file)