# Project Benson

## Exploring... and Aggregating

In [3]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels
import seaborn as sns
from numpy import linalg

import math
import patsy

from statsmodels.formula.api import ols

%matplotlib inline

In [4]:
!python -V

Python 3.6.3 :: Anaconda custom (64-bit)


In [5]:
print("Pandas version:",pd.__version__)
print("Numpy version:",np.__version__)

Pandas version: 0.20.3
Numpy version: 1.13.3


## Pick a week and play...

In [22]:
week = '170422' # Week of 2017, April 15
datafile = 'turnstile_%s.txt' % week
url = 'http://web.mta.info/developers/data/nyct/turnstile/%s' % datafile

# Specify location to store dataframes
#mydir = '/home/joseph/ds/Projects/Project_Benson/Data'
mydir =

df_pickle = '%s/turnstile_%s.pkl' % (mydir,week)
hourly_pickle = '%s/turnstile_%s_hourly.pkl' % (mydir,week)

## Data input

I start with the data off the MTA site.  I want a DateTime column immediately.

In [8]:
df = pd.read_csv(url, parse_dates = [['DATE','TIME']])

# If we're just reading this in from the MTA site, we should do this correction now
# Comment this out otherwise
df.columns.values[-1] = 'EXITS'


Or, if desired, read in previous work...

In [9]:
#df = pd.read_pickle(df_pickle, compression='gzip')
#hourly = pd.read_pickle(hourly_pickle, compression='gzip')

## Data Processing

In [10]:
df.set_index(['UNIT','SCP','STATION','DATE_TIME'], inplace=True)

In [11]:
def resampler(x):
    """ A function to resample time series data - to be used in groupby apply."""
    return (x.set_index('DATE_TIME')       # Resample based on our timestamp
            .resample('1H')                # Set the desired time period here
            .mean()                        # The aggregate function used to create the new sampled rows.
            .interpolate()                 # With the new rows, use interpolate to create the data
            .diff()                        # Now, use diff to create deltas
           )

The raw data represents a pattern of polling the turnstiles on a 4-hour period.

So, setting the time period for resampling to 1 hour may permmit further data analysis on an hourly basis but with the introduction of errors since we don't truly have that granularity.  Setting the time period greater than 4 hours
drops data because the first in the series is set to N/A with the diff.

In [12]:
hourly = (df.reset_index(level=3)
 .groupby(level=[0,1,2])
 .apply(resampler)
)

## Data Cleaning

In [13]:
# The diff opeation created N/A entries.  Let's set those to zero.
hourly.fillna(0,inplace=True)

We know we need to clean or drop the "leftover" rows from the diff operation.

Things that remain:
* Some turnstiles are actually counting backwards, believe it or not.  So we'll use *abs* to correct that.
* When turnstiles roll over or are reset, these create enormous anomalous values after diff.  This is low frequency (way less than 1%), so let's just set those to zero.

In [14]:
# There remain anomalies from a variety of causes.  Let's clean things up...
cleanitup = lambda x: abs(x) if abs(x) < 1000 else 0
hourly["ENTRIES"] = hourly["ENTRIES"].map(cleanitup)
hourly["EXITS"] = hourly["EXITS"].map(cleanitup)

## Data Aggregation

In [15]:
weekly_aggregate = hourly.groupby('STATION').sum()

In [16]:
weekly_aggregate['TOTAL'] = weekly_aggregate['ENTRIES'] + weekly_aggregate['EXITS']

In [17]:
print(weekly_aggregate.sort_values('TOTAL',ascending=False).head(10))

                  ENTRIES     EXITS      TOTAL
STATION                                       
34 ST-PENN STA   975276.5  819934.5  1795211.0
GRD CNTRL-42 ST  873820.0  767634.0  1641454.0
34 ST-HERALD SQ  781894.0  685535.0  1467429.0
23 ST            714458.0  528202.0  1242660.0
TIMES SQ-42 ST   631281.0  591907.0  1223188.0
14 ST-UNION SQ   648214.0  563431.0  1211645.0
42 ST-PORT AUTH  640614.0  442887.0  1083501.0
86 ST            528811.0  449156.0   977967.0
FULTON ST        539670.0  437482.0   977152.0
125 ST           508231.0  375817.0   884048.0


Now, let's save our work...

In [23]:
df.to_pickle(df_pickle, compression='gzip')
hourly.to_pickle(hourly_pickle, compression='gzip')

Yay!