## PUI 2016 HOMEWORK 3, ASSIGNMENT 2
#### Ian Wright, iw453
#### September 2016

In [2]:
import urllib, json
import pandas as pd
import numpy as np

#### IDEA:

There is demand for expanded citibike service in Upper Manhattan / Harlem / Columbia University.

#### HYPOTHESIS:

$H_0:$ The average daily rides (normalized per bike) that *start* along the north edge of the Citibike service zone in Manhattan (all Citibike stations on, or above, 106 St.) is *less than or equal to* that of Citibike stations in a similar, but non-boundary, region of Manhattan (between 91 St and 106 St). Timeframe set as a most recent complete 4 weeks.

##### H_0: avg(rides /bike /day)_(north of 106st, manhattan) <= avg(rides /bike /day)_(south of 106st, manhattan)

$H_1:$ The average daily rides (normalized per bike) that *start* along the north edge of the Citibike service zone in Manhattan (all Citibike stations on, or above, 106 St.) is *greater than* that of Citibike stations in a similar, but non-boundary, region of Manhattan (between 91 St and 106 St). Timeframe set as a most recent complete 4 weeks.

##### H_1: avg(rides /bike /day)_(north of 106st, manhattan) > avg(rides /bike /day)_(south of 106st, manhattan)

#### SIGNIFICANCE LEVEL:

For this test, I'll use a significance level of $\alpha=0.05$

#### DATA STRATEGY:

Citibike provides a json feed of all stations in the system. We'll parse through this to build a list of station IDs, names, and total bike capacity for each.

Then we'll cross-reference a citibike map for the relevant stations for our study, and group our dataset into boundary and non-boundary stations (and drop unneeded stations).

In [52]:
# get station data from citibike station feed
url = "https://feeds.citibikenyc.com/stations/stations.json"
response = urllib.urlopen(url)
stations = json.loads(response.read())

In [53]:
# use list comprehension to build a master list of relevant station data
station_data = [{'id':station['id'],
                 'stationName':station['stationName'],
                 'totalDocks':station['totalDocks']} 
                for station in stations['stationBeanList']] 

Sadly, citibike doesn't provide a map with unique integer IDs for each station... instead, we need to inspect the map and collect a list of strings for stationNames in our boundary and non-boundary zones. Then we'll search the station_data list for those stationNames.

In [54]:
# list of stationName for boundary zone: 
boundary_names = ['Cathedral Pkwy & Broadway', 'West End Ave & W 107 St', 
                  'W 106 St & Amsterdam Ave', 'W 107 St & Columbus Ave',
                 'W 106 St & Central Park West', 'Central Park North & Adam Clayton Powell Blvd',
                 'E 110 St & Madison Ave', 'E 106 St & Madison Ave', 'E 106 St & Lexington Ave',
                 'E 109 St & 3 Ave', 'E 106 St & 1 Av']

In [55]:
# list of stationName for non-boundary zone: 
non_boundary_names = ['Riverside Dr & W 104 St', 'West End Ave & W 94 St', 'W 92 St & Broadway',
                     'W 100 St & Broadway', 'W 95 St & Broadway', 'W 104 St & Amsterdam Ave',
                     'Columbus Ave & W 95 St', 'Columbus Ave & W 103 St', 'W 100 St & Manhattan Ave',
                     'Central Park W & W 96 St', 'Central Park West & W 100 St',
                      'Central Park West & W 102 St', '5 Ave & E 93 St', '5 Ave & E 103 St',
                      'E 97 St & Madison Ave', 'Madison Ave & E 99 St', 'E 91 St & Park Ave',
                      'E 102 St & Park Ave','E 103 St & Lexington Ave', 'E 95 St & 3 Ave',
                      'E 97 St & 3 Ave', '3 Ave & E 100 St', 'E 91 St & 2 Ave', '2 Ave & E 99 St',
                      '2 Ave & E 104 St', '1 Ave & E 94 St', 'E 102 St & 1 Ave']

This will be computationally slow, but we need to iterate through all station data to pick out those stations that belong in a boundary or non-boundary group.

In [56]:
boundary_stations = []
non_boundary_stations = []
for station in station_data:
    if station['stationName'] in boundary_names:
        boundary_stations.append(station)
    elif station['stationName'] in non_boundary_names:
        non_boundary_stations.append(station)

In [57]:
# now that the lists are built, we'll be using integer IDs to identify the stations
# we can drop all the station names to simplify things
boundary_stations = [{'id':station['id'],
                      'totalDocks':station['totalDocks']}
                     for station in boundary_stations]

non_boundary_stations = [{'id':station['id'],
                      'totalDocks':station['totalDocks']}
                     for station in non_boundary_stations]

In [58]:
boundary_stations

[{'id': 3323, 'totalDocks': 59},
 {'id': 3343, 'totalDocks': 23},
 {'id': 3357, 'totalDocks': 35},
 {'id': 3366, 'totalDocks': 19},
 {'id': 3374, 'totalDocks': 36},
 {'id': 3383, 'totalDocks': 25},
 {'id': 3387, 'totalDocks': 25},
 {'id': 3390, 'totalDocks': 24},
 {'id': 3400, 'totalDocks': 24},
 {'id': 3424, 'totalDocks': 27}]

In [59]:
non_boundary_stations

[{'id': 3292, 'totalDocks': 43},
 {'id': 3293, 'totalDocks': 24},
 {'id': 3294, 'totalDocks': 35},
 {'id': 3295, 'totalDocks': 59},
 {'id': 3301, 'totalDocks': 39},
 {'id': 3302, 'totalDocks': 23},
 {'id': 3305, 'totalDocks': 39},
 {'id': 3307, 'totalDocks': 31},
 {'id': 3309, 'totalDocks': 30},
 {'id': 3312, 'totalDocks': 39},
 {'id': 3314, 'totalDocks': 32},
 {'id': 3316, 'totalDocks': 47},
 {'id': 3320, 'totalDocks': 31},
 {'id': 3325, 'totalDocks': 31},
 {'id': 3327, 'totalDocks': 27},
 {'id': 3328, 'totalDocks': 39},
 {'id': 3331, 'totalDocks': 39},
 {'id': 3336, 'totalDocks': 41},
 {'id': 3338, 'totalDocks': 31},
 {'id': 3341, 'totalDocks': 59},
 {'id': 3345, 'totalDocks': 35},
 {'id': 3350, 'totalDocks': 39},
 {'id': 3351, 'totalDocks': 25},
 {'id': 3363, 'totalDocks': 33},
 {'id': 3367, 'totalDocks': 35},
 {'id': 3379, 'totalDocks': 35}]

Now we need to load some actual trip data from citibike to use as our sample.

In [None]:
# Download the citibike trip data for the last available month.
# At the time of this writing (9/24/2016) that is June 2016.
!curl -O 'https://s3.amazonaws.com/tripdata/201606-citibike-tripdata.zip'

In [38]:
# Import trip data into a pandas dataframe
# ⚠️ WARNING ⚠️: This import will take a long time. Seriously, like 5-10 minutes.
# This is because of the datetime conversation.
trips = pd.read_csv('201606-citibike-tripdata.zip',
                    dtype={'starttime': 'str',
                           'start station id': 'int64'},
                    parse_dates=['starttime'])

In [49]:
# Drop all columns except for starttime and start station id
trips = trips[["starttime", "start station id"]]

In [50]:
# Take a look at the data
trips.head(3)

Unnamed: 0,starttime,start station id
109067,2016-06-03 00:00:12,164
109068,2016-06-03 00:00:37,264
109069,2016-06-03 00:00:36,368


In [43]:
# Convert starttime to datetime datatype in a new column
# WARNING: Processor intensive, will take some time
# trips.loc[:, 'start_datetime'] = pd.to_datetime(trips.starttime, format='%-m/%-d/%Y %H:%M:%S').head(3)

In [48]:
# Filter to only 4 weeks of data to represent only 4 instances of every day of the week (4 Mondays, 4 Tuesdays, etc). 
# The last day of June 2016 was Thursday the 30th. Thus, this dataset now captures all rides from Thursday the 3rd at
# midnight to Thursday the 30th at 23:59:59.
trips = trips[trips["starttime"] > "2016-06-03 00:00:00"]

In [97]:
trips[trips["start station id"].isin([168])]

Unnamed: 0,starttime,start station id
109263,2016-06-03 00:18:51,168
109315,2016-06-03 00:26:03,168
110368,2016-06-03 06:02:27,168
110572,2016-06-03 06:18:04,168
110599,2016-06-03 06:19:47,168
111182,2016-06-03 06:48:23,168
111672,2016-06-03 07:06:25,168
112299,2016-06-03 07:32:30,168
112694,2016-06-03 07:45:24,168
112771,2016-06-03 07:47:20,168


In [82]:
# Get relevant trips 
boundary_trips = trips[trips["start station id"].isin(map(lambda x: x["id"], boundary_stations))]
non_boundary_trips = trips[trips["start station id"].isin(map(lambda x: x["id"], non_boundary_stations))]

In [91]:
non_boundary_trips

Unnamed: 0,starttime,start station id
