# Testing whether more people ride citibike from Manhattan to Brooklyn, or from Brooklyn to Manhattan

# the citibike september 2015 file can be obtained here
# https://drive.google.com/file/d/0Bx8AmvwAoY0jUzV1dXltOXZHWTQ/view?usp=sharing

Dataset: open data for citibike ridership for September 2015

Idea: We want to know if the ridership from Brooklyn to Manhattan during Weekdays is Greater than that from Manhattan to Brooklyn during Weekdays.

Terms: 

"Manhattan": Citibike stations in Manhattan and The Bronx

"Brooklyn": Citibike stations in Brooklyn and Queens

Control Group: People moving from Manhattan to Brooklyn

Test Group: People riding from Brooklyn to Manhattan

Hypotheses:

Null Hypothesis: The total number of rides from Manhattan to Brooklyn on weekdays is greater than the total number of rides from Brooklyn to Manhattan on weekdays.

Alternative Hypothesis: The total number of rides from Manhattan to Brooklyn on weekdays is less than or equal to the total number of rides from Brooklyn to Manhattan on weekdays.

Process:
First, we took the json file of citibike stations from the citibike database and extracted the station id, latitude and longitude. 
Then we used ArcGis to lasso around the stations in Manhattan+Bronx and add the value of 'Manhattan' against those stations in a new column entitled 'Boroughs', and we did the same for the stations in Brooklyn + Queens with the value 'Brooklyn'.
Next, we used Python to left-join our station-borough table to the citibike csv file using the start station id and repeated the join for stop station id, and deleted the columns we did not need.
Then we converted the starttime column into a datetime and calculated which days were weekends and which were weekdays.
We counted the weekdays against the boroughs and performed chi-squared testing of the hypothesis.
We got chi-squared value of 7 and rejected the null-hypothesis with a confidense of 95%. 

## Extracting data from JSON, getting a list of stations with latitudes and longitudes

In [3]:
# importing libraries to work with json, csv, url, and big data frames
import json
import sys
import urllib2 as ulib
import csv
import pandas as pd
import numpy as np

# getting json with a list of station id, latitude, longitude
Request = ulib.urlopen('http://www.citibikenyc.com/stations/json')
datum= json.loads(Request.read())

# exploring the json data
bike = datum['stationBeanList']

# saving the json data into the csv 'bikelist.csv'
with open('bikelist.csv', 'wb') as CsvFile:
    Writer = csv.writer(CsvFile)
    Headers = ['station id', 'latitude', 'longitude']
    Writer.writerow(Headers)

    for station in bike:
        Writer.writerow([station['id'], station['latitude'], station['longitude']])        

# After that we split all the stations into the Manhattan ones and Brooklyn ones
# We used ArcGis Map for that.
# the resulting file is StationsBorough.csv

## Joining table of stations and boroughs to citibike dataset

In [5]:
# reading Citibike data for September 2015 into citibike015 and the generated table of station ids and their boroughs
boroughs = pd.read_csv('StationsBorough.csv')
# the file can be obtained here
# https://drive.google.com/file/d/0Bx8AmvwAoY0jUzV1dXltOXZHWTQ/view?usp=sharing
citibike0915 = pd.read_csv('bike.csv')


CParserError: Error tokenizing data. C error: Expected 1 fields in line 40, saw 3


In [3]:
# performing left join of the station id table to the original citibike table to 
# identify where the stations are: in Brooklyn or in Manhattan:
# by starting station
citibike0915=citibike0915.merge(boroughs, left_on='start station id', right_on='id', how='left', sort=False)

#  cleaning data and renaming the adjoined columns
citibike0915.drop(['bikeid', 'usertype', 'gender', 'tripduration', 'stoptime', 'start station name', 'birth year', 'end station name', 'id'], axis=1, inplace=True)
citibike0915.rename(columns = {'Borough':'Start Borough'}, inplace=True)

# performing left join of the station id table to the original citibike table to 
# identify where the stations are: in Brooklyn or in Manhattan:
# by starting station
citibike0915=citibike0915.merge(boroughs, left_on='end station id', right_on='id', how='left', sort=False)

# cleaning data and renaming the adjoined columns
citibike0915.drop('id', axis=1, inplace=True)
citibike0915.rename(columns = {'Borough':'End Borough'}, inplace=True)

# converting the starttime field from string to datetime
citibike0915['starttime'] = pd.to_datetime(citibike0915['starttime'])

## Counting trips in each direction between the boroughs against the weekdays and weekends. Summarizing

In [4]:
# function to calculate if a given date is a weekday, or a weekend
def Weekend(x):
    if x.weekday() > 5:
        return True
    return False

# checking if a start time of the trip is a weekend
citibike0915['weekend'] = citibike0915['starttime'].apply(lambda x: Weekend(x))

In [5]:
print(citibike0915.head(5))

            starttime  start station id  start station latitude  \
0 2015-09-01 00:00:00               263               40.717290   
1 2015-09-01 00:00:00               495               40.762699   
2 2015-09-01 00:00:01              3119               40.742327   
3 2015-09-01 00:00:07               536               40.741444   
4 2015-09-01 00:00:09               347               40.728846   

   start station longitude  end station id  end station latitude  \
0               -73.996375             307             40.714275   
1               -73.993012             449             40.764618   
2               -73.954117            3118             40.735550   
3               -73.975361             340             40.712690   
4               -74.008591             483             40.732233   

   end station longitude Start Borough End Borough weekend  
0             -73.989900     Manhattan   Manhattan   False  
1             -73.987895     Manhattan   Manhattan   False  
2    

In [6]:
# counting how many trips happened on a weekday (weekend = false), depending on the direction of the trip:
# Brooklyn to Manhattan or Manhattan to Brooklyn
citibike0915.groupby(['weekend', 'Start Borough', 'End Borough']).size()

weekend  Start Borough  End Borough
False    Brooklyn       Brooklyn        97429
                        Manhattan       23952
         Manhattan      Brooklyn        25568
                        Manhattan      989467
True     Brooklyn       Brooklyn        17655
                        Manhattan        3716
         Manhattan      Brooklyn         3838
                        Manhattan      120147
dtype: int64

## Testing the hypothesis. Performing chi-squared and z-tests

In [7]:
# Performing the chi-square test
# Total Frequency
Tot_MB_Grp = 29406
Tot_BM_Grp = 27668

# Weekday frequency of MB
Tot_MBWD_Grp = 25568
Tot_MBWE_Grp = Tot_MB_Grp - Tot_MBWD_Grp

# Weekday frequency of BM
Tot_BMWD_Grp = 23952
Tot_BMWE_Grp = Tot_BM_Grp - Tot_BMWD_Grp

# Assigning values to chi-square notation
a = Tot_MBWD_Grp
b = Tot_BMWD_Grp
c = Tot_MBWE_Grp
d = Tot_MBWE_Grp

# Computing chi-square statistic
Ntot = a + b + c + d
expected = (a + b)*(c + d)*(a + c)*(b + d)
sample_values = [[a,b],[c,d]]
 
chisqstat= lambda N, values, expect : N*((values[0][0]*values[1][1]-values[0][1]*values[1][0])**2)/(expect)

# Printing the chi-square statistic value
print "Chi-squared statistics is ", chisqstat(Ntot,  sample_values, expected)

Chi-squared statistics is  7


### Chi-squared statistics is 7, therefore we reject the null hypothesis with a 95% confidense.

In [8]:
# Z-test

#Rides from Brooklyn to Manhattan on weekdays
P_0 = 25568.0 / 57074.0
#Rides from Manhattan to Brooklyn on weekdays
P_1 = 23952.0 / 57074.0

#Total number of rides to Manhattan and Brooklyn (Sample Size)
n_0=29406.0
n_1=27668.0

#lets get the counts by multiplying by the sample size
Nt_0=25568.0
Nt_1=23952.0

sp=((P_0*n_0)+(P_1*n_1))/(n_1+n_0)
print sp

sp_stdev= lambda p, n: np.sqrt( p * ( 1 - p ) /n[0] +  p * ( 1 - p )/n[1]  )


sp_stdev_2y=sp_stdev((Nt_0+Nt_1)/(n_0+n_1),[n_0,n_1])

zscore = lambda p0, p1, s : (p0-p1)/s
z_2y = zscore(P_1, P_0, sp_stdev_2y)
print "z score ", z_2y

p_2y=1-0.9115

alpha = 0.05
def report_result(p,a):
    print 'is the p value {0:.2f} smaller than the critical value {1:.2f}? '.format(p,a)
    if p<a:
        print "YES!"
    else: print "NO!"
    
    print 'the Null hypothesis is {}'.format( 'rejected' if p<a  else 'not rejected') 

    
report_result(p_2y,alpha)

0.434253862862
z score  -9.97585786394
is the p value 0.09 smaller than the critical value 0.05? 
NO!
the Null hypothesis is not rejected


### z score is -9.98, therefore the Null hypothesis is rejected with a 95% confidense.