# Overview of the analyses that follow:

For this assignment, I performed a Kolmogorov-Smirnov (KS) test for differences in ride duration between all CitiBike rides that begin during the day versus all rides that begin at night. I also performed a KS test for differences in the age of riders who begin their CitiBike rides in Manhattan versus riders who begin in Brooklyn.

After performing the KS test using all of the relevant rides, I took random samples of 200 rides for each category (day, night, Manhattan, and Brooklyn rides) and then performed two more KS tests for day versus night and Manhattan versus Brooklyn rides with the smaller samples. I conducted these tests with smaller sample sizes because with very large sample sizes, the KS test is more likely to find differences in their distributions. (Any two samples will not be identically distributed, and the more information provide to the KS test, the more likely it is to find differences in sample distributions.) With the smaller sample sizes, differences in distributions (if they exist) are more meaningful.

In addition to the two KS tests, I also performed Pearson's and Spearman's correlation tests with 500 randomly selected data points. (I used the same sample sizes for both tests because the tests will otherwise not run.)

See below for the null hypotheses and results for each test. For all of the analyses that follow, I used an $\alpha$ level of .05 to test for true, statistically significant differences that affect related sample groups.

**Note:** For my analysis of rides by time of day, I define daytime as 5 am to 4 pm and night as 8 pm to 3 am. And for my analysis of rides by their place of origin, I used four latitudinal and longitudinal coordinates that roughly define Manhattan and Brooklyn. This method is obviously not ideal since these boundaries allow for areas that overlap between Manhattan and Brooklyn and allow for the improper association of parts of other boroughs (especially the Bronx and Queens) with Manhattan and Brooklyn.

In [30]:
from __future__  import print_function, division
import pylab as pl
import pandas as pd
import numpy as np
import os
import scipy.stats

%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Data wrangling:

In [31]:
os.environ["PUIdata"] = "{}/PUIdata".format(os.getenv("HOME"))

In [32]:
# Read the January 2017 data
!curl -O https://s3.amazonaws.com/tripdata/201707-citibike-tripdata.csv.zip
# Upack it into PUIdata
!unzip 201707-citibike-tripdata.csv.zip -d $PUIdata

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 57.7M  100 57.7M    0     0  41.8M      0  0:00:01  0:00:01 --:--:-- 41.8M
Archive:  201707-citibike-tripdata.csv.zip
  inflating: /nfshome/shb395/PUIdata/201707-citibike-tripdata.csv  


In [33]:
# Read the July 2017 data
!curl -O https://s3.amazonaws.com/tripdata/201712-citibike-tripdata.csv.zip
# Upack it into PUIdata
!unzip 201712-citibike-tripdata.csv.zip -d $PUIdata

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30.6M  100 30.6M    0     0  40.0M      0 --:--:-- --:--:-- --:--:-- 40.0M
Archive:  201712-citibike-tripdata.csv.zip
  inflating: /nfshome/shb395/PUIdata/201712-citibike-tripdata.csv  


In [34]:
# Make sure all the data has been loaded
!ls $PUIdata

201707-citibike-tripdata.csv  PLUTO_for_WEB  README.md
201712-citibike-tripdata.csv  prac	     times.txt
ACS_16			      puma	     ZIP_CODE_040114.shp


In [35]:
# Read and concatenate data into a dataframe
jul = pd.read_csv(os.getenv("PUIdata") + '/' + '201707-citibike-tripdata.csv')
dec = pd.read_csv(os.getenv("PUIdata") + '/' + '201712-citibike-tripdata.csv')
frames = [jul,dec]
df_new = pd.concat(frames)

In [36]:
df_new.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender'],
      dtype='object')

In [37]:
df_new.rename(columns={'start station latitude':'start_lat', \
                       'start station longitude':'start_long'}, inplace=True)

In [38]:
df_new = df_new.dropna(subset=['tripduration', 'starttime','start_lat','start_long'])

In [39]:
df_new['hour'] = pd.to_datetime(df_new['starttime'])
df_new['hour'] = df_new['hour'].dt.hour

In [40]:
# Had to reset index because there was a problem of duplicate indexes
df_new = df_new.reset_index(drop=True)

In [41]:
# Adding column for rides started during the day
df_new['day_dur'] = df_new['tripduration'][(df_new['hour'] >= 5) & \
                                               (df_new['hour'] <= 16)]

In [42]:
# Adding column for rides started during at night
df_new['night_dur'] = df_new['tripduration'][((df_new['hour'] >= 20) & \
                                               (df_new['hour'] <= 24)) | \
                                                ((df_new['hour'] >= 0) & \
                                               (df_new['hour'] <= 3))]

In [59]:
df_day2 = df_new[(df_new['hour'] >= 5) & (df_new['hour'] <= 16)]

In [62]:
df_del_cols = ['night_dur','mnhtn','brkln']
df_day2.drop(df_del_cols, axis=1, inplace=True)
df_day2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start_lat,start_long,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,hour,day_dur
1491,763,2017-07-01 05:00:12,2017-07-01 05:12:56,494,W 26 St & 8 Ave,40.747348,-73.997236,411,E 6 St & Avenue D,40.722281,-73.976687,26519,Subscriber,1977.0,1,5,763.0
1492,546,2017-07-01 05:01:55,2017-07-01 05:11:02,499,Broadway & W 60 St,40.769155,-73.981918,519,Pershing Square North,40.751873,-73.977706,26024,Customer,,0,5,546.0
1493,185,2017-07-01 05:03:19,2017-07-01 05:06:25,423,W 54 St & 9 Ave,40.765849,-73.986905,479,9 Ave & W 45 St,40.760193,-73.991255,29416,Subscriber,1991.0,1,5,185.0
1494,380,2017-07-01 05:03:30,2017-07-01 05:09:50,3058,Lewis Ave & Kosciuszko St,40.692371,-73.937054,3058,Lewis Ave & Kosciuszko St,40.692371,-73.937054,19977,Subscriber,1990.0,2,5,380.0
1495,196,2017-07-01 05:04:26,2017-07-01 05:07:43,472,E 32 St & Park Ave,40.745712,-73.981948,527,E 33 St & 2 Ave,40.744023,-73.976056,15005,Subscriber,1984.0,1,5,196.0


In [65]:
df_day2.dropna(subset=['day_dur'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [67]:
df_night = df_new[((df_new['hour'] >= 20) & \
                                               (df_new['hour'] <= 24)) | \
                                                ((df_new['hour'] >= 0) & \
                                               (df_new['hour'] <= 3))]

In [68]:
df_del_cols = ['day_dur','mnhtn','brkln']
df_night.drop(df_del_cols, axis=1, inplace=True)
df_night.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start_lat,start_long,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,hour,night_dur
0,364,2017-07-01 00:00:00,2017-07-01 00:06:05,539,Metropolitan Ave & Bedford Ave,40.715348,-73.960241,3107,Bedford Ave & Nassau Ave,40.723117,-73.952123,14744,Subscriber,1986.0,1,0,364.0
1,2142,2017-07-01 00:00:03,2017-07-01 00:35:46,293,Lafayette St & E 8 St,40.730207,-73.991026,3425,2 Ave & E 104 St,40.78921,-73.943708,19587,Subscriber,1981.0,1,0,2142.0
2,328,2017-07-01 00:00:08,2017-07-01 00:05:37,3242,Schermerhorn St & Court St,40.691029,-73.991834,3397,Court St & Nelson St,40.676395,-73.998699,27937,Subscriber,1984.0,2,0,328.0
3,2530,2017-07-01 00:00:11,2017-07-01 00:42:22,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,398,Atlantic Ave & Furman St,40.691652,-73.999979,26066,Subscriber,1985.0,1,0,2530.0
4,2534,2017-07-01 00:00:15,2017-07-01 00:42:29,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,398,Atlantic Ave & Furman St,40.691652,-73.999979,29408,Subscriber,1982.0,2,0,2534.0


In [70]:
df_night.dropna(subset=['night_dur'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [45]:
# Add county origin location using lat long boundaries for Manhattan and Brooklyn
# Source: http://www.mapdevelopers.com/geocode_bounding_box.php
df_new['mnhtn'] = 2018 - df_new['birth year'][(df_new['start_lat'] <= 40.882214) & \
                           (df_new['start_lat'] >= 40.680396) & \
                           (df_new['start_long'] <= -73.907000) & \
                           (df_new['start_long'] >= -74.047285)]

df_new['brkln'] = 2018 - df_new['birth year'][(df_new['start_lat'] <= 40.739446) & \
                           (df_new['start_lat'] >= 40.551042) & \
                           (df_new['start_long'] <= -73.833365) & \
                           (df_new['start_long'] >= -74.056630)]

In [46]:
df_new.head(5)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start_lat,start_long,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,hour,day_dur,night_dur,mnhtn,brkln
0,364,2017-07-01 00:00:00,2017-07-01 00:06:05,539,Metropolitan Ave & Bedford Ave,40.715348,-73.960241,3107,Bedford Ave & Nassau Ave,40.723117,-73.952123,14744,Subscriber,1986.0,1,0,,364.0,32.0,32.0
1,2142,2017-07-01 00:00:03,2017-07-01 00:35:46,293,Lafayette St & E 8 St,40.730207,-73.991026,3425,2 Ave & E 104 St,40.78921,-73.943708,19587,Subscriber,1981.0,1,0,,2142.0,37.0,37.0
2,328,2017-07-01 00:00:08,2017-07-01 00:05:37,3242,Schermerhorn St & Court St,40.691029,-73.991834,3397,Court St & Nelson St,40.676395,-73.998699,27937,Subscriber,1984.0,2,0,,328.0,34.0,34.0
3,2530,2017-07-01 00:00:11,2017-07-01 00:42:22,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,398,Atlantic Ave & Furman St,40.691652,-73.999979,26066,Subscriber,1985.0,1,0,,2530.0,33.0,33.0
4,2534,2017-07-01 00:00:15,2017-07-01 00:42:29,2002,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,398,Atlantic Ave & Furman St,40.691652,-73.999979,29408,Subscriber,1982.0,2,0,,2534.0,36.0,36.0


# KS tests:

## Null hypotheses (for both the KS test of all data and the test of 200 random points):

### CitiBike ride duration by time of day:

When we compare the sample of CitiBike rides that begin at night (between 8 pm and 3 am) with the sample of rides that begin during the day (between 5 am and 4 pm), we will see the same distribution of ride durations. In other words, we will not see a large difference in ride duration between CitiBike rides that begin during the day and those that begin at night.

** $H_0: P_0 = P_1$    **

** $H_a: P_0 \neq P_1 $    **

 $P_0$ = The distribution of CitiBike ride durations that begin at night (between 8 pm and 3 am).
 
 $P_1$ = The distribution of CitiBike ride durations that begin during the day (between 5 am and 4 pm).
 
 ### CitiBike rider age by ride originating location:
 
When we compare the sample of CitiBike rides that begin in Manhattan with the sample of rides that begin in Brooklyn, we will see the same distribution of rider ages. In other words, we will not see a large difference in the age of riders between CitiBike rides that begin in Manhattan and those that begin in Brooklyn.

** $H_0: P_0 = P_1$    **

** $H_a: P_0 \neq P_1 $    **

 $P_0$ = The distribution of CitiBike rider ages for rides that begin in Manhattan.
 
 $P_1$ = The distribution of CitiBike rider ages for rides that begin in Brooklyn.

## KS statistics:

In [47]:
# For an alpha level of .05
ca = 1.36

In [48]:
# Test for rejecting the null for full samples
def ks_sig(ca, samp1, samp2):
    stat = scipy.stats.ks_2samp(samp1, samp2)
    ks = stat.statistic
    crit_val = stat.pvalue
    samp1_count = samp1.count()
    samp2_count = samp2.count()
    sig = ca * sqrt((samp1_count + samp2_count)/(samp1_count*samp2_count))
    if ks > sig:
        return "Reject the null: Our KS stat of " + str(round(ks, 3)) + " meets "\
        "our threshold for an alpha level of .05." \
            " (p value = " + str(round(crit_val,3)) + ")"
    else:
        return "Do not reject the null: Our KS stat of " + str(round(ks, 3)) + \
        " does not surpass our threshold for an alpha "\
            "level of .05. (p value = " + str(round(crit_val,3)) + ")"

In [49]:
# Result of test
print("Test for time of day - " + str(ks_sig(ca, df_new.day_dur, df_new.night_dur)))
print("")
print("Test for location - " + str(ks_sig(ca, df_new.mnhtn, df_new.brkln)))

Test for time of day - Reject the null: Our KS stat of 0.463 meets our threshold for an alpha level of .05. (p value = 0.0)

Test for location - Reject the null: Our KS stat of 0.391 meets our threshold for an alpha level of .05. (p value = 0.0)


In [71]:
# Result of new test
print("Test for time of day - " + str(ks_sig(ca, df_day2.day_dur, df_night.night_dur)))

Test for time of day - Reject the null: Our KS stat of 0.012 meets our threshold for an alpha level of .05. (p value = 0.0)


In [50]:
# 200 data points for day, night, Manhattan, and Brooklyn rides
np.random.seed(101)
sample_size = 200

# Get a list of indexes to keep
def new_indexes(df):
    indexes = np.random.choice(arange(len(df)), sample_size, replace=True)
    return indexes

# day
df_day_reduced_ks = df_new.dropna(subset=['day_dur']).reset_index()
day_indexes_ks = new_indexes(df_day_reduced_ks)
df_day_reduced_ks = df_day_reduced_ks.ix[day_indexes_ks]

# night
df_night_reduced_ks = df_new.dropna(subset=['night_dur']).reset_index()
night_indexes_ks = new_indexes(df_night_reduced_ks)
df_night_reduced_ks = df_night_reduced_ks.ix[night_indexes_ks]

# Manhattan
df_mnhtn_reduced_ks = df_new.dropna(subset=['mnhtn']).reset_index()
mnhtn_indexes_ks = new_indexes(df_mnhtn_reduced_ks)
df_mnhtn_reduced_ks = df_mnhtn_reduced_ks.ix[mnhtn_indexes_ks]

# Brooklyn
df_brkln_reduced_ks = df_new.dropna(subset=['brkln']).reset_index()
brkln_indexes_ks = new_indexes(df_brkln_reduced_ks)
df_brkln_reduced_ks = df_brkln_reduced_ks.ix[brkln_indexes_ks]

In [51]:
# Tests for samples with 200 data points
print("Test for time of day - " + str(ks_sig(ca, df_day_reduced_ks.day_dur, \
                                             df_night_reduced_ks.night_dur)))
print("")
print("Test for location - " + str(ks_sig(ca, df_mnhtn_reduced_ks.mnhtn, \
                                          df_brkln_reduced_ks.brkln)))

Test for time of day - Do not reject the null: Our KS stat of 0.055 does not surpass our threshold for an alpha level of .05. (p value = 0.915)

Test for location - Do not reject the null: Our KS stat of 0.09 does not surpass our threshold for an alpha level of .05. (p value = 0.377)


## KS tests conclusion:

For the analyses of the complete samples of CitiBike ride duration by time of day and CitiBike rider age by ride origin, our KS statistics are greater than our critical values, leading us to conclude that we should reject the null hypotheses. The distributions of each pair of sample groups (day vs night rides, Manhattan vs Brooklyn rides) are not drawn from the same continuous distribution and we have arrived at this conclusion with high confidence due to our  p value (which is less than the $\alpha$ value of .05 that we established at the beginning of this analysis).

However, once we reduce our data sets to only 200 data points, the resulting KS statistics are less than the critical values and we therefore fail to reject the null. In other words, when we limit the amount of information we have about each data set, we are unable to determine the difference in their respective distributions with statistical confidence.

# Pearson's and Spearman's tests

For the Pearson's and Spearman's tests that follow, I drew 500 (reproducibly) random samples each of day, night, Manhattan and Brooklyn rides. I then sorted the relevant column of data for each sample (ride duration for the time test and rider age for the location test) in ascending order and ran Pearson's and Spearman's tests on each sample.

## More data wrangling:

In [52]:
# Reduced dfs for day, night, Manhattan, and Brooklyn rides
np.random.seed(101)
sample_size = 500

# Get a list of indexes to keep
def new_indexes(df):
    indexes = np.random.choice(arange(len(df)), sample_size, replace=True)
    return indexes

df_day_reduced = df_new.dropna(subset=['day_dur']).reset_index()
day_indexes = new_indexes(df_day_reduced)
df_day_reduced = df_day_reduced.ix[day_indexes]
df_day_reduced = df_day_reduced.sort_values('day_dur')

df_night_reduced = df_new.dropna(subset=['night_dur']).reset_index()
night_indexes = new_indexes(df_night_reduced)
df_night_reduced = df_night_reduced.ix[night_indexes]
df_night_reduced = df_night_reduced.sort_values('night_dur')

df_mnhtn_reduced = df_new.dropna(subset=['mnhtn']).reset_index()
mnhtn_indexes = new_indexes(df_mnhtn_reduced)
df_mnhtn_reduced = df_mnhtn_reduced.ix[mnhtn_indexes]
df_mnhtn_reduced = df_mnhtn_reduced.sort_values('mnhtn')

df_brkln_reduced = df_new.dropna(subset=['brkln']).reset_index()
brkln_indexes = new_indexes(df_brkln_reduced)
df_brkln_reduced = df_brkln_reduced.ix[brkln_indexes]
df_brkln_reduced = df_brkln_reduced.sort_values('brkln')

## Null hypotheses for Pearson's tests:

### CitiBike ride duration by time of day:

When we compare the samples of CitiBike rides that begin at night (between 8 pm and 3 am) with the sample of rides that begin during the day (between 5 am and 4 pm), we will find no **linear** correlation in the ride durations of the two samples.

** $H_0: \rho = 0$**

** $H_a: \rho \neq 0$    **

$\rho$ is the correlation coefficient.
 
 ### CitiBike rider age by ride originating location:
 
When we compare the samples of CitiBike rides that begin in Manhattan with those that begin in Brooklyn, we will find no **linear** correlation in rider age across the two samples.

** $H_0: \rho = 0$**

** $H_a: \rho \neq 0$    **

$\rho$ is the correlation coefficient.

In [53]:
# Code to calculate Pearson's tests:
def peartest(dfsamp1, dfsamp2):
    stat = scipy.stats.pearsonr(dfsamp1, dfsamp2)
    pear = stat[0]
    pval = stat[1]
    return pear, round(pval,3)    

In [54]:
print("Pearson's test for day vs night rides - correlation coefficient and p value: " \
      + str(peartest(df_day_reduced.day_dur,df_night_reduced.night_dur)))
print("Pearson's test for Manhattan vs Brooklyn rides - correlation coefficient "\
      "and p value: " + str(peartest(df_mnhtn_reduced.mnhtn, df_brkln_reduced.brkln)))

Pearson's test for day vs night rides - correlation coefficient and p value: (0.98236800718567308, 0.0)
Pearson's test for Manhattan vs Brooklyn rides - correlation coefficient and p value: (0.99140045898652429, 0.0)


## Pearson's tests conclusion:

The results of our Pearson's test tell us that the ride duration and rider age data we tested for trip time of day and origin are highly correlated and that we can be very confident in these results (due to such low p values). We are therefore rejecting the null.

This makes sense, intuitively, because we have taken two columns of numerical data, sorted them, and then compared the two sets of data. As the numbers from one column increase, the numbers in the second column also increase. Therefore, there is a predictable positive correlation between the two. And in this case, that relationship is almost perfeclty linear.

## Null hypotheses for Spearman's tests:

### CitiBike ride duration by time of day:

When we compare the samples of CitiBike rides that begin at night (between 8 pm and 3 am) with the sample of rides that begin during the day (between 5 am and 4 pm), we will find no **monotonic** correlation in the ride durations of the two samples.

** $H_0: \rho = 0$**

** $H_a: \rho \neq 0$    **

$\rho$ is the correlation coefficient.
 
 ### CitiBike rider age by ride originating location:
 
When we compare the samples of CitiBike rides that begin in Manhattan with those that begin in Brooklyn, we will find no **monotonic** correlation in rider age across the two samples.

** $H_0: \rho = 0$**

** $H_a: \rho \neq 0$    **

$\rho$ is the correlation coefficient.

In [55]:
# Code to calculate Spearman's tests:
def speartest(dfsamp1, dfsamp2):
    stat = scipy.stats.spearmanr(dfsamp1, dfsamp2)
    spear = stat[0]
    pval = stat[1]
    return spear, round(pval,3)    

In [56]:
print("Spearman's test for day vs night rides - correlation coefficient and p value: " + str(speartest(df_day_reduced.day_dur\
                                                                ,df_night_reduced.night_dur)))
print("Spearman's test for Manhattan vs Brooklyn rides - correlation coefficient and p value: " + str(speartest \
                                                                (df_mnhtn_reduced.mnhtn, df_brkln_reduced.brkln)))

Spearman's test for day vs night rides - correlation coefficient and p value: (0.99999529595748027, 0.0)
Spearman's test for Manhattan vs Brooklyn rides - correlation coefficient and p value: (0.99893952004956355, 0.0)


## Spearman's tests conclusion:

The results of our Spearman's test tell us that the ride duration and rider age data we tested for trip time of day and origin are highly correlated and that we can be very confident in these results (due to such low p values). We are therefore rejecting the null.

Again, this makes intuitive sense because we have taken two columns of numerical data, sorted them, and then compared the two sets of data. As the numbers from one column increase, the numbers in the second column also increase. Therefore, there is a predictable positive correlation between the two. In this case, the relationship is even stronger than the Pearson's test, which makes sense because the Spearman's test has more flexibility and is better able to detect correlations that are not strictly linear (the Spearman's test can detect monotonic relationships).