Comparison of Citibike Weekend Ridership
====

The goal of this analysis is to evaluate whether customers or subscribers are using Citibikes
more on the weekend. The data has been pulled from October 2014, and will use tripduration as an evaluation
of ridership use (ie total number of seconds used). The main categories that will be evaluated will be
customers and subscribers.

I've done previous evaluations of Citibike data that shows that subscribers are much more common and ridden for longer during the week (analysis done using R for City Challenge Week). Given that customers are much more likely to be using Citibikes for leisure rides (whether they be tourists or non-regular user NY natives), it's possible that customer usage on weekends might match that of subscriber usage. I'd like to test that.

Hypotheses
======

Null hypothesis: The weekend mean of the trip duration of subscribers will not be different from the mean trip duration of customers.

Alternative hypothesis: The weekend means of the trip duration for subscribers will be different from the mean tripduration for customers.

Confidence interval
======

This analysis will use an alpha level of 0.05.

In [1]:
#package imports
import pylab as pl
import pandas as pd
import scipy.stats as sps
import numpy as np
import os
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
#read in the data and convert the starttime to a datetime object
data = pd.read_csv('201410-citibike-tripdata.csv')
data['date'] = pd.to_datetime(data['starttime'])

In [3]:
print data.head()

   tripduration           starttime            stoptime  start station id  \
0          1027  10/1/2014 00:00:27  10/1/2014 00:17:34               479   
1           534  10/1/2014 00:00:36  10/1/2014 00:09:30               417   
2           416  10/1/2014 00:00:42  10/1/2014 00:07:38               327   
3           428  10/1/2014 00:00:50  10/1/2014 00:07:58               515   
4           281  10/1/2014 00:01:08  10/1/2014 00:05:49               497   

         start station name  start station latitude  start station longitude  \
0           9 Ave & W 45 St               40.760193               -73.991255   
1    Barclay St & Church St               40.712912               -74.010202   
2  Vesey Pl & River Terrace               40.715338               -74.016584   
3          W 43 St & 10 Ave               40.760094               -73.994618   
4        E 17 St & Broadway               40.737050               -73.990093   

   end station id           end station name  end statio

In [3]:
#create a new column to hold an integer value for the day of the week, then subset for weekends only
data['weekdays'] = data['date'].apply(lambda x: x.weekday())
weekends_only = data[data['weekdays'] > 4 ]

#subset the data to divide it into customers and subscribers
weekend_subscriber = weekends_only.loc[weekends_only['usertype'] == 'Subscriber']
weekend_customer = weekends_only.loc[weekends_only['usertype'] == 'Customer']

In [4]:
#print the mean of the weekend subscribers and weekend customers
print "Subscriber mean:", mean(weekend_subscriber['tripduration'])
print "Customer mean:", mean(weekend_customer['tripduration'])

Subscriber mean: 807.197731839
Customer mean: 1750.75005592


In [5]:
#compute test statistic: t test for two independent means
model_t, model_p = sps.ttest_ind(weekend_subscriber['tripduration'], weekend_customer['tripduration'], 
                                 equal_var = False)

print "T = %s, p = %s" % (model_t, model_p)
print "Degrees of freedom:", len(weekend_subscriber) + len(weekend_customer) - 2

T = -20.1240564051, p = 1.96874457874e-89
Degrees of freedom: 162705


Interpretations
======

Given that we were doing a 2-mean t-test with 162705 degrees of freedom, the threshold t-value would be +/-1.96. Since my t-value was -20.124, we would reject the null hypothesis that the two means are the same.