# Overview

This is a python script that addresses the following questions:

- What are the most popular departing and destination stations?
- What are the common origination and arrival area clusters (not just based on Zip Code)? 
- What are the top 10 stations with the least number of bikes by the end of the day at 11 pm over the month of September? 
- What are the top 3 stations with the most number of bikes near each of the stations from question 3)?
- Does bike stations encourage multi-modal travelling?
- How many trips ends near a subway station?

# Hypothesis Testing
the average number of arrival trips is higher compared to the ones that are remote from any subway stations.

# Outputs
- Statistical and Graphical Analysis
- Scenario Simulation

# Key Methods 
- Pairwise Distance (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html)
- Spatial Visualization

---

## Data Ingestion and Feature Engineering

In [26]:
import pandas as pd
import geopandas as gpd
pd.set_option('display.max_columns', 500)
import numpy as np
import matplotlib as plt
from sklearn.cluster import KMeans
from sklearn import preprocessing
import warnings
import matplotlib.style as style
warnings.filterwarnings('ignore')
style.use('fivethirtyeight')

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
url = "https://s3.amazonaws.com/tripdata/JC-201709-citibike-tripdata.csv.zip"
df = pd.read_csv(url, compression = "zip")

In [23]:
url = "https://data.cityofnewyork.us/api/views/he7q-3hwy/rows.csv?accessType=DOWNLOAD"
subway = pd.read_csv(url)

In [3]:
df.shape

(33119, 15)

In [24]:
subway.shape

(1928, 5)

In [4]:
df.sample(10)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
7916,1081,2017-09-09 07:58:10,2017-09-09 08:16:12,3203,Hamilton Park,40.727596,-74.044247,3183,Exchange Place,40.716247,-74.033459,29270,Subscriber,1977.0,2
8315,593,2017-09-09 14:13:40,2017-09-09 14:23:34,3275,Columbus Drive,40.718355,-74.038914,3199,Newport Pkwy,40.728745,-74.032108,29442,Subscriber,1991.0,1
31768,370,2017-09-29 16:52:49,2017-09-29 16:58:59,3202,Newport PATH,40.727224,-74.033759,3203,Hamilton Park,40.727596,-74.044247,29598,Subscriber,1980.0,1
15958,299,2017-09-15 18:12:57,2017-09-15 18:17:57,3183,Exchange Place,40.716247,-74.033459,3187,Warren St,40.721124,-74.038051,26223,Subscriber,1956.0,1
29599,239,2017-09-27 19:21:44,2017-09-27 19:25:43,3213,Van Vorst Park,40.718489,-74.047727,3185,City Hall,40.717733,-74.043845,30356,Subscriber,1983.0,1
14024,317,2017-09-14 09:11:14,2017-09-14 09:16:31,3209,Brunswick St,40.724176,-74.050656,3186,Grove St PATH,40.719586,-74.043117,26220,Subscriber,1986.0,2
20876,183,2017-09-20 08:05:48,2017-09-20 08:08:51,3279,Dixon Mills,40.72163,-74.049968,3186,Grove St PATH,40.719586,-74.043117,29243,Subscriber,1986.0,1
13931,104,2017-09-14 08:30:19,2017-09-14 08:32:04,3211,Newark Ave,40.721525,-74.046305,3186,Grove St PATH,40.719586,-74.043117,29265,Subscriber,1982.0,1
24919,419,2017-09-23 11:17:52,2017-09-23 11:24:51,3268,Lafayette Park,40.713464,-74.062859,3185,City Hall,40.717733,-74.043845,29665,Subscriber,1986.0,1
14501,434,2017-09-14 17:49:54,2017-09-14 17:57:09,3202,Newport PATH,40.727224,-74.033759,3203,Hamilton Park,40.727596,-74.044247,26229,Subscriber,1979.0,1


In [5]:
df['starttime'] = pd.to_datetime(df['starttime'], infer_datetime_format= True)
df['stoptime'] = pd.to_datetime(df['stoptime'], infer_datetime_format= True)

In [6]:
df['day'] = df['starttime'].dt.day
df['start_hour'] = df['starttime'].dt.hour
df['DOW'] = df['starttime'].dt.dayofweek

In [7]:
df.sample(10)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,day,start_hour,DOW
20148,236,2017-09-19 13:44:42,2017-09-19 13:48:38,3199,Newport Pkwy,40.728745,-74.032108,3480,WS Don't Use,0.0,0.0,26314,Subscriber,1986.0,2,19,13,1
3527,661,2017-09-05 09:05:30,2017-09-05 09:16:31,3277,Communipaw & Berry Lane,40.714358,-74.066611,3185,City Hall,40.717733,-74.043845,26311,Subscriber,1986.0,2,5,9,1
17877,270,2017-09-17 14:55:17,2017-09-17 14:59:47,3203,Hamilton Park,40.727596,-74.044247,3186,Grove St PATH,40.719586,-74.043117,29595,Subscriber,,0,17,14,6
20279,494,2017-09-19 17:45:40,2017-09-19 17:53:54,3183,Exchange Place,40.716247,-74.033459,3199,Newport Pkwy,40.728745,-74.032108,26300,Subscriber,1981.0,1,19,17,1
23835,890,2017-09-22 11:09:38,2017-09-22 11:24:28,3206,Hilltop,40.731169,-74.057574,3267,Morris Canal,40.712419,-74.038526,29490,Subscriber,1975.0,2,22,11,4
113,121,2017-09-01 07:46:58,2017-09-01 07:48:59,3211,Newark Ave,40.721525,-74.046305,3186,Grove St PATH,40.719586,-74.043117,26224,Subscriber,1977.0,1,1,7,4
11462,531,2017-09-12 08:56:52,2017-09-12 09:05:43,3203,Hamilton Park,40.727596,-74.044247,3205,JC Medical Center,40.71654,-74.049638,26152,Subscriber,1978.0,1,12,8,1
3738,386,2017-09-05 13:56:43,2017-09-05 14:03:10,3186,Grove St PATH,40.719586,-74.043117,3202,Newport PATH,40.727224,-74.033759,29276,Subscriber,1985.0,1,5,13,1
22324,1031,2017-09-21 08:42:01,2017-09-21 08:59:12,3212,Christ Hospital,40.734786,-74.050444,3183,Exchange Place,40.716247,-74.033459,29586,Subscriber,1982.0,2,21,8,3
28184,587,2017-09-26 12:02:27,2017-09-26 12:12:14,3205,JC Medical Center,40.71654,-74.049638,3202,Newport PATH,40.727224,-74.033759,26255,Subscriber,1988.0,1,26,12,1


---

## Quantitative Analysis

- Count the number of trips and group by starting and ending station accordingly
- Create a table of bike station, hour, total # of departure, total # of arrival, # of bikes at the dock, and lat/long; assuming there are 20 bikes at the beginning of the day at 12 a.m.

In [12]:
uni_dep_stations = df[['start station id', 'start station name', 'start station latitude', 'start station longitude']].drop_duplicates()
uni_arv_stations = df[['end station id', 'end station name', 'end station latitude', 'end station longitude']].drop_duplicates()

In [14]:
uni_dep_stations.shape

(49, 4)

In [15]:
uni_arv_stations.shape

(69, 4)

In [20]:
start_count = df.groupby("start station id").size().reset_index(name = "departure_cnt").\
                sort_values(by = "departure_cnt", ascending = False)
start_count = start_count.merge(uni_dep_stations, how = "left", left_on = "start station id", right_on = "start station id")
start_count

Unnamed: 0,start station id,departure_cnt,start station name,start station latitude,start station longitude
0,3186,4079,Grove St PATH,40.719586,-74.043117
1,3203,2415,Hamilton Park,40.727596,-74.044247
2,3183,2146,Exchange Place,40.716247,-74.033459
3,3195,1672,Sip Ave,40.730743,-74.063784
4,3202,1487,Newport PATH,40.727224,-74.033759
5,3267,1402,Morris Canal,40.712419,-74.038526
6,3213,1079,Van Vorst Park,40.718489,-74.047727
7,3211,1061,Newark Ave,40.721525,-74.046305
8,3187,1008,Warren St,40.721124,-74.038051
9,3199,951,Newport Pkwy,40.728745,-74.032108


In [22]:
end_count = df.groupby("end station id").size().reset_index(name = "arrival_cnt").\
                sort_values(by = "arrival_cnt", ascending = False)
end_count = end_count.merge(uni_arv_stations, how = "left", left_on = "end station id", right_on = "end station id")
end_count

Unnamed: 0,end station id,arrival_cnt,end station name,end station latitude,end station longitude
0,3186,5005,Grove St PATH,40.719586,-74.043117
1,3183,2585,Exchange Place,40.716247,-74.033459
2,3203,2044,Hamilton Park,40.727596,-74.044247
3,3202,1491,Newport PATH,40.727224,-74.033759
4,3195,1421,Sip Ave,40.730743,-74.063784
5,3267,1154,Morris Canal,40.712419,-74.038526
6,3211,958,Newark Ave,40.721525,-74.046305
7,3185,950,City Hall,40.717733,-74.043845
8,3213,947,Van Vorst Park,40.718489,-74.047727
9,3214,924,Essex Light Rail,40.712774,-74.036486


In [None]:
# Calculate end-of-hour bike balance in each station
