Skip to content
Analyzing NYC's bike sharing program
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


title: "Citi Bike Data Analysis"
output: html_document

The set of files included are an exploratory data analysis of the publicly
provided Citi Bike data available [here.](
Citi Bike is a bike sharing system in New York City that lets riders rent 
bikes for commuting purposes.

*  **cb_analysis.R** - R data processing
*  **cb_eda.Rmd** - Exploratory analysis. See for knitted version.
*  **README.Rmd** - this file  

Downloads and reads in all the historical trip data from Citi Bike. The format 
of data has changed several times, so files in certain date ranges need to be 
handled differently. This specifically relates to the format of the filename and 
the handling of date-times. Files are split into two data.tables: the dat table
which contains trip information, and a lookup table called stations. Stations
takes the start station from each trip and dedups. This ensures we capture all 
stations, old and new, throughout the dates.  

The function stationDistance takes in a pair of coordinates and makes a call 
to the Google Directions API. The bicylcing distance and estimated duration are 

Aggregation is done on a file-by-file (monthly) basis, since there's not 
enough memory to download all the files and aggregate at that point. Initially, 
aggregation will be done by station-pairs, i.e. Lafayette St & E 8 St to 
University Pl & E 14 St


*  Right-type dat columns for space efficiency
*  Add to trip data **speed** as a function of distance and duration
*  Aggregate each month's file at a time. Not enough memory to load entire data
set then aggregate. Average month has ~650K rows. Even grouping by hour and 
start location would yield 1/3 the rows. Need to reduce at least an order of 
magnitude. Instead of every hour of the month, maybe:  
    * ~~break into hour intervals by station (24 * 330 rows/file).~~  
    * ~~Or days by station (30 * 330 rows/file).~~  
    * By start station and end station combinations (< 330^2) since people will 
  not pick up in BK and return in midtown, for example). This lets us answer
  questions related to neighborhoods/stations
*  Lifetime of bicycle. Track bike by ID and see which bikes accumlate more miles, are they taken out of rotation?  
*  Use average speed to determine bike friendliness of an a neighborhood. Use average speed vs traffic to determine if bike or car is faster as certain points throughout the day.
*  Do people bike downhill and not return bikes to uphill stations?
*  Cluster stations by lat/long to determine neighborhood for results 
*  Calculate standard error of start/end combinations across all the different 
months. Average ride duration from station A to station B should not change much
between months and seasons
*  Hist, summary of distance
*  Read from zip
*  Use birth year and gender as partial identifiers
*  ~~Functionize code for adding new station combinations~~
*  Download distance data for 90 stations added Aug '15.
You can’t perform that action at this time.