GitHub - jwillage/citibike: Analyzing NYC's bike sharing program

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
cb_eda_files/figure-html		cb_eda_files/figure-html
citibike		citibike
data		data
figure		figure
README.Rmd		README.Rmd
cb_analysis.R		cb_analysis.R
cb_eda.Rmd		cb_eda.Rmd
cb_eda.html		cb_eda.html
cb_eda.md		cb_eda.md
citibike.gif		citibike.gif
descriptive_plots.R		descriptive_plots.R
gender.R		gender.R
gg_animate2.R		gg_animate2.R
maps.R		maps.R

Repository files navigation

---
title: "Citi Bike Data Analysis"
output: html_document
---

The set of files included are an exploratory data analysis of the publicly
provided Citi Bike data available [here.](http://www.citibikenyc.com/system-data)
Citi Bike is a bike sharing system in New York City that lets riders rent 
bikes for commuting purposes.

##Files  
*  **cb_analysis.R** - R data processing
*  **cb_eda.Rmd** - Exploratory analysis. See cb_eda.md for knitted version.
*  **README.Rmd** - this file  


###cb_analysis.R     
Downloads and reads in all the historical trip data from Citi Bike. The format 
of data has changed several times, so files in certain date ranges need to be 
handled differently. This specifically relates to the format of the filename and 
the handling of date-times. Files are split into two data.tables: the dat table
which contains trip information, and a lookup table called stations. Stations
takes the start station from each trip and dedups. This ensures we capture all 
stations, old and new, throughout the dates.  

The function stationDistance takes in a pair of coordinates and makes a call 
to the Google Directions API. The bicylcing distance and estimated duration are 
returned.  

Aggregation is done on a file-by-file (monthly) basis, since there's not 
enough memory to download all the files and aggregate at that point. Initially, 
aggregation will be done by station-pairs, i.e. Lafayette St & E 8 St to 
University Pl & E 14 St


##TODO##

*  Right-type dat columns for space efficiency
*  Add to trip data **speed** as a function of distance and duration
*  Aggregate each month's file at a time. Not enough memory to load entire data
set then aggregate. Average month has ~650K rows. Even grouping by hour and 
start location would yield 1/3 the rows. Need to reduce at least an order of 
magnitude. Instead of every hour of the month, maybe:  
    * ~~break into hour intervals by station (24 * 330 rows/file).~~  
    * ~~Or days by station (30 * 330 rows/file).~~  
    * By start station and end station combinations (< 330^2) since people will 
  not pick up in BK and return in midtown, for example). This lets us answer
  questions related to neighborhoods/stations
*  Lifetime of bicycle. Track bike by ID and see which bikes accumlate more miles, are they taken out of rotation?  
*  Use average speed to determine bike friendliness of an a neighborhood. Use average speed vs traffic to determine if bike or car is faster as certain points throughout the day.
*  Do people bike downhill and not return bikes to uphill stations?
*  Cluster stations by lat/long to determine neighborhood for results 
communication
*  Calculate standard error of start/end combinations across all the different 
months. Average ride duration from station A to station B should not change much
between months and seasons
*  Hist, summary of distance
*  Read from zip
*  Use birth year and gender as partial identifiers
*  ~~Functionize code for adding new station combinations~~
*  Download distance data for 90 stations added Aug '15.