Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Analyzing NYC's bike sharing program
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
--- title: "Citi Bike Data Analysis" output: html_document --- The set of files included are an exploratory data analysis of the publicly provided Citi Bike data available [here.](http://www.citibikenyc.com/system-data) Citi Bike is a bike sharing system in New York City that lets riders rent bikes for commuting purposes. ##Files * **cb_analysis.R** - R data processing * **cb_eda.Rmd** - Exploratory analysis. See cb_eda.md for knitted version. * **README.Rmd** - this file ###cb_analysis.R Downloads and reads in all the historical trip data from Citi Bike. The format of data has changed several times, so files in certain date ranges need to be handled differently. This specifically relates to the format of the filename and the handling of date-times. Files are split into two data.tables: the dat table which contains trip information, and a lookup table called stations. Stations takes the start station from each trip and dedups. This ensures we capture all stations, old and new, throughout the dates. The function stationDistance takes in a pair of coordinates and makes a call to the Google Directions API. The bicylcing distance and estimated duration are returned. Aggregation is done on a file-by-file (monthly) basis, since there's not enough memory to download all the files and aggregate at that point. Initially, aggregation will be done by station-pairs, i.e. Lafayette St & E 8 St to University Pl & E 14 St ##TODO## * Right-type dat columns for space efficiency * Add to trip data **speed** as a function of distance and duration * Aggregate each month's file at a time. Not enough memory to load entire data set then aggregate. Average month has ~650K rows. Even grouping by hour and start location would yield 1/3 the rows. Need to reduce at least an order of magnitude. Instead of every hour of the month, maybe: * ~~break into hour intervals by station (24 * 330 rows/file).~~ * ~~Or days by station (30 * 330 rows/file).~~ * By start station and end station combinations (< 330^2) since people will not pick up in BK and return in midtown, for example). This lets us answer questions related to neighborhoods/stations * Lifetime of bicycle. Track bike by ID and see which bikes accumlate more miles, are they taken out of rotation? * Use average speed to determine bike friendliness of an a neighborhood. Use average speed vs traffic to determine if bike or car is faster as certain points throughout the day. * Do people bike downhill and not return bikes to uphill stations? * Cluster stations by lat/long to determine neighborhood for results communication * Calculate standard error of start/end combinations across all the different months. Average ride duration from station A to station B should not change much between months and seasons * Hist, summary of distance * Read from zip * Use birth year and gender as partial identifiers * ~~Functionize code for adding new station combinations~~ * Download distance data for 90 stations added Aug '15.